The Problem
When I am writing my powershell script to clean csv file to remove invalid records: I mistakenly add -encoding utf8 when using out-file to write response to the final csv.
Then I run the following command to import the csv file to Solr:
http://localhost:8080/solr/update/csv?literal.f1=1&literal.f2=2&&header=true&stream.file=C:\1.csv&literal.cistate=17&commit=true
It will generate a unique id: docid by concatenating f1, f2, and the first column of the csv file: localId.
But to my surprise, there is only one document in solr with docid: 12.
http://localhost:8080/solr/cvcorefla4/admin/luke, it shows:
<int name="numDocs">1</int> <int name="maxDoc">16420</int> <int name="deletedDocs">16419</int> <long name="version">2521</long>
Run http://localhost:8080/solr/select?q=*, and copy the response to a new file in notepad++ with encoding utf8, everything seems fine, but when I change the file encoding to ascii, it looks like below:
<str name="docid">12</str> <arr name="id"> <str>f0e662cefe56a31c6eec5d53e64f988d</str> </arr>
Notice the messed invisible character before id: id. - Also the field is not expected string, but array of string. So I write one simple java application to view the real content in "id":public void testUnicode() { String str = "id"; for (int i = 0; i < str.length(); i++) { System.out.println(str.charAt(i)); System.out.println((int) str.charAt(i)); System.out.println(escapeNonAscii(str.charAt(i) + "")); } System.out.println("***************"); System.out.println(str.length()); System.out.println(str.hashCode()); System.out.println(escapeNonAscii(str)); System.out.println("***************"); } private static String escapeNonAscii(String str) { StringBuilder retStr = new StringBuilder(); for (int i = 0; i < str.length(); i++) { int cp = Character.codePointAt(str, i); int charCount = Character.charCount(cp); if (charCount > 1) { i += charCount - 1; // 2. if (i >= str.length()) { throw new IllegalArgumentException("truncated unexpectedly"); } } if (cp < 128) { retStr.appendCodePoint(cp); } else { retStr.append(String.format("\\u%x", cp)); } } return retStr.toString(); }The invisible prefix is \ufeff. U+FEFF is byte order mark (BOM). So now the problem is kind of obvious: out-file -encoding utf8 it is actually using utf-8 with BOM. But java uses utf8 without bom to read file. This causes the problem: to java the first column in first line is: \ufefflocalId not localId.
The Solution
Actually the fix is simple: the default encoding of out-file is Unicode: which works fine with java. If we are sure all code is in the ascii range, we can also specify -encoding ascii.
Resource Byte order mark Unicode Character 'ZERO WIDTH NO-BREAK SPACE' (U+FEFF)