The Problem
When I am writing my powershell script to clean csv file to remove invalid records: I mistakenly add -encoding utf8 when using out-file to write response to the final csv.
Then I run the following command to import the csv file to Solr:
http://localhost:8080/solr/update/csv?literal.f1=1&literal.f2=2&&header=true&stream.file=C:\1.csv&literal.cistate=17&commit=true
It will generate a unique id: docid by concatenating f1, f2, and the first column of the csv file: localId.
But to my surprise, there is only one document in solr with docid: 12.
http://localhost:8080/solr/cvcorefla4/admin/luke, it shows:
<int name="numDocs">1</int>
<int name="maxDoc">16420</int>
<int name="deletedDocs">16419</int>
<long name="version">2521</long>
Run http://localhost:8080/solr/select?q=*, and copy the response to a new file in notepad++ with encoding utf8, everything seems fine, but when I change the file encoding to ascii, it looks like below:
<str name="docid">12</str>
<arr name="id">
<str>f0e662cefe56a31c6eec5d53e64f988d</str>
</arr>
Notice the messed invisible character before id: id. - Also the field is not expected string, but array of string.
So I write one simple java application to view the real content in "id":
public void testUnicode() {
String str = "id";
for (int i = 0; i < str.length(); i++) {
System.out.println(str.charAt(i));
System.out.println((int) str.charAt(i));
System.out.println(escapeNonAscii(str.charAt(i) + ""));
}
System.out.println("***************");
System.out.println(str.length());
System.out.println(str.hashCode());
System.out.println(escapeNonAscii(str));
System.out.println("***************");
}
private static String escapeNonAscii(String str) {
StringBuilder retStr = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
int cp = Character.codePointAt(str, i);
int charCount = Character.charCount(cp);
if (charCount > 1) {
i += charCount - 1; // 2.
if (i >= str.length()) {
throw new IllegalArgumentException("truncated unexpectedly");
}
}
if (cp < 128) {
retStr.appendCodePoint(cp);
} else {
retStr.append(String.format("\\u%x", cp));
}
}
return retStr.toString();
}
The invisible prefix is \ufeff. U+FEFF is byte order mark (BOM).
So now the problem is kind of obvious:
out-file -encoding utf8
it is actually using utf-8 with BOM. But java uses utf8 without bom to read file. This causes the problem: to java the first column in first line is: \ufefflocalId not localId.
The Solution
Actually the fix is simple: the default encoding of out-file is Unicode: which works fine with java. If we are sure all code is in the ascii range, we can also specify -encoding ascii.
Resource
Byte order mark
Unicode Character 'ZERO WIDTH NO-BREAK SPACE' (U+FEFF)