Programmer: Lifelong Learning: Powershell and Java: Stumble by UTF8 BOM or without BOM

The Problem
When I am writing my powershell script to clean csv file to remove invalid records: I mistakenly add -encoding utf8 when using out-file to write response to the final csv.

Then I run the following command to import the csv file to Solr:
http://localhost:8080/solr/update/csv?literal.f1=1&literal.f2=2&&header=true&stream.file=C:\1.csv&literal.cistate=17&commit=true
It will generate a unique id: docid by concatenating f1, f2, and the first column of the csv file: localId.
But to my surprise, there is only one document in solr with docid: 12.
http://localhost:8080/solr/cvcorefla4/admin/luke, it shows:

  <int name="numDocs">1</int>
  <int name="maxDoc">16420</int>
  <int name="deletedDocs">16419</int>
  <long name="version">2521</long>

Run http://localhost:8080/solr/select?q=*, and copy the response to a new file in notepad++ with encoding utf8, everything seems fine, but when I change the file encoding to ascii, it looks like below:

  <str name="docid">12</str>
  <arr name="ï»¿id">
  <str>f0e662cefe56a31c6eec5d53e64f988d</str>
  </arr>

Notice the messed invisible character before id: ï»¿id. -  Also the field is not expected string, but array of string.

So I write one simple java application to view the real content in "id":

  public void testUnicode() {
    String str = "id";
    for (int i = 0; i < str.length(); i++) {
      System.out.println(str.charAt(i));
      System.out.println((int) str.charAt(i));
      System.out.println(escapeNonAscii(str.charAt(i) + ""));
    }
    System.out.println("***************");
    System.out.println(str.length());
    System.out.println(str.hashCode());
    System.out.println(escapeNonAscii(str));
    System.out.println("***************");
  }
  private static String escapeNonAscii(String str) {
    
    StringBuilder retStr = new StringBuilder();
    for (int i = 0; i < str.length(); i++) {
      int cp = Character.codePointAt(str, i);
      int charCount = Character.charCount(cp);
      if (charCount > 1) {
        i += charCount - 1; // 2.
        if (i >= str.length()) {
          throw new IllegalArgumentException("truncated unexpectedly");
        }
      }      
      if (cp < 128) {
        retStr.appendCodePoint(cp);
      } else {
        retStr.append(String.format("\\u%x", cp));
      }
    }
    return retStr.toString();
  }

The invisible prefix is \ufeff. U+FEFF is byte order mark (BOM). 

So now the problem is kind of obvious:
out-file -encoding utf8
it is actually using utf-8 with BOM. But java uses utf8 without bom to read file. This causes the problem: to java the first column in first line is: \ufefflocalId not localId.

The Solution

Actually the fix is simple: the default encoding of out-file is Unicode: which works fine with java. If we are sure all code is in the ascii range, we can also specify -encoding ascii.

Resource
Byte order mark
Unicode Character 'ZERO WIDTH NO-BREAK SPACE' (U+FEFF)

Powershell and Java: Stumble by UTF8 BOM or without BOM

Labels