Powershell and Java: Stumble by UTF8 BOM or without BOM


The Problem
When I am writing my powershell script to clean csv file to remove invalid records: I mistakenly add -encoding utf8 when using out-file to write response to the final csv.

Then I run the following command to import the csv file to Solr:
http://localhost:8080/solr/update/csv?literal.f1=1&literal.f2=2&&header=true&stream.file=C:\1.csv&literal.cistate=17&commit=true
It will generate a unique id: docid by concatenating f1, f2, and the first column of the csv file: localId.
But to my surprise, there is only one document in solr with docid: 12.
http://localhost:8080/solr/cvcorefla4/admin/luke, it shows:
  <int name="numDocs">1</int>
  <int name="maxDoc">16420</int>
  <int name="deletedDocs">16419</int>
  <long name="version">2521</long>


Run http://localhost:8080/solr/select?q=*, and copy the response to a new file in notepad++ with encoding utf8, everything seems fine, but when I change the file encoding to ascii, it looks like below:
  <str name="docid">12</str>
  <arr name="id">
  <str>f0e662cefe56a31c6eec5d53e64f988d</str>
  </arr>
Notice the messed invisible character before id: id. -  Also the field is not expected string, but array of string.

So I write one simple java application to view the real content in "id":

  public void testUnicode() {
    String str = "id";
    for (int i = 0; i < str.length(); i++) {
      System.out.println(str.charAt(i));
      System.out.println((int) str.charAt(i));
      System.out.println(escapeNonAscii(str.charAt(i) + ""));
    }
    System.out.println("***************");
    System.out.println(str.length());
    System.out.println(str.hashCode());
    System.out.println(escapeNonAscii(str));
    System.out.println("***************");
  }
  private static String escapeNonAscii(String str) {
    
    StringBuilder retStr = new StringBuilder();
    for (int i = 0; i < str.length(); i++) {
      int cp = Character.codePointAt(str, i);
      int charCount = Character.charCount(cp);
      if (charCount > 1) {
        i += charCount - 1; // 2.
        if (i >= str.length()) {
          throw new IllegalArgumentException("truncated unexpectedly");
        }
      }      
      if (cp < 128) {
        retStr.appendCodePoint(cp);
      } else {
        retStr.append(String.format("\\u%x", cp));
      }
    }
    return retStr.toString();
  }
The invisible prefix is \ufeff. U+FEFF is byte order mark (BOM).  So now the problem is kind of obvious: out-file -encoding utf8 it is actually using utf-8 with BOM. But java uses utf8 without bom to read file. This causes the problem: to java the first column in first line is: \ufefflocalId not localId.
The Solution
Actually the fix is simple: the default encoding of out-file is Unicode: which works fine with java. If we are sure all code is in the ascii range, we can also specify -encoding ascii.

Resource
Byte order mark
Unicode Character 'ZERO WIDTH NO-BREAK SPACE' (U+FEFF)

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)