Powershell and Java: Stumble by UTF8 BOM or without BOM

The Problem
When I am writing my powershell script to clean csv file to remove invalid records: I mistakenly add -encoding utf8 when using out-file to write response to the final csv.

Then I run the following command to import the csv file to Solr:
http://localhost:8080/solr/update/csv?literal.f1=1&literal.f2=2&&header=true&stream.file=C:\1.csv&literal.cistate=17&commit=true
It will generate a unique id: docid by concatenating f1, f2, and the first column of the csv file: localId.
But to my surprise, there is only one document in solr with docid: 12.
http://localhost:8080/solr/cvcorefla4/admin/luke, it shows:
  <int name="numDocs">1</int>
  <int name="maxDoc">16420</int>
  <int name="deletedDocs">16419</int>
  <long name="version">2521</long>


Run http://localhost:8080/solr/select?q=*, and copy the response to a new file in notepad++ with encoding utf8, everything seems fine, but when I change the file encoding to ascii, it looks like below:
  <str name="docid">12</str>
  <arr name="id">
  <str>f0e662cefe56a31c6eec5d53e64f988d</str>
  </arr>
Notice the messed invisible character before id: id. -  Also the field is not expected string, but array of string.

So I write one simple java application to view the real content in "id":

  public void testUnicode() {
    String str = "id";
    for (int i = 0; i < str.length(); i++) {
      System.out.println(str.charAt(i));
      System.out.println((int) str.charAt(i));
      System.out.println(escapeNonAscii(str.charAt(i) + ""));
    }
    System.out.println("***************");
    System.out.println(str.length());
    System.out.println(str.hashCode());
    System.out.println(escapeNonAscii(str));
    System.out.println("***************");
  }
  private static String escapeNonAscii(String str) {
    
    StringBuilder retStr = new StringBuilder();
    for (int i = 0; i < str.length(); i++) {
      int cp = Character.codePointAt(str, i);
      int charCount = Character.charCount(cp);
      if (charCount > 1) {
        i += charCount - 1; // 2.
        if (i >= str.length()) {
          throw new IllegalArgumentException("truncated unexpectedly");
        }
      }      
      if (cp < 128) {
        retStr.appendCodePoint(cp);
      } else {
        retStr.append(String.format("\\u%x", cp));
      }
    }
    return retStr.toString();
  }
The invisible prefix is \ufeff. U+FEFF is byte order mark (BOM).  So now the problem is kind of obvious: out-file -encoding utf8 it is actually using utf-8 with BOM. But java uses utf8 without bom to read file. This causes the problem: to java the first column in first line is: \ufefflocalId not localId.
The Solution
Actually the fix is simple: the default encoding of out-file is Unicode: which works fine with java. If we are sure all code is in the ascii range, we can also specify -encoding ascii.

Resource
Byte order mark
Unicode Character 'ZERO WIDTH NO-BREAK SPACE' (U+FEFF)
Post a Comment

Labels

Java (159) Lucene-Solr (112) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (38) Eclipse (33) Code Example (31) Linux (25) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Shell (7) ANT (6) Coding Skills (6) Database (6) Lesson Learned (6) Programmer Skills (6) Scala (6) Tips (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) System Design (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts