Use Eclipse Display View While Debugging to Fix Real Problem

Today, one colleague in another team told me that when import one csv file to Solr server, it failed with following exception.

We all know it's because there are invalid(format) characters in that line, but that line is too long, from the error log we can't easily determine which characters caused the problem.

SEVERE: org.apache.solr.common.SolrException: CSVLoader: input=file:/sample.txt,can't read line: 12450
        values={NO LINES AVAILABLE}
        at org.apache.solr.handler.loader.CSVLoaderBase.input_err(CSVLoaderBase.java:320)
        at org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderBase.java:359)
Caused by: java.io.IOException: (line 12450) invalid char between encapsulated token end delimiter.
     at org.apache.solr.internal.csv.CSVParser.encapsulatedTokenLexer(CSVParser.java:481)
        at org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderBase.java:356)
So I enabled remote debug, added a breakpoint at  org.apache.solr.internal.csv.CSVParser.encapsulatedTokenLexer(CSVParser.java:481), then re-post the csv file with only the header and that line.
private Token encapsulatedTokenLexer(Token tkn, int c) throws IOException {
    // save current line
    int startLineNumber = getLineNumber();
    for (;;) {
      c = in.read();
      if (c == '\\' && strategy.getUnicodeEscapeInterpretation() && 
      in.lookAhead()=='u') {
        tkn.content.append((char) unicodeEscapeLexer(c));
      } else if (c == strategy.getEscape()) {
        tkn.content.append((char)readEscape(c));
      } else if (c == strategy.getEncapsulator()) {
  ...
      } else if (isEndOfFile(c)) {
        // error condition (end of file before end of token)
        throw new IOException( // add a breakpoint here.
                "(startline " + startLineNumber + ")"
                        + "eof reached before encapsulated token finished"
        );
      } 
   ...
    }
  }
Now I know the value of character c is a, but this not enough.
I can step through the method until it hits the exception, but too many characters in that line, we don't know when we can hit the exception.

Fortunately, when we pause at a breakpoint, we can execute any code in the display view.

So we can enter the following line in display view:
return "" + String.valueOf((char)c) +  
    String.valueOf((char)in.read()) +  String.valueOf((char)in.read());
Then we select the code, and click "Display Result of Evaluating Selected Text", the output would be:
(java.lang.String) an,

Now, search "an," in the csv file, find it:
|  | "an,xxxx"@

Now the reason is obvious, the value is part of the from field.
From CSV standard are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.
For example:
"aaa","b""bb","ccc"

For more info, please read Import CSV that Contains Double-Quotes into Solr

The fix is simple, just change the data to: |  | ""an,xxxx""@, now it works.
-- Next we need fix the code that generates the csv, but that's off topic.

The window in Eclipse looks like below:


Post a Comment

Labels

Java (159) Lucene-Solr (112) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (38) Eclipse (33) Code Example (31) Linux (25) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Shell (7) ANT (6) Coding Skills (6) Database (6) Lesson Learned (6) Programmer Skills (6) Scala (6) Tips (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) System Design (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts