Improve Solr CSVParser to Log Invalid Characters

Recently, one colleague in another team told me that when import one csv file to Solr server, it failed with following exception:
SEVERE: org.apache.solr.common.SolrException: CSVLoader: input=file:/sample.txt, line=1,can't read line: 12450
        values={NO LINES AVAILABLE}
        at rg.apache.solr.handler.loader.CSVLoaderBase.input_err(CSVLoaderBase.java:320)
        at org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderBase.java:359)
        at org.apache.solr.handler.loader.CSVLoader.load(CSVLoader.java:31)
  ...
Caused by: java.io.IOException: (line 12450) invalid char between encapsulated token end delimiter.
        at org.apache.solr.internal.csv.CSVParser.encapsulatedTokenLexer(CSVParser.java:481)
        at org.apache.solr.internal.csv.CSVParser.nextToken(CSVParser.java:359)
        at org.apache.solr.internal.csv.CSVParser.getLine(CSVParser.java:231)
        at org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderBase.java:356)
At that time, I enabled remote debug, and used Eclipse Display view to find the invalid character, and 2 more characters, then searched in the CSV file to find the reason: it is because there is " in the value of the from field: |  | "an,xxxx"
For more information, please read: 
Use Eclipse Display View While Debugging to Fix Real Problem
Import CSV that Contains Double-Quotes into Solr

This makes me change Solr's code so if similar problem happens next time, we can find the problem directly from the log, not have to do remote debug again.
The code looks like below:
private Token encapsulatedTokenLexer(Token tkn, int c) throws IOException {
 for (;;) {
   c = in.read();
   else if (c == strategy.getEncapsulator()) {
   if (in.lookAhead() == strategy.getEncapsulator()) {
   } else {
  for (;;) {
   c = in.read();
   else if (!isWhitespace(c)) {
    // error invalid char between token and next delimiter
    throw new IOException(
     "(line "
      + getLineNumber()
      + ") invalid char between encapsulated token end delimiter, invalid char: "
      + String.valueOf((char)c) + ", context " + getContextChars(c));
    }
  }
  }
   } 
 }
}
// new method: read more 3 characters
private String getContextChars(int c) {
    int count =0;
    String moreChars=String.valueOf((char)c);
    while (count < 3) {
      try {
        int tmpc = in.read();
        moreChars +=String.valueOf((char)tmpc);
        ++count;
      } catch (Exception e) {
        break;
      }
    }
    return moreChars;
}
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (58) Interview (58) J2SE (53) Algorithm (41) Soft Skills (36) Eclipse (34) Code Example (31) Linux (25) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Continuous Integration (8) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts