Improve Solr CSVParser to Log Invalid Characters


Recently, one colleague in another team told me that when import one csv file to Solr server, it failed with following exception:
SEVERE: org.apache.solr.common.SolrException: CSVLoader: input=file:/sample.txt, line=1,can't read line: 12450
        values={NO LINES AVAILABLE}
        at rg.apache.solr.handler.loader.CSVLoaderBase.input_err(CSVLoaderBase.java:320)
        at org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderBase.java:359)
        at org.apache.solr.handler.loader.CSVLoader.load(CSVLoader.java:31)
  ...
Caused by: java.io.IOException: (line 12450) invalid char between encapsulated token end delimiter.
        at org.apache.solr.internal.csv.CSVParser.encapsulatedTokenLexer(CSVParser.java:481)
        at org.apache.solr.internal.csv.CSVParser.nextToken(CSVParser.java:359)
        at org.apache.solr.internal.csv.CSVParser.getLine(CSVParser.java:231)
        at org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderBase.java:356)
At that time, I enabled remote debug, and used Eclipse Display view to find the invalid character, and 2 more characters, then searched in the CSV file to find the reason: it is because there is " in the value of the from field: |  | "an,xxxx"
For more information, please read: 
Use Eclipse Display View While Debugging to Fix Real Problem
Import CSV that Contains Double-Quotes into Solr

This makes me change Solr's code so if similar problem happens next time, we can find the problem directly from the log, not have to do remote debug again.
The code looks like below:
private Token encapsulatedTokenLexer(Token tkn, int c) throws IOException {
 for (;;) {
   c = in.read();
   else if (c == strategy.getEncapsulator()) {
   if (in.lookAhead() == strategy.getEncapsulator()) {
   } else {
  for (;;) {
   c = in.read();
   else if (!isWhitespace(c)) {
    // error invalid char between token and next delimiter
    throw new IOException(
     "(line "
      + getLineNumber()
      + ") invalid char between encapsulated token end delimiter, invalid char: "
      + String.valueOf((char)c) + ", context " + getContextChars(c));
    }
  }
  }
   } 
 }
}
// new method: read more 3 characters
private String getContextChars(int c) {
    int count =0;
    String moreChars=String.valueOf((char)c);
    while (count < 3) {
      try {
        int tmpc = in.read();
        moreChars +=String.valueOf((char)tmpc);
        ++count;
      } catch (Exception e) {
        break;
      }
    }
    return moreChars;
}

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (4) ANT (6) bat (8) Become a Better You (4) Big Data (7) Blogger (14) Bugs (4) Cache (5) Chrome (17) Code Example (29) Code Quality (6) Coding Skills (5) Concurrency (4) Database (7) Debug (16) Design (5) Dev Tips (62) Eclipse (32) GAE (4) Git (5) Good Programming Practices (4) Google (27) Guava (7) How to (9) Http Client (8) IDE (6) Interview (88) J2EE (13) J2SE (49) Jackson (4) Java (177) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (22) Lucene-Solr (112) Mac (10) Maven (8) Memory Usage (4) Network (9) Nutch2 (18) OpenNLP (4) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Review (4) Scala (6) Security (9) Soft Skills (38) Spark (4) Spring (22) System Design (11) Testing (6) Text Mining (14) Tips (12) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)

Trending