Recently, one colleague in another team told me that when import one csv file to Solr server, it failed with following exception:
SEVERE: org.apache.solr.common.SolrException: CSVLoader: input=file:/sample.txt, line=1,can't read line: 12450
values={NO LINES AVAILABLE}
at rg.apache.solr.handler.loader.CSVLoaderBase.input_err(CSVLoaderBase.java:320)
at org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderBase.java:359)
at org.apache.solr.handler.loader.CSVLoader.load(CSVLoader.java:31)
...
Caused by: java.io.IOException: (line 12450) invalid char between encapsulated token end delimiter.
at org.apache.solr.internal.csv.CSVParser.encapsulatedTokenLexer(CSVParser.java:481)
at org.apache.solr.internal.csv.CSVParser.nextToken(CSVParser.java:359)
at org.apache.solr.internal.csv.CSVParser.getLine(CSVParser.java:231)
at org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderBase.java:356)
At that time, I enabled remote debug, and used Eclipse Display view to find the invalid character, and 2 more characters, then searched in the CSV file to find the reason: it is because there is " in the value of the from field: | | "an,xxxx"
For more information, please read:
Use Eclipse Display View While Debugging to Fix Real Problem
Import CSV that Contains Double-Quotes into Solr
This makes me change Solr's code so if similar problem happens next time, we can find the problem directly from the log, not have to do remote debug again.
The code looks like below:
SEVERE: org.apache.solr.common.SolrException: CSVLoader: input=file:/sample.txt, line=1,can't read line: 12450
values={NO LINES AVAILABLE}
at rg.apache.solr.handler.loader.CSVLoaderBase.input_err(CSVLoaderBase.java:320)
at org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderBase.java:359)
at org.apache.solr.handler.loader.CSVLoader.load(CSVLoader.java:31)
...
Caused by: java.io.IOException: (line 12450) invalid char between encapsulated token end delimiter.
at org.apache.solr.internal.csv.CSVParser.encapsulatedTokenLexer(CSVParser.java:481)
at org.apache.solr.internal.csv.CSVParser.nextToken(CSVParser.java:359)
at org.apache.solr.internal.csv.CSVParser.getLine(CSVParser.java:231)
at org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderBase.java:356)
At that time, I enabled remote debug, and used Eclipse Display view to find the invalid character, and 2 more characters, then searched in the CSV file to find the reason: it is because there is " in the value of the from field: | | "an,xxxx"
For more information, please read:
Use Eclipse Display View While Debugging to Fix Real Problem
Import CSV that Contains Double-Quotes into Solr
This makes me change Solr's code so if similar problem happens next time, we can find the problem directly from the log, not have to do remote debug again.
The code looks like below:
private Token encapsulatedTokenLexer(Token tkn, int c) throws IOException { for (;;) { c = in.read(); else if (c == strategy.getEncapsulator()) { if (in.lookAhead() == strategy.getEncapsulator()) { } else { for (;;) { c = in.read(); else if (!isWhitespace(c)) { // error invalid char between token and next delimiter throw new IOException( "(line " + getLineNumber() + ") invalid char between encapsulated token end delimiter, invalid char: " + String.valueOf((char)c) + ", context " + getContextChars(c)); } } } } } } // new method: read more 3 characters private String getContextChars(int c) { int count =0; String moreChars=String.valueOf((char)c); while (count < 3) { try { int tmpc = in.read(); moreChars +=String.valueOf((char)tmpc); ++count; } catch (Exception e) { break; } } return moreChars; }