Today, one colleague in another team told me that when import one csv file to Solr server, it failed with following exception.
We all know it's because there are invalid(format) characters in that line, but that line is too long, from the error log we can't easily determine which characters caused the problem.
SEVERE: org.apache.solr.common.SolrException: CSVLoader: input=file:/sample.txt,can't read line: 12450
values={NO LINES AVAILABLE}
at org.apache.solr.handler.loader.CSVLoaderBase.input_err(CSVLoaderBase.java:320)
at org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderBase.java:359)
Caused by: java.io.IOException: (line 12450) invalid char between encapsulated token end delimiter.
at org.apache.solr.internal.csv.CSVParser.encapsulatedTokenLexer(CSVParser.java:481)
at org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderBase.java:356)
So I enabled remote debug, added a breakpoint at org.apache.solr.internal.csv.CSVParser.encapsulatedTokenLexer(CSVParser.java:481), then re-post the csv file with only the header and that line.We all know it's because there are invalid(format) characters in that line, but that line is too long, from the error log we can't easily determine which characters caused the problem.
SEVERE: org.apache.solr.common.SolrException: CSVLoader: input=file:/sample.txt,can't read line: 12450
values={NO LINES AVAILABLE}
at org.apache.solr.handler.loader.CSVLoaderBase.input_err(CSVLoaderBase.java:320)
at org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderBase.java:359)
Caused by: java.io.IOException: (line 12450) invalid char between encapsulated token end delimiter.
at org.apache.solr.internal.csv.CSVParser.encapsulatedTokenLexer(CSVParser.java:481)
at org.apache.solr.handler.loader.CSVLoaderBase.load(CSVLoaderBase.java:356)
private Token encapsulatedTokenLexer(Token tkn, int c) throws IOException { // save current line int startLineNumber = getLineNumber(); for (;;) { c = in.read(); if (c == '\\' && strategy.getUnicodeEscapeInterpretation() && in.lookAhead()=='u') { tkn.content.append((char) unicodeEscapeLexer(c)); } else if (c == strategy.getEscape()) { tkn.content.append((char)readEscape(c)); } else if (c == strategy.getEncapsulator()) { ... } else if (isEndOfFile(c)) { // error condition (end of file before end of token) throw new IOException( // add a breakpoint here. "(startline " + startLineNumber + ")" + "eof reached before encapsulated token finished" ); } ... } }
I can step through the method until it hits the exception, but too many characters in that line, we don't know when we can hit the exception.
Fortunately, when we pause at a breakpoint, we can execute any code in the display view.
So we can enter the following line in display view:
return "" + String.valueOf((char)c) + String.valueOf((char)in.read()) + String.valueOf((char)in.read());Then we select the code, and click "Display Result of Evaluating Selected Text", the output would be:
(java.lang.String) an,
Now, search "an," in the csv file, find it:
| | "an,xxxx"@
Now the reason is obvious, the value is part of the from field.
From CSV standard are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.
For example:
"aaa","b""bb","ccc"
For more info, please read Import CSV that Contains Double-Quotes into Solr
The fix is simple, just change the data to: | | ""an,xxxx""@, now it works.
-- Next we need fix the code that generates the csv, but that's off topic.
The window in Eclipse looks like below: