Import CSV that Contains Double-Quotes into Solr


My colleague meets problem when trying to import a CSV file which contains Double-Quotes in a column value to Solr.
Looked at CSV standard
If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.
For example:
"aaa","b""bb","ccc"

Afte add a another preceding double quote, it works.
Implementation in Solr
When Solr imports CSV file, it honors CSV standard.
In Solr, the default encapsulator is also ". Please refer to: Updating a Solr Index with CSV
From org.apache.solr.internal.csv.CSVParser.encapsulatedTokenLexer(Token, int), we can see How Solr parse value.

private Token encapsulatedTokenLexer(Token tkn, int c) throws IOException {
for (;;) {
  c = in.read();
  if (c == '\\' && strategy.getUnicodeEscapeInterpretation() && in.lookAhead()=='u') {
 tkn.content.append((char) unicodeEscapeLexer(c));
  } else if (c == strategy.getEscape()) {
 tkn.content.append((char)readEscape(c));
  } else if (c == strategy.getEncapsulator()) {
 if (in.lookAhead() == strategy.getEncapsulator()) {
   // double or escaped encapsulator -> add single encapsulator to token
   c = in.read();
   tkn.content.append((char) c);
 } else {
   // token finish mark (encapsulator) reached: ignore whitespace till delimiter
   for (;;) {
  c = in.read();
  if (c == strategy.getDelimiter()) {
    tkn.type = TT_TOKEN;
    tkn.isReady = true;
    return tkn;
  } else if (isEndOfFile(c)) {
    tkn.type = TT_EOF;
    tkn.isReady = true;
    return tkn;
  } else if (isEndOfLine(c)) {
    // ok eo token reached
    tkn.type = TT_EORECORD;
    tkn.isReady = true;
    return tkn;
  } else if (!isWhitespace(c)) {
    // error invalid char between token and next delimiter
    throw new IOException(
      "(line " + getLineNumber()
        + ") invalid char between encapsulated token end delimiter"
    );
  }
   }
 }
  } else if (isEndOfFile(c)) {
 // error condition (end of file before end of token)
 throw new IOException(
   "(startline " + startLineNumber + ")"
     + "eof reached before encapsulated token finished"
 );
  } else {
 // consume character
 tkn.content.append((char) c);
  }
}
}

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)