Import CSV that Contains Double-Quotes into Solr

My colleague meets problem when trying to import a CSV file which contains Double-Quotes in a column value to Solr.
Looked at CSV standard
If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.
For example:
"aaa","b""bb","ccc"

Afte add a another preceding double quote, it works.
Implementation in Solr
When Solr imports CSV file, it honors CSV standard.
In Solr, the default encapsulator is also ". Please refer to: Updating a Solr Index with CSV
From org.apache.solr.internal.csv.CSVParser.encapsulatedTokenLexer(Token, int), we can see How Solr parse value.

private Token encapsulatedTokenLexer(Token tkn, int c) throws IOException {
for (;;) {
  c = in.read();
  if (c == '\\' && strategy.getUnicodeEscapeInterpretation() && in.lookAhead()=='u') {
 tkn.content.append((char) unicodeEscapeLexer(c));
  } else if (c == strategy.getEscape()) {
 tkn.content.append((char)readEscape(c));
  } else if (c == strategy.getEncapsulator()) {
 if (in.lookAhead() == strategy.getEncapsulator()) {
   // double or escaped encapsulator -> add single encapsulator to token
   c = in.read();
   tkn.content.append((char) c);
 } else {
   // token finish mark (encapsulator) reached: ignore whitespace till delimiter
   for (;;) {
  c = in.read();
  if (c == strategy.getDelimiter()) {
    tkn.type = TT_TOKEN;
    tkn.isReady = true;
    return tkn;
  } else if (isEndOfFile(c)) {
    tkn.type = TT_EOF;
    tkn.isReady = true;
    return tkn;
  } else if (isEndOfLine(c)) {
    // ok eo token reached
    tkn.type = TT_EORECORD;
    tkn.isReady = true;
    return tkn;
  } else if (!isWhitespace(c)) {
    // error invalid char between token and next delimiter
    throw new IOException(
      "(line " + getLineNumber()
        + ") invalid char between encapsulated token end delimiter"
    );
  }
   }
 }
  } else if (isEndOfFile(c)) {
 // error condition (end of file before end of token)
 throw new IOException(
   "(startline " + startLineNumber + ")"
     + "eof reached before encapsulated token finished"
 );
  } else {
 // consume character
 tkn.content.append((char) c);
  }
}
}
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (60) Interview (59) J2SE (53) Algorithm (37) Eclipse (35) Soft Skills (35) Code Example (31) Linux (26) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Continuous Integration (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Design (8) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Miscs (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Firefox (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Bit Operation (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts