Learning Solr: How to import CSV String into Solr

Examples we find on web are usually to import csv files into Solr, here I want to to show how to import CSV string into Solr.
Why we want to import CSV string?
1 Compared to importing csv files: importing csv string would be much faster, as no unnecessary IO: no need to write the csv files in client side and read it at server side.
2 Compared to importing XML String, importing csv string should be also faster: 
It is faster than writing xml string in the client would and reading/parsing it in server side.

3. CSV string is usually much smaller than XML, less time spend on network transfer, also can save a little bandwidth.
How to import CSV String into Solr?
1. Use stream.body to specify csv data, one thing we should pay attention, we have to use ASCII code %0D%0A to separate lines - use \r\r wouldn't work.
2. In the stream body, you have to escape special character, change “ to \”, \ to \\.

The following request would import 2 lines into Solr.
curl -d "stream.body=2,0,1,0,1,\"c:\\\",1,0,\"c:\",0,1,16 %0D%0A 2,0,1,0,1,\"x:\\\",2,0,\"x:\",0,1,16 &separator=,&fieldnames=omiited&literal.id=9000&stream.contentType=text/csv;charset=utf-8&commit=true" http://localhost:8080/solr/update/csv

In order to simplify client code, we can create a new requestHandler, which sets fieldnames, so client need not specify fieldnames in each request. 
Solr Code

org.apache.solr.handler.loader.CSVLoaderBase.load(SolrQueryRequest, SolrQueryResponse, ContentStream, UpdateRequestProcessor)
org.apache.solr.internal.csv.CSVParser
getLine()
nextToken(Token)
  private boolean isEndOfLine(int c) throws IOException {
    // check if we have \r\n...
    if (c == '\r') {
      if (in.lookAhead() == '\n') {
        // note: does not change c outside of this method !!
        c = in.read();
      }
    }
    return (c == '\n');
  }
References:
UpdateCSV - Solr Wiki
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (58) Interview (58) J2SE (53) Algorithm (43) Soft Skills (36) Eclipse (34) Code Example (31) Linux (24) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts