Learning Solr Code: Import Data to Solr

The main logic is at solr.servlet.SolrRequestParsers.parse(SolrCore, String, HttpServletRequest)
public SolrQueryRequest parse( SolrCore core, String path, HttpServletRequest req ) throws Exception
  {
    SolrRequestParser parser = standard;
  
    ArrayList streams = new ArrayList(1);
    SolrParams params = parser.parseParamsAndFillStreams( req, streams );
    SolrQueryRequest sreq = buildRequestFrom( core, params, streams );
    sreq.getContext().put( "path", path );
  }  
There are multiple SolrRequestParser implementations: FormDataRequestParser, MultipartRequestParser, RawRequestParser, SimpleRequestParser, StandardRequestParser. By default, it uses StandardRequestParser.

In StandardRequestParser.parseParamsAndFillStreams, if it is GET or HEAD request, it will parse the query string, and create a SolrParams: please refer about how it parses the query string.

If it is a POST request, for normal post request, StandardRequestParser.parseParamsAndFillStreams will use FormDataRequestParser to parse the form data, and create a SolrParams.
The following curl request will be handled by FormDataRequestParser.
curl -d "stream.body=<add><doc><field name='contentid'>content1</field></doc></add>&clientId=client123&batchId=1" http://host:port/solr/update

If the data is uploaded as a file, like below:
curl http://host:port/solr/update -F "fieldName=@data.xml"
The fieldName doesn't matter and can be anything.

StandardRequestParser.parseParamsAndFillStreams will use MultipartRequestParser, which will use apache commons to create fileupload.FileItem, then create a servlet.FileItemContentStream.
How it determines whether the request is multipart?
ServletFileUpload.isMultipartContent(req), whether the contentType starts with "multipart/".

For a POST request, if the request is not format mentioned before, it will use RawRequestParser which creates a servlet.HttpRequestContentStream from the request.

Then in SolrRequestParsers.buildRequestFrom, it will get stream.file, stream.body, stream.url, and constructs ContentStreamBase.FileStream/StringStream/URLStream. The file stream.file points to must be a local file to Solr server.
Subclasses of ContentStreamBase
HttpRequestContentStream
Wrap an HttpServletRequest as a ContentStream
public InputStream getStream() throws IOException {
return req.getInputStream();
}
FileItemContentStream
Wrap a org.apache.commons.fileupload.FileItem as a ContentStream
ContentStreamBase.FileStream
ContentStreamBase.URLStream
ContentStreamBase.StringStream
DocumentAnalysisRequestHandlerTest.ByteStream
Using curl to send request to Solr
curl -d "stream.body=<add><doc><field name=\"id\">id1</field></doc></add>&clientId=client123" http://host:port/solr/update
curl -d "stream.body=<add><commit/></add>&clientId=client123" http://host:port/solr/update
Error:
In this case, have to add "" for the value of -d, as the value contains special characters, like <, otherwise it will report error:
curl -d stream.body=<add><doc><field name=\"id\">id1</field></doc></add>&clientId=client123 http://host:port/solr/update
< was unexpected at this time.

For the stream body, have to use " to enclose property name, like \"id\". The following request will fail:
curl -d "stream.body=<add><doc><field name=id>id1</field></doc></add>&clientId=client123" http://host:port/solr/update
org.apache.solr.common.SolrException: Unexpected character 'i' (code 105) in start tag Expected a quote
at [row,col {unknown-source}]: [1,23]
Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character 'i' (code 105) in start tag Expected a quote
at [row,col {unknown-source}]: [1,23]
at com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:648)

Correct Usage:
Use "" to enclose the value of -d.
Use \ to escape specail characrts, " to \", \ to \\. 
curl -d "stream.body=2,0,1,0,1,\"c:\\\",1,0,\"c:\",0,1,16 %0D%0A 2,0,1,0,1,\"x:\\\",2,0,\"x:\",0,1,16 &separator=,&fieldnames=omiited&literal.id=9000&stream.contentType=text/csv;charset=utf-8&commit=true" http://localhost:8080/solr/update/csv
Code:
private final boolean com.ctc.wstx.sr.BasicStreamReader.handleNsAttrs(char)(char c)
 // And then a quote:
 if (c != '"' && c != '\'') {
  throwUnexpectedChar(c, SUFFIX_IN_ELEMENT+" Expected a quote");
 }
Upload csv content
curl -d "stream.body=id1&clientId=client123&fieldnames=id" http://host:port/solr/update/csv
Upload XML File
curl -F "fieldName=@data.xml" http://host:port/solr/update
curl -F "fieldName=@data.xml;type=application/xml" http://host:port/solr/update
;type=application/xml set MIME content-type of the file.
curl -F fieldName=@data.xml -F clientId=client123 -F &batchId=2 http://host:port/solr/update
Have to use multiple -F for multiple form data, format -F "key1=value1&key2=value2" doesn't work - this will only set one pair, value of key1 is value1&key2=value2.
Delete Data
curl -d "stream.body=<delete><query>*:*</query></delete>&commit=true" http://host:post/solr/update
Curl Usage
-d, --data <data>
(HTTP) Sends the specified data in a POST request to the HTTP server.
--data-binary <data>
(HTTP) This posts data exactly as specified with no extra processing whatsoever.
curl http://host:port/solr/update -H "Content-Type: text/xml" -d @C:\jeffery\data.xml
-F, --form <name=content>
curl -F password=@/etc/passwd www.mypasswords.com
curl -F "name=daniel;type=text/foo" url.com

Set the Request Method: -X POST
Set Request Headers: -H "Authorization: OAuth 2c4419d1aabeec"
View Response Headers: -i
Debug request: -v

-o (lowercase o) the result will be saved in the filename provided in the command line
-O (uppercase O) the filename in the URL will be taken and it will be used as the filename to store the result
Follow HTTP Location Headers with -L option

To POST to a page
curl -d "item=bottle&category=consumer&submit=ok" www.example.com/process.php

Referer & User Agent
curl -e http://some_referring_site.com http://www.example.com/
curl -A "Mozilla/5.0 (compatible; MSIE 7.01; Windows NT 5.0)" http://www.example.com

Limit the Rate of Data Transfer
curl --limit-rate 1000B -O http://www.gnu.org/software/gettext/manual/gettext.html

Continue/Resume a Previous Download: -C -
curl -C - -O http://www.gnu.org/software/gettext/manual/gettext.html

Pass HTTP Authentication in cURL
curl -u username:password URL

Download Files from FTP server
curl -u ftpuser:ftppass -O ftp://ftp_server/public_html/xss.php

List/Download using Ranges
curl ftp://ftp.uk.debian.org/debian/pool/main/[a-z]/

Upload Files to FTP Server
curl -u ftpuser:ftppass -T myfile.txt ftp://ftp.testserver.com
curl -u ftpuser:ftppass -T "{file1,file2}" ftp://ftp.testserver.com
curl ftp://username:password@example.com

Use Proxy to Download a File
curl -x proxysever.test.com:3128 http://google.co.in

References
http://curl.haxx.se/docs/manpage.html
http://curl.haxx.se/docs/httpscripting.html
9 uses for cURL worth knowing
6 essential cURL commands for daily use
9 uses for cURL worth knowing
15 Practical Linux cURL Command Examples (cURL Download Examples)
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (58) Interview (58) J2SE (53) Algorithm (43) Soft Skills (36) Eclipse (34) Code Example (31) Linux (24) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts