Solr: Escape Special Character when Import Data

We are importing XML(CSV) data via curl Get request, in order to make it work, we need handle escape special characters: XML special Characters and URL special characters.

We need first escape XML special characters: & < > " ' to: & < > " '. In code, we can use org.apache.commons.lang.StringEscapeUtils.escapeXml(String).

Then we use code java.net.URLEncoder.encode(String, String) to escape URL special characters, especially $ & + , / : ; = ? @.
URLEncoder.encode will also convert new line feed(\r\n) to %0D%0A.

For example if filed content includes the following 2-lines data:
xml sepcail: & < > " '
url sepcail: $ & + , / : ; = ? @

The Curl Get request to import the data would be like below:
http://localhost:8080/solr/update?stream.body=<add><doc><field name="id">id1</field><field name="content">xml+sepcail%3A+%26amp%3B+%26lt%3B+%26gt%3B+%26quot%3B+%26apos%3B%0D%0Aurl+sepcail%3A+%24+%26amp%3B+%2B+%2C+%2F+%3A+%3B+%3D+%3F+%40</field></doc></add>&commit=true
Code to convert the XML field data
private String escapleXMLEncodeUrl(String str)
  throws UnsupportedEncodingException {
 String result= URLEncoder.encode(StringEscapeUtils.escapeXml(str), "UTF-8");
 return result;
} 
From org.apache.solr.client.solrj.util.ClientUtils.escapeQueryChars
We can know that we need escape(add \) the following special character for query string: \, +, -, !, (, ), :, ^, [, ], \, {, }, ~, *, ?, |, &, ;, /, or whitespace.
Resources
Online XML Escape
Online URL Encoder/Decoder
RFC 1738: Uniform Resource Locators (URL) specification
http://www.xmlnews.org/docs/xml-basics.html
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (58) Interview (58) J2SE (53) Algorithm (43) Soft Skills (36) Eclipse (34) Code Example (31) Linux (24) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts