Removing Invalid Control Characters from XML

The Problem
Today, our application reports following error when import one xml data to Solr:
<str name="msg">Illegal character ((CTRL-CHAR, code 22)) at [row,col {unknown-source}]: [1,115]</str>

Via Eclipse remote debug, I got the xml data client sent:
<field name="lnks">\MB\abc def....</field>
It's obvious that this is because of the invalid Control Character "synchronous idle" character in the XML data.

Putting the field value in  CDATA tag won't help in this case, as from http://msdn.microsoft.com/en-us/library/ms256076.aspx:
Content within CDATA sections must be within the range of characters permitted for XML content; control characters and compatibility characters cannot be escaped this way. In addition, the sequence ]]> cannot appear within a CDATA section because this sequence signals the end of the section. This means that CDATA sections cannot be nested. The sequence also appears in some scripts. Within scripts, it is usually possible to substitute] ]> for ]]>.

The Fix
To fix this, we have to remove these special control characters from xml data or replace them with other character like space or dash.
We can do this easily at client side: just write a function to replace or remove these special control characters or reuse existing library, such as Guva
CharMatcher.JAVA_ISO_CONTROL.removeFrom(string);
str = str.replaceAll("\\p{Cntrl}", "");
or str = str.replaceAll("[\\p{Cntrl}^\r\n\t]+", "");

Explanation

CharMatcher.JAVA_ISO_CONTROL characters are in range '\u0000' - '\u001F' and '\u007F' - '\u009F'. They are invisible and invlaid in XML, and have no meaning in text processing.

Determines if the referenced character (Unicode code point) is an ISO control character. A character is considered to be an ISO control character if its code is in the range '\u0000' through '\u001F' or in the range '\u007F' through '\u009F'.

The code point U+0000, assigned to the null control character, is the only character encoded in Unicode and ISO/IEC 10646 that is always invalid in any XML 1.0 and 1.1 document.
U+0001–U+0008, U+000B–U+000C, U+000E–U+001F : this includes most (not all) C0 control characters
U+007F–U+0084, U+0086–U+009F  : this includes a C0 control character, and all but one C1 control

Resources
http://msdn.microsoft.com/en-us/library/ms256076.aspx
http://stackoverflow.com/questions/14028716/how-to-remove-control-characters-from-java-string
http://unicode-table.com/en/
Post a Comment

Labels

Java (159) Lucene-Solr (112) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (38) Eclipse (33) Code Example (31) Linux (25) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Shell (7) ANT (6) Coding Skills (6) Database (6) Lesson Learned (6) Programmer Skills (6) Scala (6) Tips (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) System Design (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts