Debugging and Optimizing Regular Expression

In last post, Using HTML Parser Jsoup and Regular Expression to Extract Text between Tow TagsI introduced how to use Jsoup and Regular Expression to extract content between two anchor tags. 

Current regex is: 
<span[^>]*\bid\s*=\s*(?:"|')?JDK_contents(?:'|")?[^>]*>([^<]*)</span>(.*?)<span[^>]*\bid\s*=\s*(?:"|')?Ambiguity_between_a_JDK_and_an_SDK(?:'|")?[^>]*>[^<]*</span>.*
Writing efficient Regex is complicated, luckily there are several tools that can help us.

The best one is RegexBuddy. The trail RegexBuddyCookbook.exe is fully functional for seven days of actual use. 

Select Regex language(Java, Python, etc), type regex string, paste content in test panel, then click "Debug till end" from Debug drop-dwon buttons. In the debug panel, it will list all steps the regex engine executes.

In the Create panel, RegexBuddy will give the explanation of the regex. Select one token then click "Explain Token". We can also export the regex string and the explanation.
In the convert panel, we can convert between different regex languages.

Other Tools
regexplanet
We can use regexplanet to covert regex string to a java string. 

Regex Optimization Tips
Use the any (dot) operator sparingly.
Use the non-capturing group (?:pattern) if possible.
Use the atomic group (or non-backtracking subexpression) when applicable (?>pattern).
Alternation: The order of alternation counts, so place the more common options in the front so they can be matched faster.

Resources
Optimizing regular expressions in Java
regexplanet
regexbuddy
Post a Comment

Labels

Java (159) Lucene-Solr (111) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (37) Eclipse (33) Code Example (31) Linux (24) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts