Debugging and Optimizing Regular Expression


In last post, Using HTML Parser Jsoup and Regular Expression to Extract Text between Tow TagsI introduced how to use Jsoup and Regular Expression to extract content between two anchor tags. 

Current regex is: 
<span[^>]*\bid\s*=\s*(?:"|')?JDK_contents(?:'|")?[^>]*>([^<]*)</span>(.*?)<span[^>]*\bid\s*=\s*(?:"|')?Ambiguity_between_a_JDK_and_an_SDK(?:'|")?[^>]*>[^<]*</span>.*
Writing efficient Regex is complicated, luckily there are several tools that can help us.

The best one is RegexBuddy. The trail RegexBuddyCookbook.exe is fully functional for seven days of actual use. 

Select Regex language(Java, Python, etc), type regex string, paste content in test panel, then click "Debug till end" from Debug drop-dwon buttons. In the debug panel, it will list all steps the regex engine executes.

In the Create panel, RegexBuddy will give the explanation of the regex. Select one token then click "Explain Token". We can also export the regex string and the explanation.
In the convert panel, we can convert between different regex languages.

Other Tools
regexplanet
We can use regexplanet to covert regex string to a java string. 

Regex Optimization Tips
Use the any (dot) operator sparingly.
Use the non-capturing group (?:pattern) if possible.
Use the atomic group (or non-backtracking subexpression) when applicable (?>pattern).
Alternation: The order of alternation counts, so place the more common options in the front so they can be matched faster.

Resources
Optimizing regular expressions in Java
regexplanet
regexbuddy

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)