Using HTML Parser Jsoup and Regular Expression to Get Text between Tow Tags

The Task
In this article, we are going to use jsoup to parse html pages to get all TOC(table of content) anchor links, and use regular expression to get all text content of each anchor link.

The Solution
Jsoup is a java HTML parser, its jquery-like and regex selector syntax makes it very easy to use to extract content form html page. 

Normally a site has some convention about where it puts the TOC anchor link: from this we can compose a css selector to select all anchor link. We will take this Java_Development_Kit wikipedia page as an example.

Use Jsoup to Get All Anchor Links
To try CSS selector, we can open Chrome Developer tools, in the console tab: use document.querySelectorAll("CSS_SELECTOR_HERE"); to test our css selector.

Our final css selector would be:
div[id=toc]>ul>li a[href^='#']:not([href='#'])
in the id=toc div section, get it's direct child ui element, then get direct child li elements, fina all link with href attribute: value of href should be started with #(means this points to an anchor link), but no '#".

The Code
One caveat: Jsoup doesn't like the ' or " around attribute value, the old css selector will cause no match. 
The final css selector for Jsoup is: div[id=toc] ul>li a[href^=#]:not([href=#])
Document doc = Jsoup.connect(url).get();
Element rootElement = doc.select(PATTERN_BODY_ROOT).first();
Set<String> anchors = new LinkedHashSet<String>();
Elements elements = rootElement.select(TOC_ANCHOR);
if (!elements.isEmpty()) {
  for (Element element : elements) {
    String href = element.attr("href");
    anchors.add(href.substring(1));
  }
}
Using Regular Expression and Jsoup to Get Text of each Anchor
First definition of the content of an anchor in our case: it's the all content between the current anchor and the next anchor.

The regular expression to get all html content between the the anchor JDK_contents and the anchor Ambiguity_between_a_JDK_and_an_SDK is like below:
<span[^>]*\s*(?:"|')?JDK_contents(?:'|")?[^>]*>([^<]*)</span>(.*)(<span[^>]*\s*(?:"|')?Ambiguity_between_a_JDK_and_an_SDK(?:'|")?[^>]*>[^<]*</span>.*)

In another post, we will introduce how to use tool RegexBuudy to test and compose this regular expression and improve the regulare expression to boost the performance.

After get the HMTL content, we call Jsoup.parse(html).text(); to get all combined text.

The Code
public String getContentBetweenAnchor(StringBuilder remaining,
    String anchor1, String anchor2, String anchorElement,
    String anchorAttribute) throws IOException {
  StringBuilder sb = new StringBuilder();
  // the first group is the anchor text
  sb.append(matchAnchorRegexStr(anchor1, anchorElement, true))
      // the second group is the text between these 2 anchors
      .append("(.*)")
      // the third group is the remaining text
      .append("(").append(matchAnchorRegexStr(anchor2, anchorElement, false))
      .append(".*)");

  System.out.println(sb);
  Matcher matcher = Pattern.compile(sb.toString(),
      Pattern.DOTALL | Pattern.MULTILINE).matcher(remaining);
  String matchedText = "";
  if (matcher.find()) {
    String anchorText = Jsoup.parse(matcher.group(1)).text();
    matchedText = anchorText + " " + Jsoup.parse(matcher.group(2)).text();
    String newRemaining = matcher.group(3);
    remaining.setLength(0);
    remaining.append(newRemaining);
  }
  return matchedText;
}

The Complete Code
package org.codeexample.lifelongprogrammer.anchorlinks;

import org.apache.commons.lang.StringUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import com.google.common.base.Stopwatch;

public class JsoupExample {
  private static final String TOC_ANCHOR = "div[id=toc] ul>li a[href^=#]:not([href=#])";
  private static final String PLAIN_ANCHOR_A_TAG = "a[href^=#]:not([href=#])";

  private static final int MAX_ANCHOR_LINKS = 5;
  // only <div id="bodyContent"> section
  private static final String PATTERN_BODY_ROOT = "div[id=bodyContent]";

  public Map<String, String> parseHTML(String url) throws IOException {
    Map<String, String> anchorContents = new LinkedHashMap<String, String>();

    Document doc = Jsoup.connect(url).get();
    Element rootElement = doc.select(PATTERN_BODY_ROOT).first();
    if (rootElement == null)
      return anchorContents;
    Set<String> anchors = getAnchors(rootElement);
    if (anchors.isEmpty())
      return anchorContents;
    StringBuilder remaining = new StringBuilder(rootElement.toString());

    Iterator<String> it = anchors.iterator();
    String current = it.next();
    while (it.hasNext() && remaining.length() > >0) {
      String next = it.next();
      anchorContents
          .put(
              current,
              getContentBetweenAnchorInWiki(remaining, current, next, "span",
                  "id"));
      current = next;
    }
    // last one
    String lastTxt = Jsoup.parse(remaining.toString()).text();
    if (StringUtils.isNotBlank(lastTxt)) {
      anchorContents.put(current, lastTxt);
    }
    return anchorContents;
  }

  public Set<String> getAnchors(Element rootElement) {
    Set<String> anchors = new LinkedHashSet<String>() {
      private static final long serialVersionUID = 1L;

      @Override
      public boolean add(String e) {
        if (size() >= MAX_ANCHOR_LINKS)
          return false;
        return super.add(e);
      }
    };
    getAnchorsImpl(rootElement, TOC_ANCHOR, anchors);
    if (anchors.isEmpty()) {
      // no toc anchor found, then use
      getAnchorsImpl(rootElement, PLAIN_ANCHOR_A_TAG, anchors);
    }
    return anchors;
  }

  public void getAnchorsImpl(Element rootElement, String anchorPattern,
      Set<String> anchors) {
    Elements elements = rootElement.select(anchorPattern);
    if (!elements.isEmpty()) {
      for (Element element : elements) {
        String href = element.attr("href");
        anchors.add(href.substring(1));
      }
    }
  }

  public String getContentBetweenAnchor(StringBuilder remaining,
      String anchor1, String anchor2, String anchorElement,
      String anchorAttribute) throws IOException {
    StringBuilder sb = new StringBuilder();
    // the first group is the anchor text
    sb.append(matchAnchorRegexStr(anchor1, anchorElement, true))
        // the second group is the text between these 2 anchors
        .append("(.*)")
        // the third group is the remaing text
        .append("(").append(matchAnchorRegexStr(anchor2, anchorElement, false))
        .append(".*)");

    System.out.println(sb);
    Matcher matcher = Pattern.compile(sb.toString(),
        Pattern.DOTALL | Pattern.MULTILINE).matcher(remaining);
    String matchedText = "";
    if (matcher.find()) {
      String anchorText = Jsoup.parse(matcher.group(1)).text();
      matchedText = anchorText + " " + Jsoup.parse(matcher.group(2)).text();
      String newRemaining = matcher.group(3);
      remaining.setLength(0);
      remaining.append(newRemaining);
    }
    return matchedText;
  }

  public String matchAnchorRegexStr(String anchor1, String anchorElement,
      boolean cpatureAnchorText) {
    StringBuilder sb = new StringBuilder().append("<").append(anchorElement)
        .append("[^>]*").append("\\s*").append("(?:\"|')?").append(anchor1)
        .append("(?:'|\")?[^>]*>");
    if (cpatureAnchorText) {
      sb.append("([^<]*)");
    } else {
      sb.append("[^<]*");
    }
    return sb.append("</").append(anchorElement).append(">").toString();
  }

  @Test
  public void testWiki() throws IOException {
    Stopwatch stopwatch = Stopwatch.createStarted();
    String url = "http://en.wikipedia.org/wiki/Java_Development_Kit";
    Map<String, String> anchorContents = parseHTML(url);
    System.out.println(anchorContents);
    System.out.println("Took " + stopwatch.elapsed(TimeUnit.MILLISECONDS));
    stopwatch.stop();
  }  
}

Resources
Comparison of HTML parsers
jsoup
CSS Selector Reference
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (60) Interview (59) J2SE (53) Algorithm (37) Eclipse (35) Soft Skills (35) Code Example (31) Linux (26) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Continuous Integration (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Design (8) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Miscs (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Firefox (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Bit Operation (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts