Nutch2: Extend Nutch2 to Crawl IFrames Pages

Recently I am using Nutch2 and Solr 4.x to crawl our docuemntation site.
The site uses iframes: In the main page, there is nearly no useful content, it mainly includes 2 iframes like below: 
<iframe name="left" src="foldera/leftmenu.htm" width="24%" height="85%" frameborder="0" target="main" style="border-right-style: solid; border-right-width: 1px; border-color: #808080">Your browser does not support inline frames..</iframe>
<iframe name="main" src="../products/overview.htm" width="75%" height="85%" border="0" frameborder="0">Your browser does not support inline frames.</iframe>
Goals
When run search, we want to ignore the left menu. 
Usually a search would match content in the main iframe page. We want to show the link of the main page, instead of the iframe page. As there is no left menu  - unable to navigate, no header, footer and legal information.
How to Implement
Nutch2 Html Parser
When Nutch2 parses one html page, we want to test whether it has iframes and whether its name is left or main, If so we will add two new metatags into HTMLMetaTags.getGeneralTags().
metatag.ext.mainframe:   the absolute url of main iframe.
metatag.ext.ignoreframe: the absolute url of left iframe.
Solr Post Process Handler
After crawl is finished, we will run one Solr request handler to post process the data. 
In the post handler, we will run a query: (metatag.ext.mainframe:* OR metatag.ext.ignoreframe:*) to find all pages whether one of these 2 fields is set.
Then we will set ignore:true for the left menu page (url of metatag.ext.ignoreframe filed).
We will copy the body_stored (the content Nutch2 crawls) from the doc( url:metatag.ext.mainframe) to this page, and set ignore:true for the doc( url:metatag.ext.mainframe).

Then we modify the search handler, add one fq=ignore:false to only search main pages.
Implementation Code
The complete source code can be found at Github.
Nutch2 Html Parser
We can write a new custom parse-html plugin based on the existing pasre-html plugin: HtmlParser.
The following code would get the url of left and main iframe, and store them into metatag.ext.leftframe and metatag.ext.mainframe.

package org.jefferyyuan.codeexample.nutch.parse.html;
  public Parse getParse(String url, WebPage page) {
    Parse parse = new Parse(text, title, outlinks, status);
    if (metaTags.getNoCache()) {             // not okay to cache
      page.putToMetadata(new Utf8(Nutch.CACHING_FORBIDDEN_KEY),
          ByteBuffer.wrap(Bytes.toBytes(cachingPolicy)));
    }
 // Our own code
    StringBuilder newSB = new StringBuilder();
    DOMContentUtils.getMainFrame(newSB, root);
    String mainFrame = newSB.toString();

    newSB.setLength(0);
    DOMContentUtils.getIgnoreFrame(newSB, root);
    String ignoreframe = newSB.toString();
 
    String parent = baseUrl.substring(0, baseUrl.lastIndexOf('/'));
    HashMap<String, String[]> generalMetaTags =  metaTags.getGeneralTags();
    if(!mainFrame.isEmpty())
    {
      mainFrame = toAbsolutePath(parent, mainFrame);
      LOG.info("Extension: add metatag.ext.mainframe: " + mainFrame + ", baseUrl: " + baseUrl);
      generalMetaTags.put("metatag.ext.mainframe", new String[] {mainFrame});
    }
    if(!ignoreframe.isEmpty())
    {
      ignoreframe = toAbsolutePath(parent, ignoreframe);
      LOG.info("Extension: add metatag.ext.ignoreframe: " + ignoreframe + ", baseUrl: " + baseUrl);
      generalMetaTags.put("metatag.ext.ignoreframe", new String[] {ignoreframe});
    }
    parse = htmlParseFilters.filter(url, page, parse, metaTags, root);
    return parse;
  }
} 
The code to get the left and main iframe from webpage.
package org.jefferyyuan.codeexample.nutch.parse.html;
public class DOMContentUtils {
  public static boolean getMainFrame(StringBuilder sb, Node node) {
    return getNodeAttributeValue(sb, node, "iframe", "name", "main", "src");
  }
  public static boolean getIgnoreFrame(StringBuilder sb, Node node) {
    return getNodeAttributeValue(sb, node, "iframe", "name", "left", "src");
  }
  
  public static boolean getNodeAttributeValue(StringBuilder sb, Node node,
      String checkedNodeName, String checkedAttName, String checkedAttValue,
      String returnAttName) {

    NodeWalker walker = new NodeWalker(node);

    while (walker.hasNext()) {
      Node currentNode = walker.nextNode();
      String nodeName = currentNode.getNodeName();
      short nodeType = currentNode.getNodeType();
      if (nodeType == Node.ELEMENT_NODE
          && checkedNodeName.equalsIgnoreCase(nodeName)) {
        NamedNodeMap attributes = currentNode.getAttributes();
        Node nameNode = attributes.getNamedItem(checkedAttName);
        if (nameNode == null) {
          nameNode = attributes.getNamedItem(checkedAttName.toUpperCase());
        }
        if (nameNode == null) {
          return false;
        }
        if (checkedAttValue.equalsIgnoreCase(nameNode.getTextContent())) {
          Node srcNode = attributes.getNamedItem(returnAttName);
          if (srcNode != null) {
            String text = srcNode.getTextContent();
            if (text != null) {
              sb.append(text);
            }
          }
          return true;
        }
      }
    }
    return false;
  }
}  
Solr Post Process Handler
The following solr post process handler will be called after crawl is finished. It will set ignore:false for left menu pages and main iframe pages, copy body_stored field of main iframe page to main page.
package org.jefferyyuan.codeexample.nutch.solr4;
public class DocIndexPostProcessHandler extends RequestHandlerBase {
  private static final String PARAM_LOGGING = "logging_process";
  protected static final Logger logger = LoggerFactory
      .getLogger(DocIndexPostProcessHandler.class);
  private static final String FL_METATAG_EXT_MAINFRAME = "metatag.ext.mainframe";
  private static final String FL_METATAG_EXT_IGNOREFRAME = "metatag.ext.ignoreframe";

  private static final String FL_IGNORE = "ignore";
  private static final String FL_IGNORE_REASON = "ignore_reason";
  private static final String FL_EXTRA_MSG = "extra_msg";
  private static final String FL_PAGETYPE = "pagetype";
  private boolean defaultLogging = false;

  public void init(NamedList args) {
    super.init(args);
    if (args != null) {
      SolrParams params = SolrParams.toSolrParams(args);
      defaultLogging = params.getBool(PARAM_LOGGING, false);
    }
  }
  public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp)
      throws Exception {
    SolrCore core = req.getCore();
    SolrParams params = req.getParams();
    boolean logging = params.getBool(PARAM_LOGGING, defaultLogging);
    SolrQueryRequest newReq = new LocalSolrQueryRequest(core,
        new ModifiableSolrParams());
    try {
      SolrRequestHandler searchHandler = core.getRequestHandler("/select");
      SolrIndexSearcher searcher = req.getSearcher();
      IndexSchema schema = core.getSchema();
      final String uniqueFname = schema.getUniqueKeyField().getName();

      UpdateHandler updateHandler = core.getUpdateHandler();
      int start = 0;
      int rows = 100;

      boolean hasMore = true;
      while (hasMore) {
        SolrQuery query = new SolrQuery(FL_METATAG_EXT_MAINFRAME + ":* OR "
            + FL_METATAG_EXT_IGNOREFRAME + ":*").setRows(rows).setStart(start);
        SolrQueryResponse newRsp = new SolrQueryResponse();
        newReq.setParams(query);
        searchHandler.handleRequest(newReq, newRsp);

        NamedList valuesNL = newRsp.getValues();
        Object rspObj = valuesNL.get("response");
        // if the request is sent to this core itself, rspObj will be
        // ResultContext
        if (rspObj instanceof ResultContext) {
          ResultContext resultContext = (ResultContext) rspObj;
          DocList doclist = resultContext.docs;
          if (doclist.size() < rows) {
            hasMore = false;
          }
          DocIterator dit = doclist.iterator();
          String leftMenuBosyStored = null;
          StringBuilder extraMsgSb = new StringBuilder();
          while (dit.hasNext()) {
            int docid = dit.nextDoc();
            Document originDoc = searcher.doc(docid, new HashSet<String>());
            String originDocID = originDoc.get(uniqueFname);
            String ignoreFrame = originDoc.get(FL_METATAG_EXT_IGNOREFRAME);
            if (ignoreFrame != null) {
              Query luceneQuery = new TermQuery(new Term(uniqueFname,
                  ignoreFrame));
              TopDocs topDocs = searcher.search(luceneQuery, 1);
              ScoreDoc[] sdocs = topDocs.scoreDocs;
              if (sdocs.length == 1) {
                ScoreDoc doc = sdocs[0];
                Document ignoreDoc = searcher.doc(doc.doc);

                SolrInputDocument irgoreSolrDoc = copyToSolrDoc(schema,
                    ignoreDoc);
                irgoreSolrDoc.setField(FL_IGNORE, true);
                irgoreSolrDoc.setField(FL_PAGETYPE, "menuframe");
                irgoreSolrDoc.setField(FL_IGNORE_REASON,
                    "Is makrded as ignore.frame, should be left menu of page: "
                        + originDocID);
                if (logging)
                  logger
                      .info("DocSearchPostProcess: "
                          + irgoreSolrDoc.getFieldValue(uniqueFname)
                          + "Is makrded as ignore.frame, should be left menu of page: "
                          + originDocID);
                leftMenuBosyStored = ignoreDoc.get("body_stored");

                addDoc(req, updateHandler, irgoreSolrDoc);
              } else {
                extraMsgSb.append("Can't find " + ignoreFrame + "\t.");
              }
            }

            String mainBosyStored = null;
            String mainFrame = originDoc.get(FL_METATAG_EXT_MAINFRAME);
            if (mainFrame != null) {
              Query luceneQuery = new TermQuery(
                  new Term(uniqueFname, mainFrame));
              TopDocs topDocs = searcher.search(luceneQuery, 1);
              ScoreDoc[] sdocs = topDocs.scoreDocs;
              if (sdocs.length == 1) {
                ScoreDoc doc = sdocs[0];
                Document mainDoc = searcher.doc(doc.doc);
                SolrInputDocument mainSolrDoc = copyToSolrDoc(schema, mainDoc);
                mainSolrDoc.setField(FL_PAGETYPE, "mainframe");
                mainSolrDoc.setField(FL_IGNORE, true);
                mainSolrDoc.setField(FL_IGNORE_REASON,
                    "Is makrded as main.frame, its content will be copied to "
                        + originDocID);
                if (logging)
                  logger
                      .info("DocSearchPostProcess: "
                          + mainSolrDoc.getFieldValue(uniqueFname)
                          + "Is makrded as main.frame, its content will be copied to "
                          + originDocID);

                mainBosyStored = mainDoc.get("body_stored");
                addDoc(req, updateHandler, mainSolrDoc);

              } else {
                extraMsgSb.append("Can't find " + mainFrame + "\t.");
              }
            }

            boolean needUpdateOriginDoc = false;
            if (mainBosyStored != null || leftMenuBosyStored != null
                || extraMsgSb.length() != 0) {
              needUpdateOriginDoc = true;
            }
            if (needUpdateOriginDoc) {
              SolrInputDocument originSolrDoc = copyToSolrDoc(schema, originDoc);
              // body_stored, content
              if (mainBosyStored != null) {
                originSolrDoc.setField("origin_body_stored",
                    originSolrDoc.getFieldValue("body_stored"));
                originSolrDoc.setField("body_stored", mainBosyStored);
                originSolrDoc.setField(FL_PAGETYPE, "mainpage");
              }
              if (leftMenuBosyStored != null) {
                originSolrDoc.setField("menu_body_stored", leftMenuBosyStored);
              }
              if (extraMsgSb.length() > 0) {
                originSolrDoc.setField(FL_EXTRA_MSG, extraMsgSb.toString());
                if (logging)
                  logger.info("DocSearchPostProcess: when handle "
                      + originDocID + ", msg: " + extraMsgSb.toString());
              }
              addDoc(req, updateHandler, originSolrDoc);
            }
          }
        } else if (rspObj instanceof SolrDocumentList) {
          throw new RuntimeException("Not implemeneted yet as this request will not be sent to remote core.");
        }
        start += rows;
      }

    } finally {
      newReq.close();
    }
    commitChange(core);
  }
  private void addDoc(SolrQueryRequest req, UpdateHandler updateHandler,
      SolrInputDocument mainSolrDoc) throws IOException {
    AddUpdateCommand updateCommand = new AddUpdateCommand(req);
    updateCommand.solrDoc = mainSolrDoc;
    updateHandler.addDoc(updateCommand);
  }
  private void commitChange(final SolrCore core) {
    SolrRequestHandler commitHandler = core.getRequestHandler("/update");
    ModifiableSolrParams commitParams = new ModifiableSolrParams();
    commitParams.set("commit", "true");
    SolrQueryRequest commitReq = new LocalSolrQueryRequest(core, commitParams);
    try {
      commitHandler.handleRequest(commitReq, new SolrQueryResponse());
    } finally {
      commitReq.close();
    }
  }
  private SolrInputDocument copyToSolrDoc(IndexSchema schema, Document mainDoc) {
    SolrInputDocument mainSolrDoc = new SolrInputDocument();
    // still we keep all fields
    Iterator<IndexableField> it = mainDoc.iterator();
    while (it.hasNext()) {
      IndexableField indexableField = it.next();
      SchemaField solrSchemaField = schema.getField(indexableField.name());
      Object obj = solrSchemaField.getType().toObject(indexableField);
      mainSolrDoc.addField(indexableField.name(), obj);
    }
    return mainSolrDoc;
  }
}
Post a Comment

Labels

Java (159) Lucene-Solr (110) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (36) Eclipse (34) Code Example (31) Linux (24) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (12) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Http Client (8) Maven (8) Problem Solving (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) Lesson Learned (5) Programmer Skills (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) System Design (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts