Nutch2: Index Raw Content and Outlinks into Solr


By default, Nutch2 doesn't index raw html content, outlinks into Solr. But in some cases, we may need save them into Solr.
We can create a Nutch2 plugin to do this.
How to Implement
We create our own IndexingFilter, overwrite its getFields, add WebPage.Field.CONTENT and WebPage.Field.OUTLINKS into Collection. This will cause Nutch to read these 2 fields from underlying storage into webpage instance in IndexerMapper: org.apache.nutch.indexer.IndexerJob.IndexerMapper.map(String, WebPage, Context)
org.apache.nutch.indexer.IndexerJob.createIndexJob(Configuration, String, String)
Collection fields = getFields(job);
StorageUtils.initMapperJob(job, fields, String.class, NutchDocument.class,IndexerMapper.class);

In our IndexingFilter, we then read these 2 fields, add them into NutchDocument.
Implementation Code
We use two properties myindexer.index.rawcontent and myindexer.index.outlinks to control whether index raw content and outlinks.
package org.apache.nutch.indexer.myindexer;
public class MyIndexingFilter implements IndexingFilter {
  public static final String FL_RAWCONTENT = "rawcontent";
  public static final String FL_OUTLINKS = "outlinks";
  private Configuration conf;
  private boolean indexRawContent;
  private boolean indexOutlinks;

  private static final Collection<WebPage.Field> FIELDS = new HashSet<WebPage.Field>();
  static {
    FIELDS.add(WebPage.Field.CONTENT);
    FIELDS.add(WebPage.Field.OUTLINKS);
  }
  public Collection<Field> getFields() {
    return FIELDS;
  }
  public NutchDocument filter(NutchDocument doc, String url, WebPage page)
      throws IndexingException {
    try {
      if (indexRawContent) {
        ByteBuffer bb = page.getContent();
        if (bb != null) {
          doc.add(FL_RAWCONTENT, new String(bb.array()));
        }
      }
      if (indexOutlinks) {
        HashSet<String> set = new HashSet<String>();
        for (Utf8 value : page.getOutlinks().keySet()) {
          String outlink = TableUtil.toString(value);
          String outLinkLower = outlink.toLowerCase();
          if (!set.contains(outLinkLower)) {
            doc.add(FL_OUTLINKS, outlink);
            set.add(outLinkLower);
          }
        }
      }
    } catch (Exception e) {
      LOG.error(this.getClass().getName() + " throws exception: ", e);
      throw new IndexingException(e);
    }
    return doc;
  }
  public void setConf(Configuration conf) {
    this.conf = conf;
    indexRawContent = conf.getBoolean("myindexer.index.rawcontent", false);
    indexOutlinks = conf.getBoolean("myindexer.index.outlinks", false);
  }
}
Nutch-site.xml
We then define myindexer.index.rawcontent and myindexer.index.outlinks in nutch-site.xml.
<property>
 <name>myindexer.index.rawcontent</name>
 <value>true</value>
</property>
<property>
 <name>myindexer.index.outlinks</name>
 <value>true</value>
</property>
Here we ignore the code to create a nutch2 plugin, and the code to add rawcontent and outlinks into Solr's schrma.xml.

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)