Nutch2: Index Raw Content and Outlinks into Solr

By default, Nutch2 doesn't index raw html content, outlinks into Solr. But in some cases, we may need save them into Solr.
We can create a Nutch2 plugin to do this.
How to Implement
We create our own IndexingFilter, overwrite its getFields, add WebPage.Field.CONTENT and WebPage.Field.OUTLINKS into Collection. This will cause Nutch to read these 2 fields from underlying storage into webpage instance in IndexerMapper: org.apache.nutch.indexer.IndexerJob.IndexerMapper.map(String, WebPage, Context)
org.apache.nutch.indexer.IndexerJob.createIndexJob(Configuration, String, String)
Collection fields = getFields(job);
StorageUtils.initMapperJob(job, fields, String.class, NutchDocument.class,IndexerMapper.class);

In our IndexingFilter, we then read these 2 fields, add them into NutchDocument.
Implementation Code
We use two properties myindexer.index.rawcontent and myindexer.index.outlinks to control whether index raw content and outlinks.
package org.apache.nutch.indexer.myindexer;
public class MyIndexingFilter implements IndexingFilter {
  public static final String FL_RAWCONTENT = "rawcontent";
  public static final String FL_OUTLINKS = "outlinks";
  private Configuration conf;
  private boolean indexRawContent;
  private boolean indexOutlinks;

  private static final Collection<WebPage.Field> FIELDS = new HashSet<WebPage.Field>();
  static {
    FIELDS.add(WebPage.Field.CONTENT);
    FIELDS.add(WebPage.Field.OUTLINKS);
  }
  public Collection<Field> getFields() {
    return FIELDS;
  }
  public NutchDocument filter(NutchDocument doc, String url, WebPage page)
      throws IndexingException {
    try {
      if (indexRawContent) {
        ByteBuffer bb = page.getContent();
        if (bb != null) {
          doc.add(FL_RAWCONTENT, new String(bb.array()));
        }
      }
      if (indexOutlinks) {
        HashSet<String> set = new HashSet<String>();
        for (Utf8 value : page.getOutlinks().keySet()) {
          String outlink = TableUtil.toString(value);
          String outLinkLower = outlink.toLowerCase();
          if (!set.contains(outLinkLower)) {
            doc.add(FL_OUTLINKS, outlink);
            set.add(outLinkLower);
          }
        }
      }
    } catch (Exception e) {
      LOG.error(this.getClass().getName() + " throws exception: ", e);
      throw new IndexingException(e);
    }
    return doc;
  }
  public void setConf(Configuration conf) {
    this.conf = conf;
    indexRawContent = conf.getBoolean("myindexer.index.rawcontent", false);
    indexOutlinks = conf.getBoolean("myindexer.index.outlinks", false);
  }
}
Nutch-site.xml
We then define myindexer.index.rawcontent and myindexer.index.outlinks in nutch-site.xml.
<property>
 <name>myindexer.index.rawcontent</name>
 <value>true</value>
</property>
<property>
 <name>myindexer.index.outlinks</name>
 <value>true</value>
</property>
Here we ignore the code to create a nutch2 plugin, and the code to add rawcontent and outlinks into Solr's schrma.xml.
Post a Comment

Labels

Java (159) Lucene-Solr (110) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (36) Eclipse (34) Code Example (31) Linux (24) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Troubleshooting (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Http Client (8) Maven (8) Problem Solving (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) Lesson Learned (5) Programmer Skills (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) System Design (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts