By default, Nutch2 doesn't index raw html content, outlinks into Solr. But in some cases, we may need save them into Solr.
We can create a Nutch2 plugin to do this.
How to Implement
We create our own IndexingFilter, overwrite its getFields, add WebPage.Field.CONTENT and WebPage.Field.OUTLINKS into Collection. This will cause Nutch to read these 2 fields from underlying storage into webpage instance in IndexerMapper: org.apache.nutch.indexer.IndexerJob.IndexerMapper.map(String, WebPage, Context)
org.apache.nutch.indexer.IndexerJob.createIndexJob(Configuration, String, String)
Collection fields = getFields(job);
StorageUtils.initMapperJob(job, fields, String.class, NutchDocument.class,IndexerMapper.class);
In our IndexingFilter, we then read these 2 fields, add them into NutchDocument.
Implementation Code
We use two properties myindexer.index.rawcontent and myindexer.index.outlinks to control whether index raw content and outlinks.
Nutch-site.xml
We then define myindexer.index.rawcontent and myindexer.index.outlinks in nutch-site.xml.
We can create a Nutch2 plugin to do this.
How to Implement
We create our own IndexingFilter, overwrite its getFields, add WebPage.Field.CONTENT and WebPage.Field.OUTLINKS into Collection
org.apache.nutch.indexer.IndexerJob.createIndexJob(Configuration, String, String)
Collection
StorageUtils.initMapperJob(job, fields, String.class, NutchDocument.class,IndexerMapper.class);
In our IndexingFilter, we then read these 2 fields, add them into NutchDocument.
Implementation Code
We use two properties myindexer.index.rawcontent and myindexer.index.outlinks to control whether index raw content and outlinks.
package org.apache.nutch.indexer.myindexer;
public class MyIndexingFilter implements IndexingFilter {
public static final String FL_RAWCONTENT = "rawcontent";
public static final String FL_OUTLINKS = "outlinks";
private Configuration conf;
private boolean indexRawContent;
private boolean indexOutlinks;
private static final Collection<WebPage.Field> FIELDS = new HashSet<WebPage.Field>();
static {
FIELDS.add(WebPage.Field.CONTENT);
FIELDS.add(WebPage.Field.OUTLINKS);
}
public Collection<Field> getFields() {
return FIELDS;
}
public NutchDocument filter(NutchDocument doc, String url, WebPage page)
throws IndexingException {
try {
if (indexRawContent) {
ByteBuffer bb = page.getContent();
if (bb != null) {
doc.add(FL_RAWCONTENT, new String(bb.array()));
}
}
if (indexOutlinks) {
HashSet<String> set = new HashSet<String>();
for (Utf8 value : page.getOutlinks().keySet()) {
String outlink = TableUtil.toString(value);
String outLinkLower = outlink.toLowerCase();
if (!set.contains(outLinkLower)) {
doc.add(FL_OUTLINKS, outlink);
set.add(outLinkLower);
}
}
}
} catch (Exception e) {
LOG.error(this.getClass().getName() + " throws exception: ", e);
throw new IndexingException(e);
}
return doc;
}
public void setConf(Configuration conf) {
this.conf = conf;
indexRawContent = conf.getBoolean("myindexer.index.rawcontent", false);
indexOutlinks = conf.getBoolean("myindexer.index.outlinks", false);
}
} We then define myindexer.index.rawcontent and myindexer.index.outlinks in nutch-site.xml.
<property> <name>myindexer.index.rawcontent</name> <value>true</value> </property> <property> <name>myindexer.index.outlinks</name> <value>true</value> </property>Here we ignore the code to create a nutch2 plugin, and the code to add rawcontent and outlinks into Solr's schrma.xml.