By default, Nutch2 doesn't index raw html content, outlinks into Solr. But in some cases, we may need save them into Solr.
We can create a Nutch2 plugin to do this.
How to Implement
We create our own IndexingFilter, overwrite its getFields, add WebPage.Field.CONTENT and WebPage.Field.OUTLINKS into Collection. This will cause Nutch to read these 2 fields from underlying storage into webpage instance in IndexerMapper: org.apache.nutch.indexer.IndexerJob.IndexerMapper.map(String, WebPage, Context)
org.apache.nutch.indexer.IndexerJob.createIndexJob(Configuration, String, String)
Collection fields = getFields(job);
StorageUtils.initMapperJob(job, fields, String.class, NutchDocument.class,IndexerMapper.class);
In our IndexingFilter, we then read these 2 fields, add them into NutchDocument.
Implementation Code
We use two properties myindexer.index.rawcontent and myindexer.index.outlinks to control whether index raw content and outlinks.
Nutch-site.xml
We then define myindexer.index.rawcontent and myindexer.index.outlinks in nutch-site.xml.
We can create a Nutch2 plugin to do this.
How to Implement
We create our own IndexingFilter, overwrite its getFields, add WebPage.Field.CONTENT and WebPage.Field.OUTLINKS into Collection
org.apache.nutch.indexer.IndexerJob.createIndexJob(Configuration, String, String)
Collection
StorageUtils.initMapperJob(job, fields, String.class, NutchDocument.class,IndexerMapper.class);
In our IndexingFilter, we then read these 2 fields, add them into NutchDocument.
Implementation Code
We use two properties myindexer.index.rawcontent and myindexer.index.outlinks to control whether index raw content and outlinks.
package org.apache.nutch.indexer.myindexer; public class MyIndexingFilter implements IndexingFilter { public static final String FL_RAWCONTENT = "rawcontent"; public static final String FL_OUTLINKS = "outlinks"; private Configuration conf; private boolean indexRawContent; private boolean indexOutlinks; private static final Collection<WebPage.Field> FIELDS = new HashSet<WebPage.Field>(); static { FIELDS.add(WebPage.Field.CONTENT); FIELDS.add(WebPage.Field.OUTLINKS); } public Collection<Field> getFields() { return FIELDS; } public NutchDocument filter(NutchDocument doc, String url, WebPage page) throws IndexingException { try { if (indexRawContent) { ByteBuffer bb = page.getContent(); if (bb != null) { doc.add(FL_RAWCONTENT, new String(bb.array())); } } if (indexOutlinks) { HashSet<String> set = new HashSet<String>(); for (Utf8 value : page.getOutlinks().keySet()) { String outlink = TableUtil.toString(value); String outLinkLower = outlink.toLowerCase(); if (!set.contains(outLinkLower)) { doc.add(FL_OUTLINKS, outlink); set.add(outLinkLower); } } } } catch (Exception e) { LOG.error(this.getClass().getName() + " throws exception: ", e); throw new IndexingException(e); } return doc; } public void setConf(Configuration conf) { this.conf = conf; indexRawContent = conf.getBoolean("myindexer.index.rawcontent", false); indexOutlinks = conf.getBoolean("myindexer.index.outlinks", false); } }
We then define myindexer.index.rawcontent and myindexer.index.outlinks in nutch-site.xml.
<property> <name>myindexer.index.rawcontent</name> <value>true</value> </property> <property> <name>myindexer.index.outlinks</name> <value>true</value> </property>Here we ignore the code to create a nutch2 plugin, and the code to add rawcontent and outlinks into Solr's schrma.xml.