Boost Search Relevancy Using boilerpipe, Nutch and Solr

The Problem
We use Nutch to crawl web sites and save the content into Solr for search.

A website usally applies a template which defines header, footer, navigation menu. Take one faked storage related documentation site as an example. The word storage appears multiple times in header, footer and menus.
The website has some very simple contact-us or login page. Because there is only a few content in these pages, it's very likely if user search storage, these 2 pages would be ranked highly and listed in first page.

We want to avoid this. We would like to save main content in one field in Solr, and boost that field.
The Solution
Change Nutch to Send Raw Content to Solr
By default, Nutch sends Solr the html tag stripped content to solr, not the rawl html page content.
To send raw content to Solr, we have to create one extra nutch plugin:

public class ExtraIndexingFilter implements IndexingFilter {
  public static final String FL_RAWCONTENT = "rawcontent";
  private Configuration conf;
  private boolean indexRawContent;
  private static final Collection<WebPage.Field> FIELDS = new HashSet<WebPage.Field>();

  static {
    FIELDS.add(WebPage.Field.CONTENT);
  }
  public NutchDocument filter(NutchDocument doc, String url, WebPage page)
      throws IndexingException {
    if (indexRawContent) {
      ByteBuffer bb = page.getContent();
      if (bb != null) {
        doc.add(FL_RAWCONTENT, new String(bb.array()));
      }
    }
    return doc;
  }
  public void setConf(Configuration conf) {
    this.conf = conf;
    indexRawContent = conf.getBoolean("index-extra.rawcontent", false);
  }
}
Then change nutch-site.xml, add the plugin(index-extra) in plugin.includes. 
Set extra-index.rawcontent to true, and set http.content.limit to -1, so Nutch will crawl whole page.
<property>
  <name>extra-index.rawcontent</name>
  <value>true</value>
</property>
<property>
 <name>http.content.limit</name>
 <value>-1</value>
</property>
In solrindex-mapping.xml, add:
<field dest="rawcontent" source="rawcontent" />
Using boilerpipe to Remove Boilerplate Content in Solr
Next, we will define one Solr update processor which will use boilerpipe to remove the surplus "clutter" (boilerplate, templates).
BoilerpipeProcessor will use boilerpipe to remove boilerplate from originalfield and save stripped main content into strippedField. 
import de.l3s.boilerpipe.BoilerpipeProcessingException;
import de.l3s.boilerpipe.extractors.ArticleExtractor;

public class BoilerpipeProcessorFactory extends UpdateRequestProcessorFactory {
  private static final Logger logger = LoggerFactory
      .getLogger(BoilerpipeProcessorFactory.class);
  private boolean enabled = true;
  private String originfield, strippedField;
  private boolean removeOriginfield = true;

  public void init(NamedList args) {
    super.init(args);
    if (args != null) {
      SolrParams params = SolrParams.toSolrParams(args);
      enabled = params.getBool("enabled", true);
      if (!enabled) return;
      removeOriginfield = params.getBool("removeOriginfield", true);
      originfield = Preconditions.checkNotNull(params.get("originfield"),
          "Must set originfield.");
      
      strippedField = Preconditions.checkNotNull(params.get("strippedField"),
          "Must set strippedField.");
    }
  }
  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
      SolrQueryResponse rsp, UpdateRequestProcessor next) {
    if (!enabled) return null;
    return new BoilerpipeProcessor(next);
  }
  
  private class BoilerpipeProcessor extends UpdateRequestProcessor {
    public BoilerpipeProcessor(UpdateRequestProcessor next) {
      super(next);
    }
    public void processAdd(AddUpdateCommand cmd) throws IOException {
      SolrInputDocument doc = cmd.solrDoc;
      Collection<Object> colls = doc.getFieldValues(originfield);
      if (colls != null) {
        for (Object obj : colls) {
          if (obj != null) {
            String str = obj.toString();
            try {
              String strippedText = ArticleExtractor.getInstance().getText(str);
              doc.addField(strippedField, strippedText);
              if (removeOriginfield) {
                doc.removeField(originfield);
              }
            } catch (BoilerpipeProcessingException e) {
              logger.error("Error happened when use boilerpipe to strip text.",
                  e);
            }
          }
        }
      }
      super.processAdd(cmd);
    }
  }
}
Add the processor into the default chain in the solrconfig.xml:
<updateRequestProcessorChain name="defaultChain" default="true`">
  <processor
   class="org.lifelongprogrammer.BoilerpipeProcessorFactory">
    <bool name="enabled">true</bool>
    <str name="originfield">rawcontent</str>
    <str name="strippedField">main_content</str>
    <bool name="removeOriginfield">true</bool>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" /> 
</updateRequestProcessorChain>

Add main_content field in schema.xml:
<field name="main_content" type="text_rev" indexed="true" stored="true"  omitNorms="false" />
After all this, we can change our search handler to boost on main_content field:
<requestHandler name="/select" class="solr.SearchHandler" default="true">
  <lst name="defaults">
    <!-- Omitted -->
    <str name="qf">main_content^10 body_stored</str>
  </lst>
</requestHandler>
Resources
boilerpipe library
Filtering Source Code Using boilerpipe
Post a Comment

Labels

Java (159) Lucene-Solr (111) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (37) Eclipse (33) Code Example (31) Linux (24) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts