Using UpdateRequestProcessor to Store Anchor Tag and Content into Solr

This series talks about how to use Nutch and Solr to implement Google Search's "Jump to" and Anchor links features. This article introduces how to use Nutch, HTML Parser Jsoup and Regular Expression to Extract Anchor Tag and Content
The Problem 
In the search result, to help users easily jump to the section uses may be interested, we want to add anchor link below page description. Just like Google Search's "Jump to" and Anchor links features.
Main Steps
1. Extract anchor tag, text and content in Nutch.
Also refer to
Using Nutch to Extract Anchor Tag and Conten
Using HTML Parser Jsoup and Regex to Extract Text between Tow Tags
Debugging and Optimizing Regular Expression
2. Using UpdateRequestProcessor to Store Anchor Tag and Content into Solr
This is described in this article
3. Using DocTransformer to Add Anchor tag and content into response. 

Task: Using UpdateRequestProcessor to Store Anchor Tag and Content into Solr
In previous article, we have used Nutch to extract anchor tag, text and content from web page, and add into Solr documents: anchorTags, anchorTexts, anchorContents. These three fields are a list of string.

In Solr side, it will use a UpdateRequestProcessor to remove these three fields, and add a new Document for each anchor, set docType as 1: 0 means, this doc is a web page. 1 means an anchor.
The web page doc and anchor docs is a parent-child relationship.
Code
public class AnchorContentProcessorFactory extends
    UpdateRequestProcessorFactory {
  
  private String fromFlAnchorTags, fromFlAnchorTexts, fromFlAnchorContents;
  private String toFlAnchorTag, toFlAnchorText, toFlAnchorContent,
      toFlAnchorOrder, flForeignKey;
  
  public void init(NamedList args) {
    super.init(args);
    if (args != null) {
      SolrParams params = SolrParams.toSolrParams(args);
      fromFlAnchorTags = checkNotNull(params.get("fromFlAnchorTags"),
          "fromFlAnchorTags can't be null");
      fromFlAnchorTexts = checkNotNull(params.get("fromFlAnchorTexts"),
          "fromFlAnchorTexts can't be null");
      fromFlAnchorContents = checkNotNull(params.get("fromFlAnchorContents"),
          "fromFlAnchorContents can't be null");
      
      toFlAnchorTag = checkNotNull(params.get("toFlAnchorTag"),
          "toFlAnchorTag can't be null");
      toFlAnchorText = checkNotNull(params.get("toFlAnchorText"),
          "toFlAnchorText can't be null");
      toFlAnchorContent = checkNotNull(params.get("toFlAnchorContent"),
          "toFlAnchorContent can't be null");
      toFlAnchorOrder = checkNotNull(params.get("toFlAnchorOrder"),
          "toFlAnchorOrder can't be null");
      flForeignKey = checkNotNull(params.get("flForeignKey"),
          "flForeignKey can't be null");
    }
  }
  
  @Override
  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
      SolrQueryResponse rsp, UpdateRequestProcessor next) {
    return new AnchorContentProcessor(next);
  }
  
  class AnchorContentProcessor extends UpdateRequestProcessor {
    
    public AnchorContentProcessor(UpdateRequestProcessor next) {
      super(next);
    }
    
    @Override
    public void processAdd(AddUpdateCommand cmd) throws IOException {
      
      SolrInputDocument oldDoc = cmd.solrDoc;
      // docType 0 means this item is full web page.
      // docType 1 means this item is anchor.
      oldDoc.setField("docType", 0);
      Collection<Object> fromAnchorTags = oldDoc
          .getFieldValues(fromFlAnchorTags);
      Collection<Object> fromAnchorTexts = oldDoc
          .getFieldValues(fromFlAnchorTexts);
      Collection<Object> fromAnchorContents = oldDoc
          .getFieldValues(fromFlAnchorContents);
      
      if (fromAnchorTags != null && fromAnchorTexts != null
          && fromAnchorContents != null) {
        if (fromAnchorTags.size() != fromAnchorTexts.size()
            || fromAnchorTags.size() != fromAnchorContents.size()) throw new RuntimeException(
            "size doesn't match: size of fromAnchorTags: "
                + fromAnchorTags.size() + ", size of fromAnchorTexts: "
                + fromAnchorTexts.size() + ", size of fromAnchorContents: "
                + fromAnchorContents.size());
        
        // add a new document
        AddUpdateCommand newCmd = new AddUpdateCommand(cmd.getReq());
        SolrInputDocument newDoc = new SolrInputDocument();
        
        Iterator<Object> it1 = fromAnchorTags.iterator(), it2 = fromAnchorTexts
            .iterator(), it3 = fromAnchorContents.iterator();
        int order = 0;
        while (it1.hasNext()) {
          // avoid construct new SolrInputDocument
          newDoc.clear();
          newDoc.addField(toFlAnchorTag, it1.next().toString());
          newDoc.addField(toFlAnchorText, it2.next().toString());
          newDoc.addField(toFlAnchorContent, it3.next().toString());
          newDoc.addField(toFlAnchorOrder, order++);
          
          String uniqueFl = newCmd.getReq().getSchema().getUniqueKeyField()
              .getName();
          newDoc.addField(uniqueFl,
              UUID.randomUUID().toString().toLowerCase(Locale.ROOT).toString());
          newDoc.addField(flForeignKey, oldDoc.getFieldValue(uniqueFl)
              .toString());
          // set docType 1 for the anchor item
          newDoc.addField("docType", 1);
          newCmd.solrDoc = newDoc;
          super.processAdd(newCmd);
        }
      }
      
      oldDoc.removeField(fromFlAnchorTags);
      oldDoc.removeField(fromFlAnchorTexts);
      oldDoc.removeField(fromFlAnchorContents);
      super.processAdd(cmd);
    }
  } 
}
SolrConfig.xml
<processor
   class="com.commvault.solr.update.processor.CVAnchorContentProcessorFactory">
      <str name="fromFlAnchorTags">anchorTags</str>
      <str name="fromFlAnchorTexts">anchorTexts</str>
      <str name="fromFlAnchorContents">anchorContents</str>

      <str name="toFlAnchorTag">anchorTag</str>
      <str name="toFlAnchorText">anchorText</str>
      <str name="toFlAnchorContent">anchorContent</str>
      <str name="toFlAnchorOrder">anchorOrder</str>
      <str name="flForeignKey">url</str>
    </processor>  
Schema.xml
<field name="docType" type="tint" indexed="true" stored="true" multiValued="false" /> 
    <field name="anchorTag" type="string" indexed="false" stored="true"  multiValued="false" /> 
    <field name="anchorText" type="string" indexed="false" stored="true" multiValued="false" /> 
    <field name="anchorContent" type="text_rev" indexed="true" stored="false" multiValued="false" /> 
    <field name="anchorOrder" type="tint" indexed="true" stored="true" multiValued="false" /> 
Resource
Using Nutch to Extract Anchor Tag and Content
Using HTML Parser Jsoup and Regex to Extract Text between Tow Tags
Debugging and Optimizing Regular Expression
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (58) Interview (58) J2SE (53) Algorithm (43) Soft Skills (36) Eclipse (34) Code Example (31) Linux (24) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts