This series talks about how to use Nutch and Solr to implement Google Search's "Jump to" and Anchor links features. This article introduces how to use Nutch, HTML Parser Jsoup and Regular Expression to Extract Anchor Tag and Content
The Problem
In the search result, to help users easily jump to the section uses may be interested, we want to add anchor link below page description. Just like Google Search's "Jump to" and Anchor links features.
Main Steps
1. Extract anchor tag, text and content in Nutch.
Also refer to
Using Nutch to Extract Anchor Tag and Conten
Using HTML Parser Jsoup and Regex to Extract Text between Tow Tags
Debugging and Optimizing Regular Expression
2. Using UpdateRequestProcessor to Store Anchor Tag and Content into Solr
This is described in this article.
3. Using DocTransformer to Add Anchor tag and content into response.
Task: Using UpdateRequestProcessor to Store Anchor Tag and Content into Solr
In previous article, we have used Nutch to extract anchor tag, text and content from web page, and add into Solr documents: anchorTags, anchorTexts, anchorContents. These three fields are a list of string.
In Solr side, it will use a UpdateRequestProcessor to remove these three fields, and add a new Document for each anchor, set docType as 1: 0 means, this doc is a web page. 1 means an anchor.
The web page doc and anchor docs is a parent-child relationship.
Code
Using Nutch to Extract Anchor Tag and Content
Using HTML Parser Jsoup and Regex to Extract Text between Tow Tags
Debugging and Optimizing Regular Expression
The Problem
In the search result, to help users easily jump to the section uses may be interested, we want to add anchor link below page description. Just like Google Search's "Jump to" and Anchor links features.
Main Steps
1. Extract anchor tag, text and content in Nutch.
Also refer to
Using Nutch to Extract Anchor Tag and Conten
Using HTML Parser Jsoup and Regex to Extract Text between Tow Tags
Debugging and Optimizing Regular Expression
2. Using UpdateRequestProcessor to Store Anchor Tag and Content into Solr
This is described in this article.
3. Using DocTransformer to Add Anchor tag and content into response.
Task: Using UpdateRequestProcessor to Store Anchor Tag and Content into Solr
In previous article, we have used Nutch to extract anchor tag, text and content from web page, and add into Solr documents: anchorTags, anchorTexts, anchorContents. These three fields are a list of string.
In Solr side, it will use a UpdateRequestProcessor to remove these three fields, and add a new Document for each anchor, set docType as 1: 0 means, this doc is a web page. 1 means an anchor.
The web page doc and anchor docs is a parent-child relationship.
Code
public class AnchorContentProcessorFactory extends UpdateRequestProcessorFactory { private String fromFlAnchorTags, fromFlAnchorTexts, fromFlAnchorContents; private String toFlAnchorTag, toFlAnchorText, toFlAnchorContent, toFlAnchorOrder, flForeignKey; public void init(NamedList args) { super.init(args); if (args != null) { SolrParams params = SolrParams.toSolrParams(args); fromFlAnchorTags = checkNotNull(params.get("fromFlAnchorTags"), "fromFlAnchorTags can't be null"); fromFlAnchorTexts = checkNotNull(params.get("fromFlAnchorTexts"), "fromFlAnchorTexts can't be null"); fromFlAnchorContents = checkNotNull(params.get("fromFlAnchorContents"), "fromFlAnchorContents can't be null"); toFlAnchorTag = checkNotNull(params.get("toFlAnchorTag"), "toFlAnchorTag can't be null"); toFlAnchorText = checkNotNull(params.get("toFlAnchorText"), "toFlAnchorText can't be null"); toFlAnchorContent = checkNotNull(params.get("toFlAnchorContent"), "toFlAnchorContent can't be null"); toFlAnchorOrder = checkNotNull(params.get("toFlAnchorOrder"), "toFlAnchorOrder can't be null"); flForeignKey = checkNotNull(params.get("flForeignKey"), "flForeignKey can't be null"); } } @Override public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) { return new AnchorContentProcessor(next); } class AnchorContentProcessor extends UpdateRequestProcessor { public AnchorContentProcessor(UpdateRequestProcessor next) { super(next); } @Override public void processAdd(AddUpdateCommand cmd) throws IOException { SolrInputDocument oldDoc = cmd.solrDoc; // docType 0 means this item is full web page. // docType 1 means this item is anchor. oldDoc.setField("docType", 0); Collection<Object> fromAnchorTags = oldDoc .getFieldValues(fromFlAnchorTags); Collection<Object> fromAnchorTexts = oldDoc .getFieldValues(fromFlAnchorTexts); Collection<Object> fromAnchorContents = oldDoc .getFieldValues(fromFlAnchorContents); if (fromAnchorTags != null && fromAnchorTexts != null && fromAnchorContents != null) { if (fromAnchorTags.size() != fromAnchorTexts.size() || fromAnchorTags.size() != fromAnchorContents.size()) throw new RuntimeException( "size doesn't match: size of fromAnchorTags: " + fromAnchorTags.size() + ", size of fromAnchorTexts: " + fromAnchorTexts.size() + ", size of fromAnchorContents: " + fromAnchorContents.size()); // add a new document AddUpdateCommand newCmd = new AddUpdateCommand(cmd.getReq()); SolrInputDocument newDoc = new SolrInputDocument(); Iterator<Object> it1 = fromAnchorTags.iterator(), it2 = fromAnchorTexts .iterator(), it3 = fromAnchorContents.iterator(); int order = 0; while (it1.hasNext()) { // avoid construct new SolrInputDocument newDoc.clear(); newDoc.addField(toFlAnchorTag, it1.next().toString()); newDoc.addField(toFlAnchorText, it2.next().toString()); newDoc.addField(toFlAnchorContent, it3.next().toString()); newDoc.addField(toFlAnchorOrder, order++); String uniqueFl = newCmd.getReq().getSchema().getUniqueKeyField() .getName(); newDoc.addField(uniqueFl, UUID.randomUUID().toString().toLowerCase(Locale.ROOT).toString()); newDoc.addField(flForeignKey, oldDoc.getFieldValue(uniqueFl) .toString()); // set docType 1 for the anchor item newDoc.addField("docType", 1); newCmd.solrDoc = newDoc; super.processAdd(newCmd); } } oldDoc.removeField(fromFlAnchorTags); oldDoc.removeField(fromFlAnchorTexts); oldDoc.removeField(fromFlAnchorContents); super.processAdd(cmd); } } }SolrConfig.xml
<processor class="com.commvault.solr.update.processor.CVAnchorContentProcessorFactory"> <str name="fromFlAnchorTags">anchorTags</str> <str name="fromFlAnchorTexts">anchorTexts</str> <str name="fromFlAnchorContents">anchorContents</str> <str name="toFlAnchorTag">anchorTag</str> <str name="toFlAnchorText">anchorText</str> <str name="toFlAnchorContent">anchorContent</str> <str name="toFlAnchorOrder">anchorOrder</str> <str name="flForeignKey">url</str> </processor>Schema.xml
<field name="docType" type="tint" indexed="true" stored="true" multiValued="false" /> <field name="anchorTag" type="string" indexed="false" stored="true" multiValued="false" /> <field name="anchorText" type="string" indexed="false" stored="true" multiValued="false" /> <field name="anchorContent" type="text_rev" indexed="true" stored="false" multiValued="false" /> <field name="anchorOrder" type="tint" indexed="true" stored="true" multiValued="false" />Resource
Using Nutch to Extract Anchor Tag and Content
Using HTML Parser Jsoup and Regex to Extract Text between Tow Tags
Debugging and Optimizing Regular Expression