Programmer: Lifelong Learning: Nutch2: Crawl and Index Extra(image alt) Tag

By default, Nutch only save the text conetnt of a webpage into "content" field.
For our documentation site, our boss wants to crawl the value of img alt property into content field and save index into Solr.
To do this, we can easily extend Nutch's DOMContentUtils.getTextHelper(StringBuilder, Node, boolean, int).

Implementation Code

private boolean getTextHelper(StringBuilder sb, Node node,
  boolean abortOnNestedAnchors, int anchorDepth) {
boolean abort = false;
NodeWalker walker = new NodeWalker(node);

while (walker.hasNext()) {
  Node currentNode = walker.nextNode();
  String nodeName = currentNode.getNodeName();
  short nodeType = currentNode.getNodeType();
  // omitted... 
  // get img alt value
  if (nodeType == Node.ELEMENT_NODE) {
 if ("img".equalsIgnoreCase(nodeName)) {
   NamedNodeMap attributes = currentNode.getAttributes();
   Node nameNode = attributes.getNamedItem("alt");
   if (nameNode != null) {
  sb.append(nameNode.getTextContent());
   }
 }
  }
}
return abort;
}

You may also read
Nutch2: Index Raw Content and Outlinks into Solr
Nutch2: Parse All Content and Get All Outlinks
Nutch2 : Extend Nutch2 to Get Custom Outlinks from Javascript Files
Nutch2: Extend Nutch2 to Crawl IFrames Pages

Nutch2: Crawl and Index Extra(image alt) Tag

Labels