Nutch2: Crawl and Index Extra(image alt) Tag


By default, Nutch only save the text conetnt of a webpage into "content" field.
For our documentation site, our boss wants to crawl the value of img alt property into content field and save index into Solr. 
To do this, we can easily extend Nutch's DOMContentUtils.getTextHelper(StringBuilder, Node, boolean, int).

Implementation Code

private boolean getTextHelper(StringBuilder sb, Node node,
  boolean abortOnNestedAnchors, int anchorDepth) {
boolean abort = false;
NodeWalker walker = new NodeWalker(node);

while (walker.hasNext()) {
  Node currentNode = walker.nextNode();
  String nodeName = currentNode.getNodeName();
  short nodeType = currentNode.getNodeType();
  // omitted... 
  // get img alt value
  if (nodeType == Node.ELEMENT_NODE) {
 if ("img".equalsIgnoreCase(nodeName)) {
   NamedNodeMap attributes = currentNode.getAttributes();
   Node nameNode = attributes.getNamedItem("alt");
   if (nameNode != null) {
  sb.append(nameNode.getTextContent());
   }
 }
  }
}
return abort;
}
You may also read
Nutch2: Index Raw Content and Outlinks into Solr
Nutch2: Parse All Content and Get All Outlinks
Nutch2 : Extend Nutch2 to Get Custom Outlinks from Javascript Files
Nutch2: Extend Nutch2 to Crawl IFrames Pages

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)