By default, Nutch only save the text conetnt of a webpage into "content" field.
For our documentation site, our boss wants to crawl the value of img alt property into content field and save index into Solr.
To do this, we can easily extend Nutch's DOMContentUtils.getTextHelper(StringBuilder, Node, boolean, int).
Implementation Code
Nutch2: Index Raw Content and Outlinks into Solr
Nutch2: Parse All Content and Get All Outlinks
Nutch2 : Extend Nutch2 to Get Custom Outlinks from Javascript Files
Nutch2: Extend Nutch2 to Crawl IFrames Pages
For our documentation site, our boss wants to crawl the value of img alt property into content field and save index into Solr.
To do this, we can easily extend Nutch's DOMContentUtils.getTextHelper(StringBuilder, Node, boolean, int).
Implementation Code
private boolean getTextHelper(StringBuilder sb, Node node, boolean abortOnNestedAnchors, int anchorDepth) { boolean abort = false; NodeWalker walker = new NodeWalker(node); while (walker.hasNext()) { Node currentNode = walker.nextNode(); String nodeName = currentNode.getNodeName(); short nodeType = currentNode.getNodeType(); // omitted... // get img alt value if (nodeType == Node.ELEMENT_NODE) { if ("img".equalsIgnoreCase(nodeName)) { NamedNodeMap attributes = currentNode.getAttributes(); Node nameNode = attributes.getNamedItem("alt"); if (nameNode != null) { sb.append(nameNode.getTextContent()); } } } } return abort; }You may also read
Nutch2: Index Raw Content and Outlinks into Solr
Nutch2: Parse All Content and Get All Outlinks
Nutch2 : Extend Nutch2 to Get Custom Outlinks from Javascript Files
Nutch2: Extend Nutch2 to Crawl IFrames Pages