We use Nutch2 to crawl one documentation site, and store the index to Solr4.x to implement documentation search function.
But I met one problem: the documentation site uses COOLjsTree, in htm pages, it defines the left side menu in tree_nodes.js.
But it's not flexible. It uses the following regular expression to find outlinks:
But luckily, we can easily write our own Nutch plugin to modify or extend nutch.
We can create our own ext-parse-js plugin, write our own ParseFilter and Parser to parse outlinks from our tree_nodes.js file.
Design Concept
We want to make our new ext-js-parser configurable and extensible, we will add the following parameters to nutch-site.xml:
ext.js.file.include.pattern: user can specify what files to parse. In our case, it is .*/tree_nodes.js.
ext.js.extract.outlink.pattern: how to extract outlinks from the matched files. It's \"([^\"]*.[htm|html|pdf])\" in our case.
ext.js.absolute.url.pattern: what url is treated as absoulte url, by default it is: ^[http|https|www].*
ext.js.indexjs: whether to index js files to solr.
Implementation Code
The complete source code can be found at Github.
First check whether the file matches ext.js.file.include.pattern, it not return directly, If so, try to get links from the file using the regular pattern: ext.js.extract.outlink.pattern. We will check whether the extracted url is absolute url via ext.js.absolute.url.pattern, if not, we will convert it to absolute url.
Then we need to include ext-parse-js in nutch-site.xml
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
At last, as we don't need store content from javascript file to Solr, we can either write a Solr UpdateRequestProcessor to ignore the document if the value of url field is ended with .js, or we can change org.apache.nutch.indexer.solr.SolrWriter.write(NutchDocument) like below:
http://wiki.apache.org/nutch/AboutPlugins
http://wiki.apache.org/nutch/WritingPluginExample
http://florianhartl.com/nutch-plugin-tutorial.html
But I met one problem: the documentation site uses COOLjsTree, in htm pages, it defines the left side menu in tree_nodes.js.
END_USER: {
NODES: [
["End User 1", "../../products/end_user1.htm", "_top"],
["End User 2", "../../products/end_user2.htm", "_top"],
],
TITLE: " End-User"
}
Nutch2 provides parse-js plugin to find outlinks defined in javascript file or embedded javascript section. But it's not flexible. It uses the following regular expression to find outlinks:
org.apache.nutch.parse.js.JSParseFilter private static final String STRING_PATTERN = "(\\\\*(?:\"|\'))([^\\s\"\']+?)(?:\\1)"; private static final String URI_PATTERN = "(^|\\s*?)/?\\S+?[/\\.]\\S+($|\\s*)";It can find links like: http://site.com/folder/pagea.html. But it doesn't work for the links we defined in our tree_nodes.js.
But luckily, we can easily write our own Nutch plugin to modify or extend nutch.
We can create our own ext-parse-js plugin, write our own ParseFilter and Parser to parse outlinks from our tree_nodes.js file.
Design Concept
We want to make our new ext-js-parser configurable and extensible, we will add the following parameters to nutch-site.xml:
ext.js.file.include.pattern: user can specify what files to parse. In our case, it is .*/tree_nodes.js.
ext.js.extract.outlink.pattern: how to extract outlinks from the matched files. It's \"([^\"]*.[htm|html|pdf])\" in our case.
ext.js.absolute.url.pattern: what url is treated as absoulte url, by default it is: ^[http|https|www].*
ext.js.indexjs: whether to index js files to solr.
Implementation Code
The complete source code can be found at Github.
First check whether the file matches ext.js.file.include.pattern, it not return directly, If so, try to get links from the file using the regular pattern: ext.js.extract.outlink.pattern. We will check whether the extracted url is absolute url via ext.js.absolute.url.pattern, if not, we will convert it to absolute url.
package org.jefferyyuan.codeexample.nutch.parse.js.treenodes; public class JSParseFilter implements ParseFilter, Parser { public static final Logger LOG = LoggerFactory .getLogger(JSParseFilter.class); private static final int MAX_TITLE_LEN = 80; private static final String DEFAULT_FILE_INCLUDE_PATTERN_STR = "*.js"; private static final String ABSOLUTE_URL_PATTERN_STR = "^[http|https|www].*"; private static Pattern fileIncludePath, absoluteURLPpattern, outlinkPattern; private Configuration conf; public void setConf(Configuration conf) { this.conf = conf; PatternCompiler patternCompiler = new Perl5Compiler(); try { String str = conf.get("ext.js.file.include.pattern", DEFAULT_FILE_INCLUDE_PATTERN_STR); fileIncludePath = patternCompiler.compile(str, Perl5Compiler.READ_ONLY_MASK | Perl5Compiler.SINGLELINE_MASK); str = conf.get("ext.js.absolute.url.pattern", ABSOLUTE_URL_PATTERN_STR); absoluteURLPpattern = patternCompiler.compile(str, Perl5Compiler.CASE_INSENSITIVE_MASK | Perl5Compiler.READ_ONLY_MASK | Perl5Compiler.SINGLELINE_MASK); str = conf.get("ext.js.extract.outlink.pattern"); if (!StringUtils.isBlank(str)) { outlinkPattern = patternCompiler.compile(str, Perl5Compiler.READ_ONLY_MASK | Perl5Compiler.MULTILINE_MASK); } } catch (MalformedPatternException e) { throw new RuntimeException(e); } } private boolean shouldHandlePage(WebPage page) { boolean shouldHandle = false; String url = TableUtil.toString(page.getBaseUrl()); PatternMatcher matcher = new Perl5Matcher(); if (matcher.matches(url, fileIncludePath)) { shouldHandle = true; } return shouldHandle; } private static String toAbsolutePath(String baseUrl, String path) throws MalformedPatternException { PatternMatcher matcher = new Perl5Matcher(); boolean isAbsolute = false; if (matcher.matches(path, absoluteURLPpattern)) { isAbsolute = true; } if (isAbsolute) { return path; } while (true) { if (!path.startsWith("../")) { break; } baseUrl = baseUrl.substring(0, baseUrl.lastIndexOf('/')); path = path.substring(3); } return baseUrl + "/" + path; } public static Outlink[] getJSLinks(String plainText, String anchor, String base) { long start = System.currentTimeMillis(); // the base is always absolute path: http://.../tree_nodes.js, change it to // folder base = base.substring(0, base.lastIndexOf('/')); final List<Outlink> outlinks = new ArrayList<Outlink>(); URL baseURL = null; try { baseURL = new URL(base); } catch (Exception e) { if (LOG.isErrorEnabled()) { LOG.error("error assigning base URL", e); } } try { final PatternMatcher matcher = new Perl5Matcher(); final PatternMatcherInput input = new PatternMatcherInput(plainText); MatchResult result; String url; // loop the matches while (matcher.contains(input, outlinkPattern)) { if (System.currentTimeMillis() - start >= 60000L) { if (LOG.isWarnEnabled()) { LOG.warn("Time limit exceeded for getOutLinks"); } break; } result = matcher.getMatch(); url = result.group(1); // See if candidate URL is parseable. If not, pass and move on to // the next match. try { url = new URL(toAbsolutePath(base, url)).toString(); LOG.info("Extension added: " + url + " and baseURL " + baseURL); } catch (MalformedURLException ex) { LOG.info("Extension - failed URL parse '" + url + "' and baseURL '" + baseURL + "'", ex); continue; } try { outlinks.add(new Outlink(url.toString(), anchor)); } catch (MalformedURLException mue) { LOG.warn("Extension Invalid url: '" + url + "', skipping."); } } } catch (Exception ex) { LOG.error("getOutlinks", ex); } } final Outlink[] retval; if (outlinks != null && outlinks.size() > 0) { retval = outlinks.toArray(new Outlink[0]); } else { retval = new Outlink[0]; } return retval; } public Parse filter(String url, WebPage page, Parse parse, HTMLMetaTags metaTags, DocumentFragment doc) { if (shouldHandlePage(page)) { ArrayList<Outlink> outlinks = new ArrayList<Outlink>(); walk(doc, parse, metaTags, url, outlinks); if (outlinks.size() > 0) { Outlink[] old = parse.getOutlinks(); String title = parse.getTitle(); List<Outlink> list = Arrays.asList(old); outlinks.addAll(list); ParseStatus status = parse.getParseStatus(); String text = parse.getText(); Outlink[] newlinks = outlinks.toArray(new Outlink[outlinks.size()]); return new Parse(text, title, newlinks, status); } } return parse; } private void walk(Node n, Parse parse, HTMLMetaTags metaTags, String base, List<Outlink> outlinks) { if (n instanceof Element) { String name = n.getNodeName(); if (name.equalsIgnoreCase("script")) { @SuppressWarnings("unused") String lang = null; Node lNode = n.getAttributes().getNamedItem("language"); if (lNode == null) lang = "javascript"; else lang = lNode.getNodeValue(); StringBuilder script = new StringBuilder(); NodeList nn = n.getChildNodes(); if (nn.getLength() > 0) { for (int i = 0; i < nn.getLength(); i++) { if (i > 0) script.append('\n'); script.append(nn.item(i).getNodeValue()); } Outlink[] links = getJSLinks(script.toString(), "", base); if (links != null && links.length > 0) outlinks.addAll(Arrays.asList(links)); // no other children of interest here, go one level up. return; } } else { // process all HTML 4.0 events, if present... NamedNodeMap attrs = n.getAttributes(); int len = attrs.getLength(); for (int i = 0; i < len; i++) { Node anode = attrs.item(i); Outlink[] links = null; if (anode.getNodeName().startsWith("on")) { links = getJSLinks(anode.getNodeValue(), "", base); } else if (anode.getNodeName().equalsIgnoreCase("href")) { String val = anode.getNodeValue(); if (val != null && val.toLowerCase().indexOf("javascript:") != -1) { links = getJSLinks(val, "", base); } } if (links != null && links.length > 0) outlinks.addAll(Arrays.asList(links)); } } } NodeList nl = n.getChildNodes(); for (int i = 0; i < nl.getLength(); i++) { walk(nl.item(i), parse, metaTags, base, outlinks); } } public Parse getParse(String url, WebPage page) { if (!shouldHandlePage(page)) { return ParseStatusUtils.getEmptyParse( ParseStatusCodes.FAILED_INVALID_FORMAT, "Content not JavaScript: '" + TableUtil.toString(page.getContentType()) + "'", getConf()); } String script = new String(page.getContent().array()); Outlink[] outlinks = getJSLinks(script, "", url); if (outlinks == null) outlinks = new Outlink[0]; // Title? use the first line of the script... String title; int idx = script.indexOf('\n'); if (idx != -1) { if (idx > MAX_TITLE_LEN) idx = MAX_TITLE_LEN; title = script.substring(0, idx); } else { idx = Math.min(MAX_TITLE_LEN, script.length()); title = script.substring(0, idx); } Parse parse = new Parse(script, title, outlinks, ParseStatusUtils.STATUS_SUCCESS); return parse; } }Configuration
Then we need to include ext-parse-js in nutch-site.xml
Then change parse-plugins.xml to make nutch use ext-parse-js plugin to parse javascript file.plugin.includes protocol-http|urlfilter-regex|ext-parse-js|parse-(html|tika|metatags)|index-(basic|static|metadata|anchor) |urlnormalizer-(pass|regex|basic)|scoring-opic|subcollection ext.js.file.include.pattern .*/tree_nodes.js ext.js.absolute.url.pattern ^[http|https|www].* ext.js.extract.outlink.pattern \"([^\"]*.[htm|html|pdf])\" ext.js.indexjs false
Then we need change regex-urlfilter.txt to make nutch handle javascript file: to remove |js|JS from the following section.
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
At last, as we don't need store content from javascript file to Solr, we can either write a Solr UpdateRequestProcessor to ignore the document if the value of url field is ended with .js, or we can change org.apache.nutch.indexer.solr.SolrWriter.write(NutchDocument) like below:
public class SolrWriter implements NutchIndexWriter { private boolean indexjs; public void open(TaskAttemptContext job) throws IOException { Configuration conf = job.getConfiguration(); indexjs= conf.getBoolean("ext.js.indexjs", false); } public void write(NutchDocument doc) throws IOException { String urlValue = doc.getFieldValue("url"); if (!indexjs) { if (urlValue != null && urlValue.endsWith(".js")) { LOG.info("CVExtension ignore js file: " + urlValue); return; } } } }References
http://wiki.apache.org/nutch/AboutPlugins
http://wiki.apache.org/nutch/WritingPluginExample
http://florianhartl.com/nutch-plugin-tutorial.html