Some links in our documentation site are dynamically generated: especially the left side menu. This cause Nutch2 and Google unable to crawl all pages in our site. So we decide to have one invisible link which lists all pages in our site.
But Nutch2 is unable to get all outlinks from the invisible listing-all-pages page.
In org.apache.nutch.parse.ParseUtil.process(String, WebPage), Nutch use parameter db.max.outlinks.per.page to specify the max number of outlinks Nutch fetches from a page.
int maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100);
maxOutlinks = (maxOutlinksPerPage < 0) ? Integer.MAX_VALUE : maxOutlinksPerPage;
We can set db.max.outlinks.per.page to -1 or tell Nutch to get all outlinks.
Meanwhile, we need change http.content.limit to -1, so Nutch will parse all content of a page, change http.timeout to some bigger number.
We will put our change in nutch-site.xml like below:
But Nutch2 is unable to get all outlinks from the invisible listing-all-pages page.
In org.apache.nutch.parse.ParseUtil.process(String, WebPage), Nutch use parameter db.max.outlinks.per.page to specify the max number of outlinks Nutch fetches from a page.
int maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100);
maxOutlinks = (maxOutlinksPerPage < 0) ? Integer.MAX_VALUE : maxOutlinksPerPage;
We can set db.max.outlinks.per.page to -1 or tell Nutch to get all outlinks.
Meanwhile, we need change http.content.limit to -1, so Nutch will parse all content of a page, change http.timeout to some bigger number.
We will put our change in nutch-site.xml like below:
db.max.outlinks.per.page -1 http.content.limit -1 http.timeout 1000000 db.ignore.internal.links false file.content.limit -1