Programmer: Lifelong Learning: Nutch2: Speed up Nutch Crawling

The fetch step is likely to take most of the time.
Increase increase the number of threads and the number of threads per queue.
fetcher.threads.fetch and fetcher.threads.per.queue
Decrease fetcher.server.delay
Add Solr docs asynchronously
Update "/update" request handler to the implementation that return directly, add solr document asynchronously

<property>
 <name>fetcher.server.delay</name>
 <value>0.1</value>
</property>
<property>
 <name>fetcher.threads.fetch</name>
 <value>100</value>
</property>
<property>
 <name>fetcher.threads.per.queue</name>
 <value>100</value>
</property>

Resources
http://wiki.apache.org/nutch/OptimizingCrawls
http://www.supermind.org/blog/274/improving-nutch-for-constrained-crawls
http://tech--help.blogspot.com/2010/06/tweak-nutch-default-crawling-speed.html

Nutch2: Speed up Nutch Crawling

Labels