Nutch2: Speed up Nutch Crawling


The fetch step is likely to take most of the time.
Increase increase the number of threads and the number of threads per queue.
fetcher.threads.fetch and fetcher.threads.per.queue
Decrease fetcher.server.delay
Add Solr docs asynchronously
Update "/update" request handler to the implementation that return directly, add solr document asynchronously
<property>
 <name>fetcher.server.delay</name>
 <value>0.1</value>
</property>
<property>
 <name>fetcher.threads.fetch</name>
 <value>100</value>
</property>
<property>
 <name>fetcher.threads.per.queue</name>
 <value>100</value>
</property>

Resources
http://wiki.apache.org/nutch/OptimizingCrawls

http://www.supermind.org/blog/274/improving-nutch-for-constrained-crawls
http://tech--help.blogspot.com/2010/06/tweak-nutch-default-crawling-speed.html

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)