Nutch2: Parse All Content and Get All Outlinks

Some links in our documentation site are dynamically generated: especially the left side menu. This cause Nutch2 and Google unable to crawl all pages in our site. So we decide to have one invisible link which lists all pages in our site. 

But Nutch2 is unable to get all outlinks from the invisible listing-all-pages page.
In org.apache.nutch.parse.ParseUtil.process(String, WebPage), Nutch use parameter to specify the max number of outlinks Nutch fetches from a page.
int maxOutlinksPerPage = conf.getInt("", 100);
maxOutlinks = (maxOutlinksPerPage < 0) ? Integer.MAX_VALUE : maxOutlinksPerPage;

We can set to -1 or tell Nutch to get all outlinks.
Meanwhile, we need change http.content.limit to -1, so Nutch will parse all content of a page, change http.timeout to some bigger number.
We will put our change in nutch-site.xml like below:





Linux Notes All In One

Add third-party yum-repositories
rpm -Uvh
yum install --enablerepo=webtatic git-all
Using CentOS 5 Repos in RHEL 5 Server
rpm -e redhat-release-notes-5Server redhat-release-5Server --nodeps
rpm -ivh centos-release-notes-5.10-0.x86_64.rpm centos-release-5-10.el5.centos.x86_64.rpm
Installing RPMforge
rpm --import
rpm -ivh rpmforge-release-0.5.3-1.el5.rf.x86_64.rpm
yum search java | grep 'java-'
cd /etc/yum.repos.d

Open a file browser from bash
nautilus --browser .
Exit from git diff
VNC Server Setup
VNCSERVERS="1:vncuser 2:vncuser2 3:vncuser3"
VNCSERVERARGS[1]="-geometry 1600x1200"

service vncserver start|stop|restart
Create xstartup scripts
vi ~/.vnc/xstartup
Uncomment the following two lines (remove the "#" characters):
exec /etc/X11/xinit/xinitrc
Managing your VNC sessions
vncserver -kill :1
List all VNC server sessions

ls ~/.vnc/*.pid
Check vnc version in redhat
rpm -qa | grep vnc-server
rpm -qf /usr/bin/vncserver

Copy and paste stops working in VNC session
Run vncconfig &

Install Java in Redhat
Add CentOs repository
yum search java-1.7
yum install java-1.7**
alternatives --display java
/usr/sbin/alternatives --config java

alternatives --install /usr/bin/java java


adsense (5) Algorithm (69) Algorithm Series (35) Android (4) ANT (6) bat (8) Become a Better You (4) Big Data (7) Blogger (14) Bugs (4) Cache (5) Chrome (17) Code Example (29) Code Quality (6) Coding Skills (5) Concurrency (4) Database (7) Debug (16) Design (5) Dev Tips (62) Eclipse (32) GAE (4) Git (5) Good Programming Practices (4) Google (27) Guava (7) How to (9) Http Client (8) IDE (6) Interview (88) J2EE (13) J2SE (49) Jackson (4) Java (177) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (22) Lucene-Solr (112) Mac (10) Maven (8) Memory Usage (4) Network (9) Nutch2 (18) OpenNLP (4) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Review (4) Scala (6) Security (9) Soft Skills (38) Spark (4) Spring (22) System Design (11) Testing (6) Text Mining (14) Tips (12) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)