Nutch2: Parse All Content and Get All Outlinks


Some links in our documentation site are dynamically generated: especially the left side menu. This cause Nutch2 and Google unable to crawl all pages in our site. So we decide to have one invisible link which lists all pages in our site. 

But Nutch2 is unable to get all outlinks from the invisible listing-all-pages page.
In org.apache.nutch.parse.ParseUtil.process(String, WebPage), Nutch use parameter db.max.outlinks.per.page to specify the max number of outlinks Nutch fetches from a page.
int maxOutlinksPerPage = conf.getInt("db.max.outlinks.per.page", 100);
maxOutlinks = (maxOutlinksPerPage < 0) ? Integer.MAX_VALUE : maxOutlinksPerPage;

We can set db.max.outlinks.per.page to -1 or tell Nutch to get all outlinks.
Meanwhile, we need change http.content.limit to -1, so Nutch will parse all content of a page, change http.timeout to some bigger number.
We will put our change in nutch-site.xml like below:

 db.max.outlinks.per.page
 -1
 


 http.content.limit
 -1



 http.timeout
 1000000



 db.ignore.internal.links
 false
 

  file.content.limit
  -1


Linux Notes All In One


Add third-party yum-repositories
rpm -Uvh http://repo.webtatic.com/yum/centos/5/latest.rpm
yum install --enablerepo=webtatic git-all
Using CentOS 5 Repos in RHEL 5 Server
wget http://mirrors.nl.kernel.org/centos/5/os/x86_64/CentOS/centos-release-notes-5.10-0.x86_64.rpm
wget http://mirrors.nl.kernel.org/centos/5/os/x86_64/CentOS/centos-release-5-10.el5.centos.x86_64.rpm
rpm -e redhat-release-notes-5Server redhat-release-5Server --nodeps
rpm -ivh centos-release-notes-5.10-0.x86_64.rpm centos-release-5-10.el5.centos.x86_64.rpm
Installing RPMforge
rpm --import http://apt.sw.be/RPM-GPG-KEY.dag.txt
wget http://pkgs.repoforge.org/rpmforge-release/rpmforge-release-0.5.3-1.el5.rf.x86_64.rpm
rpm -ivh rpmforge-release-0.5.3-1.el5.rf.x86_64.rpm
yum search java | grep 'java-'
cd /etc/yum.repos.d

Open a file browser from bash
nautilus --browser .
Exit from git diff
Window+Q 
VNC Server Setup
/etc/sysconfig/vncservers
VNCSERVERS="1:vncuser 2:vncuser2 3:vncuser3"
VNCSERVERARGS[1]="-geometry 1600x1200"

service vncserver start|stop|restart
Create xstartup scripts
vi ~/.vnc/xstartup
Uncomment the following two lines (remove the "#" characters):
unset SESSION_MANAGER
exec /etc/X11/xinit/xinitrc
Managing your VNC sessions
vncserver -kill :1
List all VNC server sessions

ls ~/.vnc/*.pid
Check vnc version in redhat
rpm -qa | grep vnc-server
rpm -qf /usr/bin/vncserver

Copy and paste stops working in VNC session
Run vncconfig &

Install Java in Redhat
Add CentOs repository
yum search java-1.7
yum install java-1.7**
alternatives --display java
/usr/sbin/alternatives --config java

alternatives --install /usr/bin/java java

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (4) ANT (6) bat (8) Become a Better You (4) Big Data (7) Blogger (14) Bugs (4) Cache (5) Chrome (17) Code Example (29) Code Quality (6) Coding Skills (5) Concurrency (4) Database (7) Debug (16) Design (5) Dev Tips (62) Eclipse (32) GAE (4) Git (5) Good Programming Practices (4) Google (27) Guava (7) How to (9) Http Client (8) IDE (6) Interview (88) J2EE (13) J2SE (49) Jackson (4) Java (177) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (22) Lucene-Solr (112) Mac (10) Maven (8) Memory Usage (4) Network (9) Nutch2 (18) OpenNLP (4) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Review (4) Scala (6) Security (9) Soft Skills (38) Spark (4) Spring (22) System Design (11) Testing (6) Text Mining (14) Tips (12) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)

Trending