Nutch2: Save Entire Title to Solr

The Problem
We use Nutch2 to crawl our internal documentation site, and save index to Solr. We noticed that if the title is too long (longer than 100 characters), the title would be truncated to the frist 100 chracters.

For example: 
The original title is:
Getting Started -....(omit 62 characters) Firewall for Windows File System
In solr search result, the title would be:
Getting Started -....(omit 62 characters) Firewall for Wi
This is bad for the user experience. We want to save entire title to Solr.
How Nutch Works?
In parsing phrase, Nutch gets the entire title:
org.apache.nutch.parse.html.DOMContentUtils.getTitle(StringBuilder, Node)

org.apache.nutch.parse.html.HtmlParser.getParse(String, WebPage)
utils.getTitle(sb, root); // extract title
title = sb.toString().trim();
Parse parse = new Parse(text, title, outlinks, status);

But in indexing phrase, in BasicIndexingFilter, it will only crawl the first X characters of title. X is defined by property indexer.max.title.length.

It will read property indexer.max.title.length from nutch-default.xml or nutch-site.xml. The value in nutch-site.xml will overwrite the one in nutch-default.xml.
public void setConf(Configuration conf) {
 this.conf = conf;
 this.MAX_TITLE_LENGTH = conf.getInt("indexer.max.title.length", 100);"Maximum title length for indexing set to: " + this.MAX_TITLE_LENGTH);
public NutchDocument filter(NutchDocument doc, String url, WebPage page) throws IndexingException {
 String title = TableUtil.toString(page.getTitle());
 if (title.length() > MAX_TITLE_LENGTH) { // truncate title if needed
  title = title.substring(0, MAX_TITLE_LENGTH);
 if (title.length() > 0) {
  doc.add("title", title);
The default value of indexer.max.title.length is 100, as defined in nutch-default.xml.
 <description>The maximum number of characters of a title that are
   Used by index-basic.
The Solution
Now the fix is straightforward, we can define indexer.max.title.length to a larger value in nutch-site.xml such as indexer.max.title.length=500.

Nutch includes many indexer plugin such as index-(basic|static|metadata), which add some field name and value pairs. We can check all added fields by opening call hierarchy on method: org.apache.nutch.indexer.NutchDocument.add(String, String).


Commonly Used Windows PowerShell Commands

One reason we like Linux is because it's so easy to complete common (administration) tasks via the shell or scripting.

But sometimes, we have to work on Windows, and not able to install cygwin.
Luckily, Microsoft provides PowerShell, and it's preinstalled with Win7, Windows Server 2008 R2 and later Windows release.

Power Shell is cool and useful, it's different from Linux's Shell, as it's completely object-oriented.
Common Folder/File Operations
Create a folder
mkdir c:\f1\f2\f3
md c:\f1\f2\f3
New-Item c:\f1\f2\f3 -ItemType directory
rm -r c:\f1\f2\f3
Create a file
New-Item c:\f1\f2\f3 -ItemType file -force -value "hello world"
cat c:\f1\f2\f3

Delete Files
Remove-Item -Recurse -Force .\incubator-blur #like linux rm -rf
Remove-Item c:\scripts\* -include .txt -exclude *test
Extract lines from files
Get first 10 lines as head -10 in linux
Get-Content -Path my.csv -TotalCount 10
Get last 10 lines as tail -10 in Linux
Get-Content -Path my.csv | Select-Object -Last 10
Get-Content -Path my.csv | Select-Object -Index(10)
Get the 10th to 100th lines
Get-Content -Path my.csv | Select-Object -Index(10..100)
Get 10th and 100th lines
Get-Content -Path my.csv | Select-Object -Index(10, 100)
Search recursively for a certain string within files
Get-ChildItem -Recurse -Filter *.log | Select-String Exception
Get-ChildItem -Recurse -Filter *.log | Select-String -CaseSensitive -Pattern Exception

Tail -f in PowerShell

In powershell 3.0 and newer version, powershel supports: -Tail:
Get-Content error.log -Tail 10 -Wait
Get-Content error.log -wait
Get-Content error.log -wait | Where-Object { $_ -match "Exception" } 
-match is case-insensitive. -cmath is case-sensitive.

List All Java Files in All Subfolders
gci -Recurse -filter *.java | % { $_.FullName }

(select-string -path audit.log -pattern "logon failed").count
Select-String C:\Scripts\Test.lxt -pattern "failure" -context 3,1

Display the number of characters, words, and lines in the Text.txt file.
get-content C:\test.txt | measure-object -character -line -word
get-childitem | measure-object -property length -minimum -maximum -average
import-csv d:\test\serviceyrs.csv | measure-object -property years -minimum -maximum -average

Find the five processes using the most memory
Get-Process | Sort-Object -Property WS -Descending | Select-Object -First 10

Delete all files within a directory

Remove-Item foldername -Recurse

Rename all .TXT files as .LOG files in the current directory:
Get-Childitem -Path *.txt | rename-item -NewName {$ -replace ".txt",".log"}

Restart-Computer –Force –ComputerName TARGETMACHINE
Run a script on a remote computer
invoke-command -computername machine1, machine2 -filepath c:\Script\script.ps1

Using Get-WmiObject
List all WMI classes:
Get-WmiObject -List
Get-WmiObject -Class Win32_ComputerSystem 
Get-WmiObject -Class Win32_BIOS -ComputerName .
gwmi win32_service -filter "name like 'Oracle%'" | select name 
gwmi win32_service -filter "startmode='auto'" | select name,startmode
(gwmi win32_service -filter "name='alerter'").StopService()

A Complete DNS Setup Guide on Redhat(CentOS)

When install cloudera cluster recently, I have to setup a private DNS server.
Private DNS server:
DNS client: and

Install bind and caching-nameserver
yum install bind  bind-utils bind-libs bind-chroot caching-nameserver -y

Run service named restart  to start named server first.

Configure DNS Server
Enable Caching nameserver and Create Zones
Edit /var/named/chroot/etc/named.conf:
1. Use forwarders block to forward DNS requests it can't resolve to upstream DNS server.
2. Add forward and reverse zones block for and
3. Add forward and reverse zones block for locahost and

vi /var/named/chroot/etc/named.conf 

acl localdomain-com { 172.19/16; };
options {
  directory  "/var/named";
  allow-query { localdomain-com; };
 # The block will cause the caching name server to forward DNS requests it can't resolve to upstream DNS server.
  forwarders { upstream-dns-server1; upstream-dns-server2};
  #forward only;
zone "" IN {
 type master;
 file "";
zone "" IN {
 type master;
 file "";

zone "locahost" IN {
        type master;
        file "";

zone "" {
        type master;
        file "named.local";
zone "." {
 type hint;
 file "";
Add Zone files
Go to /var/named/chroot/var/named directory, create files: and
cd /var/named/chroot/var/named
touch && chown named:named && chmod 644
touch && chown named:named && chmod 644

@             IN      SOA (
                                200612060                 ; serial
                                2H                        ; refresh slaves
                                5M                        ; retry
                                1W                        ; expire
                                1M                        ; Negative TTL

@                       IN      NS      bigdatam

bigdatam       IN      A
bigdata1       IN      A
bigdata2       IN      A


@       IN      SOA (
200612060       ; serial
2H              ; refresh slaves
5M              ; retry
1W              ; expire
1M              ; Negative TTL

        IN      NS
224.97      IN      PTR
66.101       IN      PTR
56.102       IN      PTR, named.local and are already in /var/named/chroot/var/named, they are created automatically, we just need refer them in /var/named/chroot/etc/named.conf.
Restart named server
service named restart 
chkconfig named on
Reload configuration and zones
rndc reload 
Toggle query logging
rndc querylog

Sometimes, we need disable SELinux and firewall
Disable SELinux
setenforce 0
vi /etc/selinux/config
Disable firewall
/etc/init.d/iptables stop
chkconfig iptables off

Configure DNS Client
Do following steps in all 3 servers.
Prevent /etc/resolv.conf get overwritten
/etc/sysconfig/network-scripts/ifcfg-eth0 (replace eth0 with your network interface if different) and change PEERDNS=yes to PEERDNS=no
Setup DNS Name resolution 
vi /etc/resolv.conf
nameserver # the private dns server ip address.
Restart network
/etc/init.d/network restart

Test DNS Setup
Run nslookup to start a session, and run the folloiwng command in all hosts.
# nslookup
Address:  name = localhost.
> localhost

Non-authoritative answer:
Name:   localhost
> bigdatam

> bigdata1

dig bigdaam
host -v -t A `hostname

vi /etc/hosts       localhost.localdomain localhost
Synchronize System Clock Between Servers

How to set up a home DNS server
How to set up a home DNS server, part II

Using Decompiler JDEclipse-Realign to Debug Classes without Source in Eclipse

(Remote) debug is a great way to trouble-shooting, to figure out how the code works. But sometimes we only have the jars, no source code. For example the code is closed or proprietary, no where to get the sourc code.

Luckily, we can use JDEclipse-Realign to easily debug classes  without sources in Eclipse.

1. Install and Configure JDEclipse-Realign in eclipse
Install via JDEclipse-Realign update site

Click "Window" -> "Preferences", type "File Association". Select "class without source", in the dialogue below,change the default from "Class File Viewer [Decompiled]" to "Class File Editor".

2. Find jars contained classes which we want to debug
In linux, Use grep -r -s full_class_name * to find the jar.

3. Create a java project with the jars
Create a java project in Eclipse, add the jar(related jars) into the project's build path.
Now if we right click on the jar, select "Attach Source", we can see "Decompiled Source" is selected.. 

4. Enable remote debug
Add the -Xdebug -Xrunjdwp:transport=dt_socket,address=1044,server=y,suspend=y in JVM options to remote java application. 

Then configure Eclipse for remote debugging by click "Run" -> "Debug Configuration", then create a new "Remote Java Application", enter the host, and port number: 1044 in this case, be sure to slect the previously-created project in project textbox.

5. Add breakpoints in compiled classes and Run 
After add breakpoints, run remote application: remote application will stop and wait for remote debut client to connect to port 1044.
Run remote debug in Eclipse, now the application will stop at breakpoints.

Other tools
JD-GUI from

- A decompiler that supports Java 8 language features, including lambda expressions and me
thod references.

java -jar cfr.jar class_or_jar_file [method] [options]

Mchr3k - JDEclipse-Realign
JDEclipse-Realign Github

Advanced Usage of Linux Grep Command

Grep recursively: -R, -r, --recursive
grep -r "127" /etc

Use grep to search words only: -w, --word-regexp
Select only those lines containing matches that form whole words
grep -w "boo" file

-s, --no-messages
Suppress error messages about nonexistent or unreadable files.

Ignore binary files: -I
When we search, we may only want to search text files and ignore binary files for better performance. we can use: -I.
-I Process a binary file as if it did not contain matching data; this is equivalent to the --binary-files=without-match option.

Combine grep and find
grep "some_string" `find some_folder -name "*.some_extension"`
find . -name "*.php" -exec grep -H "some_string" {} \;

Search only files whose base name matches GLOB
Skip files whose base name matches GLOB (using wildcard matching). A file-name glob can use *, ?, and [...] as wildcards, and \ to quote a wildcard or backslash character literally.
grep -rsI --include=*.html --include=*.php --include=*.htm "pattern" /some/path/
grep -rsI --include=*.{html,php,htm} "pattern" /some/path/
grep -rsI --include=*.{py,pyc} hue /etc

Find jars that contain matched classes
Grep can be used to find the jar that contains the matched class fileL
grep -r com.cloudera.cmf.inspector.Inspector /usr/share/cmf/lib

-c, --count
Suppress normal output; instead print a count of matching lines for each input file
-n, --line-number
-h, --no-filename
-H, --with-filename
-v, --invert-match
--color[=WHEN], --colour[=WHEN]
grep --color root /etc/passwd

Grep Manual

Linux Mount and Unmout Remote File System

Create the Mount Point
sudo mkdir /mnt/ip-shared

Mount a windows shared folder 
mount //server-name/share-name /mnt/ip-shared -o username=shareuser,password=sharepassword,domain=sharedomain
Here, -o is used top specify mount options, in this case, we specify the login credential: username, password, and domain . Its format can be -o username=sharedomain/shareuser,password=sharepassword
Here we don't use -t to specify vfstype, thus mount will try to guess the desired type. Check the list here at mount Manual

Common Mount Options
loop Mounts an image as a loop device.
ro Mounts the file system for reading only.
rw Mounts the file system for both reading and writing.

Unmount it
If we don't need to access the remote file system, we can use umount to unmount it: Notice, it's umount, not unmount.
umount /mnt/ip-shared

Linux Mount Manual
Linux mount CIFS Windows Share

Using Chrome DevTools to Hack Client-Side Only Validation

Client-side validation can provide faster response and better user experience, and reduce server load. But relying client-side validation alone is a terrible idea.  

As user can easily bypass the validation, change value of javascript object using tools like Chrome DevTools or Firebug. We must always appropriate server-side validation as well.
Using Chrome DevTools to Hack it
One website limits the max value of one field, and I am wondering whether I can bypass the verification.
1. Found the pattern
I noticed that when I input invalid value, and go to next field, it will change the color of the field to red.
2. Beak on DOM Attributes Modifications 
So when the color of the field changes, I want Chrome to stop at that moment, so I can figure out how this website do the verification.

So I right click the input textbox, click "Inspect Element", this will open Chrome DevTools, and select the checked element in "Elements" tab. 
<input title="" class="input_field_" type="text" name="Memory" id="memory">
Right click it, and select "Break on..." -> "Attributes modifications".

This will cause Chrome pauses when attribute of this input textbox changes.
It would be better if Chrome DevTools allows us to stop when a particular attribute changes, in this case: class field. or even better to stop when a particular attribute changes to a particular value. But anyhow this works.
3. Reproduce It
Now input the invalid value, then trigger the event that will cause the client side validation. Usually this will happen when we focus leaves current field or we click next or submit.

This will cause Chrome stops execution, and switch to "Sources" tab and stops at the code that change 
4. Check the Call Stack
Now we can check the call stack. 
To make  javascript much easier to read, we can press the "Pretty print" button (marked with curly braces{}) from the bottom of the "Sources" tab.

Follow the call stack, I found it do verification like below:
function validateSteps(step) 
    var isStepValid = true;
  // omitted
    if (step == step_config) {
        return validateField('#memory', selectedPolicy.minMemory, selectedPolicy.maxMemory) &&...);
    return isStepValid; // we can set a breakpoint value here to change isStepValid to true. 
The code compare the value from memory textbox with selectedPolicy.minMemory, selectedPolicy.maxMemory.
5. Hack it
Now there are several ways to hack it: we can step into validateField method and change return value.
The simpler way is to directly modify the object selectedPolicy, change maxMemory to a much bigger value. We go to the "Console" tab, and type selectedPolicy.maxMemory, this will print current value, it's 4. We can type selectedPolicy.maxMemory=100.

Now unclick all breakpoints, and press resume, and submit the form. It works, this site doesn't check the value in server side at all.

Later if I want to so the same trick: bypass the verification, I can go to "Console" tab directly, and type selectedPolicy.maxMemory=100.

In Console tab, we can do many things, run any valid javascript code in current scope.
We can check windows object, and detect any custom properties and functions.

Happy Hacking!!!
Tips And Tricks

Using Guava Splitter and Joiner

In my last post: Solr RssResponseWriter by Extending XMLWriter, I need parse the field mapping string like: topic:title,url:link,description:description to a map.

If we have to write our own code, the code would look like below:
public void splitOnOurOwn() {
 String mapStr = "topic: title, url: link, description: description ";
 String[] pairs = mapStr.split(",");
 Map<String,String> map = new HashMap<String,String>();
 for (String pair : pairs) {
  String[] str = pair.split(":");
  map.put(str[0].trim(), str[1].trim());
 System.out.println("toString: " + Objects.toString(map));
But if we use Guava it would be just two lines:

private static MapSplitter splitter = Splitter.on(",").trimResults()
private static MapJoiner joinner = Joiner.on(",").withKeyValueSeparator(":");

public void guavaSplit() {
 String mapStr = "topic: title, url: link, description: description ";
 Map<String,String> map = splitter.split(mapStr.trim());
 System.out.println("toString: " + Objects.toString(map));
 System.out.println("join: " + joinner.join(map));
Guava Splitter provides some other useful methods such as fixedLength, omitEmptyStrings, trimResults, limit.

As described in code above, Guava provides a Joiner that can join text with a separator. Joiner also provides useful methods such as skipNulls, useForNull(nullText).
Strings Explained
Splitter Javadoc
Joiner Javadoc
Guava Splitter vs StringUtils

Configure Tomcat SSL Using PFX(PKCS12) Certificate

I am trying to import certificate from entrust to tomcat.
Entrust provides a pfk file to us. pfx means Personal Information Exchange, it stores many cryptography objects as a single file. Read more about PKCS #12

To import the  pfx(PKCS_12) to tomcat or other java web server, the easy solution is to convert the pfx(PKCS_12) file to Java Key Store file.
1. Using keytool
Since JDK6, we can use JDK keytool to convert pkcs12 to JKS.
keytool -importkeystore -srckeystore file.pfx -srcstoretype PKCS12 -destkeystore cert.jks -deststoretype JKS
2. Using XWSS
For older JDK, we can use XWSS utility to convert pkcs12 to JKS.
XWSS - XML and WebServices Security Project is part of Project Metro in the Glassfish community. It provide some utility that can be downloaded from here.

Download the, unzip it, we can find pkcs12import.bat.
pkcs12import usage
pkcs12import -file pkcs12-file [ -keystore keystore-file ]
[ -pass pkcs12-password ]   [ -storepass store-password ]  [ -keypass key-password ] [ -alias alias ]

Add SSL Connector in server.xml

Restart tomcat, and try to access https://localhost/
PKCS 12 Wiki
Converting .pfx Files to .jks Files
How to import PFX file into JKS using pkcs12import utility

Using Guava Stopwatch

When read this post, TRIM, LTRIM and RTRIM in Java

I have a doubt about the performance difference between the one uses regular expression ltrim and the straightforward ltrim3. So I use Guava topwatch
to test their performance. 
Guava Usage
1. Use Stopwatch static factory method to create one instance.
Stopwatch stopwatch = Stopwatch.createStarted()
2. Use stopwatch.elapsed(TimeUnit) to get elapsed time.
3. We can reuse same stopwatch instance to measures another operation, but we have to first call reset(), and start().
reset() method would reset internal variable: elapsedNanos=0, isRunning=false. Calling start() would start again to measure time elapsed.
The test result is as what I expected, the straightforward one gives the best performance.
458 for ltrim3(the straightforward one)
3967 for ltrim(using regular expression)

The code looks like below:

public class LTrimTester {
  private final static Pattern LTRIM = Pattern.compile("^\\s+");
  private final static Pattern RTRIM = Pattern.compile("\\s+$");

  public static void main(String[] args) {
    int times = 10000000;
    Stopwatch stopwatch = Stopwatch.createStarted();
    for (int i = 0; i < times; i++) {
      String str = "  hello world  ";
        + " for ltrim(using regular expression)");

    for (int i = 0; i < times; i++) {
      String str = "  hello world  ";
        + " for ltrim3(the straightforward one)");

  public static String ltrim(String s) {
    return LTRIM.matcher(s).replaceAll("");

  public static String ltrim3(String s) {
    int i = 0;
    while (i < s.length() && Character.isWhitespace(s.charAt(i))) {
    return s.substring(i);
Guava Stopwatch Javadoc
Guava Stopwatch

Solr: Creating a DocTransformer to Show Doc Offset in Response

In client, the doc offset can be easilly computed: start * rows + (offset in current response).

But in some cases, it is useful to show offset explicitly. For example, when debug response search relevancy, tester may report why some doc are not showd in first page. In this case, we may run the query, and give a big rows, for example: q="Nexus 7"&rows=500, then search the doc in the response xml. In this case, it would be helpful, if response can show the offset directly, like below:
 <str name="id">id3456</str>
 <long name="[offset]">156</long>
The field name  would be [transfomername], [offset] in this case.
Solr DocTransformers
Solr DocTransformers allows us to add/change/remove fields/response before return to the client. By default, it provides [explain],[value],[shard],[docid]. We can easily add our own DocTransformer implementation.

OffsetTransformerFactory Implementation
The implementation would be look like: ValueAugmenterFactory. To use it, we will add the trasnformer in fl field: q="Nexus 7"&fl=id,[offset] 
public class OffsetTransformerFactory extends TransformerFactory {
  private boolean enabled = false;
  public void init(NamedList args) {
    if (args != null) {
      SolrParams params = SolrParams.toSolrParams(args);
      enabled = params.getBool("enabled", false);
      if (!enabled) return;
   * filed is [offset] in this case.<br>
   * Notice augmenterArgs is the local params to this transfomer like:
   * [myTransformer foo=1 bar=good], not paramters in SolrQueryRequest.
  public DocTransformer create(String field, SolrParams augmenterArgs,
      SolrQueryRequest req) {
    SolrParams params = req.getParams();
    String str = params.get(CommonParams.START);
    long start = 0;
    if (StringUtils.isNotBlank(str)) {
      start = Long.valueOf(str);
    long rows = Long.valueOf(params.get(CommonParams.ROWS));
    long startOffset = start * rows;
    return new OffsetTransformer(field, startOffset);
  class OffsetTransformer extends DocTransformer {
    private String field;
    private long startOffset;
    private long offset = 0;
    public OffsetTransformer(String field, long startOffset) {
      this.field = field;
      this.startOffset = startOffset;
    public void transform(SolrDocument doc, int docid) throws IOException {
      if (enabled) {
        doc.setField(field, startOffset + offset);
    public String getName() {
      return OffsetTransformer.class.getName();
Configuration in solrconfig.xml
<transformer name="offset"
  <bool name="enabled">true</bool>
Solr DocTransformers

Git Clone a Specific Version/Tag/Branch

I am trying to learning guava implementation detail. So I git-clone the source code as described here.
git clone
Cloning into 'guava-libraries'...
error: error setting certificate verify locations:
  CAfile: chrome\depot_tools\git-1.8.0_bin/bin/curl-ca-bundle.crt
  CApath: none
 while accessing
fatal: HTTP request failed
I changed from https to http:
git clone 

It works, then I ran "mvn eclipse:eclipse, it failed with the following exception:
[ERROR] Failed to execute goal on project guava-testlib: Could not resolve dependencies for project Failure to find in was cached in the local repository, resolution will not be reattempted until the update interval of sonatype-nexus-snapshots has elapsed or updates are forced -> [Help 1]

So I tried to check out the old version: release 15.0.
I run command: git clone tag and dint the tag name of release 15.0: v15.0

To change local repo to release 15.0, I use: git checkout vlocal_15.0 -b v15.0
The format is: git checkout ] -b remote_branch_name.

Then run mvn eclipse:eclipse and mvn install, it works

To directly clone release 15.0, we can use:
git clone  -b v15.0
git-clone manual page


adsense (5) Algorithm (69) Algorithm Series (35) Android (4) ANT (6) bat (8) Become a Better You (4) Big Data (7) Blogger (14) Bugs (4) Cache (5) Chrome (17) Code Example (29) Code Quality (6) Coding Skills (5) Concurrency (4) Database (7) Debug (16) Design (5) Dev Tips (62) Eclipse (32) GAE (4) Git (5) Good Programming Practices (4) Google (27) Guava (7) How to (9) Http Client (8) IDE (6) Interview (88) J2EE (13) J2SE (49) Jackson (4) Java (177) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (22) Lucene-Solr (112) Mac (10) Maven (8) Memory Usage (4) Network (9) Nutch2 (18) OpenNLP (4) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Review (4) Scala (6) Security (9) Soft Skills (38) Spark (4) Spring (22) System Design (11) Testing (6) Text Mining (14) Tips (12) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)