Nutch2: Save Entire Title to Solr


The Problem
We use Nutch2 to crawl our internal documentation site, and save index to Solr. We noticed that if the title is too long (longer than 100 characters), the title would be truncated to the frist 100 chracters.

For example: 
The original title is:
Getting Started -....(omit 62 characters) Firewall for Windows File System
In solr search result, the title would be:
Getting Started -....(omit 62 characters) Firewall for Wi
This is bad for the user experience. We want to save entire title to Solr.
How Nutch Works?
In parsing phrase, Nutch gets the entire title:
org.apache.nutch.parse.html.DOMContentUtils.getTitle(StringBuilder, Node)

org.apache.nutch.parse.html.HtmlParser.getParse(String, WebPage)
utils.getTitle(sb, root); // extract title
title = sb.toString().trim();
Parse parse = new Parse(text, title, outlinks, status);

But in indexing phrase, in BasicIndexingFilter, it will only crawl the first X characters of title. X is defined by property indexer.max.title.length.

org.apache.nutch.indexer.basic.BasicIndexingFilter
It will read property indexer.max.title.length from nutch-default.xml or nutch-site.xml. The value in nutch-site.xml will overwrite the one in nutch-default.xml.
public void setConf(Configuration conf) {
 this.conf = conf;
 this.MAX_TITLE_LENGTH = conf.getInt("indexer.max.title.length", 100);
 LOG.info("Maximum title length for indexing set to: " + this.MAX_TITLE_LENGTH);
}
public NutchDocument filter(NutchDocument doc, String url, WebPage page) throws IndexingException {
 String title = TableUtil.toString(page.getTitle());
 if (title.length() > MAX_TITLE_LENGTH) { // truncate title if needed
  title = title.substring(0, MAX_TITLE_LENGTH);
 }
 if (title.length() > 0) {
  doc.add("title", title);
 }
}
The default value of indexer.max.title.length is 100, as defined in nutch-default.xml.
<property>
 <name>indexer.max.title.length</name>
 <value>100</value>
 <description>The maximum number of characters of a title that are
   indexed.
   Used by index-basic.
 </description>
</property>
The Solution
Now the fix is straightforward, we can define indexer.max.title.length to a larger value in nutch-site.xml such as indexer.max.title.length=500.

Misc
Nutch includes many indexer plugin such as index-(basic|static|metadata), which add some field name and value pairs. We can check all added fields by opening call hierarchy on method: org.apache.nutch.indexer.NutchDocument.add(String, String).

Resources
nutch-default.xml

Commonly Used Windows PowerShell Commands


One reason we like Linux is because it's so easy to complete common (administration) tasks via the shell or scripting.

But sometimes, we have to work on Windows, and not able to install cygwin.
Luckily, Microsoft provides PowerShell, and it's preinstalled with Win7, Windows Server 2008 R2 and later Windows release.

Power Shell is cool and useful, it's different from Linux's Shell, as it's completely object-oriented.
Common Folder/File Operations
Create a folder
mkdir c:\f1\f2\f3
md c:\f1\f2\f3
New-Item c:\f1\f2\f3 -ItemType directory
rm -r c:\f1\f2\f3
Create a file
New-Item c:\f1\f2\f3 -ItemType file -force -value "hello world"
cat c:\f1\f2\f3

Delete Files
Remove-Item -Recurse -Force .\incubator-blur #like linux rm -rf
Remove-Item c:\scripts\* -include .txt -exclude *test
Extract lines from files
Get first 10 lines as head -10 in linux
Get-Content -Path my.csv -TotalCount 10
Get last 10 lines as tail -10 in Linux
Get-Content -Path my.csv | Select-Object -Last 10
Get-Content -Path my.csv | Select-Object -Index(10)
Get the 10th to 100th lines
Get-Content -Path my.csv | Select-Object -Index(10..100)
Get 10th and 100th lines
Get-Content -Path my.csv | Select-Object -Index(10, 100)
Search recursively for a certain string within files
Get-ChildItem -Recurse -Filter *.log | Select-String Exception
Get-ChildItem -Recurse -Filter *.log | Select-String -CaseSensitive -Pattern Exception

Tail -f in PowerShell

In powershell 3.0 and newer version, powershel supports: -Tail:
Get-Content error.log -Tail 10 -Wait
Get-Content error.log -wait
Get-Content error.log -wait | Where-Object { $_ -match "Exception" } 
-match is case-insensitive. -cmath is case-sensitive.

List All Java Files in All Subfolders
gci -Recurse -filter *.java | % { $_.FullName }

Select-String
(select-string -path audit.log -pattern "logon failed").count
Select-String C:\Scripts\Test.lxt -pattern "failure" -context 3,1

Measure-Object
Display the number of characters, words, and lines in the Text.txt file.
get-content C:\test.txt | measure-object -character -line -word
get-childitem | measure-object -property length -minimum -maximum -average
import-csv d:\test\serviceyrs.csv | measure-object -property years -minimum -maximum -average

Find the five processes using the most memory
Get-Process | Sort-Object -Property WS -Descending | Select-Object -First 10

Delete all files within a directory

Remove-Item foldername -Recurse

Rename all .TXT files as .LOG files in the current directory:
Get-Childitem -Path *.txt | rename-item -NewName {$_.name -replace ".txt",".log"}

Miscs
Restart-Computer –Force –ComputerName TARGETMACHINE
Run a script on a remote computer
invoke-command -computername machine1, machine2 -filepath c:\Script\script.ps1

Using Get-WmiObject
List all WMI classes:
Get-WmiObject -List
Get-WmiObject -Class Win32_ComputerSystem 
Get-WmiObject -Class Win32_BIOS -ComputerName .
gwmi win32_service -filter "name like 'Oracle%'" | select name 
gwmi win32_service -filter "startmode='auto'" | select name,startmode
(gwmi win32_service -filter "name='alerter'").StopService()

A Complete DNS Setup Guide on Redhat(CentOS)


Background
When install cloudera cluster recently, I have to setup a private DNS server.
Environment
Private DNS server: 172.19.97.224(bigdatam.localdomain.com)
DNS client:  172.19.101.66(bigdata1.localdomain.com) and 172.19.102.56(bigdata2.localdomain.com.

Install bind and caching-nameserver
yum install bind  bind-utils bind-libs bind-chroot caching-nameserver -y

Run service named restart  to start named server first.

Configure DNS Server
Enable Caching nameserver and Create Zones
Edit /var/named/chroot/etc/named.conf:
1. Use forwarders block to forward DNS requests it can't resolve to upstream DNS server.
2. Add forward and reverse zones block for localdomain.com and 19.172.in-addr.arpa.
3. Add forward and reverse zones block for locahost and 0.0.127.in-addr.arpa.

vi /var/named/chroot/etc/named.conf 

acl localdomain-com { 172.19/16; };
options {
  directory  "/var/named";
  allow-query { localdomain-com; };
 # The block will cause the caching name server to forward DNS requests it can't resolve to upstream DNS server.
  forwarders { upstream-dns-server1; upstream-dns-server2};
  #forward only;
};
zone "localdomain.com" IN {
 type master;
 file "localdomain.com.zone";
};
zone "19.172.in-addr.arpa" IN {
 type master;
 file "172.19.zone";
};

zone "locahost" IN {
        type master;
        file "localhost.zone";
        allow-update{none;};
};

zone "0.0.127.in-addr.arpa" {
        type master;
        file "named.local";
};
zone "." {
 type hint;
 file "named.ca";
};
Add Zone files
Go to /var/named/chroot/var/named directory, create files: localdomain.com.zone and 19.172.in-addr.arpa.
cd /var/named/chroot/var/named
touch localdomain.com.zone && chown named:named localdomain.com.zone && chmod 644 localdomain.com.zone
touch 19.172.in-addr.arpa && chown named:named 19.172.in-addr.arpa && chmod 644 19.172.in-addr.arpa

vi localdomain.com.zone
$TTL 1D
$ORIGIN localdomain.com.
@             IN      SOA     bigdatam.localdomain.com. foo.bar.tld. (
                                200612060                 ; serial
                                2H                        ; refresh slaves
                                5M                        ; retry
                                1W                        ; expire
                                1M                        ; Negative TTL
                                )

@                       IN      NS      bigdatam

bigdatam       IN      A       172.19.97.224
bigdata1       IN      A       172.19.101.66
bigdata2       IN      A       172.19.102.56

vi 172.19.zone
$TTL 1D
$ORIGIN 19.172.IN-ADDR.ARPA.

@       IN      SOA     bigdatam.localdomain.com. foo.bar.tld. (
200612060       ; serial
2H              ; refresh slaves
5M              ; retry
1W              ; expire
1M              ; Negative TTL
)

        IN      NS      bigdatam.localdomain.com.
224.97      IN      PTR     bigdatam.localdomain.com.
66.101       IN      PTR     bigdata1.localdomain.com.
56.102       IN      PTR     bigdata2.localdomain.com.
localhost.zone, named.local and named.ca are already in /var/named/chroot/var/named, they are created automatically, we just need refer them in /var/named/chroot/etc/named.conf.
Restart named server
service named restart 
chkconfig named on
Reload configuration and zones
rndc reload 
Toggle query logging
rndc querylog

Sometimes, we need disable SELinux and firewall
Disable SELinux
setenforce 0
vi /etc/selinux/config
SELINUX=disabled
SELINUXTYPE=targeted
Disable firewall
/etc/init.d/iptables stop
chkconfig iptables off

Configure DNS Client
Do following steps in all 3 servers.
Prevent /etc/resolv.conf get overwritten
/etc/sysconfig/network-scripts/ifcfg-eth0 (replace eth0 with your network interface if different) and change PEERDNS=yes to PEERDNS=no
Setup DNS Name resolution 
vi /etc/resolv.conf
search localdomain.com
nameserver 172.19.97.224 # the private dns server ip address.
Restart network
/etc/init.d/network restart

Test DNS Setup
Run nslookup to start a session, and run the folloiwng command in all hosts.
# nslookup
> 127.0.0.1
Server:         172.19.97.224
Address:        172.19.97.224#53

1.0.0.127.in-addr.arpa  name = localhost.
> localhost
Server:         172.19.97.224
Address:        172.19.97.224#53

Non-authoritative answer:
Name:   localhost
Address: 127.0.0.1
> bigdatam
Server:         172.19.97.224
Address:        172.19.97.224#53

Name:   bigdatam.localdomain.com
Address: 172.19.97.224
> bigdata1
Server:         172.19.97.224
Address:        172.19.97.224#53

Name:   bigdata1.localdomain.com
Address: 172.19.101.66
dig bigdaam
host -v -t A `hostname

vi /etc/hosts
127.0.0.1       localhost.localdomain localhost
Synchronize System Clock Between Servers
ntpdate pool.ntp.org

Resources
How to set up a home DNS server
How to set up a home DNS server, part II

Using Decompiler JDEclipse-Realign to Debug Classes without Source in Eclipse


(Remote) debug is a great way to trouble-shooting, to figure out how the code works. But sometimes we only have the jars, no source code. For example the code is closed or proprietary, no where to get the sourc code.

Luckily, we can use JDEclipse-Realign to easily debug classes  without sources in Eclipse.

1. Install and Configure JDEclipse-Realign in eclipse
Install via JDEclipse-Realign update site http://mchr3k-eclipse.appspot.com/.

Click "Window" -> "Preferences", type "File Association". Select "class without source", in the dialogue below,change the default from "Class File Viewer [Decompiled]" to "Class File Editor".

2. Find jars contained classes which we want to debug
In linux, Use grep -r -s full_class_name * to find the jar.

3. Create a java project with the jars
Create a java project in Eclipse, add the jar(related jars) into the project's build path.
Now if we right click on the jar, select "Attach Source", we can see "Decompiled Source" is selected.. 

4. Enable remote debug
Add the -Xdebug -Xrunjdwp:transport=dt_socket,address=1044,server=y,suspend=y in JVM options to remote java application. 

Then configure Eclipse for remote debugging by click "Run" -> "Debug Configuration", then create a new "Remote Java Application", enter the host, and port number: 1044 in this case, be sure to slect the previously-created project in project textbox.

5. Add breakpoints in compiled classes and Run 
After add breakpoints, run remote application: remote application will stop and wait for remote debut client to connect to port 1044.
Run remote debug in Eclipse, now the application will stop at breakpoints.

Other tools
JD-GUI from http://jd.benow.ca/

CFR 
- A decompiler that supports Java 8 language features, including lambda expressions and me
thod references.

java -jar cfr.jar class_or_jar_file [method] [options]

Resources
Mchr3k - JDEclipse-Realign
JDEclipse-Realign Github
JD-GUI

Advanced Usage of Linux Grep Command


Grep recursively: -R, -r, --recursive
grep -r "127" /etc

Use grep to search words only: -w, --word-regexp
Select only those lines containing matches that form whole words
grep -w "boo" file

-s, --no-messages
Suppress error messages about nonexistent or unreadable files.

Ignore binary files: -I
When we search, we may only want to search text files and ignore binary files for better performance. we can use: -I.
-I Process a binary file as if it did not contain matching data; this is equivalent to the --binary-files=without-match option.

Combine grep and find
grep "some_string" `find some_folder -name "*.some_extension"`
find . -name "*.php" -exec grep -H "some_string" {} \;

--include=GLOB
Search only files whose base name matches GLOB
--exclude=GLOB
Skip files whose base name matches GLOB (using wildcard matching). A file-name glob can use *, ?, and [...] as wildcards, and \ to quote a wildcard or backslash character literally.
--exclude-dir=DIR
grep -rsI --include=*.html --include=*.php --include=*.htm "pattern" /some/path/
grep -rsI --include=*.{html,php,htm} "pattern" /some/path/
grep -rsI --include=*.{py,pyc} hue /etc

Find jars that contain matched classes
Grep can be used to find the jar that contains the matched class fileL
grep -r com.cloudera.cmf.inspector.Inspector /usr/share/cmf/lib

Miscs
-c, --count
Suppress normal output; instead print a count of matching lines for each input file
-n, --line-number
-h, --no-filename
-H, --with-filename
-v, --invert-match
--color[=WHEN], --colour[=WHEN]
grep --color root /etc/passwd

Resources
Grep Manual

Linux Mount and Unmout Remote File System


Create the Mount Point
sudo mkdir /mnt/ip-shared

Mount a windows shared folder 
mount //server-name/share-name /mnt/ip-shared -o username=shareuser,password=sharepassword,domain=sharedomain
Here, -o is used top specify mount options, in this case, we specify the login credential: username, password, and domain . Its format can be -o username=sharedomain/shareuser,password=sharepassword
Here we don't use -t to specify vfstype, thus mount will try to guess the desired type. Check the list here at mount Manual

Common Mount Options
loop Mounts an image as a loop device.
ro Mounts the file system for reading only.
rw Mounts the file system for both reading and writing.

Unmount it
If we don't need to access the remote file system, we can use umount to unmount it: Notice, it's umount, not unmount.
umount /mnt/ip-shared

Resource
Linux Mount Manual
Linux mount CIFS Windows Share

Using Chrome DevTools to Hack Client-Side Only Validation


Client-side validation can provide faster response and better user experience, and reduce server load. But relying client-side validation alone is a terrible idea.  

As user can easily bypass the validation, change value of javascript object using tools like Chrome DevTools or Firebug. We must always appropriate server-side validation as well.
Using Chrome DevTools to Hack it
One website limits the max value of one field, and I am wondering whether I can bypass the verification.
1. Found the pattern
I noticed that when I input invalid value, and go to next field, it will change the color of the field to red.
2. Beak on DOM Attributes Modifications 
So when the color of the field changes, I want Chrome to stop at that moment, so I can figure out how this website do the verification.

So I right click the input textbox, click "Inspect Element", this will open Chrome DevTools, and select the checked element in "Elements" tab. 
<input title="" class="input_field_" type="text" name="Memory" id="memory">
Right click it, and select "Break on..." -> "Attributes modifications".

This will cause Chrome pauses when attribute of this input textbox changes.
It would be better if Chrome DevTools allows us to stop when a particular attribute changes, in this case: class field. or even better to stop when a particular attribute changes to a particular value. But anyhow this works.
3. Reproduce It
Now input the invalid value, then trigger the event that will cause the client side validation. Usually this will happen when we focus leaves current field or we click next or submit.

This will cause Chrome stops execution, and switch to "Sources" tab and stops at the code that change 
4. Check the Call Stack
Now we can check the call stack. 
To make  javascript much easier to read, we can press the "Pretty print" button (marked with curly braces{}) from the bottom of the "Sources" tab.

Follow the call stack, I found it do verification like below:
function validateSteps(step) 
{
    var isStepValid = true;
  // omitted
    if (step == step_config) {
        return validateField('#memory', selectedPolicy.minMemory, selectedPolicy.maxMemory) &&...);
    }
    return isStepValid; // we can set a breakpoint value here to change isStepValid to true. 
}
The code compare the value from memory textbox with selectedPolicy.minMemory, selectedPolicy.maxMemory.
5. Hack it
Now there are several ways to hack it: we can step into validateField method and change return value.
The simpler way is to directly modify the object selectedPolicy, change maxMemory to a much bigger value. We go to the "Console" tab, and type selectedPolicy.maxMemory, this will print current value, it's 4. We can type selectedPolicy.maxMemory=100.

Now unclick all breakpoints, and press resume, and submit the form. It works, this site doesn't check the value in server side at all.

Later if I want to so the same trick: bypass the verification, I can go to "Console" tab directly, and type selectedPolicy.maxMemory=100.

In Console tab, we can do many things, run any valid javascript code in current scope.
We can check windows object, and detect any custom properties and functions.

Happy Hacking!!!
Resources
Tips And Tricks

Using Guava Splitter and Joiner


In my last post: Solr RssResponseWriter by Extending XMLWriter, I need parse the field mapping string like: topic:title,url:link,description:description to a map.

If we have to write our own code, the code would look like below:
public void splitOnOurOwn() {
 String mapStr = "topic: title, url: link, description: description ";
 String[] pairs = mapStr.split(",");
 Map<String,String> map = new HashMap<String,String>();
 for (String pair : pairs) {
  String[] str = pair.split(":");
  map.put(str[0].trim(), str[1].trim());
 }
 System.out.println("toString: " + Objects.toString(map));
}
But if we use Guava it would be just two lines:

private static MapSplitter splitter = Splitter.on(",").trimResults()
  .withKeyValueSeparator(Splitter.on(':').trimResults());
private static MapJoiner joinner = Joiner.on(",").withKeyValueSeparator(":");

public void guavaSplit() {
 String mapStr = "topic: title, url: link, description: description ";
 Map<String,String> map = splitter.split(mapStr.trim());
 System.out.println("toString: " + Objects.toString(map));
 System.out.println("join: " + joinner.join(map));
}
Guava Splitter provides some other useful methods such as fixedLength, omitEmptyStrings, trimResults, limit.


As described in code above, Guava provides a Joiner that can join text with a separator. Joiner also provides useful methods such as skipNulls, useForNull(nullText).
Resources
Strings Explained
Splitter Javadoc
Joiner Javadoc
Guava Splitter vs StringUtils

Configure Tomcat SSL Using PFX(PKCS12) Certificate


I am trying to import certificate from entrust to tomcat.
Entrust provides a pfk file to us. pfx means Personal Information Exchange, it stores many cryptography objects as a single file. Read more about PKCS #12

To import the  pfx(PKCS_12) to tomcat or other java web server, the easy solution is to convert the pfx(PKCS_12) file to Java Key Store file.
1. Using keytool
Since JDK6, we can use JDK keytool to convert pkcs12 to JKS.
keytool -importkeystore -srckeystore file.pfx -srcstoretype PKCS12 -destkeystore cert.jks -deststoretype JKS
2. Using XWSS
For older JDK, we can use XWSS utility to convert pkcs12 to JKS.
XWSS - XML and WebServices Security Project is part of Project Metro in the Glassfish community. It provide some utility that can be downloaded from here.

Download the pkcs12import.zip, unzip it, we can find pkcs12import.bat.
pkcs12import usage
pkcs12import -file pkcs12-file [ -keystore keystore-file ]
[ -pass pkcs12-password ]   [ -storepass store-password ]  [ -keypass key-password ] [ -alias alias ]

Add SSL Connector in server.xml

 
Restart tomcat, and try to access https://localhost/
Resources
Keytool
PKCS 12 Wiki
Converting .pfx Files to .jks Files
How to import PFX file into JKS using pkcs12import utility

Using Guava Stopwatch


When read this post, TRIM, LTRIM and RTRIM in Java

I have a doubt about the performance difference between the one uses regular expression ltrim and the straightforward ltrim3. So I use Guava topwatch
to test their performance. 
Guava Usage
1. Use Stopwatch static factory method to create one instance.
Stopwatch stopwatch = Stopwatch.createStarted()
2. Use stopwatch.elapsed(TimeUnit) to get elapsed time.
3. We can reuse same stopwatch instance to measures another operation, but we have to first call reset(), and start().
reset() method would reset internal variable: elapsedNanos=0, isRunning=false. Calling start() would start again to measure time elapsed.
Conclusion
The test result is as what I expected, the straightforward one gives the best performance.
458 for ltrim3(the straightforward one)
3967 for ltrim(using regular expression)

The code looks like below:
import com.google.common.base.Stopwatch;

public class LTrimTester {
  private final static Pattern LTRIM = Pattern.compile("^\\s+");
  private final static Pattern RTRIM = Pattern.compile("\\s+$");

  public static void main(String[] args) {
    int times = 10000000;
    Stopwatch stopwatch = Stopwatch.createStarted();
    for (int i = 0; i < times; i++) {
      String str = "  hello world  ";
      ltrim(str);
    }
    System.out.println(stopwatch.elapsed(TimeUnit.MILLISECONDS)
        + " for ltrim(using regular expression)");

    stopwatch.reset();
    stopwatch.start();
    for (int i = 0; i < times; i++) {
      String str = "  hello world  ";
      ltrim3(str);
    }
    System.out.println(stopwatch.elapsed(TimeUnit.MILLISECONDS)
        + " for ltrim3(the straightforward one)");
    stopwatch.stop();
  }

  public static String ltrim(String s) {
    return LTRIM.matcher(s).replaceAll("");
  }

  public static String ltrim3(String s) {
    int i = 0;
    while (i < s.length() && Character.isWhitespace(s.charAt(i))) {
      i++;
    }
    return s.substring(i);
  }
}
Resources
TRIM, LTRIM and RTRIM in Java
Guava Stopwatch Javadoc
Guava Stopwatch

Solr: Creating a DocTransformer to Show Doc Offset in Response


In client, the doc offset can be easilly computed: start * rows + (offset in current response).

But in some cases, it is useful to show offset explicitly. For example, when debug response search relevancy, tester may report why some doc are not showd in first page. In this case, we may run the query, and give a big rows, for example: q="Nexus 7"&rows=500, then search the doc in the response xml. In this case, it would be helpful, if response can show the offset directly, like below:
<doc>
 <str name="id">id3456</str>
 <long name="[offset]">156</long>
</doc>
The field name  would be [transfomername], [offset] in this case.
Solr DocTransformers
Solr DocTransformers allows us to add/change/remove fields/response before return to the client. By default, it provides [explain],[value],[shard],[docid]. We can easily add our own DocTransformer implementation.

OffsetTransformerFactory Implementation
The implementation would be look like: ValueAugmenterFactory. To use it, we will add the trasnformer in fl field: q="Nexus 7"&fl=id,[offset] 
public class OffsetTransformerFactory extends TransformerFactory {
  private boolean enabled = false;
  public void init(NamedList args) {
    if (args != null) {
      SolrParams params = SolrParams.toSolrParams(args);
      enabled = params.getBool("enabled", false);
      if (!enabled) return;
    }
    super.init(args);
  }
  
  /*
   * filed is [offset] in this case.<br>
   * Notice augmenterArgs is the local params to this transfomer like:
   * [myTransformer foo=1 bar=good], not paramters in SolrQueryRequest.
   */
  public DocTransformer create(String field, SolrParams augmenterArgs,
      SolrQueryRequest req) {
    SolrParams params = req.getParams();
    String str = params.get(CommonParams.START);
    long start = 0;
    if (StringUtils.isNotBlank(str)) {
      start = Long.valueOf(str);
    }
    long rows = Long.valueOf(params.get(CommonParams.ROWS));
    long startOffset = start * rows;
    return new OffsetTransformer(field, startOffset);
  }
  
  class OffsetTransformer extends DocTransformer {
    private String field;
    private long startOffset;
    private long offset = 0;
    
    public OffsetTransformer(String field, long startOffset) {
      this.field = field;
      this.startOffset = startOffset;
    }
    public void transform(SolrDocument doc, int docid) throws IOException {
      if (enabled) {
        doc.setField(field, startOffset + offset);
        ++offset;
      }
    }
    public String getName() {
      return OffsetTransformer.class.getName();
    }  
  }
}
Configuration in solrconfig.xml
<transformer name="offset"
  class="OffsetTransformerFactory">
  <bool name="enabled">true</bool>
</transformer>
Resources
Solr DocTransformers

Git Clone a Specific Version/Tag/Branch


I am trying to learning guava implementation detail. So I git-clone the source code as described here.
https://code.google.com/p/guava-libraries/source/checkout
git clone https://code.google.com/p/guava-libraries/
Cloning into 'guava-libraries'...
error: error setting certificate verify locations:
  CAfile: chrome\depot_tools\git-1.8.0_bin/bin/curl-ca-bundle.crt
  CApath: none
 while accessing https://code.google.com/p/guava-libraries/info/refs?service=git-upload-pack
fatal: HTTP request failed
I changed from https to http:
git clone http://code.google.com/p/guava-libraries/ 

It works, then I ran "mvn eclipse:eclipse, it failed with the following exception:
[ERROR] Failed to execute goal on project guava-testlib: Could not resolve dependencies for project com.google.guava:guava-testlib:jar:16.0-SNAPSHOT: Failure to find com.google.guava:guava:jar:16.0-SNAPSHOT in https://oss.sonatype.org/content/repositories/snapshots was cached in the local repository, resolution will not be reattempted until the update interval of sonatype-nexus-snapshots has elapsed or updates are forced -> [Help 1]

So I tried to check out the old version: release 15.0.
I run command: git clone tag and dint the tag name of release 15.0: v15.0

To change local repo to release 15.0, I use: git checkout vlocal_15.0 -b v15.0
The format is: git checkout ] -b remote_branch_name.

Then run mvn eclipse:eclipse and mvn install, it works

To directly clone release 15.0, we can use:
git clone http://code.google.com/p/guava-libraries/  -b v15.0
Resources
git-clone manual page
http://stackoverflow.com/questions/3231079/how-to-see-all-tags-in-a-git-repository-in-command-line
http://stackoverflow.com/questions/791959/how-to-use-git-to-download-a-particular-tag

Solr RssResponseWriter by Extending XMLWriter


The problem
Customers want to show search results from Solr in Rss reader, so we need customize Solr Response  Rss format.
There are several ways to do this in Solr:
1. We can use XSLT Response Writer: write a xslt transformer to transform the xml to Rss format. Check XsltResponseWriter:
wt=xslt&tr=example_rss.xsl
We can change the example_rss.xsl or example_atom.xsl Solr provided to match our need.
2. We can write our own Solr ResponseWriter class to write the Rss format response as described in this post.
Solr ResponseWriter
Solr defines several Response Writers, such as XMLResponseWriter, XsltResponseWriter, CSVResponseWriter, etc.
TextResponseWriter is the base class for text-oriented response writers. Solr also allows us to define our own new Response Writers.
The Solution
As the format of Rss is similar as the Solr XML response, we can try to extend XMLResponseWriter and reuse existing code as much as possible.

The difference between Solr XML and Expected RSS format
1. The overall structural difference
In Solr, the format is like: response->result->doc. 
In Rss, the format is like below:
<rss version="2.0">
  <channel>
    <title>title here</title> //channel metadata
    <link>link here</link>    //channel metadata
    <description>description here</description> //channel metadata
    <item>
       <title>item1 title</title> //item metadata
       <link>item1 link</link> //item metadata
       <description>item1 description</description> //item metadata
    </item>
  </channel>
</rss>
For this, we need update writeResponse method to change overall structure.
2. The element structural difference
In Solr, element format is like:
<element-type name="id"> //  like str, int, arr
</element-type>
In Rss, 
<element-name name="id"> //  like title, link, etc
</element-name>
For this, we need udate writeStr/Int/Long implementation.
3. File Name mapping
The field name in Solr may no be expected, for example we may want to map field "url" to "link". We can define a new parameter flmap. We can define the mapping in solrconfig.xml.
<str name="fl">title,url,id,score,physicalpath</str>

<str name="flmap">title,link,,,,physicalpath</str> 
In above example, url will be renamed to link, field id, score would be ignored. title, physicalpath would remain same.
Or we can add fl, flmap as request parameters.

RssResponseWriter Implementation
import com.google.common.base.CharMatcher;
import com.google.common.base.Splitter;
import com.google.common.collect.Lists;

public class RssWriter extends XMLWriter {
  private static final Splitter split = Splitter.on(CharMatcher.anyOf(","))
      .trimResults();
  private static final char[] XML_START1 = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
      .toCharArray();
  private Map<String,String> oldToNewFLMapping = new HashMap<String,String>();
  private String baseURL;
  
  public RssWriter(Writer writer, SolrQueryRequest req, SolrQueryResponse rsp)
      throws IOException {
    super(writer, req, rsp);
    SolrParams solrParams = req.getParams();
    String fl = solrParams.get("fl");
    
    
    String flmap = solrParams.get("flmap");
    if (fl == null || flmap == null) {
      throw new IOException("do not get fl or flmap parameter");
    }
    
    ArrayList<String> oldFLs = Lists.newArrayList(split.split(fl));
    ArrayList<String> newFLs = Lists.newArrayList(split.split(flmap));
    if (oldFLs.size() != newFLs.size()) {
      throw new IOException("field count different in fl and rnamefl parameter");
    }
    
    Iterator<String> oldIt = oldFLs.iterator(), newIt = newFLs.iterator();
    while (newIt.hasNext()) {
      String oldFl = oldIt.next();
      String newFl = newIt.next();
      if (!StringUtils.isBlank(newFl)) {
        oldToNewFLMapping.put(oldFl, newFl);
      }
    }
    getBaseUrl(req);
    
  }
  @Override
  public void writeResponse() throws IOException {
    writer.write(XML_START1);
    writer.write("<rss version=\"2.0\">");
    writer.write("<channel>");
    String qstr = req.getParams().get(CommonParams.Q);
    writeVal("title", qstr);
    String fullUrl = req.getContext().get("fullUrl").toString();
    writeCdata("link", fullUrl);
    writeVal("copyright", "Copyright ......");
    
    NamedList<?> lst = rsp.getValues();
    Object obj = lst.get("response");
    DocList docList = null;
    if (obj instanceof ResultContext) {
      ResultContext context = (ResultContext) obj;
      docList = context.docs;
    } else if (obj instanceof DocList) {
      docList = (DocList) obj;
    } else {
      throw new RuntimeException("Unkown type: " + obj.getClass());
    }
    writeVal("numFound", Integer.toString(docList.matches()));
    writeVal("start", Integer.toString(docList.offset()));
    writeVal("maxScore", Float.toString(docList.maxScore()));
    
    Set<String> fields = new HashSet<String>(oldToNewFLMapping.keySet());
    SolrIndexSearcher searcher = req.getSearcher();
    DocIterator iterator = docList.iterator();
    int sz = docList.size();
    for (int i = 0; i < sz; i++) {
      int id = iterator.nextDoc();
      Document doc = searcher.doc(id, fields);
      writeVal("item", doc);
    }
    writer.write("\n</channel>");
    writer.write("\n</rss>");
  } 
  @Override
  public void writeSolrDocument(String name, SolrDocument doc,
      ReturnFields returnFields, int idx) throws IOException {
    startTag("item", false);
    incLevel();
    boolean hasLink = false;
    
    Set<String> oldFLs = oldToNewFLMapping.keySet();
    for (String oldFL : returnFields.getLuceneFieldNames()) {
      String newName = oldFL;
      if (oldFLs.contains(oldFL)) {
        newName = oldToNewFLMapping.get(oldFL);
      }
      Object val = doc.getFieldValue(oldFL);
      writeVal(newName, val);
      if ("link".equalsIgnoreCase(newName)) {
        hasLink = true;
      }
    }
    if (!hasLink) {
      String uniqueKey = schema.getUniqueKeyField().getName();
      String uniqueKeyValue = "";
      if (uniqueKey != null) {
        Object obj = doc.getFieldValue(uniqueKey);
        if (obj instanceof Field) {
          Field field = (Field) obj;
          uniqueKeyValue = field.stringValue();
        } else {
          uniqueKeyValue = obj.toString();
        }
      }
      writeCdata("link", baseURL + "viewsourceservlet?docid=" + uniqueKeyValue);
    }
    decLevel();
    if (doIndent) indent();
    writer.write("</item>");
  }
  @Override
  public void writeArray(String name, Iterator iter) throws IOException {
    if (iter.hasNext()) {
      incLevel();
      while (iter.hasNext()) {
        writeVal(name, iter.next());
      }
      decLevel();
    } else {
      startTag(name, true);
    }
  }
  @Override
  public void writeStr(String name, String val, boolean escape)
      throws IOException {
    writePrim(name, val, escape);
  }
  public void writeCdata(String tag, String val) throws IOException {
    writer.write("<" + tag + ">");
    writer.write("<![CDATA[" + val + "]]>");
    writer.write("</" + tag + ">");
  }
  private void writePrim(String name, String val, boolean escape)
      throws IOException {
    int contentLen = val == null ? 0 : val.length();
    
    startTag(name, contentLen == 0);
    if (contentLen == 0) return;
    
    if (escape) {
      XML.escapeCharData(val, writer);
    } else {
      writer.write(val, 0, contentLen);
    }
    writer.write('<');
    writer.write('/');
    writer.write(name);
    writer.write('>');
  }  
  void startTag(String name, boolean closeTag) throws IOException {
    if (doIndent) indent();
    
    writer.write('<');
    writer.write(name);
    if (closeTag) {
      writer.write("/>");
    } else {
      writer.write('>');
    }
  }
  public void getBaseUrl(SolrQueryRequest req) {
    String url = req.getContext().get("url").toString();
    int i = 0;
    int j = 0;
    for (j = 0; j < url.length() && i < 3; ++j) {
      if (url.charAt(j) == '/') {
        ++i;
      }
    }
    baseURL = url.substring(0, j);
  }
  
  @Override
  public void writeNull(String name) throws IOException {
    writePrim(name, "", false);
  }
  
  @Override
  public void writeInt(String name, String val) throws IOException {
    writePrim(name, val, false);
  }
  
  @Override
  public void writeLong(String name, String val) throws IOException {
    writePrim(name, val, false);
  }
  
  @Override
  public void writeBool(String name, String val) throws IOException {
    writePrim(name, val, false);
  }
  
  @Override
  public void writeFloat(String name, String val) throws IOException {
    writePrim(name, val, false);
  }
  
  @Override
  public void writeDouble(String name, String val) throws IOException {
    writePrim(name, val, false);
  }
  
  @Override
  public void writeDate(String name, String val) throws IOException {
    writePrim(name, val, false);
  }
}
RSSResponseWriter
public class RSSResponseWriter implements QueryResponseWriter {
  public void write(Writer writer, SolrQueryRequest req, SolrQueryResponse rsp)
      throws IOException {
    RssWriter rssWriter = new RssWriter(writer, req, rsp);
    try {
      rssWriter.writeResponse();
    } finally {
      rssWriter.close();
    }
  }
  public String getContentType(SolrQueryRequest request,
      SolrQueryResponse response) {
    return CONTENT_TYPE_XML_UTF8;
  }
  public void init(NamedList args) {}
}
Configuration
<requestHandler name="/rss" class="solr.SearchHandler">
 <lst name="defaults">
  <str name="rows">10</str>
  <str name="wt">rss</str>
  <!--default mapping-->
  <str name="fl">title,url,id,score,physicalpath</str>
  <str name="flmap">title,link,,,,physicalpath</str> 
 </lst>
</requestHandler>
<queryResponseWriter name="rss" class="org.apache.solr.response.RSSResponseWriter"/>
Resources
QueryResponseWriter
Solr Search Result (attribute-to-tag) customization using XsltResponseWriter
RSS 2.0 Specification
RSS Tutorial

Solr: Add new fields with Default Value for Existing Documents


In some cases, we have to upgrade existing Solr application to add new fields, and we don't want or can't reindex old data.
For example, in old solr application, we only store information about regular file. Now, we need to upgrade it to store other types of files, so we want to add a field: fileType. fileType=0 means it is a regular file, fileType=1 means it's a folder. In future we may add other types of file.

We add the following definition in schema.xml:
<field name="fileType" type="TINT" indexed="true" stored="true" default="-1"/>

Adding this definition in schema.xml doesn't affect existing data: they still don't have fileType field. No fileType value in the response, also no term in fileType field for query: query fileType:[* TO *] returns empty result.

To fix this issue, we have to consider two parts, one is search query, one is search response.
Fix Search Query by Querying NULL Field
As all old data is about regular file, so it means if no value for fileType, then it's regular file. 
When we search regular file: the query should be adjusted as below: It will fetche data where value of fileType is 0, or no value for fileType.
-(-fileType:0 AND fileType:[* TO *])

No change needed when search other field types. We can wrap this change in our own search handler: if the query includes fileType:0, change it to -(-fileType:0 AND fileType:[* TO *]), or we can write a new query parser.
Fix Search Response by Using DocTransformer to Add Default Value
For the old data, there is no value for fileType. We need add fileType=0 in the search response. To do this, we can define a Solr DocTransformer
DocTransformer allow us to modify fields that are returned to the user. 
In our DocTransformer, we can check the value for fileType, if there is no value, set its value as the default value. Now in the response date(xml or json), it will show fileType=0 for old data.  
NullDefaultValueTransformerFactory Implementation
public class NullDefaultValueTransformerFactory extends TransformerFactory {
  private Map<String,String> nullDefaultMap = new HashMap<String,String>();
  private boolean enabled = false;
  protected static Logger logger = LoggerFactory
      .getLogger(NullDefaultValueTransformerFactory.class);
  public void init(NamedList args) {
    super.init(args);
    if (args != null) {
      SolrParams params = SolrParams.toSolrParams(args);
      enabled = params.getBool("enabled", false);
      if (!enabled) return;
      
      List<String> fieldNames = new ArrayList<String>();
      String str = params.get("fields");
      if (str != null) {
        fieldNames = StrUtils.splitSmart(str, ',');
      }
      List<String> nullDefaultvalue = new ArrayList<String>();
      str = params.get("nullDefaultValue");
      if (str != null) {
        nullDefaultvalue = StrUtils.splitSmart(str, ',');
      }
      if (fieldNames.size() != nullDefaultvalue.size()) {
        logger.error("Size doesn't match, fieldNames.size: "
            + fieldNames.size() + ",nullDefaultvalue.size: "
            + nullDefaultvalue.size());
        enabled = false;
      } else {
        if (fieldNames.isEmpty()) {
          logger.error("No fields are set.");
          enabled = false;
        }
      }
      
      for (int i = 0; i < fieldNames.size(); i++) {
        nullDefaultMap.put(fieldNames.get(i).trim(), nullDefaultvalue.get(i)
            .trim());
      }
    }
  }
  public DocTransformer create(String field, SolrParams params,
      SolrQueryRequest req) {
    return new NullDefaultValueTransformer();
  }
  
  class NullDefaultValueTransformer extends DocTransformer {
    public String getName() {
      return NullDefaultValueTransformer.class.getName();
    }
    public void transform(SolrDocument doc, int docid) throws IOException {
      if (enabled) {
        Iterator<Entry<String,String>> it = nullDefaultMap.entrySet()
            .iterator();
        while (it.hasNext()) {
          Entry<String,String> entry = it.next();
          String fieldName = entry.getKey();
          Object obj = doc.getFieldValue(fieldName);
          if (obj == null) {
            doc.setField(fieldName, entry.getValue());
          }
        }
      }
    }
  }
}
With the previous 2 changes, the client application can kind of think the old data has default value 0 for fieldType. Be aware that some functions will not work, such as sort, stats.
Resources
Solr: Use DocTransformer to Change Response
Searching for date range or null/no field in Solr
Solr DocTransformers

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)