Nutch2 Http Form Authentication-Part3: Integrate Http Form Post Authentication in Nutch2

The Problem
Http Form-based Authentication is a very common used authentication mechanism to protect web resources.
When crawl, Nutch supports NTLM, Basic or Digest authentication to authenticate itself to websites. But It doesn't support Http Post Form Authentication.

This series of articles talks about how to extend Nutch2 to support Http Post Form Authentication.
Main Steps
Use Apache Http Client to do http post form authentication.
Make http post form authentication work.
Integrate http form authentication in Nutch2.

After previous two steps, now we can integrate http form authentication in Nutch2.
Define Http Form Post Authentication Properties in httpclient-auth.xml
First, in nutch-site.xml change plugin.includes to use protocol-httpclient plugin: not the default protocol-http.

Nutch uses http.auth.file to locate the xml file that defines credentials info, default value is httpclient-auth.xml. We extend httpclient-auth.xml to include information about http form authentication properties. The httpclient-auth.xml for the asp.net web application in last post is like below:

<?xml version="1.0"?>
<auth-configuration>
  <credentials authMethod="formAuth" loginUrl="http://localhost:44444/Account/Login.aspx" loginFormId="ctl01" loginRedirect="true">
    <loginPostData>
      <field name="ctl00$MainContent$LoginUser$UserName" value="admin"/>
      <field name="ctl00$MainContent$LoginUser$Password" value="admin123"/>
    </loginPostData>
    <removedFormFields>
      <field name="ctl00$MainContent$LoginUser$RememberMe"/>
    </removedFormFields>
  </credentials>
</auth-configuration>
Read Http Form Post Authentication from Configuration XML File
In Nutch's http-client plugin, change org.apache.nutch.protocol.httpclient.Http.setCredentials() method to read authentication info into variable formConfigurer from configuration file.
Then change Http.resolveCredentials() method: if formConfigurer is not null, use HttpFormAuthentication to do form post login.
package org.apache.nutch.protocol.httpclient;
public class Http extends HttpBase {
 private void resolveCredentials(URL url) {
  if (formConfigurer != null) {
   HttpFormAuthentication formAuther = new HttpFormAuthentication(
     formConfigurer, client, this);
   try {
    formAuther.login();
   } catch (Exception e) {
    throw new RuntimeException(e);
   }
   return;
  }
  }
 private static synchronized void setCredentials()
   throws ParserConfigurationException, SAXException, IOException {

  if (authRulesRead)
   return;

  authRulesRead = true; // Avoid re-attempting to read
  InputStream is = conf.getConfResourceAsInputStream(authFile);
  if (is != null) {
   Document doc = DocumentBuilderFactory.newInstance()
     .newDocumentBuilder().parse(is);

   Element rootElement = doc.getDocumentElement();
   if (!"auth-configuration".equals(rootElement.getTagName())) {
    if (LOG.isWarnEnabled())
     LOG.warn("Bad auth conf file: root element <"
       + rootElement.getTagName() + "> found in "
       + authFile + " - must be <auth-configuration>");
   }

   // For each set of credentials
   NodeList credList = rootElement.getChildNodes();
   for (int i = 0; i < credList.getLength(); i++) {
    Node credNode = credList.item(i);
    if (!(credNode instanceof Element))
     continue;

    Element credElement = (Element) credNode;
    if (!"credentials".equals(credElement.getTagName())) {
     if (LOG.isWarnEnabled())
      LOG.warn("Bad auth conf file: Element <"
        + credElement.getTagName()
        + "> not recognized in " + authFile
        + " - expected <credentials>");
     continue;
    }
        // read http form post auth info
    String authMethod = credElement.getAttribute("authMethod");
    if (StringUtils.isNotBlank(authMethod)) {
     formConfigurer = readFormAuthConfigurer(credElement,
       authMethod);
     continue;
    }
      }
    }
  }
 private static HttpFormAuthConfigurer readFormAuthConfigurer(
   Element credElement, String authMethod) {
  if ("formAuth".equals(authMethod)) {
   HttpFormAuthConfigurer formConfigurer = new HttpFormAuthConfigurer();

   String str = credElement.getAttribute("loginUrl");
   if (StringUtils.isNotBlank(str)) {
    formConfigurer.setLoginUrl(str.trim());
   } else {
    throw new IllegalArgumentException("Must set loginUrl.");
   }
   str = credElement.getAttribute("loginFormId");
   if (StringUtils.isNotBlank(str)) {
    formConfigurer.setLoginFormId(str.trim());
   } else {
    throw new IllegalArgumentException("Must set loginFormId.");
   }
   str = credElement.getAttribute("loginRedirect");
   if (StringUtils.isNotBlank(str)) {
    formConfigurer.setLoginRedirect(Boolean.parseBoolean(str));
   }

   NodeList nodeList = credElement.getChildNodes();
   for (int j = 0; j < nodeList.getLength(); j++) {
    Node node = nodeList.item(j);
    if (!(node instanceof Element))
     continue;

    Element element = (Element) node;
    if ("loginPostData".equals(element.getTagName())) {
     Map<String, String> loginPostData = new HashMap<String, String>();
     NodeList childNodes = element.getChildNodes();
     for (int k = 0; k < childNodes.getLength(); k++) {
      Node fieldNode = childNodes.item(k);
      if (!(fieldNode instanceof Element))
       continue;

      Element fieldElement = (Element) fieldNode;
      String name = fieldElement.getAttribute("name");
      String value = fieldElement.getAttribute("value");
      loginPostData.put(name, value);
     }
     formConfigurer.setLoginPostData(loginPostData);
    } else if ("additionalPostHeaders".equals(element.getTagName())) {
     Map<String, String> additionalPostHeaders = new HashMap<String, String>();
     NodeList childNodes = element.getChildNodes();
     for (int k = 0; k < childNodes.getLength(); k++) {
      Node fieldNode = childNodes.item(k);
      if (!(fieldNode instanceof Element))
       continue;

      Element fieldElement = (Element) fieldNode;
      String name = fieldElement.getAttribute("name");
      String value = fieldElement.getAttribute("value");
      additionalPostHeaders.put(name, value);
     }
     formConfigurer
       .setAdditionalPostHeaders(additionalPostHeaders);
    } else if ("removedFormFields".equals(element.getTagName())) {
     Set<String> removedFormFields = new HashSet<String>();
     NodeList childNodes = element.getChildNodes();
     for (int k = 0; k < childNodes.getLength(); k++) {
      Node fieldNode = childNodes.item(k);
      if (!(fieldNode instanceof Element))
       continue;

      Element fieldElement = (Element) fieldNode;
      String name = fieldElement.getAttribute("name");
      removedFormFields.add(name);
     }
     formConfigurer.setRemovedFormFields(removedFormFields);
    }
   }
   return formConfigurer;
  } else {
   throw new IllegalArgumentException("Unsupported authMethod: "
     + authMethod);
  }
 }  
}  
Resources

Nutch2 Http Form Authentication-Part2: Make Http Post Form Authentication Work

The Problem
Http Form-based Authentication is a very common used authentication mechanism to protect web resources.
When crawl, Nutch supports NTLM, Basic or Digest authentication to authenticate itself to websites. But It doesn't support Http Post Form Authentication.

This series of articles talks about how to extend Nutch2 to support Http Post Form Authentication.
Main Steps
Use Apache Http Client to do http post form authentication.
Make http post form authentication work.
Integrate form authentication in Nutch2.

This article will focus on how to make http post form authentication work via a practical example.
Create and Run ASP.NET Web Application
In visual studio, create a ASP.NET (MVC2) web application, the default created web application supports form authentication. It's good to test our http form login.

Write Test Code
To use HttpFormAuthentication to do http post form authentication, we have to figure out the loginFormId: this can be done by searching "<form" in page source. Also use Chrom Devtools's "Inspect element" function, we can easily find out the name of username and password fields. Be sure to use name field, not id field of input element.

Now we can write test code:
private static void authTestAspWebApp() throws Exception, IOException {
  HttpFormAuthConfigurer authConfigurer = new HttpFormAuthConfigurer();
  authConfigurer.setLoginUrl("http://localhost:44444/Account/Login.aspx")
    .setLoginFormId("ctl01").setLoginRedirect(true);
  Map<String, String> loginPostData = new HashMap<String, String>();
  loginPostData.put("ctl00$MainContent$LoginUser$UserName", "admin");
  loginPostData.put("ctl00$MainContent$LoginUser$Password", "admin123");
  authConfigurer.setLoginPostData(loginPostData);

  Set<String> removedFormFields = new HashSet<String>();
  removedFormFields.add("ctl00$MainContent$LoginUser$RememberMe");
  authConfigurer.setRemovedFormFields(removedFormFields);

  HttpFormAuthentication example = new HttpFormAuthentication(
    authConfigurer);

  // example.client.getHostConfiguration().setProxy("127.0.0.1", 8888);

  String proxyHost = System.getProperty("http.proxyHost");
  String proxyPort = System.getProperty("http.proxyPort");
  if (StringUtils.isNotBlank(proxyHost)
    && StringUtils.isNotBlank(proxyPort)) {
   example.client.getHostConfiguration().setProxy(proxyHost,
     Integer.parseInt(proxyPort));
  }

  example.login();
  String result = example
    .httpGetPageContent("http://localhost:44444/secret/needlogin.aspx");
  System.out.println(result);
 }
Run the previous test code, check Response Code, Response headers and response body. We can copy the whole response body to jsbin, there we can view the html much easily.

What to Do if it doesn't Work?
But sometimes things are not that simple, the previous code may still not work: that user is not logined, and we can't access protected resource.

When this happens, we need compare the request Apache http client sends with the request Chrome sends, including headers and request body. 

We can use Chrome DevTools to get request headers and post body, we can even copy the request as a cURL request and execute in command line.

We can start fiddler as a proxy, add example.client.getHostConfiguration().setProxy("127.0.0.1", 8888); in test code, then monitor request and response Apache http client sends and receives in fiddler.

Compare them and check whether some headers a missing, if so add them into additionalPostHeaders. Check whether we need remove some fields, if so add them into removedFormFields. Check whether we need add more fields, if so add them into loginPostData.

After all this, we should be able to make it work.
We can get request headers and post body via Chrome DevTools like below, we can even copy the request as a cURL request and execute in command line.

Nutch2 Http Form Authentication-Part1: Using Apache Http Client to Do Http Post Form Authentication

The Problem
Http Form-based Authentication is a very common used authentication mechanism to protect web resources.
When crawl, Nutch supports NTLM, Basic or Digest authentication to authenticate itself to websites. But It doesn't support Http Post Form Authentication.

This series of articles talks about how to extend Nutch2 to support Http Post Form Authentication.
Main Steps
Use Apache Http Client to do http post form authentication.
Make http post form authentication work.
Integrate post from authentication in Nutch2.

Use Apache Http Client to Do Http Post Form Authentication
HttpFormAuthConfigurer
First let's check the HttpFormAuthConfigurer class. No need to explain loginUrl and loginFormId. loginPostData stores the field name and value for login fields, such as username:user1, passowrd:password1. removedFormFields told us input field we want to remove, additionalPostHeaders is uesed when we have to add addtional header name and value when do post form login. if loginRedirect is true, and http post login returns redirect code: 301 or 302, Http Client will automatically follow the redirect.
package org.apache.nutch.protocol.httpclient;
public class HttpFormAuthConfigurer {
 private String loginUrl;
 private String loginFormId;
 private Map<String, String> loginPostData;
 private Set<String> removedFormFields; 
 private Map<String, String> additionalPostHeaders;
 private boolean loginRedirect;
} 
HttpFormAuthentication 
In login method, it first calls CookieHandler.setDefault(new CookieManager()); so if login succeeds, subsequent request would not require login again.

Then it sends a http get request to the loginUrl, uses Jsoup.parse(pageContent) to parse the response, iterates all input fields in the login form, adds all field names and values into List params, sets values for username and password fields which are stored in loginPostData, we may also have to remove some form fields(in removedFormFields). Then send a post request to the loginUrl with data: List params.

The following code uses Apache Http Client 3.x, as Nutch2 still uses the pretty old http client library.
package org.apache.nutch.protocol.httpclient;

public class HttpFormAuthentication {
 private static final Logger LOGGER = LoggerFactory
   .getLogger(HttpFormAuthentication.class);
 private static Map<String, String> defaultLoginHeaders = new HashMap<String, String>();
 static {
  defaultLoginHeaders.put("User-Agent", "Mozilla/5.0");
  defaultLoginHeaders
    .put("Accept",
      "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
  defaultLoginHeaders.put("Accept-Language", "en-US,en;q=0.5");
  defaultLoginHeaders.put("Connection", "keep-alive");
  defaultLoginHeaders.put("Content-Type",
    "application/x-www-form-urlencoded");
 }

 private HttpClient client;
 private HttpFormAuthConfigurer authConfigurer = new HttpFormAuthConfigurer();
 private String cookies;

 public HttpFormAuthentication(HttpFormAuthConfigurer authConfigurer) {
  this.authConfigurer = authConfigurer;
  this.client = new HttpClient();
 }
 public HttpFormAuthentication(HttpFormAuthConfigurer authConfigurer,
   HttpClient client, Http http) {
  this.authConfigurer = authConfigurer;
  this.client = client;
  defaultLoginHeaders.put("Accept", http.getAccept());
  defaultLoginHeaders.put("Accept-Language", http.getAcceptLanguage());
  defaultLoginHeaders.put("User-Agent", http.getUserAgent());
 }
 public void login() throws Exception {
  // make sure cookies is turn on
  CookieHandler.setDefault(new CookieManager());
  String pageContent = httpGetPageContent(authConfigurer.getLoginUrl());
  List<NameValuePair> params = getLoginFormParams(pageContent);
  sendPost(authConfigurer.getLoginUrl(), params);
 }

 private void sendPost(String url, List<NameValuePair> params)
   throws Exception {
  PostMethod post = null;
  try {
   if (authConfigurer.isLoginRedirect()) {
    post = new PostMethod(url) {
     @Override
     public boolean getFollowRedirects() {
      return true;
     }
    };
   } else {
    post = new PostMethod(url);
   }
   // we can't use post.setFollowRedirects(true) as it will throw
   // IllegalArgumentException:
   // Entity enclosing requests cannot be redirected without user
   // intervention
   setLoginHeader(post);
   post.addParameters(params.toArray(new NameValuePair[0]));
   // post.setEntity(new UrlEncodedFormEntity(postParams));

   int rspCode = client.executeMethod(post);
   if (LOGGER.isDebugEnabled()) {
    LOGGER.info("rspCode: " + rspCode);
    LOGGER.info("\nSending 'POST' request to URL : " + url);

    LOGGER.info("Post parameters : " + params);
    LOGGER.info("Response Code : " + rspCode);

    for (Header header : post.getRequestHeaders()) {
     LOGGER.info("Response headers : " + header);
    }
   }
   String rst = IOUtils.toString(post.getResponseBodyAsStream());
   LOGGER.debug("login post result: " + rst);
  } finally {
   if (post != null) {
    post.releaseConnection();
   }
  }
 }

 private void setLoginHeader(PostMethod post) {
  Map<String, String> headers = new HashMap<String, String>();
  headers.putAll(defaultLoginHeaders);
  // additionalPostHeaders can overwrite value in defaultLoginHeaders
  headers.putAll(authConfigurer.getAdditionalPostHeaders());
  for (Entry<String, String> entry : headers.entrySet()) {
   post.addRequestHeader(entry.getKey(), entry.getValue());
  }
  post.addRequestHeader("Cookie", getCookies());
 }

 private String httpGetPageContent(String url) throws IOException {

  GetMethod get = new GetMethod(url);
  try {
   for (Entry<String, String> entry : authConfigurer
     .getAdditionalPostHeaders().entrySet()) {
    get.addRequestHeader(entry.getKey(), entry.getValue());
   }
   client.executeMethod(get);
      
   Header cookieHeader = get.getResponseHeader("Set-Cookie");
   if (cookieHeader != null) {
    setCookies(cookieHeader.getValue());
   }
   return IOUtils.toString(get.getResponseBodyAsStream());
  } finally {
   get.releaseConnection();
  }
 }

 private List<NameValuePair> getLoginFormParams(String pageContent)
   throws UnsupportedEncodingException {
  List<NameValuePair> params = new ArrayList<NameValuePair>();
  Document doc = Jsoup.parse(pageContent);
  Element loginform = doc.getElementById(authConfigurer.getLoginFormId());
  if (loginform == null) {
   throw new IllegalArgumentException("No form exists: "
     + authConfigurer.getLoginFormId());
  }
  Elements inputElements = loginform.getElementsByTag("input");

  // skip fields in removedFormFields or loginPostData
  for (Element inputElement : inputElements) {
   String key = inputElement.attr("name");
   String value = inputElement.attr("value");
   if (authConfigurer.getLoginPostData().containsKey(key)
     || authConfigurer.getRemovedFormFields().contains(key)) {
    continue;
   }
   params.add(new NameValuePair(key, value));
  }
  // add key and value in loginPostData
  for (Entry<String, String> entry : authConfigurer.getLoginPostData()
    .entrySet()) {
   params.add(new NameValuePair(entry.getKey(), entry.getValue()));
  }
  return params;
 }
}
Http Form Authentication in Apache Http Client 4.x
public class HttpCilentFormLoginExample {
  private static final Logger LOGGER = LoggerFactory
      .getLogger(HttpCilentFormLoginExample.class);
  private DefaultHttpClient client = new DefaultHttpClient();
  private String loginUrl, loginForm;  
  private static Map<String,String> defaultLoginHeaders = new HashMap<String,String>();  
  static {
    defaultLoginHeaders.put("User-Agent", "Mozilla/5.0");
    defaultLoginHeaders.put("Accept",
        "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
    defaultLoginHeaders.put("Accept-Language", "en-US,en;q=0.5");
    defaultLoginHeaders.put("Connection", "keep-alive");
    // defaultLoginHeaders.put("Referer",
    // "https://accounts.google.com/ServiceLoginAuth");
    defaultLoginHeaders
        .put("Content-Type", "application/x-www-form-urlencoded");
  }
  private Map<String,String> loginPostData;
  private Map<String,String> additionalPostHeaders;
  private Set<String> removedFormFields;
  private String cookies;
  
  public HttpCilentFormLoginExample(String loginUrl, String loginForm,
      Map<String,String> loginPostData,
      Map<String,String> additionalPostHeaders, Set<String> removedFormFields) {
    this.loginUrl = loginUrl;
    this.loginForm = loginForm;
    this.loginPostData = loginPostData == null ? new HashMap<String,String>()
        : loginPostData;
    this.additionalPostHeaders = additionalPostHeaders == null ? new HashMap<String,String>()
        : additionalPostHeaders;
    this.removedFormFields = removedFormFields == null ? new HashSet<String>()
        : removedFormFields;
  }
    
  public void login() throws Exception, UnsupportedEncodingException {
    client.setRedirectStrategy(new LaxRedirectStrategy());
    // make sure cookies is turn on
    CookieHandler.setDefault(new CookieManager());
    String pageContent = httpGetPageContent(loginUrl);
    List<NameValuePair> postParams = getLoginFormParams(pageContent);
    sendPost(loginUrl, postParams);
  }
  
  private void sendPost(String url, List<NameValuePair> postParams)
      throws Exception {
    HttpPost post = new HttpPost(url);
    try {
      setLoginHeader(post);
      post.setEntity(new UrlEncodedFormEntity(postParams));      
      HttpResponse response = client.execute(post);      
      int responseCode = response.getStatusLine().getStatusCode();
      if (LOGGER.isDebugEnabled()) {
        LOGGER.info("rspCode: " + responseCode);
        LOGGER.info("\nSending 'POST' request to URL : " + url);
        LOGGER.info("Post parameters : " + postParams);
        for (Header header : response.getAllHeaders()) {
          LOGGER.info("Response headers : " + header);
        }
      }
      String rst = IOUtils.toString(response.getEntity().getContent());
      LOGGER.debug("login post result: " + rst);
    } finally {
      post.releaseConnection();
    }
  }
  
  private void setLoginHeader(HttpPost post) {
    Map<String,String> headers = new HashMap<String,String>();
    headers.putAll(defaultLoginHeaders);
    // additionalPostHeaders can overwrite value in defaultLoginHeaders
    headers.putAll(additionalPostHeaders);
    for (Entry<String,String> entry : headers.entrySet()) {
      post.setHeader(entry.getKey(), entry.getValue());
    }
    post.setHeader("Cookie", getCookies());
  }
  
  private String httpGetPageContent(String url) throws IOException {    
    HttpGet get = new HttpGet(url);
    try {
      for (Entry<String,String> entry : additionalPostHeaders.entrySet()) {
        get.setHeader(entry.getKey(), entry.getValue());
      }
      HttpResponse response = client.execute(get);
      setCookies(response.getFirstHeader("Set-Cookie") == null ? "" : response
          .getFirstHeader("Set-Cookie").toString());
      return IOUtils.toString(response.getEntity().getContent());
    } finally {
      get.releaseConnection();
    }    
  }
  
  private List<NameValuePair> getLoginFormParams(String pageContent)
      throws UnsupportedEncodingException {
    Document doc = Jsoup.parse(pageContent);
    List<NameValuePair> paramList = new ArrayList<NameValuePair>();
    Element loginform = doc.getElementById(loginForm);
    if (loginform == null) {
      throw new IllegalArgumentException("No form exists: " + loginForm);
    }
    Elements inputElements = loginform.getElementsByTag("input");
    // skip fields in removedFormFields or loginPostData
    for (Element inputElement : inputElements) {
      String key = inputElement.attr("name");
      String value = inputElement.attr("value");
      if (loginPostData.containsKey(key) || removedFormFields.contains(key)) {
        continue;
      }
      paramList.add(new BasicNameValuePair(key, value));
    }
    // add key and value in loginPostData
    for (Entry<String,String> entry : loginPostData.entrySet()) {
      paramList.add(new BasicNameValuePair(entry.getKey(), entry.getValue()));
    }
    return paramList;
  }
}
Resources
Cookie Handling in Java SE 6
Apache HttpClient – Automate login Google

Run Commands Faster in PowerShell

In Linux, we can use ! to execute commands faster. such as !! or !-1 or up arrow to execute last command, use !prex to run last command that starts with a specific word.

We can do same thing in PowerShell.

Get-History: alias h
Invoke-History: alias r
Call r to execute last command.
Call r prefix to execute last command that starts with a specific word:
r ant               Run last ant command.
r "git push"        Run last git push command: notice if there is space in the prefix, we have to put them in double quotes.

Use get history to show id of commands, then run:
r id(for example r 3
The Invoke-History cmdlet accepts only a single ID, if we want to run multiple commands, run r 3; r 5
The Last Command:  $^
970X90

Run Multiple PowerShell in Tabs Mode
Use ConEmu to run multiple PowerShell in tabs mode
Another option is console2.

Resources
ConEmu - The Windows Terminal/Console/Prompt we've been waiting for?

Http Proxy Setting In HttpURLConnection and Apache HTTP Client

During development, we usually need use fiddler to monitor/debug request and response. This article introduce how to set proxy in code or in command line to use fiddler as a proxy.

Set Proxy When Use HttpURLConnection
If we are using Java HttpURLConnection, we can set the following system environment in test code:
System.setProperty("http.proxyHost", "localhost");
System.setProperty("http.proxyPort", "8888");
or set them as JVM parameters in command line:
-Dhttp.proxyHost=localhost -Dhttp.proxyPort=8888

Set Proxy in the Code When Use Apache HTTP Client 4.x
HttpHost proxy = new HttpHost("127.0.0.1", 8888, "http");
httpclient.getParams().setParameter(ConnRoutePNames.DEFAULT_PROXY, proxy);
Set Proxy When Use Apache HTTP Client 3.x
HttpClient client = new HttpClient();
client.getHostConfiguration().setProxy("127.0.0.1", 8888);

Set Proxy in Command Line When Use Apache HTTP Client 4.2 or Newer
If we are using 4.2 or newer Apache HTTP Client, we can use SystemDefaultHttpClient, which honors JSSE and networking system properties, such as http.proxyHost, http.proxyPort

How this is implemented in SystemDefaultHttpClient
SystemDefaultHttpClient uses ProxySelector.getDefault(), which uses DefaultProxySelector. DefaultProxySelector uses NetProperties to read system properties.

Set Proxy in Command Line When Use Apache HTTP Client 3.x
If we are using Apache HTTP Client 3.x, we can read system property: proxyHost and proxyPort. If they are not empty, set proxy.
String proxyHost = System.getProperty("http.proxyHost");
String proxyPort = System.getProperty("http.proxyPort");

if (StringUtils.isNotBlank(proxyHost)
  && StringUtils.isNotBlank(proxyPort)) {
 client.getHostConfiguration().setProxy(proxyHost,
   Integer.parseInt(proxyPort));
}
We use similar logic to set proxy when use older Apache HTTP Client 4.x.

Resources
Java Networking and Proxies

C# Parse Negative Number When Using Double.Parse(String, NumberStyles)

The Problem
Our C# application sends a query to Solr server, and parses the response and generate graph report. Today the application throws error: Input string was not in a correct format. in one test environment.

At first thought, we thought it is due to the language and region settings, as shown in the post: C# Parsing is Regional(Culture) Sensitive

But we found out that in customer environment, it doesn't always fail, just failed in some rare cases. 
The Analysis
We re-executed the Solr stats query, and found some unexpected numbers in the response: the min value of the stats query is negative. This should not happen in normal cases. 

The real problem here is why these negative value comes from, and we should reject invalid value when push data to Solr.  We will fix the previous problem. 

But why it fails when C# parses the response? As the same code is used when parse all double value from solr response. It may contain negative value.

Checked the code and run it with the negative number.

string value = "-10.01";
double dvalue = double.Parse(value, System.Globalization.NumberStyles.AllowExponent | System.Globalization.NumberStyles.AllowDecimalPoint);
Console.WriteLine(dvalue);
It failed. Now it's clear that this is because Double.Parse(String, NumberStyles).
From Double.Parse Method (String, NumberStyles)
Converts the string representation of a number in a specified style to its double-precision floating-point number equivalent.

As we only specify AllowExponent and AllowDecimalPoint. It will disallow sign symbol. It will only take non negative value. Give it negative value, it will fail with exception. This code should be updated to use Double.parse(String s).

The Double.Parse Method allows string in format: [ws][sign][integral-digits[,]]integral-digits[.[fractional-digits]][E[sign]exponential-digits][ws]

Resources
C# Parsing is Regional(Culture) Sensitive
Solr: Extend StatsComponent to Support stats.query, stats.facet and facet.topn

PowerShell: Working with CSV Files

Background
When import csv file to solr, it may fail because the csv is in correct formatted: mostly related with double quotes in column value, or maybe there is no enough columns.

When this happens, we may have to dig into csv files. Powershell is a great tool in this case.
Task: Get Line Number of the CSV Record
When solr fails to import csv: it may report the following error:
SEVERE: Import csv1.csv failed: org.apache.solr.common.SolrException: CSVLoader: input=file:/C:/csv1.csv, line=134370,expected 19 values but got 17
                values={field_values_in_this_row}
Solr shows the error happens at 134370 line, but if we use Get-Content csv1.csv | Select-Object -index 134370, we may find content of 134370 line is totally different. This is because if there are multiline records in the csv file, the line number would be not correct.
  /**
   * ATTENTION: in case your csv has multiline-values the returned
   *            number does not correspond to the record-number
   * 
   * @return  current line number
   */
  public int org.apache.solr.internal.csv.CSVParser.getLineNumber() {
    return in.getLineNumber();  
  }

To Get correct line of the csv record, use the following PowerShell command:
select-string -pattern 'field_values_in_this_row' csv1.csv | select Line,LineNumber
Line                                                                                              LineNumber
----                                                                                               ----------
field_values_in_this_row                                                                134378
Task: Get Record Number of CSV File
Users want to know whether all records are imported to csv. To do this, we need get number of all not-empty records in the csv file. Line number of the csv file is not useful, as ther may be empty lines , or multiple-lines records in the csv file.

We can use the following Powershell command: the Where-Object excludes empty records.
(Import-Csv csv1.csv | Where-Object { ($_.PSObject.Properties | ForEach-Object {$_.Value}) -ne $null} | Measure-Object).count

The previous command is slow, if we are sure there is no empty records(lines) in the csv file: we can use following command:
(Import-Csv .\csv1.csv | Measure-Object).count

Other CSV related PfowerShell Commands
Select fields from CSV file:
Import-Csv csv1.csv | select f1,f2 | Export-Csv -Path csv2.csv –NoTypeInformation
Add new fields into CSV file:
Import-CSV csv1.csv | Select @{Name="Surname";Expression={$_."Last Name"}}, @{Name="GivenName";Expression={$_."First Name"}} | Export-Csv -Path csv2.csv –NoTypeInformation
Import-Csv .\1.txt | select-object id | sort id –Unique | Measure-Object
Rescources
Import CSV that Contains Double-Quotes into Solr
Improve Solr CSVParser to Log Invalid Characters

Part2: Run Time-Consuming Solr Query Faster: Use Guava CacheBuilder to Cache Response

The Problem
In our web application, the very first request to solr server is a stats query. When there are more than 50 millions data, the first stats query may take 1, 2 or more minutes. As it need load millions of documents, terms into Solr. 

For subsequent stats queries, it will run faster as Solr load them into its caches, but it still takes 5 to 15 or more seconds as the stats query is a compute-intensive task, and there is too many data.

We need make it run faster to make the web GUI more responsive.
Main Steps
1. Auto run queries X minutes after no update after startup or commit to make the first stats query run faster
2.  Use Guava CacheBuilder to Cache Solr Response
This is described in this article.

Task: Use Guava CacheBuilder to Cache Solr Response
We would like to store response of time-consuming request into cache, sol later request will be much faster.

The Implementation
CacheManager
CacheManager is the key class in the implementation. The key of the outer ConcurrentHashMap is SolrCore, its value is a ConcurrentHashMap. The key of inner ConcurrentHashMap is cacheType: such as solr request. Its value is a Guava Cache.

By default the cache = CacheBuilder.newBuilder().concurrencyLevel(16).expireAfterAccess(10, TimeUnit.MINUTES).softValues().recordStats().build(); We can specify parameter -DcacheSpec=concurrencyLevel=10,expireAfterAccess=5m,softValues to use a different kind of cache.

It adds response to cache asynchronously.
public class CacheManager implements CacheStatsOpMXBean {
  protected static final Logger logger = LoggerFactory
      .getLogger(CacheManager.class);
  public static final String CACHE_TAG_SOLR_REQUEST = "CACHE_TAG_SOLR_REQUEST";
  @SuppressWarnings("rawtypes")
  private ConcurrentHashMap<SolrCore,ConcurrentHashMap<String,Cache>> cacheMap = new ConcurrentHashMap<SolrCore,ConcurrentHashMap<String,Cache>>();
  
  private static CacheManager instance = null;
  private ExecutorService executors;
  
  private static String cacheSpec;
  
  private CacheManager() {
    cacheSpec = System.getProperty("cacheSpec");
    executors = Executors.newCachedThreadPool();
  }
  
  public static CacheManager getInstance() {
    if (instance == null) {
      synchronized (CacheManager.class) {
        if (instance == null) {
          instance = new CacheManager();
        }
      }
    }
    return instance;
  }
  
  private <K,V> Cache<K,V> newCache() {
    Cache<K,V> result = null;
    if (StringUtils.isNotBlank(cacheSpec)) {
      try {
        result = CacheBuilder.from(cacheSpec).build();
      } catch (Exception e) {
        logger.error("Invalid cacheSpec: " + cacheSpec, e);
      }
    }
    if (result == null) {
      // default cache
      result = CacheBuilder.newBuilder().concurrencyLevel(16)
          .expireAfterAccess(10, TimeUnit.MINUTES).softValues()
          .recordStats().build();
    }
    return result;
  }
  
  public <K,V> Cache<K,V> getCache(SolrCore core, String cacheTag) {
    cacheMap.putIfAbsent(core, new ConcurrentHashMap<String,Cache>());
    ConcurrentHashMap<String,Cache> coreCache = cacheMap.get(core);
    coreCache.putIfAbsent(cacheTag, newCache());
    return coreCache.get(cacheTag);
  }
  
  public void invalidateAll(SolrCore core) {
    ConcurrentHashMap<String,Cache> coreCache = cacheMap.get(core);
    if (coreCache != null) {
      for (Cache cahe : coreCache.values()) {
        cahe.invalidateAll();
      }
    }
  }

  public void addToCache(final SolrCore core, final String cacheTag,
      final CacheKeySolrQueryRequest cacheKey, final Object rspObj) {
    executors.submit(new Runnable() {
      @Override
      public void run() {
        Cache<CacheKeySolrQueryRequest,Object> cache = CacheManager
            .getInstance().getCache(core, cacheTag);
        cache.put(cacheKey, rspObj);
      }
    });
  }
}
CacheKeySolrQueryRequest
We can't use SolrQueryRequest as the the key of Guava cache. Because it doesn't implement hashCode and equals methods.The hashCode would be different for different requests with same solr query, equals would be false.
So We extract params map: Map from SolrQueryRequest, and implements the hashCode and equals methods. The order in the map and String[] array doesn't matter.

We can also use the deepHahsCode and deepEquals from java-util.
public class CacheKeySolrQueryRequest implements Serializable {
  
  private static final long serialVersionUID = 1L;
  Map<String,String[]> paramsMap;
  String url;
  
  private CacheKeySolrQueryRequest(SolrQueryRequest request) {
    this.paramsMap = SolrParams.toMultiMap(request.getParams().toNamedList());
    // remove unimportant params
    paramsMap.remove(CommonParams.TIME_ALLOWED);
    if (request.getContext().get("url") != null) {
      this.url = request.getContext().get("url").toString();
    }
  }
  
  public static CacheKeySolrQueryRequest create(SolrQueryRequest request) {
    CacheKeySolrQueryRequest result = null;
    if ((request.getContentStreams() == null || !request.getContentStreams()
        .iterator().hasNext())) {
      result = new CacheKeySolrQueryRequest(request);
    }
    return result;    
  }

  public int hashCode() {
    final int prime = 31;
    int result = 1;
    result = prime * result + ((url == null) ? 0 : url.hashCode());
    // the order in the map doesn't matter
    if (paramsMap != null) {
      int mapHashCode = 1;
      for (Entry<String,String[]> entry : paramsMap.entrySet()) {
        mapHashCode = (entry.getKey() == null ? 0 : entry.getKey().hashCode());
        for (String value : entry.getValue()) {
          mapHashCode = prime * mapHashCode
              + (value == null ? 0 : value.hashCode());
        }
      }
      
      result = prime * result + mapHashCode;
    }
    return result;
  }

  public boolean equals(Object obj) {
    if (this == obj) return true;
    if (obj == null) return false;
    if (getClass() != obj.getClass()) return false;
    CacheKeySolrQueryRequest other = (CacheKeySolrQueryRequest) obj;
    if (url == null) {
      if (other.url != null) return false;
    } else if (!url.equals(other.url)) return false;
    
    if (paramsMap == null) {
      if (other.paramsMap != null) return false;
    } else {
      if (paramsMap.size() != other.paramsMap.size()) return false;
      
      Iterator<Entry<String,String[]>> it = paramsMap.entrySet().iterator();
      while (it.hasNext()) {
        Entry<String,String[]> entry = it.next();
        String[] thisValues = entry.getValue();
        String[] otherValues = other.paramsMap.get(entry.getKey());
        if (!haveSameElements(thisValues, otherValues)) return false;
      }
      if (it.hasNext()) {
        return false;
      }
    }
    return true;
  }
  
  // helper class, so we don't have to do a whole lot of autoboxing
  private static class Count {
    public int count = 0;
  }
  // from: http://stackoverflow.com/questions/13501142/java-arraylist-how-can-i-tell-if-two-lists-are-equal-order-not-mattering
  public boolean haveSameElements(String[] list1, String[] list2) {
    if (list1 == list2) return true;
    if (list1 == null || list2 == null || list1.length != list2.length) return false;
    HashMap<String,Count> counts = new HashMap<String,Count>();

    for (String item : list1) {
      if (!counts.containsKey(item)) counts.put(item, new Count());
      counts.get(item).count += 1;
    }
    for (String item : list2) {
      // If the map doesn't contain the item here, then this item wasn't in
      // list1
      if (!counts.containsKey(item)) return false;
      counts.get(item).count -= 1;
    }
    for (Map.Entry<String,Count> entry : counts.entrySet()) {
      if (entry.getValue().count != 0) return false;
    }
    return true;
  }  
}
ResponseCachedSearchHandler
If useCache is true, ResponseCachedSearchHandler will first try to load the response from the cache, if the response is already cached, it will return response directly. If this is the first time this request is executed, it will run the request, if the execution time is longer than minExecuteTime, put response into cache. by default minExecuteTime is -1, mean we will always put response into cache).
We can change value of minExecuteTime, so Solr will only cache response if the requests takes more than specified minimum time.

Before return cached response, we have to call oldRsp.setReturnFields(new SolrReturnFields(oldReq)); this will set what fields to return based on fl parameter in request. Otherwise, solr will return all fields: as no fl parameter is set.

Sub class can extend ResponseCachedSearchHandler: implement isUseCache() method to determine whether solr should cache the response; implement beforeReturnFromCache to do something before return cached response back to solr.
public class ResponseCachedSearchHandler extends SearchHandler {  
  protected static final String PARAM_USE_CACHE = "useCache",
      PARAM_MIN_EXECUTE_TIME = "minExecuteTime";
  
  protected boolean defUseCache = false;
  protected int defMinExecuteTime = -1;
  public void init(NamedList args) {
    super.init(args);
    if (args != null) {
      defUseCache = defaults.getBool(PARAM_USE_CACHE, false);
      defMinExecuteTime = defaults.getInt(PARAM_MIN_EXECUTE_TIME, -1);
    }
  }
  
  public void handleRequestBody(SolrQueryRequest oldReq,
      SolrQueryResponse oldRsp) throws Exception {
    
    boolean useCache = isUseCache(oldReq);
    CacheKeySolrQueryRequest cacheKey = null;
    if (useCache) {
      Cache<CacheKeySolrQueryRequest,Object> cache = CacheManager
          .getInstance().getCache(oldReq.getCore(),
              CacheManager.CACHE_TAG_SOLR_REQUEST);
      
      cacheKey = CacheKeySolrQueryRequest.create(oldReq);
      if (cacheKey != null) {
        Object cachedRsp = cache.getIfPresent(cacheKey);
        if (cachedRsp != null) {
          NamedList<Object> valuesNL = oldRsp.getValues();
          valuesNL.add("response", cachedRsp);
          // SolrReturnFields defines which fields to return.
          oldRsp.setReturnFields(new SolrReturnFields(oldReq));
          beforeReturnFromCache(oldReq, oldRsp);
          return;
        }
      }
    }
    Stopwatch stopwatch = new Stopwatch().start();
    executeRequest(oldReq, oldRsp);
    long executeTime = stopwatch.elapsedTime(TimeUnit.MILLISECONDS);
    stopwatch.stop();
    beforeReturnNoCache(oldReq, oldRsp);
    addRspToCache(oldReq, oldRsp, useCache, cacheKey, executeTime);
  }
  
  protected void addRspToCache(SolrQueryRequest oldReq,
      SolrQueryResponse oldRsp, boolean useCache,
      CacheKeySolrQueryRequest cacheKey, long executeTime) {
    long minExecuteTime = oldReq.getParams().getInt(PARAM_MIN_EXECUTE_TIME,
        defMinExecuteTime);
    if (useCache && cacheKey != null && executeTime > minExecuteTime) {
      NamedList<Object> valuesNL = oldRsp.getValues();
      Object rspObj = (Object) valuesNL.get("response");
      CacheManager.getInstance().addToCache(oldReq.getCore(),
          CacheManager.CACHE_TAG_SOLR_REQUEST, cacheKey, rspObj);      
    }
  }
  
  /**
   * SubClass can extend this to check whether the request is stats query etc.
   */
  protected boolean isUseCache(SolrQueryRequest oldReq) {
    return oldReq.getParams().getBool(PARAM_USE_CACHE, defUseCache);
  }
  
  protected void beforeReturnNoCache(SolrQueryRequest oldReq,
      SolrQueryResponse oldRsp) {}

  protected void beforeReturnFromCache(SolrQueryRequest oldReq,
      SolrQueryResponse oldRsp) {}
      
  /**
   * by default, call searchHander.executeRequest
   */
  protected void executeRequest(SolrQueryRequest oldReq,
      SolrQueryResponse oldRsp) throws Exception {
    super.handleRequestBody(oldReq, oldRsp);
  }
}
CacheStatsFacetRequestHandler
CacheStatsFacetRequestHandler extends ResponseCachedSearchHandler, so solr will only store response of stats and facet requests. We will change the default requestHandler to use CacheStatsFacetRequestHandler.
<requestHandler name="/select" class="CacheStatsFacetRequestHandler" default="true">
    <!-- omitted -->
  </requestHandler>
public class CacheStatsFacetRequestHandler extends ResponseCachedSearchHandler {
  protected boolean isUseCache(SolrQueryRequest oldReq) {
    boolean useCache = super.isUseCache(oldReq);
    if (useCache) {
      SolrParams params = oldReq.getParams();
      useCache = params.getBool(StatsParams.STATS, false)
          || params.getBool(FacetParams.FACET, false);
    }
    return useCache;
  }
}
InvalidateCacheProcessorFactory
We need invalidate caches after solr commit. We need add the InvalidateCacheProcessorFactory to the default processor chain, and every updateRequestProcessorChain.
<updateRequestProcessorChain name="defaultChain" default="true">
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
    <processor class="InvalidateCacheProcessorFactory" />
    <processor
        class="AutoRunQueriesProcessorFactory"/>      
  </updateRequestProcessorChain>
public class InvalidateCacheProcessorFactory extends
    UpdateRequestProcessorFactory {
  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
      SolrQueryResponse rsp, UpdateRequestProcessor next) {
    return new InvalidateCacheProcessor(next);
  }  
  private static class InvalidateCacheProcessor extends
      UpdateRequestProcessor {    
    public InvalidateCacheProcessor(UpdateRequestProcessor next) {
      super(next);
    }
    public void processCommit(CommitUpdateCommand cmd) throws IOException {
      super.processCommit(cmd);
      CacheManager.getInstance().invalidateAll(cmd.getReq().getCore());
    }
  }
}

Part1: Run Time-Consuming Solr Query Faster: Auto Run Queries X Minutes after Startup and Commit

The Problem
In our web application, the very first request to solr server is a stats query. When there are more than 50 millions data, the first stats query may take 1, 2 or more minutes. As it need load millions of documents, terms into Solr.
For subsequent stats queries, it will run faster as Solr load them into its caches, but it still takes 5 to 10 or more seconds as the stats query is a compute-intensive task, and there is too many data.


We want these stats queries run faster to make the web GUI more responsive.
Main Steps
1. Make the first stats query run faster
This is described in this article: auto run quries X minutes after no update after startup or commit.
2. Make subsequent stats qury run faster.
Task: Make the first stats query run faster
The first stats query is like this: q=*&stats=true&stats.field=szkb&stats.pagination=true&f.szkb.stats.query=*&f.szkb.stats.facet=file_type.
Solr firstSearcher and newSearcher

From Solr wiki:
A firstSearcher event is fired whenever a new searcher is being prepared but there is no current registered searcher to handle requests or to gain autowarming data from (ie: on Solr startup). A newSearcher event is fired whenever a new searcher is being prepared and there is a current searcher handling requests (aka registered).

In our application, we can't use firstSearcher. As there are too many data, and multiple cores in one solr server, the startup would be very slow, it may take 3 to 5 minutes, 
It also may take 1 to 2 minutes to run commit. Also during push date phrase, client will push many data and commit multiple times, we don't want to slow down the commit, or run the queries every time after commit.
Expected Solution
We want run defined queries after no update in last 5 minutes after server startup; run defined queries after no update in last 10 minutes after a commit.
In this way, we will not run these queries too often: we only run them when the data is kind of stable. No update in 10 minutes.
The Implementation
QueryAutoRunner
This singleton classes maintains the mapping between the SolrCore and the queries, and will auto run them X minutes after no update after startup or commit.
public class QueryAutoRunner {
  protected static final Logger logger = LoggerFactory
      .getLogger(QueryAutoRunner.class);
  
  public static final long DEFAULT_RUN_AUTO_QUERIES_AFTER_COMMIT = 1000 * 60 * 10;
  public static final long DEFAULT_RUN_AUTO_QUERIES_AFTER_STARTUP = 1000 * 60 * 2;
  
  public static long RUN_AUTO_QUERIES_AFTER_COMMIT = DEFAULT_RUN_AUTO_QUERIES_AFTER_COMMIT;
  public static long RUN_AUTO_QUERIES_AFTER_STARTUP = DEFAULT_RUN_AUTO_QUERIES_AFTER_STARTUP;
  private ConcurrentHashMap<SolrCore,CoreAutoRunnerState> autoRunQueries = new ConcurrentHashMap<SolrCore,CoreAutoRunnerState>();
  
  private static QueryAutoRunner instance = null;  
  public static QueryAutoRunner getInstance() {
    if (instance == null) {
      synchronized (QueryAutoRunner.class) {
        if (instance == null) {
          instance = new QueryAutoRunner();
        }
      }
    }
    return instance;
  }

  public void scheduleAutoRunnerAfterCommit(SolrCore core) {
    CoreAutoRunnerState autoQueriesState = autoRunQueries.get(core);
    autoQueriesState.setLastUpdateTime(new Date().getTime());
    autoQueriesState.schedule(RUN_AUTO_QUERIES_AFTER_COMMIT,
        RUN_AUTO_QUERIES_AFTER_COMMIT);
  }  
  public void updateLastUpdateTime(SolrCore core) {
    autoRunQueries.get(core).setLastUpdateTime(new Date().getTime());
  }
  
  public synchronized void initQueries(SolrCore core, Set<NamedList> queries) {
    CoreAutoRunnerState autoQueriesState = new CoreAutoRunnerState(core,
        queries);
    autoRunQueries.put(core, autoQueriesState);
    // always run auto queries for first start
    autoQueriesState.schedule(RUN_AUTO_QUERIES_AFTER_STARTUP, -1);
  }
  private QueryAutoRunner() {
    String str = System.getProperty("RUN_AUTO_QUERIES_AFTER_COMMIT");
    if (StringUtils.isNotBlank(str)) {
      try {
        RUN_AUTO_QUERIES_AFTER_COMMIT = Long.parseLong(str);
      } catch (Exception e) {
        logger
            .error("RUN_AUTO_QUERIES_AFTER_COMMIT should be a positive number");
      }
    }
    str = System.getProperty("RUN_AUTO_QUERIES_AFTER_STARTUP");
    if (StringUtils.isNotBlank(str)) {
      try {
        RUN_AUTO_QUERIES_AFTER_STARTUP = Long.parseLong(str);
      } catch (Exception e) {
        logger
            .error("RUN_AUTO_QUERIES_AFTER_STARTUP should be a positive number");
      }
    }
  }
  
  private static class CoreAutoRunnerState {
    protected static final Logger logger = LoggerFactory
        .getLogger(CoreAutoRunnerState.class);
    
    private SolrCore core;
    private AtomicLong lastUpdateTime = new AtomicLong();
    private Set<NamedList> paramsSet = new LinkedHashSet<NamedList>();

    private ScheduledFuture pending;
    private final ScheduledExecutorService scheduler = Executors
        .newScheduledThreadPool(1);

        public CoreAutoRunnerState(SolrCore core, Set<NamedList> queries) {
      this.core = core;
      this.paramsSet = queries;
    }
    
    public void schedule(long withIn, long minTimeNoUpdate) {
      // if there is already one scheduled runner whose remaining time less
      // than withIn (almost always), cancel the old one.
      if (pending != null && pending.getDelay(TimeUnit.MILLISECONDS) < withIn) {
        pending.cancel(false);
        pending = null;
      }
      if (pending == null) {
        pending = scheduler.schedule(new AutoQueriesRunner(minTimeNoUpdate),
            withIn, TimeUnit.MILLISECONDS);
        logger.info("Scheduled to run queries in " + withIn);
      }
    }
    
    private class AutoQueriesRunner implements Runnable {
      private long minTimeNoUpdate;
      
      public AutoQueriesRunner(long minTimeNoUpdate) {
        this.minTimeNoUpdate = minTimeNoUpdate;
      }      
      @Override
      public void run() {
        if (minTimeNoUpdate > 0
            && (new Date().getTime() - lastUpdateTime.get()) < minTimeNoUpdate) {
          long remaingTime = minTimeNoUpdate
              - (new Date().getTime() - lastUpdateTime.get());
          if (remaingTime > 1000) {
            // reschedule auto runner
            pending = scheduler.schedule(
                new AutoQueriesRunner(minTimeNoUpdate), remaingTime,
                TimeUnit.MILLISECONDS);
            return;
          }
        }
        logger.info("Started to execute auto runner for " + core.getName());
        // if there is no update in less than X minutes,
        for (NamedList params : paramsSet) {
          SolrQueryRequest request = null;
          try {
            request = new LocalSolrQueryRequest(core, params);
            
            String qt = request.getParams().get(CommonParams.QT);
            if (StringUtils.isBlank(qt)) {
              qt = "/select";
            }
            request.getContext().put("url", qt);
            core.execute(core.getRequestHandler(request.getParams().get(
                CommonParams.QT)), request, new SolrQueryResponse());
          } catch (Exception e) {
            logger.error("Error happened when run for " + core.getName()
                + " auro query: " + params, e);
          } finally {
            if (request != null) {
              request.close();
            }
          }
        }
        logger.info("Excuted auto runner for " + core.getName());
      }
    }
    public CoreAutoRunnerState setLastUpdateTime(long lastUpdateTime) {
      this.lastUpdateTime.set(lastUpdateTime);
      return this;
    }
  }
}
AutoRunQueriesRequestHandler
This request handler is a abstract handler, not meant to be called via http. It's used to define the query list which will be run automatically at some point, also it will shcedule a AutoRunner in 2 minutes.
Its definition in solrConfig.xml looks like this:
<requestHandler name="/abstracthandler_autorunqueries" class="AutoRunQueriesRequestHandler" >
  <lst name="defaults">
    <arr name="autoRunQueries">
      <lst> 
        <str name="q">*</str>
        <str name="rows">0</str>                 
        <str name="stats">true</str>
        <str name="stats.pagination">true</str>
        <str name="f.szkbround1.stats.query">*</str>
        <str name="stats.field">szkbround1</str>
        <str name="f.szkbround1.stats.facet">ext_name</str>
      </lst>
    </arr>
  </lst>
</requestHandler>
public class AutoRunQueriesRequestHandler extends RequestHandlerBase
    implements SolrCoreAware {  
  private Set<NamedList> paramsSet = new LinkedHashSet<NamedList>();
  private static final String PARAM_AUTO_RUN_QUERIES = "autoRunQueries";
  public void init(NamedList args) {
    super.init(args);
    if (args != null) {
      NamedList nl = (NamedList) args.get("defaults");
      List<NamedList> allLists = (List<NamedList>) nl
          .get(PARAM_AUTO_RUN_QUERIES);
      if (allLists == null) return;
      for (NamedList nlst : allLists) {
        if (nlst.get("distrib") == null) {
          nlst.add("distrib", false);
        }
        paramsSet.add(nlst);
      }
    }
  }
  public void inform(SolrCore core) {
    if (!paramsSet.isEmpty()) {
      QueryAutoRunner.getInstance().initQueries(core, paramsSet);
    }
  }
  public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp)
      throws Exception {
    throw new SolrServerException("Abstract Hanlder, not meant to be called.");
  }
}
AutoRunQueriesProcessorFactory
This processor factory needed to be added in the default processor chain, and all updateRequestProcessorChain. The InvalidateCacheProcessorFactory is used to invalidate the Solr response cache. It's described at a later post.
<updateRequestProcessorChain name="defaultChain" default="true">
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
  <processor class="InvalidateCacheProcessorFactory" />
  <processor
   class="AutoRunQueriesProcessorFactory"/>      
</updateRequestProcessorChain>
It's processAdd, processDelete will update lastUpdateTime of CoreAutoRunnerState, its processCommit method will schedule a AutoRunner in 10 minutes. 
public class AutoRunQueriesProcessorFactory extends
    UpdateRequestProcessorFactory {
  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
      SolrQueryResponse rsp, UpdateRequestProcessor next) {
    return new AutoRunQueriesProcessor(next);
  }
  
  private static class AutoRunQueriesProcessor extends UpdateRequestProcessor {
    public AutoRunQueriesProcessor(UpdateRequestProcessor next) {
      super(next);
    }
    public void processAdd(AddUpdateCommand cmd) throws IOException {
      updateLastUpdateTime(cmd);
      super.processAdd(cmd);
    }
    public void processDelete(DeleteUpdateCommand cmd) throws IOException {
      updateLastUpdateTime(cmd);
      super.processDelete(cmd);
    }
    public void processCommit(CommitUpdateCommand cmd) throws IOException {
      super.processCommit(cmd);
      QueryAutoRunner.getInstance().scheduleAutoRunnerAfterCommit(
          cmd.getReq().getCore());
    }
    public void updateLastUpdateTime(UpdateCommand cmd) {
      QueryAutoRunner.getInstance().updateLastUpdateTime(
          cmd.getReq().getCore());
    }
  }
}

PowerShell Tips: Get a Random Sample from CSV File

The Problem

I am trying to write and test R script against some data from customer. But the data is too big, it would take a lot of time to load the data and run the script. So it would be to extract a small fraction from the original data.

The Solution
First extract the first line from the original csv file, write to destination file.
Get-Content big.csv -TotalCount 1 | Out-File -Encoding utf8 sample.txt

Notice that by default Out-File cmdlet or redirection command >> uses system default encoding when write to a file. Most application by default uses utf-8 or utf-16 to read data. Hence we use -Encoding utf8 here.

Then we read all lines except the first line: Get-Content big.csv | where {$_.readcount -gt 1 }

Then randomly select 1000 lines and append them to the destination file.
Get-Content big.csv | where {$_.readcount -gt 1 } | Get-Random -Count 100 | Out-File -Encoding utf8 -Append sample.txt

The Complete Script
Get-Content big.csv -TotalCount 1 | Out-File -Encoding utf8 sample.txt; Get-Content big.csv | where {$_.readcount -gt 1 } | Get-Random -Count 100 | Out-File -Encoding utf8 -Append sample.txt

Related Script: Get default system encoding
[System.Text.Encoding]::Default
[System.Text.Encoding]::Default.EncodingName

Resource
PSTip: Get-Random
Get-Random Cmdlet

Using Solr DocTransformer to Add Anchor Tag and Text into Response

This series talks about how to use Nutch and Solr to implement Google Search's "Jump to" and Anchor links features. This article introduces how to use Nutch, HTML Parser Jsoup and Regular Expression to Extract Anchor Tag and Content
The Problem
In the search result, to help users easily jump to the section uses may be interested, we want to add anchor link below page description. Just like Google Search's "Jump to" and Anchor links features.
Main Steps
1. Extract anchor tag, text and content in Nutch
Please refer to
Using Nutch to Extract Anchor Tag and Content
Using HTML Parser Jsoup and Regex to Extract Text between Tow Tags
Debugging and Optimizing Regular Expression
2. Using UpdateRequestProcessor to Store Anchor Tag and Content into Solr
3. Using Solr DocTransformer to Add Anchor Tag and Content into Response
This is described in current article.

Task: Using Solr DocTransformer to Add Anchor Tag and Content into Response
In previous article, we have used Nutch to extract anchor tag, text and content from web page, and saved content into Solr as separate docs with docType 1.

The first thought was to use Solr group feature: &q=keyword&fl=anchorTag,anchorText,anchorContent&group=true&group.field=url_sort&group.limit=6. 

Then we ignore the url of the main page in group, and convert the anchors in group response to anchors map: the key is the anchorTag, the value is the anchorText.

But there is one critical issue in this approach: the groups are sorted by the score of the top document within each group. 

In a webpage, there maybe a anchor section: its content is small and matches the keyword: for example, the score is 0.9, but the whole webpage is not really related with the keyword: the whole webpage's score is 0.01. But solr sorts groups by the score of the top document within each group. So this group's score would be 0.9, and would be listed first. This is unacceptable.

To return tag information for the web page that matches the query, we decide to use Solr DocTransformer to add fields into response.

[future thought]
We can change solr's code to make solr run DocTransformer in parallel to improve performance.

AnchorTransformerFactory
DocTransformer is very powerful and useful, allows us to add/remove or update fields before returning. But it has one limit: it can only add one field, and the field name must be [transformer_name].

AnchorTransformer adds tow fields anchorTag, anchorText into SolrDocument. If we just use fl=[anchors], the response would not contains these fields. We have to use fl=[anchors],anchorTag,anchorText. The anchorTag,anchorText would tell Solr to add them into SolrReturnFields. Please refer the code at SolrReturnFields.add(String, NamedList<String>, DocTransformers, SolrQueryRequest).
public class AnchorTransformerFactory extends TransformerFactory {
  
  private String defaultSort;
  private int defaultAnchorRows = 5;
  private static final String SORT_BY_ORDER = "order";
  protected static Logger logger = LoggerFactory
      .getLogger(AnchorTransformerFactory.class);
  public void init(NamedList args) {
    super.init(args);
    Object obj = args.get("sort");
    if (obj != null) {
      defaultSort = (String) obj;
    }
    obj = args.get("anchorRows");
    if (obj != null) {
      defaultAnchorRows = Integer.parseInt(obj.toString());
    }
  }
  @Override
  public DocTransformer create(String field, SolrParams params,
      SolrQueryRequest req) {
    String sort = defaultSort;
    if (!StringUtils.isBlank(params.get("sort"))) {
      sort = params.get("sort");
    }
    int anchorRows = defaultAnchorRows;
    if (StringUtils.isNotBlank(params.get("anchorRows"))) {
      anchorRows = Integer.parseInt(params.get("anchorRows"));
    }
    return new AnchorTransformer(field, req, sort, anchorRows);
  }
  
  private static class AnchorTransformer extends DocTransformer {
    private SolrQueryRequest req;
    private String sort;
    private int anchorRows;
    
    public AnchorTransformer(String field, SolrQueryRequest req, String sort,
        int anchorRows) {
      this.req = req;
      this.sort = sort;
      this.anchorRows = anchorRows;
    }
    
    @Override
    public void transform(SolrDocument doc, int docid) throws IOException {
      String oldQuery = req.getParams().get(CommonParams.Q);
      Object idObj = doc.getFieldValue("contentid");
      
      // java.lang.RuntimeException: When this is called? obj.type:class
      // org.apache.lucene.document.LazyDocument$LazyField at
      String id;
      if (idObj instanceof org.apache.lucene.document.Field) {
        org.apache.lucene.document.Field field = (Field) idObj;
        id = field.stringValue();
      } else if (idObj instanceof IndexableField) {
        IndexableField field = (IndexableField) idObj;
        id = field.stringValue();
      } else {
        throw new RuntimeException("When this is called? obj.type:"
            + idObj.getClass());
      }
      SolrQuery query = new SolrQuery();
      query
          .setQuery(
              "anchorContent:" + ClientUtils.escapeQueryChars(oldQuery)
                  + " AND url: " + ClientUtils.escapeQueryChars(id))
          .addFilterQuery("docType:1").setRows(anchorRows)
          .setFields("anchorTag", "anchorText");
      if (SORT_BY_ORDER.equals(sort)) {
        query.setSort("anchorOrder", ORDER.asc);
      }
      // else default, sort by score
      List<Map<String,String>> anchorMap = extractSingleFieldValues(
          req.getCore(), "/select", query, "anchorTag", "anchorText");
      for (Map<String,String> map : anchorMap) {
        doc.addField("anchorTag", map.get("anchorTag"));
        doc.addField("anchorText", map.get("anchorText"));
      }
    }
    
  public static List<Map<String,String>> extractSingleFieldValues(
      SolrCore core, String handlerName, SolrQuery query, String... fls)
      throws IOException {
    SolrRequestHandler requestHandler = core.getRequestHandler(handlerName);
    query.setFields(fls);
    SolrQueryRequest newReq = new LocalSolrQueryRequest(core, query);
    try {
      SolrQueryResponse queryRsp = new SolrQueryResponse();
      requestHandler.handleRequest(newReq, queryRsp);
      return extractSingleFieldValues(newReq, queryRsp, fls);
    } finally {
      newReq.close();
    }
  }
  
  @SuppressWarnings("rawtypes")
  public static List<Map<String,String>> extractSingleFieldValues(
      SolrQueryRequest newReq, SolrQueryResponse newRsp, String[] fls)
      throws IOException {
    List<Map<String,String>> rst = new ArrayList<Map<String,String>>();
    NamedList contentIdNL = newRsp.getValues();
    
    Object rspObj = contentIdNL.get("response");
    SolrIndexSearcher searcher = newReq.getSearcher();    
    if (rspObj instanceof ResultContext) {
      ResultContext resultContext = (ResultContext) rspObj;
      DocList doclist = resultContext.docs;
      DocIterator dit = doclist.iterator();
      while (dit.hasNext()) {
        int docid = dit.nextDoc();
        Document doc = searcher.doc(docid, new HashSet<String>());
        Map<String,String> row = new HashMap<String,String>();
        for (String fl : fls) {
          row.put(fl, doc.get(fl));
        }
        rst.add(row);
      }
    } else if (rspObj instanceof SolrDocumentList) {
      SolrDocumentList docList = (SolrDocumentList) rspObj;
      Iterator<SolrDocument> docIt = docList.iterator();
      while (docIt.hasNext()) {
        SolrDocument doc = docIt.next();
        docIt.remove();
        Map<String,String> row = new HashMap<String,String>();
        for (String fl : fls) {
          Object tmp = doc.getFieldValue(fl);
          if (tmp != null) {
            row.put(fl, tmp.toString());
          }
        }
        rst.add(row);
      }
    }
    return rst;
  }    
  } 
}
SolrConfig.xml
<transformer name="anchors" class="AnchorTransformerFactory" >
    <int name="anchorRows">5</int>
  </transformer>
  <requestHandler name="/select" class="solr.SearchHandler"
  default="true">  
      <lst name="defaults">
          <str name="fl">otherfields,[anchors],anchorTag,anchorText</str>
       </lst>
   </requestHandler>
Resource
Using UpdateRequestProcessor to Store Anchor Tag and Content into Solr
Using Nutch to Extract Anchor Tag and Content
Using HTML Parser Jsoup and Regex to Extract Text between Tow Tags
Debugging and Optimizing Regular Expression

Labels

Java (159) Lucene-Solr (110) Interview (59) All (58) J2SE (53) Algorithm (45) Soft Skills (36) Eclipse (34) Code Example (31) Linux (24) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Troubleshooting (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Dynamic Languages (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts