Nutch2 Http Form Authentication-Part3: Integrate Http Form Post Authentication in Nutch2

The Problem
Http Form-based Authentication is a very common used authentication mechanism to protect web resources.
When crawl, Nutch supports NTLM, Basic or Digest authentication to authenticate itself to websites. But It doesn't support Http Post Form Authentication.

This series of articles talks about how to extend Nutch2 to support Http Post Form Authentication.
Main Steps
Use Apache Http Client to do http post form authentication.
Make http post form authentication work.
Integrate http form authentication in Nutch2.

After previous two steps, now we can integrate http form authentication in Nutch2.
Define Http Form Post Authentication Properties in httpclient-auth.xml
First, in nutch-site.xml change plugin.includes to use protocol-httpclient plugin: not the default protocol-http.

Nutch uses http.auth.file to locate the xml file that defines credentials info, default value is httpclient-auth.xml. We extend httpclient-auth.xml to include information about http form authentication properties. The httpclient-auth.xml for the asp.net web application in last post is like below:

<?xml version="1.0"?>
<auth-configuration>
  <credentials authMethod="formAuth" loginUrl="http://localhost:44444/Account/Login.aspx" loginFormId="ctl01" loginRedirect="true">
    <loginPostData>
      <field name="ctl00$MainContent$LoginUser$UserName" value="admin"/>
      <field name="ctl00$MainContent$LoginUser$Password" value="admin123"/>
    </loginPostData>
    <removedFormFields>
      <field name="ctl00$MainContent$LoginUser$RememberMe"/>
    </removedFormFields>
  </credentials>
</auth-configuration>
Read Http Form Post Authentication from Configuration XML File
In Nutch's http-client plugin, change org.apache.nutch.protocol.httpclient.Http.setCredentials() method to read authentication info into variable formConfigurer from configuration file.
Then change Http.resolveCredentials() method: if formConfigurer is not null, use HttpFormAuthentication to do form post login.
package org.apache.nutch.protocol.httpclient;
public class Http extends HttpBase {
 private void resolveCredentials(URL url) {
  if (formConfigurer != null) {
   HttpFormAuthentication formAuther = new HttpFormAuthentication(
     formConfigurer, client, this);
   try {
    formAuther.login();
   } catch (Exception e) {
    throw new RuntimeException(e);
   }
   return;
  }
  }
 private static synchronized void setCredentials()
   throws ParserConfigurationException, SAXException, IOException {

  if (authRulesRead)
   return;

  authRulesRead = true; // Avoid re-attempting to read
  InputStream is = conf.getConfResourceAsInputStream(authFile);
  if (is != null) {
   Document doc = DocumentBuilderFactory.newInstance()
     .newDocumentBuilder().parse(is);

   Element rootElement = doc.getDocumentElement();
   if (!"auth-configuration".equals(rootElement.getTagName())) {
    if (LOG.isWarnEnabled())
     LOG.warn("Bad auth conf file: root element <"
       + rootElement.getTagName() + "> found in "
       + authFile + " - must be <auth-configuration>");
   }

   // For each set of credentials
   NodeList credList = rootElement.getChildNodes();
   for (int i = 0; i < credList.getLength(); i++) {
    Node credNode = credList.item(i);
    if (!(credNode instanceof Element))
     continue;

    Element credElement = (Element) credNode;
    if (!"credentials".equals(credElement.getTagName())) {
     if (LOG.isWarnEnabled())
      LOG.warn("Bad auth conf file: Element <"
        + credElement.getTagName()
        + "> not recognized in " + authFile
        + " - expected <credentials>");
     continue;
    }
        // read http form post auth info
    String authMethod = credElement.getAttribute("authMethod");
    if (StringUtils.isNotBlank(authMethod)) {
     formConfigurer = readFormAuthConfigurer(credElement,
       authMethod);
     continue;
    }
      }
    }
  }
 private static HttpFormAuthConfigurer readFormAuthConfigurer(
   Element credElement, String authMethod) {
  if ("formAuth".equals(authMethod)) {
   HttpFormAuthConfigurer formConfigurer = new HttpFormAuthConfigurer();

   String str = credElement.getAttribute("loginUrl");
   if (StringUtils.isNotBlank(str)) {
    formConfigurer.setLoginUrl(str.trim());
   } else {
    throw new IllegalArgumentException("Must set loginUrl.");
   }
   str = credElement.getAttribute("loginFormId");
   if (StringUtils.isNotBlank(str)) {
    formConfigurer.setLoginFormId(str.trim());
   } else {
    throw new IllegalArgumentException("Must set loginFormId.");
   }
   str = credElement.getAttribute("loginRedirect");
   if (StringUtils.isNotBlank(str)) {
    formConfigurer.setLoginRedirect(Boolean.parseBoolean(str));
   }

   NodeList nodeList = credElement.getChildNodes();
   for (int j = 0; j < nodeList.getLength(); j++) {
    Node node = nodeList.item(j);
    if (!(node instanceof Element))
     continue;

    Element element = (Element) node;
    if ("loginPostData".equals(element.getTagName())) {
     Map<String, String> loginPostData = new HashMap<String, String>();
     NodeList childNodes = element.getChildNodes();
     for (int k = 0; k < childNodes.getLength(); k++) {
      Node fieldNode = childNodes.item(k);
      if (!(fieldNode instanceof Element))
       continue;

      Element fieldElement = (Element) fieldNode;
      String name = fieldElement.getAttribute("name");
      String value = fieldElement.getAttribute("value");
      loginPostData.put(name, value);
     }
     formConfigurer.setLoginPostData(loginPostData);
    } else if ("additionalPostHeaders".equals(element.getTagName())) {
     Map<String, String> additionalPostHeaders = new HashMap<String, String>();
     NodeList childNodes = element.getChildNodes();
     for (int k = 0; k < childNodes.getLength(); k++) {
      Node fieldNode = childNodes.item(k);
      if (!(fieldNode instanceof Element))
       continue;

      Element fieldElement = (Element) fieldNode;
      String name = fieldElement.getAttribute("name");
      String value = fieldElement.getAttribute("value");
      additionalPostHeaders.put(name, value);
     }
     formConfigurer
       .setAdditionalPostHeaders(additionalPostHeaders);
    } else if ("removedFormFields".equals(element.getTagName())) {
     Set<String> removedFormFields = new HashSet<String>();
     NodeList childNodes = element.getChildNodes();
     for (int k = 0; k < childNodes.getLength(); k++) {
      Node fieldNode = childNodes.item(k);
      if (!(fieldNode instanceof Element))
       continue;

      Element fieldElement = (Element) fieldNode;
      String name = fieldElement.getAttribute("name");
      removedFormFields.add(name);
     }
     formConfigurer.setRemovedFormFields(removedFormFields);
    }
   }
   return formConfigurer;
  } else {
   throw new IllegalArgumentException("Unsupported authMethod: "
     + authMethod);
  }
 }  
}  
Resources
Post a Comment

Labels

Java (159) Lucene-Solr (112) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (38) Eclipse (33) Code Example (31) Linux (25) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Shell (7) ANT (6) Coding Skills (6) Database (6) Lesson Learned (6) Programmer Skills (6) Scala (6) Tips (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) System Design (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts