The Problem
Http Form-based Authentication is a very common used authentication mechanism to protect web resources.
When crawl, Nutch supports NTLM, Basic or Digest authentication to authenticate itself to websites. But It doesn't support Http Post Form Authentication.
This series of articles talks about how to extend Nutch2 to support Http Post Form Authentication.
Main Steps
Use Apache Http Client to do http post form authentication.
Make http post form authentication work.
Integrate form authentication in Nutch2.
This article will focus on how to make http post form authentication work via a practical example.
Create and Run ASP.NET Web Application
In visual studio, create a ASP.NET (MVC2) web application, the default created web application supports form authentication. It's good to test our http form login.
Write Test Code
To use HttpFormAuthentication to do http post form authentication, we have to figure out the loginFormId: this can be done by searching "<form" in page source. Also use Chrom Devtools's "Inspect element" function, we can easily find out the name of username and password fields. Be sure to use name field, not id field of input element.
Now we can write test code:
What to Do if it doesn't Work?
But sometimes things are not that simple, the previous code may still not work: that user is not logined, and we can't access protected resource.
When this happens, we need compare the request Apache http client sends with the request Chrome sends, including headers and request body.
We can use Chrome DevTools to get request headers and post body, we can even copy the request as a cURL request and execute in command line.
We can start fiddler as a proxy, add example.client.getHostConfiguration().setProxy("127.0.0.1", 8888); in test code, then monitor request and response Apache http client sends and receives in fiddler.
Compare them and check whether some headers a missing, if so add them into additionalPostHeaders. Check whether we need remove some fields, if so add them into removedFormFields. Check whether we need add more fields, if so add them into loginPostData.
After all this, we should be able to make it work.
We can get request headers and post body via Chrome DevTools like below, we can even copy the request as a cURL request and execute in command line.
Http Form-based Authentication is a very common used authentication mechanism to protect web resources.
When crawl, Nutch supports NTLM, Basic or Digest authentication to authenticate itself to websites. But It doesn't support Http Post Form Authentication.
This series of articles talks about how to extend Nutch2 to support Http Post Form Authentication.
Main Steps
Use Apache Http Client to do http post form authentication.
Make http post form authentication work.
Integrate form authentication in Nutch2.
This article will focus on how to make http post form authentication work via a practical example.
Create and Run ASP.NET Web Application
In visual studio, create a ASP.NET (MVC2) web application, the default created web application supports form authentication. It's good to test our http form login.
Write Test Code
To use HttpFormAuthentication to do http post form authentication, we have to figure out the loginFormId: this can be done by searching "<form" in page source. Also use Chrom Devtools's "Inspect element" function, we can easily find out the name of username and password fields. Be sure to use name field, not id field of input element.
Now we can write test code:
private static void authTestAspWebApp() throws Exception, IOException { HttpFormAuthConfigurer authConfigurer = new HttpFormAuthConfigurer(); authConfigurer.setLoginUrl("http://localhost:44444/Account/Login.aspx") .setLoginFormId("ctl01").setLoginRedirect(true); Map<String, String> loginPostData = new HashMap<String, String>(); loginPostData.put("ctl00$MainContent$LoginUser$UserName", "admin"); loginPostData.put("ctl00$MainContent$LoginUser$Password", "admin123"); authConfigurer.setLoginPostData(loginPostData); Set<String> removedFormFields = new HashSet<String>(); removedFormFields.add("ctl00$MainContent$LoginUser$RememberMe"); authConfigurer.setRemovedFormFields(removedFormFields); HttpFormAuthentication example = new HttpFormAuthentication( authConfigurer); // example.client.getHostConfiguration().setProxy("127.0.0.1", 8888); String proxyHost = System.getProperty("http.proxyHost"); String proxyPort = System.getProperty("http.proxyPort"); if (StringUtils.isNotBlank(proxyHost) && StringUtils.isNotBlank(proxyPort)) { example.client.getHostConfiguration().setProxy(proxyHost, Integer.parseInt(proxyPort)); } example.login(); String result = example .httpGetPageContent("http://localhost:44444/secret/needlogin.aspx"); System.out.println(result); }Run the previous test code, check Response Code, Response headers and response body. We can copy the whole response body to jsbin, there we can view the html much easily.
What to Do if it doesn't Work?
But sometimes things are not that simple, the previous code may still not work: that user is not logined, and we can't access protected resource.
When this happens, we need compare the request Apache http client sends with the request Chrome sends, including headers and request body.
We can use Chrome DevTools to get request headers and post body, we can even copy the request as a cURL request and execute in command line.
We can start fiddler as a proxy, add example.client.getHostConfiguration().setProxy("127.0.0.1", 8888); in test code, then monitor request and response Apache http client sends and receives in fiddler.
Compare them and check whether some headers a missing, if so add them into additionalPostHeaders. Check whether we need remove some fields, if so add them into removedFormFields. Check whether we need add more fields, if so add them into loginPostData.
After all this, we should be able to make it work.
We can get request headers and post body via Chrome DevTools like below, we can even copy the request as a cURL request and execute in command line.