Showing posts with label GAE. Show all posts
Showing posts with label GAE. Show all posts

Using lucene-appengine & google-http-java-client to Crawl Blogger on GAE


The Goal
In my latest project, I need develop one GAE java application to crawl blogger siter, and save index into Lucene on GAE.

This post will introduce how to deploy lucene-appengine and use google-http-java-client to parse sitemap.xml to get all posts then crawl each post, then save index to lucene-appengine on GAE, then use GAR cron task to index new posts periodically.

Creating Maven GAE project & Adding Dependencies
First Check GAE: Using Apache Maven to create appengine-skeleton-archetype maven project

Then download lucene-appengine-examples source code, and copy needed dependencies from its pom.xml, and add google-http-client, google-http-client-appengine and google-http-client-xml into pom.xml.

Using google-http-java-client to Parse sitemap.xml
google-http-java-client library allow us to easily convert xml response as java object by com.google.api.client.http.HttpResponse.parseAs(SOmeClass.class), all we need is to define the Java class.

Check blogger's sitemap.xml: lifelongprogrammer sitemap.xml
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://lifelongprogrammer.blogspot.com/2014/11/using-solr-classifier-to-categorize-articles.html</loc>
    <lastmod>2014-11-04T22:49:54Z</lastmod>
</urlset>

So we can map it to two classes, Urlset and TUrl, the key here is to use @com.google.api.client.util.Key to map java field to element in xml.
public class Urlset {
 @Key
 protected List<TUrl> url = new ArrayList<>();

 public List<TUrl> getUrl() {
  return url;
 }
}
public class TUrl {
 @Key
 protected String loc;
 @Key
 protected String lastmod;
  // omitted the getters
}

Then use the following code to parse sitemap.xml to Urlset java object.
static final HttpTransport HTTP_TRANSPORT = new NetHttpTransport();
static final XmlNamespaceDictionary XML_DICT = new XmlNamespaceDictionary();

HttpRequestFactory requestFactory = HTTP_TRANSPORT
    .createRequestFactory(new HttpRequestInitializer() {
      @Override
      public void initialize(HttpRequest request) {
        request.setParser(new XmlObjectParser(XML_DICT));
      }
    });

HttpRequest request = requestFactory.buildGetRequest(new GenericUrl(sitemapUrl));
HttpResponse response = request.execute();
Urlset urls = response.parseAs(Urlset.class);

When parse each post, we can use the following code to get the post html string:
HttpRequestFactory requestFactory = HTTP_TRANSPORT.createRequestFactory();
HttpRequest request = requestFactory.buildGetRequest(new GenericUrl(url.getLoc()));
HttpResponse response = request.execute();

String html = response.parseAsString();

LAEUtil
The following is the complete code which parse sitemap, then crawl each post and save index into lucene-appengine.
public class LAEUtil {
 private static final Logger logger = LoggerFactory.getLogger(Util.class);
 private static final Version LUCENE_VERSION = Version.LUCENE_4_10_2;

 static final HttpTransport HTTP_TRANSPORT = new NetHttpTransport();
 static final XmlNamespaceDictionary XML_DICT = new XmlNamespaceDictionary();

 public static void crawl(String indexName, String sitemapUrl,
   long maxSeconds) throws IOException {
  Stopwatch stopwatch = Stopwatch.createStarted();
  IndexReader reader = null;
  try (GaeDirectory directory = new GaeDirectory(indexName)) {
   try {
    reader = DirectoryReader.open(directory);
   } catch (IndexNotFoundException e) {
    createIndex(directory);
    reader = DirectoryReader.open(directory);
   }

   IndexSearcher searcher = new IndexSearcher(reader);
   Date crawledMinDate = getCrawledMinMaxDate(searcher, false);
   Date crawlMaxDate = getCrawledMinMaxDate(searcher, true);

   reader.close();
   crawl(directory, stopwatch, indexName, sitemapUrl, crawledMinDate,
     crawlMaxDate, maxSeconds);
  } catch (IOException e) {
   logger.error("crawl failed with error", e);
  }
 }

 private static void createIndex(GaeDirectory directory) throws IOException {
  try (IndexWriter writer = new IndexWriter(directory,
    getIndexWriterConfig(LUCENE_VERSION, getAnalyzer()))) {
  }
 }

 private static Date getCrawledMinMaxDate(IndexSearcher searcher,
   boolean minDate) throws IOException {
  Query q = new MatchAllDocsQuery();
  Date minMaxDate = null;
  boolean reverse = minDate;
  TopFieldDocs docs = searcher.search(q, 1, new Sort(new SortField(
    Fields.LASTMOD, SortField.Type.LONG, reverse)));

  ScoreDoc[] hits = docs.scoreDocs;
  if (hits.length != 0) {
   Document doc = searcher.doc(hits[0].doc);
   minMaxDate = new Date(Long.parseLong(doc.get(Fields.LASTMOD)));
  }
  return minMaxDate;
 }

 /** post between [crawledMinDate to crawledMaxDate] is already crawled  */
 private static void crawl(GaeDirectory directory, Stopwatch stopwatch,
   String indexName, String sitemapUrl, Date crawledMinDate,
   Date crawlMaxDate, long maxSeconds) throws IOException {
  HttpRequestFactory requestFactory = HTTP_TRANSPORT
    .createRequestFactory(new HttpRequestInitializer() {
     @Override
     public void initialize(HttpRequest request) {
      request.setParser(new XmlObjectParser(XML_DICT));
     }
    });

  HttpRequest request = requestFactory.buildGetRequest(new GenericUrl(
    sitemapUrl));

  HttpResponse response = request.execute();
  Urlset urls = response.parseAs(Urlset.class);
  PorterAnalyzer analyzer = getAnalyzer();

  // posts are sorted by lastMod in sitemap.xml
  int added = 0;
  try (IndexWriter writer = new IndexWriter(directory,
    getIndexWriterConfig(LUCENE_VERSION, analyzer))) {

   for (TUrl url : urls.getUrl()) {
    // will not happen
    Date lastmod = url.getLastmodDate();
    if (lastmod == null)  continue;

    if (stopwatch.elapsed(TimeUnit.SECONDS) >= maxSeconds) {
     logger.error("Exceed timelimt " + maxSeconds
       + ", already run "
       + stopwatch.elapsed(TimeUnit.SECONDS) + " seconds");
     break;
    }
    boolean post = false;
    if (crawlMaxDate == null || crawledMinDate == null) {
     post = true;
    }
    if (crawlMaxDate != null && lastmod.after(crawlMaxDate)) {
     post = true;
    } else if (crawledMinDate != null
      && url.getLastmodDate().before(crawledMinDate)) {
     post = true;
    }
    if (post) {
     crawlPost(url, writer);
     ++added;
     if (added == 20) {
      writer.commit();
      added = 0;
     }
    } else {
     logger.debug("ingore " + url + " : lastmod " + lastmod
       + ", crawlMaxDate: " + crawlMaxDate
       + ", crawledMinDate: " + crawledMinDate);
    }
   }
   logger.error("started to commit");
   writer.commit();
   logger.error("commit finished.");
  }
 }

 private static PorterAnalyzer getAnalyzer() {
  return new PorterAnalyzer(LUCENE_VERSION);
 }
  
 private static void crawlPost(TUrl url, IndexWriter writer)
   throws IOException {
  logger.info(url.getLoc() + " : " + url.getLastmod());
  HttpRequestFactory requestFactory = HTTP_TRANSPORT
    .createRequestFactory();
  HttpRequest request = requestFactory.buildGetRequest(new GenericUrl(url
    .getLoc()));
  HttpResponse response = request.execute();

  String html = response.parseAsString();
  Document luceneDoc = new Document();
  luceneDoc.add(new StringField(Fields.ID, url.getLoc(), Store.YES));
  luceneDoc.add(new TextField(Fields.URL, url.getLoc(), Store.YES));

  luceneDoc.add(new TextField(Fields.RAWCONTENT, html, Store.YES));

  ArticleExtractor articleExtractor = ArticleExtractor.getInstance();

  org.jsoup.nodes.Document jsoupDoc = Jsoup.parse(html);
  luceneDoc.add(new TextField(Fields.TITLE, jsoupDoc.title(), Store.YES));

  html = normalize(jsoupDoc);
  try {
   String mainContent = articleExtractor.getText(html);
   luceneDoc.add(new TextField(Fields.MAINCONTENT, mainContent,
     Store.YES));
  } catch (BoilerpipeProcessingException e) {
   throw new RuntimeException(e);
  }
  luceneDoc.add(new LongField(Fields.LASTMOD, url.getLastmodDate()
    .getTime(), Store.YES));
  writer.addDocument(luceneDoc);
 }
}
BloggerCrawler Servlet
We can call BloggerCrawler servlet manually to test our crawler. When we test or call the servlet manully we set maxseconds to some smaller value due to the GAE request handler time limit, when we call it from cron task, we set it to 8 mins(the timelimit for task is 10 mins).
public class BloggerCrawler extends HttpServlet {
 private static final Logger logger = LoggerFactory
   .getLogger(BloggerCrawler.class);
 protected void doGet(HttpServletRequest req, HttpServletResponse resp)
   throws ServletException, IOException {

  String site = Preconditions.checkNotNull(req.getParameter("sitename"),
    "site can't be null");

  String indexName = site;
  if (site.endsWith("blogspot.com")) {
   throw new IllegalArgumentException("not valid sitename: " + site);
  }
  String sitemapUrl = "http://" + site + ".blogspot.com/sitemap.xml";

  int maxseconds = getMaxSeconds(req);
  logger.info("started to crawl " + sitemapUrl);
  Util.crawl(indexName, sitemapUrl, maxseconds);
  super.doGet(req, resp);
 }
 private int getMaxSeconds(HttpServletRequest req) {
  int maxseconds = 40;
  String str = req.getParameter("maxseconds");
  if (str != null) {
   maxseconds = Integer.parseInt(str);
  }
  return maxseconds;
 }
}

Scheduled Crawler with GAE Cron
We can use GAE cron to call crawler servlet periodically, for example every 12 hours. All we need do is add the cron task into cron.xml:
Check Scheduled Tasks With Cron for Java for more about GAE cron.
Notice that Local development server does not execute cron jobs nor have the Cron Jobs link. The actual appengine will show cron jobs and will execute them.
<cronentries>
  <cron>
    <url>/crawl?sitename=lifelongprogrammer&maxseconds=480</url>
    <description>Crawl lifelongprogrammer every 12 hours</description>
    <schedule>every 12 hours</schedule>
  </cron>
</cronentries>
References
lucene-appengine
GAE: Using Apache Maven
Scheduled Tasks With Cron for Java

Image Transformer (Resize, Rotate, flip and Enhance)



check a file on the web...
Image URL:

or check a file on your local disk

or drop files here

Transform Image:
Resize: New Width:     New Height: Stretch:
Rotate: Rotate Degree: 90    180    270   
Horizontal Flip
Vertical Flip
Enhance Iamge


This server side is deployed on GAE, it uses GAE's ImagesService to resize, rotate, flip and enhance images.
GAE Images Java API Overview

Using Maven with Google App Engine


Maven is very good at managing the project's dependencies, so I also choose maven when develop GAE project.

Google Eclipse plugin doesn't support GAE maven development very well: we can't use Google Eclipse plugin to directly run, debug the app or deploy it to app engine.

To run the app in local GAE server:
cd ${mypp}\${mypp-ear}
mvn -f ..\pom.xml clean install && mvn appengine:devserver

To debug the app: add the following in pom.xml:
<plugins>
  <plugin>
    <groupId>com.google.appengine</groupId>
    <artifactId>appengine-maven-plugin</artifactId>
    <configuration>
      <jvmFlags>
        <jvmFlag>-Xdebug</jvmFlag>
        <jvmFlag>-agentlib:jdwp=transport=dt_socket,address=9999,server=y,suspend=n</jvmFlag>
      </jvmFlags>
      <disableUpdateCheck>true</disableUpdateCheck>
    </configuration>
  </plugin>
</plugins>

Start the local GAE server, then create a remote application to connect to localhost:9999. Now we can debug the GAE maven application in eclipse.

Change Application Id
For some reason, we may want to deploy the same application with multiple application id. - We may use GAE as backbone application, our client application maybe mobile app or even google blogger(as google doesn't allow to put ads in GAE app, we may use google bloger as the front side which talks with GAE server to do real task.).
When our application is getting popular, and exceeds the free quota. We may want to duplicate our applition to deploy under another application id.

If we are using maven to build and deploy, we need change the application id: ${mypp}\${mypp-ear}\src\main\application\META-INF\appengine-application.xml.

Then deploy it to the new application id:
cd ${mypp}\${mypp-ear}
mvn -f ..\pom.xml clean install && mvn appengine:update

Resources

GAE: Java compiler level does not match the version of the installed Java project facet


The Problem
I am creating a new GAE(1.9.1) project today. There is no place to set JDK version during creating Google Web Application Project, and the default JRE in Eclipse is 1.6, and I don't want to change the default JRE to JDK 7 as most of our projects still uses Java 6.

So after creating the GAE project, I want to change GAE project to use JDK 7, so I can use new features in JDK 7.
I right click on the project -> Properties -> Java Build Path -> Libaries, remove JDK 6 runtime and add JDK 7 runtime, also in Java Compiler tab, change Complier compliance level to 1.7.

This will rebuild the project. After it's done, in eclipse Problems views, it shows error:
Java compiler level does not match the version of the installed Java project facet. HelloGAE Unknown Faceted Project Problem (Java Version Mismatch)

We can still write code, run this GAE project, but if there is syntax(commpiler level) error in our code, it will not show in the Problems view. This is kind of annoying.
The Solution
This is because when we create GAE project, it uses JDK 6, thus sets the Java Project Facet level to 1.6.
The value can alos be seen in .settings/org.eclipse.wst.common.project.facet.core.xml

The fix is to change Java Project Facet level to 1.7: right click on the GAE project -> Properties -> Project Facet, in the right panel, change the Java version to 1.7.

It will rebuild the project, now we can see the errors, warnings in Problems view.

Happy coding :)

Labels

ANT (6) Algorithm (69) Algorithm Series (35) Android (7) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) JSON (7) Java (186) JavaScript (27) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) adsense (5) bat (8) regex (5) xml (5)