Google: Help Sharpen Our Profession Skills

We people, strive to improve our profession skills. 
It's not easy but we can never give up, as this is how we can make money to raise our family and children. 

It's a lifelong and tough task. We would appreciate if Google can help on this.

As developers/photographers/teachers, we want to sharpen our programmings/photography/teaching skills.
As students, we want to know what to learn, similar problems that we just solved, how to improve our problem solving skills etc.

So It would be great if Google could help in our mission to sharpen our professional skills.

Recommend technical/profession skills related content in Youtube, Google+, and Google Now

Youtube
Recommend to You(Profession Skill) 
Youtube is a great website for listening music(the music tab added recently is awesome) and watching casual videos.

But there are a also lot of profession skills related videos(programming, teaching, etc) on youtube. However, many people don't even know their existence even though they want to watch this kind of videos or sometimes they don't realize.

These videos are not usually on popular list or top video list which would hurt (passion and revenue)the producers of these profession skills videos 

A lot of people are eager to watch them, but even don't know their existence. The view count for this kind of videos are usually very small.

Therefore, it will be great that if Youtube can help us to find these videos so we know then and watch them.
There can be two kinds of profession videos: short, interesting and easy to understand which we can watch at spare time(eating, travelling, commuting); long, profound videos which we have to sit down and take time to understand.

In addition, this feature will benefit both the audiences and Youtube - people will watch more on Youtube, so more revenue for Youtube.

Watch offline if Eligible
Allow us to watchprofession skills related videos offline on our phone or tablet if its producer doesn't set ads on them.

Open Course on Youtube or Video for Education
There are a lot of open course sites like coursera, edx, they are very popular. It demonstrates that videos for education are appealing to people. So why not Youtube build its platform? 

Google+
Recommend to You(Profession Skill) - Recommend profession skills related posts to us
I'm not a very social person, so I don't use my Facebook or twitter. 
As a fan of Google. I mainly use G+, especially its what's hot and explorer(Technology category).

As most of people, I am always working hardly to improve my profession skills - for me, they are programming, coding, algorithm, coding interview(recently). 
I am sure there are a lot of useful and interesting posts that are related with these topics in G+.

So it would be great if G+ can help us to find these related posts and recommend to us. So we can read them when I check my G+ during my spare time(waiting train, shopping with family etc).

We can have fun and sharpen our profession skill at same time .

"what's hot" and "recommend to you" in community
What's hot, explore and communities are great features.
Even though I join multiple programming related communities,I don't check them often.
In these communities, there are tons of useful posts has been posted and updated everyday these communities. It is difficult for me to catch them all at once. 

So if G+ can do us a favor, in each community, it can have a category "what's hot" or "recommend to you". It will be  a great gift to us!

Allow user to create group and aggregate content from similar communities in G+.
It is awesome if we can create a group and to aggregate content from multiple communities, so we can check all updates in one place - my group.
The group will list "what's hot" and "recommend to you" posts in all defined communities. 


Group similar topics in play newsstand
Allow us to create our own topic or group to aggregate posts from multiple topics or feeds. 
More profession skills related topics. - for me like coding skills, algorithm, 

Google Now
Google Now has done a great job on this, hope it can do better in future.

Google Alters: Job Recommendation
At some time, we may decide to move on or look for new challenge.

Linkedin is great, but there are also a lot of other job search sites: lever, jobvite, etc. They can't

Google can actually help  on this: as Google search knows all jobs that are posted recently in companies' sites.

User can use Google alerts to define what kind of jobs they are interested, like:
Location: New York or Bay Area
Keyword: Lucene Solr Hadoop

Then Google Alters can notify us whenever there are matched jobs posted online.

Java: Using classmexer MemoryUtil to Get Object Deep Memory

The Problem
In some case, we may want to get the deep memory usage of one object.
For example, in recent project, I developed one Solr request handler which will copy docs from remote solr to local solr.

The request looks like this: /solr/core/pulldocs?remoteSolr=solrurl&q=query&fl=fields&rows=ROWS&start=START
Internally, it will get 100 docs each time: first get START to START+100 then get START+100 to START+200 - there are actually 5 threads to pull docs and insert to local solr at same time.

But in one test environment, the tester reports that the get 100-docs request gets slower and slower. I am guessing it's not the case, but because some 100 docs are abnormal and huge.

So I need to find it out and prove it: I want to print each request execution time and the size of solr response from remote solr server.

Solution: Use classmexer MemoryUtil to Get Deep Memory Usage
So, how to get deep memory usage of java object
Via google search, I found we can use Java Instrumentation to get object size(Instrumentation.getObjectSize), but which just gives the shallow size of object.

Then I found MemoryUtil from classmexer which can get deep memory usage of object.
MemoryUtil.deepMemoryUsageOf(object)

Integrate classmexer MemoryUtil to Web Application
In order to use MemoryUtil in our Solr application, I add the -javaagent:C:\mysolrapp\extra\classmexer.jar to the Java startup parameter.

Then change the code like below:
QueryResponse rsp = solrServer.query(fetchQuery);
logger.info("start: " + fetchQuery.getStart() + ", deep size: "
  + MemoryUtil.deepMemoryUsageOf(rsp));
Copy the new built class to WEB-INF/classes, restart server and rerun the test. From the log, I can easily find the huge solr response from remote solr like below:
INFO: start: 4000, deep size: 714, 778, 104 ==> 700mb approximately, in normal case, it should between 1 and 10 mb.
INFO: Added 100, start: 4000, took 1195796

Then clean data, rerun test with start=4000&rows=100

Check the solr index, the size of solr index is more than 5 g, use Luke to analyze the Solr index, and found 99.99% is content field, which has more than 41 million terms.

The real root cause is in the server side, when server extracts text from file, if the file is corrupted, it will get the binary data and add it into content field which is huge. We fixed the server side code issue, and everything works fine.
The Problem
In some cases, we may want to check the deep memory size of one object: for example, in recent project, 

Scala & Java: Merge K Sorted List

Recently I started to learn Scala, and the best way to learn a new language is to write code to resolve real problem.

So here is my code to use Scala for the classic algorithm question: merge K stored list.
The code works, but as I am just a beginner in Scala, it doesn't use Scala's full power or features - I just translated my Java version to Scala.

Scala Code: Merge K Sorted List
As the list can be ArrayList or linkedList, so we use Iterator to check whether it still has elements and get next element.
package org.lifelongprogrammer.scala.algorithms

import scala.collection.mutable
import scala.collection.mutable.PriorityQueue
object MergeKArrays {
  case class Element[E <: Comparable[E]](var value: E, iterator: Iterator[E]) extends Ordered[Element[E]] {
    def compare(that: Element[E]) = that.value.compareTo(this.value)
  }
  def merge[E <: Comparable[E]](lists: List[List[E]]): List[E] =
    {
      if (lists == null || lists.isEmpty)
        return List[E]();

      val pq = new PriorityQueue[Element[E]]()

      for (list <- lists) {
        if (list != null && !list.isEmpty) {
          val it = list.iterator;
          pq.enqueue(Element(it.next, it))
        }
      }

      val result = mutable.ListBuffer[E]();
      while (pq.size > 1) {
        val first = pq.dequeue;
        result.append(first.value)

        val it = first.iterator;
        if (it.hasNext) {
          // reuse first element
          first.value = it.next;
          pq.enqueue(first)
        }
      }

      if (!pq.isEmpty) {
        val first = pq.dequeue;
        result.append(first.value);

        val it = first.iterator;
        while (it.hasNext) {
          result.append(it.next);
        }
      }
      return result.toList;
    }

  def main(args: Array[String]) {
    val lists: List[List[Integer]] = List(List(1, 3, 5), List(2, 4, 6, 8))
    val result = merge(lists);
    print(result)
  }
}
Java: Merge K Sorted List
Here is my Java code:
package org.codeexample.algorithms;
public class MergeKArray {

 private static class Element<E extends Comparable<E>> implements
   Comparable<Element<E>> {
  private E value;
  private Iterator<E> iterator;
  @Override
  public int compareTo(Element<E> o) {
   return this.value.compareTo(o.value);
  }
 }
 /**
  * Preconditions: List can't contain null
  */
 public static <E extends Comparable<E>> List<E> merge(List<List<E>> lists) {
  if (lists == null || lists.isEmpty())
   return new ArrayList<>();

  PriorityQueue<Element<E>> pq = new PriorityQueue<>(lists.size()
  // ,new ElementComparator<E>() // or use external comparator
  );

  int allSize = 0;
  for (List<E> list : lists) {
   if (list != null && !list.isEmpty()) {
    Element<E> e = new Element<>();
    e.iterator = list.iterator();
    e.value = e.iterator.next();
    assert e.value != null;
    pq.add(e);

    allSize += list.size();
   }
  }
  List<E> result = new ArrayList<>(allSize);

  while (pq.size() > 1) {
   Element<E> e = pq.poll();
   assert e.value != null;
   result.add(e.value);
   Iterator<E> iterator = e.iterator;
   if (iterator.hasNext()) {
    e.value = iterator.next();
    assert e.value != null;
    pq.add(e);
   }
  }

  if (!pq.isEmpty()) {
   Element<E> e = pq.poll();
   result.add(e.value);
   while (e.iterator.hasNext()) {
    result.add(e.iterator.next());
   }
  }
  return result;
 }

 private static class ElementComparator<E extends Comparable<E>> implements
   Comparator<Element<E>> {
  public int compare(Element<E> o1, Element<E> o2) {
   return o1.value.compareTo(o2.value);
  }
 }

 public static void main(String[] args) {
  List<List<Integer>> lists = new ArrayList<>();
  lists.add(Arrays.asList(1, 3, 5));
  lists.add(Arrays.asList(2, 4, 6, 8));
  lists.add(Arrays.asList(0, 10, 13, 15));
  System.out.println(merge(lists));
 }
}

Eclipse Debugging Tips: Find which jar containing the class and the application is using

The problem:
Today I am adding Solr Cell(Tika) to our Solr application, during test, it throws the following exception:
Caused by: java.lang.NoSuchMethodError: org.apache.tika.mime.MediaType.set([Lorg/apache/tika/mime/MediaType;)Ljava/util/Set;
        at org.apache.tika.parser.crypto.Pkcs7Parser.getSupportedTypes(Pkcs7Parser.java:52)
        at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)

This looks like there is a conflicting tika-core jar.
I used the java decompiler JD-GUI to check the jars I added: solr\contrib\extraction\lib\tika-core-1.3.jar, the class MediaType does contain this method: set(MediaType[] types).

Then it seems there are some other jars containing the MediaType class, I checked solr.war\WEB-INF\lib, but no obvious hint.

Using Eclipse Display View to Check which jar Contains the Class
I enabled the remote debug, added a breakpoint at Pkcs7Parser.getSupportedTypes(Pkcs7Parser.java:52), reran the Solr Cell request, it hit and stops at Pkcs7Parser.getSupportedTypes(Pkcs7Parser.java:52).
java.security.CodeSource src = org.apache.tika.mime.MediaType.class.getProtectionDomain().getCodeSource();
return src.getLocation();
The output:
(java.net.URL) file:omitted/webapps/server/WEB-INF/lib/crawler4j-dependency.jar

Check crawler4j-dependency.ja, so now the root cause is obvious.
The culprit is that some one added crawler4j into the Solr application and put its all dependencies into crawler4j-dependency.jar. It uses tika-core-1.0.jar, the MediaType classes doesn't contain the method: set(MediaType[] types).

We can use following code to return all methods in MediaType:
java.lang.reflect.Method[] methods = org.apache.tika.mime.MediaType.class.getMethods();
return methods;

The Problem - Breakpoint doesn't work
When we step through classes from some jars in Eclipse, we may find that the code doesn't match or breakpoint doesn't work at all.

This usually means there are multiple versions of same class or library in your application, the one java uses is not same as the Eclipse loads to debug. 
you can check what java is using by - this.getClass()/(XClass.class).getProtectionDomain().getCodeSource().getLocation()

You can check what jar Eclipse is loading in package view - if "Link with Editor" is enabled.

References
Uploading Data with Solr Cell using Apache Tika
Solr ExtractingRequestHandler

Spark Basic Statistics - Using Scala

Summary statistics
colStats() returns an instance of MultivariateStatisticalSummary, which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.
Test data:
1 2 3
10 20 30
100 200 300

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
  
val data = sc.textFile("E:/jeffery/src/ML/data/statistics.txt").cache();  
val parsedData = data.map( line =>  Vectors.dense(line.split(' ').map(x => x.toDouble).toArray) )
val summary = Statistics.colStats(parsedData);
println(summary.count)
println(summary.min)
println(summary.max)
println(summary.mean) // a dense vector containing the mean value for each column
println(summary.variance) // column-wise variance
println(summary.numNonzeros) // number of nonzeros in each column


Stratified sampling

Stratified sampling methods, sampleByKey and sampleByKeyExact, can be performed on RDD’s of key-value pairs.

The sampleByKey method will flip a coin to decide whether an observation will be sampled or not, therefore requires one pass over the data, and provides an expected sample size. sampleByKeyExact requires significant more resources than the per-stratum simple random sampling used in sampleByKey, but will provide the exact sampling size with 99.99% confidence.


Test Dataman 6
woman 14
woman 19
child 6
baby 1
child 3
woman 26
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.PairRDDFunctions
val data = sc.textFile("E:/jeffery/src/ML/data/sampling.txt").cache();  
val parsedData = data.map{line => {
  val sp = line.split(' '); 
  (sp(0), sp(1).toInt);
}
}.cache()

parsedData.foreach(println)
var fractions = Map[String, Double]()

fractions += ("man" ->  0.5, "woman" -> 0.5, "child" -> 0.5, "baby" -> 0.3);
val approxSample = parsedData.sampleByKey(false, fractions).collect();
val exactSample = parsedData.sampleByKeyExact(false, fractions).collect();
print(approxSample.mkString(" "));
print(exactSample.mkString(" "));

Random data generation
import org.apache.spark.mllib.random.RandomRDDs._
val u = normalRDD(sc, 100L, 2);
// Apply a transform to get a random double RDD following `N(1, 4)`.
val v = u.map(x => 1.0 + 2.0 * x)
print(u.collect())
print(v.collect())

val u = poissonRDD(sc, 10, 100L);
val v = u.map(x => 1.0 + 2.0 * x).collect()

val u = uniformRDD(sc, 100L);
val v = u.map(x => 1.0 + 2.0 * x).collect()

Histogram
val ints = sc.parallelize(1 to 100)
ints.histogram(5) // 5 evenly spaced buckets
res92: (Array[Double], Array[Long]) = (Array(1.0, 20.8, 40.6, 60.4, 80.2, 100.0),Array(20, 20, 20, 20, 20)) Correlations


MLlib - Basic Statistics
Spark 1.1.0 Basic Statistics(上)

Hack Scala REPL Classpath

The Problem
Running the example Latent Semantic Analysis (LSA) Wikipedia Example from the book Advanced Analytics with Spark, from Spark 1.2 spark-shell.cmd.

It depends on stanfordNLP libraries, So I need add stanfordNLP related jars into Scala REPL shell - I don't want to add these jars to Spark's spark-shell.cmd. We can use :cp to add a jar to current Scala Shell session.
But as there are multiple jars(actually 7) jars in stanford-corenlp-full-2014-10-31 folder, I don't want to add them one by one.

The Solution
import java.io.File

for (file <- new File("E:/jeffery/src/textmining/standfordnlp/stanford-corenlp-full-2014-10-31").listFiles.filter(f => f.getName().endsWith(".jar")&& !f.getName().contains("-sources") && !f.getName().contains("-src") && !f.getName().contains("-javadoc"))) { println(":cp " + file) } 

This will println all stanford-corenlp jars except jars including source code and javadoc:
:cp E:\jeffery\src\textmining\standfordnlp\stanford-corenlp-full-2014-10-31\ejml-0.23.jar
:cp E:\jeffery\src\textmining\standfordnlp\stanford-corenlp-full-2014-10-31\javax.json.jar
:cp E:\jeffery\src\textmining\standfordnlp\stanford-corenlp-full-2014-10-31\joda-time.jar
:cp E:\jeffery\src\textmining\standfordnlp\stanford-corenlp-full-2014-10-31\jollyday.jar
:cp E:\jeffery\src\textmining\standfordnlp\stanford-corenlp-full-2014-10-31\stanford-corenlp-3.5.0-models.jar
:cp E:\jeffery\src\textmining\standfordnlp\stanford-corenlp-full-2014-10-31\stanford-corenlp-3.5.0.jar
:cp E:\jeffery\src\textmining\standfordnlp\stanford-corenlp-full-2014-10-31\xom.jar

Then just copy the output and paste them in Scala shell, Scala will add these into current shell classpath.

Happy Hacking.

Build Spark Failure: Nonzero exit code (128): git clone sbt-pom-reader.git

The Problem
Download Sprak 1.2 from github, and try to build it by running sbt assembly.
It always failed with error:
[error] Nonzero exit code (128): git clone https://github.com/ScrapCodes/sbt-pom-reader.git C:\Users\jyuan\.sbt\0.13\staging\ad8e8574a5bcb2d22d23\sbt-pom-reader
[error] Use 'last' for the full log.
Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore?

Retry didn't work, and I can access https://github.com/ScrapCodes/sbt-pom-reader.git, git clone it.
Not sure why it failed.

The Solution
To fix this: I opened a new cmd terminal, and ran the following command to create the staging folder and git clone to the dest folder:
mkdir C:\Users\jyuan\.sbt\0.13\staging\ad8e8574a5bcb2d22d23\sbt-pom-reader
git clone https://github.com/ScrapCodes/sbt-pom-reader.git C:\Users\jyuan\.sbt\0.13\staging\ad8e8574a5bcb2d22d23\sbt-pom-reader

Then I type r to retry it. As the sbt-pom-reader is already there, sbt would just happily take it. 
After several minutes, spark built succesfully

Happy hacking.

Running Stanford Sentiment Analysis in UIMA

The Goal
In previous post, we introduced how to run Stanford NER(Named Entity Recognition) in UIMA, now we are integrating Stanford Sentiment Analysis in UIMA.

StanfordNLPAnnotator
Feature Structure: org.apache.uima.stanfordnlp.input:action
We use StanfordNLPAnnotator as the gateway or facade: client uses org.apache.uima.stanfordnlp.input:action to specify what to extract: action=ner - to run named entity extraction or action=sentimet to run sentiment analysis.

The feature org.apache.uima.stanfordnlp.output:type specifies the sentiment of the whole article: very negative, negative, neutral, positive or very positive.

The configuration parameter: SentiwordnetFile which specifies the path of sentiwordnet file.

How it Works
First it ignore sentence which doesn't contain opinionated  word. It uses Sentiwordnet to check whether this sentence contains non-neutral adjective.

The it calls Stanford NLP Sentiment Analysis tool to process the text.
Stanford NLP Sentiment Analysis has two model files: edu/stanford/nlp/models/sentiment/sentiment.ser.gz, which maps sentimentto 5 classes: very negative, negative, neutral, positive or very positive; edu/stanford/nlp/models/sentiment/sentiment.binary.ser.gz which maps sentiment to 2 classes: negative or positive.

We use edu/stanford/nlp/models/sentiment/sentiment.ser.gz, but seems sometimes it inclines to mistakenly map non-negative text to negative.

For example, it will map the following sentence to negative, but the binary mode will correctly map it to positive.
I was able to stream video and surf the internet for well over 7 hours without any hiccups .

So to fix this, when the 5 classes mode(sentiment.ser.gz) maps one sentence to negative, we will run the binay mode to recheck it, if the binary mode agrees(also report negative) then no change, otherwise change it to positive.

We calculate the score of all sentence, and map the average score to the 5 classes. We give negative sentence a smaller value as we don't trust it. 
package org.lifelongprogrammer.nlp;
public class StanfordNLPAnnotator extends JCasAnnotator_ImplBase {
	public static final String STANFORDNLP_ACTION_SENTIMENT = "sentiment";
	public static final String TYPE_STANDFORDNLP_OUTPUT = "org.apache.uima.standfordnlp.output";
	public static final String FS_STANDFORDNLP_OUTPUT_TYPE = TYPE_STANDFORDNLP_OUTPUT
			+ ":type";
	public static final String TYPE_STANFORDNLP_INPUT = "org.apache.uima.stanfordnlp.input";
	public static final String FS_STANFORDNLP_INPUT_ACTION = TYPE_STANFORDNLP_INPUT
			+ ":action";

	private static Splitter splitter = Splitter.on(",").trimResults()
			.omitEmptyStrings();
	public static final String SENTIWORDNET_FILE_PARAM = "SentiwordnetFile";

	private StanfordCoreNLP sentiment5ClassesPipeline,
			sentiment2ClassesPipeline;
	private SWN3 sentiwordnet;
	private ExecutorService threadpool;
	private Logger logger;
	public void initialize(UimaContext aContext)
			throws ResourceInitializationException {
		super.initialize(aContext);
		this.logger = getContext().getLogger();
		reconfigure();
	}

	public void reconfigure() throws ResourceInitializationException {
		try {
			threadpool = Executors.newCachedThreadPool();
			String dataPath = getContext().getDataPath();
			Properties props = new Properties();
			props.setProperty("annotators",
					"tokenize, ssplit, parse, sentiment");
			props.put("sentiment.model",
					"edu/stanford/nlp/models/sentiment/sentiment.ser.gz");

			sentiment5ClassesPipeline = new StanfordCoreNLP(props);
			props.put("sentiment.model",
					"edu/stanford/nlp/models/sentiment/sentiment.binary.ser.gz");
			sentiment2ClassesPipeline = new StanfordCoreNLP(props);

			String sentiwordnetFile = (String) getContext()
					.getConfigParameterValue(SENTIWORDNET_FILE_PARAM);
			sentiwordnet = new SWN3(
					new File(dataPath, sentiwordnetFile).getPath());
		} catch (Exception e) {
			logger.log(Level.SEVERE, e.getMessage());
			throw new ResourceInitializationException(e);
		}
	}
	public void process(JCas jcas) throws AnalysisEngineProcessException {
		CAS cas = jcas.getCas();
		ArrayList<String> action = getAction(cas);
		if (action.contains(STANFORDNLP_ACTION_SENTIMENT)) {
			Future<Void> future = threadpool.submit(new Callable<Void>() {
				@Override
				public Void call() throws Exception {
					checkSentiment(cas);
					return null;
				}
			});
			futures.add(future);
		}
		for (Future<Void> future : futures) {
			try {
				future.get();
			} catch (InterruptedException | ExecutionException e) {
				throw new AnalysisEngineProcessException(e);
			}
		}
		logger.log(Level.FINE, "StanfordNERAnnotator done.");
	}

  
	private void checkSentiment(CAS cas) {
		String sentimenTetx = getSentimentSentence(cas.getDocumentText())
				.toString();

		Annotation annotation = sentiment5ClassesPipeline.process(sentimenTetx);
		TypeSystem ts = cas.getTypeSystem();
		Type dyOutputType = ts.getType(TYPE_STANDFORDNLP_OUTPUT);
		org.apache.uima.cas.Feature dyOutputTypeFt = ts
				.getFeatureByFullName(FS_STANDFORDNLP_OUTPUT_TYPE);
        
		SentimentAccumulator accumulator = new SentimentAccumulator(false);
		for (CoreMap sentenceCore : annotation
				.get(CoreAnnotations.SentencesAnnotation.class)) {
			Tree tree = sentenceCore
					.get(SentimentCoreAnnotations.AnnotatedTree.class);
			int predictedClass = RNNCoreAnnotations.getPredictedClass(tree);
			String sentence = sentenceCore.toString();
			if (predictedClass == 1) {
				int old = predictedClass;
				predictedClass = checkNegative(sentence);
				System.out.println("Sentiment changed from " + old + " to "
						+ predictedClass + " String: " + sentence);
			} 
			accumulator.accumulate(predictedClass, sentence.length());
		}
		AnnotationFS dyAnnFS = cas.createAnnotation(dyOutputType, 0, 0);
		dyAnnFS.setStringValue(dyOutputTypeFt, accumulator.getResult());
		cas.getIndexRepository().addFS(dyAnnFS);
	}
  
	private ArrayList<String> getAction(CAS cas) {
		TypeSystem ts = cas.getTypeSystem();
		Type dyInputType = ts.getType(TYPE_STANFORDNLP_INPUT);
		org.apache.uima.cas.Feature dyInputTypesFt = ts
				.getFeatureByFullName(FS_STANFORDNLP_INPUT_ACTION);
		FSIterator<?> dyIt = cas.getAnnotationIndex(dyInputType).iterator();
		String action = "";
		while (dyIt.hasNext()) {
			// TODO this is kind of weird
			AnnotationFS afs = (AnnotationFS) dyIt.next();
			String str = afs.getStringValue(dyInputTypesFt);
			if (str != null) {
				action = str;
			}
		}
		return Lists.newArrayList(splitter.split(action));
	}
  
  

	class SentimentAccumulator {
		private double totalScore;
		private int sentCount;
		public SentimentAccumulator() {}
		public void accumulate(int type, int sentLen) {
		  clac5ClassModel(type);
		}
		private void clac5ClassModel(int type) {
			++sentCount;
			// very negative
			switch (type) {
			case 0:
				totalScore += -5;
				break;
			case 1:
				totalScore += -1; // give smaller value
				break;
			case 2:
				totalScore += 0;
				break;
			case 3:
				totalScore += 2;
				break;
			case 4:
				totalScore += 5;
				break;
			default:
				// ignore this
				logger.log(Level.SEVERE, "unkown type:" + type);
				--sentCount;
			}
		}

		public String getResult() {
      double avgScore = (double) totalScore / sentCount;
      logger.log(Level.INFO, "avgScore: " + avgScore
          + ", totalScore: " + totalScore + ", sentCount: "
          + sentCount);

      if (avgScore > 2) {
        return "very positove";
      } else if (avgScore > 0.5) {
        return "positove";
        // [-0.5 TO 0]: neutral
      } else if (avgScore > -0.5) {
        return "neutral";
      } else if (avgScore > -2) {
        return "negative";
      } else {
        return "very negative";
      }
		}
	}

	public StringBuilder getSentimentSentence(String text) {
		DocumentPreprocessor dp = new DocumentPreprocessor(new StringReader(
				text));
		// List<String> sentenceList = new LinkedList<String>();
		StringBuilder sentenceList = new StringBuilder();
		Iterator<List<HasWord>> it = dp.iterator();
		while (it.hasNext()) {
			StringBuilder sentenceSb = new StringBuilder();
			List<HasWord> sentence = it.next();

			boolean hasFeeling = false;
			Iterator<HasWord> inner = sentence.iterator();
			while (inner.hasNext()) {
				HasWord token = inner.next();
				sentenceSb.append(token.word());

				if (inner.hasNext()) {
					sentenceSb.append(" ");
				}
				String feeling = sentiwordnet.extractFelling(token.word(), "a");
				if (!"neutral".equals(feeling)) {
					hasFeeling = true;
					System.out.println(feeling + ":" + token);
				}
			}
			if (hasFeeling) {
				sentenceList.append(sentenceSb.toString());
			}
		}
		return sentenceList;
	}

	private int checkNegative(String sentence) {
		Annotation annotation = sentiment2ClassesPipeline.process(sentence);

		for (CoreMap sentenceCore : annotation
				.get(CoreAnnotations.SentencesAnnotation.class)) {

			Tree tree = sentenceCore
					.get(SentimentCoreAnnotations.AnnotatedTree.class);
			int newPredict = RNNCoreAnnotations.getPredictedClass(tree);
			// if binary checker still returns negative then use negative
			if (newPredict == 0) {
				return 1;
			} else {
				return 3;
			}
		}
		return 1;
	}  
}
Descriptor File: StanfordNLPAnnotator.xml
We define uima types: org.apache.uima.stanfordnlp.input and org.apache.uima.stanfordnlp.output, and the configuration parameter: SentiwordnetFile.
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
	<frameworkImplementation>org.apache.uima.java</frameworkImplementation>
	<primitive>true</primitive>
	<annotatorImplementationName>org.lifelongprogrammer.nlp.StanfordNLPAnnotator
	</annotatorImplementationName>
	<analysisEngineMetaData>
		<name>StanfordNLPAnnotatorAE</name>
		<description>StanfordNLPAnnotator Wrapper.</description>
		<version>1.0</version>
		<vendor>LifeLong Programmer, Inc.</vendor>
		<configurationParameters>
			<configurationParameter>
				<name>SentiwordnetFile</name>
				<description>Filename of the sentiwordnet file.</description>
				<type>String</type>
				<multiValued>false</multiValued>
				<mandatory>true</mandatory>
			</configurationParameter>
		</configurationParameters>
		<configurationParameterSettings>
			<nameValuePair>
				<name>SentiwordnetFile</name>
				<value>
					<string>dicts\SentiWordNet_3.0.0_20130122.txt</string>
				</value>
			</nameValuePair>
		</configurationParameterSettings>
		<typeSystemDescription>
			<typeDescription>
				<name>org.apache.uima.stanfordnlp.input</name>
				<description />
				<supertypeName>uima.tcas.Annotation</supertypeName>
				<features>
					<featureDescription>
						<name>action</name>
						<description />
						<rangeTypeName>uima.cas.String</rangeTypeName>
					</featureDescription>
				</features>
			</typeDescription>
			<typeDescription>
				<name>org.apache.uima.standfordnlp.output</name>
				<description />
				<supertypeName>uima.tcas.Annotation</supertypeName>
				<features>
					<featureDescription>
						<name>type</name>
						<description />
						<rangeTypeName>uima.cas.String</rangeTypeName>
					</featureDescription>
				</features>
			</typeDescription>
		</typeSystemDescription>
</analysisEngineDescription>
Annotator Test case
Check the previous post about how use sujitpal's UimaUtils.java to test the StanfordNLPAnnotator.

Running Stanford Named Entity Recognition in UIMA

The Goal
To improve our text analytic project, after integrated OpenNLP with UIMA, we are trying to integrate StanfordNLP NER(Named Entity Recognition) into UIMA.

StanfordNLPAnnotator
Feature Structure: org.apache.uima.stanfordnlp.input:action
We use StanfordNLPAnnotator as the gateway or facade: client uses org.apache.uima.stanfordnlp.input:action to specify what to extract: action=ner - to run named entity extraction or action=sentimet to run sentiment analysis.

We use dynamic output entity: org.apache.uima.stanfordnlp.output, its type specifies whether it's person or organization or etc.

The configuration parameter: ClassifierFile which specifies the  mode files NER uses.

package org.lifelongprogrammer.nlp;
public class StanfordNLPAnnotator extends JCasAnnotator_ImplBase {
 public static final String STANFORDNLP_ACTION_NER = "ner";
 public static final String TYPE_STANDFORDNLP_OUTPUT = "org.apache.uima.standfordnlp.output";
 public static final String FS_STANDFORDNLP_OUTPUT_TYPE = TYPE_STANDFORDNLP_OUTPUT
   + ":type";
 public static final String TYPE_STANFORDNLP_INPUT = "org.apache.uima.stanfordnlp.input";
 public static final String FS_STANFORDNLP_INPUT_ACTION = TYPE_STANFORDNLP_INPUT
   + ":action";

 // http://nlp.stanford.edu/software/CRF-NER.shtml
 private static final Set<String> NER_TYPES = new HashSet<String>(
   Arrays.asList("PERSON", "ORGANIZATION", "LOCATION", "MISC", "TIME",
     "MONEY", "PERCENT", "DATE"));
          
 private static Splitter splitter = Splitter.on(",").trimResults()
   .omitEmptyStrings();
 public static final String CLASSIFIER_FILE_PARAM = "ClassifierFile";
 private CRFClassifier<CoreLabel> crf;
 private ExecutorService threadpool;
 private Logger logger;

 public void initialize(UimaContext aContext)
   throws ResourceInitializationException {
  super.initialize(aContext);
  this.logger = getContext().getLogger();
  reconfigure();
 }
 public void reconfigure() throws ResourceInitializationException {
  try {
   threadpool = Executors.newCachedThreadPool();
   String dataPath = getContext().getDataPath();

   String classifierFile = (String) getContext()
     .getConfigParameterValue(CLASSIFIER_FILE_PARAM);
   System.out.println(classifierFile);
   crf = CRFClassifier
     .getClassifier(new File(dataPath, classifierFile));
  } catch (Exception e) {
   logger.log(Level.SEVERE, e.getMessage());
   throw new ResourceInitializationException(e);
  }
 }
  
 public void process(JCas jcas) throws AnalysisEngineProcessException {
  CAS cas = jcas.getCas();
  ArrayList<String> action = getAction(cas);
  List<Future<Void>> futures = new ArrayList<Future<Void>>();
  if (action.contains(STANFORDNLP_ACTION_NER)) {
   Future<Void> future = threadpool.submit(new Callable<Void>() {
    @Override
    public Void call() throws Exception {
     getNer(jcas);
     return null;
    }
   });

   futures.add(future);
  }
    //...
  for (Future<Void> future : futures) {
   try {
    future.get();
   } catch (InterruptedException | ExecutionException e) {
    throw new AnalysisEngineProcessException(e);
   }
  }
  logger.log(Level.FINE, "StanfordNERAnnotator done.");
 }
  
 private ArrayList<String> getAction(CAS cas) {
  TypeSystem ts = cas.getTypeSystem();
  Type dyInputType = ts.getType(TYPE_STANFORDNLP_INPUT);
  org.apache.uima.cas.Feature dyInputTypesFt = ts
    .getFeatureByFullName(FS_STANFORDNLP_INPUT_ACTION);

  FSIterator<?> dyIt = cas.getAnnotationIndex(dyInputType).iterator();
  String action = "";
  while (dyIt.hasNext()) {
   // TODO this is kind of weird
   AnnotationFS afs = (AnnotationFS) dyIt.next();
   String str = afs.getStringValue(dyInputTypesFt);
   if (str != null) {
    action = str;
   }
  }
  return Lists.newArrayList(splitter.split(action));
 }
  
 private void getNer(JCas jcas) {
    CAS cas=jcas.getCas();
  String docText = jcas.getDocumentText();
  List<List<CoreLabel>> classify = crf.classify(docText);

  MatchedNER preNER = null;

  TypeSystem ts = jcas.getTypeSystem();
  Type dyOutputType = ts.getType(TYPE_STANDFORDNLP_OUTPUT);
  org.apache.uima.cas.Feature dyOutputTypeFt = ts
    .getFeatureByFullName(FS_STANDFORDNLP_OUTPUT_TYPE);

  // merge co-located same entity
  for (List<CoreLabel> coreLabels : classify) {
   for (CoreLabel coreLabel : coreLabels) {
    String category = coreLabel
      .get(CoreAnnotations.AnswerAnnotation.class);
    if (NER_TYPES.contains(category)) {
     if (preNER == null) {
      preNER = new MatchedNER(category,
        coreLabel.beginPosition(),
        coreLabel.endPosition());
     } else if (category.equals(preNER.getCategory())) {
      preNER = new MatchedNER(category,
        preNER.getEntityBegin(),
        coreLabel.endPosition());
     } else {
      // add preNER
      addNER(preNER, cas, dyOutputType, dyOutputTypeFt);
      preNER = new MatchedNER(category,
        coreLabel.beginPosition(),
        coreLabel.endPosition());
     }
    } else {
     if (preNER != null) {
      addNER(preNER, cas, dyOutputType, dyOutputTypeFt);
      preNER = null;
     }

    }
   }
  }
  if (preNER != null) {
   addNER(preNER, cas, dyOutputType, dyOutputTypeFt);
  }
 }
 private void addNER(MatchedNER preNER, CAS cas, Type dyOutputType,
   org.apache.uima.cas.Feature dyOutputTypeFt) {
  AnnotationFS dyAnnFS = cas.createAnnotation(dyOutputType,
    preNER.getEntityBegin(), preNER.getEntityEnd());
  dyAnnFS.setStringValue(dyOutputTypeFt, preNER.getCategory()
    .toLowerCase());
  cas.getIndexRepository().addFS(dyAnnFS);
 }

 class MatchedNER {
  private String cat;
  private int entityBegin, entityEnd;

  public MatchedNER(String cat, int entityBegin, int entityEnd) {
   this.cat = cat;
   this.entityBegin = entityBegin;
   this.entityEnd = entityEnd;
  }
 }
}
Descriptor File: StanfordNLPAnnotator.xml
We define uima types: org.apache.uima.stanfordnlp.input and org.apache.uima.stanfordnlp.output, and the configuration parameter: ClassifierFile.
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
 <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
 <primitive>true</primitive>
 <annotatorImplementationName>org.lifelongprogrammer.nlp.StanfordNLPAnnotator
 </annotatorImplementationName>
 <analysisEngineMetaData>
  <name>StanfordNLPAnnotatorAE</name>
  <description>StanfordNLPAnnotator Wrapper.</description>
  <version>1.0</version>
  <vendor>LifeLong Programmer, Inc.</vendor>
  <configurationParameters>
   <configurationParameter>
    <name>ClassifierFile</name>
    <description>Filename of the classifier file.</description>
    <type>String</type>
    <multiValued>false</multiValued>
    <mandatory>true</mandatory>
   </configurationParameter>
  </configurationParameters>
  <configurationParameterSettings>
   <nameValuePair>
    <name>ClassifierFile</name>
    <value>
     <!-- relative to pear resource file -->
     <string>models\classifiers\english.muc.7class.distsim.crf.ser.gz
     </string>
    </value>
   </nameValuePair>
  </configurationParameterSettings>
  <typeSystemDescription>
   <typeDescription>
    <name>org.apache.uima.stanfordnlp.input</name>
    <description />
    <supertypeName>uima.tcas.Annotation</supertypeName>
    <features>
     <featureDescription>
      <name>action</name>
      <description />
      <rangeTypeName>uima.cas.String</rangeTypeName>
     </featureDescription>
    </features>
   </typeDescription>

   <typeDescription>
    <name>org.apache.uima.standfordnlp.output</name>
    <description />
    <supertypeName>uima.tcas.Annotation</supertypeName>
    <features>
     <featureDescription>
      <name>type</name>
      <description />
      <rangeTypeName>uima.cas.String</rangeTypeName>
     </featureDescription>
    </features>
   </typeDescription>
  </typeSystemDescription>
</analysisEngineDescription>
Annotator Test case
Here we are using sujitpal's UimaUtils.java, it adds the feature org.apache.uima.stanfordnlp.input:action=ner to the CAS then send the case to UIMA server then check the org.apache.uima.stanfordnlp.output feature in the response.
private static final Joiner joiner = Joiner.on(",");
@Test
public void testStanfordNLPAnnotator() throws Exception {
  AnalysisEngine ae = UimaUtils.getAE("%ABS_PATH%\StanfordNLPAnnotator.xml", null);
  for (String input : INPUTS) {
    JCas jcas = ae.newJCas();
    addFSAction(jcas,Lists.newArrayList(StanfordNLPAnnotator.STANFORDNLP_ACTION_NER));
    jcas = UimaUtils.runAE(ae, input, UimaUtils.MIMETYPE_TEXT, jcas);

    Feature feature = jcas.getTypeSystem().getFeatureByFullName(
        "org.apache.uima.standfordnlp.output:type");
    org.apache.uima.cas.TypeSystem ts = jcas.getTypeSystem();
    org.apache.uima.cas.Type dyOutputType = ts
        .getType("org.apache.uima.standfordnlp.output");

    FSIndex<? extends Annotation> index = jcas
        .getAnnotationIndex(dyOutputType);
    for (Iterator<? extends Annotation> it = index.iterator(); it
        .hasNext();) {
      Annotation annotation = it.next();
      System.out.println("...(" + annotation.getBegin() + ","
          + annotation.getEnd() + "): "
          + annotation.getCoveredText() + ", type: "
          + annotation.getFeatureValueAsString(feature));
    }
  }
  ae.destroy();
}
private void addFSAction(JCas jcas, List<String> action) {
  TypeSystem ts = jcas.getTypeSystem();
  Feature ft = ts
      .getFeatureByFullName(StanfordNLPAnnotator.FS_STANFORDNLP_INPUT_ACTION);
  Type type = ts.getType(StanfordNLPAnnotator.TYPE_STANFORDNLP_INPUT);

  FeatureStructure fs = jcas.getCas().createFS(type);
  fs.setStringValue(ft, joiner.join(action));
  jcas.addFsToIndexes(fs);
}

Using lucene-appengine & google-http-java-client to Crawl Blogger on GAE

The Goal
In my latest project, I need develop one GAE java application to crawl blogger siter, and save index into Lucene on GAE.

This post will introduce how to deploy lucene-appengine and use google-http-java-client to parse sitemap.xml to get all posts then crawl each post, then save index to lucene-appengine on GAE, then use GAR cron task to index new posts periodically.

Creating Maven GAE project & Adding Dependencies
First Check GAE: Using Apache Maven to create appengine-skeleton-archetype maven project

Then download lucene-appengine-examples source code, and copy needed dependencies from its pom.xml, and add google-http-client, google-http-client-appengine and google-http-client-xml into pom.xml.

Using google-http-java-client to Parse sitemap.xml
google-http-java-client library allow us to easily convert xml response as java object by com.google.api.client.http.HttpResponse.parseAs(SOmeClass.class), all we need is to define the Java class.

Check blogger's sitemap.xml: lifelongprogrammer sitemap.xml
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://lifelongprogrammer.blogspot.com/2014/11/using-solr-classifier-to-categorize-articles.html</loc>
    <lastmod>2014-11-04T22:49:54Z</lastmod>
</urlset>

So we can map it to two classes, Urlset and TUrl, the key here is to use @com.google.api.client.util.Key to map java field to element in xml.
public class Urlset {
 @Key
 protected List<TUrl> url = new ArrayList<>();

 public List<TUrl> getUrl() {
  return url;
 }
}
public class TUrl {
 @Key
 protected String loc;
 @Key
 protected String lastmod;
  // omitted the getters
}

Then use the following code to parse sitemap.xml to Urlset java object.
static final HttpTransport HTTP_TRANSPORT = new NetHttpTransport();
static final XmlNamespaceDictionary XML_DICT = new XmlNamespaceDictionary();

HttpRequestFactory requestFactory = HTTP_TRANSPORT
    .createRequestFactory(new HttpRequestInitializer() {
      @Override
      public void initialize(HttpRequest request) {
        request.setParser(new XmlObjectParser(XML_DICT));
      }
    });

HttpRequest request = requestFactory.buildGetRequest(new GenericUrl(sitemapUrl));
HttpResponse response = request.execute();
Urlset urls = response.parseAs(Urlset.class);

When parse each post, we can use the following code to get the post html string:
HttpRequestFactory requestFactory = HTTP_TRANSPORT.createRequestFactory();
HttpRequest request = requestFactory.buildGetRequest(new GenericUrl(url.getLoc()));
HttpResponse response = request.execute();

String html = response.parseAsString();

LAEUtil
The following is the complete code which parse sitemap, then crawl each post and save index into lucene-appengine.
public class LAEUtil {
 private static final Logger logger = LoggerFactory.getLogger(Util.class);
 private static final Version LUCENE_VERSION = Version.LUCENE_4_10_2;

 static final HttpTransport HTTP_TRANSPORT = new NetHttpTransport();
 static final XmlNamespaceDictionary XML_DICT = new XmlNamespaceDictionary();

 public static void crawl(String indexName, String sitemapUrl,
   long maxSeconds) throws IOException {
  Stopwatch stopwatch = Stopwatch.createStarted();
  IndexReader reader = null;
  try (GaeDirectory directory = new GaeDirectory(indexName)) {
   try {
    reader = DirectoryReader.open(directory);
   } catch (IndexNotFoundException e) {
    createIndex(directory);
    reader = DirectoryReader.open(directory);
   }

   IndexSearcher searcher = new IndexSearcher(reader);
   Date crawledMinDate = getCrawledMinMaxDate(searcher, false);
   Date crawlMaxDate = getCrawledMinMaxDate(searcher, true);

   reader.close();
   crawl(directory, stopwatch, indexName, sitemapUrl, crawledMinDate,
     crawlMaxDate, maxSeconds);
  } catch (IOException e) {
   logger.error("crawl failed with error", e);
  }
 }

 private static void createIndex(GaeDirectory directory) throws IOException {
  try (IndexWriter writer = new IndexWriter(directory,
    getIndexWriterConfig(LUCENE_VERSION, getAnalyzer()))) {
  }
 }

 private static Date getCrawledMinMaxDate(IndexSearcher searcher,
   boolean minDate) throws IOException {
  Query q = new MatchAllDocsQuery();
  Date minMaxDate = null;
  boolean reverse = minDate;
  TopFieldDocs docs = searcher.search(q, 1, new Sort(new SortField(
    Fields.LASTMOD, SortField.Type.LONG, reverse)));

  ScoreDoc[] hits = docs.scoreDocs;
  if (hits.length != 0) {
   Document doc = searcher.doc(hits[0].doc);
   minMaxDate = new Date(Long.parseLong(doc.get(Fields.LASTMOD)));
  }
  return minMaxDate;
 }

 /** post between [crawledMinDate to crawledMaxDate] is already crawled  */
 private static void crawl(GaeDirectory directory, Stopwatch stopwatch,
   String indexName, String sitemapUrl, Date crawledMinDate,
   Date crawlMaxDate, long maxSeconds) throws IOException {
  HttpRequestFactory requestFactory = HTTP_TRANSPORT
    .createRequestFactory(new HttpRequestInitializer() {
     @Override
     public void initialize(HttpRequest request) {
      request.setParser(new XmlObjectParser(XML_DICT));
     }
    });

  HttpRequest request = requestFactory.buildGetRequest(new GenericUrl(
    sitemapUrl));

  HttpResponse response = request.execute();
  Urlset urls = response.parseAs(Urlset.class);
  PorterAnalyzer analyzer = getAnalyzer();

  // posts are sorted by lastMod in sitemap.xml
  int added = 0;
  try (IndexWriter writer = new IndexWriter(directory,
    getIndexWriterConfig(LUCENE_VERSION, analyzer))) {

   for (TUrl url : urls.getUrl()) {
    // will not happen
    Date lastmod = url.getLastmodDate();
    if (lastmod == null)  continue;

    if (stopwatch.elapsed(TimeUnit.SECONDS) >= maxSeconds) {
     logger.error("Exceed timelimt " + maxSeconds
       + ", already run "
       + stopwatch.elapsed(TimeUnit.SECONDS) + " seconds");
     break;
    }
    boolean post = false;
    if (crawlMaxDate == null || crawledMinDate == null) {
     post = true;
    }
    if (crawlMaxDate != null && lastmod.after(crawlMaxDate)) {
     post = true;
    } else if (crawledMinDate != null
      && url.getLastmodDate().before(crawledMinDate)) {
     post = true;
    }
    if (post) {
     crawlPost(url, writer);
     ++added;
     if (added == 20) {
      writer.commit();
      added = 0;
     }
    } else {
     logger.debug("ingore " + url + " : lastmod " + lastmod
       + ", crawlMaxDate: " + crawlMaxDate
       + ", crawledMinDate: " + crawledMinDate);
    }
   }
   logger.error("started to commit");
   writer.commit();
   logger.error("commit finished.");
  }
 }

 private static PorterAnalyzer getAnalyzer() {
  return new PorterAnalyzer(LUCENE_VERSION);
 }
  
 private static void crawlPost(TUrl url, IndexWriter writer)
   throws IOException {
  logger.info(url.getLoc() + " : " + url.getLastmod());
  HttpRequestFactory requestFactory = HTTP_TRANSPORT
    .createRequestFactory();
  HttpRequest request = requestFactory.buildGetRequest(new GenericUrl(url
    .getLoc()));
  HttpResponse response = request.execute();

  String html = response.parseAsString();
  Document luceneDoc = new Document();
  luceneDoc.add(new StringField(Fields.ID, url.getLoc(), Store.YES));
  luceneDoc.add(new TextField(Fields.URL, url.getLoc(), Store.YES));

  luceneDoc.add(new TextField(Fields.RAWCONTENT, html, Store.YES));

  ArticleExtractor articleExtractor = ArticleExtractor.getInstance();

  org.jsoup.nodes.Document jsoupDoc = Jsoup.parse(html);
  luceneDoc.add(new TextField(Fields.TITLE, jsoupDoc.title(), Store.YES));

  html = normalize(jsoupDoc);
  try {
   String mainContent = articleExtractor.getText(html);
   luceneDoc.add(new TextField(Fields.MAINCONTENT, mainContent,
     Store.YES));
  } catch (BoilerpipeProcessingException e) {
   throw new RuntimeException(e);
  }
  luceneDoc.add(new LongField(Fields.LASTMOD, url.getLastmodDate()
    .getTime(), Store.YES));
  writer.addDocument(luceneDoc);
 }
}
BloggerCrawler Servlet
We can call BloggerCrawler servlet manually to test our crawler. When we test or call the servlet manully we set maxseconds to some smaller value due to the GAE request handler time limit, when we call it from cron task, we set it to 8 mins(the timelimit for task is 10 mins).
public class BloggerCrawler extends HttpServlet {
 private static final Logger logger = LoggerFactory
   .getLogger(BloggerCrawler.class);
 protected void doGet(HttpServletRequest req, HttpServletResponse resp)
   throws ServletException, IOException {

  String site = Preconditions.checkNotNull(req.getParameter("sitename"),
    "site can't be null");

  String indexName = site;
  if (site.endsWith("blogspot.com")) {
   throw new IllegalArgumentException("not valid sitename: " + site);
  }
  String sitemapUrl = "http://" + site + ".blogspot.com/sitemap.xml";

  int maxseconds = getMaxSeconds(req);
  logger.info("started to crawl " + sitemapUrl);
  Util.crawl(indexName, sitemapUrl, maxseconds);
  super.doGet(req, resp);
 }
 private int getMaxSeconds(HttpServletRequest req) {
  int maxseconds = 40;
  String str = req.getParameter("maxseconds");
  if (str != null) {
   maxseconds = Integer.parseInt(str);
  }
  return maxseconds;
 }
}

Scheduled Crawler with GAE Cron
We can use GAE cron to call crawler servlet periodically, for example every 12 hours. All we need do is add the cron task into cron.xml:
Check Scheduled Tasks With Cron for Java for more about GAE cron.
Notice that Local development server does not execute cron jobs nor have the Cron Jobs link. The actual appengine will show cron jobs and will execute them.
<cronentries>
  <cron>
    <url>/crawl?sitename=lifelongprogrammer&maxseconds=480</url>
    <description>Crawl lifelongprogrammer every 12 hours</description>
    <schedule>every 12 hours</schedule>
  </cron>
</cronentries>
References
lucene-appengine
GAE: Using Apache Maven
Scheduled Tasks With Cron for Java

Handling gzip Response in Apache HttpClient 4.2

The Problem
My application uses Apache HttpClient 4.2, but when it sends request to some web pages, the response is garbled characters.

Using Fiddler's Composer to execute the request, found the response is gziped.
Content-Encoding: gzip

The Solution
In Apache HttpClient 4.2, the DefaultHttpClient doesn't support compression, so it doesn't decompress the response. We have to use DecompressingHttpClient.
public void usingDefualtHttpClient() throws Exception {
  // output would be garbled characters in http client 4.2.
  HttpClient httpClient = new DefaultHttpClient();
  getContent(httpClient, new URI(URL_STRING));
}

public void usingDecompressingHttpClient() throws Exception {
  // use DecompressingHttpClient to handle gzip response in  http client 4.2.
  HttpClient httpCLient = new DecompressingHttpClient(
      new DefaultHttpClient());
  getContent(httpCLient, new URI(URL_STRING));
}

private void getContent(HttpClient httpClient, URI url) throws IOException,
    ClientProtocolException {
  HttpGet httpGet = new HttpGet(url);
  HttpResponse httpRsp = httpClient.execute(httpGet);
  String text = EntityUtils.toString(httpRsp.getEntity());

  for (Header header : httpRsp.getAllHeaders()) {
    System.out.println(header);
  }
  System.out.println(text);
}
The problem can also be fixed by upgrading http client to 4.3.5: in this versionthe default http client supports compression.

And in  http client to 4.3.5, the DefaultHttpClient is deprecated, it's recommenced to use HttpClientBuilder:
public void usingHttpClientBuilderIn43() throws Exception {
  HttpClientBuilder builder = HttpClientBuilder.create();
  CloseableHttpClient httpClient = builder.build();
  getContent(httpClient, new URI(URL_STRING));
}

Solr: Using Classifier to Categorize Articles

The Goal
In my latest project, I use crawler4j to crawl websites and Solr summarizer to add summary of article
Now I would use Solr Classification to categorize articles to different categories: such as Java, Linux, News etc.

Using Solr Classifier
There are two steps when use Solr Classification: 

Train
first we add docs with known category. We can crawl known websites, for example, assign java for cat field for articles from javarevisited; assign linux for articles from linuxcommando, assign solr for articles from solrpl and etc.
localhost:23456/solr/crawler/crawler?action=create,start&name=linuxcommando.blogspot&seeds=http://linuxcommando.blogspot.com/&maxCount=50&parsePaths=http://linuxcommando.blogspot.com/\d{4}/\d{2}/.*&constants=cat:linux

localhost:23456/solr/crawler/crawler?action=create,start&name=javarevisited.blogspot&seeds=http://javarevisited.blogspot.com/&maxCount=50&parsePaths=http://javarevisited.blogspot.com/\d{4}/\d{2}/.*&constants=cat:java

localhost:23456/solr/crawler/crawler?action=create,start&name=solrpl&seeds=http://solr.pl/en/&maxCount=50&parsePathshttp://solr.pl/en/\d{4}/\d{2}/.*&constants=cat:solr

Solr ClassfierUpdateProcessorFactory
public class ClassfierUpdateProcessorFactory extends
    UpdateRequestProcessorFactory {  
  private boolean defaultDoClassifer;
  private String formField;
  private String catField;
  Classifier<BytesRef> classifier = null;

  public void init(NamedList args) {
    super.init(args);
    if (args != null) {
      SolrParams params = SolrParams.toSolrParams(args);
      defaultDoClassifer = params.getBool("doClassifer", false);
      if (defaultDoClassifer) {
        formField = Preconditions.checkNotNull(params.get("fromField"),
            "Have to set fromField");
        catField = Preconditions.checkNotNull(params.get("catField"),
            "Have to set catField");
        
        String classifierStr = params.get("classifier", "simpleNaive");
        if ("simpleNaive".equals(classifierStr)) {
          classifier = new SimpleNaiveBayesClassifier();
        } else if ("knearest".equalsIgnoreCase(classifierStr)) {
          classifier = new KNearestNeighborClassifier(10);
        } else {
          throw new IllegalArgumentException("Unsupported classifier: "
              + classifier);
        }
      }
    }
  }
  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
      SolrQueryResponse rsp, UpdateRequestProcessor next) {
    return new ClassfierUpdateProcessor(req, next);
  }
  
  private class ClassfierUpdateProcessor extends UpdateRequestProcessor {
    private SolrQueryRequest req;
    public ClassfierUpdateProcessor(SolrQueryRequest req,
        UpdateRequestProcessor next) {
      super(next);
      this.req = req;
    }
    public void processAdd(AddUpdateCommand cmd) throws IOException {
      SolrParams params = req.getParams();
      boolean doClassifer = params.getBool("doClassifer", false);
      
      if (doClassifer) {
        try {
          classifier.train(req.getSearcher().getAtomicReader(), formField,
              catField, new StandardAnalyzer(Version.LUCENE_42));
          SolrInputDocument doc = cmd.solrDoc;
          Object obj = doc.getFieldValue(formField);
          if (obj != null) {
            String text = obj.toString();
            ClassificationResult<BytesRef> result = classifier
                .assignClass(text);
            
            String classified = result.getAssignedClass().utf8ToString();
            doc.addField(catField, classified);
          }
        } catch (IOException e) {
          throw new IOException(e);
        }
      }
      super.processAdd(cmd);
    } 
  } 
}
solrconfig.xml
Please check the pervious post about the implementation of MainContentUpdateProcessorFactory.
<updateRequestProcessorChain name="crawlerUpdateChain">
  <processor class="org.lifelongprogrammer.solr.update.processor.MainContentUpdateProcessorFactory">
    <str name="fromField">rawcontent</str>
    <str name="mainContentField">maincontent</str>      
  </processor>

  <processor class="org.lifelongprogrammer.solr.update.processor.ClassfierUpdateProcessorFactory">
    <bool name="doClassifer">true</bool>
    <str name="fromField">maincontent</str>
    <str name="catField">cat</str>
  </processor>
  
  <processor class="org.lifelongprogrammer.solr.update.processor.DocumentSummaryUpdateProcessorFactory" >
  </processor>

  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
schema.xml
<field name="rawcontent" type="text" indexed="false" stored="true" multiValued="true" />
<field name="maincontent" type="text" indexed="true" stored="true" multiValued="true" />
<field name="cat" type="string" indexed="true" stored="true" multiValued="true" />
<field name="summary" type="text_rev" indexed="true" stored="true" multiValued="true" />
Test Solr Classifier
Next, when we crawl some website which contains multiple categories, we can use Solr Classification to assign category for each article.

For example, let's crawl lifelongprogrammer.blogspot
localhost:23456/solr/crawler/crawler?action=create,start&name=lifelongprogrammer.blogspot&seeds=http://lifelongprogrammer.blogspot.com/&maxCount=50&parsePaths=http://lifelongprogrammer.blogspot.com/\d{4}/\d{2}/.*&doClassifer=true

We set doClassifer=true, the ClassfierUpdateProcessorFactory will call Solr Classifier to do assign a label for the category field.

From the result, we can see some articles are assigned to Java, some goes to Linux, some goes to solr. About Accuracy
The accuracy of Solr Classification is worse than Mahout, but its performance is much better and it's enough for my application.


References
[SOLR-3975] Document Summarization toolkit, using LSA techniques
Comparing Document Classification Functions of Lucene and Mahout
Text categorization with Lucene and Solr
\

Solr: Using Summarizer(Solr-3975) to get Get Summaries of Article

The Goal
In my latest project, I use crawler4j to crawl website, and then would to add some summarization to the article.

After Google search I found this Solr Jira Solr-3975 Document Summarization toolkit, using LSA techniques and the programmer's articles(Document Summarization with LSA #1: Introduction) to describe how it works.

It's not checked in, but works fine for me.
So I started my work based on it: Use boilerpipe to get the main content of web page, then later use Solr 3975 to get the most important sentences.

Normalize Html Text and Get Main Content: MainContentUpdateProcessorFactory
First, I use JSoup to normalize the html text: remove links: as they are usually used for navigation or contain javascript code,  also remove invisible block: style~=display:\\s*none

To hep Solr 3975 to get important sentence, I add period(.) after div, span, textarea if their own text don't end with period(.).
<processor class="org.lifelongprogrammer.solr.update.processor.MainContentUpdateProcessorFactory">
  <str name="fromField">rawcontent</str>
  <str name="mainContentField">maincontent</str>      
</processor>
It will parse fromField which contains the raw content of web page, and store the parsed main content to mainContentField.
public class MainContentUpdateProcessorFactory extends
    UpdateRequestProcessorFactory {
  
  private String fromField;
  private String mainContentField;
  public void init(NamedList args) {
    super.init(args);
    if (args != null) {
      SolrParams params = SolrParams.toSolrParams(args);
      fromField = Preconditions.checkNotNull(params.get("fromField"),
          "Have to set fromField");
      mainContentField = Preconditions.checkNotNull(
          params.get("mainContentField"), "Have to set fromField");
    }
  }
  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
      SolrQueryResponse rsp, UpdateRequestProcessor next) {
    return new MainContentUpdateProcessor(req, next, fromField,
        mainContentField);
  }
  
  private static class MainContentUpdateProcessor extends
      UpdateRequestProcessor {
    private String fromField;
    private String mainContentField;
    private ArticleExtractor articleExtractor;
    
    public MainContentUpdateProcessor(SolrQueryRequest req,
        UpdateRequestProcessor next, String fromField, String mainContentField) {
      super(next);
      this.fromField = fromField;
      this.mainContentField = mainContentField;
      articleExtractor = ArticleExtractor.getInstance();
    }
    public void processAdd(AddUpdateCommand cmd) throws IOException {
      SolrInputDocument doc = cmd.solrDoc;
      Object obj = doc.getFieldValue(fromField);
      if (obj != null) {
        try {
          String text = obj.toString();
          text = normalize(text);
          String mainContent = articleExtractor.getText(text);

          Document jsoupDoc = Jsoup.parse(mainContent);
          mainContent = jsoupDoc.text();
          doc.addField(mainContentField, mainContent);
        } catch (BoilerpipeProcessingException e) {
          throw new IOException(e);
        }
      }
      super.processAdd(cmd);
    }
    
    private String normalize(String text) {
      Document doc = Jsoup.parse(text);
      doc.select("a, [style~=display:\\s*none]").remove();
      Elements divs = doc.select("textarea, span, div");
      for (Element tmp : divs) {
        String html = tmp.html();
        if (tmp.childNodeSize() == 1) {
          // && !html.endsWith(".")
          String ownText = tmp.ownText();
          if (ownText != null && !ownText.trim().equals("")
              && !ownText.endsWith(".")) {
            html += ".";
            tmp.html(html);
          }
        }
      }
      return doc.html();
    }
  }
}
Get Summaraization
Define DocumentSummaryUpdateProcessorFactory in solrconfig.xml
Let's first look at the definition of DocumentSummaryUpdateProcessorFactory,:
<processor class="org.lifelongprogrammer.solr.update.processor.DocumentSummaryUpdateProcessorFactory" >
  <str name="summary.type">text_lsa</str>
  <str name="summary.fromField">maincontent</str>
  <str name="summary.summaryField">summary</str>
  <str name="summary.hl_start"/>
  <str name="summary.hl_end" />     
  <bool name="summary.simpleformat">true</bool>
  <int name="summary.count">3</int>
</processor>
It wants to parse summary.fromField(maincontent in this case), and get the most important summary.count(3) sentences and them into summary.summaryField(summary in this case), summary.hl_start and summary.hl_end is empty, as we just need the text, not want to use html tag(like em or bold) to highlight important words. 
summary.simpleformat is an internal used argument to tell summarizer to only return highlighted section: no stats, terms or sentences sections.
DocumentSummaryUpdateProcessorFactory 
As some of web pages define og:description which gives one to two sentence, we can directly use it.
If og:description is defined, then we would use summarizer to get most important summary.count(3) -1 =2 sentences.
public class DocumentSummaryUpdateProcessorFactory extends
    UpdateRequestProcessorFactory implements SolrCoreAware {
  private SummarizerOutputFormat outputFormat;
  private Map<String,String> summarizerParams = new HashMap<>();
  private Analyzer analyzer;
  public void init(NamedList args) {
    super.init(args);
    if (args != null) {
      SolrParams params = SolrParams.toSolrParams(args);
      
      Iterator<String> it = params.getParameterNamesIterator();
      String prefix = "summary.";
      
      while (it.hasNext()) {
        String paramName = it.next();
        if (paramName.startsWith(prefix)) {
          summarizerParams.put(paramName.substring(prefix.length()),
              params.get(paramName));
        }
      }
      outputFormat = getSummarizeOutputFormat(summarizerParams);
    }
  }
  public void inform(SolrCore core) {
    analyzer = getAnalyzer(core, summarizerParams);
  }
  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
      SolrQueryResponse rsp, UpdateRequestProcessor next) {
    
    return new DocumentSummaryUpdateProcessor(next, req, analyzer,
        summarizerParams, outputFormat);
  }
  
  private Analyzer getAnalyzer(SolrCore core, Map<String,String> params) {
    FieldType fType = null;
    if (params.containsKey("type")) {
      fType = core.getSchema().getFieldTypeByName(params.get("type"));
      if (fType == null) {
        throw new IllegalArgumentException("field type not found: "
            + params.get("type"));
      } else {
        return fType.getAnalyzer();
      }
    } else if (params.containsKey("fl")) {
      fType = core.getSchema().getFieldType(params.get("fl"));
      if (fType == null) {
        throw new IllegalArgumentException("field not found: "
            + params.get("type"));
      } else {
        return fType.getAnalyzer();
      }
    } else {
      throw new IllegalArgumentException("need field name or type");
    }
  }
  
  private SummarizerOutputFormat getSummarizeOutputFormat(
      Map<String,String> params) {
    SummarizerOutputFormat outputFormat = new SummarizerOutputFormat();
    boolean simpleformat = false;
    if (params.containsKey("simpleformat")) {
      simpleformat = Boolean.parseBoolean(params.remove("simpleformat"));
    }
    outputFormat.setHighlightedOnly(simpleformat);
    int count = -1;
    if (params.containsKey("count")) {
      count = Integer.parseInt(params.remove("count"));
    }
    outputFormat.setHighlightedCount(count);
    return outputFormat;
  }
  
  private static class DocumentSummaryUpdateProcessor extends
      UpdateRequestProcessor {
    private SolrQueryRequest req;
    private SummarizerOutputFormat outputFormat;
    private Analyzer analyzer;
    private String fromField;
    private String summaryField;
    private SchemaSummarizer summarizer;
    public DocumentSummaryUpdateProcessor(UpdateRequestProcessor next,
        SolrQueryRequest req, Analyzer analyzer,
        Map<String,String> summarizerParams, SummarizerOutputFormat outputFormat) {
      super(next);
      this.req = req;
      this.analyzer = analyzer;
      this.outputFormat = outputFormat;
      fromField = Preconditions.checkNotNull(summarizerParams.get("fromField"),
          "have to set fromField");
      
      summaryField = Preconditions.checkNotNull(
          summarizerParams.get("summaryField"), "have to set summaryField");
      summarizer = new SchemaSummarizer(summarizerParams, Locale.getDefault());
    }
    public void processAdd(AddUpdateCommand cmd) throws IOException {
      SolrInputDocument doc = cmd.solrDoc;
      // use og:description
      String og_description = null;
      Object obj = doc.getFieldValue("og:description");
      
      int count = 0;
      if (obj != null) {
        og_description = obj.toString();
        doc.addField(summaryField, og_description);
        ++count;
      }
      
      obj = doc.getFieldValue(fromField);
      if (obj != null) {
        NamedList summary = doSummary(summarizer, analyzer, obj.toString(),
            req.getParams());
        NamedList highlighted = (NamedList) summary.get("highlighted");
        List<NamedList> list = highlighted.getAll("sentence");
        
        for (NamedList<Object> sentence : list) {
          if (count < outputFormat.getHighlightedCount()) {
            String value = sentence.get("text").toString();
            if (value.equals(og_description)) continue;
            ++count;
            doc.addField(summaryField, value);
          } else {
            break;
          }
        }
      }
      super.processAdd(cmd);
    }
    
    private NamedList<Object> doSummary(Summarizer sz, Analyzer analyzer,
        String text, SolrParams solrParams) throws IOException {
      long start = System.currentTimeMillis();
      sz.startSummary();
      sz.addDocument(text, analyzer);
      NamedList<Object> summary = new NamedList<Object>();
      sz.finishSummary(summary, outputFormat, start);
      return summary;
    }
  }
}
Summarizer in Action
Now let's use our crawler to crawl one web page: Official: Debris Sign of Spaceship Breaking Up, and check the summarization.
curl "http://localhost:23456/solr/crawler/crawler?action=start&seeds=http://abcnews.go.com/Health/wireStory/investigators-branson-spacecraft-crash-site-26619288&maxCount=1&constants=cat:news"
The summaries saved in the doc:
<arr name="summary">
  <str>
Investigators looking into what caused the crash of a Virgin Galactic prototype spacecraft that killed one of two test pilots said a 5-mile path of debris across the California desert indicates the aircraft broke up in flight. "When the wreckage is dispersed like that, it indicates the...
  </str>
  <str>
"We are determined to find out what went wrong," he said, asserting that safety has always been the top priority of the program that envisions taking wealthy tourists six at a time to the edge of space for a brief experience of weightlessness and a view of Earth below.
  </str>
  <str>
In grim remarks at the Mojave Air and Space Port, where the craft known as SpaceShipTwo was under development, Branson gave no details of Friday's accident and deferred to the NTSB, whose team began its first day of investigation Saturday.
  </str>
</arr>
The first one is the og:description defined in the webpage, the other two sentences is want the most two important sentences the summarizer  found.
References Solr-3975 Document Summarization toolkit, using LSA techniques
Document Summarization with LSA #1: Introduction

Labels

Java (159) Lucene-Solr (112) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (39) Eclipse (33) Code Example (31) Linux (24) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Problem Solving (10) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Shell (7) ANT (6) Coding Skills (6) Database (6) Lesson Learned (6) Programmer Skills (6) Scala (6) Tips (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) System Design (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Life (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) Invest (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts