Spark Basic Statistics - Using Scala


Summary statistics
colStats() returns an instance of MultivariateStatisticalSummary, which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.
Test data:
1 2 3
10 20 30
100 200 300

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
  
val data = sc.textFile("E:/jeffery/src/ML/data/statistics.txt").cache();  
val parsedData = data.map( line =>  Vectors.dense(line.split(' ').map(x => x.toDouble).toArray) )
val summary = Statistics.colStats(parsedData);
println(summary.count)
println(summary.min)
println(summary.max)
println(summary.mean) // a dense vector containing the mean value for each column
println(summary.variance) // column-wise variance
println(summary.numNonzeros) // number of nonzeros in each column


Stratified sampling

Stratified sampling methods, sampleByKey and sampleByKeyExact, can be performed on RDD’s of key-value pairs.

The sampleByKey method will flip a coin to decide whether an observation will be sampled or not, therefore requires one pass over the data, and provides an expected sample size. sampleByKeyExact requires significant more resources than the per-stratum simple random sampling used in sampleByKey, but will provide the exact sampling size with 99.99% confidence.


Test Dataman 6
woman 14
woman 19
child 6
baby 1
child 3
woman 26
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.PairRDDFunctions
val data = sc.textFile("E:/jeffery/src/ML/data/sampling.txt").cache();  
val parsedData = data.map{line => {
  val sp = line.split(' '); 
  (sp(0), sp(1).toInt);
}
}.cache()

parsedData.foreach(println)
var fractions = Map[String, Double]()

fractions += ("man" ->  0.5, "woman" -> 0.5, "child" -> 0.5, "baby" -> 0.3);
val approxSample = parsedData.sampleByKey(false, fractions).collect();
val exactSample = parsedData.sampleByKeyExact(false, fractions).collect();
print(approxSample.mkString(" "));
print(exactSample.mkString(" "));

Random data generation
import org.apache.spark.mllib.random.RandomRDDs._
val u = normalRDD(sc, 100L, 2);
// Apply a transform to get a random double RDD following `N(1, 4)`.
val v = u.map(x => 1.0 + 2.0 * x)
print(u.collect())
print(v.collect())

val u = poissonRDD(sc, 10, 100L);
val v = u.map(x => 1.0 + 2.0 * x).collect()

val u = uniformRDD(sc, 100L);
val v = u.map(x => 1.0 + 2.0 * x).collect()

Histogram
val ints = sc.parallelize(1 to 100)
ints.histogram(5) // 5 evenly spaced buckets
res92: (Array[Double], Array[Long]) = (Array(1.0, 20.8, 40.6, 60.4, 80.2, 100.0),Array(20, 20, 20, 20, 20)) Correlations


MLlib - Basic Statistics
Spark 1.1.0 Basic Statistics(上)

Hack Scala REPL Classpath


The Problem
Running the example Latent Semantic Analysis (LSA) Wikipedia Example from the book Advanced Analytics with Spark, from Spark 1.2 spark-shell.cmd.

It depends on stanfordNLP libraries, So I need add stanfordNLP related jars into Scala REPL shell - I don't want to add these jars to Spark's spark-shell.cmd. We can use :cp to add a jar to current Scala Shell session.
But as there are multiple jars(actually 7) jars in stanford-corenlp-full-2014-10-31 folder, I don't want to add them one by one.

The Solution
import java.io.File

for (file <- new File("E:/jeffery/src/textmining/standfordnlp/stanford-corenlp-full-2014-10-31").listFiles.filter(f => f.getName().endsWith(".jar")&& !f.getName().contains("-sources") && !f.getName().contains("-src") && !f.getName().contains("-javadoc"))) { println(":cp " + file) } 

This will println all stanford-corenlp jars except jars including source code and javadoc:
:cp E:\jeffery\src\textmining\standfordnlp\stanford-corenlp-full-2014-10-31\ejml-0.23.jar
:cp E:\jeffery\src\textmining\standfordnlp\stanford-corenlp-full-2014-10-31\javax.json.jar
:cp E:\jeffery\src\textmining\standfordnlp\stanford-corenlp-full-2014-10-31\joda-time.jar
:cp E:\jeffery\src\textmining\standfordnlp\stanford-corenlp-full-2014-10-31\jollyday.jar
:cp E:\jeffery\src\textmining\standfordnlp\stanford-corenlp-full-2014-10-31\stanford-corenlp-3.5.0-models.jar
:cp E:\jeffery\src\textmining\standfordnlp\stanford-corenlp-full-2014-10-31\stanford-corenlp-3.5.0.jar
:cp E:\jeffery\src\textmining\standfordnlp\stanford-corenlp-full-2014-10-31\xom.jar

Then just copy the output and paste them in Scala shell, Scala will add these into current shell classpath.

Happy Hacking.

Build Spark Failure: Nonzero exit code (128): git clone sbt-pom-reader.git


The Problem
Download Sprak 1.2 from github, and try to build it by running sbt assembly.
It always failed with error:
[error] Nonzero exit code (128): git clone https://github.com/ScrapCodes/sbt-pom-reader.git C:\Users\jyuan\.sbt\0.13\staging\ad8e8574a5bcb2d22d23\sbt-pom-reader
[error] Use 'last' for the full log.
Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore?

Retry didn't work, and I can access https://github.com/ScrapCodes/sbt-pom-reader.git, git clone it.
Not sure why it failed.

The Solution
To fix this: I opened a new cmd terminal, and ran the following command to create the staging folder and git clone to the dest folder:
mkdir C:\Users\jyuan\.sbt\0.13\staging\ad8e8574a5bcb2d22d23\sbt-pom-reader
git clone https://github.com/ScrapCodes/sbt-pom-reader.git C:\Users\jyuan\.sbt\0.13\staging\ad8e8574a5bcb2d22d23\sbt-pom-reader

Then I type r to retry it. As the sbt-pom-reader is already there, sbt would just happily take it. 
After several minutes, spark built succesfully

Happy hacking.

Running Stanford Sentiment Analysis in UIMA


The Goal
In previous post, we introduced how to run Stanford NER(Named Entity Recognition) in UIMA, now we are integrating Stanford Sentiment Analysis in UIMA.

StanfordNLPAnnotator
Feature Structure: org.apache.uima.stanfordnlp.input:action
We use StanfordNLPAnnotator as the gateway or facade: client uses org.apache.uima.stanfordnlp.input:action to specify what to extract: action=ner - to run named entity extraction or action=sentimet to run sentiment analysis.

The feature org.apache.uima.stanfordnlp.output:type specifies the sentiment of the whole article: very negative, negative, neutral, positive or very positive.

The configuration parameter: SentiwordnetFile which specifies the path of sentiwordnet file.

How it Works
First it ignore sentence which doesn't contain opinionated  word. It uses Sentiwordnet to check whether this sentence contains non-neutral adjective.

The it calls Stanford NLP Sentiment Analysis tool to process the text.
Stanford NLP Sentiment Analysis has two model files: edu/stanford/nlp/models/sentiment/sentiment.ser.gz, which maps sentimentto 5 classes: very negative, negative, neutral, positive or very positive; edu/stanford/nlp/models/sentiment/sentiment.binary.ser.gz which maps sentiment to 2 classes: negative or positive.

We use edu/stanford/nlp/models/sentiment/sentiment.ser.gz, but seems sometimes it inclines to mistakenly map non-negative text to negative.

For example, it will map the following sentence to negative, but the binary mode will correctly map it to positive.
I was able to stream video and surf the internet for well over 7 hours without any hiccups .

So to fix this, when the 5 classes mode(sentiment.ser.gz) maps one sentence to negative, we will run the binay mode to recheck it, if the binary mode agrees(also report negative) then no change, otherwise change it to positive.

We calculate the score of all sentence, and map the average score to the 5 classes. We give negative sentence a smaller value as we don't trust it. 
package org.lifelongprogrammer.nlp;
public class StanfordNLPAnnotator extends JCasAnnotator_ImplBase {
	public static final String STANFORDNLP_ACTION_SENTIMENT = "sentiment";
	public static final String TYPE_STANDFORDNLP_OUTPUT = "org.apache.uima.standfordnlp.output";
	public static final String FS_STANDFORDNLP_OUTPUT_TYPE = TYPE_STANDFORDNLP_OUTPUT
			+ ":type";
	public static final String TYPE_STANFORDNLP_INPUT = "org.apache.uima.stanfordnlp.input";
	public static final String FS_STANFORDNLP_INPUT_ACTION = TYPE_STANFORDNLP_INPUT
			+ ":action";

	private static Splitter splitter = Splitter.on(",").trimResults()
			.omitEmptyStrings();
	public static final String SENTIWORDNET_FILE_PARAM = "SentiwordnetFile";

	private StanfordCoreNLP sentiment5ClassesPipeline,
			sentiment2ClassesPipeline;
	private SWN3 sentiwordnet;
	private ExecutorService threadpool;
	private Logger logger;
	public void initialize(UimaContext aContext)
			throws ResourceInitializationException {
		super.initialize(aContext);
		this.logger = getContext().getLogger();
		reconfigure();
	}

	public void reconfigure() throws ResourceInitializationException {
		try {
			threadpool = Executors.newCachedThreadPool();
			String dataPath = getContext().getDataPath();
			Properties props = new Properties();
			props.setProperty("annotators",
					"tokenize, ssplit, parse, sentiment");
			props.put("sentiment.model",
					"edu/stanford/nlp/models/sentiment/sentiment.ser.gz");

			sentiment5ClassesPipeline = new StanfordCoreNLP(props);
			props.put("sentiment.model",
					"edu/stanford/nlp/models/sentiment/sentiment.binary.ser.gz");
			sentiment2ClassesPipeline = new StanfordCoreNLP(props);

			String sentiwordnetFile = (String) getContext()
					.getConfigParameterValue(SENTIWORDNET_FILE_PARAM);
			sentiwordnet = new SWN3(
					new File(dataPath, sentiwordnetFile).getPath());
		} catch (Exception e) {
			logger.log(Level.SEVERE, e.getMessage());
			throw new ResourceInitializationException(e);
		}
	}
	public void process(JCas jcas) throws AnalysisEngineProcessException {
		CAS cas = jcas.getCas();
		ArrayList<String> action = getAction(cas);
		if (action.contains(STANFORDNLP_ACTION_SENTIMENT)) {
			Future<Void> future = threadpool.submit(new Callable<Void>() {
				@Override
				public Void call() throws Exception {
					checkSentiment(cas);
					return null;
				}
			});
			futures.add(future);
		}
		for (Future<Void> future : futures) {
			try {
				future.get();
			} catch (InterruptedException | ExecutionException e) {
				throw new AnalysisEngineProcessException(e);
			}
		}
		logger.log(Level.FINE, "StanfordNERAnnotator done.");
	}

  
	private void checkSentiment(CAS cas) {
		String sentimenTetx = getSentimentSentence(cas.getDocumentText())
				.toString();

		Annotation annotation = sentiment5ClassesPipeline.process(sentimenTetx);
		TypeSystem ts = cas.getTypeSystem();
		Type dyOutputType = ts.getType(TYPE_STANDFORDNLP_OUTPUT);
		org.apache.uima.cas.Feature dyOutputTypeFt = ts
				.getFeatureByFullName(FS_STANDFORDNLP_OUTPUT_TYPE);
        
		SentimentAccumulator accumulator = new SentimentAccumulator(false);
		for (CoreMap sentenceCore : annotation
				.get(CoreAnnotations.SentencesAnnotation.class)) {
			Tree tree = sentenceCore
					.get(SentimentCoreAnnotations.AnnotatedTree.class);
			int predictedClass = RNNCoreAnnotations.getPredictedClass(tree);
			String sentence = sentenceCore.toString();
			if (predictedClass == 1) {
				int old = predictedClass;
				predictedClass = checkNegative(sentence);
				System.out.println("Sentiment changed from " + old + " to "
						+ predictedClass + " String: " + sentence);
			} 
			accumulator.accumulate(predictedClass, sentence.length());
		}
		AnnotationFS dyAnnFS = cas.createAnnotation(dyOutputType, 0, 0);
		dyAnnFS.setStringValue(dyOutputTypeFt, accumulator.getResult());
		cas.getIndexRepository().addFS(dyAnnFS);
	}
  
	private ArrayList<String> getAction(CAS cas) {
		TypeSystem ts = cas.getTypeSystem();
		Type dyInputType = ts.getType(TYPE_STANFORDNLP_INPUT);
		org.apache.uima.cas.Feature dyInputTypesFt = ts
				.getFeatureByFullName(FS_STANFORDNLP_INPUT_ACTION);
		FSIterator<?> dyIt = cas.getAnnotationIndex(dyInputType).iterator();
		String action = "";
		while (dyIt.hasNext()) {
			// TODO this is kind of weird
			AnnotationFS afs = (AnnotationFS) dyIt.next();
			String str = afs.getStringValue(dyInputTypesFt);
			if (str != null) {
				action = str;
			}
		}
		return Lists.newArrayList(splitter.split(action));
	}
  
  

	class SentimentAccumulator {
		private double totalScore;
		private int sentCount;
		public SentimentAccumulator() {}
		public void accumulate(int type, int sentLen) {
		  clac5ClassModel(type);
		}
		private void clac5ClassModel(int type) {
			++sentCount;
			// very negative
			switch (type) {
			case 0:
				totalScore += -5;
				break;
			case 1:
				totalScore += -1; // give smaller value
				break;
			case 2:
				totalScore += 0;
				break;
			case 3:
				totalScore += 2;
				break;
			case 4:
				totalScore += 5;
				break;
			default:
				// ignore this
				logger.log(Level.SEVERE, "unkown type:" + type);
				--sentCount;
			}
		}

		public String getResult() {
      double avgScore = (double) totalScore / sentCount;
      logger.log(Level.INFO, "avgScore: " + avgScore
          + ", totalScore: " + totalScore + ", sentCount: "
          + sentCount);

      if (avgScore > 2) {
        return "very positove";
      } else if (avgScore > 0.5) {
        return "positove";
        // [-0.5 TO 0]: neutral
      } else if (avgScore > -0.5) {
        return "neutral";
      } else if (avgScore > -2) {
        return "negative";
      } else {
        return "very negative";
      }
		}
	}

	public StringBuilder getSentimentSentence(String text) {
		DocumentPreprocessor dp = new DocumentPreprocessor(new StringReader(
				text));
		// List<String> sentenceList = new LinkedList<String>();
		StringBuilder sentenceList = new StringBuilder();
		Iterator<List<HasWord>> it = dp.iterator();
		while (it.hasNext()) {
			StringBuilder sentenceSb = new StringBuilder();
			List<HasWord> sentence = it.next();

			boolean hasFeeling = false;
			Iterator<HasWord> inner = sentence.iterator();
			while (inner.hasNext()) {
				HasWord token = inner.next();
				sentenceSb.append(token.word());

				if (inner.hasNext()) {
					sentenceSb.append(" ");
				}
				String feeling = sentiwordnet.extractFelling(token.word(), "a");
				if (!"neutral".equals(feeling)) {
					hasFeeling = true;
					System.out.println(feeling + ":" + token);
				}
			}
			if (hasFeeling) {
				sentenceList.append(sentenceSb.toString());
			}
		}
		return sentenceList;
	}

	private int checkNegative(String sentence) {
		Annotation annotation = sentiment2ClassesPipeline.process(sentence);

		for (CoreMap sentenceCore : annotation
				.get(CoreAnnotations.SentencesAnnotation.class)) {

			Tree tree = sentenceCore
					.get(SentimentCoreAnnotations.AnnotatedTree.class);
			int newPredict = RNNCoreAnnotations.getPredictedClass(tree);
			// if binary checker still returns negative then use negative
			if (newPredict == 0) {
				return 1;
			} else {
				return 3;
			}
		}
		return 1;
	}  
}
Descriptor File: StanfordNLPAnnotator.xml
We define uima types: org.apache.uima.stanfordnlp.input and org.apache.uima.stanfordnlp.output, and the configuration parameter: SentiwordnetFile.
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
	<frameworkImplementation>org.apache.uima.java</frameworkImplementation>
	<primitive>true</primitive>
	<annotatorImplementationName>org.lifelongprogrammer.nlp.StanfordNLPAnnotator
	</annotatorImplementationName>
	<analysisEngineMetaData>
		<name>StanfordNLPAnnotatorAE</name>
		<description>StanfordNLPAnnotator Wrapper.</description>
		<version>1.0</version>
		<vendor>LifeLong Programmer, Inc.</vendor>
		<configurationParameters>
			<configurationParameter>
				<name>SentiwordnetFile</name>
				<description>Filename of the sentiwordnet file.</description>
				<type>String</type>
				<multiValued>false</multiValued>
				<mandatory>true</mandatory>
			</configurationParameter>
		</configurationParameters>
		<configurationParameterSettings>
			<nameValuePair>
				<name>SentiwordnetFile</name>
				<value>
					<string>dicts\SentiWordNet_3.0.0_20130122.txt</string>
				</value>
			</nameValuePair>
		</configurationParameterSettings>
		<typeSystemDescription>
			<typeDescription>
				<name>org.apache.uima.stanfordnlp.input</name>
				<description />
				<supertypeName>uima.tcas.Annotation</supertypeName>
				<features>
					<featureDescription>
						<name>action</name>
						<description />
						<rangeTypeName>uima.cas.String</rangeTypeName>
					</featureDescription>
				</features>
			</typeDescription>
			<typeDescription>
				<name>org.apache.uima.standfordnlp.output</name>
				<description />
				<supertypeName>uima.tcas.Annotation</supertypeName>
				<features>
					<featureDescription>
						<name>type</name>
						<description />
						<rangeTypeName>uima.cas.String</rangeTypeName>
					</featureDescription>
				</features>
			</typeDescription>
		</typeSystemDescription>
</analysisEngineDescription>
Annotator Test case
Check the previous post about how use sujitpal's UimaUtils.java to test the StanfordNLPAnnotator.

Running Stanford Named Entity Recognition in UIMA


The Goal
To improve our text analytic project, after integrated OpenNLP with UIMA, we are trying to integrate StanfordNLP NER(Named Entity Recognition) into UIMA.

StanfordNLPAnnotator
Feature Structure: org.apache.uima.stanfordnlp.input:action
We use StanfordNLPAnnotator as the gateway or facade: client uses org.apache.uima.stanfordnlp.input:action to specify what to extract: action=ner - to run named entity extraction or action=sentimet to run sentiment analysis.

We use dynamic output entity: org.apache.uima.stanfordnlp.output, its type specifies whether it's person or organization or etc.

The configuration parameter: ClassifierFile which specifies the  mode files NER uses.

package org.lifelongprogrammer.nlp;
public class StanfordNLPAnnotator extends JCasAnnotator_ImplBase {
 public static final String STANFORDNLP_ACTION_NER = "ner";
 public static final String TYPE_STANDFORDNLP_OUTPUT = "org.apache.uima.standfordnlp.output";
 public static final String FS_STANDFORDNLP_OUTPUT_TYPE = TYPE_STANDFORDNLP_OUTPUT
   + ":type";
 public static final String TYPE_STANFORDNLP_INPUT = "org.apache.uima.stanfordnlp.input";
 public static final String FS_STANFORDNLP_INPUT_ACTION = TYPE_STANFORDNLP_INPUT
   + ":action";

 // http://nlp.stanford.edu/software/CRF-NER.shtml
 private static final Set<String> NER_TYPES = new HashSet<String>(
   Arrays.asList("PERSON", "ORGANIZATION", "LOCATION", "MISC", "TIME",
     "MONEY", "PERCENT", "DATE"));
          
 private static Splitter splitter = Splitter.on(",").trimResults()
   .omitEmptyStrings();
 public static final String CLASSIFIER_FILE_PARAM = "ClassifierFile";
 private CRFClassifier<CoreLabel> crf;
 private ExecutorService threadpool;
 private Logger logger;

 public void initialize(UimaContext aContext)
   throws ResourceInitializationException {
  super.initialize(aContext);
  this.logger = getContext().getLogger();
  reconfigure();
 }
 public void reconfigure() throws ResourceInitializationException {
  try {
   threadpool = Executors.newCachedThreadPool();
   String dataPath = getContext().getDataPath();

   String classifierFile = (String) getContext()
     .getConfigParameterValue(CLASSIFIER_FILE_PARAM);
   System.out.println(classifierFile);
   crf = CRFClassifier
     .getClassifier(new File(dataPath, classifierFile));
  } catch (Exception e) {
   logger.log(Level.SEVERE, e.getMessage());
   throw new ResourceInitializationException(e);
  }
 }
  
 public void process(JCas jcas) throws AnalysisEngineProcessException {
  CAS cas = jcas.getCas();
  ArrayList<String> action = getAction(cas);
  List<Future<Void>> futures = new ArrayList<Future<Void>>();
  if (action.contains(STANFORDNLP_ACTION_NER)) {
   Future<Void> future = threadpool.submit(new Callable<Void>() {
    @Override
    public Void call() throws Exception {
     getNer(jcas);
     return null;
    }
   });

   futures.add(future);
  }
    //...
  for (Future<Void> future : futures) {
   try {
    future.get();
   } catch (InterruptedException | ExecutionException e) {
    throw new AnalysisEngineProcessException(e);
   }
  }
  logger.log(Level.FINE, "StanfordNERAnnotator done.");
 }
  
 private ArrayList<String> getAction(CAS cas) {
  TypeSystem ts = cas.getTypeSystem();
  Type dyInputType = ts.getType(TYPE_STANFORDNLP_INPUT);
  org.apache.uima.cas.Feature dyInputTypesFt = ts
    .getFeatureByFullName(FS_STANFORDNLP_INPUT_ACTION);

  FSIterator<?> dyIt = cas.getAnnotationIndex(dyInputType).iterator();
  String action = "";
  while (dyIt.hasNext()) {
   // TODO this is kind of weird
   AnnotationFS afs = (AnnotationFS) dyIt.next();
   String str = afs.getStringValue(dyInputTypesFt);
   if (str != null) {
    action = str;
   }
  }
  return Lists.newArrayList(splitter.split(action));
 }
  
 private void getNer(JCas jcas) {
    CAS cas=jcas.getCas();
  String docText = jcas.getDocumentText();
  List<List<CoreLabel>> classify = crf.classify(docText);

  MatchedNER preNER = null;

  TypeSystem ts = jcas.getTypeSystem();
  Type dyOutputType = ts.getType(TYPE_STANDFORDNLP_OUTPUT);
  org.apache.uima.cas.Feature dyOutputTypeFt = ts
    .getFeatureByFullName(FS_STANDFORDNLP_OUTPUT_TYPE);

  // merge co-located same entity
  for (List<CoreLabel> coreLabels : classify) {
   for (CoreLabel coreLabel : coreLabels) {
    String category = coreLabel
      .get(CoreAnnotations.AnswerAnnotation.class);
    if (NER_TYPES.contains(category)) {
     if (preNER == null) {
      preNER = new MatchedNER(category,
        coreLabel.beginPosition(),
        coreLabel.endPosition());
     } else if (category.equals(preNER.getCategory())) {
      preNER = new MatchedNER(category,
        preNER.getEntityBegin(),
        coreLabel.endPosition());
     } else {
      // add preNER
      addNER(preNER, cas, dyOutputType, dyOutputTypeFt);
      preNER = new MatchedNER(category,
        coreLabel.beginPosition(),
        coreLabel.endPosition());
     }
    } else {
     if (preNER != null) {
      addNER(preNER, cas, dyOutputType, dyOutputTypeFt);
      preNER = null;
     }

    }
   }
  }
  if (preNER != null) {
   addNER(preNER, cas, dyOutputType, dyOutputTypeFt);
  }
 }
 private void addNER(MatchedNER preNER, CAS cas, Type dyOutputType,
   org.apache.uima.cas.Feature dyOutputTypeFt) {
  AnnotationFS dyAnnFS = cas.createAnnotation(dyOutputType,
    preNER.getEntityBegin(), preNER.getEntityEnd());
  dyAnnFS.setStringValue(dyOutputTypeFt, preNER.getCategory()
    .toLowerCase());
  cas.getIndexRepository().addFS(dyAnnFS);
 }

 class MatchedNER {
  private String cat;
  private int entityBegin, entityEnd;

  public MatchedNER(String cat, int entityBegin, int entityEnd) {
   this.cat = cat;
   this.entityBegin = entityBegin;
   this.entityEnd = entityEnd;
  }
 }
}
Descriptor File: StanfordNLPAnnotator.xml
We define uima types: org.apache.uima.stanfordnlp.input and org.apache.uima.stanfordnlp.output, and the configuration parameter: ClassifierFile.
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
 <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
 <primitive>true</primitive>
 <annotatorImplementationName>org.lifelongprogrammer.nlp.StanfordNLPAnnotator
 </annotatorImplementationName>
 <analysisEngineMetaData>
  <name>StanfordNLPAnnotatorAE</name>
  <description>StanfordNLPAnnotator Wrapper.</description>
  <version>1.0</version>
  <vendor>LifeLong Programmer, Inc.</vendor>
  <configurationParameters>
   <configurationParameter>
    <name>ClassifierFile</name>
    <description>Filename of the classifier file.</description>
    <type>String</type>
    <multiValued>false</multiValued>
    <mandatory>true</mandatory>
   </configurationParameter>
  </configurationParameters>
  <configurationParameterSettings>
   <nameValuePair>
    <name>ClassifierFile</name>
    <value>
     <!-- relative to pear resource file -->
     <string>models\classifiers\english.muc.7class.distsim.crf.ser.gz
     </string>
    </value>
   </nameValuePair>
  </configurationParameterSettings>
  <typeSystemDescription>
   <typeDescription>
    <name>org.apache.uima.stanfordnlp.input</name>
    <description />
    <supertypeName>uima.tcas.Annotation</supertypeName>
    <features>
     <featureDescription>
      <name>action</name>
      <description />
      <rangeTypeName>uima.cas.String</rangeTypeName>
     </featureDescription>
    </features>
   </typeDescription>

   <typeDescription>
    <name>org.apache.uima.standfordnlp.output</name>
    <description />
    <supertypeName>uima.tcas.Annotation</supertypeName>
    <features>
     <featureDescription>
      <name>type</name>
      <description />
      <rangeTypeName>uima.cas.String</rangeTypeName>
     </featureDescription>
    </features>
   </typeDescription>
  </typeSystemDescription>
</analysisEngineDescription>
Annotator Test case
Here we are using sujitpal's UimaUtils.java, it adds the feature org.apache.uima.stanfordnlp.input:action=ner to the CAS then send the case to UIMA server then check the org.apache.uima.stanfordnlp.output feature in the response.
private static final Joiner joiner = Joiner.on(",");
@Test
public void testStanfordNLPAnnotator() throws Exception {
  AnalysisEngine ae = UimaUtils.getAE("%ABS_PATH%\StanfordNLPAnnotator.xml", null);
  for (String input : INPUTS) {
    JCas jcas = ae.newJCas();
    addFSAction(jcas,Lists.newArrayList(StanfordNLPAnnotator.STANFORDNLP_ACTION_NER));
    jcas = UimaUtils.runAE(ae, input, UimaUtils.MIMETYPE_TEXT, jcas);

    Feature feature = jcas.getTypeSystem().getFeatureByFullName(
        "org.apache.uima.standfordnlp.output:type");
    org.apache.uima.cas.TypeSystem ts = jcas.getTypeSystem();
    org.apache.uima.cas.Type dyOutputType = ts
        .getType("org.apache.uima.standfordnlp.output");

    FSIndex<? extends Annotation> index = jcas
        .getAnnotationIndex(dyOutputType);
    for (Iterator<? extends Annotation> it = index.iterator(); it
        .hasNext();) {
      Annotation annotation = it.next();
      System.out.println("...(" + annotation.getBegin() + ","
          + annotation.getEnd() + "): "
          + annotation.getCoveredText() + ", type: "
          + annotation.getFeatureValueAsString(feature));
    }
  }
  ae.destroy();
}
private void addFSAction(JCas jcas, List<String> action) {
  TypeSystem ts = jcas.getTypeSystem();
  Feature ft = ts
      .getFeatureByFullName(StanfordNLPAnnotator.FS_STANFORDNLP_INPUT_ACTION);
  Type type = ts.getType(StanfordNLPAnnotator.TYPE_STANFORDNLP_INPUT);

  FeatureStructure fs = jcas.getCas().createFS(type);
  fs.setStringValue(ft, joiner.join(action));
  jcas.addFsToIndexes(fs);
}

Using lucene-appengine & google-http-java-client to Crawl Blogger on GAE


The Goal
In my latest project, I need develop one GAE java application to crawl blogger siter, and save index into Lucene on GAE.

This post will introduce how to deploy lucene-appengine and use google-http-java-client to parse sitemap.xml to get all posts then crawl each post, then save index to lucene-appengine on GAE, then use GAR cron task to index new posts periodically.

Creating Maven GAE project & Adding Dependencies
First Check GAE: Using Apache Maven to create appengine-skeleton-archetype maven project

Then download lucene-appengine-examples source code, and copy needed dependencies from its pom.xml, and add google-http-client, google-http-client-appengine and google-http-client-xml into pom.xml.

Using google-http-java-client to Parse sitemap.xml
google-http-java-client library allow us to easily convert xml response as java object by com.google.api.client.http.HttpResponse.parseAs(SOmeClass.class), all we need is to define the Java class.

Check blogger's sitemap.xml: lifelongprogrammer sitemap.xml
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://lifelongprogrammer.blogspot.com/2014/11/using-solr-classifier-to-categorize-articles.html</loc>
    <lastmod>2014-11-04T22:49:54Z</lastmod>
</urlset>

So we can map it to two classes, Urlset and TUrl, the key here is to use @com.google.api.client.util.Key to map java field to element in xml.
public class Urlset {
 @Key
 protected List<TUrl> url = new ArrayList<>();

 public List<TUrl> getUrl() {
  return url;
 }
}
public class TUrl {
 @Key
 protected String loc;
 @Key
 protected String lastmod;
  // omitted the getters
}

Then use the following code to parse sitemap.xml to Urlset java object.
static final HttpTransport HTTP_TRANSPORT = new NetHttpTransport();
static final XmlNamespaceDictionary XML_DICT = new XmlNamespaceDictionary();

HttpRequestFactory requestFactory = HTTP_TRANSPORT
    .createRequestFactory(new HttpRequestInitializer() {
      @Override
      public void initialize(HttpRequest request) {
        request.setParser(new XmlObjectParser(XML_DICT));
      }
    });

HttpRequest request = requestFactory.buildGetRequest(new GenericUrl(sitemapUrl));
HttpResponse response = request.execute();
Urlset urls = response.parseAs(Urlset.class);

When parse each post, we can use the following code to get the post html string:
HttpRequestFactory requestFactory = HTTP_TRANSPORT.createRequestFactory();
HttpRequest request = requestFactory.buildGetRequest(new GenericUrl(url.getLoc()));
HttpResponse response = request.execute();

String html = response.parseAsString();

LAEUtil
The following is the complete code which parse sitemap, then crawl each post and save index into lucene-appengine.
public class LAEUtil {
 private static final Logger logger = LoggerFactory.getLogger(Util.class);
 private static final Version LUCENE_VERSION = Version.LUCENE_4_10_2;

 static final HttpTransport HTTP_TRANSPORT = new NetHttpTransport();
 static final XmlNamespaceDictionary XML_DICT = new XmlNamespaceDictionary();

 public static void crawl(String indexName, String sitemapUrl,
   long maxSeconds) throws IOException {
  Stopwatch stopwatch = Stopwatch.createStarted();
  IndexReader reader = null;
  try (GaeDirectory directory = new GaeDirectory(indexName)) {
   try {
    reader = DirectoryReader.open(directory);
   } catch (IndexNotFoundException e) {
    createIndex(directory);
    reader = DirectoryReader.open(directory);
   }

   IndexSearcher searcher = new IndexSearcher(reader);
   Date crawledMinDate = getCrawledMinMaxDate(searcher, false);
   Date crawlMaxDate = getCrawledMinMaxDate(searcher, true);

   reader.close();
   crawl(directory, stopwatch, indexName, sitemapUrl, crawledMinDate,
     crawlMaxDate, maxSeconds);
  } catch (IOException e) {
   logger.error("crawl failed with error", e);
  }
 }

 private static void createIndex(GaeDirectory directory) throws IOException {
  try (IndexWriter writer = new IndexWriter(directory,
    getIndexWriterConfig(LUCENE_VERSION, getAnalyzer()))) {
  }
 }

 private static Date getCrawledMinMaxDate(IndexSearcher searcher,
   boolean minDate) throws IOException {
  Query q = new MatchAllDocsQuery();
  Date minMaxDate = null;
  boolean reverse = minDate;
  TopFieldDocs docs = searcher.search(q, 1, new Sort(new SortField(
    Fields.LASTMOD, SortField.Type.LONG, reverse)));

  ScoreDoc[] hits = docs.scoreDocs;
  if (hits.length != 0) {
   Document doc = searcher.doc(hits[0].doc);
   minMaxDate = new Date(Long.parseLong(doc.get(Fields.LASTMOD)));
  }
  return minMaxDate;
 }

 /** post between [crawledMinDate to crawledMaxDate] is already crawled  */
 private static void crawl(GaeDirectory directory, Stopwatch stopwatch,
   String indexName, String sitemapUrl, Date crawledMinDate,
   Date crawlMaxDate, long maxSeconds) throws IOException {
  HttpRequestFactory requestFactory = HTTP_TRANSPORT
    .createRequestFactory(new HttpRequestInitializer() {
     @Override
     public void initialize(HttpRequest request) {
      request.setParser(new XmlObjectParser(XML_DICT));
     }
    });

  HttpRequest request = requestFactory.buildGetRequest(new GenericUrl(
    sitemapUrl));

  HttpResponse response = request.execute();
  Urlset urls = response.parseAs(Urlset.class);
  PorterAnalyzer analyzer = getAnalyzer();

  // posts are sorted by lastMod in sitemap.xml
  int added = 0;
  try (IndexWriter writer = new IndexWriter(directory,
    getIndexWriterConfig(LUCENE_VERSION, analyzer))) {

   for (TUrl url : urls.getUrl()) {
    // will not happen
    Date lastmod = url.getLastmodDate();
    if (lastmod == null)  continue;

    if (stopwatch.elapsed(TimeUnit.SECONDS) >= maxSeconds) {
     logger.error("Exceed timelimt " + maxSeconds
       + ", already run "
       + stopwatch.elapsed(TimeUnit.SECONDS) + " seconds");
     break;
    }
    boolean post = false;
    if (crawlMaxDate == null || crawledMinDate == null) {
     post = true;
    }
    if (crawlMaxDate != null && lastmod.after(crawlMaxDate)) {
     post = true;
    } else if (crawledMinDate != null
      && url.getLastmodDate().before(crawledMinDate)) {
     post = true;
    }
    if (post) {
     crawlPost(url, writer);
     ++added;
     if (added == 20) {
      writer.commit();
      added = 0;
     }
    } else {
     logger.debug("ingore " + url + " : lastmod " + lastmod
       + ", crawlMaxDate: " + crawlMaxDate
       + ", crawledMinDate: " + crawledMinDate);
    }
   }
   logger.error("started to commit");
   writer.commit();
   logger.error("commit finished.");
  }
 }

 private static PorterAnalyzer getAnalyzer() {
  return new PorterAnalyzer(LUCENE_VERSION);
 }
  
 private static void crawlPost(TUrl url, IndexWriter writer)
   throws IOException {
  logger.info(url.getLoc() + " : " + url.getLastmod());
  HttpRequestFactory requestFactory = HTTP_TRANSPORT
    .createRequestFactory();
  HttpRequest request = requestFactory.buildGetRequest(new GenericUrl(url
    .getLoc()));
  HttpResponse response = request.execute();

  String html = response.parseAsString();
  Document luceneDoc = new Document();
  luceneDoc.add(new StringField(Fields.ID, url.getLoc(), Store.YES));
  luceneDoc.add(new TextField(Fields.URL, url.getLoc(), Store.YES));

  luceneDoc.add(new TextField(Fields.RAWCONTENT, html, Store.YES));

  ArticleExtractor articleExtractor = ArticleExtractor.getInstance();

  org.jsoup.nodes.Document jsoupDoc = Jsoup.parse(html);
  luceneDoc.add(new TextField(Fields.TITLE, jsoupDoc.title(), Store.YES));

  html = normalize(jsoupDoc);
  try {
   String mainContent = articleExtractor.getText(html);
   luceneDoc.add(new TextField(Fields.MAINCONTENT, mainContent,
     Store.YES));
  } catch (BoilerpipeProcessingException e) {
   throw new RuntimeException(e);
  }
  luceneDoc.add(new LongField(Fields.LASTMOD, url.getLastmodDate()
    .getTime(), Store.YES));
  writer.addDocument(luceneDoc);
 }
}
BloggerCrawler Servlet
We can call BloggerCrawler servlet manually to test our crawler. When we test or call the servlet manully we set maxseconds to some smaller value due to the GAE request handler time limit, when we call it from cron task, we set it to 8 mins(the timelimit for task is 10 mins).
public class BloggerCrawler extends HttpServlet {
 private static final Logger logger = LoggerFactory
   .getLogger(BloggerCrawler.class);
 protected void doGet(HttpServletRequest req, HttpServletResponse resp)
   throws ServletException, IOException {

  String site = Preconditions.checkNotNull(req.getParameter("sitename"),
    "site can't be null");

  String indexName = site;
  if (site.endsWith("blogspot.com")) {
   throw new IllegalArgumentException("not valid sitename: " + site);
  }
  String sitemapUrl = "http://" + site + ".blogspot.com/sitemap.xml";

  int maxseconds = getMaxSeconds(req);
  logger.info("started to crawl " + sitemapUrl);
  Util.crawl(indexName, sitemapUrl, maxseconds);
  super.doGet(req, resp);
 }
 private int getMaxSeconds(HttpServletRequest req) {
  int maxseconds = 40;
  String str = req.getParameter("maxseconds");
  if (str != null) {
   maxseconds = Integer.parseInt(str);
  }
  return maxseconds;
 }
}

Scheduled Crawler with GAE Cron
We can use GAE cron to call crawler servlet periodically, for example every 12 hours. All we need do is add the cron task into cron.xml:
Check Scheduled Tasks With Cron for Java for more about GAE cron.
Notice that Local development server does not execute cron jobs nor have the Cron Jobs link. The actual appengine will show cron jobs and will execute them.
<cronentries>
  <cron>
    <url>/crawl?sitename=lifelongprogrammer&maxseconds=480</url>
    <description>Crawl lifelongprogrammer every 12 hours</description>
    <schedule>every 12 hours</schedule>
  </cron>
</cronentries>
References
lucene-appengine
GAE: Using Apache Maven
Scheduled Tasks With Cron for Java

Handling gzip Response in Apache HttpClient 4.2


The Problem
My application uses Apache HttpClient 4.2, but when it sends request to some web pages, the response is garbled characters.

Using Fiddler's Composer to execute the request, found the response is gziped.
Content-Encoding: gzip

The Solution
In Apache HttpClient 4.2, the DefaultHttpClient doesn't support compression, so it doesn't decompress the response. We have to use DecompressingHttpClient.
public void usingDefualtHttpClient() throws Exception {
  // output would be garbled characters in http client 4.2.
  HttpClient httpClient = new DefaultHttpClient();
  getContent(httpClient, new URI(URL_STRING));
}

public void usingDecompressingHttpClient() throws Exception {
  // use DecompressingHttpClient to handle gzip response in  http client 4.2.
  HttpClient httpCLient = new DecompressingHttpClient(
      new DefaultHttpClient());
  getContent(httpCLient, new URI(URL_STRING));
}

private void getContent(HttpClient httpClient, URI url) throws IOException,
    ClientProtocolException {
  HttpGet httpGet = new HttpGet(url);
  HttpResponse httpRsp = httpClient.execute(httpGet);
  String text = EntityUtils.toString(httpRsp.getEntity());

  for (Header header : httpRsp.getAllHeaders()) {
    System.out.println(header);
  }
  System.out.println(text);
}
The problem can also be fixed by upgrading http client to 4.3.5: in this versionthe default http client supports compression.

And in  http client to 4.3.5, the DefaultHttpClient is deprecated, it's recommenced to use HttpClientBuilder:
public void usingHttpClientBuilderIn43() throws Exception {
  HttpClientBuilder builder = HttpClientBuilder.create();
  CloseableHttpClient httpClient = builder.build();
  getContent(httpClient, new URI(URL_STRING));
}

Solr: Using Classifier to Categorize Articles


The Goal
In my latest project, I use crawler4j to crawl websites and Solr summarizer to add summary of article
Now I would use Solr Classification to categorize articles to different categories: such as Java, Linux, News etc.

Using Solr Classifier
There are two steps when use Solr Classification: 

Train
first we add docs with known category. We can crawl known websites, for example, assign java for cat field for articles from javarevisited; assign linux for articles from linuxcommando, assign solr for articles from solrpl and etc.
localhost:23456/solr/crawler/crawler?action=create,start&name=linuxcommando.blogspot&seeds=http://linuxcommando.blogspot.com/&maxCount=50&parsePaths=http://linuxcommando.blogspot.com/\d{4}/\d{2}/.*&constants=cat:linux

localhost:23456/solr/crawler/crawler?action=create,start&name=javarevisited.blogspot&seeds=http://javarevisited.blogspot.com/&maxCount=50&parsePaths=http://javarevisited.blogspot.com/\d{4}/\d{2}/.*&constants=cat:java

localhost:23456/solr/crawler/crawler?action=create,start&name=solrpl&seeds=http://solr.pl/en/&maxCount=50&parsePathshttp://solr.pl/en/\d{4}/\d{2}/.*&constants=cat:solr

Solr ClassfierUpdateProcessorFactory
public class ClassfierUpdateProcessorFactory extends
    UpdateRequestProcessorFactory {  
  private boolean defaultDoClassifer;
  private String formField;
  private String catField;
  Classifier<BytesRef> classifier = null;

  public void init(NamedList args) {
    super.init(args);
    if (args != null) {
      SolrParams params = SolrParams.toSolrParams(args);
      defaultDoClassifer = params.getBool("doClassifer", false);
      if (defaultDoClassifer) {
        formField = Preconditions.checkNotNull(params.get("fromField"),
            "Have to set fromField");
        catField = Preconditions.checkNotNull(params.get("catField"),
            "Have to set catField");
        
        String classifierStr = params.get("classifier", "simpleNaive");
        if ("simpleNaive".equals(classifierStr)) {
          classifier = new SimpleNaiveBayesClassifier();
        } else if ("knearest".equalsIgnoreCase(classifierStr)) {
          classifier = new KNearestNeighborClassifier(10);
        } else {
          throw new IllegalArgumentException("Unsupported classifier: "
              + classifier);
        }
      }
    }
  }
  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
      SolrQueryResponse rsp, UpdateRequestProcessor next) {
    return new ClassfierUpdateProcessor(req, next);
  }
  
  private class ClassfierUpdateProcessor extends UpdateRequestProcessor {
    private SolrQueryRequest req;
    public ClassfierUpdateProcessor(SolrQueryRequest req,
        UpdateRequestProcessor next) {
      super(next);
      this.req = req;
    }
    public void processAdd(AddUpdateCommand cmd) throws IOException {
      SolrParams params = req.getParams();
      boolean doClassifer = params.getBool("doClassifer", false);
      
      if (doClassifer) {
        try {
          classifier.train(req.getSearcher().getAtomicReader(), formField,
              catField, new StandardAnalyzer(Version.LUCENE_42));
          SolrInputDocument doc = cmd.solrDoc;
          Object obj = doc.getFieldValue(formField);
          if (obj != null) {
            String text = obj.toString();
            ClassificationResult<BytesRef> result = classifier
                .assignClass(text);
            
            String classified = result.getAssignedClass().utf8ToString();
            doc.addField(catField, classified);
          }
        } catch (IOException e) {
          throw new IOException(e);
        }
      }
      super.processAdd(cmd);
    } 
  } 
}
solrconfig.xml
Please check the pervious post about the implementation of MainContentUpdateProcessorFactory.
<updateRequestProcessorChain name="crawlerUpdateChain">
  <processor class="org.lifelongprogrammer.solr.update.processor.MainContentUpdateProcessorFactory">
    <str name="fromField">rawcontent</str>
    <str name="mainContentField">maincontent</str>      
  </processor>

  <processor class="org.lifelongprogrammer.solr.update.processor.ClassfierUpdateProcessorFactory">
    <bool name="doClassifer">true</bool>
    <str name="fromField">maincontent</str>
    <str name="catField">cat</str>
  </processor>
  
  <processor class="org.lifelongprogrammer.solr.update.processor.DocumentSummaryUpdateProcessorFactory" >
  </processor>

  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
schema.xml
<field name="rawcontent" type="text" indexed="false" stored="true" multiValued="true" />
<field name="maincontent" type="text" indexed="true" stored="true" multiValued="true" />
<field name="cat" type="string" indexed="true" stored="true" multiValued="true" />
<field name="summary" type="text_rev" indexed="true" stored="true" multiValued="true" />
Test Solr Classifier
Next, when we crawl some website which contains multiple categories, we can use Solr Classification to assign category for each article.

For example, let's crawl lifelongprogrammer.blogspot
localhost:23456/solr/crawler/crawler?action=create,start&name=lifelongprogrammer.blogspot&seeds=http://lifelongprogrammer.blogspot.com/&maxCount=50&parsePaths=http://lifelongprogrammer.blogspot.com/\d{4}/\d{2}/.*&doClassifer=true

We set doClassifer=true, the ClassfierUpdateProcessorFactory will call Solr Classifier to do assign a label for the category field.

From the result, we can see some articles are assigned to Java, some goes to Linux, some goes to solr. About Accuracy
The accuracy of Solr Classification is worse than Mahout, but its performance is much better and it's enough for my application.


References
[SOLR-3975] Document Summarization toolkit, using LSA techniques
Comparing Document Classification Functions of Lucene and Mahout
Text categorization with Lucene and Solr
\

Solr: Using Summarizer(Solr-3975) to get Get Summaries of Article


The Goal
In my latest project, I use crawler4j to crawl website, and then would to add some summarization to the article.

After Google search I found this Solr Jira Solr-3975 Document Summarization toolkit, using LSA techniques and the programmer's articles(Document Summarization with LSA #1: Introduction) to describe how it works.

It's not checked in, but works fine for me.
So I started my work based on it: Use boilerpipe to get the main content of web page, then later use Solr 3975 to get the most important sentences.

Normalize Html Text and Get Main Content: MainContentUpdateProcessorFactory
First, I use JSoup to normalize the html text: remove links: as they are usually used for navigation or contain javascript code,  also remove invisible block: style~=display:\\s*none

To hep Solr 3975 to get important sentence, I add period(.) after div, span, textarea if their own text don't end with period(.).
<processor class="org.lifelongprogrammer.solr.update.processor.MainContentUpdateProcessorFactory">
  <str name="fromField">rawcontent</str>
  <str name="mainContentField">maincontent</str>      
</processor>
It will parse fromField which contains the raw content of web page, and store the parsed main content to mainContentField.
public class MainContentUpdateProcessorFactory extends
    UpdateRequestProcessorFactory {
  
  private String fromField;
  private String mainContentField;
  public void init(NamedList args) {
    super.init(args);
    if (args != null) {
      SolrParams params = SolrParams.toSolrParams(args);
      fromField = Preconditions.checkNotNull(params.get("fromField"),
          "Have to set fromField");
      mainContentField = Preconditions.checkNotNull(
          params.get("mainContentField"), "Have to set fromField");
    }
  }
  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
      SolrQueryResponse rsp, UpdateRequestProcessor next) {
    return new MainContentUpdateProcessor(req, next, fromField,
        mainContentField);
  }
  
  private static class MainContentUpdateProcessor extends
      UpdateRequestProcessor {
    private String fromField;
    private String mainContentField;
    private ArticleExtractor articleExtractor;
    
    public MainContentUpdateProcessor(SolrQueryRequest req,
        UpdateRequestProcessor next, String fromField, String mainContentField) {
      super(next);
      this.fromField = fromField;
      this.mainContentField = mainContentField;
      articleExtractor = ArticleExtractor.getInstance();
    }
    public void processAdd(AddUpdateCommand cmd) throws IOException {
      SolrInputDocument doc = cmd.solrDoc;
      Object obj = doc.getFieldValue(fromField);
      if (obj != null) {
        try {
          String text = obj.toString();
          text = normalize(text);
          String mainContent = articleExtractor.getText(text);

          Document jsoupDoc = Jsoup.parse(mainContent);
          mainContent = jsoupDoc.text();
          doc.addField(mainContentField, mainContent);
        } catch (BoilerpipeProcessingException e) {
          throw new IOException(e);
        }
      }
      super.processAdd(cmd);
    }
    
    private String normalize(String text) {
      Document doc = Jsoup.parse(text);
      doc.select("a, [style~=display:\\s*none]").remove();
      Elements divs = doc.select("textarea, span, div");
      for (Element tmp : divs) {
        String html = tmp.html();
        if (tmp.childNodeSize() == 1) {
          // && !html.endsWith(".")
          String ownText = tmp.ownText();
          if (ownText != null && !ownText.trim().equals("")
              && !ownText.endsWith(".")) {
            html += ".";
            tmp.html(html);
          }
        }
      }
      return doc.html();
    }
  }
}
Get Summaraization
Define DocumentSummaryUpdateProcessorFactory in solrconfig.xml
Let's first look at the definition of DocumentSummaryUpdateProcessorFactory,:
<processor class="org.lifelongprogrammer.solr.update.processor.DocumentSummaryUpdateProcessorFactory" >
  <str name="summary.type">text_lsa</str>
  <str name="summary.fromField">maincontent</str>
  <str name="summary.summaryField">summary</str>
  <str name="summary.hl_start"/>
  <str name="summary.hl_end" />     
  <bool name="summary.simpleformat">true</bool>
  <int name="summary.count">3</int>
</processor>
It wants to parse summary.fromField(maincontent in this case), and get the most important summary.count(3) sentences and them into summary.summaryField(summary in this case), summary.hl_start and summary.hl_end is empty, as we just need the text, not want to use html tag(like em or bold) to highlight important words. 
summary.simpleformat is an internal used argument to tell summarizer to only return highlighted section: no stats, terms or sentences sections.
DocumentSummaryUpdateProcessorFactory 
As some of web pages define og:description which gives one to two sentence, we can directly use it.
If og:description is defined, then we would use summarizer to get most important summary.count(3) -1 =2 sentences.
public class DocumentSummaryUpdateProcessorFactory extends
    UpdateRequestProcessorFactory implements SolrCoreAware {
  private SummarizerOutputFormat outputFormat;
  private Map<String,String> summarizerParams = new HashMap<>();
  private Analyzer analyzer;
  public void init(NamedList args) {
    super.init(args);
    if (args != null) {
      SolrParams params = SolrParams.toSolrParams(args);
      
      Iterator<String> it = params.getParameterNamesIterator();
      String prefix = "summary.";
      
      while (it.hasNext()) {
        String paramName = it.next();
        if (paramName.startsWith(prefix)) {
          summarizerParams.put(paramName.substring(prefix.length()),
              params.get(paramName));
        }
      }
      outputFormat = getSummarizeOutputFormat(summarizerParams);
    }
  }
  public void inform(SolrCore core) {
    analyzer = getAnalyzer(core, summarizerParams);
  }
  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
      SolrQueryResponse rsp, UpdateRequestProcessor next) {
    
    return new DocumentSummaryUpdateProcessor(next, req, analyzer,
        summarizerParams, outputFormat);
  }
  
  private Analyzer getAnalyzer(SolrCore core, Map<String,String> params) {
    FieldType fType = null;
    if (params.containsKey("type")) {
      fType = core.getSchema().getFieldTypeByName(params.get("type"));
      if (fType == null) {
        throw new IllegalArgumentException("field type not found: "
            + params.get("type"));
      } else {
        return fType.getAnalyzer();
      }
    } else if (params.containsKey("fl")) {
      fType = core.getSchema().getFieldType(params.get("fl"));
      if (fType == null) {
        throw new IllegalArgumentException("field not found: "
            + params.get("type"));
      } else {
        return fType.getAnalyzer();
      }
    } else {
      throw new IllegalArgumentException("need field name or type");
    }
  }
  
  private SummarizerOutputFormat getSummarizeOutputFormat(
      Map<String,String> params) {
    SummarizerOutputFormat outputFormat = new SummarizerOutputFormat();
    boolean simpleformat = false;
    if (params.containsKey("simpleformat")) {
      simpleformat = Boolean.parseBoolean(params.remove("simpleformat"));
    }
    outputFormat.setHighlightedOnly(simpleformat);
    int count = -1;
    if (params.containsKey("count")) {
      count = Integer.parseInt(params.remove("count"));
    }
    outputFormat.setHighlightedCount(count);
    return outputFormat;
  }
  
  private static class DocumentSummaryUpdateProcessor extends
      UpdateRequestProcessor {
    private SolrQueryRequest req;
    private SummarizerOutputFormat outputFormat;
    private Analyzer analyzer;
    private String fromField;
    private String summaryField;
    private SchemaSummarizer summarizer;
    public DocumentSummaryUpdateProcessor(UpdateRequestProcessor next,
        SolrQueryRequest req, Analyzer analyzer,
        Map<String,String> summarizerParams, SummarizerOutputFormat outputFormat) {
      super(next);
      this.req = req;
      this.analyzer = analyzer;
      this.outputFormat = outputFormat;
      fromField = Preconditions.checkNotNull(summarizerParams.get("fromField"),
          "have to set fromField");
      
      summaryField = Preconditions.checkNotNull(
          summarizerParams.get("summaryField"), "have to set summaryField");
      summarizer = new SchemaSummarizer(summarizerParams, Locale.getDefault());
    }
    public void processAdd(AddUpdateCommand cmd) throws IOException {
      SolrInputDocument doc = cmd.solrDoc;
      // use og:description
      String og_description = null;
      Object obj = doc.getFieldValue("og:description");
      
      int count = 0;
      if (obj != null) {
        og_description = obj.toString();
        doc.addField(summaryField, og_description);
        ++count;
      }
      
      obj = doc.getFieldValue(fromField);
      if (obj != null) {
        NamedList summary = doSummary(summarizer, analyzer, obj.toString(),
            req.getParams());
        NamedList highlighted = (NamedList) summary.get("highlighted");
        List<NamedList> list = highlighted.getAll("sentence");
        
        for (NamedList<Object> sentence : list) {
          if (count < outputFormat.getHighlightedCount()) {
            String value = sentence.get("text").toString();
            if (value.equals(og_description)) continue;
            ++count;
            doc.addField(summaryField, value);
          } else {
            break;
          }
        }
      }
      super.processAdd(cmd);
    }
    
    private NamedList<Object> doSummary(Summarizer sz, Analyzer analyzer,
        String text, SolrParams solrParams) throws IOException {
      long start = System.currentTimeMillis();
      sz.startSummary();
      sz.addDocument(text, analyzer);
      NamedList<Object> summary = new NamedList<Object>();
      sz.finishSummary(summary, outputFormat, start);
      return summary;
    }
  }
}
Summarizer in Action
Now let's use our crawler to crawl one web page: Official: Debris Sign of Spaceship Breaking Up, and check the summarization.
curl "http://localhost:23456/solr/crawler/crawler?action=start&seeds=http://abcnews.go.com/Health/wireStory/investigators-branson-spacecraft-crash-site-26619288&maxCount=1&constants=cat:news"
The summaries saved in the doc:
<arr name="summary">
  <str>
Investigators looking into what caused the crash of a Virgin Galactic prototype spacecraft that killed one of two test pilots said a 5-mile path of debris across the California desert indicates the aircraft broke up in flight. "When the wreckage is dispersed like that, it indicates the...
  </str>
  <str>
"We are determined to find out what went wrong," he said, asserting that safety has always been the top priority of the program that envisions taking wealthy tourists six at a time to the edge of space for a brief experience of weightlessness and a view of Earth below.
  </str>
  <str>
In grim remarks at the Mojave Air and Space Port, where the craft known as SpaceShipTwo was under development, Branson gave no details of Friday's accident and deferred to the NTSB, whose team began its first day of investigation Saturday.
  </str>
</arr>
The first one is the og:description defined in the webpage, the other two sentences is want the most two important sentences the summarizer  found.
References Solr-3975 Document Summarization toolkit, using LSA techniques
Document Summarization with LSA #1: Introduction

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)