Running Stanford Sentiment Analysis in UIMA

The Goal
In previous post, we introduced how to run Stanford NER(Named Entity Recognition) in UIMA, now we are integrating Stanford Sentiment Analysis in UIMA.

StanfordNLPAnnotator
Feature Structure: org.apache.uima.stanfordnlp.input:action
We use StanfordNLPAnnotator as the gateway or facade: client uses org.apache.uima.stanfordnlp.input:action to specify what to extract: action=ner - to run named entity extraction or action=sentimet to run sentiment analysis.

The feature org.apache.uima.stanfordnlp.output:type specifies the sentiment of the whole article: very negative, negative, neutral, positive or very positive.

The configuration parameter: SentiwordnetFile which specifies the path of sentiwordnet file.

How it Works
First it ignore sentence which doesn't contain opinionated  word. It uses Sentiwordnet to check whether this sentence contains non-neutral adjective.

The it calls Stanford NLP Sentiment Analysis tool to process the text.
Stanford NLP Sentiment Analysis has two model files: edu/stanford/nlp/models/sentiment/sentiment.ser.gz, which maps sentimentto 5 classes: very negative, negative, neutral, positive or very positive; edu/stanford/nlp/models/sentiment/sentiment.binary.ser.gz which maps sentiment to 2 classes: negative or positive.

We use edu/stanford/nlp/models/sentiment/sentiment.ser.gz, but seems sometimes it inclines to mistakenly map non-negative text to negative.

For example, it will map the following sentence to negative, but the binary mode will correctly map it to positive.
I was able to stream video and surf the internet for well over 7 hours without any hiccups .

So to fix this, when the 5 classes mode(sentiment.ser.gz) maps one sentence to negative, we will run the binay mode to recheck it, if the binary mode agrees(also report negative) then no change, otherwise change it to positive.

We calculate the score of all sentence, and map the average score to the 5 classes. We give negative sentence a smaller value as we don't trust it. 
package org.lifelongprogrammer.nlp;
public class StanfordNLPAnnotator extends JCasAnnotator_ImplBase {
	public static final String STANFORDNLP_ACTION_SENTIMENT = "sentiment";
	public static final String TYPE_STANDFORDNLP_OUTPUT = "org.apache.uima.standfordnlp.output";
	public static final String FS_STANDFORDNLP_OUTPUT_TYPE = TYPE_STANDFORDNLP_OUTPUT
			+ ":type";
	public static final String TYPE_STANFORDNLP_INPUT = "org.apache.uima.stanfordnlp.input";
	public static final String FS_STANFORDNLP_INPUT_ACTION = TYPE_STANFORDNLP_INPUT
			+ ":action";

	private static Splitter splitter = Splitter.on(",").trimResults()
			.omitEmptyStrings();
	public static final String SENTIWORDNET_FILE_PARAM = "SentiwordnetFile";

	private StanfordCoreNLP sentiment5ClassesPipeline,
			sentiment2ClassesPipeline;
	private SWN3 sentiwordnet;
	private ExecutorService threadpool;
	private Logger logger;
	public void initialize(UimaContext aContext)
			throws ResourceInitializationException {
		super.initialize(aContext);
		this.logger = getContext().getLogger();
		reconfigure();
	}

	public void reconfigure() throws ResourceInitializationException {
		try {
			threadpool = Executors.newCachedThreadPool();
			String dataPath = getContext().getDataPath();
			Properties props = new Properties();
			props.setProperty("annotators",
					"tokenize, ssplit, parse, sentiment");
			props.put("sentiment.model",
					"edu/stanford/nlp/models/sentiment/sentiment.ser.gz");

			sentiment5ClassesPipeline = new StanfordCoreNLP(props);
			props.put("sentiment.model",
					"edu/stanford/nlp/models/sentiment/sentiment.binary.ser.gz");
			sentiment2ClassesPipeline = new StanfordCoreNLP(props);

			String sentiwordnetFile = (String) getContext()
					.getConfigParameterValue(SENTIWORDNET_FILE_PARAM);
			sentiwordnet = new SWN3(
					new File(dataPath, sentiwordnetFile).getPath());
		} catch (Exception e) {
			logger.log(Level.SEVERE, e.getMessage());
			throw new ResourceInitializationException(e);
		}
	}
	public void process(JCas jcas) throws AnalysisEngineProcessException {
		CAS cas = jcas.getCas();
		ArrayList<String> action = getAction(cas);
		if (action.contains(STANFORDNLP_ACTION_SENTIMENT)) {
			Future<Void> future = threadpool.submit(new Callable<Void>() {
				@Override
				public Void call() throws Exception {
					checkSentiment(cas);
					return null;
				}
			});
			futures.add(future);
		}
		for (Future<Void> future : futures) {
			try {
				future.get();
			} catch (InterruptedException | ExecutionException e) {
				throw new AnalysisEngineProcessException(e);
			}
		}
		logger.log(Level.FINE, "StanfordNERAnnotator done.");
	}

  
	private void checkSentiment(CAS cas) {
		String sentimenTetx = getSentimentSentence(cas.getDocumentText())
				.toString();

		Annotation annotation = sentiment5ClassesPipeline.process(sentimenTetx);
		TypeSystem ts = cas.getTypeSystem();
		Type dyOutputType = ts.getType(TYPE_STANDFORDNLP_OUTPUT);
		org.apache.uima.cas.Feature dyOutputTypeFt = ts
				.getFeatureByFullName(FS_STANDFORDNLP_OUTPUT_TYPE);
        
		SentimentAccumulator accumulator = new SentimentAccumulator(false);
		for (CoreMap sentenceCore : annotation
				.get(CoreAnnotations.SentencesAnnotation.class)) {
			Tree tree = sentenceCore
					.get(SentimentCoreAnnotations.AnnotatedTree.class);
			int predictedClass = RNNCoreAnnotations.getPredictedClass(tree);
			String sentence = sentenceCore.toString();
			if (predictedClass == 1) {
				int old = predictedClass;
				predictedClass = checkNegative(sentence);
				System.out.println("Sentiment changed from " + old + " to "
						+ predictedClass + " String: " + sentence);
			} 
			accumulator.accumulate(predictedClass, sentence.length());
		}
		AnnotationFS dyAnnFS = cas.createAnnotation(dyOutputType, 0, 0);
		dyAnnFS.setStringValue(dyOutputTypeFt, accumulator.getResult());
		cas.getIndexRepository().addFS(dyAnnFS);
	}
  
	private ArrayList<String> getAction(CAS cas) {
		TypeSystem ts = cas.getTypeSystem();
		Type dyInputType = ts.getType(TYPE_STANFORDNLP_INPUT);
		org.apache.uima.cas.Feature dyInputTypesFt = ts
				.getFeatureByFullName(FS_STANFORDNLP_INPUT_ACTION);
		FSIterator<?> dyIt = cas.getAnnotationIndex(dyInputType).iterator();
		String action = "";
		while (dyIt.hasNext()) {
			// TODO this is kind of weird
			AnnotationFS afs = (AnnotationFS) dyIt.next();
			String str = afs.getStringValue(dyInputTypesFt);
			if (str != null) {
				action = str;
			}
		}
		return Lists.newArrayList(splitter.split(action));
	}
  
  

	class SentimentAccumulator {
		private double totalScore;
		private int sentCount;
		public SentimentAccumulator() {}
		public void accumulate(int type, int sentLen) {
		  clac5ClassModel(type);
		}
		private void clac5ClassModel(int type) {
			++sentCount;
			// very negative
			switch (type) {
			case 0:
				totalScore += -5;
				break;
			case 1:
				totalScore += -1; // give smaller value
				break;
			case 2:
				totalScore += 0;
				break;
			case 3:
				totalScore += 2;
				break;
			case 4:
				totalScore += 5;
				break;
			default:
				// ignore this
				logger.log(Level.SEVERE, "unkown type:" + type);
				--sentCount;
			}
		}

		public String getResult() {
      double avgScore = (double) totalScore / sentCount;
      logger.log(Level.INFO, "avgScore: " + avgScore
          + ", totalScore: " + totalScore + ", sentCount: "
          + sentCount);

      if (avgScore > 2) {
        return "very positove";
      } else if (avgScore > 0.5) {
        return "positove";
        // [-0.5 TO 0]: neutral
      } else if (avgScore > -0.5) {
        return "neutral";
      } else if (avgScore > -2) {
        return "negative";
      } else {
        return "very negative";
      }
		}
	}

	public StringBuilder getSentimentSentence(String text) {
		DocumentPreprocessor dp = new DocumentPreprocessor(new StringReader(
				text));
		// List<String> sentenceList = new LinkedList<String>();
		StringBuilder sentenceList = new StringBuilder();
		Iterator<List<HasWord>> it = dp.iterator();
		while (it.hasNext()) {
			StringBuilder sentenceSb = new StringBuilder();
			List<HasWord> sentence = it.next();

			boolean hasFeeling = false;
			Iterator<HasWord> inner = sentence.iterator();
			while (inner.hasNext()) {
				HasWord token = inner.next();
				sentenceSb.append(token.word());

				if (inner.hasNext()) {
					sentenceSb.append(" ");
				}
				String feeling = sentiwordnet.extractFelling(token.word(), "a");
				if (!"neutral".equals(feeling)) {
					hasFeeling = true;
					System.out.println(feeling + ":" + token);
				}
			}
			if (hasFeeling) {
				sentenceList.append(sentenceSb.toString());
			}
		}
		return sentenceList;
	}

	private int checkNegative(String sentence) {
		Annotation annotation = sentiment2ClassesPipeline.process(sentence);

		for (CoreMap sentenceCore : annotation
				.get(CoreAnnotations.SentencesAnnotation.class)) {

			Tree tree = sentenceCore
					.get(SentimentCoreAnnotations.AnnotatedTree.class);
			int newPredict = RNNCoreAnnotations.getPredictedClass(tree);
			// if binary checker still returns negative then use negative
			if (newPredict == 0) {
				return 1;
			} else {
				return 3;
			}
		}
		return 1;
	}  
}
Descriptor File: StanfordNLPAnnotator.xml
We define uima types: org.apache.uima.stanfordnlp.input and org.apache.uima.stanfordnlp.output, and the configuration parameter: SentiwordnetFile.
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
	<frameworkImplementation>org.apache.uima.java</frameworkImplementation>
	<primitive>true</primitive>
	<annotatorImplementationName>org.lifelongprogrammer.nlp.StanfordNLPAnnotator
	</annotatorImplementationName>
	<analysisEngineMetaData>
		<name>StanfordNLPAnnotatorAE</name>
		<description>StanfordNLPAnnotator Wrapper.</description>
		<version>1.0</version>
		<vendor>LifeLong Programmer, Inc.</vendor>
		<configurationParameters>
			<configurationParameter>
				<name>SentiwordnetFile</name>
				<description>Filename of the sentiwordnet file.</description>
				<type>String</type>
				<multiValued>false</multiValued>
				<mandatory>true</mandatory>
			</configurationParameter>
		</configurationParameters>
		<configurationParameterSettings>
			<nameValuePair>
				<name>SentiwordnetFile</name>
				<value>
					<string>dicts\SentiWordNet_3.0.0_20130122.txt</string>
				</value>
			</nameValuePair>
		</configurationParameterSettings>
		<typeSystemDescription>
			<typeDescription>
				<name>org.apache.uima.stanfordnlp.input</name>
				<description />
				<supertypeName>uima.tcas.Annotation</supertypeName>
				<features>
					<featureDescription>
						<name>action</name>
						<description />
						<rangeTypeName>uima.cas.String</rangeTypeName>
					</featureDescription>
				</features>
			</typeDescription>
			<typeDescription>
				<name>org.apache.uima.standfordnlp.output</name>
				<description />
				<supertypeName>uima.tcas.Annotation</supertypeName>
				<features>
					<featureDescription>
						<name>type</name>
						<description />
						<rangeTypeName>uima.cas.String</rangeTypeName>
					</featureDescription>
				</features>
			</typeDescription>
		</typeSystemDescription>
</analysisEngineDescription>
Annotator Test case
Check the previous post about how use sujitpal's UimaUtils.java to test the StanfordNLPAnnotator.
Post a Comment

Labels

Java (159) Lucene-Solr (112) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (38) Eclipse (33) Code Example (31) Linux (25) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Shell (7) ANT (6) Coding Skills (6) Database (6) Lesson Learned (6) Programmer Skills (6) Scala (6) Tips (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) System Design (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts