Using ResultSpecification to Filter Annotator to Boost Opennlp UIMA Performance

The Problem:
We use opennlp-uima to extract entities such as person, organization, location, date, time, money, percentage. But in most cases, client just wants to extract one or several kinds of entities: for example just person and location.

In OpenNlpTextAnalyzer.pear, it will run all 12 annotators in sequence. This is not good from performance perspective. Check flowConstraints/fixedFlow definition in OpenNlpTextAnalyzer.xml:

We want to opennlp-uima to only run needed annotators to boost its performance.

The solution: Using ResultSpecification
UIMA's descriptors include a section under the XML capabilities element where the descriptor may specify inputs and outputs.  These end up informing the ResultSpecification which is provided to the annotator.  The ResultSpecification can be queried by the annotator code to see what the annotator ought to produce.

PersonTitleAnnotator and TutorialDateTime in uimaj-examples project uses ResultSpecification to check whether it need run the annotator to boost the performance:

public void process(CAS aCAS) throws AnalysisEngineProcessException {
    // If the ResultSpec doesn't include the PersonTitle type, we have nothing to do.
    if (!getResultSpecification().containsType("example.PersonTitle",aCAS.getDocumentLanguage())) {
      if (!warningMsgShown) {
        logger.log(Level.WARNING, m);
        warningMsgShown = true;
      }
      return;
    }
}
We need make the following change to make opennlp-uima to honor ResultSpecification to filter annotators.
1. Update Annotator's analysisEngineDescription outputs to reflect its capabilities
Take PersonNameFinder.xml as an exmple: we need add opennlp.uima.Person like below:
Do simliar change in these files: PersonNameFinder.xml, LocationNameFinder.xml, OrganizationNameFinder.xml, DateNameFinder.xml, TimeNameFinder.xml, MoneyNameFinder.xml, PercentageNameFinder.xml, PosTagger.xml, Tokenizer.xml,Parser.xml, Chunker.xml.
<capabilities>
  <capability>
    <inputs />
    <outputs>
      <type>opennlp.uima.Person</type>
    </outputs>
    <languagesSupported>
      <language>en</language>
    </languagesSupported>
  </capability>
</capabilities>
Due to a bug in opennlp-uima, we need change NameType in nameValuePair from opennlp.uima.Person to opennlp.uima.Time.
Please refer to Wrong NameType in TimeNameFinder.xml, otherwise the annotator would classify time phrases such as "this afternoon" and "tomorrow morning" as Persons instead of Times.

2. Change Annotator's code to honor ResultSpecification
PersonNameFinder.xml, LocationNameFinder.xml, OrganizationNameFinder.xml, DateNameFinder.xml, TimeNameFinder.xml extends same parent class: opennlp.uima.namefind.AbstractNameFinder. We can change its process method like below:
public final void process(CAS cas) {
 ResultSpecification rs = getResultSpecification();  
 boolean run = rs.containsType(mNameType.getName())
   || rs.containsType(mNameType.getName(),cas.getDocumentLanguage());
 if (!run) {
  return;
 }
  // omitted ....
} 
opennlp.uima.parser.Parser:
public void process(CAS cas) {
    ResultSpecification rs = getResultSpecification();  
 boolean run = rs.containsType("opennlp.uima.Parse") || rs.containsType("opennlp.uima.Parse", cas.getDocumentLanguage());
 if (!run) {
  return;
 }
} 
opennlp.uima.chunker.Chunker:
public void process(CAS tcas) {
 ResultSpecification rs = getResultSpecification();  
 boolean run = rs.containsType("opennlp.uima.Chunk") 
   || rs.containsType("opennlp.uima.Chunk", tcas.getDocumentLanguage());
 if (!run) {
  return;
 } 
}
opennlp.uima.postag.POSTagger:
public void process(CAS tcas) {
 ResultSpecification rs = getResultSpecification();
 boolean run = rs.containsType("opennlp.uima.Token:pos")
   || rs.containsType("opennlp.uima.Token:pos", tcas.getDocumentLanguage());
 if (!run) {
  return;
 }
}  
Change in Client Side
In client side, we need add result type in ResultSpecification when call org.apache.uima.analysis_engine.AnalysisEngine.process(CAS, ResultSpecification):
  ResultSpecification rs = UIMAFramework.getResourceSpecifierFactory()
      .createResultSpecification();
  rs.addResultType("opennlp.uima.Person", true);
  rs.addResultType("opennlp.uima.Location", true);
  this.ae.process(this.cas, rsf);
In our project, we use uima's Regular Expression Annotator to extract entities such as ssn, phone number, credit card etc. We define more than 20 entities and their corresponding regex in its concepts.xml

Resources
UIMA Result Specifications
UIMA References
http://comments.gmane.org/gmane.comp.apache.uima.general/5670
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (60) Interview (59) J2SE (53) Algorithm (37) Eclipse (35) Soft Skills (35) Code Example (31) Linux (26) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Continuous Integration (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Design (8) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Miscs (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Firefox (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Bit Operation (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts