The Problem:
We use opennlp-uima to extract entities such as person, organization, location, date, time, money, percentage. But in most cases, client just wants to extract one or several kinds of entities: for example just person and location.
In OpenNlpTextAnalyzer.pear, it will run all 12 annotators in sequence. This is not good from performance perspective. Check flowConstraints/fixedFlow definition in OpenNlpTextAnalyzer.xml:
We want to opennlp-uima to only run needed annotators to boost its performance.
The solution: Using ResultSpecification
UIMA's descriptors include a section under the XML capabilities element where the descriptor may specify inputs and outputs. These end up informing the ResultSpecification which is provided to the annotator. The ResultSpecification can be queried by the annotator code to see what the annotator ought to produce.
PersonTitleAnnotator and TutorialDateTime in uimaj-examples project uses ResultSpecification to check whether it need run the annotator to boost the performance:
1. Update Annotator's analysisEngineDescription outputs to reflect its capabilities
Take PersonNameFinder.xml as an exmple: we need addopennlp.uima.Person like below:
Do simliar change in these files: PersonNameFinder.xml, LocationNameFinder.xml, OrganizationNameFinder.xml, DateNameFinder.xml, TimeNameFinder.xml, MoneyNameFinder.xml, PercentageNameFinder.xml, PosTagger.xml, Tokenizer.xml,Parser.xml, Chunker.xml.
Please refer to Wrong NameType in TimeNameFinder.xml, otherwise the annotator would classify time phrases such as "this afternoon" and "tomorrow morning" as Persons instead of Times.
2. Change Annotator's code to honor ResultSpecification
PersonNameFinder.xml, LocationNameFinder.xml, OrganizationNameFinder.xml, DateNameFinder.xml, TimeNameFinder.xml extends same parent class: opennlp.uima.namefind.AbstractNameFinder. We can change its process method like below:
In client side, we need add result type in ResultSpecification when call org.apache.uima.analysis_engine.AnalysisEngine.process(CAS, ResultSpecification):
Resources
UIMA Result Specifications
UIMA References
http://comments.gmane.org/gmane.comp.apache.uima.general/5670
We use opennlp-uima to extract entities such as person, organization, location, date, time, money, percentage. But in most cases, client just wants to extract one or several kinds of entities: for example just person and location.
In OpenNlpTextAnalyzer.pear, it will run all 12 annotators in sequence. This is not good from performance perspective. Check flowConstraints/fixedFlow definition in OpenNlpTextAnalyzer.xml:
We want to opennlp-uima to only run needed annotators to boost its performance.
The solution: Using ResultSpecification
UIMA's descriptors include a section under the XML capabilities element where the descriptor may specify inputs and outputs. These end up informing the ResultSpecification which is provided to the annotator. The ResultSpecification can be queried by the annotator code to see what the annotator ought to produce.
PersonTitleAnnotator and TutorialDateTime in uimaj-examples project uses ResultSpecification to check whether it need run the annotator to boost the performance:
public void process(CAS aCAS) throws AnalysisEngineProcessException { // If the ResultSpec doesn't include the PersonTitle type, we have nothing to do. if (!getResultSpecification().containsType("example.PersonTitle",aCAS.getDocumentLanguage())) { if (!warningMsgShown) { logger.log(Level.WARNING, m); warningMsgShown = true; } return; } }We need make the following change to make opennlp-uima to honor ResultSpecification to filter annotators.
1. Update Annotator's analysisEngineDescription outputs to reflect its capabilities
Take PersonNameFinder.xml as an exmple: we need add
Do simliar change in these files: PersonNameFinder.xml, LocationNameFinder.xml, OrganizationNameFinder.xml, DateNameFinder.xml, TimeNameFinder.xml, MoneyNameFinder.xml, PercentageNameFinder.xml, PosTagger.xml, Tokenizer.xml,Parser.xml, Chunker.xml.
<capabilities> <capability> <inputs /> <outputs> <type>opennlp.uima.Person</type> </outputs> <languagesSupported> <language>en</language> </languagesSupported> </capability> </capabilities>Due to a bug in opennlp-uima, we need change NameType in nameValuePair from opennlp.uima.Person to opennlp.uima.Time.
Please refer to Wrong NameType in TimeNameFinder.xml, otherwise the annotator would classify time phrases such as "this afternoon" and "tomorrow morning" as Persons instead of Times.
2. Change Annotator's code to honor ResultSpecification
PersonNameFinder.xml, LocationNameFinder.xml, OrganizationNameFinder.xml, DateNameFinder.xml, TimeNameFinder.xml extends same parent class: opennlp.uima.namefind.AbstractNameFinder. We can change its process method like below:
public final void process(CAS cas) { ResultSpecification rs = getResultSpecification(); boolean run = rs.containsType(mNameType.getName()) || rs.containsType(mNameType.getName(),cas.getDocumentLanguage()); if (!run) { return; } // omitted .... }opennlp.uima.parser.Parser:
public void process(CAS cas) { ResultSpecification rs = getResultSpecification(); boolean run = rs.containsType("opennlp.uima.Parse") || rs.containsType("opennlp.uima.Parse", cas.getDocumentLanguage()); if (!run) { return; } }opennlp.uima.chunker.Chunker:
public void process(CAS tcas) { ResultSpecification rs = getResultSpecification(); boolean run = rs.containsType("opennlp.uima.Chunk") || rs.containsType("opennlp.uima.Chunk", tcas.getDocumentLanguage()); if (!run) { return; } }opennlp.uima.postag.POSTagger:
public void process(CAS tcas) { ResultSpecification rs = getResultSpecification(); boolean run = rs.containsType("opennlp.uima.Token:pos") || rs.containsType("opennlp.uima.Token:pos", tcas.getDocumentLanguage()); if (!run) { return; } }Change in Client Side
In client side, we need add result type in ResultSpecification when call org.apache.uima.analysis_engine.AnalysisEngine.process(CAS, ResultSpecification):
ResultSpecification rs = UIMAFramework.getResourceSpecifierFactory() .createResultSpecification(); rs.addResultType("opennlp.uima.Person", true); rs.addResultType("opennlp.uima.Location", true); this.ae.process(this.cas, rsf);In our project, we use uima's Regular Expression Annotator to extract entities such as ssn, phone number, credit card etc. We define more than 20 entities and their corresponding regex in its concepts.xml
Resources
UIMA Result Specifications
UIMA References
http://comments.gmane.org/gmane.comp.apache.uima.general/5670