Text Mining: Integrate OpenNLP, UIMA AS and Solr

In this series, I will introduce how to integrate OpenNLP, UIMA and Solr.
Integrate OpenNLP with UIMA
Talk about how to install UIMA, build OpenNLP pear, and run OpenNLP pear in CVD or UIMA Simple Server. 
Integrate OpenNLP, UIMA and Solr via SOAP Web Service
Talk about how to deploy OpenNLP UIMA pear as SOAP web service, and integrate it with Solr.
Integrate OpenNLP, UIMA AS and Solr
Talk about how to deploy OpenNLP UIMA pear as UIMA AS Service, and integrate it with Solr.

Please refer to the part1 about how to install UIMA, build OpenNLP UIMA.
Deploy OpenNLP Pear as UIMA AS Service
UIMA AS(Asynchronous Scaleout) is the next generation scalability replacement for the Collection Processing Manager (CPM).

Download UIMA AS binary package, unzip it, then run bin/startBroker.bat to starts the ActiveMQ broker, which must be running before UIMA AS services can be deployed.

Then use deployAsyncService.bat to deploy UIMA-AS services: deployAsyncService.sh [testDD.xml] [-brokerURL url]

In order to deploy pear, we have to use 2.4.2 or newer UIMA AS version - 2.3.1 doesn't work.

First unzip the OpenNlpTextAnalyzer.pear to %PEARS_HOME_REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer. 
Create pear descriptor: opennlp.uima.OpenNlpTextAnalyzer_pear.xml in%PEARS_HOME_REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer.
<?xml version="1.0" encoding="UTF-8"?>
<pearSpecifier xmlns="http://uima.apache.org/resourceSpecifier">
    <pearPath>%PEARS_HOME_REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer</pearPath>
</pearSpecifier>
Then %PEARS_HOME_REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer, create one UIMA-AS deployment descriptor: Deploy_OpenNLP.xml like below. We can refer AS deploy descriptors in uima-as-%version%-bin\examples\deploy\as.
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDeploymentDescription
  xmlns="http://uima.apache.org/resourceSpecifier">
  <name>OpenNLP Text Analyzer</name>
  <description>Deploys OpenNLP text analyzer.</description>

  <deployment protocol="jms" provider="activemq">
    <service>
      <inputQueue endpoint="OpenNLP-service"
brokerURL="tcp://localhost:61616"/>
      <topDescriptor>
       <import location="opennlp.uima.OpenNlpTextAnalyzer_pear.xml"/>
      </topDescriptor>
    </service>
  </deployment> 
</analysisEngineDeploymentDescription>
Then run: deployAsyncService.cmd %PEARS_HOME_REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer\Deploy_OpenNLP.xml
Test OpenNLP Pear UIMA AS Service in CVD
Refer to uima-as-version-bin\examples\descriptors\as\MeetingDetectorAsyncAE.xml, we need create client descriptor, OpenNLPAsyncAEClient.xml: we just ned change endpoint to OpenNLP-service.
<customResourceSpecifier xmlns="http://uima.apache.org/resourceSpecifier">
   <resourceClassName>org.apache.uima.aae.jms_adapter.JmsAnalysisEngineServiceAdapter</resourceClassName>
   <parameters>
     <parameter name="brokerURL" value="tcp://localhost:61616"/>
     <parameter name="endpoint" value="OpenNLP-service"/>
     <parameter name="timeout" value="5000"/>
     <parameter name="getmetatimeout" value="5000"/>
     <parameter name="cpctimeout" value="5000"/>
   </parameters>
</customResourceSpecifier>
Then in CVD, click "Run" -> "Load AE" to load OpenNLPServiceClient.xml, then test it.
Integrate OpenNLP-UIMA with Solr
We can use Solr UIMAUpdateRequestProcessorFactory to send the text to OpenNLP-UIMA SOAP web service to analyze it when add a document to Solr, UIMAUpdateRequestProcessorFactory will save the UIMA extracted information into Solr.

In order to call SOAP web service, we first need put the SOAP Service Client Descriptor: OpenNLPAsyncAEClient.xml in solr/collection1/conf folder.
Then wrap the UIMA AS service to be a part of aggregate analysis engine.

We create an analysis engine descriptor file: AggragateOpenNLPAsyncService.xml like below.
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>false</primitive>
  <delegateAnalysisEngineSpecifiers>
    <delegateAnalysisEngine key="OpenNLPAsyncAE">
      <import location="OpenNLPAsyncAEClient.xml"/>
    </delegateAnalysisEngine>    
  </delegateAnalysisEngineSpecifiers>
  <analysisEngineMetaData>
    <name>ExtServicesAE</name>
    <description/>
    <version>1.0</version>
    <vendor/>
    <configurationParameters searchStrategy="language_fallback">
    </configurationParameters>
    <flowConstraints>
      <fixedFlow>
         <node>OpenNLPAsyncAE</node>
      </fixedFlow>
    </flowConstraints>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs/>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>false</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration/>
</analysisEngineDescription>

Define dynamicField *_mxf in schema.xml:
<dynamicField name="*_mxf" type="text" indexed="true" stored="true"  multiValued="true"/>
Now we update solrconfig.xml to include this UIAM analysis engine.
<updateRequestProcessorChain name="opennlp-uima-as" default="true">
    <processor class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
      <lst name="uimaConfig">
        <lst name="runtimeParameters">
        </lst>
        <str name="analysisEngine">%REPLACE_THIS%\AggragateOpenNLPAsyncService.xml.xml</str>
        <bool name="ignoreErrors">false</bool>
        <lst name="analyzeFields">
          <bool name="merge">false</bool>
          <arr name="fields">
            <str>content</str>
          </arr>
        </lst>
        <lst name="fieldMappings">
          <lst name="type">
            <str name="name">opennlp.uima.Date</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">date_mxf</str>
            </lst>
          </lst>           
          <lst name="type">
            <str name="name">opennlp.uima.Location</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">location_mxf</str>
            </lst>
          </lst> 

          <lst name="type">
            <str name="name">opennlp.uima.Money</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">money_mxf</str>
            </lst>
          </lst>
          <lst name="type">
            <str name="name">opennlp.uima.Organization</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">organization_mxf</str>
            </lst>
        </lst>
          <lst name="type">
            <str name="name">opennlp.uima.Percentage</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">percentage_mxf</str>
            </lst>
          </lst> 
          <lst name="type">
            <str name="name">opennlp.uima.Sentence</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">sentence_mxf</str>
            </lst>
          </lst> 
          <lst name="type">
            <str name="name">opennlp.uima.Time</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">time_mxf</str>
            </lst>
          </lst>           
          <lst name="type">
            <str name="name">opennlp.uima.Person</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">person_mxf</str>
            </lst>
          </lst>           
        </lst>
      </lst>
    </processor>
After that we can call
http://localhost:8080/solr/update?update.chain=opennlp-uima-soap&commit=true&stream.body=<add><doc><field name="id">1</field><field name="content">some text here</field></doc></add>

Then run http://localhost:8080/solr/select?q=id:1, we can see it extracts some entity like organization, person name, location, time, date, money, percentage, etc.
Resources
UIMA Documentation Overview
UIMA Asynchronous Scaleout Documentation Overview
Refer to Re: Error deploying pear on AS 2.4.2
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (60) Interview (59) J2SE (53) Algorithm (37) Eclipse (35) Soft Skills (35) Code Example (31) Linux (26) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Continuous Integration (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Design (8) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Miscs (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Firefox (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Bit Operation (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts