Text Mining: Integrate OpenNLP, UIMA and Solr via SOAP Web Service


In this series, I will introduce how to integrate OpenNLP, UIMA and Solr.
Integrate OpenNLP with UIMA
Talk about how to install UIMA, build OpenNLP pear, and run OpenNLP pear in CVD or UIMA Simple Server. 
Integrate OpenNLP, UIMA and Solr via SOAP Web Service
Talk about how to deploy OpenNLP UIMA pear as SOAP web service, and integrate it with Solr.
Integrate OpenNLP, UIMA AS and Solr
Talk about how to deploy OpenNLP UIMA pear as UIMA AS Service, and integrate it with Solr.

Please refer to the part1 about how to install UIMA, build OpenNLP UIMA.
Deploy OpenNLP Pear as SOAP Web Service
Check Working with Remote Services to figure out how to deploy a UIMA component as a SOAP web service. 
First unzip the OpenNlpTextAnalyzer.pear to %PEARS_HOME_REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer. 
Create pear descriptor: opennlp.uima.OpenNlpTextAnalyzer_pear.xml in%PEARS_HOME_REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer like below.

<?xml version="1.0" encoding="UTF-8"?>
<pearSpecifier xmlns="http://uima.apache.org/resourceSpecifier">
  <pearPath>%REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer</pearPath>
</pearSpecifier>
Then create the web services deployment descriptor, Deploy_OpenNLP.wsdd. Example WSDD files are provided in the examples/deploy/soap directory of the UIMA SDK. All we need do is to copy one wsdd(for example:Deploy_NamesAndPersonTitles.wsdd): change service name to urn:OpenNLP, change resourceSpecifierPath to point to %PEARS_HOME_REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer\opennlp.uima.OpenNlpTextAnalyzer_pear.xml. Replace %PEARS_HOME_REPLACE_THIS% with real location.
<deployment name="OpenNLP" xmlns="http://xml.apache.org/axis/wsdd/"
    xmlns:java="http://xml.apache.org/axis/wsdd/providers/java">
  <service name="urn:OpenNLP" provider="java:RPC">
    <paramater name="scope" value="Request"/>
    <parameter name="className" value="org.apache.uima.adapter.soap.AxisAnalysisEngineService_impl"/>
    <parameter name="allowedMethods" value="getMetaData process"/>
    <parameter name="allowedRoles" value="*"/>
    <parameter name="resourceSpecifierPath" value="%REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer_pear.xml"/>
    <parameter name="numInstances" value="3"/>
    <parameter name="enableLogging" value="true"/>
    <!-- typeMapping omitted -->
  </service>
</deployment>
If we are using tomcat, set CATALINA_HOME to the location where Tomcat is installed. if we are using other application server, we may update UIMA_CLASSPATH in runUimaClass.bat to include axis\WEB-INF\lib, axis\WEB-INF\classes.

Then run deploytool %FOLDER%\Deploy_OpenNLP.wsdd to deploy the pear as SOAP service.
Test OpenNLP SOAP Service in CVD
Check How to Call a UIMA Service for detail.
We need define one SOAP Service Client Descriptor: OpenNLPSOAPServiceClient.xml
<?xml version="1.0" encoding="UTF-8" ?> 
<uriSpecifier xmlns="http://uima.apache.org/resourceSpecifier">
 <resourceType>AnalysisEngine</resourceType>
 <uri>http://localhost:8080/axis/services/urn:OpenNLP</uri>
 <protocol>SOAP</protocol>
</uriSpecifier>
Then in CVD, click "Run" -> "Load AE" to load OpenNLPSOAPServiceClient.xml, then test it.

Integrate OpenNLP-UIMA with Solr
We can use Solr UIMAUpdateRequestProcessorFactory to send the text to OpenNLP-UIMA SOAP web service to analyze it when add a document to Solr, UIMAUpdateRequestProcessorFactory will save the UIMA extracted information into Solr.

In order to call SOAP web service, we first need put the SOAP Service Client Descriptor: OpenNLPSOAPServiceClient.xml in solr/collection1/conf folder.
Then wrap the SOAP service to be a part of aggragate analysis engine.

We create an analysis engine descriptor file: AggragateOpenNLPSOAPService.xml like below.
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>false</primitive>
  <delegateAnalysisEngineSpecifiers>
    <delegateAnalysisEngine key="OpenNLPSOAPService">
      <import location="OpenNLPSOAPServiceClient.xml"/>
    </delegateAnalysisEngine>    
  </delegateAnalysisEngineSpecifiers>
  <analysisEngineMetaData>
    <name>ExtServicesAE</name>
    <description/>
    <version>1.0</version>
    <vendor/>
    <configurationParameters searchStrategy="language_fallback">
    </configurationParameters>
    <flowConstraints>
      <fixedFlow>
         <node>OpenNLPSOAPService</node>
      </fixedFlow>
    </flowConstraints>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs/>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>false</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration/>
</analysisEngineDescription>
Define dynamicField *_mxf in schema.xml:
<dynamicField name="*_mxf" type="text" indexed="true" stored="true"  multiValued="true"/>

Now we update solrconfig.xml to include this UIAM analysis engine.
<updateRequestProcessorChain name="opennlp-uima-soap" default="true">
  <processor class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
    <lst name="uimaConfig">
      <lst name="runtimeParameters">
      </lst>javax.xml.rpc.ServiceException
      <str name="analysisEngine">%REPLACE_THIS%\AggragateOpenNLPSOAPService.xml</str>
      <bool name="ignoreErrors">false</bool>
      <lst name="analyzeFields">
        <bool name="merge">false</bool>
        <arr name="fields">
          <str>content</str>
        </arr>
      </lst>
      <lst name="fieldMappings">
        <lst name="type">
          <str name="name">opennlp.uima.Date</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">date_mxf</str>
          </lst>
        </lst>           
        <lst name="type">
          <str name="name">opennlp.uima.Location</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">location_mxf</str>
          </lst>
        </lst> 

        <lst name="type">
          <str name="name">opennlp.uima.Money</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">money_mxf</str>
          </lst>
        </lst>
        <lst name="type">
          <str name="name">opennlp.uima.Organization</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">organization_mxf</str>
          </lst>
        </lst>
        <lst name="type">
          <str name="name">opennlp.uima.Percentage</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">percentage_mxf</str>
          </lst>
        </lst> 
        <lst name="type">
          <str name="name">opennlp.uima.Sentence</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">sentence_mxf</str>
          </lst>
        </lst> 
        <lst name="type">
          <str name="name">opennlp.uima.Time</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">time_mxf</str>
          </lst>
        </lst> 
        <lst name="type">
          <str name="name">opennlp.uima.Person</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">person_mxf</str>
          </lst>
        </lst>           
      </lst>
    </lst>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain> 
After that we can call 
http://localhost:8080/solr/update?update.chain=opennlp-uima-soap&commit=true&stream.body=<add><doc><field name="id">1</field><field name="content">some text here</field></doc></add>

Then run http://localhost:8080/solr/select?q=id:1, we can see it extracts some entity like organization, person name, location, time, date, money, percentage, etc.

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)