Text Mining: Integrate OpenNLP, UIMA and Solr via SOAP Web Service

In this series, I will introduce how to integrate OpenNLP, UIMA and Solr.
Integrate OpenNLP with UIMA
Talk about how to install UIMA, build OpenNLP pear, and run OpenNLP pear in CVD or UIMA Simple Server. 
Integrate OpenNLP, UIMA and Solr via SOAP Web Service
Talk about how to deploy OpenNLP UIMA pear as SOAP web service, and integrate it with Solr.
Integrate OpenNLP, UIMA AS and Solr
Talk about how to deploy OpenNLP UIMA pear as UIMA AS Service, and integrate it with Solr.

Please refer to the part1 about how to install UIMA, build OpenNLP UIMA.
Deploy OpenNLP Pear as SOAP Web Service
Check Working with Remote Services to figure out how to deploy a UIMA component as a SOAP web service. 
First unzip the OpenNlpTextAnalyzer.pear to %PEARS_HOME_REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer. 
Create pear descriptor: opennlp.uima.OpenNlpTextAnalyzer_pear.xml in%PEARS_HOME_REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer like below.

<?xml version="1.0" encoding="UTF-8"?>
<pearSpecifier xmlns="http://uima.apache.org/resourceSpecifier">
  <pearPath>%REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer</pearPath>
</pearSpecifier>
Then create the web services deployment descriptor, Deploy_OpenNLP.wsdd. Example WSDD files are provided in the examples/deploy/soap directory of the UIMA SDK. All we need do is to copy one wsdd(for example:Deploy_NamesAndPersonTitles.wsdd): change service name to urn:OpenNLP, change resourceSpecifierPath to point to %PEARS_HOME_REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer\opennlp.uima.OpenNlpTextAnalyzer_pear.xml. Replace %PEARS_HOME_REPLACE_THIS% with real location.
<deployment name="OpenNLP" xmlns="http://xml.apache.org/axis/wsdd/"
    xmlns:java="http://xml.apache.org/axis/wsdd/providers/java">
  <service name="urn:OpenNLP" provider="java:RPC">
    <paramater name="scope" value="Request"/>
    <parameter name="className" value="org.apache.uima.adapter.soap.AxisAnalysisEngineService_impl"/>
    <parameter name="allowedMethods" value="getMetaData process"/>
    <parameter name="allowedRoles" value="*"/>
    <parameter name="resourceSpecifierPath" value="%REPLACE_THIS%\opennlp.uima.OpenNlpTextAnalyzer_pear.xml"/>
    <parameter name="numInstances" value="3"/>
    <parameter name="enableLogging" value="true"/>
    <!-- typeMapping omitted -->
  </service>
</deployment>
If we are using tomcat, set CATALINA_HOME to the location where Tomcat is installed. if we are using other application server, we may update UIMA_CLASSPATH in runUimaClass.bat to include axis\WEB-INF\lib, axis\WEB-INF\classes.

Then run deploytool %FOLDER%\Deploy_OpenNLP.wsdd to deploy the pear as SOAP service.
Test OpenNLP SOAP Service in CVD
Check How to Call a UIMA Service for detail.
We need define one SOAP Service Client Descriptor: OpenNLPSOAPServiceClient.xml
<?xml version="1.0" encoding="UTF-8" ?> 
<uriSpecifier xmlns="http://uima.apache.org/resourceSpecifier">
 <resourceType>AnalysisEngine</resourceType>
 <uri>http://localhost:8080/axis/services/urn:OpenNLP</uri>
 <protocol>SOAP</protocol>
</uriSpecifier>
Then in CVD, click "Run" -> "Load AE" to load OpenNLPSOAPServiceClient.xml, then test it.

Integrate OpenNLP-UIMA with Solr
We can use Solr UIMAUpdateRequestProcessorFactory to send the text to OpenNLP-UIMA SOAP web service to analyze it when add a document to Solr, UIMAUpdateRequestProcessorFactory will save the UIMA extracted information into Solr.

In order to call SOAP web service, we first need put the SOAP Service Client Descriptor: OpenNLPSOAPServiceClient.xml in solr/collection1/conf folder.
Then wrap the SOAP service to be a part of aggragate analysis engine.

We create an analysis engine descriptor file: AggragateOpenNLPSOAPService.xml like below.
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>false</primitive>
  <delegateAnalysisEngineSpecifiers>
    <delegateAnalysisEngine key="OpenNLPSOAPService">
      <import location="OpenNLPSOAPServiceClient.xml"/>
    </delegateAnalysisEngine>    
  </delegateAnalysisEngineSpecifiers>
  <analysisEngineMetaData>
    <name>ExtServicesAE</name>
    <description/>
    <version>1.0</version>
    <vendor/>
    <configurationParameters searchStrategy="language_fallback">
    </configurationParameters>
    <flowConstraints>
      <fixedFlow>
         <node>OpenNLPSOAPService</node>
      </fixedFlow>
    </flowConstraints>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs/>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>false</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration/>
</analysisEngineDescription>
Define dynamicField *_mxf in schema.xml:
<dynamicField name="*_mxf" type="text" indexed="true" stored="true"  multiValued="true"/>

Now we update solrconfig.xml to include this UIAM analysis engine.
<updateRequestProcessorChain name="opennlp-uima-soap" default="true">
  <processor class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
    <lst name="uimaConfig">
      <lst name="runtimeParameters">
      </lst>javax.xml.rpc.ServiceException
      <str name="analysisEngine">%REPLACE_THIS%\AggragateOpenNLPSOAPService.xml</str>
      <bool name="ignoreErrors">false</bool>
      <lst name="analyzeFields">
        <bool name="merge">false</bool>
        <arr name="fields">
          <str>content</str>
        </arr>
      </lst>
      <lst name="fieldMappings">
        <lst name="type">
          <str name="name">opennlp.uima.Date</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">date_mxf</str>
          </lst>
        </lst>           
        <lst name="type">
          <str name="name">opennlp.uima.Location</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">location_mxf</str>
          </lst>
        </lst> 

        <lst name="type">
          <str name="name">opennlp.uima.Money</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">money_mxf</str>
          </lst>
        </lst>
        <lst name="type">
          <str name="name">opennlp.uima.Organization</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">organization_mxf</str>
          </lst>
        </lst>
        <lst name="type">
          <str name="name">opennlp.uima.Percentage</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">percentage_mxf</str>
          </lst>
        </lst> 
        <lst name="type">
          <str name="name">opennlp.uima.Sentence</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">sentence_mxf</str>
          </lst>
        </lst> 
        <lst name="type">
          <str name="name">opennlp.uima.Time</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">time_mxf</str>
          </lst>
        </lst> 
        <lst name="type">
          <str name="name">opennlp.uima.Person</str>
          <lst name="mapping">
            <str name="feature">coveredText</str>
            <str name="field">person_mxf</str>
          </lst>
        </lst>           
      </lst>
    </lst>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain> 
After that we can call 
http://localhost:8080/solr/update?update.chain=opennlp-uima-soap&commit=true&stream.body=<add><doc><field name="id">1</field><field name="content">some text here</field></doc></add>

Then run http://localhost:8080/solr/select?q=id:1, we can see it extracts some entity like organization, person name, location, time, date, money, percentage, etc.
Post a Comment

Labels

Java (159) Lucene-Solr (110) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (37) Eclipse (33) Code Example (31) Linux (24) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts