Text Mining: Integrate UIMA Regular Expression Annotator with Solr

UIMA RegexAnnotator
UIMA RegexAnnotator is an Apache UIMA analysis engine that uses regular expression to detect entities such as email addresses, URLs, phone numbers, zip codes or any other entity.

This article will introduce how to deploy RegexAnnotator as SOAP web service, add extra regex to extract other types of entities and integrate it with Solr.
Deploy RegexAnnotator as SOAP Web Service
For detailed steps, please refer to this post.
First copy all jars in %uima-addons-home%\addons\annotator\RegularExpressionAnnotator\lib\ to axis.war\WEB-INF\lib, copy RegularExpressionAnnotator\desc\concepts.xml to axis.war\WEB-INF\classes.

Then we need create web services deployment descriptor. Example WSDD files are provided in the examples/deploy/soap directory of the UIMA SDK. All we need do is to copy one wsdd(for example:Deploy_NamesAndPersonTitles.wsdd): change service name to urn:RegExAnnotator, change resourceSpecifierPath to point to %PEARS_HOME_REPLACE_THIS%\addons/annotator/RegularExpressionAnnotator/desc/RegExAnnotator.xml.

<deployment name="RegExAnnotator">
 <service name="urn:RegExAnnotator" provider="java:RPC">
  <parameter name="resourceSpecifierPath" value="%PEARS_HOME_REPLACE_THIS%\addons/annotator/RegularExpressionAnnotator/desc/RegExAnnotator.xml"/>
If we are using tomcat, set CATALINA_HOME to the location where Tomcat is installed. if we are using other application server, we may update UIMA_CLASSPATH in runUimaClass.bat to include axis\WEB-INF\lib, axis\WEB-INF\classes.

Then run deploytool %FOLDER%\RegExAnnotator.wsdd to deploy the pear as SOAP service.
Test RegexAnnotator SOAP Service in CVD
Check How to Call a UIMA Service for detail.

We need define one SOAP Service Client Descriptor: RegExAnnotatorSoapServiceClient.xml
<uriSpecifier xmlns="http://uima.apache.org/resourceSpecifier">
Then in CVD, click "Run" -> "Load AE" to load RegExAnnotatorSoapServiceClient.xml, then test it.
Adding RegEx to extract other types of entities
Regular expression can be used to extract many types of enties. 
We can use regex from this post: \(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4}) to extract North American Phone Numbers.

To add this feature to UIMA, we create one ExtraRegExAnnotator.xml - similar as RegExAnnotator.xml except using different concept xml file and type definition:
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
    <!--configurationParameters omitted here -->
    <!-- operationalProperties omited here -->
extra-concepts.xml: - and copy it to axis.war\WEB-INF\classes
<conceptSet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  <concept name="usaPhoneNumberDetection">
      <rule regEx="\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})"
    matchStrategy="matchAll" matchType="uima.tcas.DocumentAnnotation"
    confidence="1.0" />
      <annotation id="usaPhoneNumber"
        <begin group="0" />
        <end group="0" />
        <setFeature name="confidence" type="Confidence" />
The web services deployment descriptor: Deploy_ExtraRegExAnnotator.wsdd: similar as Deploy_RegExAnnotator.wsdd.

The SOAP Service Client Descriptor: ExtraRegExAnnotatorSoapServiceClient.xml(similar as RegExAnnotatorSoapServiceClient.xml). Load it to CVD, and test it.
Integrate UIMA RegexAnnotator with Solr
Solr uses UIMAUpdateRequestProcessorFactory to send the text to SOAP web service and parse the soap response when add a document to Solr, UIMAUpdateRequestProcessorFactory will save the UIMA extracted information into Solr.

In order to call SOAP web service, we first need put the SOAP Service Client Descriptor: RegExAnnotatorSoapServiceClient.xml and ExtraRegExAnnotatorSoapServiceClient.xml in solr/collection1/conf folder.

Then wrap the two SOAP services urn:RegExAnnotator and urn:ExtraRegExAnnotator in an aggregate analysis engine.
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
    <delegateAnalysisEngine key="RegExAnnotatorService">
      <import location="RegExAnnotatorSoapServiceClient.xml"/>
    <delegateAnalysisEngine key="ExtraRegExAnnotatorService">
      <import location="ExtraRegExAnnotatorSoapServiceClient.xml"/>
    <configurationParameters searchStrategy="language_fallback">
    <!-- capabilities omitted -->
    <!-- operationalProperties omitted -->
Then define update chain: uima-regex in solrconfig.xml:
<updateRequestProcessorChain name="uima-regex" default="true">
    <processor class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
      <lst name="uimaConfig">
        <lst name="runtimeParameters">
        <str name="analysisEngine">file:///%REPLACE_THIS%\solr\collection1\conf\RegExAnnotatorAE.xml</str>
        <bool name="ignoreErrors">false</bool>
        <lst name="analyzeFields">
          <bool name="merge">false</bool>
          <arr name="fields">
        <lst name="fieldMappings">
        <lst name="type">
            <str name="name">org.lifelongprogrammer.USAPhoneNumber</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">usaphone_mxf</str>
          <lst name="type">
            <str name="name">org.apache.uima.EmailAddress</str>
            <lst name="mapping">
              <str name="feature">normalizedEmail</str>
              <str name="field">email_mxf</str>
          <lst name="type">
            <str name="name">org.apache.uima.ISBNNumber</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">isbn_mxf</str>

          <lst name="type">
            <str name="name">org.apache.uima.MoneyAmount</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">money_mxf</str>
          <lst name="type">
            <str name="name">org.apache.uima.CreditCardNumber</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">creditcard_mxf</str>
            <lst name="mapping">
              <str name="feature">cardType</str>
              <str name="field">creditcardType_mxf</str>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
After that we can call 
http://localhost:8080/solr/update?update.chain=uima-regex&commit=true&stream.body=<add><doc><field name="id">1</field><field name="content">some text here</field></doc></add>
Then run http://localhost:8080/solr/select?q=id:1, we can see it extracts entities like phone number, email, credit card, isbn etc.
Text Mining: Integrate OpenNLP, UIMA and Solr via SOAP Web Service

No comments:


Java (162) Lucene-Solr (112) Interview (63) J2SE (53) Algorithm (45) Soft Skills (39) Eclipse (32) Code Example (31) Troubleshooting (27) JavaScript (23) Linux (23) Spring (22) Tools (22) Windows (22) Web Development (20) Dev Tips (18) Nutch2 (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (10) Google (10) Problem Solving (10) How to (9) Learning code (9) Performance (9) Security (9) UIMA (9) html (9) Design (8) Http Client (8) Maven (8) Shell (8) bat (8) blogger (8) Big Data (7) Database (7) Guava (7) JSON (7) System Design (7) ANT (6) Coding Skills (6) Lesson Learned (6) Programmer Skills (6) Scala (6) css (6) Algorithm Series (5) Cache (5) Continuous Integration (5) IDE (5) Testing (5) adsense (5) xml (5) AIX (4) Become a Better You (4) Code Quality (4) Concurrency (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Life (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Review (4) Spark (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Distributed (3) Dynamic Languages (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Mac (3) Python (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Fiddler (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Firefox (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) Invest (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) RxJava (1) Search (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Tips (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts