UIMA: Run Custom Regex Dynamically

The Problem
Extend UIMA Regex Annotator to allow user run custom regex dynamically.

Regular Expression Annotator allows us to easily define entity name(such as credit card, email) and regex to extract these entities.

But we can never define all useful entities, so it's good to allow customers to add their own entities and regex, and the UIMA Regular Expression Annotator would run them dynamically.

We can create and deploy a new annotator, but we decide to just extend UIMA RegExAnnotator.

How it Works
Client Side
We create one type org.apache.uima.input.dynamicregex with feature types and regexes. 
In our http interface, client specifies the entity name and its regex: 
host:port/nlp?text=abcxxdef&customTypes=mytype1,mytype2&customRegexes=abc.*,def.*

Client will add Feature Structure: org.apache.uima.input.dynamicregex.types=mytype1,mytype2 and org.apache.uima.input.dynamicregex.regexes=abc.*,def.*
public void addCustomRegex(List<String> customTypes,
    List<String> customRegexes, CAS cas) {
  if (customTypes != null && customRegexes != null) {
    if (customTypes.size() != customRegexes.size()) {
      throw new IllegalArgumentException(
          "Size doesn't match: customTypes size: "
              + customTypes.size() + ", customRegexes size: "
              + customRegexes.size());
    }
    TypeSystem ts = cas.getTypeSystem();
    Feature ft = ts
        .getFeatureByFullName("org.apache.uima.input.dynamicregex:types");
    Type type = ts.getType("org.apache.uima.input.dynamicregex");

    if (type != null) {
      // if remote annotator or pear supports type
      // org.apache.uima.entities:entities, add it to indexes,
      // otherwise do nothing.
      FeatureStructure fs = cas.createFS(type);
      fs.setStringValue(ft, joiner.join(customTypes));
      cas.addFsToIndexes(fs);
    }

    ft = ts.getFeatureByFullName("org.apache.uima.input.dynamicregex:regexes");
    type = ts.getType("org.apache.uima.input.dynamicregex");

    if (type != null) {
      // if remote annotator or pear supports type
      // org.apache.uima.entities:entities, add it to indexes,
      // otherwise do nothing.
      FeatureStructure fs = cas.createFS(type);
      fs.setStringValue(ft, joiner.join(customRegexes));
      cas.addFsToIndexes(fs);
    }
  }
}
public Result process(String text, String lang, List<String> uimaTypes,
    List<String> customTypes, List<String> customRegexes,
    Long waitMillseconds) throws Exception {
  CAS cas = this.ae.getCAS();
  String casId;
  try {
    cas.setDocumentText(text);
    cas.setDocumentLanguage(lang);
    TypeSystem ts = cas.getTypeSystem();
    Feature ft = ts.getFeatureByFullName(UIMA_ENTITIES_FS);
    Type type = ts.getType(UIMA_ENTITIES);
    if (type != null) {
      // if remote annotator or pear supports type
      // org.apache.uima.entities:entities, add it to indexes,
      // otherwise do nothing.
      FeatureStructure fs = cas.createFS(type);
      fs.setStringValue(ft, joiner.join(uimaTypes));
      cas.addFsToIndexes(fs);
    }
    addCustomRegex(customTypes, customRegexes, cas);
    casId = this.ae.sendCAS(cas);
  } catch (ResourceProcessException e) {
    // http://t17251.apache-uima-general.apachetalk.us/uima-as-client-is-blocking-t17251.html
    cas.release();
    logger.error("Exception thrown when process cas " + cas, e);
    throw e;
  }
  Result rst = this.listener.waitFinished(casId, waitMillseconds);
  return rst;
}
Define Feature Structures in RegExAnnotator.xml
org.apache.uima.input.dynamicregex is used as input paramter, client can specify value for its features: types and regexes. org.apache.uima.output.dynamicrege is the output type.
<typeDescription>
  <name>org.apache.uima.input.dynamicregex</name>
  <description />
  <supertypeName>uima.tcas.Annotation</supertypeName>
  <features>
    <featureDescription>
      <name>types</name>
      <description />
      <rangeTypeName>uima.cas.String</rangeTypeName>
    </featureDescription>            
    <featureDescription>
      <name>regexes</name>
      <description />
      <rangeTypeName>uima.cas.String</rangeTypeName>
    </featureDescription>            
  </features>          
</typeDescription>
<!-- output params -->
<typeDescription>
  <name>org.apache.uima.output.dynamicregex</name>
  <description />
  <supertypeName>uima.tcas.Annotation</supertypeName>
  <features>
    <featureDescription>
      <name>type</name>
      <description />
      <rangeTypeName>uima.cas.String</rangeTypeName>
    </featureDescription>
  </features>
</typeDescription>

Run Custom Regex and Return Extracted Entities in  RegExAnnotator
Next, in RegExAnnotator.process method, we get value of the input types and regex, run custom regex and add found entities to CAS indexes.
public void process(CAS cas) throws AnalysisEngineProcessException {
  procressCutsomRegex(cas);
  //... omitted
}
private void procressCutsomRegex(CAS cas) {
  TypeSystem ts = cas.getTypeSystem();
  Type dyInputType = ts.getType("org.apache.uima.input.dynamicregex");
  org.apache.uima.cas.Feature dyInputTypesFt = ts
      .getFeatureByFullName("org.apache.uima.input.dynamicregex:types");
  org.apache.uima.cas.Feature dyInputRegexesFt = ts
      .getFeatureByFullName("org.apache.uima.input.dynamicregex:regexes");
  String dyTypes = null, dyRegexes = null;
  FSIterator<?> dyIt = cas.getAnnotationIndex(dyInputType).iterator();

  AnnotationFS dyInputTypesFs = null, dyInputRegexesFs = null;
  while (dyIt.hasNext()) {
    // TODO this is kind of weird
    AnnotationFS afs = (AnnotationFS) dyIt.next();
    if (afs.getStringValue(dyInputTypesFt) != null) {
      dyTypes = afs.getStringValue(dyInputTypesFt);
      dyInputTypesFs = afs;
    }
    if (afs.getStringValue(dyInputRegexesFt) != null) {
      dyRegexes = afs.getStringValue(dyInputRegexesFt);
      dyInputRegexesFs = afs;
    }
  }
  if (dyInputTypesFs != null) {
    cas.removeFsFromIndexes(dyInputTypesFs);
  }
  if (dyInputRegexesFs != null) {
    cas.removeFsFromIndexes(dyInputRegexesFs);
  }
  String[] dyTypesArr = dyTypes.split(","), dyRegexesArr = dyRegexes
      .split(",");
  if (dyTypesArr.length != dyRegexesArr.length) {
    throw new IllegalArgumentException(
        "Size of custom regex doesn't match. types: "
            + dyTypesArr.length + ",  regexes: "
            + dyRegexesArr.length);
  }
  if (dyTypesArr.length == 0)
    return;
  logger.log(Level.FINE, "User specifies custom regex: type: " + dyTypes
      + ", regexes: " + dyRegexes);
  String docText = cas.getDocumentText();
  Type dyOutputType = ts.getType("org.apache.uima.output.dynamicregex");
  org.apache.uima.cas.Feature dyOutputTypeFt = ts
      .getFeatureByFullName("org.apache.uima.output.dynamicregex:type");
  FSIndexRepository indexRepository = cas.getIndexRepository();
  for (int i = 0; i < dyTypesArr.length; i++) {
    Pattern pattern = Pattern.compile(dyRegexesArr[i]);
    Integer captureGroupPos = getNamedGrpupPosition(pattern, "capture");
    Matcher matcher = pattern.matcher(docText);

    while (matcher.find()) {
      AnnotationFS dyAnnFS;
      // if named group capture exists
      if (captureGroupPos != null) {
        dyAnnFS = cas.createAnnotation(dyOutputType,
            matcher.start(captureGroupPos),
            matcher.end(captureGroupPos));
      } else {
        dyAnnFS = cas.createAnnotation(dyOutputType,
            matcher.start(), matcher.end());
      }
      dyAnnFS.setStringValue(dyOutputTypeFt, dyTypesArr[i]);
      indexRepository.addFS(dyAnnFS);
    }
  }
}
/**
 * Use reflection to call namedGroups in JDK7
 */
@SuppressWarnings("unchecked")
private Integer getNamedGrpupPosition(Pattern pattern, String namedGroup) {
  try {
    Method namedGroupsMethod = Pattern.class.getDeclaredMethod(
        "namedGroups", null);
    namedGroupsMethod.setAccessible(true);

    Map<String, Integer> namedGroups = (Map<String, Integer>) namedGroupsMethod
        .invoke(pattern, null);
    return namedGroups.get(namedGroup);
  } catch (Exception e) {
    throw new RuntimeException(e);
  }
}
References
UIMA References
Apache UIMA Regular Expression Annotator Documentation
Post a Comment

Labels

Java (159) Lucene-Solr (112) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (38) Eclipse (33) Code Example (31) Linux (25) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Shell (7) ANT (6) Coding Skills (6) Database (6) Lesson Learned (6) Programmer Skills (6) Scala (6) Tips (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) System Design (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts