UIMA: Using Dedicated Feature Structure to Control Annotator Behavior

The Problem
In previous post: Using ResultSpecification to Filter Annotator to Boost Opennlp UIMA Performance, I introduced how to use ResultSpecification to make OpenNLP.pear only run needed annotators.

But recently, we changed our content analzyer project to use UIMA-AS for better scale out. But UIMA-AS doesn't support specify ResultSpecification at client side, so we have to find other solutions.

Luckily UIMA provides a more common mechanism: feature structures to allow us to control annotator's behavioral characteristics.

Using Dedicated Feature Structure to Control Server Behavioral
This time, we will take RegExAnnotator.pear as example, as we have defined more than 10+ regex and entities in RegExAnnotator, and the client would specify which entities they are interested. 

Client specify  values of the feature: org.apache.uima.entities:entities, such as "ssn,creditcard,email", RegExAnnotator will check the setting and run only needed regex.

Specify Feature Value at Client Side
First we have one properties file uima.properties which define the mapping of entity name to the UIMA type: 
regex_type_ssn=org.apache.uima.ssn
regex_type_CreditCard=org.apache.uima.CreditCardNumber
regex_type_Email=org.apache.uima.EmailAddress


public class UIMAASService extends AbstractService {
 private static final String UIMA_ENTITIES = "org.apache.uima.entities";
 private static final String UIMA_ENTITIES_FS = UIMA_ENTITIES + ":entities";
 private static Joiner joiner = Joiner.on(",");

 public Result process(String text, String lang, List<String> types,
   Long waitMillseconds) throws Exception {
  CAS cas = this.ae.getCAS();
  String casId;
  try {
   cas.setDocumentText(text);
   cas.setDocumentLanguage(lang);
   TypeSystem ts = cas.getTypeSystem();
   Feature ft = ts.getFeatureByFullName(UIMA_ENTITIES_FS);

   Type type = ts.getType(UIMA_ENTITIES);
   if (type != null) {
    // if remote annotator or pear supports type
    // org.apache.uima.entities:entities, add it to indexes,
    // otherwise do nothing.
    FeatureStructure fs = cas.createFS(type);
    fs.setStringValue(ft, joiner.join(types));
    cas.addFsToIndexes(fs);
   }
   casId = this.ae.sendCAS(cas);
  } catch (ResourceProcessException e) {
   // http://t17251.apache-uima-general.apachetalk.us/uima-as-client-is-blocking-t17251.html
   // The UIMA AS framework code throws an
   // Exception and the application must catch it and release a CAS
   // before continuing. 
   cas.release();
   logger.error("Exception thrown when process cas " + cas, e);
   throw e;
  }
  Result rst = this.listener.waitFinished(casId, waitMillseconds);
  return rst;
 }
 protected static final Logger logger = LoggerFactory
   .getLogger(UIMAASService.class);

 private UimaAsynchronousEngine ae = null;
 protected UimaAsListener listener;
 private String serverUrl;
 private String endpoint;
 private static final int RETRIES = 10;

 public UIMAASService(String serverUrl, String endpoint) {
  this.serverUrl = serverUrl;
  this.endpoint = endpoint;
 }
 public void configureAE() throws SimpleServerException, IOException,
   XmlException, ResourceInitializationException {
  boolean success = false;
  for (int i = 0; (i < RETRIES) && (!success); i++) {
   try {
    this.ae = new BaseUIMAAsynchronousEngine_impl();
    this.listener = new UimaAsListener(this);
    this.ae.addStatusCallbackListener(this.listener);
    Map<String, Object> deployCtx = new HashMap<String, Object>();
    deployCtx.put("ServerURI", this.serverUrl);
    deployCtx.put("Endpoint", this.endpoint);
    deployCtx.put("Timeout", 60000);
    deployCtx.put("CasPoolSize", 20);
    deployCtx.put("GetMetaTimeout", 20000);

    if (StringUtils.isNotBlank(System.getProperty("uimaDebug"))) {
     deployCtx.put("-uimaEeDebug", Boolean.valueOf(true));
    }

    this.ae.initialize(deployCtx);
    success = true;
   } catch (ResourceInitializationException e) {
    if (i < 10) {
     logger.error(
       getName()
         + " configureAE failed when deploy , will retry, retried times: "
         + i, e);
    } else {
     logger.error(getName()
       + " configureAE failed, retried times: " + i, e);
     throw e;
    }
   }
  }
  configure(null);
 }

 public UimaAsListener getListener() {
  return this.listener;
 }

 public void deployPear(String appHome, File pearFile, File installationDir,
   String deployFileName) throws Exception {
  PackageBrowser instPear = PackageInstaller.installPackage(
    installationDir, pearFile, true);

  File deployFile = new File(instPear.getRootDirectory(), deployFileName);

  logger.info(getName() + " deployFile: " + deployFile);
  updateDeployFile(deployFile, this.serverUrl, this.endpoint);

  Map<String, Object> deployCtx = new HashMap<String, Object>();
  deployCtx.put("DD2SpringXsltFilePath", new File(appHome,
    "resources/uima/config/dd2spring.xsl").getAbsolutePath());

  deployCtx.put(
    "SaxonClasspath",
    "file:"
      + new File(appHome, "resources/uima/lib/saxon8.jar")
        .getAbsolutePath());

  BaseUIMAAsynchronousEngine_impl tmpAE = new BaseUIMAAsynchronousEngine_impl();
  tmpAE.deploy(deployFile.getAbsolutePath(), deployCtx);

  logger.info(getName() + " deployed " + pearFile.getAbsolutePath());
 }
 private static void updateDeployFile(File deployFile, String serverUrl,
   String endpoint) throws FileNotFoundException, IOException {
  String fileContext = null;
  InputStream is = new FileInputStream(deployFile);
  try {
   fileContext = IOUtils.toString(is);
  } finally {
   IOUtils.closeQuietly(is);
  }
  fileContext = fileContext.replace("${endpoint}", endpoint);
  fileContext = fileContext.replace("${brokerURL}", serverUrl);

  Object os = new FileOutputStream(deployFile);
  try {
   IOUtils.write(fileContext, (OutputStream) os);
  } finally {
   IOUtils.closeQuietly((OutputStream) os);
  }
 }
 public void deployPear(File pearFile, File installationDir,
   String deployFileName) throws Exception {
  String appHome = System.getProperty("cv.app.running.home").trim();
  deployPear(appHome, pearFile, installationDir, deployFileName);
 } 
}
Add Feature in RegExAnnotator.xml
<typeSystemDescription>
  <typeDescription>
    <name>org.apache.uima.entities</name>
    <description />
    <supertypeName>uima.tcas.Annotation</supertypeName>
    <features>
      <featureDescription>
        <name>entities</name>
        <description/>
        <rangeTypeName>uima.cas.String</rangeTypeName>
      </featureDescription>
    </features>
  </typeDescription>
</typeSystemDescription>
Check Feature Value at RegExAnnotator
RegExAnnotator will get the value of org.apache.uima.entities:entities, if it's set, then it will check all configured regex Concepts and add it to runConcepts if the concept produces one of the uima types of these entities.
public class RegExAnnotator extends CasAnnotator_ImplBase {
 private static final String UIMA_ENTITIES = "org.apache.uima.entities";
 private static final String UIMA_ENTITIES_FS = UIMA_ENTITIES + ":entities";
  
 public void process(CAS cas) throws AnalysisEngineProcessException {
  TypeSystem ts = cas.getTypeSystem();
  org.apache.uima.cas.Type entitiesType = ts.getType(UIMA_ENTITIES);
  FSIterator<?> it = cas.getAnnotationIndex(entitiesType).iterator();

  org.apache.uima.cas.Feature ft = ts
      .getFeatureByFullName(UIMA_ENTITIES_FS);
  String onlyRegexStr = null;
  AnnotationFS entitiesFs = null;
  while (it.hasNext()) {
    // TODO this is kind of weird
    AnnotationFS afs = (AnnotationFS) it.next();
    if (afs.getStringValue(ft) != null) {
      System.out.println(afs.getType().getName());
      onlyRegexStr = afs.getStringValue(ft).trim();
      entitiesFs = afs;
    }
    // onlyRegexStr = afs.getStringValue(ft).trim();
    logger.log(Level.FINE, "Only run " + onlyRegexStr);
  }
  if (entitiesFs != null) {
    cas.removeFsFromIndexes(entitiesFs);
  }

  List<String> types = null;
  List<Concept> runConcepts = new ArrayList<Concept>();

  if (onlyRegexStr != null) {
    onlyRegexStr.split(",");
    types = Arrays.asList(onlyRegexStr.split(","));
    for (Concept concept : regexConcepts) {
      Annotation[] annotations = concept.getAnnotations();
      if (annotations != null) {
        for (Annotation annotation : annotations) {
          if (types.contains(annotation.getAnnotationType()
              .getName())) {
            runConcepts.add(concept);
            break;
          }
        }
      }
    }
  } else {
    runConcepts = this.regexConcepts;
  }
  // change this.regexConcepts.length to local variable: runConcepts
  for (int i = 0; i < runConcepts.size(); i++) { 
       // same and omitted...
    }
  }
}  
References
UIMA References - feature structures
Apache UIMA Regular Expression Annotator Documentation
http://comments.gmane.org/gmane.comp.apache.uima.general/5866
Post a Comment

Labels

Java (159) Lucene-Solr (112) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (38) Eclipse (33) Code Example (31) Linux (25) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Shell (7) ANT (6) Coding Skills (6) Database (6) Lesson Learned (6) Programmer Skills (6) Scala (6) Tips (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) System Design (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts