Programmer: Lifelong Learning: May 2014

Using ResultSpecification to Filter Annotator to Boost Opennlp UIMA Performance

The Problem:
We use opennlp-uima to extract entities such as person, organization, location, date, time, money, percentage. But in most cases, client just wants to extract one or several kinds of entities: for example just person and location.

In OpenNlpTextAnalyzer.pear, it will run all 12 annotators in sequence. This is not good from performance perspective. Check flowConstraints/fixedFlow definition in OpenNlpTextAnalyzer.xml:

We want to opennlp-uima to only run needed annotators to boost its performance.

The solution: Using ResultSpecification
UIMA's descriptors include a section under the XML capabilities element where the descriptor may specify inputs and outputs. These end up informing the ResultSpecification which is provided to the annotator. The ResultSpecification can be queried by the annotator code to see what the annotator ought to produce.

PersonTitleAnnotator and TutorialDateTime in uimaj-examples project uses ResultSpecification to check whether it need run the annotator to boost the performance:

public void process(CAS aCAS) throws AnalysisEngineProcessException {
    // If the ResultSpec doesn't include the PersonTitle type, we have nothing to do.
    if (!getResultSpecification().containsType("example.PersonTitle",aCAS.getDocumentLanguage())) {
      if (!warningMsgShown) {
        logger.log(Level.WARNING, m);
        warningMsgShown = true;
      }
      return;
    }
}

We need make the following change to make opennlp-uima to honor ResultSpecification to filter annotators.
1. Update Annotator's analysisEngineDescription outputs to reflect its capabilities
Take PersonNameFinder.xml as an exmple: we need add opennlp.uima.Person like below:
Do simliar change in these files: PersonNameFinder.xml, LocationNameFinder.xml, OrganizationNameFinder.xml, DateNameFinder.xml, TimeNameFinder.xml, MoneyNameFinder.xml, PercentageNameFinder.xml, PosTagger.xml, Tokenizer.xml,Parser.xml, Chunker.xml.

<capabilities>
  <capability>
    <inputs />
    <outputs>
      <type>opennlp.uima.Person</type>
    </outputs>
    <languagesSupported>
      <language>en</language>
    </languagesSupported>
  </capability>
</capabilities>

Due to a bug in opennlp-uima, we need change NameType in nameValuePair from opennlp.uima.Person to opennlp.uima.Time.
Please refer to Wrong NameType in TimeNameFinder.xml, otherwise the annotator would classify time phrases such as "this afternoon" and "tomorrow morning" as Persons instead of Times.

2. Change Annotator's code to honor ResultSpecification
PersonNameFinder.xml, LocationNameFinder.xml, OrganizationNameFinder.xml, DateNameFinder.xml, TimeNameFinder.xml extends same parent class: opennlp.uima.namefind.AbstractNameFinder. We can change its process method like below:

public final void process(CAS cas) {
 ResultSpecification rs = getResultSpecification();  
 boolean run = rs.containsType(mNameType.getName())
   || rs.containsType(mNameType.getName(),cas.getDocumentLanguage());
 if (!run) {
  return;
 }
  // omitted ....
}

opennlp.uima.parser.Parser:

public void process(CAS cas) {
    ResultSpecification rs = getResultSpecification();  
 boolean run = rs.containsType("opennlp.uima.Parse") || rs.containsType("opennlp.uima.Parse", cas.getDocumentLanguage());
 if (!run) {
  return;
 }
}

opennlp.uima.chunker.Chunker:

public void process(CAS tcas) {
 ResultSpecification rs = getResultSpecification();  
 boolean run = rs.containsType("opennlp.uima.Chunk") 
   || rs.containsType("opennlp.uima.Chunk", tcas.getDocumentLanguage());
 if (!run) {
  return;
 } 
}

opennlp.uima.postag.POSTagger:

public void process(CAS tcas) {
 ResultSpecification rs = getResultSpecification();
 boolean run = rs.containsType("opennlp.uima.Token:pos")
   || rs.containsType("opennlp.uima.Token:pos", tcas.getDocumentLanguage());
 if (!run) {
  return;
 }
}

Change in Client Side
In client side, we need add result type in ResultSpecification when call org.apache.uima.analysis_engine.AnalysisEngine.process(CAS, ResultSpecification):

  ResultSpecification rs = UIMAFramework.getResourceSpecifierFactory()
      .createResultSpecification();
  rs.addResultType("opennlp.uima.Person", true);
  rs.addResultType("opennlp.uima.Location", true);
  this.ae.process(this.cas, rsf);

In our project, we use uima's Regular Expression Annotator to extract entities such as ssn, phone number, credit card etc. We define more than 20 entities and their corresponding regex in its concepts.xml

Resources
UIMA Result Specifications
UIMA References
http://comments.gmane.org/gmane.comp.apache.uima.general/5670

Winmerge: Including subfolders when using the Windows Explorer integration

The Problem
Today, when I use Winmerge's shell integration to compare tow folders, Winmerge just lists files and folders in the first layer, and the tree mode is disabled(grayed out).

The Solution
I can click ctrl+O or click File -> Open to open the "Select Files or Folders" dialogue, this will fill the Left and Right with current value, then I can the mode to "Include Subfolders", then change to Tree mode.

Another way is to: press(hold) Control(ctrl) button while selecting(clicking) the WinMerge or Compare.

This will include subfolders when compare tow folders.

Another way is to go to "Edit" -> "Options", go to "Shell Integration", check "Enable Include subfolders by default".

Personally I like the previuos solution better(press ctrl button).

Resources
WinMerge: Opening files and folders

Exif Viewer: Extract Metadata from Image

check a file on the web...

Image URL:

or check a file on your local disk

Files to upload:

or drop files here

List All Meta

This server side is deployed on GAE, and uses java metadata-extractor lib to extract meta data from image files.
https://drewnoakes.com/code/exif/
https://code.google.com/p/metadata-extractor/w/list

Powershell and Java: Stumble by UTF8 BOM or without BOM

The Problem
When I am writing my powershell script to clean csv file to remove invalid records: I mistakenly add -encoding utf8 when using out-file to write response to the final csv.

Then I run the following command to import the csv file to Solr:
http://localhost:8080/solr/update/csv?literal.f1=1&literal.f2=2&&header=true&stream.file=C:\1.csv&literal.cistate=17&commit=true
It will generate a unique id: docid by concatenating f1, f2, and the first column of the csv file: localId.
But to my surprise, there is only one document in solr with docid: 12.
http://localhost:8080/solr/cvcorefla4/admin/luke, it shows:

  <int name="numDocs">1</int>
  <int name="maxDoc">16420</int>
  <int name="deletedDocs">16419</int>
  <long name="version">2521</long>

Run http://localhost:8080/solr/select?q=*, and copy the response to a new file in notepad++ with encoding utf8, everything seems fine, but when I change the file encoding to ascii, it looks like below:

  <str name="docid">12</str>
  <arr name="ï»¿id">
  <str>f0e662cefe56a31c6eec5d53e64f988d</str>
  </arr>

Notice the messed invisible character before id: ï»¿id. -  Also the field is not expected string, but array of string.

So I write one simple java application to view the real content in "id":

  public void testUnicode() {
    String str = "id";
    for (int i = 0; i < str.length(); i++) {
      System.out.println(str.charAt(i));
      System.out.println((int) str.charAt(i));
      System.out.println(escapeNonAscii(str.charAt(i) + ""));
    }
    System.out.println("***************");
    System.out.println(str.length());
    System.out.println(str.hashCode());
    System.out.println(escapeNonAscii(str));
    System.out.println("***************");
  }
  private static String escapeNonAscii(String str) {
    
    StringBuilder retStr = new StringBuilder();
    for (int i = 0; i < str.length(); i++) {
      int cp = Character.codePointAt(str, i);
      int charCount = Character.charCount(cp);
      if (charCount > 1) {
        i += charCount - 1; // 2.
        if (i >= str.length()) {
          throw new IllegalArgumentException("truncated unexpectedly");
        }
      }      
      if (cp < 128) {
        retStr.appendCodePoint(cp);
      } else {
        retStr.append(String.format("\\u%x", cp));
      }
    }
    return retStr.toString();
  }

The invisible prefix is \ufeff. U+FEFF is byte order mark (BOM). 

So now the problem is kind of obvious:
out-file -encoding utf8
it is actually using utf-8 with BOM. But java uses utf8 without bom to read file. This causes the problem: to java the first column in first line is: \ufefflocalId not localId.

The Solution

Actually the fix is simple: the default encoding of out-file is Unicode: which works fine with java. If we are sure all code is in the ascii range, we can also specify -encoding ascii.

Resource
Byte order mark
Unicode Character 'ZERO WIDTH NO-BREAK SPACE' (U+FEFF)

Powershell-Working with CSV: Delete Rows Without Enough Columns

The Problem
We import csv files into Solr server, which is very sensitive with the number of columns. If there is no enough columns, Solr will fail with exception:
org.apache.solr.common.SolrException: CSVLoader: input=file:/C:/1.csv, line=9158554,expected 19 values but got 17

So we would like to have a script to clean the csv: to remove the rows which have no enough data: the number of columns is not 19.

Don't know how to get the number of columns of current record, but it's easier to check whether the value of last field is null: this means exactly no enough columns.

The Solution: Using Powershell
Powershell command would be like below(- the last field is to):

Import-Csv .\1.csv | Where-Object { $_.to -ne $null} | Export-Csv .\rst1.csv -NoTypeInformation

To output which line has no enough columns:

Import-Csv .\1.csv| Foreach-Object {$line = 0} {
   if($_.bcc -eq $null) {
      echo "ignore line: $line, no enough fields"; 
   } else {
     convertto-csv -inputobject $_ -NoTypeInformation | select -Skip 1 | out-file -filepath .\r1.csv -Append 
   }
   $line++
}

The complete script:
cleanCsv.ps1 Usage: .\cleanCsv.ps1 -filePath .\1.csv -destFilePath .\r1.csv

[CmdletBinding()]
Param(
  [Parameter(Mandatory=$True)]
  [string]$filePath,

  [Parameter(Mandatory=$True)]
  [string]$destFilePath,
  
  [Parameter(Mandatory=$False)]
  [string]$lastField="bcc"
)
# $ignoreLine = 2323533;

Get-Date -format "yyyy-MM-dd HH:mm:ss"
$sw = [Diagnostics.Stopwatch]::StartNew()

If (Test-Path $destFilePath ){
  echo "remove old $destFilePath"
 Remove-Item $destFilePath 
}

gc $filePath -TotalCount 1 | out-file -filepath $destFilePath
Import-Csv $filePath | Foreach-Object {$line = 0} {
  if($_.$lastField -eq $null) {
    echo "ignore line: $line, no enough fields"; 
  } else {
   convertto-csv -inputobject $_ -NoTypeInformation | select -Skip 1 | out-file -filepath $destFilePath -Append 
  }
  $line++
}

$sw.Stop()
Get-Date -format "yyyy-MM-dd HH:mm:ss"
echo "took " $sw.Elapsed

Using Maven with Google App Engine

Maven is very good at managing the project's dependencies, so I also choose maven when develop GAE project.

Google Eclipse plugin doesn't support GAE maven development very well: we can't use Google Eclipse plugin to directly run, debug the app or deploy it to app engine.

To run the app in local GAE server:
cd ${mypp}\${mypp-ear}
mvn -f ..\pom.xml clean install && mvn appengine:devserver

To debug the app: add the following in pom.xml:

<plugins>
  <plugin>
    <groupId>com.google.appengine</groupId>
    <artifactId>appengine-maven-plugin</artifactId>
    <configuration>
      <jvmFlags>
        <jvmFlag>-Xdebug</jvmFlag>
        <jvmFlag>-agentlib:jdwp=transport=dt_socket,address=9999,server=y,suspend=n</jvmFlag>
      </jvmFlags>
      <disableUpdateCheck>true</disableUpdateCheck>
    </configuration>
  </plugin>
</plugins>

Start the local GAE server, then create a remote application to connect to localhost:9999. Now we can debug the GAE maven application in eclipse.

Change Application Id

For some reason, we may want to deploy the same application with multiple application id. - We may use GAE as backbone application, our client application maybe mobile app or even google blogger(as google doesn't allow to put ads in GAE app, we may use google bloger as the front side which talks with GAE server to do real task.).

When our application is getting popular, and exceeds the free quota. We may want to duplicate our applition to deploy under another application id.

If we are using maven to build and deploy, we need change the application id: ${mypp}\${mypp-ear}\src\main\application\META-INF\appengine-application.xml.

Then deploy it to the new application id:

cd ${mypp}\${mypp-ear}

mvn -f ..\pom.xml clean install && mvn appengine:update

Resources

GAE Using Apache Maven

Boilerpipe Demo

Boilerpipe Demo
Welcome to the Web API for the boilerpipe Java library.
boilerpipe provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

Original Demo is at http://boilerpipe-web.appspot.com/, but it's too popular, and we often meet got "Over Quota" error, so I made this simple demo here for anyone who is interested in Boilerpipe.

Result:

How to Hide Labels count < X from Label Widget in Blogger

ConEmu failed to startup: command group is empty, choose your shell?

I am a fan of ConEmu, as it allows me to open various console(cmd, powsershell, Cygwin bash) in one place with tabs. Also it can auto save and restore opened tabs.

Today when I reopen ConEmu, it prompts a dialog: "Command group is empty, choose your shell?".

I guess this is because last time the windows was out of power, and ConEmu exited abruptly.

The fix is easy:
click OK, and in the opened dialog, Type: C:\Windows\System32\cmd.exe in the Crete new console field, or select ${cmd} from the dropdown list.

Resource:
ConEmu - The Windows Terminal/Console/Prompt we've been waiting for?

JavaScript: Convert Relative Link in Selected Html to Absolute Path

In the previous post, we get the selected html, but usually we also need convert the relative link to absolute path.
The Code

Tested in Chrome:

var thisPageUrl = stripQueryStringAndHashFromPath(window.location.href);

function stripQueryStringAndHashFromPath(url) {
 return url.split("?")[0].split("#")[0]; 
}

var hrefPattern = /href="([^"]*)"/;  
function changeToAbsoluteUrl(linkElement)
{
   var outerHTML = linkElement.outerHTML;
   var match= outerHTML.match(hrefPattern);
   if(match!=null)
   {
      var href= match[1];
      linkElement.href= thisPageUrl + href;
   }
}

function parseSelection() {
  var selection = window.getSelection();
  
  if(selection.rangeCount > 0)
  {   // please get source code of getSelectionHtml from 
      // http://lifelongprogrammer.blogspot.com/2014/05/javascript-get-selected-html-in-webpage.html
      var selectedHtml = getSelectionHtml(selection);
      var parser = new DOMParser()
      var selectedDoc = parser.parseFromString(selectedHtml, "text/html");      
      var elements = selectedDoc.getElementsByTagName("a");

      var arrayLength = elements.length;
      for (var i = 0; i < arrayLength; i++) {
       var element = elements[i];
       changeToAbsoluteUrl(element);
      }
      
      selectedHtml = selectedDoc.getElementsByTagName("body")[0].innerHTML;
      console.log("selectedHtml: " + selectedHtml);
  }   
}

parseSelection();

JavaScript: Get the Selected Html in Webpage

The Code

Tested in Chrome:

function getSelectionHtml(selection) {
  var range = selection.getRangeAt(0);
  
  var div = document.createElement('div');
  div.appendChild(range.cloneContents());
  var htmlText = div.innerHTML;
  return htmlText;
}

function parseSelection() {
  var selection = window.getSelection();
  
  if(selection.rangeCount > 0)
  {
      var selectedHtml = getSelectionHtml(selection);
      console.log("selectedHtml: " + selectedHtml);
  }   
}

parseSelection();

Android Studio: No classes.dex in built apk

The Problem: The apk does not include classes.dex
Recently, when I deploy my simple Android application to emulator in Android Studio, it always fails with error: Failure [INSTALL_FAILED_DEXOPT]

Check the logcat log: it shows: "The apk does not include classes.dex".
05-13 16:10:23.740 394-417/system_process W/ActivityManager﹕ No content provider found for permission revoke:
05-13 16:10:29.150 1055-1055/? W/dalvikvm﹕ DexOptZ: zip archive '/data/app/org.lifelongprogrammer.tools.myapp-1.apk' does not include classes.dex
05-13 16:10:29.160 53-53/? W/installd﹕ DexInv: --- END '/data/app/org.lifelongprogrammer.tools.myapp-1.apk' --- status=0xff00, process failed

I checked the built apk file at \app\build\apk, there is indeed no classes.dex in the apk file.
Google searched, but didn't didn't solution.
The Workaround: Build the app manually in command line
So I tried to use gradlew.bat to build the app manully: "gradlew.bat clean assembleDebug" then check the built apk file at \app\build\ap, it is larger, and indeed include classes.dex.

In Android studio, deploy the app to the emulator, it works. Also deploy it to my Android phone, it also works.

I think the problem should be in Android studio. But anyway I can continue to develop my application :)

Dict Demo

This is a practical usage of Jsoup to extract content from webpage.

Result:

JSoup Demo

This is an example to demonstrate how to use JSoup's selector to extract content form a webpage.

Result:

Simple Web Proxy

Some companies may block some website for unreasonable reason, for example: block Google Cache site: webcache.googleusercontent.com, mark it as Proxy/Anonymizer. This is a pain at sometime when we have to access google cache. So I wrote this simple tool.
Result: