Solr: Using LengthFilterFactory to Reduce Index Size and Memory Usage


Our Goal: Use less Memory
As our Solr application runs in client machine, so it's important for us to use less memory as possible as we can.

In Solr, we have fields which are used only for search and one way to reduce the index size and memory usage is to remove the term that is larger than the threshold: for example: 50. 

The reason is that the user is very unlikely to search on these large terms, so why keep them in the index.

The Definition for content field
 <fieldType name="text_rev_trucated" class="solr.TextField" positionIncrementGap="100" >
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishMinimalStemFilterFactory"/>
    <filter class="solr.LengthFilterFactory" min="1" max="50"/>
    <filter class="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>       
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishMinimalStemFilterFactory"/>
  </analyzer>
</fieldType>

Other Tricks
Reduce fetch size when retrieve data from remote server.
Commit frequently.
Reduce the cache size
Monitor memory usage and run gc when needed.

Deploying Hadoop and Solr with Docker


Docker is an open platform for building, shipping, running distributed applications. There are a lot of docker containers with different os and bundled with different application such as hadoop, mongoDB.

When we want to learn or give some tools a try, we can just call docker run with the specific image: for example: docker run --name some-mongo -d mongo
This will not mess our host environment, when we are done, we can just call docker kill to kill the running container.

We can also use Docker to create a consistent environment which can be ran on any Docker enabled machine.

In this article, I would like to introduce how to run hadoop and Solr in docker.

Install Hadoop Image and Run it
Search Haddop in Docker registry: https://registry.hub.docker.com, and I chooses the most popular sequenceiq/hadoop-docker
Run the command in my Ubuntu host:
docker run -i -t sequenceiq/hadoop-docker /etc/bootstrap.sh -bash

This will download the hadoop-docker image, and start it. After several minutes, it will start the bash of  hadoop-docker container.

Install Solr in Hadoop Container
Run the following commands, it will download latest Solr-4.10.1, and unzip it.
mkdir -p /home/lifelongprogrammer/src/solr; cd /home/lifelongprogrammer/src/solr
curl -O http://mirrors.advancedhosters.com/apache/lucene/solr/4.10.1/solr-4.10.1.tgz
tar -xf solr-4.10.1.tgz
cd /home/lifelongprogrammer/src/solr/solr-4.10.1/example

Then run the following command, it will run solr on HDFS with default port 8983.
java -Dsolr.directoryFactory=HdfsDirectoryFactory \
     -Dsolr.lock.type=hdfs \
     -Dsolr.data.dir=hdfs://$(hostname):9000/solr/datadir \
     -Dsolr.updatelog=hdfs://$(hostname):9000/solr/updateLog -jar start.jar

Run Solr in background on Startup
Edit /etc/bootstrap.sh, and add the following commands after HADOOP_PREFIX/sbin/start-yarn.sh  
cd /home/lifelongprogrammer/src/solr/solr-4.10.1/example && nohup java -Dsolr.directoryFactory=HdfsDirectoryFactory \
   -Dsolr.lock.type=hdfs \
   -Dsolr.data.dir=hdfs://$(hostname):9000/solr/datadir \
   -Dsolr.updatelog=hdfs://$(hostname):9000/solr/updateLog -jar start.jar &

Commit changes and Create Docker Images
First run docker ps to get the container id:
CONTAINER ID        IMAGE 
2cd8fadba668        93186936bee2

Then let's commit the change and create our own docker images:
docker commit 2cd8fadba668   hadoop_docker_withsolr

Run exit in opened docker bash to logout it. Then run
docker run -d -t -p 8983:8983 hadoop_docker_withsolr /etc/bootstrap.sh -d

The first -d tells docker to tun the image in detached mode, the -p tells docker to publish a container's port to the host
The last -d is parameter of /etc/bootstrap.sh 

After several minutes, we can access http://linuxhostip:8983/solr/#/ to access solr admin page. Now solr is running in the hadoop docker image.

After we are done with our test, we run docker ps to get its container id, then call docker kill $container_id to kill it. 

Persist Modified Image
Now let's save our modified docker image:
docker save hadoop_docker_withsolr  > hadoop_docker_withsolr_save.tar

Now we can copy this tar to another machine, and load it:
docker load < hadoop_docker_withsolr_save.tar


References

Powershell: How Can I Determine Local Path from UNC Path?


The Problem
Usually we know the UNC path(like \\server\share\file_path), we need get the local physical path, so we can login to that machine, go to that path and make some change.

The Solution
We can use WMI(Windows Management Instrumentation) to get and operate on windows management information. Win32_Share represents a shared resource on a computer system running Windows.

In Powershell, we can use get-wmiobject  to get WIM class and -filter to specify the share name.

So now the solution is obvious: just one command line:
get-wmiobject -class "Win32_Share" -namespace "root\cimv2" -computername "computername" -filter "name='uncpath'" | select name,path

Output:
name                path
----                   ----

share-name        e:\users\somefolder
Reference
get-wmiobject
Win32_Share

UIMA: Run Custom Regex Dynamically


The Problem
Extend UIMA Regex Annotator to allow user run custom regex dynamically.

Regular Expression Annotator allows us to easily define entity name(such as credit card, email) and regex to extract these entities.

But we can never define all useful entities, so it's good to allow customers to add their own entities and regex, and the UIMA Regular Expression Annotator would run them dynamically.

We can create and deploy a new annotator, but we decide to just extend UIMA RegExAnnotator.

How it Works
Client Side
We create one type org.apache.uima.input.dynamicregex with feature types and regexes. 
In our http interface, client specifies the entity name and its regex: 
host:port/nlp?text=abcxxdef&customTypes=mytype1,mytype2&customRegexes=abc.*,def.*

Client will add Feature Structure: org.apache.uima.input.dynamicregex.types=mytype1,mytype2 and org.apache.uima.input.dynamicregex.regexes=abc.*,def.*
public void addCustomRegex(List<String> customTypes,
    List<String> customRegexes, CAS cas) {
  if (customTypes != null && customRegexes != null) {
    if (customTypes.size() != customRegexes.size()) {
      throw new IllegalArgumentException(
          "Size doesn't match: customTypes size: "
              + customTypes.size() + ", customRegexes size: "
              + customRegexes.size());
    }
    TypeSystem ts = cas.getTypeSystem();
    Feature ft = ts
        .getFeatureByFullName("org.apache.uima.input.dynamicregex:types");
    Type type = ts.getType("org.apache.uima.input.dynamicregex");

    if (type != null) {
      // if remote annotator or pear supports type
      // org.apache.uima.entities:entities, add it to indexes,
      // otherwise do nothing.
      FeatureStructure fs = cas.createFS(type);
      fs.setStringValue(ft, joiner.join(customTypes));
      cas.addFsToIndexes(fs);
    }

    ft = ts.getFeatureByFullName("org.apache.uima.input.dynamicregex:regexes");
    type = ts.getType("org.apache.uima.input.dynamicregex");

    if (type != null) {
      // if remote annotator or pear supports type
      // org.apache.uima.entities:entities, add it to indexes,
      // otherwise do nothing.
      FeatureStructure fs = cas.createFS(type);
      fs.setStringValue(ft, joiner.join(customRegexes));
      cas.addFsToIndexes(fs);
    }
  }
}
public Result process(String text, String lang, List<String> uimaTypes,
    List<String> customTypes, List<String> customRegexes,
    Long waitMillseconds) throws Exception {
  CAS cas = this.ae.getCAS();
  String casId;
  try {
    cas.setDocumentText(text);
    cas.setDocumentLanguage(lang);
    TypeSystem ts = cas.getTypeSystem();
    Feature ft = ts.getFeatureByFullName(UIMA_ENTITIES_FS);
    Type type = ts.getType(UIMA_ENTITIES);
    if (type != null) {
      // if remote annotator or pear supports type
      // org.apache.uima.entities:entities, add it to indexes,
      // otherwise do nothing.
      FeatureStructure fs = cas.createFS(type);
      fs.setStringValue(ft, joiner.join(uimaTypes));
      cas.addFsToIndexes(fs);
    }
    addCustomRegex(customTypes, customRegexes, cas);
    casId = this.ae.sendCAS(cas);
  } catch (ResourceProcessException e) {
    // http://t17251.apache-uima-general.apachetalk.us/uima-as-client-is-blocking-t17251.html
    cas.release();
    logger.error("Exception thrown when process cas " + cas, e);
    throw e;
  }
  Result rst = this.listener.waitFinished(casId, waitMillseconds);
  return rst;
}
Define Feature Structures in RegExAnnotator.xml
org.apache.uima.input.dynamicregex is used as input paramter, client can specify value for its features: types and regexes. org.apache.uima.output.dynamicrege is the output type.
<typeDescription>
  <name>org.apache.uima.input.dynamicregex</name>
  <description />
  <supertypeName>uima.tcas.Annotation</supertypeName>
  <features>
    <featureDescription>
      <name>types</name>
      <description />
      <rangeTypeName>uima.cas.String</rangeTypeName>
    </featureDescription>            
    <featureDescription>
      <name>regexes</name>
      <description />
      <rangeTypeName>uima.cas.String</rangeTypeName>
    </featureDescription>            
  </features>          
</typeDescription>
<!-- output params -->
<typeDescription>
  <name>org.apache.uima.output.dynamicregex</name>
  <description />
  <supertypeName>uima.tcas.Annotation</supertypeName>
  <features>
    <featureDescription>
      <name>type</name>
      <description />
      <rangeTypeName>uima.cas.String</rangeTypeName>
    </featureDescription>
  </features>
</typeDescription>

Run Custom Regex and Return Extracted Entities in  RegExAnnotator
Next, in RegExAnnotator.process method, we get value of the input types and regex, run custom regex and add found entities to CAS indexes.
public void process(CAS cas) throws AnalysisEngineProcessException {
  procressCutsomRegex(cas);
  //... omitted
}
private void procressCutsomRegex(CAS cas) {
  TypeSystem ts = cas.getTypeSystem();
  Type dyInputType = ts.getType("org.apache.uima.input.dynamicregex");
  org.apache.uima.cas.Feature dyInputTypesFt = ts
      .getFeatureByFullName("org.apache.uima.input.dynamicregex:types");
  org.apache.uima.cas.Feature dyInputRegexesFt = ts
      .getFeatureByFullName("org.apache.uima.input.dynamicregex:regexes");
  String dyTypes = null, dyRegexes = null;
  FSIterator<?> dyIt = cas.getAnnotationIndex(dyInputType).iterator();

  AnnotationFS dyInputTypesFs = null, dyInputRegexesFs = null;
  while (dyIt.hasNext()) {
    // TODO this is kind of weird
    AnnotationFS afs = (AnnotationFS) dyIt.next();
    if (afs.getStringValue(dyInputTypesFt) != null) {
      dyTypes = afs.getStringValue(dyInputTypesFt);
      dyInputTypesFs = afs;
    }
    if (afs.getStringValue(dyInputRegexesFt) != null) {
      dyRegexes = afs.getStringValue(dyInputRegexesFt);
      dyInputRegexesFs = afs;
    }
  }
  if (dyInputTypesFs != null) {
    cas.removeFsFromIndexes(dyInputTypesFs);
  }
  if (dyInputRegexesFs != null) {
    cas.removeFsFromIndexes(dyInputRegexesFs);
  }
  String[] dyTypesArr = dyTypes.split(","), dyRegexesArr = dyRegexes
      .split(",");
  if (dyTypesArr.length != dyRegexesArr.length) {
    throw new IllegalArgumentException(
        "Size of custom regex doesn't match. types: "
            + dyTypesArr.length + ",  regexes: "
            + dyRegexesArr.length);
  }
  if (dyTypesArr.length == 0)
    return;
  logger.log(Level.FINE, "User specifies custom regex: type: " + dyTypes
      + ", regexes: " + dyRegexes);
  String docText = cas.getDocumentText();
  Type dyOutputType = ts.getType("org.apache.uima.output.dynamicregex");
  org.apache.uima.cas.Feature dyOutputTypeFt = ts
      .getFeatureByFullName("org.apache.uima.output.dynamicregex:type");
  FSIndexRepository indexRepository = cas.getIndexRepository();
  for (int i = 0; i < dyTypesArr.length; i++) {
    Pattern pattern = Pattern.compile(dyRegexesArr[i]);
    Integer captureGroupPos = getNamedGrpupPosition(pattern, "capture");
    Matcher matcher = pattern.matcher(docText);

    while (matcher.find()) {
      AnnotationFS dyAnnFS;
      // if named group capture exists
      if (captureGroupPos != null) {
        dyAnnFS = cas.createAnnotation(dyOutputType,
            matcher.start(captureGroupPos),
            matcher.end(captureGroupPos));
      } else {
        dyAnnFS = cas.createAnnotation(dyOutputType,
            matcher.start(), matcher.end());
      }
      dyAnnFS.setStringValue(dyOutputTypeFt, dyTypesArr[i]);
      indexRepository.addFS(dyAnnFS);
    }
  }
}
/**
 * Use reflection to call namedGroups in JDK7
 */
@SuppressWarnings("unchecked")
private Integer getNamedGrpupPosition(Pattern pattern, String namedGroup) {
  try {
    Method namedGroupsMethod = Pattern.class.getDeclaredMethod(
        "namedGroups", null);
    namedGroupsMethod.setAccessible(true);

    Map<String, Integer> namedGroups = (Map<String, Integer>) namedGroupsMethod
        .invoke(pattern, null);
    return namedGroups.get(namedGroup);
  } catch (Exception e) {
    throw new RuntimeException(e);
  }
}
References
UIMA References
Apache UIMA Regular Expression Annotator Documentation

Get Start End Offset of Named Group in JDK7


The Problem
We want to know the start and end offset of named group, but Matcher start(), end() in JDK 7 doesn't accept group name as its parameter.

JDK7 adds the support of Named Group:
(1) (?<NAME>X) to define a named group NAME".
(2) \\k<Name> to backref a named group "NAME"                   
(3) <$<NAME> to reference to captured group in matcher's replacement str 

We can use matcher.group(String NAME) to return the captured input subsequence by the given "named group", but its start(), end() in matcher doesn't accept group name as its parameter.

The Solution
Check the JDK code, look at how mathcer.group(String name) is implemented:
public String group(String name) {
    if (name == null)
        throw new NullPointerException("Null group name");
    if (first < 0)
        throw new IllegalStateException("No match found");
    if (!parentPattern.namedGroups().containsKey(name))
        throw new IllegalArgumentException("No group with name <" + name + ">");
    int group = parentPattern.namedGroups().get(name);
    if ((groups[group*2] == -1) || (groups[group*2+1] == -1))
        return null;
    return getSubSequence(groups[group * 2], groups[group * 2 + 1]).toString();
}
It uses int group = parentPattern.namedGroups().get(name) to get the group position of the named group. Check the pattern code: its namedGroups is not public: it's package visible only.
Map<String, Integer> namedGroups() {
    if (namedGroups == null)
        namedGroups = new HashMap<>(2);
    return namedGroups;
}
We can't call it directly, but we can use Java reflection to call this package visible method.

public void testGetNamedGrpupPositionInJDK7() throws Exception {
  Pattern pattern = Pattern.compile("((?<capture>abc).*d)(ef)");
  Integer groupPos = getNamedGrpupPositionInJDK7(pattern, "capture");
  if (groupPos == null) {
    System.out
        .println("Doesn't contain named group: capture, the pattern: "
            + pattern.toString());
  }
  Matcher matcher = pattern.matcher("abcxxdef");
  while (matcher.find()) {
    String matchedText = matcher.group("capture");
    matchedText = matcher.group(groupPos);
    System.out.println(matchedText + " " + matcher.start(groupPos)
        + ":" + matcher.end(groupPos));
  }
}

@SuppressWarnings("unchecked")
// don't use int, it would throw NPE if the regex doesn't contain the named
// group
private Integer getNamedGrpupPositionInJDK7(Pattern pattern,
    String namedGroup) throws NoSuchMethodException,
    IllegalAccessException, InvocationTargetException {
  Method namedGroupsMethod = Pattern.class.getDeclaredMethod(
      "namedGroups", null);
  namedGroupsMethod.setAccessible(true);

  Map<String, Integer> namedGroups = (Map<String, Integer>) namedGroupsMethod
      .invoke(pattern, null);
  return namedGroups.get(namedGroup);
}
Get Start End Offset of Named Group in JDK8
JDK8 realized this problem and added APIs: start(String groupName), end(String groupName) to get start and end offset of named group.
public void testGetNamedGrpupPositionInJDK8() throws Exception {
  Pattern pattern = Pattern.compile("((?<capture>abc).*d)(ef)");
  Matcher matcher = pattern.matcher("abcxxdef");
  while (matcher.find()) {
    // if the regex doesn't contain the named group, it would throw
    // IllegalArgumentException: No group with name <capture>
    System.out.println(matcher.group("capture") + " "
        + matcher.start("capture") + ":" + matcher.end("capture"));
  }
}
References
Named Capturing Group in JDK7 RegEx

UIMA: Using Dedicated Feature Structure to Control Annotator Behavior


The Problem
In previous post: Using ResultSpecification to Filter Annotator to Boost Opennlp UIMA Performance, I introduced how to use ResultSpecification to make OpenNLP.pear only run needed annotators.

But recently, we changed our content analzyer project to use UIMA-AS for better scale out. But UIMA-AS doesn't support specify ResultSpecification at client side, so we have to find other solutions.

Luckily UIMA provides a more common mechanism: feature structures to allow us to control annotator's behavioral characteristics.

Using Dedicated Feature Structure to Control Server Behavioral
This time, we will take RegExAnnotator.pear as example, as we have defined more than 10+ regex and entities in RegExAnnotator, and the client would specify which entities they are interested. 

Client specify  values of the feature: org.apache.uima.entities:entities, such as "ssn,creditcard,email", RegExAnnotator will check the setting and run only needed regex.

Specify Feature Value at Client Side
First we have one properties file uima.properties which define the mapping of entity name to the UIMA type: 
regex_type_ssn=org.apache.uima.ssn
regex_type_CreditCard=org.apache.uima.CreditCardNumber
regex_type_Email=org.apache.uima.EmailAddress


public class UIMAASService extends AbstractService {
 private static final String UIMA_ENTITIES = "org.apache.uima.entities";
 private static final String UIMA_ENTITIES_FS = UIMA_ENTITIES + ":entities";
 private static Joiner joiner = Joiner.on(",");

 public Result process(String text, String lang, List<String> types,
   Long waitMillseconds) throws Exception {
  CAS cas = this.ae.getCAS();
  String casId;
  try {
   cas.setDocumentText(text);
   cas.setDocumentLanguage(lang);
   TypeSystem ts = cas.getTypeSystem();
   Feature ft = ts.getFeatureByFullName(UIMA_ENTITIES_FS);

   Type type = ts.getType(UIMA_ENTITIES);
   if (type != null) {
    // if remote annotator or pear supports type
    // org.apache.uima.entities:entities, add it to indexes,
    // otherwise do nothing.
    FeatureStructure fs = cas.createFS(type);
    fs.setStringValue(ft, joiner.join(types));
    cas.addFsToIndexes(fs);
   }
   casId = this.ae.sendCAS(cas);
  } catch (ResourceProcessException e) {
   // http://t17251.apache-uima-general.apachetalk.us/uima-as-client-is-blocking-t17251.html
   // The UIMA AS framework code throws an
   // Exception and the application must catch it and release a CAS
   // before continuing. 
   cas.release();
   logger.error("Exception thrown when process cas " + cas, e);
   throw e;
  }
  Result rst = this.listener.waitFinished(casId, waitMillseconds);
  return rst;
 }
 protected static final Logger logger = LoggerFactory
   .getLogger(UIMAASService.class);

 private UimaAsynchronousEngine ae = null;
 protected UimaAsListener listener;
 private String serverUrl;
 private String endpoint;
 private static final int RETRIES = 10;

 public UIMAASService(String serverUrl, String endpoint) {
  this.serverUrl = serverUrl;
  this.endpoint = endpoint;
 }
 public void configureAE() throws SimpleServerException, IOException,
   XmlException, ResourceInitializationException {
  boolean success = false;
  for (int i = 0; (i < RETRIES) && (!success); i++) {
   try {
    this.ae = new BaseUIMAAsynchronousEngine_impl();
    this.listener = new UimaAsListener(this);
    this.ae.addStatusCallbackListener(this.listener);
    Map<String, Object> deployCtx = new HashMap<String, Object>();
    deployCtx.put("ServerURI", this.serverUrl);
    deployCtx.put("Endpoint", this.endpoint);
    deployCtx.put("Timeout", 60000);
    deployCtx.put("CasPoolSize", 20);
    deployCtx.put("GetMetaTimeout", 20000);

    if (StringUtils.isNotBlank(System.getProperty("uimaDebug"))) {
     deployCtx.put("-uimaEeDebug", Boolean.valueOf(true));
    }

    this.ae.initialize(deployCtx);
    success = true;
   } catch (ResourceInitializationException e) {
    if (i < 10) {
     logger.error(
       getName()
         + " configureAE failed when deploy , will retry, retried times: "
         + i, e);
    } else {
     logger.error(getName()
       + " configureAE failed, retried times: " + i, e);
     throw e;
    }
   }
  }
  configure(null);
 }

 public UimaAsListener getListener() {
  return this.listener;
 }

 public void deployPear(String appHome, File pearFile, File installationDir,
   String deployFileName) throws Exception {
  PackageBrowser instPear = PackageInstaller.installPackage(
    installationDir, pearFile, true);

  File deployFile = new File(instPear.getRootDirectory(), deployFileName);

  logger.info(getName() + " deployFile: " + deployFile);
  updateDeployFile(deployFile, this.serverUrl, this.endpoint);

  Map<String, Object> deployCtx = new HashMap<String, Object>();
  deployCtx.put("DD2SpringXsltFilePath", new File(appHome,
    "resources/uima/config/dd2spring.xsl").getAbsolutePath());

  deployCtx.put(
    "SaxonClasspath",
    "file:"
      + new File(appHome, "resources/uima/lib/saxon8.jar")
        .getAbsolutePath());

  BaseUIMAAsynchronousEngine_impl tmpAE = new BaseUIMAAsynchronousEngine_impl();
  tmpAE.deploy(deployFile.getAbsolutePath(), deployCtx);

  logger.info(getName() + " deployed " + pearFile.getAbsolutePath());
 }
 private static void updateDeployFile(File deployFile, String serverUrl,
   String endpoint) throws FileNotFoundException, IOException {
  String fileContext = null;
  InputStream is = new FileInputStream(deployFile);
  try {
   fileContext = IOUtils.toString(is);
  } finally {
   IOUtils.closeQuietly(is);
  }
  fileContext = fileContext.replace("${endpoint}", endpoint);
  fileContext = fileContext.replace("${brokerURL}", serverUrl);

  Object os = new FileOutputStream(deployFile);
  try {
   IOUtils.write(fileContext, (OutputStream) os);
  } finally {
   IOUtils.closeQuietly((OutputStream) os);
  }
 }
 public void deployPear(File pearFile, File installationDir,
   String deployFileName) throws Exception {
  String appHome = System.getProperty("cv.app.running.home").trim();
  deployPear(appHome, pearFile, installationDir, deployFileName);
 } 
}
Add Feature in RegExAnnotator.xml
<typeSystemDescription>
  <typeDescription>
    <name>org.apache.uima.entities</name>
    <description />
    <supertypeName>uima.tcas.Annotation</supertypeName>
    <features>
      <featureDescription>
        <name>entities</name>
        <description/>
        <rangeTypeName>uima.cas.String</rangeTypeName>
      </featureDescription>
    </features>
  </typeDescription>
</typeSystemDescription>
Check Feature Value at RegExAnnotator
RegExAnnotator will get the value of org.apache.uima.entities:entities, if it's set, then it will check all configured regex Concepts and add it to runConcepts if the concept produces one of the uima types of these entities.
public class RegExAnnotator extends CasAnnotator_ImplBase {
 private static final String UIMA_ENTITIES = "org.apache.uima.entities";
 private static final String UIMA_ENTITIES_FS = UIMA_ENTITIES + ":entities";
  
 public void process(CAS cas) throws AnalysisEngineProcessException {
  TypeSystem ts = cas.getTypeSystem();
  org.apache.uima.cas.Type entitiesType = ts.getType(UIMA_ENTITIES);
  FSIterator<?> it = cas.getAnnotationIndex(entitiesType).iterator();

  org.apache.uima.cas.Feature ft = ts
      .getFeatureByFullName(UIMA_ENTITIES_FS);
  String onlyRegexStr = null;
  AnnotationFS entitiesFs = null;
  while (it.hasNext()) {
    // TODO this is kind of weird
    AnnotationFS afs = (AnnotationFS) it.next();
    if (afs.getStringValue(ft) != null) {
      System.out.println(afs.getType().getName());
      onlyRegexStr = afs.getStringValue(ft).trim();
      entitiesFs = afs;
    }
    // onlyRegexStr = afs.getStringValue(ft).trim();
    logger.log(Level.FINE, "Only run " + onlyRegexStr);
  }
  if (entitiesFs != null) {
    cas.removeFsFromIndexes(entitiesFs);
  }

  List<String> types = null;
  List<Concept> runConcepts = new ArrayList<Concept>();

  if (onlyRegexStr != null) {
    onlyRegexStr.split(",");
    types = Arrays.asList(onlyRegexStr.split(","));
    for (Concept concept : regexConcepts) {
      Annotation[] annotations = concept.getAnnotations();
      if (annotations != null) {
        for (Annotation annotation : annotations) {
          if (types.contains(annotation.getAnnotationType()
              .getName())) {
            runConcepts.add(concept);
            break;
          }
        }
      }
    }
  } else {
    runConcepts = this.regexConcepts;
  }
  // change this.regexConcepts.length to local variable: runConcepts
  for (int i = 0; i < runConcepts.size(); i++) { 
       // same and omitted...
    }
  }
}  
References
UIMA References - feature structures
Apache UIMA Regular Expression Annotator Documentation
http://comments.gmane.org/gmane.comp.apache.uima.general/5866

Lucene Highlighter HowTo


In practice, we may want to highlight the matched word in the query response, so user can easily look at the matched section and jump to it.

package org.lifelongprogrammer.learningLucene;
public class LuceneHighlighterInAction {

 public static void main(String[] args) throws Exception {
  Directory directory = new RAMDirectory();
  StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_9);

  String fieldName = "content";
  writeDocs(directory, analyzer, fieldName);
  // use Highlighter
  try (DirectoryReader indexReader = DirectoryReader.open(directory);) {
   IndexSearcher searcher = new IndexSearcher(indexReader);
   TermQuery query = new TermQuery(new Term(fieldName, "love"));

   TopDocs topDocs = searcher.search(query, 10);
   System.out.println("Total hits: " + topDocs.totalHits);
   ScoreDoc[] scoreDocs = topDocs.scoreDocs;

   // use SimpleHTMLFormatter
   System.out.println("use SimpleHTMLFormatter");
   QueryScorer scorer = new QueryScorer(query);
   Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(
     "<font color='red'>", "</font>"), scorer);
   Fragmenter fragmenter = new SimpleFragmenter(200);
   highlighter.setTextFragmenter(fragmenter);

   for (int i = 0; i < Math.min(scoreDocs.length, 10); ++i) {
    Document doc = searcher.doc(scoreDocs[i].doc);
    String fieldContent = doc.get(fieldName);
    System.out.println(fieldContent + " , " + scoreDocs[i].score);
    System.out.println(highlighter.getBestFragment(analyzer,
      fieldName, fieldContent));
   }

   // use SimpleSpanFragmenter
   System.out.println("use SimpleSpanFragmenter");
   highlighter = new Highlighter(scorer);
   //default is Highlighter.DEFAULT_MAX_CHARS_TO_ANALYZE 50*1024
   highlighter.setMaxDocCharsToAnalyze(10240);
   fragmenter = new SimpleSpanFragmenter(new QueryScorer(query), 10);
   for (int i = 0; i < Math.min(scoreDocs.length, 10); ++i) {
    Document doc = searcher.doc(scoreDocs[i].doc);
    String fieldContent = doc.get(fieldName);
    System.out.println(fieldContent + " , " + scoreDocs[i].score);
    TokenStream tokenStream = analyzer.tokenStream(fieldName,
      fieldContent);
    String result = highlighter.getBestFragments(tokenStream,
      fieldContent, 2, "...");
    System.out.println(result);
   }
  }
 }

 private static void writeDocs(Directory directory,
   StandardAnalyzer analyzer, String fieldName) throws IOException {
  IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_9,
    analyzer);
  config.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
  try (IndexWriter writer = new IndexWriter(directory, config)) {

   FieldType fieldType = new FieldType();
   fieldType.setIndexed(true);
   fieldType.setStored(true);
   fieldType.setTokenized(true);
   fieldType.setStoreTermVectors(true);
   fieldType.setStoreTermVectorOffsets(true);
   fieldType.setStoreTermVectorPositions(true);
   fieldType.setOmitNorms(false);
   fieldType.freeze();

   Document doc = new Document();
   doc.add(new Field(
     fieldName,
     "I am a lifelong programmer, I love coding; I am a lifelong programmer, I love programming.",
     fieldType));
   writer.addDocument(doc);

   doc = new Document();
   doc.add(new Field(
     fieldName,
     "I am a lifelong programmer, I love the world; I am a lifelong programmer, I love the life.",
     fieldType));
   writer.addDocument(doc);
  }
 }
}
Main code: org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(TokenStream, String, boolean, int) 
Highlighter in Solr
https://cwiki.apache.org/confluence/display/solr/Highlighting
http://wiki.apache.org/solr/HighlightingParameters

Paginating Lucene Search Results


Use IndexSearcher.searchAfter
/**
 * useSearcherAfter, need client record the returned last ScoreDoc
 * lastBottom, and pass it in next round.
 */
private void useSearcherAfter(DirectoryReader indexReader,
    IndexSearcher searcher, int pageSize) throws IOException {
  Query query = new TermQuery(new Term("title", "java"));
  // query = new MatchAllDocsQuery();
  ScoreDoc lastBottom = null;
  while (true) {
    TopDocs paged = null;
    paged = searcher.searchAfter(lastBottom, query, null, pageSize);
    if (paged.scoreDocs.length == 0) {
      // no more data, break;
      break;
    }
    ScoreDoc[] scoreDocs = paged.scoreDocs;
    for (ScoreDoc scoreDoc : scoreDocs) {
      Utils.printDoc(searcher.doc(scoreDoc.doc), "id", "title");
    }

    lastBottom = paged.scoreDocs[paged.scoreDocs.length - 1];
  }
}

Skip Previous Docs
Not good at performance and memory usage.
private void skipPreviousRows(DirectoryReader indexReader,
    IndexSearcher searcher, int pageStart, int pageSize)
    throws IOException {
  Query query = new TermQuery(new Term("title", "java"));
  int pageEnd = pageStart - 1 + pageSize;
  TopDocs hits = searcher.search(query, pageEnd);

  for (int i = pageStart - 1; i < pageEnd; i++) {
    int docId = hits.scoreDocs[i].doc;

    // load the document
    Document doc = searcher.doc(docId);
    Utils.printDocAndExplain(doc, searcher, query, docId, "id", "title");
  }
}

In Solr4.7, we can do deep paging with cursorMark
Solr Deep Pagination Problem Fixed in Solr-5463
Sorting, Paging, and Deep Paging in Solr
http://solr1:8080/solr/select?q=accesstime:[* TO NOW-5YEAR/DAY]&sort=accesstime desc, contentid asc&sort=accesstime desc,id asc&rows=1000&start=0&cursorMark=*

http://solr1:8080/solr/select?q=accesstime:[* TO NOW-5YEAR/DAY]&sort=accesstime desc, contentid asc&sort=accesstime desc,id asc&rows=1000&start=0&cursorMark=AoJ42tmu%2FZ4CKTQxMDMyMzEwMw%3D%3D

Lucene Built-in Collectors


TotalHitCountCollector
Collector's collect method is called for each matched docs:
The main methods in the process:
org.apache.lucene.search.IndexSearcher.search(List, Weight, Collector)
org.apache.lucene.search.Weight.DefaultBulkScorer.scoreAll(Collector, Scorer)

TopScoreDocCollector
Create collector:
org.apache.lucene.search.TopScoreDocCollector.create(int, ScoreDoc, boolean)

public static TopScoreDocCollector create(int numHits, ScoreDoc after, boolean docsScoredInOrder) {
  if (docsScoredInOrder) {
    return after == null 
      ? new InOrderTopScoreDocCollector(numHits) 
      : new InOrderPagingScoreDocCollector(after, numHits);
  } else {
    return after == null
      ? new OutOfOrderTopScoreDocCollector(numHits)
      : new OutOfOrderPagingScoreDocCollector(after, numHits);
  }
}
The collector put docs into HitQueue(PriorityQueue)
org.apache.lucene.search.TopScoreDocCollector.OutOfOrderTopScoreDocCollector.collect(int)
org.apache.lucene.search.HitQueue.lessThan(ScoreDoc, ScoreDoc)
TopFieldCollector
public static TopFieldCollector create(Sort sort, int numHits, FieldDoc after,
    boolean fillFields, boolean trackDocScores, boolean trackMaxScore,
    boolean docsScoredInOrder)
    throws IOException {
  FieldValueHitQueue<Entry> queue = FieldValueHitQueue.create(sort.fields, numHits);
  if (after == null) {
    if (queue.getComparators().length == 1) {
      if (docsScoredInOrder) {
        if (trackMaxScore) {
          return new OneComparatorScoringMaxScoreCollector(queue, numHits, fillFields);
        } else if (trackDocScores) {
          return new OneComparatorScoringNoMaxScoreCollector(queue, numHits, fillFields);
        } else {
          return new OneComparatorNonScoringCollector(queue, numHits, fillFields);
        }
      } else {
        if (trackMaxScore) {
          return new OutOfOrderOneComparatorScoringMaxScoreCollector(queue, numHits, fillFields);
        } else if (trackDocScores) {
          return new OutOfOrderOneComparatorScoringNoMaxScoreCollector(queue, numHits, fillFields);
        } else {
          return new OutOfOrderOneComparatorNonScoringCollector(queue, numHits, fillFields);
        }
      }
    }
    // multiple comparators.
    if (docsScoredInOrder) {
      if (trackMaxScore) {
        return new MultiComparatorScoringMaxScoreCollector(queue, numHits, fillFields);
      } else if (trackDocScores) {
        return new MultiComparatorScoringNoMaxScoreCollector(queue, numHits, fillFields);
      } else {
        return new MultiComparatorNonScoringCollector(queue, numHits, fillFields);
      }
    } else {
      if (trackMaxScore) {
        return new OutOfOrderMultiComparatorScoringMaxScoreCollector(queue, numHits, fillFields);
      } else if (trackDocScores) {
        return new OutOfOrderMultiComparatorScoringNoMaxScoreCollector(queue, numHits, fillFields);
      } else {
        return new OutOfOrderMultiComparatorNonScoringCollector(queue, numHits, fillFields);
      }
    }
  } else {
    return new PagingFieldCollector(queue, after, numHits, fillFields, trackDocScores, trackMaxScore);
  }
}
org.apache.lucene.search.FieldValueHitQueue org.apache.lucene.search.FieldValueHitQueue.OneComparatorFieldValueHitQueue org.apache.lucene.search.FieldValueHitQueue.MultiComparatorsFieldValueHitQueue

Test Lucene Built-in Collectors
public class LearningCollector {

 @Before
 public void setup() throws IOException {
  Utils.writeIndex();
 }

 @Test
 public void testBuiltCollector() throws IOException {
  try (Directory directory = FSDirectory.open(new File(
    Utils.INDEX_FOLDER_PATH));
    DirectoryReader indexReader = DirectoryReader.open(directory);) {
   IndexSearcher searcher = new IndexSearcher(indexReader);

   usingTotalHitCollector(searcher);
   usingTopScoreDocCollector(searcher);
   usingTopFieldCollector(searcher);
   usingLuceneGroup(searcher);
  }
 }

 private void usingTotalHitCollector(IndexSearcher searcher)
   throws IOException {
  TotalHitCountCollector collector = new TotalHitCountCollector();
  TermQuery query = new TermQuery(new Term("title", "java"));
  searcher.search(query, collector);
  System.out.println("total hits:" + collector.getTotalHits());
 }

 private void usingLuceneGroup(IndexSearcher searcher) throws IOException {
  String groupField = "title";
  TermQuery query = new TermQuery(new Term("title", "java"));
  Sort groupSort = new Sort(new SortField("title", Type.STRING));
  Sort docSort = new Sort((new SortField("price", Type.INT, true)));
  groupBy(searcher, query, groupField, groupSort, docSort);
 }

 // Use TermFirstPassGroupingCollector, TermSecondPassGroupingCollector,
 // CachingCollector, TermAllGroupsCollector,MultiCollector
 private void groupBy(IndexSearcher searcher, Query query,
   String groupField, Sort groupSort, Sort docSort) throws IOException {
  // return ngroups every page
  int topNGroups = 10;
  int groupOffset = 0;
  boolean fillFields = true;

  int docOffset = 0;
  boolean requiredTotalGroupCount = true;

  TermFirstPassGroupingCollector c1 = new TermFirstPassGroupingCollector(
    groupField, groupSort, topNGroups);
  boolean cacheScores = true;
  double maxCacheRAMMB = 16.0;
  CachingCollector cachedCollector = CachingCollector.create(c1,
    cacheScores, maxCacheRAMMB);
  searcher.search(query, cachedCollector);

  Collection<SearchGroup<BytesRef>> topGroups = c1.getTopGroups(
    groupOffset, fillFields);

  if (topGroups == null) {
   // No groups matched
   return;
  }

  Collector secondPassCollector = null;

  boolean getScores = true;
  boolean getMaxScores = true;
  boolean fillSortFields = true;
  int docsPerGroup = 10;
  TermSecondPassGroupingCollector c2 = new TermSecondPassGroupingCollector(
    groupField, topGroups, groupSort, docSort, docsPerGroup,
    getScores, getMaxScores, fillSortFields);

  // Optionally compute total group count
  TermAllGroupsCollector allGroupsCollector = null;
  if (requiredTotalGroupCount) {
   allGroupsCollector = new TermAllGroupsCollector(groupField);
   secondPassCollector = MultiCollector.wrap(c2, allGroupsCollector);
  } else {
   secondPassCollector = c2;
  }

  if (cachedCollector.isCached()) {
   // Cache fit within maxCacheRAMMB, so we can replay it:
   cachedCollector.replay(secondPassCollector);
  } else {
   // Cache was too large; must re-execute query:
   searcher.search(query, secondPassCollector);
  }

  int totalGroupCount = -1;
  int totalHitCount = -1;
  int totalGroupedHitCount = -1;
  if (requiredTotalGroupCount) {
   totalGroupCount = allGroupsCollector.getGroupCount();
  }
  System.out.println("groupCount: " + totalGroupCount);

  TopGroups<BytesRef> groupsResult = c2.getTopGroups(docOffset);
  totalHitCount = groupsResult.totalHitCount;
  totalGroupedHitCount = groupsResult.totalGroupedHitCount;
  System.out.println("groupsResult.totalHitCount:" + totalHitCount);
  System.out.println("groupsResult.totalGroupedHitCount:"
    + totalGroupedHitCount);

  int groupIdx = 0;
  for (GroupDocs<BytesRef> groupDocs : groupsResult.groups) {
   groupIdx++;
   System.out.println("group[" + groupIdx + "]:"
     + groupDocs.groupValue);
   System.out
     .println("group[" + groupIdx + "]:" + groupDocs.totalHits);
   int docIdx = 0;
   for (ScoreDoc scoreDoc : groupDocs.scoreDocs) {
    docIdx++;
    System.out.println("group[" + groupIdx + "][" + docIdx + "]:"
      + scoreDoc.doc + "/" + scoreDoc.score);
    Document doc = searcher.doc(scoreDoc.doc);
    System.out.println("group[" + groupIdx + "][" + docIdx + "]:"
      + doc);
   }
  }
 }

 private void usingTopFieldCollector(IndexSearcher searcher)
   throws IOException {
  TermQuery query = new TermQuery(new Term("title", "java"));
  // reverse is true: sort=price desc
  Sort sort = new Sort(new SortField("price", Type.INT, true));
  TopFieldCollector collector = TopFieldCollector.create(sort, 10, false,
    false, false, false);

  searcher.search(query, collector);
  printAndExplainSearchResult(searcher, collector, true, query, "price");
  // set these to true: fillFields, trackDocScores, trackMaxScore
  collector = TopFieldCollector.create(sort, 10, true, true, true, false);

  searcher.search(query, collector);
  printAndExplainSearchResult(searcher, collector, true, query, "price");

  // sort by multiple field
  sort = new Sort(new SortField("price", Type.INT, true), new SortField(
    "title", Type.STRING, false));
  collector = TopFieldCollector.create(sort, 10, true, true, true, false);

  searcher.search(query, collector);
  printAndExplainSearchResult(searcher, collector, true, query, "price",
    "title");
 }

 private void usingTopScoreDocCollector(IndexSearcher searcher)
   throws IOException {
  TermQuery query = new TermQuery(new Term("title", "java"));
  TopScoreDocCollector collector = TopScoreDocCollector.create(10, false);
  searcher.search(query, collector);
  printAndExplainSearchResult(searcher, collector, true, query, "title",
    "author");
  // TODO: searchAfte example
 }
}

Solr Wilcard Query with Stemming


The Problem
Today, I was asked to take a look one query issue:
When user searches file, files or file*, Solr return matches correctly, but if user searches files*, Solr doesn't return match.

The Solution
Google Search, find the solution in this page:
Stemming not working with wildcard search

Wildcards and stemming are incompatible at query time - you need to manually stem the term before applying your wildcard.

Wildcards are not supported in quoted phrases. They will be treated as punctuation, and ignored by the standard tokenizer or the word delimiter filter.

In this case, it is PrefixQuery which work similar as Wildcard Query.

The solution is to add KeywordRepeatFilterFactory and RemoveDuplicatesTokenFilterFactory around the Stem Factory:
<fieldType name="text_rev" class="solr.TextField"
  positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.StopFilterFactory" ignoreCase="true"
      words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.WordDelimiterFilterFactory"
      generateWordParts="1" generateNumberParts="1" catenateWords="1"
      catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"
      preserveOriginal="1" />
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.KeywordRepeatFilterFactory"/> 
    <filter class="solr.PorterStemFilterFactory"/> 
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 
    <filter class="solr.ReversedWildcardFilterFactory"
      withOriginal="true" maxPosAsterisk="3" maxPosQuestion="2"
      maxFractionAsterisk="0.33" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
      ignoreCase="true" expand="true" />
    <filter class="solr.StopFilterFactory" ignoreCase="true"
      words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.WordDelimiterFilterFactory"
      generateWordParts="1" generateNumberParts="1" catenateWords="0"
      catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
      preserveOriginal="1" />        
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.PorterStemFilterFactory"/> 
  </analyzer>
</fieldType>
Test
Next let's write unit test to test and verify the change.
public void testWildcardStemFromSchema() {
  try {
    URLClassLoader urlClassLoader = (URLClassLoader) ClassLoader
        .getSystemClassLoader();
    // from
    // http://www.hangar.org/docs/activitats/SummerLAB/Pure%20Data/OSC%20-%20OpenSoundControl/SwingOSC/src/de/sciss/util/DynamicURLClassLoader.java
    DynamicURLClassLoader dynaLoader = new DynamicURLClassLoader(
        urlClassLoader);
    dynaLoader.addURL(new File(CONF_FOLDER).toURI().toURL());
    Thread.currentThread().setContextClassLoader(dynaLoader);
    InputSource solrCfgIs = new InputSource(new FileReader(new File(
        CONF_FOLDER, "solrconfig.xml")));
    SolrConfig solrConfig = new SolrConfig(null, solrCfgIs);
    InputSource solrSchemaIs = new InputSource(new FileReader(new File(
        CONF_FOLDER, "schema.xml")));
    IndexSchema solrSchema = new IndexSchema(solrConfig, "mySchema",
        solrSchemaIs);
    Map<String, FieldType> fieldTypes = solrSchema.getFieldTypes();
    listAllFieldTypes(fieldTypes);

    // now test text_rev
    String inputText = "files";
    FieldType fieldTypeText = fieldTypes.get("text_rev");
    Analyzer indexAnalyzer = fieldTypeText.getIndexAnalyzer();
    Analyzer queryAnalyzer = fieldTypeText.getQueryAnalyzer();
    System.out.println("Indexing analysis:");
testIndexerSearcher(solrSchema, indexAnalyzer, queryAnalyzer); TokenStream tokenStream = indexAnalyzer.tokenStream("content", new StringReader(inputText)); CharTermAttribute termAttr = tokenStream .getAttribute(CharTermAttribute.class); OffsetAttribute offsetAttr = tokenStream .getAttribute(OffsetAttribute.class); TypeAttribute typeAttr = tokenStream .getAttribute(TypeAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { System.out.println(termAttr.toString() + " offset: " + offsetAttr.startOffset() + ":" + offsetAttr.endOffset() + ", type:" + typeAttr.type()); } tokenStream.end(); tokenStream.close(); String searchText = "files*"; System.out.println("\r\nQuerying analysis:"); tokenStream = queryAnalyzer.tokenStream("content", new StringReader(searchText)); tokenStream.reset(); CharTermAttribute termAttr2 = (CharTermAttribute) tokenStream .getAttribute(CharTermAttribute.class); while (tokenStream.incrementToken()) { System.out.println(termAttr2.toString()); } tokenStream.end(); tokenStream.close(); } catch (Exception e) { e.printStackTrace(); } } private void testIndexerSearcher(IndexSchema solrSchema, Analyzer indexAnalyzer, Analyzer queryAnalyzer) throws IOException, ParseException { IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_9, indexAnalyzer); // recreate the index on each execution config.setOpenMode(IndexWriterConfig.OpenMode.CREATE); config.setUseCompoundFile(false); // if we setInfoStream, add the below annotation to the TestClass // @SuppressSysoutChecks(bugUrl = "Solr logs to JUL") // config.setInfoStream(System.out); // be sure to close Directory and IndexWriter try (Directory directory = FSDirectory.open(new File(FILE_PATH)); IndexWriter writer = new IndexWriter(directory, config)) { Document doc = new Document(); IndexableField field = solrSchema.getField("content").createField( "files", 1.0f); doc.add(field); writer.addDocument(doc); writer.commit(); } try (Directory directory = FSDirectory.open(new File(FILE_PATH)); DirectoryReader indexReader = DirectoryReader.open(directory);) { IndexSearcher searcher = new IndexSearcher(indexReader); QueryParser queryParser = new QueryParser(Version.LUCENE_4_9, "content", queryAnalyzer); Query query = queryParser.parse("files*"); System.out.println("queryParser query:" + query.toString()); TopDocs docs = searcher.search(query, 10); LuceneUtil.printAndExplaunSearchResult(searcher, docs, query, "content"); } } public void testUsingAnalyzer() { try { URLClassLoader urlClassLoader = (URLClassLoader) ClassLoader .getSystemClassLoader(); // from // http://www.hangar.org/docs/activitats/SummerLAB/Pure%20Data/OSC%20-%20OpenSoundControl/SwingOSC/src/de/sciss/util/DynamicURLClassLoader.java DynamicURLClassLoader dynaLoader = new DynamicURLClassLoader( urlClassLoader); dynaLoader.addURL(new File(CONF_FOLDER).toURI().toURL()); Thread.currentThread().setContextClassLoader(dynaLoader); StringReader inputText = new StringReader("pictures files"); Map<String, String> commonArgs = ImmutableMap .<String, String> builder() .put(AbstractAnalysisFactory.LUCENE_MATCH_VERSION_PARAM, Version.LUCENE_4_9 + "").build(); // These factories remove consumed element from map // org.apache.lucene.analysis.util.AbstractAnalysisFactory.AbstractAnalysisFactory(Map<String, // String>) // args.remove(CLASS_NAME); // consume the class arg // why they remove value from map? -- so it can detect unwanted // parameters, to avoid typo mistake // org.apache.lucene.analysis.core.WhitespaceTokenizerFactory.WhitespaceTokenizerFactory(Map<String, // String>) // if (!args.isEmpty()) { // throw new IllegalArgumentException("Unknown parameters: " + // args); // } TokenizerFactory tkf = new WhitespaceTokenizerFactory( new HashMap<String, String>(commonArgs)); Tokenizer tkz = tkf.create(inputText); HashMap<String, String> stopFilterParmas = new HashMap<String, String>( commonArgs); stopFilterParmas.put("words", "stopwords.txt"); // CONF_FOLDER is added to classpath ResourceLoader loader = new ClasspathResourceLoader(); // ResourceLoader loader = new FilesystemResourceLoader(new File( // CONF_FOLDER)); StopFilterFactory stf = new StopFilterFactory(stopFilterParmas); stf.inform(loader); TokenStream st = stf.create(tkz); WordDelimiterFilterFactory wdff = new WordDelimiterFilterFactory( new HashMap<String, String>(commonArgs)); TokenFilter wdf = wdff.create(st); LowerCaseFilterFactory lcf = new LowerCaseFilterFactory( new HashMap<String, String>(commonArgs)); TokenStream lcts = lcf.create(wdf); KeywordRepeatFilterFactory krff = new KeywordRepeatFilterFactory( new HashMap<String, String>(commonArgs)); TokenStream kdrf = krff.create(lcts); TokenFilterFactory psff = new PorterStemFilterFactory( new HashMap<String, String>(commonArgs)); TokenStream psf = psff.create(kdrf); RemoveDuplicatesTokenFilterFactory rdtff = new RemoveDuplicatesTokenFilterFactory( new HashMap<String, String>(commonArgs)); RemoveDuplicatesTokenFilter rdtf = rdtff.create(psf); ReversedWildcardFilterFactory rwff = new ReversedWildcardFilterFactory( new HashMap<String, String>(commonArgs)); TokenStream rwf = rwff.create(rdtf); CharTermAttribute termAttrib = (CharTermAttribute) rwf .getAttribute(CharTermAttribute.class); rwf.reset(); while (rwf.incrementToken()) { String term = termAttrib.toString(); System.out.println(term); } rwf.end(); rwf.close(); } catch (Exception e) { e.printStackTrace(); } }
Output of the test case:
text_rev:TokenizerChain(org.apache.lucene.analysis.core.WhitespaceTokenizerFactory@48ae9b55, org.apache.lucene.analysis.core.StopFilterFactory@1700915, org.apache.lucene.analysis.miscellaneous.WordDelimiterFilterFactory@21de60b4, org.apache.lucene.analysis.core.LowerCaseFilterFactory@c267ef4, org.apache.lucene.analysis.miscellaneous.KeywordRepeatFilterFactory@30ee2816, org.apache.lucene.analysis.en.PorterStemFilterFactory@31d7b7bf, org.apache.lucene.analysis.miscellaneous.RemoveDuplicatesTokenFilterFactory@635eaaf1, org.apache.solr.analysis.ReversedWildcardFilterFactory@5c30a9b0)
queryParser query:content:files*
Found : 1 hits.
1. files
1.0 = (MATCH) ConstantScore(content:files*), product of:
  1.0 = boost
  1.0 = queryNorm

Indexing analysis:
selif offset: 0:5, type:word
files offset: 0:5, type:word
elif offset: 0:5, type:word
file offset: 0:5, type:word

Querying analysis:
files*

file
References
Stemming not working with wildcard search
Testing Solr schema, analyzers and tokenization

Maven: Non Existing Library jdk.tools.jar


The Problem
Today,after add some dependencies to maven,  I found that maven refuses to compile. In Problems view, it shows error:
The container 'Maven Dependencies' references non existing library 'C:\Users\administrator\.m2\repository\jdk\tools\jdk.tools\1.6\jdk.tools-1.6.jar'

Check my pom.xml, there is no direct dependency of jdk.tools-1.6.jar, then use maven dependency:tree tool to figure out which lib indirectly depends on it.
mvn dependency:tree -Dverbose -Dincludes=jdk.tools
[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ learningLucene ---
[INFO] org.lifelongprogrammer:learningLucene:jar:1.0
[INFO] \- org.apache.solr:solr-core:jar:4.9.0:compile
[INFO]    \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
[INFO]       \- jdk.tools:jdk.tools:jar:1.7:system

Turn out, solr-code depends on hadoop-annotation which then need jdk.tools.jar.

The Solution
Google search, find this page: 

<dependency>
  <groupId>jdk.tools</groupId>
  <artifactId>jdk.tools</artifactId>
  <scope>system</scope>
  <systemPath>C:/Program Files/Java/jdk1.8.0/lib/tools.jar</systemPath>
  <!-- have to include the version, otherwise eclipse throws exception: 
    Errors running builder 'Maven Project Builder' on project 'learningLucene'. 
    java.lang.NullPointerException -->
  <version>8.0</version>
</dependency>
</dependencies>
<build>
<plugins>
  <plugin>
    <artifactId>maven-compiler-plugin</artifactId>
    <configuration>
      <source>1.8</source>
      <target>1.8</target>
    </configuration>
  </plugin>
</plugins>
</build>

Jetty: insufficient threads configured for SelectChannelConnector


The Problem
Today when I run our Solr application in one machine, during start, it reports warning:
Oct 6, 2014 7:25:15 PM org.eclipse.jetty.server.AbstractConnector doStart
WARNING: insufficient threads configured for SelectChannelConnector@0.0.0.0:12345

Trying http request in browser, no response, just hang forever.

Inspect the solr server in Visual VM. In threads tab, it shows there is 238 live threads, and a lot of selector(128) and acceptors(72). This looks very suspiciours:
qtp1287645725-145 Selector127
   java.lang.Thread.State: BLOCKED
   java.lang.Thread.State: RUNNABLE at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll0(Native Method) at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll(WindowsSelectorImpl.java:273) at sun.nio.ch.WindowsSelectorImpl$SubSelector.access$400(WindowsSelectorImpl.java:255) at sun.nio.ch.WindowsSelectorImpl.doSelect(WindowsSelectorImpl.java:136) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) - locked (a sun.nio.ch.Util$2)

"qtp1287645725-217 Acceptor71 SelectChannelConnector@0.0.0.0:12345"
   java.lang.Thread.State: BLOCKED
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:134)
- waiting to lock (a java.lang.Object) owned by "qtp1287645725-221 Acceptor71 SelectChannelConnector@0.0.0.0:12345" t@221
at org.eclipse.jetty.server.nio.SelectChannelConnector.accept(SelectChannelConnector.java:109)


Then check the code: When start jetty, the code sets acceptors to number of cpu cores * 2. in this machine,  There is 64 cores. This will cause jetty to start 64*2 = 128 selectors and acceptors. 
connector.setAcceptors(2 * Runtime.getRuntime().availableProcessors());

The default acceptors is:(Runtime.getRuntime().availableProcessors()+3)/4 which is 16 in this case. 
setAcceptors(Math.max(1,(Runtime.getRuntime().availableProcessors()+3)/4));

So to fix this issue,  I just comment or cusom acceptors code: connector.setAcceptors(2 * Runtime.getRuntime().availableProcessors()); 

Lesson Learned
Be careful when tune server performance, make sure you truly understand its meaning.

Configure ThreadPool in jetty.xml
<Set name="ThreadPool">
  <New class="org.eclipse.jetty.util.thread.QueuedThreadPool">
    <Set name="minThreads">10</Set>
    <Set name="maxThreads">200</Set>
    <Set name="detailedDump">false</Set>
  </New>
</Set>
Jetty Code:
During startup, jetty will start acceptors(2=128 in this case) selectors:
org.eclipse.jetty.server.nio.SelectChannelConnector.doStart()
protected void doStart() throws Exception
{
    _manager.setSelectSets(getAcceptors());
    super.doStart();
}
org.eclipse.jetty.io.nio.SelectorManager.doStart()
protected void doStart() throws Exception
{
    _selectSet = new SelectSet[_selectSets];
    for (int i=0;i<_selectset .length="" i="" p="">        _selectSet[i]= new SelectSet(i);

    super.doStart();

    // start a thread to Select
    for (int i=0;i    {
        final int id=i;
        boolean selecting=dispatch(new Runnable()
        {
            public void run()
            {
            // ....
            }

        });
    }
}

Lucene Internal APIs


BytesRef
Represents byte[], as a slice (offset + length) into an existing byte[].
byte bytes[] = new byte[] { (byte)'a', (byte)'b', (byte)'c', (byte)'d' };
BytesRef b = new BytesRef(bytes);
BytesRef b2 = new BytesRef(bytes, 1, 3);
assertEquals("bcd", b2.utf8ToString());

public String utf8ToString() {
  final char[] ref = new char[length];
  final int len = UnicodeUtil.UTF8toUTF16(bytes, offset, length, ref);
  return new String(ref, 0, len);
}

Term
public final class Term implements Comparable {
  String field;
  BytesRef bytes;
}
A Term represents a word from text. This is the unit of search. It is composed of two elements, the text of the word, as a string, and the name of the field that the text occurred in.

Iterator to seek (seekCeil(BytesRef), seekExact(BytesRef)) or step through (next terms to obtain frequency information (docFreq), DocsEnum or DocsAndPositionsEnum for the current term (docs. 

Term enumerations are always ordered by getComparator. Each term in the enumeration is greater than the one before it.
TermsEnum
The TermsEnum is unpositioned when you first obtain it and you must first successfully call next or one of the seek methods.

org.apache.lucene.index.TestTermsEnum

DocsEnum
Iterates through the documents and term freqs. NOTE: you must first call nextDoc before using any of the per-doc methods. 

Learning Lucene: Filtering





TermRangeFilter matches only documents containing terms within a specified range of terms.
It’s exactly the same as TermRangeQuery, without scoring.
NumericRangeFilter

FieldCacheRangeFilter
FieldCacheTermsFilter

QueryWrapperFilter turns any Query into a Filter, by using only the matching documents
from the Query as the filtered space, discarding the document scores.

PrefixFilter
SpanQueryFilter

CachingWrapperFilter is a decorator over another filter, caching its results to increase
performance when used again.
CachingSpanFilter
FilteredDocIdSet allowing you to filter a filter, one document at a time. In order to use it, you
must first subclass it and define the match method in your subclass.



Refernces
Filtering a Lucene search

Learning Lucene: Collectors


Lucene Built-in Collectors
Check Lucene Javadoc for all Lucene built-in collectors.
Lucene's core collectors are derived from Collector. Likely your application can use one of these classes, or subclass TopDocsCollector, instead of implementing Collector directly:
It's a good start to read Lucene's built-in collectors' code to learn how to build our own collectors:  TotalHitCountCollector: Just count the number of hits. public void collect(int doc) { totalHits++; } PositiveScoresOnlyCollector
if (scorer.score() > 0) { c.collect(doc); } // only include the doc if its score >0

TimeLimitingCollector: use an external counter, and compare timeout in collect, throw TimeExceededException if the allowed time has passed: 
long time = clock.get();    if (timeout < time) {throw new TimeExceededException( timeout-t0, time-t0, docBase + doc );} 
Also TestTimeLimitingCollector.MyHitCollector is an example of custom collector.

FilterCollector: A collector that filters incoming doc ids that are not in the filter. Used by Grouping.
Using TimeLimitingCollector to Stop Slow Query

public void testTimeLimitingCollector() throws IOException {
  // SimulateSlowCollector is a copy of
  // org.apache.lucene.search.TestTimeLimitingCollector.MyHitCollector
  SimulateSlowCollector slowCollector = new SimulateSlowCollector();
  slowCollector.setSlowDown(1000 * 10);
  Counter clock = Counter.newCounter(true);

  int tick = 10;
  TimeLimitingCollector collector = new TimeLimitingCollector(
      slowCollector, clock, tick);
  collector.setBaseline(0);

  try (Directory directory = FSDirectory.open(new File(FILE_PATH));
      DirectoryReader indexReader = DirectoryReader.open(directory);) {
    IndexSearcher searcher = new IndexSearcher(indexReader);
    try {
      new Thread() {
        public void run() {
          // will kill the indexSearcher.search(...) after 10
          // ticks (10 seconds)
          while (clock.get() <= tick) {
            try {
              Thread.sleep(1000);
              clock.addAndGet(1);
            } catch (InterruptedException e) {
              e.printStackTrace();
            }
          }
        }
      }.start();

      searcher.search(new MatchAllDocsQuery(), collector);
      System.out.println(slowCollector.hitCount());
    } catch (TimeExceededException e) {
      // it throws exception here.
      System.out.println("Too much time taken.");
      e.printStackTrace();
    }
  }
}
Write a Custom Collector
public class FacetCountCollector extends Collector {
 private Map countMap = new HashMap<>();
 // scorer and docBase are actually not used.
 private Scorer scorer;
 private int docBase;
 private IndexSearcher searcher = null;
 public FacetCountCollector(IndexSearcher searcher) {
  this.searcher = searcher;
 }
 @Override
 public void collect(int doc) {
  try {
   Document document = searcher.doc(doc);
   if (document != null) {
    IndexableField[] categoriesDoc = document
      .getFields("categories");

    if (categoriesDoc != null && categoriesDoc.length > 0) {
     for (int i = 0; i < categoriesDoc.length; i++) {
      if (countMap
        .containsKey(categoriesDoc[i].stringValue())) {
       countMap.put(categoriesDoc[i].stringValue(), Long
         .valueOf(countMap.get(categoriesDoc[i]
           .stringValue())) + 1);
      } else {
       countMap.put(categoriesDoc[i].stringValue(), 1L);
      }
     }
    }
   }
  } catch (IOException e) {
   e.printStackTrace();
  }
 }

 public Map getCountMap() {
  return Collections.unmodifiableMap(countMap);
 }
 public void setScorer(Scorer scorer) throws IOException {
  this.scorer = scorer;
 }
 public void setNextReader(AtomicReaderContext context) throws IOException {
  this.docBase = context.docBase;// Record the readers absolute doc base
 }
 public boolean acceptsDocsOutOfOrder() {
  // Return true if this collector does not require the matching docIDs to
  // be delivered in int sort order (smallest to largest) to collect.
  return true;
 }
}
Using Custom Collector
public void testFacetCountCollector() throws IOException {
 try (Directory directory = FSDirectory.open(new File(FILE_PATH));
   DirectoryReader indexReader = DirectoryReader.open(directory);) {
  IndexSearcher searcher = new IndexSearcher(indexReader);
  try {
   FacetCountCollector collector = new FacetCountCollector(
     searcher);
   searcher.search(new MatchAllDocsQuery(), collector);
   System.out.println(collector.getCountMap());
   // printResult(topDocsCollector, searcher);
  } catch (TimeExceededException e) {
   // it throws exception here.
   System.out.println("Too much time taken.");
   e.printStackTrace();
  }
 }
}
References
Lucene Built-in Collectors

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)