Learning Lucene: Analyzers, Tokenizers, and Filters


Lucene AnalyzersTokenizersTokenFilters
Lucene uses the concept of Analyzing to analyze the content before indexing, it comes with several built-in Analyzers, Tokenizers, and Token Filters. It's crucial to choose the right one that matches our need.

Create our own Analyzer
Let's create our own NGramANalyzer:
public class ShingleAnalyzer extends Analyzer {
 @Override
 protected TokenStreamComponents createComponents(String fieldName,
   Reader reader) {
  Tokenizer tokenizer = new StandardTokenizer(Version.LUCENE_4_9, reader);

  // Notice the order is important: stopFilter -> lowerCaseFilter ->
  // stemFilter ->shingleFilter
  TokenFilter stopFilter = new StopFilter(Version.LUCENE_4_9, tokenizer,
    StopAnalyzer.ENGLISH_STOP_WORDS_SET);
  // or we can crate our own stop words
  // TokenFilter stopFilter = new StopFilter(StopFilter.makeStopSet(
  // Version.LUCENE_4_9, "and", "of", "the", "to", "is", "their",
  // "can", "all"));
  TokenFilter lowerCaseFilter = new LowerCaseFilter(Version.LUCENE_4_9,
    stopFilter);
  TokenFilter stemFilter = new PorterStemFilter(lowerCaseFilter);

  // Notice ShingleFilter doesn't work well with SynonymFilter
  // https://issues.apache.org/jira/browse/LUCENE-3475
  // TokenFilter synonymFilter = new SynonymFilter(stemFilter,
  // getSynonymMap(), true);
  // ShingleFilter shingleFilter = new ShingleFilter(synonymFilter);
  ShingleFilter shingleFilter = new ShingleFilter(stemFilter);
  shingleFilter.setMinShingleSize(2);
  shingleFilter.setMaxShingleSize(2);
  shingleFilter.setOutputUnigrams(false);

  return new TokenStreamComponents(tokenizer, shingleFilter);
 }
 // We may create synonym from dict or property file
 private SynonymMap getSynonymMap() {
  SynonymMap.Builder sb = new SynonymMap.Builder(true);
  sb.add(new CharsRef("jump"), new CharsRef("leap"), true);
  sb.add(new CharsRef("lazy"), new CharsRef("sluggardly"), true);
  SynonymMap smap = null;
  try {
   smap = sb.build();
  } catch (IOException e) {
   e.printStackTrace();
  }
  return smap;
 }
}
We can also create our own Tokenizer: extend org.apache.lucene.analysis.Tokenizer and implement public boolean incrementToken(). incrementToken returns false for EOF, true otherwis.

Example:
Anatomy of a Lucene Tokenizer
Lucene.Net – Custom Synonym Analyzer - CodeProject
Run Analyzer Separately
We can use Lucene's tokenizing, stemming, stopword removal in some other NLP tasks.
public void runAnalyzer() {
 Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_9);
 String text = "the red fox jumped over the lazy dog";
 Reader reader = new StringReader(text);
 TokenStream ts = null;
 try {
  ts = analyzer.tokenStream(null, reader);

  // define and reuse Attribute outside of while oop
  CharTermAttribute charTermAttr = ts
    .getAttribute(CharTermAttribute.class);
  OffsetAttribute offsetAtt = ts.getAttribute(OffsetAttribute.class);
  PositionIncrementAttribute posAtt = ts
    .getAttribute(PositionIncrementAttribute.class);
  TypeAttribute typeAtt = ts.getAttribute(TypeAttribute.class);

  ts.reset();
  while (ts.incrementToken()) {
   System.out.println(charTermAttr.toString() + ", offset:"
     + offsetAtt.startOffset() + "-" + offsetAtt.endOffset()
     + ", position:" + posAtt.getPositionIncrement()
     + ",type" + typeAtt.type());
  }
 } catch (IOException e) {
  e.printStackTrace();
 }
}
Different analyzers for each field
In previous post, we are using same Analyzer for every fields.
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_9,analyzer);

But in practical fields, we have to use different analyzer for different fields, no tokenize for keyword field, or lowercase some field in order.

We can use PerFieldAnalyzerWrapper to set different analyzer for each field. In the following example, StandardAnalyzer will be used for all fields except "firstname" and "lastname", for which KeywordAnalyzer will be used.
Map<String,Analyzer> analyzerPerField = new HashMap<>();
analyzerPerField.put("firstname", new KeywordAnalyzer());
analyzerPerField.put("lastname", new KeywordAnalyzer());

PerFieldAnalyzerWrapper aWrapper =
 new PerFieldAnalyzerWrapper(new StandardAnalyzer(version), analyzerPerField);
IndexWriterConfig iwConfig = new IndexWriterConfig(Version.LUCENE_49 , aWrapper);

Lucene Source Code
Tokenizer and TokenFilter extends TokenStream.
A TokenStream enumerates the sequence of tokens, either from Fields of a Document or from query text: main method: incrementToken(), end(), close(), reset().

TokenStream extends AttributeSource, which has two maps: Map, AttributeImpl> attributes and attributeImpls.
Attribute is interface, AttributeImpl implements Attribute.
 
References
Lucene Analyzer
Anatomy of a Lucene Tokenizer
Lucene.Net – Custom Synonym Analyzer - CodeProject

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)