Programmer: Lifelong Learning: Learning Lucene: Collectors

Lucene Built-in Collectors
Check Lucene Javadoc for all Lucene built-in collectors.

Lucene's core collectors are derived from Collector. Likely your application can use one of these classes, or subclass TopDocsCollector, instead of implementing Collector directly:

TopDocsCollector is an abstract base class that assumes you will retrieve the top N docs, according to some criteria, after collection is done.
TopScoreDocCollector is a concrete subclass TopDocsCollector and sorts according to score + docID. This is used internally by the IndexSearcher search methods that do not take an explicit Sort. It is likely the most frequently used collector.
TopFieldCollector subclasses TopDocsCollector and sorts according to a specified Sort object (sort by field). This is used internally by the IndexSearcher search methods that take an explicit Sort.
TimeLimitingCollector, which wraps any other Collector and aborts the search if it's taken too much time.
PositiveScoresOnlyCollector wraps any other Collector and prevents collection of hits whose score is <= 0.0

It's a good start to read Lucene's built-in collectors' code to learn how to build our own collectors: TotalHitCountCollector: Just count the number of hits. public void collect(int doc) { totalHits++; } PositiveScoresOnlyCollector:
if (scorer.score() > 0) { c.collect(doc); } // only include the doc if its score >0

TimeLimitingCollector: use an external counter, and compare timeout in collect, throw TimeExceededException if the allowed time has passed:
long time = clock.get(); if (timeout < time) {throw new TimeExceededException( timeout-t0, time-t0, docBase + doc );}
Also TestTimeLimitingCollector.MyHitCollector is an example of custom collector.

FilterCollector: A collector that filters incoming doc ids that are not in the filter. Used by Grouping.

Using TimeLimitingCollector to Stop Slow Query

public void testTimeLimitingCollector() throws IOException {
  // SimulateSlowCollector is a copy of
  // org.apache.lucene.search.TestTimeLimitingCollector.MyHitCollector
  SimulateSlowCollector slowCollector = new SimulateSlowCollector();
  slowCollector.setSlowDown(1000 * 10);
  Counter clock = Counter.newCounter(true);

  int tick = 10;
  TimeLimitingCollector collector = new TimeLimitingCollector(
      slowCollector, clock, tick);
  collector.setBaseline(0);

  try (Directory directory = FSDirectory.open(new File(FILE_PATH));
      DirectoryReader indexReader = DirectoryReader.open(directory);) {
    IndexSearcher searcher = new IndexSearcher(indexReader);
    try {
      new Thread() {
        public void run() {
          // will kill the indexSearcher.search(...) after 10
          // ticks (10 seconds)
          while (clock.get() <= tick) {
            try {
              Thread.sleep(1000);
              clock.addAndGet(1);
            } catch (InterruptedException e) {
              e.printStackTrace();
            }
          }
        }
      }.start();

      searcher.search(new MatchAllDocsQuery(), collector);
      System.out.println(slowCollector.hitCount());
    } catch (TimeExceededException e) {
      // it throws exception here.
      System.out.println("Too much time taken.");
      e.printStackTrace();
    }
  }
}

Write a Custom Collector

public class FacetCountCollector extends Collector {
 private Map countMap = new HashMap<>();
 // scorer and docBase are actually not used.
 private Scorer scorer;
 private int docBase;
 private IndexSearcher searcher = null;
 public FacetCountCollector(IndexSearcher searcher) {
  this.searcher = searcher;
 }
 @Override
 public void collect(int doc) {
  try {
   Document document = searcher.doc(doc);
   if (document != null) {
    IndexableField[] categoriesDoc = document
      .getFields("categories");

    if (categoriesDoc != null && categoriesDoc.length > 0) {
     for (int i = 0; i < categoriesDoc.length; i++) {
      if (countMap
        .containsKey(categoriesDoc[i].stringValue())) {
       countMap.put(categoriesDoc[i].stringValue(), Long
         .valueOf(countMap.get(categoriesDoc[i]
           .stringValue())) + 1);
      } else {
       countMap.put(categoriesDoc[i].stringValue(), 1L);
      }
     }
    }
   }
  } catch (IOException e) {
   e.printStackTrace();
  }
 }

 public Map getCountMap() {
  return Collections.unmodifiableMap(countMap);
 }
 public void setScorer(Scorer scorer) throws IOException {
  this.scorer = scorer;
 }
 public void setNextReader(AtomicReaderContext context) throws IOException {
  this.docBase = context.docBase;// Record the readers absolute doc base
 }
 public boolean acceptsDocsOutOfOrder() {
  // Return true if this collector does not require the matching docIDs to
  // be delivered in int sort order (smallest to largest) to collect.
  return true;
 }
}

Using Custom Collector

public void testFacetCountCollector() throws IOException {
 try (Directory directory = FSDirectory.open(new File(FILE_PATH));
   DirectoryReader indexReader = DirectoryReader.open(directory);) {
  IndexSearcher searcher = new IndexSearcher(indexReader);
  try {
   FacetCountCollector collector = new FacetCountCollector(
     searcher);
   searcher.search(new MatchAllDocsQuery(), collector);
   System.out.println(collector.getCountMap());
   // printResult(topDocsCollector, searcher);
  } catch (TimeExceededException e) {
   // it throws exception here.
   System.out.println("Too much time taken.");
   e.printStackTrace();
  }
 }
}
References

Lucene Built-in Collectors

Learning Lucene: Collectors

Labels