Learning Lucene: Collectors


Lucene Built-in Collectors
Check Lucene Javadoc for all Lucene built-in collectors.
Lucene's core collectors are derived from Collector. Likely your application can use one of these classes, or subclass TopDocsCollector, instead of implementing Collector directly:
It's a good start to read Lucene's built-in collectors' code to learn how to build our own collectors:  TotalHitCountCollector: Just count the number of hits. public void collect(int doc) { totalHits++; } PositiveScoresOnlyCollector
if (scorer.score() > 0) { c.collect(doc); } // only include the doc if its score >0

TimeLimitingCollector: use an external counter, and compare timeout in collect, throw TimeExceededException if the allowed time has passed: 
long time = clock.get();    if (timeout < time) {throw new TimeExceededException( timeout-t0, time-t0, docBase + doc );} 
Also TestTimeLimitingCollector.MyHitCollector is an example of custom collector.

FilterCollector: A collector that filters incoming doc ids that are not in the filter. Used by Grouping.
Using TimeLimitingCollector to Stop Slow Query

public void testTimeLimitingCollector() throws IOException {
  // SimulateSlowCollector is a copy of
  // org.apache.lucene.search.TestTimeLimitingCollector.MyHitCollector
  SimulateSlowCollector slowCollector = new SimulateSlowCollector();
  slowCollector.setSlowDown(1000 * 10);
  Counter clock = Counter.newCounter(true);

  int tick = 10;
  TimeLimitingCollector collector = new TimeLimitingCollector(
      slowCollector, clock, tick);
  collector.setBaseline(0);

  try (Directory directory = FSDirectory.open(new File(FILE_PATH));
      DirectoryReader indexReader = DirectoryReader.open(directory);) {
    IndexSearcher searcher = new IndexSearcher(indexReader);
    try {
      new Thread() {
        public void run() {
          // will kill the indexSearcher.search(...) after 10
          // ticks (10 seconds)
          while (clock.get() <= tick) {
            try {
              Thread.sleep(1000);
              clock.addAndGet(1);
            } catch (InterruptedException e) {
              e.printStackTrace();
            }
          }
        }
      }.start();

      searcher.search(new MatchAllDocsQuery(), collector);
      System.out.println(slowCollector.hitCount());
    } catch (TimeExceededException e) {
      // it throws exception here.
      System.out.println("Too much time taken.");
      e.printStackTrace();
    }
  }
}
Write a Custom Collector
public class FacetCountCollector extends Collector {
 private Map countMap = new HashMap<>();
 // scorer and docBase are actually not used.
 private Scorer scorer;
 private int docBase;
 private IndexSearcher searcher = null;
 public FacetCountCollector(IndexSearcher searcher) {
  this.searcher = searcher;
 }
 @Override
 public void collect(int doc) {
  try {
   Document document = searcher.doc(doc);
   if (document != null) {
    IndexableField[] categoriesDoc = document
      .getFields("categories");

    if (categoriesDoc != null && categoriesDoc.length > 0) {
     for (int i = 0; i < categoriesDoc.length; i++) {
      if (countMap
        .containsKey(categoriesDoc[i].stringValue())) {
       countMap.put(categoriesDoc[i].stringValue(), Long
         .valueOf(countMap.get(categoriesDoc[i]
           .stringValue())) + 1);
      } else {
       countMap.put(categoriesDoc[i].stringValue(), 1L);
      }
     }
    }
   }
  } catch (IOException e) {
   e.printStackTrace();
  }
 }

 public Map getCountMap() {
  return Collections.unmodifiableMap(countMap);
 }
 public void setScorer(Scorer scorer) throws IOException {
  this.scorer = scorer;
 }
 public void setNextReader(AtomicReaderContext context) throws IOException {
  this.docBase = context.docBase;// Record the readers absolute doc base
 }
 public boolean acceptsDocsOutOfOrder() {
  // Return true if this collector does not require the matching docIDs to
  // be delivered in int sort order (smallest to largest) to collect.
  return true;
 }
}
Using Custom Collector
public void testFacetCountCollector() throws IOException {
 try (Directory directory = FSDirectory.open(new File(FILE_PATH));
   DirectoryReader indexReader = DirectoryReader.open(directory);) {
  IndexSearcher searcher = new IndexSearcher(indexReader);
  try {
   FacetCountCollector collector = new FacetCountCollector(
     searcher);
   searcher.search(new MatchAllDocsQuery(), collector);
   System.out.println(collector.getCountMap());
   // printResult(topDocsCollector, searcher);
  } catch (TimeExceededException e) {
   // it throws exception here.
   System.out.println("Too much time taken.");
   e.printStackTrace();
  }
 }
}
References
Lucene Built-in Collectors

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)