Learning Lucene: Collectors

Lucene Built-in Collectors
Check Lucene Javadoc for all Lucene built-in collectors.
Lucene's core collectors are derived from Collector. Likely your application can use one of these classes, or subclass TopDocsCollector, instead of implementing Collector directly:
It's a good start to read Lucene's built-in collectors' code to learn how to build our own collectors:  TotalHitCountCollector: Just count the number of hits. public void collect(int doc) { totalHits++; } PositiveScoresOnlyCollector
if (scorer.score() > 0) { c.collect(doc); } // only include the doc if its score >0

TimeLimitingCollector: use an external counter, and compare timeout in collect, throw TimeExceededException if the allowed time has passed: 
long time = clock.get();    if (timeout < time) {throw new TimeExceededException( timeout-t0, time-t0, docBase + doc );} 
Also TestTimeLimitingCollector.MyHitCollector is an example of custom collector.

FilterCollector: A collector that filters incoming doc ids that are not in the filter. Used by Grouping.
Using TimeLimitingCollector to Stop Slow Query

public void testTimeLimitingCollector() throws IOException {
  // SimulateSlowCollector is a copy of
  // org.apache.lucene.search.TestTimeLimitingCollector.MyHitCollector
  SimulateSlowCollector slowCollector = new SimulateSlowCollector();
  slowCollector.setSlowDown(1000 * 10);
  Counter clock = Counter.newCounter(true);

  int tick = 10;
  TimeLimitingCollector collector = new TimeLimitingCollector(
      slowCollector, clock, tick);
  collector.setBaseline(0);

  try (Directory directory = FSDirectory.open(new File(FILE_PATH));
      DirectoryReader indexReader = DirectoryReader.open(directory);) {
    IndexSearcher searcher = new IndexSearcher(indexReader);
    try {
      new Thread() {
        public void run() {
          // will kill the indexSearcher.search(...) after 10
          // ticks (10 seconds)
          while (clock.get() <= tick) {
            try {
              Thread.sleep(1000);
              clock.addAndGet(1);
            } catch (InterruptedException e) {
              e.printStackTrace();
            }
          }
        }
      }.start();

      searcher.search(new MatchAllDocsQuery(), collector);
      System.out.println(slowCollector.hitCount());
    } catch (TimeExceededException e) {
      // it throws exception here.
      System.out.println("Too much time taken.");
      e.printStackTrace();
    }
  }
}
Write a Custom Collector
public class FacetCountCollector extends Collector {
 private Map countMap = new HashMap<>();
 // scorer and docBase are actually not used.
 private Scorer scorer;
 private int docBase;
 private IndexSearcher searcher = null;
 public FacetCountCollector(IndexSearcher searcher) {
  this.searcher = searcher;
 }
 @Override
 public void collect(int doc) {
  try {
   Document document = searcher.doc(doc);
   if (document != null) {
    IndexableField[] categoriesDoc = document
      .getFields("categories");

    if (categoriesDoc != null && categoriesDoc.length > 0) {
     for (int i = 0; i < categoriesDoc.length; i++) {
      if (countMap
        .containsKey(categoriesDoc[i].stringValue())) {
       countMap.put(categoriesDoc[i].stringValue(), Long
         .valueOf(countMap.get(categoriesDoc[i]
           .stringValue())) + 1);
      } else {
       countMap.put(categoriesDoc[i].stringValue(), 1L);
      }
     }
    }
   }
  } catch (IOException e) {
   e.printStackTrace();
  }
 }

 public Map getCountMap() {
  return Collections.unmodifiableMap(countMap);
 }
 public void setScorer(Scorer scorer) throws IOException {
  this.scorer = scorer;
 }
 public void setNextReader(AtomicReaderContext context) throws IOException {
  this.docBase = context.docBase;// Record the readers absolute doc base
 }
 public boolean acceptsDocsOutOfOrder() {
  // Return true if this collector does not require the matching docIDs to
  // be delivered in int sort order (smallest to largest) to collect.
  return true;
 }
}
Using Custom Collector
public void testFacetCountCollector() throws IOException {
 try (Directory directory = FSDirectory.open(new File(FILE_PATH));
   DirectoryReader indexReader = DirectoryReader.open(directory);) {
  IndexSearcher searcher = new IndexSearcher(indexReader);
  try {
   FacetCountCollector collector = new FacetCountCollector(
     searcher);
   searcher.search(new MatchAllDocsQuery(), collector);
   System.out.println(collector.getCountMap());
   // printResult(topDocsCollector, searcher);
  } catch (TimeExceededException e) {
   // it throws exception here.
   System.out.println("Too much time taken.");
   e.printStackTrace();
  }
 }
}
References
Lucene Built-in Collectors
Post a Comment

Labels

Java (159) Lucene-Solr (111) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (37) Eclipse (33) Code Example (31) Linux (24) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts