Solr: Use DocTransformer to dynamically Generate groupCount and time value for group doc

Summary
Use DocTransformer to dynamically generate groupCount and time value for group doc(type:1) efficiently: no need ro run Solr query for each group doc(almost).

The User Case
There are two types of docs in Solr: one is child doc including fields: type(value 0), groupId, time and etc. 
another type of doc is group doc: type(value 1), they are actually just some faked docs.

We use join query with includeParent=true and group function: group.main=true&group.sort=map(type,1,1,-1) asc to make sure groups are sorted by time(the max value in the group) and the group doc is always be front of all child docs.

But Solr doesn't return groupCount in flat mode: in grouped mode, Solr can return groupCount in group header, but no such thing in flat mode.
So we have to dynamically generate groupCount and time value for each group(type=1) doc.


Now the last step is to actually generate groupCount and time value dynamically for group doc(type:1).

The Solution
After bump into one group doc, all we need do is to count how many child docs it follows(++lastGroupCount) until we bump into another group doc: 
we update groupCount when iterate last doc in this group, 
we update time field of group doc when iterate the first foc in this group.

If we don't bump into another group doc at the end, we need run query to get the group count as the accumulated lastGroupCount would be incomplete.

To update the time value of group doc is easy: when we hit its first child doc, change its group doc, note the boundary condition: the last group doc have to run query for it.

public class UpdateGroupDocTransfomerFactory extends TransformerFactory {
  public DocTransformer create(String field, SolrParams params,
      SolrQueryRequest req) {
    return new UpdateGroupDocTransfomer(req, params);
  }  
  /**
   * org.apache.solr.search.SolrReturnFields.parseFieldList(String[],
   * SolrQueryRequest) DocTransformers augmenters = new DocTransformers();
   * DocTransformer is thread safe.
   */
  private static class UpdateGroupDocTransfomer extends DocTransformer {
    private SolrQueryRequest req;
    private SolrDocument lastGroupDoc = null;
    private int lastGroupCount = 0;
    private TransformContext transContext;
    
    public void transform(SolrDocument doc, int docid) throws IOException {
      String type = SolrUtil.getFieldValue(doc, "type");
      if ("1".equals(type)) {
        if (lastGroupDoc != null) {
          lastGroupDoc.setField("[groupCount]", lastGroupCount);
        }
        lastGroupDoc = doc;
        lastGroupCount = 0;
        
        if (!transContext.iterator.hasNext()) {
          // this is last doc, run query to get
          runQueryToGetGroupCountAndTimeField(doc);
        }
      } else if (lastGroupDoc != null) {
        if (lastGroupCount == 0) {
          // the first doc in this group
          lastGroupDoc.setField(
              "time",
              DateUtil.getThreadLocalDateFormat()
                  .format(
                      new Date(Long.parseLong(SolrUtil.getFieldValue(doc,
                          "time")))));
        }
        if (!transContext.iterator.hasNext()) {
          // this is last doc, the lastGroupCount would be not correct for
          // lastGroupDoc, run query to get group count.
          runQueryToGetGroupCount(lastGroupDoc);
        } else {
          ++lastGroupCount;
        }
      }
      // else lastGroupDoc==null, and this is normal doc, nothing to do
    }
    
    public UpdateGroupDocTransfomer(SolrQueryRequest req, SolrParams params) {
      this.req = req;
    }
    public void setContext(TransformContext context) {
      this.transContext = context;
    }    
  }
}
Resources
Solr Join: Return Parent and Child Documents
Use Solr map function query(group.sort=map(type,1,1,-1) ) in group flat mode
Solr: Update other Document in DocTransformer by Writing custom SolrWriter
Post a Comment

Labels

Java (159) Lucene-Solr (112) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (38) Eclipse (33) Code Example (31) Linux (25) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Shell (7) ANT (6) Coding Skills (6) Database (6) Lesson Learned (6) Programmer Skills (6) Scala (6) Tips (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) System Design (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts