Solr: Update other Document in DocTransformer by Writing custom SolrWriter

Summary
Write our own XMLWriter so we can update other SolrDocument or even delete current document in DocTransformer.

The User Case
There are two types of docs in Solr: one is child doc including fields: type(value 0), groupId, time and etc. 
another type of doc is group doc: type(value 1), they are actually just some faked docs.

We use join query with includeParent=true and make sure groups are sorted by time(the max value in the group) and the group doc is always be front of all child docs.

But Solr doesn't return groupCount in flat mode: in grouped mode, Solr can return groupCount in group header, but no such thing in flat mode.

So we have to dynamically generate groupCount and time value for each group(type=1) doc.

I tried several solutions:
In DocTransformer, when current doc is group doc(type=1), run query to get num of docs in this group.
SolrPluginUtils.numDocs(req.getSearcher(), baseQuery + new TermQuery(new Term("groupId", groupId)), null);

Later I optimized it by pre-compute baseDocSet which matches the q and fq:
DocSet baseDocSet = req.getSearcher().getDocSet(baseQuery);
int groupCount = req.getSearcher().getDocSet(new TermQuery(new Term("groupId", groupId)), baseDocSet).size();

The Solution
But all seems not good to me: as all child(type==0) docs follows each group doc(type=1), there should be no need to run Solr query at all: we can easily calculate the groupCount and its mtm value.

But the problem here is that we can only change current SolrDocument in DocTransformer:
org.apache.solr.response.TextResponseWriter.writeDocuments(String, ResultContext, ReturnFields)
for (int i=0; i<sz; i++) {
 if( transformer != null ) {
  transformer.transform( sdoc, id);
 }
 // SolrWriter writes the doc to output stream
 writeSolrDocument( null, sdoc, returnFields, i );
}
writeEndDocumentList();

One way is to change Solr's code directly to support this:
We can change The code here like below:
cachMode = req.getParams().getBool("cachMode", false);
SolrDocument[] cachedDocs = new SolrDocument[sz];
for (int i = 0; i < sz; i++) {
 SolrDocument sdoc = toSolrDocument(doc);
 if (transformer != null) {
  transformer.transform(sdoc, id);
 }
 if(cachMode)
 {
    cachedDocs[i] = sdoc;
 }
 else{
    writeSolrDocument( null, sdoc, returnFields, i );
 }
 
}
if (transformer != null) {
 transformer.setContext(null);
}
if(cachMode) {
 for (int i = 0; i < sz; i++) {
  writeSolrDocument(null, cachedDocs[i], returnFields, i);
 }
}
writeEndDocumentList();


Or we can write our own Writer, so we don't have to change solr's code.

Custom Solr Writer: CachedXMLWriter
The implementation is simple: we just cache SolrDocument in writeSolrDocument, write them in writeEndDocumentList. 
We can also allow DocTransfromer to delete doc: by add one specifically field "_del_", if this field is set, we will not write this doc into output stream.
public class CachedXMLWriter extends XMLWriter {
  static class SolrDocumentHolder {
    SolrDocument doc;
    String name;
    int idx;
  }
  List<SolrDocumentHolder> holders = new ArrayList<SolrDocumentHolder>();
  public void writeSolrDocument(String name, SolrDocument doc,
      ReturnFields returnFields, int idx) throws IOException {
    Object del = doc.getFieldValue("_del_");
    if (del == null) {
      SolrDocumentHolder holder = new SolrDocumentHolder();
      holder.doc = doc;
      holder.name = name;
      holder.idx = idx;
      holders.add(holder);
    }
  }
  public void writeEndDocumentList() throws IOException {
    for (SolrDocumentHolder holder : holders) {
      super
          .writeSolrDocument(holder.name, holder.doc, returnFields, holder.idx);
    }
    super.writeEndDocumentList();
  }  
}
CachedXMLResponseWriter
Here is the companion class:
public class CachedXMLResponseWriter implements QueryResponseWriter {
  public void write(Writer writer, SolrQueryRequest req, SolrQueryResponse rsp)
      throws IOException {
    CachedXMLWriter w = new CachedXMLWriter(writer, req, rsp);
    try {
      w.writeResponse();
    } finally {
      w.close();
    }
  }
  public String getContentType(SolrQueryRequest request,
      SolrQueryResponse response) {
    return CONTENT_TYPE_XML_UTF8;
  }
}
at last,declare the writer in solrconfig.xml:
<queryResponseWriter name="cachexml" class="solr.CachedXMLResponseWriter" startup="lazy"/>
Now we can use it: wt=cachexml&fl=f1,[groupCount] 

To hide the implementation from client side, we can encapsulate the logic in our request handler: set wt=cachexml if transformer [groupCount] exists.

Miscs:
Transforming Result Documents
[value] - ValueAugmenterFactory
greeting:[value v='hello']
fl=id,my_number:[value v=42 t=int],my_string:[value v=42]
newname:oldname RenameFieldTransformer

[explain] doesn't work with group
if (grouping.mainResult != null) {
ResultContext ctx = new ResultContext();
ctx.query = null; // TODO? add the query?
}
[child] - ChildDocTransformerFactory
[shard] - ShardAugmenterFactory

public abstract class TransformerWithContext extends DocTransformer

Resources
Solr Join: Return Parent and Child Documents
Use Solr map function query(group.sort=map(type,1,1,-1) ) in group flat mode
Solr: Use DocTransformer to dynamically Generate groupCount and time value for group doc
SOLR-7097: Update other Document in DocTransformer
Post a Comment

Labels

Java (159) Lucene-Solr (112) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (38) Eclipse (33) Code Example (31) Linux (25) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Shell (7) ANT (6) Coding Skills (6) Database (6) Lesson Learned (6) Programmer Skills (6) Scala (6) Tips (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) System Design (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts