Solr: Using docid within same Seacher to boost performance

We all know that docid in Lucene/Solr is volatile, it may change when we remove some docs and solr merges segments.

For example:
We add 3 docs: doc0, doc1, doc2
http://localhost:12345/solr/update?stream.body=<add><doc><field name="id">doc0</field></doc><doc><field name="id">doc1</field></doc><doc><field name="id">doc1</field></doc></add>&commit=true
Their docid would be like: doc0:0, doc1:1,  doc2:2

Then we delete doc0, and commit it with expungeDeletes=true(meger deletes will also happen when merge segements)
http://localhost:12345/solr/update?stream.body=<delete><query>id:0</query></delete>&commit=true&expungeDeletes=true

Now, their docid would be changed: doc1:0, doc2:1

But in following request handler, whether the docid will be changed during the 2 queries?
public class TestDocIdHandler extends RequestHandlerBase {
  public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp)
      throws Exception {
    int docid = getLookupDocId(req.getSearcher(), "doc1");
    // stop here, and delete doc0:
    // http://localhost:12345/solr/update?stream.body=<delete><query>id:doc0</query></delete>&commit=true&expungeDeletes=true
    // check whether docid is changed
    int newdocid = getLookupDocId(req.getSearcher(), "doc12");
    
    System.out.println(docid == newdocid);
  }
  
  private int getLookupDocId(SolrIndexSearcher searcher, String lookup)
      throws IOException {
    TermQuery tq = new TermQuery(new Term("contentid", lookup));
    TopDocs hits = searcher.search(tq, 1);
    ScoreDoc[] docs = hits.scoreDocs;
    if (docs.length == 1) {
      return docs[0].doc;
    }
    return -1; // not found
  }
}

The answer is no:
The docid would be same, because we are querying using same SolrIndexSearcher: SolrIndexSearcher holds the snapshot of the index(data) at some specific time, it will not reflect the change(add,delete,etc) we made until it's reopened.

In the next post, we will demonstrate how we can use this feature in our code.

Practical Example: Use docid to boost performance
The User Case:
Give some query(q,fq, may be join or group group), we want to know the position of one doc given its id.

We can first get the docid of this document, then run the query:
SolrIndexSearcher.search().scoreDocs
then iterate all docid until we find it.
public class GetDocPositionReqHandler extends RequestHandlerBase {
  public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp)
      throws Exception {
    SolrParams params = req.getParams();
    String lookup = Preconditions.checkNotNull(params.get("lookup"));
    
    SolrIndexSearcher searcher = req.getSearcher();
    int lookupId = getLookupDocId(searcher, lookup);
    
    if (lookupId != -1) {
      boolean isGroup = params.getBool(GroupParams.GROUP, false);
      if (!isGroup) {
        nonGroupImpl(req, rsp, lookupId);
      } else {
        //
        groupImpl(req, rsp, params, lookupId);
      }
    }
  }
    
  private void nonGroupImpl(SolrQueryRequest req, SolrQueryResponse rsp,
      int lookupId) throws SyntaxError, IOException {
    int lookupPos = -1;
    ScoreDoc[] docs = runReqQuery(req);
    int newPos = 0;
    for (ScoreDoc doc : docs) {
      newPos++;
      if (doc.doc == lookupId) {
        lookupPos = newPos;
        break;
      }
    }
    
    rsp.add("newPos", lookupPos);
  }
  private void groupImpl(SolrQueryRequest req, SolrQueryResponse rsp,
      SolrParams params, int lookupId) throws SyntaxError, IOException {
    ScoreDoc[] docs = runReqQuery(req);
    // split to group
    // in our case, the type of group.field is string, group.sort is type long field
    Map<String,Set<Integer>> groupMap = new LinkedHashMap<String,Set<Integer>>();
    String lookupGroup = null;
    
    String groupField = Objects.requireNonNull(
        params.get(GroupParams.GROUP_FIELD),
        "No group field in the request string.");
    BinaryDocValues groupCache = FieldCache.DEFAULT.getTerms(req.getSearcher()
        .getAtomicReader(), groupField);
    for (ScoreDoc doc : docs) {
      int docid = doc.doc;
      BytesRef result = new BytesRef();
      groupCache.get(docid, result);
      String groupValue = result.utf8ToString();
      Set<Integer> groupItems = groupMap.get(groupValue);
      if (groupItems == null) {
        groupItems = new LinkedHashSet<Integer>();
        groupMap.put(groupValue, groupItems);
      }
      groupItems.add(docid);
      if (doc.doc == lookupId) {
        lookupGroup = groupValue;
      }
    }
    int lookupPos = -1;
    if (lookupGroup != null) {
      // then iterate the map to get the position
      int newPos = 0;
      Iterator<Entry<String,Set<Integer>>> it = groupMap.entrySet().iterator();
      
      outer: while (it.hasNext()) {
        Entry<String,Set<Integer>> entry = it.next();
        String groupName = entry.getKey();
        if (lookupGroup.equals(groupName)) {
          Set<Integer> items = entry.getValue();
          for (Integer item : items) {
            newPos++;
            if (item == lookupId) {
              lookupPos = newPos;
              break outer;
            }
          }
        } else {
          newPos += entry.getValue().size();
        }
      }
    }
    rsp.add("newPos", lookupPos);
  }
  
  private ScoreDoc[] runReqQuery(SolrQueryRequest req) throws SyntaxError,
      IOException {
    SolrParams params = req.getParams();
    SolrIndexSearcher searcher = req.getSearcher();
    String qstr = params.get(CommonParams.Q);
    
    QParser parser = QParser.getParser(qstr, ExtendedDismaxQParserPlugin.NAME,
        req);
    Query newQuery = parser.parse();
    Sort sort = SolrPluginUtils.getSort(req);
    
    String[] fqs = params.getParams(CommonParams.FQ);
    ChainedFilter chainedFilter = null;
    if (fqs != null) {
      Filter[] filters = new Filter[fqs.length];
      int i = 0;
      for (String fq : fqs) {
        filters[i++] = new QueryWrapperFilter(QParser.getParser(fq,
            ExtendedDismaxQParserPlugin.NAME, req).parse());
      }
      chainedFilter = new ChainedFilter(filters);
    }
    TopDocs topDocs;
    if (sort != null) {
      topDocs = searcher.search(newQuery, chainedFilter, searcher.maxDoc(),
          sort);
    } else {
      topDocs = searcher.search(newQuery, chainedFilter, searcher.maxDoc());
    }
    ScoreDoc[] docs = topDocs.scoreDocs;
    return docs;
  }
  
  private int getLookupDocId(SolrIndexSearcher searcher, String lookup)
      throws IOException {
    TermQuery tq = new TermQuery(new Term("id", lookup));
    TopDocs hits = searcher.search(tq, 1);
    ScoreDoc[] docs = hits.scoreDocs;
    if (docs.length == 1) {
      return docs[0].doc;
    }
    return -1; // not found
  }
  
}
Post a Comment

Labels

Java (159) Lucene-Solr (112) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (38) Eclipse (33) Code Example (31) Linux (25) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Shell (7) ANT (6) Coding Skills (6) Database (6) Lesson Learned (6) Programmer Skills (6) Scala (6) Tips (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) System Design (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts