Solr Deep Pagination Problem Fixed in Solr-5463

The Problem
We need iterate and dump more than millions data into a file. The data is deployed in multiple solr server in mltiple virtual machine with just 4gb memory.
When I tried to run the dump task, these vms totally froze for more than hours. The memory usage is more than 98%.

The problem is caused by a long-lasting problem in Solr:
When we try to get the 1,000,000 to 1,001,000 data, Solr has to load 1,001,00 sorted documents from index, then get last 1000 data.

In the case of SolrCloud, the problem gets even worse, as every shard has to sorted 1,001,00 documents and send all docs to one dest solr server, which will then iterate all data to get the 1000 data.

In older release, developers found some workaround, such as described at Solr Deep Pagination.

Solution: Solr-5463(LUCENE-3514)
In coming Solr 4.7, it solves this problem in SOLR-5463.

The basic idea is that:
Get the first 1000 rows:
http://solr1:8080/solr/select?q=accesstime:[* TO NOW-5YEAR/DAY]&sort=accesstime desc, contentid asc &shards=solr1:8080/solr/cvcorefla4,solr2:8080/solr/cvcorefla4,solr3:8080/solr/cvcorefla4&overwrite=true&sort=accesstime desc,id asc&rows=1000&start=0&cursorMark=*


Parse the response to get the value of nextCursorMark:
<response>
<str name="nextCursorMark">AoJ42tmu/Z4CKTQxMDMyMzEwMw==</str>
</response>
Then to get the next 1000 rows: [10001-2000]
http://solr1:8080/solr/select?q=accesstime:[* TO NOW-5YEAR/DAY]&sort=accesstime desc, contentid asc &shards=solr1:8080/solr/cvcorefla4,solr2:8080/solr/cvcorefla4,solr3:8080/solr/cvcorefla4&overwrite=true&sort=accesstime desc,id asc&rows=1000&start=0&cursorMark=AoJ42tmu%2FZ4CKTQxMDMyMzEwMw%3D%3D

Repeat until the nextCursorMark value stops changing, or you have collected as many docs as you need.

Basic Usage from Solr-5463
start must be "0" in all request when use cursorMark
sort can be anything, but must include the uniqueKey field (as a tie breaker)
"N" can be any number you want per page
"*" denotes you want to use a cursor starting at the beginning mark
Replace the "*" value in your initial request params with the nextCursorMark value from the response in the subsequent request

Resources
Solr-5463: Provide cursor/token based "searchAfter" support that works with arbitrary sorting (ie: "deep paging")
LUCENE-3514: deep paging with Sort
Coming Soon to Solr: Efficient Cursor Based Iteration of Large Result Sets
Deep Paging Problem
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (60) Interview (59) J2SE (53) Algorithm (37) Eclipse (35) Soft Skills (35) Code Example (31) Linux (26) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Continuous Integration (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Design (8) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Miscs (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Firefox (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Bit Operation (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts