The Problem
We need iterate and dump more than millions data into a file. The data is deployed in multiple solr server in mltiple virtual machine with just 4gb memory.
When I tried to run the dump task, these vms totally froze for more than hours. The memory usage is more than 98%.
The problem is caused by a long-lasting problem in Solr:
When we try to get the 1,000,000 to 1,001,000 data, Solr has to load 1,001,00 sorted documents from index, then get last 1000 data.
In the case of SolrCloud, the problem gets even worse, as every shard has to sorted 1,001,00 documents and send all docs to one dest solr server, which will then iterate all data to get the 1000 data.
In older release, developers found some workaround, such as described at Solr Deep Pagination.
Solution: Solr-5463(LUCENE-3514)
In coming Solr 4.7, it solves this problem in SOLR-5463.
The basic idea is that:
Get the first 1000 rows:
http://solr1:8080/solr/select?q=accesstime:[* TO NOW-5YEAR/DAY]&sort=accesstime desc, contentid asc &shards=solr1:8080/solr/cvcorefla4,solr2:8080/solr/cvcorefla4,solr3:8080/solr/cvcorefla4&overwrite=true&sort=accesstime desc,id asc&rows=1000&start=0&cursorMark=*
Parse the response to get the value of nextCursorMark:
http://solr1:8080/solr/select?q=accesstime:[* TO NOW-5YEAR/DAY]&sort=accesstime desc, contentid asc &shards=solr1:8080/solr/cvcorefla4,solr2:8080/solr/cvcorefla4,solr3:8080/solr/cvcorefla4&overwrite=true&sort=accesstime desc,id asc&rows=1000&start=0&cursorMark=AoJ42tmu%2FZ4CKTQxMDMyMzEwMw%3D%3DWe need iterate and dump more than millions data into a file. The data is deployed in multiple solr server in mltiple virtual machine with just 4gb memory.
When I tried to run the dump task, these vms totally froze for more than hours. The memory usage is more than 98%.
The problem is caused by a long-lasting problem in Solr:
When we try to get the 1,000,000 to 1,001,000 data, Solr has to load 1,001,00 sorted documents from index, then get last 1000 data.
In the case of SolrCloud, the problem gets even worse, as every shard has to sorted 1,001,00 documents and send all docs to one dest solr server, which will then iterate all data to get the 1000 data.
In older release, developers found some workaround, such as described at Solr Deep Pagination.
Solution: Solr-5463(LUCENE-3514)
In coming Solr 4.7, it solves this problem in SOLR-5463.
The basic idea is that:
Get the first 1000 rows:
http://solr1:8080/solr/select?q=accesstime:[* TO NOW-5YEAR/DAY]&sort=accesstime desc, contentid asc &shards=solr1:8080/solr/cvcorefla4,solr2:8080/solr/cvcorefla4,solr3:8080/solr/cvcorefla4&overwrite=true&sort=accesstime desc,id asc&rows=1000&start=0&cursorMark=*
Parse the response to get the value of nextCursorMark:
<response> <str name="nextCursorMark">AoJ42tmu/Z4CKTQxMDMyMzEwMw==</str> </response>Then to get the next 1000 rows: [10001-2000]
Repeat until the nextCursorMark value stops changing, or you have collected as many docs as you need.
Basic Usage from Solr-5463
start must be "0" in all request when use cursorMark
sort can be anything, but must include the uniqueKey field (as a tie breaker)
"N" can be any number you want per page
"*" denotes you want to use a cursor starting at the beginning mark
Replace the "*" value in your initial request params with the nextCursorMark value from the response in the subsequent request
Resources
Solr-5463: Provide cursor/token based "searchAfter" support that works with arbitrary sorting (ie: "deep paging")
LUCENE-3514: deep paging with Sort
Coming Soon to Solr: Efficient Cursor Based Iteration of Large Result Sets
Deep Paging Problem