Solr: How to Speed Up Indexing


Store Less And Index Less
Please refer to How to Shrink Solr Index Size
Outline: 
Indexed=false or Stored=false
Use best fit and least-size field type: tlong or tint.
Clean Data
Round Data
Increase precisionStep
Set omitNorms=true

Increase JAVA RAM
java  -server -Xms8192M -Xmx8192M 

Set overwrite as false
If the unqiue key is generated automatically, either uuid or generated in our code, or we can gurantee there is no duplicate date, we can set overwrite as false, see code: org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
After push is finished, we can run facet.field=idfield&facet.mincount=2 to find out whether there is duplicate ids: either delete the old one or check whether there is error in the data. 

Increase ramBufferSizeMB maxBufferedDocs, and mergeFactor in solrconfig.xml
This will reduce disk IO times.
After commit data, you may run optimize to increase query speed.

Increase size of buffer reader to reduce IO times.
To do this, you have to change solr code:
BUFFER_READER_SIZE = params.getInt(PARAM_BUFFER_READER_SIZE, 0);
if (BUFFER_READER_SIZE != 0) {
reader = new BufferedReader(reader, BUFFER_READER_SIZE);
}
Then configure size of BufferedReader in solrconfig.xml. 

Use multiple threads to upload multiple files at same time.
Please refer to Solr: Use Multiple Threads to Import Local stream Files

Use multiple update processor threads
https://issues.apache.org/jira/browse/SOLR-3585
Import this improvement into your solr build.

Use Solr Multiple Cores
In my test, using one core to upload 56 million data, it takes 70 minutes, using 2 cores in one solr server, it takes 40 minutes. But no improve when increases to use 3 cores(in fact worse).
I think this is because when one core busy at IO, another core can do CPU busy operation.

Deploying multiple cores in different web server instances,in different JVMs, the performance will be better.

Solr Cloud
I tested Solr Cloud, and found it is not suitable for my task, because it requires to enable solr transaction logs, which is quite slow, and also because the overhead of zookeeper. Using Solr Cloud with 2 nodes, it takes 4 hours, much much slower.
Solr Cloud should be more suitable when the index is so huge that can't be stored in one machine.

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)