Solr: How to Speed Up Indexing

Store Less And Index Less
Please refer to How to Shrink Solr Index Size
Outline: 
Indexed=false or Stored=false
Use best fit and least-size field type: tlong or tint.
Clean Data
Round Data
Increase precisionStep
Set omitNorms=true

Increase JAVA RAM
java  -server -Xms8192M -Xmx8192M 

Set overwrite as false
If the unqiue key is generated automatically, either uuid or generated in our code, or we can gurantee there is no duplicate date, we can set overwrite as false, see code: org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
After push is finished, we can run facet.field=idfield&facet.mincount=2 to find out whether there is duplicate ids: either delete the old one or check whether there is error in the data. 

Increase ramBufferSizeMB maxBufferedDocs, and mergeFactor in solrconfig.xml
This will reduce disk IO times.
After commit data, you may run optimize to increase query speed.

Increase size of buffer reader to reduce IO times.
To do this, you have to change solr code:
BUFFER_READER_SIZE = params.getInt(PARAM_BUFFER_READER_SIZE, 0);
if (BUFFER_READER_SIZE != 0) {
reader = new BufferedReader(reader, BUFFER_READER_SIZE);
}
Then configure size of BufferedReader in solrconfig.xml. 

Use multiple threads to upload multiple files at same time.
Please refer to Solr: Use Multiple Threads to Import Local stream Files

Use multiple update processor threads
https://issues.apache.org/jira/browse/SOLR-3585
Import this improvement into your solr build.

Use Solr Multiple Cores
In my test, using one core to upload 56 million data, it takes 70 minutes, using 2 cores in one solr server, it takes 40 minutes. But no improve when increases to use 3 cores(in fact worse).
I think this is because when one core busy at IO, another core can do CPU busy operation.

Deploying multiple cores in different web server instances,in different JVMs, the performance will be better.

Solr Cloud
I tested Solr Cloud, and found it is not suitable for my task, because it requires to enable solr transaction logs, which is quite slow, and also because the overhead of zookeeper. Using Solr Cloud with 2 nodes, it takes 4 hours, much much slower.
Solr Cloud should be more suitable when the index is so huge that can't be stored in one machine.
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (60) Interview (59) J2SE (53) Algorithm (37) Eclipse (35) Soft Skills (35) Code Example (31) Linux (26) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Continuous Integration (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Design (8) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Miscs (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Firefox (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Bit Operation (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts