Solr: How to Shrink Index Size


To reduce index size, we should try best to understand the application’s requirement, what each field means, what type it should be(for example, tlong or tint), what tokenizer or filter should be used, what what queries user may make.

Indexed and Stored
If user will not search on that field, we can set indexed=false for that field.
If that field is for search only, customers will never retrieve the original content, we can set stored=false.

Use best fit and least-size field type: tlong or tint.

Clean data before index them.
For instance, remove garbage data, such as NA.

Round Data
For example, for a date field, user may only cares date part, not hh:mm:ss part, so we can round date: round 2012-12-21T12:12:12.234Z to 2012-12-21T00:00:00Z. This can reduce term size.

Use StopFilterFactory to remove stop words.
What analyzers or filters to use to index input.
Range Query and precisionStep
For fields, that don't need range query, or performance is not important when do range query, we can set precisionStep to larger number, this can reduce term size in the cost of query speed.
termVectors, termPositions and termOffsets
For fields we don't need highlighting functionality, set these three properties to false, it will tell Solr not to store any information about terms in the index.
omitNorms
Norms are used to boosts and field length normalization during indexing time so that short document has higher score.
Set omitNorms= true for text fields, that are usually small, and don't need boost for short value.

For primitive types such as string, integer, and so on it's turned on by default in Solr 4.0). This would shrink the index a bit more and in addition to that save us some memory during queries.

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)