Learning Lucene: Indexing

After used Solr for about 2 years. It's a long delayed task to learn How Lucene works and its source code.

Setup
Import latest Solr source code into Eclipse
First download the latest Solr source code from Apache Solr, unzip it, and run "ant eclipse" so we can import it to Eclipse.

Lucene and Solr provides a lot of test cases, which is great source to learn how to use its API.

Download Luke
Next download latest Luke from Github DmitryKey/luke, the latest one support Lucene 4.9.
Then start Luke: java -jar luke-with-deps.jar

Create Maven Project with Lucenen Dependencies
Next create one maven project in eclipse with the following pom.xml:
Here I choose to use Lucene 4.9, as the latest Luke doesn't supports Lucene 4.10 yet.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
 <modelVersion>4.0.0</modelVersion>

 <groupId>org.lifelongprogrammer</groupId>
 <artifactId>learningLucene</artifactId>
 <version>1.0</version>
 <packaging>jar</packaging>

 <properties>
  <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  <lucene.version>4.9.0</lucene.version>
 </properties>

 <dependencies>
  <dependency>
   <groupId>org.apache.lucene</groupId>
   <artifactId>lucene-core</artifactId>
   <version>${lucene.version}</version>
  </dependency>
  <dependency>
   <groupId>org.apache.lucene</groupId>
   <artifactId>lucene-analyzers-common</artifactId>
   <version>${lucene.version}</version>
  </dependency>
  <dependency>
   <groupId>org.apache.lucene</groupId>
   <artifactId>lucene-codecs</artifactId>
   <version>${lucene.version}</version>
  </dependency>
  <dependency>
   <groupId>org.apache.lucene</groupId>
   <artifactId>lucene-queries</artifactId>
   <version>${lucene.version}</version>
  </dependency>
  <dependency>
   <groupId>org.apache.lucene</groupId>
   <artifactId>lucene-test-framework</artifactId>
   <version>4.9.0</version>
  </dependency>
  <dependency>
   <groupId>junit</groupId>
   <artifactId>junit</artifactId>
   <version>4.11</version>
  </dependency>
 </dependencies>
</project>
Indexing
Lucene test code: org.apache.lucene.index.TestIndexWriter is a good start place to learn Lucene index.
public void testIndexWriter() throws IOException {
  Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_9);

  IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_9,
      analyzer);
  // recreate the index on each execution
  config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);

  // set codec to SimpleTextCodec if we want to check
  // http://blog.florian-hopf.de/2012/12/looking-at-plaintext-lucene-index.html
  // config.setCodec(new SimpleTextCodec());

  config.setUseCompoundFile(false);
  // if we setInfoStream, add the below annotation to the TestClass
  // @SuppressSysoutChecks(bugUrl = "Solr logs to JUL")
  config.setInfoStream(System.out);

  config.setMaxBufferedDocs(1000);
  config.setRAMBufferSizeMB(IndexWriterConfig.DEFAULT_RAM_BUFFER_SIZE_MB);

  LogDocMergePolicy mergePolicy = new LogDocMergePolicy();
  // set merge factors to very small numbers on purpose: so we can see
  // that
  // after
  mergePolicy.setMergeFactor(2);
  // default merge policy is TieredMergePolicy
  config.setMergePolicy(mergePolicy);

  // be sure to close Directory and IndexWriter
  try (Directory directory = FSDirectory.open(new File(FILE_PATH));
      IndexWriter writer = new IndexWriter(directory, config)) {

    addDocWithIndex(writer, 1);
    writer.commit();
    addDocWithIndex(writer, 2);
    writer.commit();

    addDocWithIndex(writer, 3);
    writer.commit();
  }
}

// Create field type, and set its properties
public static final FieldType CONTENT_FIELD = new FieldType();
static {
  CONTENT_FIELD.setIndexed(true);
  CONTENT_FIELD.setStored(true);
  // CONTENT_FIELD.setTokenized(true);

  CONTENT_FIELD.setStoreTermVectors(true);
  CONTENT_FIELD.setStoreTermVectorPositions(true);
  CONTENT_FIELD.setStoreTermVectorOffsets(true);

  CONTENT_FIELD.setOmitNorms(false);
  // CONTENT_FIELD.setStoreTermVectorPayloads(true);
  CONTENT_FIELD.freeze();
}
static void addDocWithIndex(IndexWriter writer, int i) throws IOException {
  Document doc = new Document();
  doc.add(new LongField("id", i, Field.Store.YES));
  doc.add(new TextField("title", "title " + i, Field.Store.YES));
  doc.add(new IntField("page", 100 + i, Field.Store.YES));
  doc.add(new TextField("author", "author " + i, Field.Store.YES));
  doc.add(new TextField("description", "description " + i,
      Field.Store.YES));
  doc.add(new Field("editor_comment", "editor comment " + i,
      CONTENT_FIELD));

  // multiple value fields for search
  doc.add(new Field("content", "title " + i, CONTENT_FIELD));
  doc.add(new Field("content", "author " + i, CONTENT_FIELD));
  doc.add(new Field("content", "description " + i, CONTENT_FIELD));
  doc.add(new Field("content", "editor comment  " + i, CONTENT_FIELD));

  writer.addDocument(doc);
}
Check the following links to Understand Lucene Index File Format
lucene410 Index File Formats
Hacking Lucene - the Index Format
Lucene学习总结

Main Classes
IndexWriterConfig 
IndexWriterConfig.OpenMode {CREATE, APPEND, CREATE_OR_APPEND }
Use OpenMode.CREATE in test environment, it will create a new index or overwrites an existing one

IndexWriter
forceMerge(int maxNumSegments)
Forces merge policy to merge segments until there are <= maxNumSegments.

IndexWriter.addIndexes(Directory... dirs)
IndexWriter.addIndexes(IndexReader... readers)
Use these two classes to merge indexs.
Use Case: Bulk Index
When we need index a huge data, we can use multiple threads to index them into different directoies first, then at last, use addIndexes to merge them into the final directory.
This approach can improve indexing speed, as by default Lucene only uses a single thread.

Directory
BaseDirectory
FSDirectory(.open, sync)
    MMapDirectory,SimpleFSDirectory,NIOFSDirectory
CompoundFileDirectory 
HdfsDirectory

Segment and Merge Policy
Every time, indexWriter.commit() is called and there are pending documents, Lucene will create one segment.

Also at indexWriter.commit(), Lucene will check whether it need merge segments.


We can use config.setMergePolicy(mergePolicy); to configure the merge policy. The defualt is TieredMergePolicy.
LogDocMergePolicy(LogMergePolicy).findMerges(MergeTrigger, SegmentInfos, IndexWriter) line: 458
IndexWriter.updatePendingMerges(MergePolicy, MergeTrigger, int) line: 2005
IndexWriter.maybeMerge(MergePolicy, MergeTrigger, int) line: 1969
IndexWriter.prepareCommitInternal(MergePolicy) line: 2987
IndexWriter.commitInternal(MergePolicy) line: 3092
IndexWriter.commit() line: 3059
LearningLucene.testIndexWriter() line: 105

Reference
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (60) Interview (59) J2SE (53) Algorithm (37) Eclipse (35) Soft Skills (35) Code Example (31) Linux (26) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Continuous Integration (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Design (8) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Miscs (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Firefox (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Bit Operation (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts