After used Solr for about 2 years. It's a long delayed task to learn How Lucene works and its source code.
Setup
Import latest Solr source code into Eclipse
First download the latest Solr source code from Apache Solr, unzip it, and run "ant eclipse" so we can import it to Eclipse.
Lucene and Solr provides a lot of test cases, which is great source to learn how to use its API.
Download Luke
Next download latest Luke from Github DmitryKey/luke, the latest one support Lucene 4.9.
Then start Luke: java -jar luke-with-deps.jar
Create Maven Project with Lucenen Dependencies
Next create one maven project in eclipse with the following pom.xml:
Here I choose to use Lucene 4.9, as the latest Luke doesn't supports Lucene 4.10 yet.
Lucene test code: org.apache.lucene.index.TestIndexWriter is a good start place to learn Lucene index.
lucene410 Index File Formats
Hacking Lucene - the Index Format
Lucene学习总结
Main Classes
IndexWriterConfig
IndexWriterConfig.OpenMode {CREATE, APPEND, CREATE_OR_APPEND }
Use OpenMode.CREATE in test environment, it will create a new index or overwrites an existing one
IndexWriter
forceMerge(int maxNumSegments)
Forces merge policy to merge segments until there are <= maxNumSegments.
IndexWriter.addIndexes(Directory... dirs)
IndexWriter.addIndexes(IndexReader... readers)
Use these two classes to merge indexs.
Use Case: Bulk Index
When we need index a huge data, we can use multiple threads to index them into different directoies first, then at last, use addIndexes to merge them into the final directory.
This approach can improve indexing speed, as by default Lucene only uses a single thread.
Directory
BaseDirectory
FSDirectory(.open, sync)
MMapDirectory,SimpleFSDirectory,NIOFSDirectory
CompoundFileDirectory
HdfsDirectory
Segment and Merge Policy
Every time, indexWriter.commit() is called and there are pending documents, Lucene will create one segment.
Also at indexWriter.commit(), Lucene will check whether it need merge segments.
We can use config.setMergePolicy(mergePolicy); to configure the merge policy. The defualt is TieredMergePolicy.
LogDocMergePolicy(LogMergePolicy).findMerges(MergeTrigger, SegmentInfos, IndexWriter) line: 458
IndexWriter.updatePendingMerges(MergePolicy, MergeTrigger, int) line: 2005
IndexWriter.maybeMerge(MergePolicy, MergeTrigger, int) line: 1969
IndexWriter.prepareCommitInternal(MergePolicy) line: 2987
IndexWriter.commitInternal(MergePolicy) line: 3092
IndexWriter.commit() line: 3059
LearningLucene.testIndexWriter() line: 105
Setup
Import latest Solr source code into Eclipse
First download the latest Solr source code from Apache Solr, unzip it, and run "ant eclipse" so we can import it to Eclipse.
Lucene and Solr provides a lot of test cases, which is great source to learn how to use its API.
Download Luke
Next download latest Luke from Github DmitryKey/luke, the latest one support Lucene 4.9.
Then start Luke: java -jar luke-with-deps.jar
Create Maven Project with Lucenen Dependencies
Next create one maven project in eclipse with the following pom.xml:
Here I choose to use Lucene 4.9, as the latest Luke doesn't supports Lucene 4.10 yet.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>org.lifelongprogrammer</groupId> <artifactId>learningLucene</artifactId> <version>1.0</version> <packaging>jar</packaging> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <lucene.version>4.9.0</lucene.version> </properties> <dependencies> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-core</artifactId> <version>${lucene.version}</version> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-analyzers-common</artifactId> <version>${lucene.version}</version> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-codecs</artifactId> <version>${lucene.version}</version> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-queries</artifactId> <version>${lucene.version}</version> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-test-framework</artifactId> <version>4.9.0</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.11</version> </dependency> </dependencies> </project>Indexing
Lucene test code: org.apache.lucene.index.TestIndexWriter is a good start place to learn Lucene index.
public void testIndexWriter() throws IOException { Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_9); IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_9, analyzer); // recreate the index on each execution config.setOpenMode(IndexWriterConfig.OpenMode.CREATE); // set codec to SimpleTextCodec if we want to check // http://blog.florian-hopf.de/2012/12/looking-at-plaintext-lucene-index.html // config.setCodec(new SimpleTextCodec()); config.setUseCompoundFile(false); // if we setInfoStream, add the below annotation to the TestClass // @SuppressSysoutChecks(bugUrl = "Solr logs to JUL") config.setInfoStream(System.out); config.setMaxBufferedDocs(1000); config.setRAMBufferSizeMB(IndexWriterConfig.DEFAULT_RAM_BUFFER_SIZE_MB); LogDocMergePolicy mergePolicy = new LogDocMergePolicy(); // set merge factors to very small numbers on purpose: so we can see // that // after mergePolicy.setMergeFactor(2); // default merge policy is TieredMergePolicy config.setMergePolicy(mergePolicy); // be sure to close Directory and IndexWriter try (Directory directory = FSDirectory.open(new File(FILE_PATH)); IndexWriter writer = new IndexWriter(directory, config)) { addDocWithIndex(writer, 1); writer.commit(); addDocWithIndex(writer, 2); writer.commit(); addDocWithIndex(writer, 3); writer.commit(); } } // Create field type, and set its properties public static final FieldType CONTENT_FIELD = new FieldType(); static { CONTENT_FIELD.setIndexed(true); CONTENT_FIELD.setStored(true); // CONTENT_FIELD.setTokenized(true); CONTENT_FIELD.setStoreTermVectors(true); CONTENT_FIELD.setStoreTermVectorPositions(true); CONTENT_FIELD.setStoreTermVectorOffsets(true); CONTENT_FIELD.setOmitNorms(false); // CONTENT_FIELD.setStoreTermVectorPayloads(true); CONTENT_FIELD.freeze(); } static void addDocWithIndex(IndexWriter writer, int i) throws IOException { Document doc = new Document(); doc.add(new LongField("id", i, Field.Store.YES)); doc.add(new TextField("title", "title " + i, Field.Store.YES)); doc.add(new IntField("page", 100 + i, Field.Store.YES)); doc.add(new TextField("author", "author " + i, Field.Store.YES)); doc.add(new TextField("description", "description " + i, Field.Store.YES)); doc.add(new Field("editor_comment", "editor comment " + i, CONTENT_FIELD)); // multiple value fields for search doc.add(new Field("content", "title " + i, CONTENT_FIELD)); doc.add(new Field("content", "author " + i, CONTENT_FIELD)); doc.add(new Field("content", "description " + i, CONTENT_FIELD)); doc.add(new Field("content", "editor comment " + i, CONTENT_FIELD)); writer.addDocument(doc); }Check the following links to Understand Lucene Index File Format
lucene410 Index File Formats
Hacking Lucene - the Index Format
Lucene学习总结
Main Classes
IndexWriterConfig
IndexWriterConfig.OpenMode {CREATE, APPEND, CREATE_OR_APPEND }
Use OpenMode.CREATE in test environment, it will create a new index or overwrites an existing one
IndexWriter
forceMerge(int maxNumSegments)
Forces merge policy to merge segments until there are <= maxNumSegments.
IndexWriter.addIndexes(Directory... dirs)
IndexWriter.addIndexes(IndexReader... readers)
Use these two classes to merge indexs.
Use Case: Bulk Index
When we need index a huge data, we can use multiple threads to index them into different directoies first, then at last, use addIndexes to merge them into the final directory.
This approach can improve indexing speed, as by default Lucene only uses a single thread.
Directory
BaseDirectory
FSDirectory(.open, sync)
MMapDirectory,SimpleFSDirectory,NIOFSDirectory
CompoundFileDirectory
HdfsDirectory
Segment and Merge Policy
Every time, indexWriter.commit() is called and there are pending documents, Lucene will create one segment.
Also at indexWriter.commit(), Lucene will check whether it need merge segments.
LogDocMergePolicy(LogMergePolicy).findMerges(MergeTrigger, SegmentInfos, IndexWriter) line: 458
IndexWriter.updatePendingMerges(MergePolicy, MergeTrigger, int) line: 2005
IndexWriter.maybeMerge(MergePolicy, MergeTrigger, int) line: 1969
IndexWriter.prepareCommitInternal(MergePolicy) line: 2987
IndexWriter.commitInternal(MergePolicy) line: 3092
IndexWriter.commit() line: 3059
LearningLucene.testIndexWriter() line: 105
Reference