Lucene Highlighter HowTo

In practice, we may want to highlight the matched word in the query response, so user can easily look at the matched section and jump to it.

package org.lifelongprogrammer.learningLucene;
public class LuceneHighlighterInAction {

 public static void main(String[] args) throws Exception {
  Directory directory = new RAMDirectory();
  StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_9);

  String fieldName = "content";
  writeDocs(directory, analyzer, fieldName);
  // use Highlighter
  try (DirectoryReader indexReader = DirectoryReader.open(directory);) {
   IndexSearcher searcher = new IndexSearcher(indexReader);
   TermQuery query = new TermQuery(new Term(fieldName, "love"));

   TopDocs topDocs = searcher.search(query, 10);
   System.out.println("Total hits: " + topDocs.totalHits);
   ScoreDoc[] scoreDocs = topDocs.scoreDocs;

   // use SimpleHTMLFormatter
   System.out.println("use SimpleHTMLFormatter");
   QueryScorer scorer = new QueryScorer(query);
   Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(
     "<font color='red'>", "</font>"), scorer);
   Fragmenter fragmenter = new SimpleFragmenter(200);
   highlighter.setTextFragmenter(fragmenter);

   for (int i = 0; i < Math.min(scoreDocs.length, 10); ++i) {
    Document doc = searcher.doc(scoreDocs[i].doc);
    String fieldContent = doc.get(fieldName);
    System.out.println(fieldContent + " , " + scoreDocs[i].score);
    System.out.println(highlighter.getBestFragment(analyzer,
      fieldName, fieldContent));
   }

   // use SimpleSpanFragmenter
   System.out.println("use SimpleSpanFragmenter");
   highlighter = new Highlighter(scorer);
   //default is Highlighter.DEFAULT_MAX_CHARS_TO_ANALYZE 50*1024
   highlighter.setMaxDocCharsToAnalyze(10240);
   fragmenter = new SimpleSpanFragmenter(new QueryScorer(query), 10);
   for (int i = 0; i < Math.min(scoreDocs.length, 10); ++i) {
    Document doc = searcher.doc(scoreDocs[i].doc);
    String fieldContent = doc.get(fieldName);
    System.out.println(fieldContent + " , " + scoreDocs[i].score);
    TokenStream tokenStream = analyzer.tokenStream(fieldName,
      fieldContent);
    String result = highlighter.getBestFragments(tokenStream,
      fieldContent, 2, "...");
    System.out.println(result);
   }
  }
 }

 private static void writeDocs(Directory directory,
   StandardAnalyzer analyzer, String fieldName) throws IOException {
  IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_9,
    analyzer);
  config.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
  try (IndexWriter writer = new IndexWriter(directory, config)) {

   FieldType fieldType = new FieldType();
   fieldType.setIndexed(true);
   fieldType.setStored(true);
   fieldType.setTokenized(true);
   fieldType.setStoreTermVectors(true);
   fieldType.setStoreTermVectorOffsets(true);
   fieldType.setStoreTermVectorPositions(true);
   fieldType.setOmitNorms(false);
   fieldType.freeze();

   Document doc = new Document();
   doc.add(new Field(
     fieldName,
     "I am a lifelong programmer, I love coding; I am a lifelong programmer, I love programming.",
     fieldType));
   writer.addDocument(doc);

   doc = new Document();
   doc.add(new Field(
     fieldName,
     "I am a lifelong programmer, I love the world; I am a lifelong programmer, I love the life.",
     fieldType));
   writer.addDocument(doc);
  }
 }
}
Main code: org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(TokenStream, String, boolean, int) 
Highlighter in Solr
https://cwiki.apache.org/confluence/display/solr/Highlighting
http://wiki.apache.org/solr/HighlightingParameters
Post a Comment

Labels

Java (159) Lucene-Solr (112) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (38) Eclipse (33) Code Example (31) Linux (25) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Shell (7) ANT (6) Coding Skills (6) Database (6) Lesson Learned (6) Programmer Skills (6) Scala (6) Tips (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) System Design (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts