Lucene: Fix highlighter issue to always honor hl.fragsize

Summary
There is one bug in Lucene code: it adds all remaining text to the last fragment, in some cases, the last fragment is the most relevant one, so it would be returned to client in highlight section. This causes highlighter outputs more characters than hl.fragsize. 

This article describes how to fix the issue so that it always honor hl.fragsize: only return hl.fragsize characters to client in highlighter section.

The Problem
Recently, we hit a problem related with highlighter: I set hl.fragsize = 300 like below: 
<str name="hl">on</str>
<str name="hl.fl">title,body_stored</str>
<str name="hl.fragsize">300</str>
<str name="f.title.hl.fragsize">0</str>
<str name="f.body_stored.hl.fragsize">300</str>
But the highlight section for one document still outputs more than 2000 characters.

Look into the code, in org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(TokenStream, String, boolean, int),  after the for loop, it appends whole remaining text into last fragment.
if (
  // if there is text beyond the last token considered..
  (lastEndOffset < text.length())
  &&
  // and that text is not too large...
  (text.length()<= maxDocCharsToAnalyze)
 )
{
 //append it to the last fragment
 newText.append(encoder.encodeText(text.substring(lastEndOffset)));
}
currentFrag.textEndPos = newText.length();

This code is problematical, as in some cases, the last fragment is the most relevant section and will be selected to return to client.
The Solution I made some change to the code like below. 
//Test what remains of the original text beyond the point where we stopped analyzing
if(lastEndOffset < text.length())
{
 if(textFragmenter instanceof SimpleFragmenter)
 {
  SimpleFragmenter simpleFragmenter = 
(SimpleFragmenter) textFragmenter;
  int remain =simpleFragmenter.getFragmentSize() -(newText.length() - currentFrag.textStartPos);
  if(remain > 0 )
  {
   int endIndex = lastEndOffset + remain;
   if (endIndex > text.length()) {
    endIndex = text.length();
   }
   newText.append(encoder.encodeText(text.substring(lastEndOffset,
     endIndex)));
  }
 }
 else
 {
  newText.append(encoder.encodeText(text.substring(lastEndOffset)));
 }
}
currentFrag.textEndPos = newText.length();

Resources
https://issues.apache.org/jira/browse/LUCENE-5381
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (60) Interview (59) J2SE (53) Algorithm (37) Eclipse (35) Soft Skills (35) Code Example (31) Linux (26) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Continuous Integration (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Design (8) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Miscs (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Firefox (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Bit Operation (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts