Learning Lucene: Search

Lucene Queries
Check this page for all Lucene queries:

public void testIndexSearcher() throws Exception {

  try (Directory directory = FSDirectory.open(new File(FILE_PATH));
      DirectoryReader indexReader = DirectoryReader.open(directory);) {

    IndexSearcher searcher = new IndexSearcher(indexReader);

    // there are different ways to search in Lucene
    // 1. Use QueryParser
    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_9);
    QueryParser parser = new QueryParser(Version.LUCENE_4_9,
        "description", analyzer);
    Query query = parser.parse("description");

    TopDocs hits = searcher.search(query, 10);
    printSearchResult(searcher, hits);

    // Using TermQuery
    query = new TermQuery(new Term("description", "description"));
    hits = searcher.search(query, 10);
    printSearchResult(searcher, hits);

    // Search multiple fields using MultiFieldQueryParser, the default
    // operator is OR
    query = new MultiFieldQueryParser(Version.LUCENE_4_9, new String[] {
        "title", "description" }, new StandardAnalyzer(
        Version.LUCENE_4_9)).parse("title");
    hits = searcher.search(query, 10);
    printSearchResult(searcher, hits);

    // use MultiFieldQueryParser, change operator to AND, return 0
    // hits
    query = MultiFieldQueryParser.parse(Version.LUCENE_4_9,
        "description", new String[] { "title", "description" },
        new BooleanClause.Occur[] { BooleanClause.Occur.MUST,
            BooleanClause.Occur.MUST }, new StandardAnalyzer(
            Version.LUCENE_4_9));
    hits = searcher.search(query, 10);
    printSearchResult(searcher, hits);

    // use BooleanQuery to combining queries
    BooleanQuery searchingBooks2004 = new BooleanQuery();
    searchingBooks2004.add(new TermQuery(new Term("title", "title")),
        BooleanClause.Occur.MUST);
    Query priceQuery = NumericRangeQuery.newIntRange("price", 20, 80,
        true, true);
    searchingBooks2004.add(priceQuery, BooleanClause.Occur.MUST);
    hits = searcher.search(query, 10);
    printSearchResult(searcher, hits);
  }
}
How Search Works
score(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q)
Abstract TFIDFSimilarity
tf(t in d) correlates to the term's frequency, defined as the number of times term t appears in the currently scored document d. Documents that have more occurrences of a given term receive a higher score.
Math.sqrt(freq)

idf(t) stands for Inverse Document Frequency. This value correlates to the inverse of docFreq (the number of documents in which the term t appears). This means rarer terms give higher contribution to the total score
Math.log(numDocs/(double)(docFreq+1)) + 1.0

Query Coordination
coord(q,d) is a score factor based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms. - computed at search time
overlap / (float)maxOverlap

The coordination factor (coord) is used to reward documents that contain a higher percentage of the query terms. The more query terms that appear in the document, the greater the chances that the document is a good match for the query.

queryNorm(q) is a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable.
1.0 / Math.sqrt(sumOfSquaredWeights)
The sumOfSquaredWeights is calculated by adding together the IDF of each term in the query, squared.

t.getBoost() is a search time boost of term t in the query q as specified in the query text

Index-Time Field-Level Boosting
We strongly recommend against using field-level index-time boosts

norm(t,d) encapsulates a few (indexing time) boost and length factors:
Field boost - set by calling field.setBoost() before adding the field to a document.
lengthNorm(Field-length norm)- computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score. LengthNorm is computed by the Similarity class in effect at indexing.

DefaultSimilarity extends TFIDFSimilarity
https://lucene.apache.org/core/5_1_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
https://www.elastic.co/guide/en/elasticsearch/guide/master/practical-scoring-function.html
Boolean Model
Vector Space Model
A vector is really just a one-dimensional array containing numbers
The nice thing about vectors is that they can be compared. By measuring the angle between the query vector and the document vector, it is possible to assign a relevance score to each document.
https://www.elastic.co/guide/en/elasticsearch/guide/master/scoring-theory.html

Query Normalization Factoredit(queryNorm)
The query normalization factor is an attempt to normalize a query so that the results from one query may be compared with the results of another.

http://stackoverflow.com/questions/14512885/is-there-a-way-to-remove-the-calculation-of-length-norms-for-fields-in-elastic-s

The lengthNorm and field level boosting, as you said, are both stored in the norm. So no, you can't have one without the other.

But you don't actually need field boosting at index time. You can apply it at search time instead, and that way you have more flexibility when you want to tweak the boost level later on.

Not only that, by setting omit_norms you reduce the amount of data you have to store at index time by quite a lot, so it is to be recommended where appropriate (such as in your case).

TODO
https://www.elastic.co/guide/en/elasticsearch/guide/master/query-time-boosting.html
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (60) Interview (59) J2SE (53) Algorithm (37) Eclipse (35) Soft Skills (35) Code Example (31) Linux (26) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Continuous Integration (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Design (8) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Miscs (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Firefox (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Bit Operation (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts