Thoughts about Auto Completion in Solr

I am trying to implement auto suggestion function in our documentation site.
When a user types a phrase such as "network p" in the search box, browser will send ajax request to the auto suggester request handler in Solr.
Now my task is how to implement the auto suggester request handler.
Utilize query history information
Th following is based on the reasoning:
If a phrase is frequently searched, it means (potential) users are (probably) interested in it, and more likely to search it.
If a user searches "network proxy" recently, then if the user types netw or "network p", the user is very likely want to search "network proxy" again.

So whenever a user runs a query in our application, or a user access out page by typing a query in a search engine like Google, we can save query information such as search phrase, execution count, items that matches the query into Solr.
We also save user and user search information into Solr, such as user id  - this can be really login user id or just some id we store in client cookie, the time the query is executed etc into Solr.

In the auto suggester request handler, we can first query the user and user search information to get queries that current user searched recently and starts with the phrase user types. Response are sorted desc by the time users searched.

Then we can search the query information to get quires that searched by all users and starts with what the current user types. Response are sorted desc by execution count.

We can even write up a list of queries, and use a request handler to warm up theses 2 (table) information: to update the search phrase and search execution count, this can guide what users search.


Besides help to implement auto suggestion, theses 2 (table) information can also help us find what users are interested, the quires that no matches are found, the user statistics info etc.
Use ShingleFilterFactory
ShingleFilterFactory creates combinations of tokens as a single token. For example:
The Network Proxy preference tool enables you to configure how your system connects to the Internet.
when minShingleSize=2, maxShingleSize=4, "Network Proxy preference tool" will be a token in the field. This way, if a user types "Network Pr", we can provide "Network Proxy preference tool" as auto suggestion. This can boost words that are near each other.
We can also use StopFilterFactory to remove stop words, LengthFilterFactory to remove words that are lesser than min value, use TrimFilterFactory or KStemFilterFactory to do very basic stem before ShingleFilterFactory.
Use UIMA to only do auto suggestion on nouns
After all above, if there is still less than X(usually 5), we have to run facet query to get auto suggestion: the query is what user types, the facet.prefix is the last word.
But the problem is that there can be many response, and the word that matches the query are usually no meaning at all.

We can create a field that has only nouns, also we can add other filters to remove unwanted words(such as StopFilterFactory and LengthFilterFactory), this way we can eliminate many unmeaning words.
Post a Comment

Labels

Java (159) Lucene-Solr (112) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (38) Eclipse (33) Code Example (31) Linux (25) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Shell (7) ANT (6) Coding Skills (6) Database (6) Lesson Learned (6) Programmer Skills (6) Scala (6) Tips (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) System Design (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts