Auto Completion -Using Trie to Find Strings Starting with Prefix

We are all familiar with the auto completion function provided by IDE, for example, in eclipse, if we type Collections.un, then eclipse would list all methods that start with "un" such as unmodifiableCollection, unmodifiableList etc.

So how to implement this function?
How to find all strings that starts with prefix provided repeatedly and efficiently?

Answer:
We need to preprocess the list of string, so later we can quickly search it.

One way is to sort the string list by alphabetical order, then when search with the prefix (say app), we binary search this list and get a lower index whose string is larger than “app”, and get a higher index whose string is less than “apr”, then all strings between the lower index and higher index[lower index, higher index) are the strings that starts with the prefix.
Each query would take O(longn), n is the length of the string list.

Another better way is to create a tree from the string list, for example, for string "append", it would look like this:
  [root node(flag)]
         /
        a
       / \
     [ST] p
          \
          p -- return all strings from this sub tree
         /
         e
         \
         n
        / \
        d [Sub Tree]
       /
[leaf node(flag)]
So when we search all strings that starts with "app", it can search this tree, and get all strings of the p node, the time complexity depends on the length of the prefix, having nothing to do with the length of the string list. This is much better.

Code:
The complete algorithm/test code and also many other algorithm problems and solutions are available from Github.

package org.codeexample.jefferyyuan.autocomplete;
public class WordTree {
 private WordNode root;
 public WordTree() {
  root = new WordNode(null);
 }
 /**
  * Add a string into this word tree, if word is null or an empty string, do
  * nothing
  */
 public void addWord(String word) {
  if (word == null)
   return;
  word = word.trim();
  if ("".equals(word)) {
   return;
  }
  WordNode parentNode = root, curretNode;
  for (int i = 0; i < word.length(); i++) {
   char character = word.charAt(i);
   Map<Character, WordNode> children = parentNode
     .getChildrenMap();
   if (children.containsKey(character)) {
    curretNode = children.get(character);
   } else {
    curretNode = new WordNode(character);
    parentNode.addChild(curretNode);
   }
   parentNode = curretNode;
  }
  // at last, add a leaf node - whose character value is null to indicate
  // the end of the word
  curretNode = new WordNode(null);
  parentNode.addChild(curretNode);
 }
 /**
  * @param prefix
  * @return all words in this tree that starts with the prefix, <br>
  *         if prefix is null, return an empty list, if prefix is empty
  *         string, return all words in this word tree.
  */
 public List<String> wordsPrefixWith(String prefix) {
  List<String> words = new ArrayList<String>();
  if (prefix == null)
   return words;
  prefix = prefix.trim();
  WordNode currentNode = root;
  for (int i = 0; i < prefix.length(); i++) {
   char character = prefix.charAt(i);
   Map<Character, WordNode> children = currentNode
     .getChildrenMap();
   if (!children.containsKey(character)) {
    return words;
   }
   currentNode = children.get(character);
  }
  return currentNode.subWords();
 }
 /**
  * @param word
  * @return whether this tree contains this word, <br>
  *         if the word is null return false, if word is empty string, return
  *         true.
  */
 public boolean hasWord(String word) {
  if (word == null)
   return false;
  word = word.trim();
  if ("".equals(word))
   return true;
  WordNode currentNode = root;
  for (int i = 0; i < word.length(); i++) {
   char character = word.charAt(i);
   Map<Character, WordNode> children = currentNode
     .getChildrenMap();
   if (!children.containsKey(character)) {
    return false;
   }
   currentNode = children.get(character);
  }
  // at last, check whether the parent node contains one null key - the
  // leaf node, if so return true, else return false.
  return currentNode.getChildrenMap().containsKey(
    null);
 }
}
class WordNode {
 private Character character;
 private WordNode parent;
 private Map<Character, WordNode> childrenMap = new HashMap<Character, WordNode>();
 public WordNode(Character character) {
  this.character = character;
 }
 /**
  * @return all strings of this sub tree
  */
 public List<String> subWords() {
  List<String> subWords = new ArrayList<String>();
  String prefix = getPrefix();
  List<String> noPrefixSubWords = subWordsImpl();
  for (String noPrefixSubWord : noPrefixSubWords) {
   subWords.add(prefix + noPrefixSubWord);
  }
  return subWords;
 }
 private List<String> subWordsImpl() {
  List<String> words = new ArrayList<String>();
  Iterator<Character> keyIterator = childrenMap
    .keySet().iterator();
  while (keyIterator.hasNext()) {
   Character key = keyIterator.next();
   if (key == null) {
    words.add(convertToString(this.character));
   } else {
    WordNode node = childrenMap.get(key);
    List<String> childWords = node
      .subWordsImpl();
    for (String childWord : childWords) {
     words
       .add(convertToString(this.character)
         + childWord);
    }
   }
  }
  return words;
 }
 public void addChild(WordNode child) {
  child.parent = this;
  childrenMap.put(child.getCharacter(), child);
 }
 public Character getCharacter() {
  return character;
 }
 public WordNode getParent() {
  return parent;
 }
 public Map<Character, WordNode> getChildrenMap() {
  return childrenMap;
 }
 private String convertToString(Character character) {
  return (character == null) ? "" : String
    .valueOf(character);
 }
 private String getPrefix() {
  StringBuilder sb = new StringBuilder();
  WordNode parentNode = this.parent;
  while (parentNode != null) {
   if (parentNode.getCharacter() != null) {
    sb.append(parentNode.getCharacter());
   }
   parentNode = parentNode.parent;
  }
  return sb.reverse().toString();
 }
}
From my old blog.
Post a Comment

Labels

Java (159) Lucene-Solr (110) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (37) Eclipse (33) Code Example (31) Linux (24) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts