Solr Boost to Improve Search Relevancy

One important step when use Solr is to tune search relevancy.
Boosting Fields: qf
Not all the fields have the same importance, We can boost some fields, for example keyword or title fields.
qf=keyword^10 title^5
Boosting Phrases: field~slop^boost
pf can be used to boost close proximity. We can also specify an optional slop factor directly in "pf" with the syntax field~slop.
pf=title^20 content^10
This will translate a user query like foo bar into:
title:”foo bar”^20 OR content:”foo bar”^10
Boost Queries
A boost query is a query that will be executed on background after a user query, and that will boost the documents that matched it.
bq=sponsored:true or bq=instock:true
Boost Functions
The boost Parameter in edismax
Boost exact match
Copy the content into two fields, one with LowerCaseFilterFactory to do case-insensitive search, one without LowerCaseFilterFactory to do exact match, and boost on that field.
title titleExact^10
title^10 titleExact^100
Minimum 'Should' Match: mm

Boost records that contain all terms: boost when mm=100%
Use a pf to boost on a phrase in those same two fields (just common sense)
Set up a boost query (bq) to boost the score if all the search terms are present
'q'='_query_:"{!dismax qf=$f1 mm=$mm1 pf=$f1 bq=$bq1 v=$q1}"',
'f1'='author^3 title^1',
'q1'='Dueber Constructivism',
'bq1'='_query_:"{!dismax qf=$f1 mm=\'100%\' v=$q1 }"^5',
'fl' ='score,*'
Boosting Documents in Solr by Recency, Popularity and User Preferences
Comparing boost methods in Solr


Nutch2: Crawl and Index Extra(image alt) Tag

By default, Nutch only save the text conetnt of a webpage into "content" field.
For our documentation site, our boss wants to crawl the value of img alt property into content field and save index into Solr. 
To do this, we can easily extend Nutch's DOMContentUtils.getTextHelper(StringBuilder, Node, boolean, int).

Implementation Code

private boolean getTextHelper(StringBuilder sb, Node node,
  boolean abortOnNestedAnchors, int anchorDepth) {
boolean abort = false;
NodeWalker walker = new NodeWalker(node);

while (walker.hasNext()) {
  Node currentNode = walker.nextNode();
  String nodeName = currentNode.getNodeName();
  short nodeType = currentNode.getNodeType();
  // omitted... 
  // get img alt value
  if (nodeType == Node.ELEMENT_NODE) {
 if ("img".equalsIgnoreCase(nodeName)) {
   NamedNodeMap attributes = currentNode.getAttributes();
   Node nameNode = attributes.getNamedItem("alt");
   if (nameNode != null) {
return abort;
You may also read
Nutch2: Index Raw Content and Outlinks into Solr
Nutch2: Parse All Content and Get All Outlinks
Nutch2 : Extend Nutch2 to Get Custom Outlinks from Javascript Files
Nutch2: Extend Nutch2 to Crawl IFrames Pages

Nutch2: Index Raw Content and Outlinks into Solr

By default, Nutch2 doesn't index raw html content, outlinks into Solr. But in some cases, we may need save them into Solr.
We can create a Nutch2 plugin to do this.
How to Implement
We create our own IndexingFilter, overwrite its getFields, add WebPage.Field.CONTENT and WebPage.Field.OUTLINKS into Collection. This will cause Nutch to read these 2 fields from underlying storage into webpage instance in IndexerMapper:, WebPage, Context)
org.apache.nutch.indexer.IndexerJob.createIndexJob(Configuration, String, String)
Collection fields = getFields(job);
StorageUtils.initMapperJob(job, fields, String.class, NutchDocument.class,IndexerMapper.class);

In our IndexingFilter, we then read these 2 fields, add them into NutchDocument.
Implementation Code
We use two properties myindexer.index.rawcontent and myindexer.index.outlinks to control whether index raw content and outlinks.
package org.apache.nutch.indexer.myindexer;
public class MyIndexingFilter implements IndexingFilter {
  public static final String FL_RAWCONTENT = "rawcontent";
  public static final String FL_OUTLINKS = "outlinks";
  private Configuration conf;
  private boolean indexRawContent;
  private boolean indexOutlinks;

  private static final Collection<WebPage.Field> FIELDS = new HashSet<WebPage.Field>();
  static {
  public Collection<Field> getFields() {
    return FIELDS;
  public NutchDocument filter(NutchDocument doc, String url, WebPage page)
      throws IndexingException {
    try {
      if (indexRawContent) {
        ByteBuffer bb = page.getContent();
        if (bb != null) {
          doc.add(FL_RAWCONTENT, new String(bb.array()));
      if (indexOutlinks) {
        HashSet<String> set = new HashSet<String>();
        for (Utf8 value : page.getOutlinks().keySet()) {
          String outlink = TableUtil.toString(value);
          String outLinkLower = outlink.toLowerCase();
          if (!set.contains(outLinkLower)) {
            doc.add(FL_OUTLINKS, outlink);
    } catch (Exception e) {
      LOG.error(this.getClass().getName() + " throws exception: ", e);
      throw new IndexingException(e);
    return doc;
  public void setConf(Configuration conf) {
    this.conf = conf;
    indexRawContent = conf.getBoolean("myindexer.index.rawcontent", false);
    indexOutlinks = conf.getBoolean("myindexer.index.outlinks", false);
We then define myindexer.index.rawcontent and myindexer.index.outlinks in nutch-site.xml.
Here we ignore the code to create a nutch2 plugin, and the code to add rawcontent and outlinks into Solr's schrma.xml.

Nutch2: Extend Nutch2 to Crawl via Http API

In Nutch 2.x, we can run "nutch server 8080" to run the embedded jetty server. Nutch provides basic rest http interface -(Using Restlet), such as /nutch/admin/status|stop.

We can easily extend Nutch to add a new http interface: /crawler. User can call /crawler with required parameters to start crawl a website to Solr.

How to Implement
Our /crawler api will accept some parameters wrapped in CrawlerConfigureEntity class. Via CrawlerConfigureEntity, client can tell /crawler the solrURL, crawlDepth, seedurls, included urls, excluded urls, included file types, excluded file types, crawlID, taskName etc.

Also we can add some pre-defined tasks, we include files of seed url, subcollections.xml, nutch-sit.xml and other files. SO client only need specify task name, solr url, and crawl depth: easier to use.

In CrawlerResource, we will create a folder in tasks/${taskName}, create or copy seed url file, all needed files in conf folder, copy bin folder to tasks/${taskName}. Then we will start crawl script.

Save Index into Temporary Solr Core
We will crawl automatically when we make change to our documentation site, and when we are crawling, we don't want to interrupt current Solr Server, So we will save index into tmp core, after crawl is finished, we will swap with the core that are serving user request.
Also we can save (up to X) index history, so we can revert to previous index if the crawl failed for some reason.
Implementation Code
Register CrawlerResource in NutchApp:
public synchronized Restlet createInboundRoot() {
  // ...
  router.attach("/"+ CrawlerResource.PATH, CrawlerResource.class);
  return router;
2. CrawlerResource Handles  /crawler request
package org.apache.nutch.api;
public class CrawlerResource extends ServerResource {
  private static final String FILE_SUFFIX_URLFILTER_TXT = "suffix-urlfilter.txt";
  private static final String FILE_REGEX_URLFILTER_TXT = "regex-urlfilter.txt";
  private static final String DIR_SEED = "urls";
  private static final String DIR_BIN = "bin";
  public static final String PATH = "crawler";

  @Get @Post
  public Object crawl(final CrawlerConfigureEntity config) throws Exception {
    validateParameter(config);"Accept one request: " + config.toString());
    final File baseLocation = getBaseLocation();
    final File thisTaskBaseDir = createTask(config, baseLocation);

    Map<String, Object> result = new HashMap<String, Object>();
    result.put("start", new Date().toString());

    Thread thread = new Thread(new  Runnable() {
      public void run() {
        try {
          doCrawl(config, baseLocation, thisTaskBaseDir);
        } catch (Exception e) {
    result.put("msg", "Crawl in background.");
    result.put("end", new Date().toString());
    return result;

  private void doCrawl(CrawlerConfigureEntity config, File baseLocation,
      File thisTaskBaseDir) throws MalformedURLException, SolrServerException,
      IOException, InterruptedException {
    long start = new Date().getTime();
    File thisTaskLogsDir = new File(thisTaskBaseDir, "logs");

    ProcessBuilder processBuilder = new ProcessBuilder();;
    String crawlScriptPath = thisTaskBaseDir.getAbsolutePath() + File.separator
        + DIR_BIN + File.separator + "crawl";
    String seedDir = thisTaskBaseDir.getAbsolutePath() + File.separator
        + DIR_SEED;

    StringBuilder sb = new StringBuilder();

    sb.append("export MY_NUTCH_HOME='").append(baseLocation).append("';")
        .append("export MY_NUTCH_CONF_DIR='")
        .append(thisTaskBaseDir + File.separator + "conf").append("'; ");
    String solrIndexParmas = "";
    if (!StringUtils.isEmpty(config.getSolrinexParams())) {
      solrIndexParmas += " --SOLRINEX_PARAMS=\"" + config.getSolrinexParams()
          + "\"";
    String crawlCmd = crawlScriptPath + " "+ seedDir+ " "
        + config.getCrawlID() + " "+ (config.getTmpCoreName() == null ? config.getSolrURL()
        : getTmpSolrServerURL(config)) + " " + String.valueOf(config.getCrawlDepth()) + solrIndexParmas + " >> "
        + thisTaskLogsDir.getAbsolutePath() + File.separator + "log" + " 2>&1 ";
    processBuilder.command("/bin/bash", "--login", "-c", sb.toString());
    Map<String, String> env = processBuilder.environment();
    env.put("MY_NUTCH_HOME", baseLocation.getAbsolutePath());
    env.put("MY_NUTCH_CONF_DIR", thisTaskBaseDir + File.separator + "conf");

    Process process = processBuilder.start();
    int exitValue = process.waitFor();"Crawl took " + (new Date().getTime() - start) / 1000
        + " seconds, exitCode: " + exitValue);

  public static void updateSolr(CrawlerConfigureEntity config)
      throws MalformedURLException, SolrServerException, IOException {
    String tmpSolrServer = getTmpSolrServerURL(config);"Start to swap " + tmpSolrServer + " back to "
        + config.getSolrURL());
    // tmp Solr server: host:port/solr/coer-tmp
    int idx = tmpSolrServer.lastIndexOf("/");
    // solrBaseUrl: host:port/solr
    String solrBaseUrl = tmpSolrServer.substring(0, idx);

    CommonsHttpSolrServer solrServer = new CommonsHttpSolrServer(solrBaseUrl);

    String oldSolrServer = config.getSolrURL();
    String remainStr = oldSolrServer.substring(idx);
    String oldCoreName = "";
    if ("".equals(remainStr) || "/".equals(remainStr)) {
      oldCoreName = "collection1";
    } else {
      if (remainStr.charAt(0) == '/') {
        oldCoreName = remainStr.substring(1);
      } else {
        oldCoreName = remainStr;
    String tmpCore = config.getTmpCoreName();
    swapCore(solrServer, oldCoreName, tmpCore);
 //TODO Save old index to core-archive-date

  private static void swapCore(CommonsHttpSolrServer solrServer,
      String corename1, String corename2) throws SolrServerException,
      IOException {
    CoreAdminRequest adminReq = new CoreAdminRequest();

  private File createTask(CrawlerConfigureEntity config, File baseLocation)
      throws IOException, Exception {
    File tasksBaseDir = new File(baseLocation, "tasks");
    if (!tasksBaseDir.exists()) {

    boolean isPredined = config.getPreDefinedTask() != null;
    File thisTaskBaseDir = new File(tasksBaseDir, config.getTaskName());
    if (!isPredined && thisTaskBaseDir.exists()) {
      // for develop user only
      if (config.isDeleteIfExist()) {
      } else {
        throw new Exception("Folder " + thisTaskBaseDir + " already exists.");
    } else {
    createTaskSeed(config, baseLocation, thisTaskBaseDir);
    createTaskConfs(config, baseLocation, thisTaskBaseDir);
    copyBinFolder(config, baseLocation, thisTaskBaseDir);
    if (!isPredined && !config.getSubCollections().isEmpty()) {
      updateSubCollections(new File(thisTaskBaseDir, "conf"),
    return thisTaskBaseDir;
  private void cleanDataIfNeeded(CrawlerConfigureEntity config)
      throws MalformedURLException, SolrServerException, IOException {
    if (config.isUpdateDirectly()) {
      if (config.isCleanData()) {
        String solrServerUrl = config.getSolrURL();
    } else {
      String tmpSolrServerStr = getTmpSolrServerURL(config);
  private static String getTmpSolrServerURL(CrawlerConfigureEntity config) {
    String oldSolrServerUrl = config.getSolrURL();

    if (oldSolrServerUrl.endsWith("/")) {
      oldSolrServerUrl = oldSolrServerUrl.substring(0,
          oldSolrServerUrl.length() - 1);
    int idx = oldSolrServerUrl.lastIndexOf("/");
    String str = oldSolrServerUrl.substring(idx + 1);

    String tmpSolrServerStr;
    if (str.equals("solr")) {
      tmpSolrServerStr = oldSolrServerUrl + "/" + config.getTmpCoreName();
    } else {
      tmpSolrServerStr = oldSolrServerUrl.substring(0, idx) + "/"
          + config.getTmpCoreName();

    return tmpSolrServerStr;
  private void cleanData(String solrServerUlr) throws MalformedURLException,
      SolrServerException, IOException {
    CommonsHttpSolrServer solrServer = new CommonsHttpSolrServer(solrServerUlr);
  // core-) if it doesn't exists. 
  private void creatCore(String solrServerUlr, String newCoreName)
      throws MalformedURLException, SolrServerException {
    CommonsHttpSolrServer solrServer = new CommonsHttpSolrServer(solrServerUlr);
    SolrQuery query = new SolrQuery("*:*").setRows(0);

  private void validateParameter(CrawlerConfigureEntity config)
      throws Exception {
    if (StringUtils.isEmpty(config.getSolrURL())) {
      throw new Exception("Must set solrURL");
    if (StringUtils.isEmpty(config.getCrawlID())) {
    if (config.getCrawlDepth() == 0) {
    if (config.isUpdateDirectly()
        && StringUtils.isBlank(config.getTmpCoreName())) {
      String tmpCoreName = "core-" + RandomStringUtils.random(3);
    boolean isPredined = config.getPreDefinedTask() != null;
    if (isPredined) {
      if (StringUtils.isBlank(config.getSolrinexParams())) {
    } else {
      if (config.getUrls() == null || config.getUrls().isEmpty()) {
        throw new Exception("Must set urls to crawl.");
      if (StringUtils.isEmpty(config.getTaskName())) {
  private File copyBinFolder(CrawlerConfigureEntity config, File baseLocation,
      File thisTaskBaseDir) throws IOException {
    File destBinDir = new File(thisTaskBaseDir, DIR_BIN);
    if (config.getPreDefinedTask() != null) {
      if (destBinDir.exists()) {
        return destBinDir;
    File srcBinDir = new File(baseLocation, DIR_BIN);
    FileUtils.copyDirectory(srcBinDir, destBinDir);
    // make all files in destBinDir
    File[] files = destBinDir.listFiles();
    if (files != null) {
      for (File file : files) {
    return destBinDir;

  private void overwriteWithPredefinedFile(File oldFile,
      CrawlerConfigureEntity config) throws IOException {
    String fileName = oldFile.getName();
    int indx = fileName.indexOf(".");
    String preDefinedFN = fileName.substring(0, indx) + "-"
        + config.getPreDefinedTask() + fileName.substring(indx);
    File preDefinedFile = new File(oldFile.getParentFile(), preDefinedFN);
    if (preDefinedFile.exists()) {
      FileUtils.copyFile(preDefinedFile, oldFile);
  private void createTaskConfs(CrawlerConfigureEntity config,
      File baseLocation, File thisTaskBaseDir) throws IOException {
    File srcConfDir = new File(baseLocation, "conf");
    File thisTaskConfDir = new File(thisTaskBaseDir, "conf");
    String preDeinedTask = config.getPreDefinedTask();
    if (preDeinedTask != null) {
      if (thisTaskConfDir.exists()) {
      } else {
        FileUtils.copyDirectory(srcConfDir, thisTaskConfDir);
        String[] fileStrs = { "nutch-site.xml", "subcollections.xml",
        for (String str : fileStrs) {
          overwriteWithPredefinedFile(new File(thisTaskConfDir, str), config);
    } else {
      FileUtils.copyDirectory(srcConfDir, thisTaskConfDir);

      // handle include and exclue paths
      List<String> paths = config.getIncludePaths();
      File regexUrlfilterFile = new File(thisTaskConfDir,
      if (!paths.isEmpty()) {
        appendLines(regexUrlfilterFile, paths);

      paths = config.getExcludePaths();
      if (!paths.isEmpty()) {
        appendLines(regexUrlfilterFile, paths);
      // handle types
      File suffixUrlfilterFile = new File(thisTaskConfDir,
      List<String> suffixFilters = FileUtils.readLines(suffixUrlfilterFile);
      List<String> excludeTypes = config.getExcludeFileTypes();

      for (String excludeType : excludeTypes) {
        if (!suffixFilters.contains(excludeType)) {
      List<String> includeTypes = config.getIncludeFileTypes();
      for (String includeType : includeTypes) {
        if (suffixFilters.contains(includeType)) {
      FileUtils.writeLines(suffixUrlfilterFile, suffixFilters);
  public static void appendLines(File file, String encoding,
      Collection<String> lines, String lineEnding) throws IOException {
    OutputStream out = null;
    try {
      out = new FileOutputStream(file, true);
      IOUtils.writeLines(lines, lineEnding, out, encoding);
    } finally {
  public static void appendLines(File file, Collection<String> lines)
      throws IOException {
    appendLines(file, null, lines, null);
  private File createTaskSeed(CrawlerConfigureEntity config, File baseLocation,
      File thisTaskBaseDir) throws IOException {
    // create a file urls/seed.txt, and copy content of urls to it.
    File urlsDir = new File(thisTaskBaseDir, DIR_SEED);
    File seedFile = new File(urlsDir, "seeds.txt");
    if (config.getPreDefinedTask() == null) {
      FileUtils.writeLines(seedFile, config.getUrls());
    } else {
      if (!seedFile.exists()) {
        FileUtils.copyFile(new File(baseLocation, DIR_SEED + "/" + "seeds-"
            + config.getPreDefinedTask() + ".txt"), seedFile);

    return seedFile;
  //Save collections into file
  public void updateSubCollections(File thisTaskConfDir,
      List<SubCollectionEntity> subcollections) throws IOException {
    StringBuilder sb = new StringBuilder();
    final Iterator<SubCollectionEntity> iterator = subcollections.iterator();
    while (iterator.hasNext()) {
      final SubCollectionEntity subCol =;
        new File(thisTaskConfDir, "subcollections.xml"), sb.toString());
  private File getBaseLocation() throws UnsupportedEncodingException {
    File jarPath = new File(this.getClass().getProtectionDomain()
    String baseLocation = jarPath.getParentFile().getParent();
    baseLocation = URLDecoder.decode(baseLocation,
    return new File(baseLocation);
3. CrawlerConfigureEntity
public class CrawlerConfigureEntity {
  private String preDefinedTask;
  private boolean updateDirectly, cleanData;
  private String tmpCoreName, taskName;
  private List<String> urls;
  private String solrURL, crawlID;
  private int crawlDepth;
  private List<String> includePaths, excludePaths, includeFileTypes, excludeFileTypes;
  private List<SubCollectionEntity> subCollections = new ArrayList<SubCollectionEntity>();
  private String solrinexParams;
  private boolean sync = false, deleteIfExist = false;

public class SubCollectionEntity {
 private String name, id;
 private List<String> blackList, whiteList;

Nutch2: Speed up Nutch Crawling

The fetch step is likely to take most of the time.
Increase increase the number of threads and the number of threads per queue.
fetcher.threads.fetch and fetcher.threads.per.queue
Decrease fetcher.server.delay
Add Solr docs asynchronously
Update "/update" request handler to the implementation that return directly, add solr document asynchronously


Thoughts about Auto Completion in Solr

I am trying to implement auto suggestion function in our documentation site.
When a user types a phrase such as "network p" in the search box, browser will send ajax request to the auto suggester request handler in Solr.
Now my task is how to implement the auto suggester request handler.
Utilize query history information
Th following is based on the reasoning:
If a phrase is frequently searched, it means (potential) users are (probably) interested in it, and more likely to search it.
If a user searches "network proxy" recently, then if the user types netw or "network p", the user is very likely want to search "network proxy" again.

So whenever a user runs a query in our application, or a user access out page by typing a query in a search engine like Google, we can save query information such as search phrase, execution count, items that matches the query into Solr.
We also save user and user search information into Solr, such as user id  - this can be really login user id or just some id we store in client cookie, the time the query is executed etc into Solr.

In the auto suggester request handler, we can first query the user and user search information to get queries that current user searched recently and starts with the phrase user types. Response are sorted desc by the time users searched.

Then we can search the query information to get quires that searched by all users and starts with what the current user types. Response are sorted desc by execution count.

We can even write up a list of queries, and use a request handler to warm up theses 2 (table) information: to update the search phrase and search execution count, this can guide what users search.

Besides help to implement auto suggestion, theses 2 (table) information can also help us find what users are interested, the quires that no matches are found, the user statistics info etc.
Use ShingleFilterFactory
ShingleFilterFactory creates combinations of tokens as a single token. For example:
The Network Proxy preference tool enables you to configure how your system connects to the Internet.
when minShingleSize=2, maxShingleSize=4, "Network Proxy preference tool" will be a token in the field. This way, if a user types "Network Pr", we can provide "Network Proxy preference tool" as auto suggestion. This can boost words that are near each other.
We can also use StopFilterFactory to remove stop words, LengthFilterFactory to remove words that are lesser than min value, use TrimFilterFactory or KStemFilterFactory to do very basic stem before ShingleFilterFactory.
Use UIMA to only do auto suggestion on nouns
After all above, if there is still less than X(usually 5), we have to run facet query to get auto suggestion: the query is what user types, the facet.prefix is the last word.
But the problem is that there can be many response, and the word that matches the query are usually no meaning at all.

We can create a field that has only nouns, also we can add other filters to remove unwanted words(such as StopFilterFactory and LengthFilterFactory), this way we can eliminate many unmeaning words.


adsense (5) Algorithm (69) Algorithm Series (35) Android (4) ANT (6) bat (8) Become a Better You (4) Big Data (7) Blogger (14) Bugs (4) Cache (5) Chrome (17) Code Example (29) Code Quality (6) Coding Skills (5) Concurrency (4) Database (7) Debug (16) Design (5) Dev Tips (62) Eclipse (32) GAE (4) Git (5) Good Programming Practices (4) Google (27) Guava (7) How to (9) Http Client (8) IDE (6) Interview (88) J2EE (13) J2SE (49) Jackson (4) Java (177) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (22) Lucene-Solr (112) Mac (10) Maven (8) Memory Usage (4) Network (9) Nutch2 (18) OpenNLP (4) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Review (4) Scala (6) Security (9) Soft Skills (38) Spark (4) Spring (22) System Design (11) Testing (6) Text Mining (14) Tips (12) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)