Solr: Use UpdateRequestProcessorChain to Execute Processor Chain Before Import Data


Scenario
I am sending query to a middle proxy layer which talks with remote Solr server, import the received documents into Local Solr server.

There are different ways to put data into Local Solr server.
1. Use core.getUpdateHandler(), we can add SolrInputDocuemt, delete by id or query.
2. We can use SolrRequestHandler like below:
SolrRequestHandler updateHandler = core.getRequestHandler("/update");
SolrQueryRequest req = new LocalSolrQueryRequest(core,solrParams);
updateHandler.handleRequest(req, solrQueryRsp);
Be sure to close the created SolrQueryRequest. Please refer to.

Compared with method 2, method 1 is better, as this pass the phrases- 1: Create a XML or JSON documents playload. 2: Parse the playload. 3: Create a SolrInputDocuemt from playload.
But with method 1, we can 't tell it to execute one processor chain.
Solution
If we want to tell Solr to execute one processor chain, before import data, we can get one UpdateRequestProcessorChain, create a processor, and use the processor to add a new SolrInputDocuemt. 
UpdateRequestProcessorChain updateChian = request.getCore().getUpdateProcessingChain("dedup");
UpdateRequestProcessor processor = updateChian.createProcessor(request,response);
ddUpdateCommand command = new AddUpdateCommand(request);
command.solrDoc = toSolrServerSolrDoc;
processor.processAdd(command);
The complete code looks like below

private static boolean importDocs(String remoteDocsStr,
      SolrQueryRequest request) throws IOException {
    Object obj = ObjectBuilder.fromJSON(remoteDocsStr);
    HashMap map = (HashMap) obj;
    HashMap responseMap = (HashMap) map.get("response");
    List lists = (List) responseMap.get("docs");
    
    UpdateRequestProcessorChain updateChian = request.getCore()
        .getUpdateProcessingChain("dedup");
    for (HashMap docJsonMap : lists) {
      SolrInputDocument toSolrServerSolrDoc = convertToSolrDoc(docJsonMap);
      AddUpdateCommand command = new AddUpdateCommand(request);
      command.solrDoc = toSolrServerSolrDoc;
      SolrQueryResponse response = new SolrQueryResponse();
      UpdateRequestProcessor processor = updateChian.createProcessor(request,
          response);
      processor.processAdd(command);
    }
}

Nutch2: Extend Nutch2 to Crawl IFrames Pages


Recently I am using Nutch2 and Solr 4.x to crawl our docuemntation site.
The site uses iframes: In the main page, there is nearly no useful content, it mainly includes 2 iframes like below: 
<iframe name="left" src="foldera/leftmenu.htm" width="24%" height="85%" frameborder="0" target="main" style="border-right-style: solid; border-right-width: 1px; border-color: #808080">Your browser does not support inline frames..</iframe>
<iframe name="main" src="../products/overview.htm" width="75%" height="85%" border="0" frameborder="0">Your browser does not support inline frames.</iframe>
Goals
When run search, we want to ignore the left menu. 
Usually a search would match content in the main iframe page. We want to show the link of the main page, instead of the iframe page. As there is no left menu  - unable to navigate, no header, footer and legal information.
How to Implement
Nutch2 Html Parser
When Nutch2 parses one html page, we want to test whether it has iframes and whether its name is left or main, If so we will add two new metatags into HTMLMetaTags.getGeneralTags().
metatag.ext.mainframe:   the absolute url of main iframe.
metatag.ext.ignoreframe: the absolute url of left iframe.
Solr Post Process Handler
After crawl is finished, we will run one Solr request handler to post process the data. 
In the post handler, we will run a query: (metatag.ext.mainframe:* OR metatag.ext.ignoreframe:*) to find all pages whether one of these 2 fields is set.
Then we will set ignore:true for the left menu page (url of metatag.ext.ignoreframe filed).
We will copy the body_stored (the content Nutch2 crawls) from the doc( url:metatag.ext.mainframe) to this page, and set ignore:true for the doc( url:metatag.ext.mainframe).

Then we modify the search handler, add one fq=ignore:false to only search main pages.
Implementation Code
The complete source code can be found at Github.
Nutch2 Html Parser
We can write a new custom parse-html plugin based on the existing pasre-html plugin: HtmlParser.
The following code would get the url of left and main iframe, and store them into metatag.ext.leftframe and metatag.ext.mainframe.

package org.jefferyyuan.codeexample.nutch.parse.html;
  public Parse getParse(String url, WebPage page) {
    Parse parse = new Parse(text, title, outlinks, status);
    if (metaTags.getNoCache()) {             // not okay to cache
      page.putToMetadata(new Utf8(Nutch.CACHING_FORBIDDEN_KEY),
          ByteBuffer.wrap(Bytes.toBytes(cachingPolicy)));
    }
 // Our own code
    StringBuilder newSB = new StringBuilder();
    DOMContentUtils.getMainFrame(newSB, root);
    String mainFrame = newSB.toString();

    newSB.setLength(0);
    DOMContentUtils.getIgnoreFrame(newSB, root);
    String ignoreframe = newSB.toString();
 
    String parent = baseUrl.substring(0, baseUrl.lastIndexOf('/'));
    HashMap<String, String[]> generalMetaTags =  metaTags.getGeneralTags();
    if(!mainFrame.isEmpty())
    {
      mainFrame = toAbsolutePath(parent, mainFrame);
      LOG.info("Extension: add metatag.ext.mainframe: " + mainFrame + ", baseUrl: " + baseUrl);
      generalMetaTags.put("metatag.ext.mainframe", new String[] {mainFrame});
    }
    if(!ignoreframe.isEmpty())
    {
      ignoreframe = toAbsolutePath(parent, ignoreframe);
      LOG.info("Extension: add metatag.ext.ignoreframe: " + ignoreframe + ", baseUrl: " + baseUrl);
      generalMetaTags.put("metatag.ext.ignoreframe", new String[] {ignoreframe});
    }
    parse = htmlParseFilters.filter(url, page, parse, metaTags, root);
    return parse;
  }
} 
The code to get the left and main iframe from webpage.
package org.jefferyyuan.codeexample.nutch.parse.html;
public class DOMContentUtils {
  public static boolean getMainFrame(StringBuilder sb, Node node) {
    return getNodeAttributeValue(sb, node, "iframe", "name", "main", "src");
  }
  public static boolean getIgnoreFrame(StringBuilder sb, Node node) {
    return getNodeAttributeValue(sb, node, "iframe", "name", "left", "src");
  }
  
  public static boolean getNodeAttributeValue(StringBuilder sb, Node node,
      String checkedNodeName, String checkedAttName, String checkedAttValue,
      String returnAttName) {

    NodeWalker walker = new NodeWalker(node);

    while (walker.hasNext()) {
      Node currentNode = walker.nextNode();
      String nodeName = currentNode.getNodeName();
      short nodeType = currentNode.getNodeType();
      if (nodeType == Node.ELEMENT_NODE
          && checkedNodeName.equalsIgnoreCase(nodeName)) {
        NamedNodeMap attributes = currentNode.getAttributes();
        Node nameNode = attributes.getNamedItem(checkedAttName);
        if (nameNode == null) {
          nameNode = attributes.getNamedItem(checkedAttName.toUpperCase());
        }
        if (nameNode == null) {
          return false;
        }
        if (checkedAttValue.equalsIgnoreCase(nameNode.getTextContent())) {
          Node srcNode = attributes.getNamedItem(returnAttName);
          if (srcNode != null) {
            String text = srcNode.getTextContent();
            if (text != null) {
              sb.append(text);
            }
          }
          return true;
        }
      }
    }
    return false;
  }
}  
Solr Post Process Handler
The following solr post process handler will be called after crawl is finished. It will set ignore:false for left menu pages and main iframe pages, copy body_stored field of main iframe page to main page.
package org.jefferyyuan.codeexample.nutch.solr4;
public class DocIndexPostProcessHandler extends RequestHandlerBase {
  private static final String PARAM_LOGGING = "logging_process";
  protected static final Logger logger = LoggerFactory
      .getLogger(DocIndexPostProcessHandler.class);
  private static final String FL_METATAG_EXT_MAINFRAME = "metatag.ext.mainframe";
  private static final String FL_METATAG_EXT_IGNOREFRAME = "metatag.ext.ignoreframe";

  private static final String FL_IGNORE = "ignore";
  private static final String FL_IGNORE_REASON = "ignore_reason";
  private static final String FL_EXTRA_MSG = "extra_msg";
  private static final String FL_PAGETYPE = "pagetype";
  private boolean defaultLogging = false;

  public void init(NamedList args) {
    super.init(args);
    if (args != null) {
      SolrParams params = SolrParams.toSolrParams(args);
      defaultLogging = params.getBool(PARAM_LOGGING, false);
    }
  }
  public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp)
      throws Exception {
    SolrCore core = req.getCore();
    SolrParams params = req.getParams();
    boolean logging = params.getBool(PARAM_LOGGING, defaultLogging);
    SolrQueryRequest newReq = new LocalSolrQueryRequest(core,
        new ModifiableSolrParams());
    try {
      SolrRequestHandler searchHandler = core.getRequestHandler("/select");
      SolrIndexSearcher searcher = req.getSearcher();
      IndexSchema schema = core.getSchema();
      final String uniqueFname = schema.getUniqueKeyField().getName();

      UpdateHandler updateHandler = core.getUpdateHandler();
      int start = 0;
      int rows = 100;

      boolean hasMore = true;
      while (hasMore) {
        SolrQuery query = new SolrQuery(FL_METATAG_EXT_MAINFRAME + ":* OR "
            + FL_METATAG_EXT_IGNOREFRAME + ":*").setRows(rows).setStart(start);
        SolrQueryResponse newRsp = new SolrQueryResponse();
        newReq.setParams(query);
        searchHandler.handleRequest(newReq, newRsp);

        NamedList valuesNL = newRsp.getValues();
        Object rspObj = valuesNL.get("response");
        // if the request is sent to this core itself, rspObj will be
        // ResultContext
        if (rspObj instanceof ResultContext) {
          ResultContext resultContext = (ResultContext) rspObj;
          DocList doclist = resultContext.docs;
          if (doclist.size() < rows) {
            hasMore = false;
          }
          DocIterator dit = doclist.iterator();
          String leftMenuBosyStored = null;
          StringBuilder extraMsgSb = new StringBuilder();
          while (dit.hasNext()) {
            int docid = dit.nextDoc();
            Document originDoc = searcher.doc(docid, new HashSet<String>());
            String originDocID = originDoc.get(uniqueFname);
            String ignoreFrame = originDoc.get(FL_METATAG_EXT_IGNOREFRAME);
            if (ignoreFrame != null) {
              Query luceneQuery = new TermQuery(new Term(uniqueFname,
                  ignoreFrame));
              TopDocs topDocs = searcher.search(luceneQuery, 1);
              ScoreDoc[] sdocs = topDocs.scoreDocs;
              if (sdocs.length == 1) {
                ScoreDoc doc = sdocs[0];
                Document ignoreDoc = searcher.doc(doc.doc);

                SolrInputDocument irgoreSolrDoc = copyToSolrDoc(schema,
                    ignoreDoc);
                irgoreSolrDoc.setField(FL_IGNORE, true);
                irgoreSolrDoc.setField(FL_PAGETYPE, "menuframe");
                irgoreSolrDoc.setField(FL_IGNORE_REASON,
                    "Is makrded as ignore.frame, should be left menu of page: "
                        + originDocID);
                if (logging)
                  logger
                      .info("DocSearchPostProcess: "
                          + irgoreSolrDoc.getFieldValue(uniqueFname)
                          + "Is makrded as ignore.frame, should be left menu of page: "
                          + originDocID);
                leftMenuBosyStored = ignoreDoc.get("body_stored");

                addDoc(req, updateHandler, irgoreSolrDoc);
              } else {
                extraMsgSb.append("Can't find " + ignoreFrame + "\t.");
              }
            }

            String mainBosyStored = null;
            String mainFrame = originDoc.get(FL_METATAG_EXT_MAINFRAME);
            if (mainFrame != null) {
              Query luceneQuery = new TermQuery(
                  new Term(uniqueFname, mainFrame));
              TopDocs topDocs = searcher.search(luceneQuery, 1);
              ScoreDoc[] sdocs = topDocs.scoreDocs;
              if (sdocs.length == 1) {
                ScoreDoc doc = sdocs[0];
                Document mainDoc = searcher.doc(doc.doc);
                SolrInputDocument mainSolrDoc = copyToSolrDoc(schema, mainDoc);
                mainSolrDoc.setField(FL_PAGETYPE, "mainframe");
                mainSolrDoc.setField(FL_IGNORE, true);
                mainSolrDoc.setField(FL_IGNORE_REASON,
                    "Is makrded as main.frame, its content will be copied to "
                        + originDocID);
                if (logging)
                  logger
                      .info("DocSearchPostProcess: "
                          + mainSolrDoc.getFieldValue(uniqueFname)
                          + "Is makrded as main.frame, its content will be copied to "
                          + originDocID);

                mainBosyStored = mainDoc.get("body_stored");
                addDoc(req, updateHandler, mainSolrDoc);

              } else {
                extraMsgSb.append("Can't find " + mainFrame + "\t.");
              }
            }

            boolean needUpdateOriginDoc = false;
            if (mainBosyStored != null || leftMenuBosyStored != null
                || extraMsgSb.length() != 0) {
              needUpdateOriginDoc = true;
            }
            if (needUpdateOriginDoc) {
              SolrInputDocument originSolrDoc = copyToSolrDoc(schema, originDoc);
              // body_stored, content
              if (mainBosyStored != null) {
                originSolrDoc.setField("origin_body_stored",
                    originSolrDoc.getFieldValue("body_stored"));
                originSolrDoc.setField("body_stored", mainBosyStored);
                originSolrDoc.setField(FL_PAGETYPE, "mainpage");
              }
              if (leftMenuBosyStored != null) {
                originSolrDoc.setField("menu_body_stored", leftMenuBosyStored);
              }
              if (extraMsgSb.length() > 0) {
                originSolrDoc.setField(FL_EXTRA_MSG, extraMsgSb.toString());
                if (logging)
                  logger.info("DocSearchPostProcess: when handle "
                      + originDocID + ", msg: " + extraMsgSb.toString());
              }
              addDoc(req, updateHandler, originSolrDoc);
            }
          }
        } else if (rspObj instanceof SolrDocumentList) {
          throw new RuntimeException("Not implemeneted yet as this request will not be sent to remote core.");
        }
        start += rows;
      }

    } finally {
      newReq.close();
    }
    commitChange(core);
  }
  private void addDoc(SolrQueryRequest req, UpdateHandler updateHandler,
      SolrInputDocument mainSolrDoc) throws IOException {
    AddUpdateCommand updateCommand = new AddUpdateCommand(req);
    updateCommand.solrDoc = mainSolrDoc;
    updateHandler.addDoc(updateCommand);
  }
  private void commitChange(final SolrCore core) {
    SolrRequestHandler commitHandler = core.getRequestHandler("/update");
    ModifiableSolrParams commitParams = new ModifiableSolrParams();
    commitParams.set("commit", "true");
    SolrQueryRequest commitReq = new LocalSolrQueryRequest(core, commitParams);
    try {
      commitHandler.handleRequest(commitReq, new SolrQueryResponse());
    } finally {
      commitReq.close();
    }
  }
  private SolrInputDocument copyToSolrDoc(IndexSchema schema, Document mainDoc) {
    SolrInputDocument mainSolrDoc = new SolrInputDocument();
    // still we keep all fields
    Iterator<IndexableField> it = mainDoc.iterator();
    while (it.hasNext()) {
      IndexableField indexableField = it.next();
      SchemaField solrSchemaField = schema.getField(indexableField.name());
      Object obj = solrSchemaField.getType().toObject(indexableField);
      mainSolrDoc.addField(indexableField.name(), obj);
    }
    return mainSolrDoc;
  }
}

Nutch2 : Extend Nutch2 to Get Custom Outlinks from Javascript Files


We use Nutch2 to crawl one documentation site, and store the index to Solr4.x to implement documentation search function.

But I met one problem: the documentation site uses COOLjsTree, in htm pages, it defines the left side menu in tree_nodes.js. 
END_USER: {
  NODES: [
   ["End User 1", "../../products/end_user1.htm", "_top"],
   ["End User 2", "../../products/end_user2.htm", "_top"],
  ],
  TITLE: " End-User"
}
Nutch2 provides parse-js plugin to find outlinks defined in javascript file or embedded javascript section. 
But it's not flexible. It uses the following regular expression to find outlinks:
org.apache.nutch.parse.js.JSParseFilter
  private static final String STRING_PATTERN = "(\\\\*(?:\"|\'))([^\\s\"\']+?)(?:\\1)";
  private static final String URI_PATTERN = "(^|\\s*?)/?\\S+?[/\\.]\\S+($|\\s*)";
It can find links like: http://site.com/folder/pagea.html. But it doesn't work for the links we defined in our tree_nodes.js.

But luckily, we can easily write our own Nutch plugin to modify or extend nutch.

We can create our own ext-parse-js plugin, write our own ParseFilter and Parser to parse outlinks from our tree_nodes.js file.
Design Concept
We want to make our new ext-js-parser configurable and extensible, we will add the following parameters to nutch-site.xml:
ext.js.file.include.pattern: user can specify what files  to parse. In our case, it is .*/tree_nodes.js.
ext.js.extract.outlink.pattern: how to extract outlinks from the matched files. It's \"([^\"]*.[htm|html|pdf])\" in our case.
ext.js.absolute.url.pattern: what url is treated as absoulte url, by default it is: ^[http|https|www].*
ext.js.indexjs: whether to index js files to solr.
Implementation Code
The complete source code can be found at Github.
First check whether the file matches ext.js.file.include.pattern, it not return directly, If so, try to get links from the file using the regular pattern: ext.js.extract.outlink.pattern. We will check whether the extracted url is absolute url via ext.js.absolute.url.pattern, if not, we will convert it to absolute url.

package org.jefferyyuan.codeexample.nutch.parse.js.treenodes;
public class JSParseFilter implements ParseFilter, Parser {
  public static final Logger LOG = LoggerFactory
      .getLogger(JSParseFilter.class);

  private static final int MAX_TITLE_LEN = 80;
  private static final String DEFAULT_FILE_INCLUDE_PATTERN_STR = "*.js";
  private static final String ABSOLUTE_URL_PATTERN_STR = "^[http|https|www].*";
  private static Pattern fileIncludePath, absoluteURLPpattern, outlinkPattern;
  private Configuration conf;

  public void setConf(Configuration conf) {
    this.conf = conf;
    PatternCompiler patternCompiler = new Perl5Compiler();

    try {
      String str = conf.get("ext.js.file.include.pattern",
          DEFAULT_FILE_INCLUDE_PATTERN_STR);
      fileIncludePath = patternCompiler.compile(str,
          Perl5Compiler.READ_ONLY_MASK | Perl5Compiler.SINGLELINE_MASK);
      str = conf.get("ext.js.absolute.url.pattern", ABSOLUTE_URL_PATTERN_STR);
      absoluteURLPpattern = patternCompiler.compile(str,
          Perl5Compiler.CASE_INSENSITIVE_MASK | Perl5Compiler.READ_ONLY_MASK
              | Perl5Compiler.SINGLELINE_MASK);

      str = conf.get("ext.js.extract.outlink.pattern");
      if (!StringUtils.isBlank(str)) {
        outlinkPattern = patternCompiler.compile(str,
            Perl5Compiler.READ_ONLY_MASK | Perl5Compiler.MULTILINE_MASK);
      }
    } catch (MalformedPatternException e) {
      throw new RuntimeException(e);
    }
  }
  private boolean shouldHandlePage(WebPage page) {
    boolean shouldHandle = false;
    String url = TableUtil.toString(page.getBaseUrl());
    PatternMatcher matcher = new Perl5Matcher();
    if (matcher.matches(url, fileIncludePath)) {
      shouldHandle = true;
    }
    return shouldHandle;
  }
  private static String toAbsolutePath(String baseUrl, String path)
      throws MalformedPatternException {
    PatternMatcher matcher = new Perl5Matcher();
    boolean isAbsolute = false;
    if (matcher.matches(path, absoluteURLPpattern)) {
      isAbsolute = true;
    }

    if (isAbsolute) {
      return path;
    }
    while (true) {
      if (!path.startsWith("../")) {
        break;
      }
      baseUrl = baseUrl.substring(0, baseUrl.lastIndexOf('/'));
      path = path.substring(3);
    }
    return baseUrl + "/" + path;
  }

  public static Outlink[] getJSLinks(String plainText, String anchor,
      String base) {
    long start = System.currentTimeMillis();
  
    // the base is always absolute path: http://.../tree_nodes.js, change it to
    // folder
    base = base.substring(0, base.lastIndexOf('/'));
    final List<Outlink> outlinks = new ArrayList<Outlink>();
    URL baseURL = null;
  
    try {
      baseURL = new URL(base);
    } catch (Exception e) {
      if (LOG.isErrorEnabled()) {
        LOG.error("error assigning base URL", e);
      }
    }
    try {
      final PatternMatcher matcher = new Perl5Matcher();
      final PatternMatcherInput input = new PatternMatcherInput(plainText);
      MatchResult result;
      String url;
      // loop the matches
      while (matcher.contains(input, outlinkPattern)) {
        if (System.currentTimeMillis() - start >= 60000L) {
          if (LOG.isWarnEnabled()) {
            LOG.warn("Time limit exceeded for getOutLinks");
          }
          break;
        }
        result = matcher.getMatch();
        url = result.group(1);
        // See if candidate URL is parseable. If not, pass and move on to
        // the next match.
        try {
          url = new URL(toAbsolutePath(base, url)).toString();
          LOG.info("Extension added: " + url + " and baseURL " + baseURL);
        } catch (MalformedURLException ex) {
          LOG.info("Extension - failed URL parse '" + url + "' and baseURL '"
              + baseURL + "'", ex);
          continue;
        }
        try {
          outlinks.add(new Outlink(url.toString(), anchor));
        } catch (MalformedURLException mue) {
          LOG.warn("Extension Invalid url: '" + url + "', skipping.");
        }
      }
    } catch (Exception ex) {
        LOG.error("getOutlinks", ex);
      }
    }
    final Outlink[] retval;
    if (outlinks != null && outlinks.size() > 0) {
      retval = outlinks.toArray(new Outlink[0]);
    } else {
      retval = new Outlink[0];
    }
  
    return retval;
  }
  public Parse filter(String url, WebPage page, Parse parse,
      HTMLMetaTags metaTags, DocumentFragment doc) {
    if (shouldHandlePage(page)) {
      ArrayList<Outlink> outlinks = new ArrayList<Outlink>();
      walk(doc, parse, metaTags, url, outlinks);
      if (outlinks.size() > 0) {
        Outlink[] old = parse.getOutlinks();
        String title = parse.getTitle();
        List<Outlink> list = Arrays.asList(old);
        outlinks.addAll(list);
        ParseStatus status = parse.getParseStatus();
        String text = parse.getText();
        Outlink[] newlinks = outlinks.toArray(new Outlink[outlinks.size()]);
        return new Parse(text, title, newlinks, status);
      }
    }
    return parse;
  }
  private void walk(Node n, Parse parse, HTMLMetaTags metaTags, String base,
      List<Outlink> outlinks) {
    if (n instanceof Element) {
      String name = n.getNodeName();
      if (name.equalsIgnoreCase("script")) {
        @SuppressWarnings("unused")
        String lang = null;
        Node lNode = n.getAttributes().getNamedItem("language");
        if (lNode == null)
          lang = "javascript";
        else
          lang = lNode.getNodeValue();
        StringBuilder script = new StringBuilder();
        NodeList nn = n.getChildNodes();
        if (nn.getLength() > 0) {
          for (int i = 0; i < nn.getLength(); i++) {
            if (i > 0)
              script.append('\n');
            script.append(nn.item(i).getNodeValue());
          }
          Outlink[] links = getJSLinks(script.toString(), "", base);
          if (links != null && links.length > 0)
            outlinks.addAll(Arrays.asList(links));
          // no other children of interest here, go one level up.
          return;
        }
      } else {
        // process all HTML 4.0 events, if present...
        NamedNodeMap attrs = n.getAttributes();
        int len = attrs.getLength();
        for (int i = 0; i < len; i++) {
          Node anode = attrs.item(i);
          Outlink[] links = null;
          if (anode.getNodeName().startsWith("on")) {
            links = getJSLinks(anode.getNodeValue(), "", base);
          } else if (anode.getNodeName().equalsIgnoreCase("href")) {
            String val = anode.getNodeValue();
            if (val != null && val.toLowerCase().indexOf("javascript:") != -1) {
              links = getJSLinks(val, "", base);
            }
          }
          if (links != null && links.length > 0)
            outlinks.addAll(Arrays.asList(links));
        }
      }
    }
    NodeList nl = n.getChildNodes();
    for (int i = 0; i < nl.getLength(); i++) {
      walk(nl.item(i), parse, metaTags, base, outlinks);
    }
  }
  public Parse getParse(String url, WebPage page) {
    if (!shouldHandlePage(page)) {
      return ParseStatusUtils.getEmptyParse(
          ParseStatusCodes.FAILED_INVALID_FORMAT, "Content not JavaScript: '"
              + TableUtil.toString(page.getContentType()) + "'", getConf());
    }
    String script = new String(page.getContent().array());
    Outlink[] outlinks = getJSLinks(script, "", url);
    if (outlinks == null)
      outlinks = new Outlink[0];
    // Title? use the first line of the script...
    String title;
    int idx = script.indexOf('\n');
    if (idx != -1) {
      if (idx > MAX_TITLE_LEN)
        idx = MAX_TITLE_LEN;
      title = script.substring(0, idx);
    } else {
      idx = Math.min(MAX_TITLE_LEN, script.length());
      title = script.substring(0, idx);
    }
    Parse parse = new Parse(script, title, outlinks,
        ParseStatusUtils.STATUS_SUCCESS);
    return parse;
  }
}

Configuration
Then we need to include ext-parse-js in nutch-site.xml

    plugin.includes
    protocol-http|urlfilter-regex|ext-parse-js|parse-(html|tika|metatags)|index-(basic|static|metadata|anchor)
|urlnormalizer-(pass|regex|basic)|scoring-opic|subcollection
    


 ext.js.file.include.pattern
 .*/tree_nodes.js


 ext.js.absolute.url.pattern
 ^[http|https|www].*


 ext.js.extract.outlink.pattern
 \"([^\"]*.[htm|html|pdf])\"


 ext.js.indexjs
 false

Then change parse-plugins.xml to make nutch use ext-parse-js plugin to parse javascript file.

 


 


 

Then we need change regex-urlfilter.txt to make nutch handle javascript file: to remove |js|JS from the following section.
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
At last, as we don't need store content from javascript file to Solr, we can either write a Solr UpdateRequestProcessor to ignore the document if the value of url field is ended with .js, or we can change org.apache.nutch.indexer.solr.SolrWriter.write(NutchDocument) like below:
public class SolrWriter implements NutchIndexWriter {
  private boolean indexjs;
  public void open(TaskAttemptContext job)
  throws IOException {
    Configuration conf = job.getConfiguration();
    indexjs= conf.getBoolean("ext.js.indexjs", false);
  }
  public void write(NutchDocument doc) throws IOException {
    String urlValue = doc.getFieldValue("url");
    if (!indexjs) {
      if (urlValue != null && urlValue.endsWith(".js")) {
        LOG.info("CVExtension ignore js file: " + urlValue);
        return;
      }
    }
  }
}
References
http://wiki.apache.org/nutch/AboutPlugins
http://wiki.apache.org/nutch/WritingPluginExample
http://florianhartl.com/nutch-plugin-tutorial.html

Extend Nutch2 to Get Outlinks from COOLjsTree Javascript File


We use Nutch2 to crawl one documentation site, and store the index to Solr4.x to implement documentation search function.

But I met one problem: the documentation site uses COOLjsTree, in htmp paghes it defines the left side menu in tree_nodes.js. 
END_USER: {
  NODES: [
   ["End User 1", "../../products/end_user1.htm", "_top"],
   ["End User 2", "../../products/end_user2.htm", "_top"],
  ],
  TITLE: " End-User"
}
Nutch2 provides parse-js plugin to find outlinks defined in javascript file or embedded javascript section. 
It uses the following regular expression to find outlinks:
org.apache.nutch.parse.js.JSParseFilter
  private static final String STRING_PATTERN = "(\\\\*(?:\"|\'))([^\\s\"\']+?)(?:\\1)";
  private static final String URI_PATTERN = "(^|\\s*?)/?\\S+?[/\\.]\\S+($|\\s*)";
It can find links like: http://site.com/folder/pagea.html. But it doesn't work for the links we defined in our tree_nodes.js.

But luckily, we can easily write our own Nutch plugin to modify or extend nutch.

We can create our own parse-tree-nodes-js plugin, write our own ParseFilter and Parser to parse outlinks from our tree_nodes.js file.
Implementation Code
First check whether it is a javascript file end with tree_nodes.js, if so get links from the file via the regulare pattern like below: "*.htm|html|pdf"

  private static final String URL_PATTERN_IN_TREE_NODE_JS = "\"([^\"]*.[htm|html|pdf])\"";

package org.jefferyyuan.codeexample.nutch.parse.js.treenodes;

public class TreeNodesJSParseFilter implements ParseFilter, Parser {
  private static final int MAX_TITLE_LEN = 80;
  private static final String ABSOLUTE_URL_PATTERN_STR = "^[http|https|www].*";
  private static final String CV_TREE_NODE_LINK_PATTERN_STR = "\"([^\"]*.[htm|html|pdf])\"";
  private static final PatternCompiler patternCompiler = new Perl5Compiler();
  private static Pattern ABSOLUTE_URL_PATTERN, CV_TREE_NODE_LINK_PATTERN;

  static {
    try {
      ABSOLUTE_URL_PATTERN = patternCompiler.compile(ABSOLUTE_URL_PATTERN_STR,
          Perl5Compiler.CASE_INSENSITIVE_MASK | Perl5Compiler.READ_ONLY_MASK
              | Perl5Compiler.SINGLELINE_MASK);
      CV_TREE_NODE_LINK_PATTERN = patternCompiler.compile(
          CV_TREE_NODE_LINK_PATTERN_STR, Perl5Compiler.CASE_INSENSITIVE_MASK
              | Perl5Compiler.READ_ONLY_MASK | Perl5Compiler.MULTILINE_MASK);
    } catch (MalformedPatternException e) {
      e.printStackTrace();
    }
  }
  @Override
  public Parse filter(String url, WebPage page, Parse parse,
      HTMLMetaTags metaTags, DocumentFragment doc) {
    if (shouldHandle(page)) {
      ArrayList<Outlink> outlinks = new ArrayList<Outlink>();

      walk(doc, parse, metaTags, url, outlinks);
      if (outlinks.size() > 0) {
        Outlink[] old = parse.getOutlinks();
        String title = parse.getTitle();
        List<Outlink> list = Arrays.asList(old);
        outlinks.addAll(list);
        ParseStatus status = parse.getParseStatus();
        String text = parse.getText();
        Outlink[] newlinks = outlinks.toArray(new Outlink[outlinks.size()]);
        return new Parse(text, title, newlinks, status);
      }
    }
    return parse;
  }

  private void walk(Node n, Parse parse, HTMLMetaTags metaTags, String base,
      List<Outlink> outlinks) {
    if (n instanceof Element) {
      String name = n.getNodeName();
      if (name.equalsIgnoreCase("script")) {
        @SuppressWarnings("unused")
        String lang = null;
        Node lNode = n.getAttributes().getNamedItem("language");
        if (lNode == null)
          lang = "javascript";
        else
          lang = lNode.getNodeValue();
        StringBuffer script = new StringBuffer();
        NodeList nn = n.getChildNodes();
        if (nn.getLength() > 0) {
          for (int i = 0; i < nn.getLength(); i++) {
            if (i > 0)
              script.append('\n');
            script.append(nn.item(i).getNodeValue());
          }
          // This logging makes the output very messy.
          // if (LOG.isInfoEnabled()) {
          // LOG.info("script: language=" + lang + ", text: " +
          // script.toString());
          // }
          Outlink[] links = getJSLinks(script.toString(), "", base);
          if (links != null && links.length > 0)
            outlinks.addAll(Arrays.asList(links));
          // no other children of interest here, go one level up.
          return;
        }
      } else {
        // process all HTML 4.0 events, if present...
        NamedNodeMap attrs = n.getAttributes();
        int len = attrs.getLength();
        for (int i = 0; i < len; i++) {
          // Window: onload,onunload
          // Form: onchange,onsubmit,onreset,onselect,onblur,onfocus
          // Keyboard: onkeydown,onkeypress,onkeyup
          // Mouse:
          // onclick,ondbclick,onmousedown,onmouseout,onmousover,onmouseup
          Node anode = attrs.item(i);
          Outlink[] links = null;
          if (anode.getNodeName().startsWith("on")) {
            links = getJSLinks(anode.getNodeValue(), "", base);
          } else if (anode.getNodeName().equalsIgnoreCase("href")) {
            String val = anode.getNodeValue();
            if (val != null && val.toLowerCase().indexOf("javascript:") != -1) {
              links = getJSLinks(val, "", base);
            }
          }
          if (links != null && links.length > 0)
            outlinks.addAll(Arrays.asList(links));
        }
      }
    }
    NodeList nl = n.getChildNodes();
    for (int i = 0; i < nl.getLength(); i++) {
      walk(nl.item(i), parse, metaTags, base, outlinks);
    }
  }

  private boolean shouldHandle(WebPage page) {
    boolean shouldHandle = false;

    String url = TableUtil.toString(page.getBaseUrl());
    if (url != null && url.endsWith("tree_nodes.js")) {
      shouldHandle = true;
    }
    return shouldHandle;
  }

  @Override
  public Parse getParse(String url, WebPage page) {
    if (!shouldHandle(page)) {
      return ParseStatusUtils.getEmptyParse(
          ParseStatusCodes.FAILED_INVALID_FORMAT, "Content not JavaScript: '"
              + TableUtil.toString(page.getContentType()) + "'", getConf());
    }
    String script = new String(page.getContent().array());
    Outlink[] outlinks = getJSLinks(script, "", url);
    if (outlinks == null)
      outlinks = new Outlink[0];
    // Title? use the first line of the script...
    String title;
    int idx = script.indexOf('\n');
    if (idx != -1) {
      if (idx > MAX_TITLE_LEN)
        idx = MAX_TITLE_LEN;
      title = script.substring(0, idx);
    } else {
      idx = Math.min(MAX_TITLE_LEN, script.length());
      title = script.substring(0, idx);
    }
    Parse parse = new Parse(script, title, outlinks,
        ParseStatusUtils.STATUS_SUCCESS);
    return parse;
  }

  /**
   * This method extracts URLs from literals embedded in JavaScript.
   */
  private static Outlink[] getJSLinks(String plainText, String anchor,
      String base) {
    long start = System.currentTimeMillis();

    // the base is always absolute path: http://.../tree_nodes.js, remve last file name
    base = base.substring(0, base.lastIndexOf('/'));
    final List<Outlink> outlinks = new ArrayList<Outlink>();
    URL baseURL = null;

    try {
      baseURL = new URL(base);
    } catch (Exception e) {
      if (LOG.isErrorEnabled()) {
        LOG.error("error assigning base URL", e);
      }
    }

    try {
      final PatternMatcher matcher = new Perl5Matcher();
      final PatternMatcherInput input = new PatternMatcherInput(plainText);

      MatchResult result;
      String url;
      // loop the matches
      while (matcher.contains(input, CV_TREE_NODE_LINK_PATTERN)) {
        // if this is taking too long, stop matching
        // (SHOULD really check cpu time used so that heavily loaded systems
        // do not unnecessarily hit this limit.)
        if (System.currentTimeMillis() - start >= 60000L) {
          if (LOG.isWarnEnabled()) {
            LOG.warn("Time limit exceeded for getJSLinks");
          }
          break;
        }
        result = matcher.getMatch();
        url = result.group(1);
        // See if candidate URL is parseable. If not, pass and move on to
        // the next match.
        try {
          url = new URL(toAbsolutePath(base, url)).toString();
          LOG.info("Extension added: " + url + " and baseURL " + baseURL);
        } catch (MalformedURLException ex) {
          if (LOG.isTraceEnabled()) {
            LOG.trace("Extension - failed URL parse '" + url + "' and baseURL '"
              + baseURL + "'", ex);
          }
          continue;
        }
        try {
          outlinks.add(new Outlink(url.toString(), anchor));
        } catch (MalformedURLException mue) {
          LOG.warn("Extension Invalid url: '" + url + "', skipping.");
        }
      }
    } catch (Exception ex) {
      if (LOG.isErrorEnabled()) {
        LOG.error("getJSLinks", ex);
      }
    }

    final Outlink[] retval;

    // create array of the Outlinks
    if (outlinks != null && outlinks.size() > 0) {
      retval = outlinks.toArray(new Outlink[0]);
    } else {
      retval = new Outlink[0];
    }

    return retval;
  }

  private static String toAbsolutePath(String baseUrl, String path)
      throws MalformedPatternException {
    final PatternMatcher matcher = new Perl5Matcher();

    final PatternMatcherInput input = new PatternMatcherInput(path);
    boolean isAbsolute = false;

    if (matcher.matches(input, ABSOLUTE_URL_PATTERN)) {
      isAbsolute = true;
    }

    if (isAbsolute) {
      return path;
    }
    while (true) {
      if (!path.startsWith("../")) {
        break;
      }
      baseUrl = baseUrl.substring(0, baseUrl.lastIndexOf('/'));
      path = path.substring(3);
    }
    // now relativePath is foldera/fileb, no /

    return baseUrl + "/" + path;
  }
}
Configuration
Then we need to include parse-tree-nodes-js in nutch-site.xml

    plugin.includes
    protocol-http|urlfilter-regex|parse-tree-nodes-js|parse-(html|tika|metatags)|index-(basic|static|metadata|anchor)
|urlnormalizer-(pass|regex|basic)|scoring-opic|subcollection
    

Then change parse-plugins.xml to make nutch use parse-tree-nodes-js plugin to parse javascript file.

 


 


 

Then we need change regex-urlfilter.txt to make nutch handle javascript file: to remove |js|JS from the following section.
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
At last, as we don't need store content from javascript file to Solr, we can either write a Solr UpdateRequestProcessor to ignore the document if the value of url field is ended with .js, or we can change org.apache.nutch.indexer.solr.SolrWriter.write(NutchDocument) like below:
public void write(NutchDocument doc) throws IOException {
    String urlValue = doc.getFieldValue("url");
    if(urlValue!=null && urlValue.endsWith(".js"))
    {
      LOG.trace("Extension ignore js file: " + urlValue);
      return;
    }
...
}
References
http://wiki.apache.org/nutch/AboutPlugins
http://wiki.apache.org/nutch/WritingPluginExample
http://florianhartl.com/nutch-plugin-tutorial.html

Extend Waffle to Limit Which Window User can Access


We want to use windows integrated authentication to authenticate user and only allow user who started the application to access it. 

We can easily implement this by extending Waffle.
Waffle can do windows integrated authentication for us, after that we just need check whether the user name and domain of the logged-on user is same as the account who starts the application.
Implementation Code
After waffle.servlet.NegotiateSecurityFilter, the following filter would check whether user name and domain of the remote user and the user who starts the web application matches.
The complete code can be found at Github.
We can get the user who starts the web application by the following code:
NTSystem system = new NTSystem();
String runasUser = system.getName();
String runasDomain = system.getDomain();

There are several ways to get user name and domain of remote logged-on user:
1.  waffle.servlet.NegotiateSecurityFilter save waffle.servlet.WindowsPrincipal instance in session. We can get user name and domain info from WindowsPrincipal.
request.getSession().setAttribute(PRINCIPAL_SESSION_KEY,windowsPrincipal);
2.  waffle.servlet.NegotiateSecurityFilter add windowsPrincipal into subject, and save subject into session. Form subject instance, we can get needed info.
subject.getPrincipals().add(windowsPrincipal);
session.setAttribute("javax.security.auth.subject", subject);

package src.main.java.org.codeexample.jeffery.misc.waffle;

import waffle.servlet.NegotiateSecurityFilter;
import waffle.servlet.WindowsPrincipal;
import com.sun.jna.platform.win32.Secur32;
import com.sun.jna.platform.win32.Secur32Util;
import com.sun.security.auth.module.NTSystem;

public class OnlyAllowUserStartItFilter implements Filter {
  protected static final Logger logger = LoggerFactory
      .getLogger(NegotiateSecurityFilter.class);
  private static final String PRINCIPAL_SESSION_KEY = NegotiateSecurityFilter.class
      .getName() + ".PRINCIPAL";
  
  @Override
  public void doFilter(ServletRequest sreq, ServletResponse sres,
      FilterChain chain) throws IOException, ServletException {
    HttpServletRequest request = (HttpServletRequest) sreq;
    boolean valid = false;    
    HttpSession session = request.getSession(false);
    if (session != null) {
      WindowsPrincipal winPrincipal = (WindowsPrincipal) session
          .getAttribute(PRINCIPAL_SESSION_KEY);
      valid = validateRemoteUser(winPrincipal);
    }
    if (!valid) {
      sendUnauthorized(sres);
    } else {
      chain.doFilter(sreq, sres);
    }
  }
  private boolean validateRemoteUserViaWinPrincipal(Subject subject) {
    boolean valid = false;
    Set<Principal> principals = subject.getPrincipals();
    WindowsPrincipal winPrincipal = null;
    for (Principal principal : principals) {
      if (principal instanceof WindowsPrincipal) {
        winPrincipal = (WindowsPrincipal) principal;
      }
    }
    valid = validateRemoteUser(winPrincipal);
    return valid;
  }
  private boolean validateRemoteUser(WindowsPrincipal winPrincipal) {
    boolean valid = false;
    if (winPrincipal != null) {
      String fqn = winPrincipal.getName();
      int atIdx = fqn.indexOf('\\');
      String remoteDomain = null, remoteUser = null;
      if (atIdx > -1) {
        remoteDomain = fqn.substring(0, atIdx);
        remoteUser = fqn.substring(atIdx + 1);
      } else {
        remoteUser = fqn;
      }
      NTSystem system = new NTSystem();
      valid = validDomain(remoteDomain, system)
          && validateUser(remoteUser, system);
    }
    return valid;
  }
  private boolean validateRemoteUserViaSecur32() {
    boolean valid = false;
    String remoteUserInfo = Secur32Util
        .getUserNameEx(Secur32.EXTENDED_NAME_FORMAT.NameSamCompatible);
    if (remoteUserInfo != null) {
      String remoteDomain = null, remoteUser = null;
      if (atIdx > -1) {
        remoteDomain = remoteUserInfo.substring(0, atIdx);
        remoteUser = remoteUserInfo.substring(atIdx + 1);
      } else {
        remoteUser = remoteUserInfo;
      }
      
      NTSystem system = new NTSystem();
      valid = validDomain(remoteDomain, system)
          && validateUser(remoteUser, system);
    }
    return valid;
  }
  
  private boolean validateUser(String remoteUser, NTSystem system) {
    boolean valid = false;
    String runasUser = system.getName();
    if (runasUser != null) {
      if (runasUser.equals(remoteUser)) {
        valid = true;
      }
    } else {
      // this is unlikely to happen
      logger.error("runasUser is null, remoteUser: " + remoteUser);
      if (remoteUser == null) {
        valid = true;
      }
    }
    return valid;
  }
  private boolean validDomain(String remoteDomain, NTSystem system) {
    boolean valid = false;
    String runasDomain = system.getDomain();
    if (runasDomain != null) {
      if (runasDomain.equalsIgnoreCase(remoteDomain)) {
        valid = true;
      }
    } else {
      if (remoteDomain == null) {
        valid = true;
      }
    }
    return valid;
  }
  private void sendUnauthorized(ServletResponse sres) throws IOException {
    HttpServletResponse response = (HttpServletResponse) sres;
    response.setHeader("Connection", "close");
    response.sendError(HttpServletResponse.SC_UNAUTHORIZED,
        "This application can be only accessed by user who started it.");
    response.flushBuffer();
  }
}
Define waffle.servlet.NegotiateSecurityFilter and OnlyAllowUserStartItFilter
Next, we need define these 2 filters in web.xml, we need make sure we first define waffle.servlet.NegotiateSecurityFilter, then define OnlyAllowUserStartItFilter, to make NegotiateSecurityFilter run first, then run OnlyAllowUserStartItFilter.

As we are using jetty, and all applications in the jetty need this feature, we define these 2 filters in our own webdefault.xml.

Solr: Use DocTransformer to Change Response


We have two documentation website, one is for internal test and one is for production. The content in internal and production site are exactly same.
I am using Nutch2 and Solr to crawl internal documentation site and index to Solr4.x server: make some change, re-crawl, test and check the result. All is good.

Next I need deploy the all-in-one package and Solr index to production machine. -- We package embedded jetty and Solr server, Solr home and index in one package to make it easy to deploy and test

Now I found out that I need change fields: url and urlfolder value. The fields in the index are http://internalsite/doc, I need to change all values to  http://externalsite/doc.

I can either recrawl production documentation site, this would take some time, and is not flexible.
It is better if I can reuse(don't make change to) the internalsite index, and just change these 2 fields when return response.

Luckily, Solr 4.0 provides Document Transformers, which give you a way to modify fields and document that are returned to the user.
For example: [value] to add constant value, [docid], [shard] and [explain].

So I can write a document transformer to change url and urlfolder fields after Solr has searched Solr, but before Solr return response to client.
Implementation Code
The complete source code can be found at Github.
package org.codeexample.jeffery.solr.transform;
public class PrefixReplaceTransformerFactory extends TransformerFactory {
  private List<String> fieldNames = new ArrayList<String>();
  private List<String> fieldPrefixs = new ArrayList<String>();
  private List<String> fieldReplaces = new ArrayList<String>();
  private boolean enabled = false;
  protected static Logger logger = LoggerFactory
      .getLogger(PrefixReplaceTransformerFactory.class);

  @SuppressWarnings("rawtypes")
  @Override
  public void init(NamedList args) {
    super.init(args);
    if (args != null) {
      SolrParams params = SolrParams.toSolrParams(args);
      enabled = params.getBool("enabled", false);
      String str = params.get("fields");
      if (str != null) {
        fieldNames = StrUtils.splitSmart(str, ',');
      }
      str = params.get("prefixes");
      if (str != null) {
        fieldPrefixs = StrUtils.splitSmart(str, ',');
      }
      str = params.get("replaces");
      if (str != null) {
        fieldReplaces = StrUtils.splitSmart(str, ',');
      }
      if (fieldPrefixs.size() != fieldReplaces.size())
        throw new RuntimeException(
            "Size of prefixes and replaces must be same, fieldPrefixs.size: "
                + fieldPrefixs.size() + ",fieldReplace.size: "
                + fieldReplaces.size());
    }
  }

  @Override
  public DocTransformer create(String field, SolrParams params,
      SolrQueryRequest req) {
    return new PrefixReplaceTransformer();
  }

  class PrefixReplaceTransformer extends DocTransformer {
    @Override
    public String getName() {
      return PrefixReplaceTransformer.class.getName();
    }

    @Override
    public void transform(SolrDocument doc, int docid) throws IOException {
      if (enabled) {
        for (int i = 0; i < fieldNames.size(); i++) {
          String fieldName = fieldNames.get(i);

          Object obj = doc.getFieldValue(fieldName);
          if (obj == null)
            continue;
          if (obj instanceof Field) {
            Field field = (Field) obj;
            String fieldValue = field.stringValue();

            boolean match = false;
            int j = 0;
            while (!match && j < fieldPrefixs.size()) {
              String prefix = fieldPrefixs.get(j);
              if (fieldValue.startsWith(prefix)) {
                match = true;
                fieldValue = fieldReplaces.get(j)
                    + fieldValue.substring(prefix.length());
                field.setStringValue(fieldValue);
              }
              ++j;
            }
          } else {
            logger.error("Should not happen: obj.type:" + obj.getClass());
          }
        }
      }
    }
  }
}
The code is like below: you can review the complete code at Github
Register Document Transformer in SolrConfig.xml
Next we need register my new doc transformer in solrconfig.xml, also add the doc transformer to the request handler's invariants.
<transformer name="valuereplace"
 class="com.lifelongprogrammer.response.transform.CVPrefixReplaceTransformerFactory">
 <bool name="enabled">true</bool>
 <str name="fields">url,urlfolder,contentid</str>
 <str name="prefixes">http://internalsite/</str>
 <str name="replaces">http://externalsite/</str>
</transformer>

<requestHandler name="/searchdoc" class="solr.SearchHandler">
 <lst name="invariants">
  <str name="fl">title,url,score,[valuereplace]</str>
 </lst>
</requestHandler>
References
http://wiki.apache.org/solr/DocTransformers
http://solr.pl/en/2011/12/05/solr-4-0-doctransformers-first-look/

Solr: Escape Special Character when Import Data


We are importing XML(CSV) data via curl Get request, in order to make it work, we need handle escape special characters: XML special Characters and URL special characters.

We need first escape XML special characters: & < > " ' to: & < > " '. In code, we can use org.apache.commons.lang.StringEscapeUtils.escapeXml(String).

Then we use code java.net.URLEncoder.encode(String, String) to escape URL special characters, especially $ & + , / : ; = ? @.
URLEncoder.encode will also convert new line feed(\r\n) to %0D%0A.

For example if filed content includes the following 2-lines data:
xml sepcail: & < > " '
url sepcail: $ & + , / : ; = ? @

The Curl Get request to import the data would be like below:
http://localhost:8080/solr/update?stream.body=<add><doc><field name="id">id1</field><field name="content">xml+sepcail%3A+%26amp%3B+%26lt%3B+%26gt%3B+%26quot%3B+%26apos%3B%0D%0Aurl+sepcail%3A+%24+%26amp%3B+%2B+%2C+%2F+%3A+%3B+%3D+%3F+%40</field></doc></add>&commit=true
Code to convert the XML field data
private String escapleXMLEncodeUrl(String str)
  throws UnsupportedEncodingException {
 String result= URLEncoder.encode(StringEscapeUtils.escapeXml(str), "UTF-8");
 return result;
} 
From org.apache.solr.client.solrj.util.ClientUtils.escapeQueryChars
We can know that we need escape(add \) the following special character for query string: \, +, -, !, (, ), :, ^, [, ], \, {, }, ~, *, ?, |, &, ;, /, or whitespace.
Resources
Online XML Escape
Online URL Encoder/Decoder
RFC 1738: Uniform Resource Locators (URL) specification
http://www.xmlnews.org/docs/xml-basics.html

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)