Solr: Using Classifier to Categorize Articles

The Goal
In my latest project, I use crawler4j to crawl websites and Solr summarizer to add summary of article
Now I would use Solr Classification to categorize articles to different categories: such as Java, Linux, News etc.

Using Solr Classifier
There are two steps when use Solr Classification: 

Train
first we add docs with known category. We can crawl known websites, for example, assign java for cat field for articles from javarevisited; assign linux for articles from linuxcommando, assign solr for articles from solrpl and etc.
localhost:23456/solr/crawler/crawler?action=create,start&name=linuxcommando.blogspot&seeds=http://linuxcommando.blogspot.com/&maxCount=50&parsePaths=http://linuxcommando.blogspot.com/\d{4}/\d{2}/.*&constants=cat:linux

localhost:23456/solr/crawler/crawler?action=create,start&name=javarevisited.blogspot&seeds=http://javarevisited.blogspot.com/&maxCount=50&parsePaths=http://javarevisited.blogspot.com/\d{4}/\d{2}/.*&constants=cat:java

localhost:23456/solr/crawler/crawler?action=create,start&name=solrpl&seeds=http://solr.pl/en/&maxCount=50&parsePathshttp://solr.pl/en/\d{4}/\d{2}/.*&constants=cat:solr

Solr ClassfierUpdateProcessorFactory
public class ClassfierUpdateProcessorFactory extends
    UpdateRequestProcessorFactory {  
  private boolean defaultDoClassifer;
  private String formField;
  private String catField;
  Classifier<BytesRef> classifier = null;

  public void init(NamedList args) {
    super.init(args);
    if (args != null) {
      SolrParams params = SolrParams.toSolrParams(args);
      defaultDoClassifer = params.getBool("doClassifer", false);
      if (defaultDoClassifer) {
        formField = Preconditions.checkNotNull(params.get("fromField"),
            "Have to set fromField");
        catField = Preconditions.checkNotNull(params.get("catField"),
            "Have to set catField");
        
        String classifierStr = params.get("classifier", "simpleNaive");
        if ("simpleNaive".equals(classifierStr)) {
          classifier = new SimpleNaiveBayesClassifier();
        } else if ("knearest".equalsIgnoreCase(classifierStr)) {
          classifier = new KNearestNeighborClassifier(10);
        } else {
          throw new IllegalArgumentException("Unsupported classifier: "
              + classifier);
        }
      }
    }
  }
  public UpdateRequestProcessor getInstance(SolrQueryRequest req,
      SolrQueryResponse rsp, UpdateRequestProcessor next) {
    return new ClassfierUpdateProcessor(req, next);
  }
  
  private class ClassfierUpdateProcessor extends UpdateRequestProcessor {
    private SolrQueryRequest req;
    public ClassfierUpdateProcessor(SolrQueryRequest req,
        UpdateRequestProcessor next) {
      super(next);
      this.req = req;
    }
    public void processAdd(AddUpdateCommand cmd) throws IOException {
      SolrParams params = req.getParams();
      boolean doClassifer = params.getBool("doClassifer", false);
      
      if (doClassifer) {
        try {
          classifier.train(req.getSearcher().getAtomicReader(), formField,
              catField, new StandardAnalyzer(Version.LUCENE_42));
          SolrInputDocument doc = cmd.solrDoc;
          Object obj = doc.getFieldValue(formField);
          if (obj != null) {
            String text = obj.toString();
            ClassificationResult<BytesRef> result = classifier
                .assignClass(text);
            
            String classified = result.getAssignedClass().utf8ToString();
            doc.addField(catField, classified);
          }
        } catch (IOException e) {
          throw new IOException(e);
        }
      }
      super.processAdd(cmd);
    } 
  } 
}
solrconfig.xml
Please check the pervious post about the implementation of MainContentUpdateProcessorFactory.
<updateRequestProcessorChain name="crawlerUpdateChain">
  <processor class="org.lifelongprogrammer.solr.update.processor.MainContentUpdateProcessorFactory">
    <str name="fromField">rawcontent</str>
    <str name="mainContentField">maincontent</str>      
  </processor>

  <processor class="org.lifelongprogrammer.solr.update.processor.ClassfierUpdateProcessorFactory">
    <bool name="doClassifer">true</bool>
    <str name="fromField">maincontent</str>
    <str name="catField">cat</str>
  </processor>
  
  <processor class="org.lifelongprogrammer.solr.update.processor.DocumentSummaryUpdateProcessorFactory" >
  </processor>

  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
schema.xml
<field name="rawcontent" type="text" indexed="false" stored="true" multiValued="true" />
<field name="maincontent" type="text" indexed="true" stored="true" multiValued="true" />
<field name="cat" type="string" indexed="true" stored="true" multiValued="true" />
<field name="summary" type="text_rev" indexed="true" stored="true" multiValued="true" />
Test Solr Classifier
Next, when we crawl some website which contains multiple categories, we can use Solr Classification to assign category for each article.

For example, let's crawl lifelongprogrammer.blogspot
localhost:23456/solr/crawler/crawler?action=create,start&name=lifelongprogrammer.blogspot&seeds=http://lifelongprogrammer.blogspot.com/&maxCount=50&parsePaths=http://lifelongprogrammer.blogspot.com/\d{4}/\d{2}/.*&doClassifer=true

We set doClassifer=true, the ClassfierUpdateProcessorFactory will call Solr Classifier to do assign a label for the category field.

From the result, we can see some articles are assigned to Java, some goes to Linux, some goes to solr. About Accuracy
The accuracy of Solr Classification is worse than Mahout, but its performance is much better and it's enough for my application.


References
[SOLR-3975] Document Summarization toolkit, using LSA techniques
Comparing Document Classification Functions of Lucene and Mahout
Text categorization with Lucene and Solr
\
Post a Comment

Labels

Java (159) Lucene-Solr (111) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (37) Eclipse (33) Code Example (31) Linux (24) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts