The Goal
In my latest project, I use crawler4j to crawl websites and Solr summarizer to add summary of article.
Now I would use Solr Classification to categorize articles to different categories: such as Java, Linux, News etc.
Using Solr Classifier
There are two steps when use Solr Classification:
Train
first we add docs with known category. We can crawl known websites, for example, assign java for cat field for articles from javarevisited; assign linux for articles from linuxcommando, assign solr for articles from solrpl and etc.
localhost:23456/solr/crawler/crawler?action=create,start&name=linuxcommando.blogspot&seeds=http://linuxcommando.blogspot.com/&maxCount=50&parsePaths=http://linuxcommando.blogspot.com/\d{4}/\d{2}/.*&constants=cat:linux
localhost:23456/solr/crawler/crawler?action=create,start&name=javarevisited.blogspot&seeds=http://javarevisited.blogspot.com/&maxCount=50&parsePaths=http://javarevisited.blogspot.com/\d{4}/\d{2}/.*&constants=cat:java
localhost:23456/solr/crawler/crawler?action=create,start&name=solrpl&seeds=http://solr.pl/en/&maxCount=50&parsePathshttp://solr.pl/en/\d{4}/\d{2}/.*&constants=cat:solr
Solr ClassfierUpdateProcessorFactory
Please check the pervious post about the implementation of MainContentUpdateProcessorFactory.
For example, let's crawl lifelongprogrammer.blogspot
localhost:23456/solr/crawler/crawler?action=create,start&name=lifelongprogrammer.blogspot&seeds=http://lifelongprogrammer.blogspot.com/&maxCount=50&parsePaths=http://lifelongprogrammer.blogspot.com/\d{4}/\d{2}/.*&doClassifer=true
From the result, we can see some articles are assigned to Java, some goes to Linux, some goes to solr. About Accuracy
The accuracy of Solr Classification is worse than Mahout, but its performance is much better and it's enough for my application.
References
[SOLR-3975] Document Summarization toolkit, using LSA techniques
Comparing Document Classification Functions of Lucene and Mahout
Text categorization with Lucene and Solr
\
In my latest project, I use crawler4j to crawl websites and Solr summarizer to add summary of article.
Now I would use Solr Classification to categorize articles to different categories: such as Java, Linux, News etc.
Using Solr Classifier
There are two steps when use Solr Classification:
Train
first we add docs with known category. We can crawl known websites, for example, assign java for cat field for articles from javarevisited; assign linux for articles from linuxcommando, assign solr for articles from solrpl and etc.
localhost:23456/solr/crawler/crawler?action=create,start&name=linuxcommando.blogspot&seeds=http://linuxcommando.blogspot.com/&maxCount=50&parsePaths=http://linuxcommando.blogspot.com/\d{4}/\d{2}/.*&constants=cat:linux
localhost:23456/solr/crawler/crawler?action=create,start&name=javarevisited.blogspot&seeds=http://javarevisited.blogspot.com/&maxCount=50&parsePaths=http://javarevisited.blogspot.com/\d{4}/\d{2}/.*&constants=cat:java
localhost:23456/solr/crawler/crawler?action=create,start&name=solrpl&seeds=http://solr.pl/en/&maxCount=50&parsePathshttp://solr.pl/en/\d{4}/\d{2}/.*&constants=cat:solr
public class ClassfierUpdateProcessorFactory extends UpdateRequestProcessorFactory { private boolean defaultDoClassifer; private String formField; private String catField; Classifier<BytesRef> classifier = null; public void init(NamedList args) { super.init(args); if (args != null) { SolrParams params = SolrParams.toSolrParams(args); defaultDoClassifer = params.getBool("doClassifer", false); if (defaultDoClassifer) { formField = Preconditions.checkNotNull(params.get("fromField"), "Have to set fromField"); catField = Preconditions.checkNotNull(params.get("catField"), "Have to set catField"); String classifierStr = params.get("classifier", "simpleNaive"); if ("simpleNaive".equals(classifierStr)) { classifier = new SimpleNaiveBayesClassifier(); } else if ("knearest".equalsIgnoreCase(classifierStr)) { classifier = new KNearestNeighborClassifier(10); } else { throw new IllegalArgumentException("Unsupported classifier: " + classifier); } } } } public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) { return new ClassfierUpdateProcessor(req, next); } private class ClassfierUpdateProcessor extends UpdateRequestProcessor { private SolrQueryRequest req; public ClassfierUpdateProcessor(SolrQueryRequest req, UpdateRequestProcessor next) { super(next); this.req = req; } public void processAdd(AddUpdateCommand cmd) throws IOException { SolrParams params = req.getParams(); boolean doClassifer = params.getBool("doClassifer", false); if (doClassifer) { try { classifier.train(req.getSearcher().getAtomicReader(), formField, catField, new StandardAnalyzer(Version.LUCENE_42)); SolrInputDocument doc = cmd.solrDoc; Object obj = doc.getFieldValue(formField); if (obj != null) { String text = obj.toString(); ClassificationResult<BytesRef> result = classifier .assignClass(text); String classified = result.getAssignedClass().utf8ToString(); doc.addField(catField, classified); } } catch (IOException e) { throw new IOException(e); } } super.processAdd(cmd); } } }solrconfig.xml
Please check the pervious post about the implementation of MainContentUpdateProcessorFactory.
<updateRequestProcessorChain name="crawlerUpdateChain"> <processor class="org.lifelongprogrammer.solr.update.processor.MainContentUpdateProcessorFactory"> <str name="fromField">rawcontent</str> <str name="mainContentField">maincontent</str> </processor> <processor class="org.lifelongprogrammer.solr.update.processor.ClassfierUpdateProcessorFactory"> <bool name="doClassifer">true</bool> <str name="fromField">maincontent</str> <str name="catField">cat</str> </processor> <processor class="org.lifelongprogrammer.solr.update.processor.DocumentSummaryUpdateProcessorFactory" > </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>schema.xml
<field name="rawcontent" type="text" indexed="false" stored="true" multiValued="true" /> <field name="maincontent" type="text" indexed="true" stored="true" multiValued="true" /> <field name="cat" type="string" indexed="true" stored="true" multiValued="true" /> <field name="summary" type="text_rev" indexed="true" stored="true" multiValued="true" />Test Solr Classifier
Next, when we crawl some website which contains multiple categories, we can use Solr Classification to assign category for each article.
For example, let's crawl lifelongprogrammer.blogspot
localhost:23456/solr/crawler/crawler?action=create,start&name=lifelongprogrammer.blogspot&seeds=http://lifelongprogrammer.blogspot.com/&maxCount=50&parsePaths=http://lifelongprogrammer.blogspot.com/\d{4}/\d{2}/.*&doClassifer=true
We set doClassifer=true, the ClassfierUpdateProcessorFactory will call Solr Classifier to do assign a label for the category field.
From the result, we can see some articles are assigned to Java, some goes to Linux, some goes to solr. About Accuracy
The accuracy of Solr Classification is worse than Mahout, but its performance is much better and it's enough for my application.
References
[SOLR-3975] Document Summarization toolkit, using LSA techniques
Comparing Document Classification Functions of Lucene and Mahout
Text categorization with Lucene and Solr