The Goal
In my latest project, I use crawler4j to crawl website, and then would to add some summarization to the article.
After Google search I found this Solr Jira Solr-3975 Document Summarization toolkit, using LSA techniques and the programmer's articles(Document Summarization with LSA #1: Introduction) to describe how it works.
It's not checked in, but works fine for me.
So I started my work based on it: Use boilerpipe to get the main content of web page, then later use Solr 3975 to get the most important sentences.
Normalize Html Text and Get Main Content: MainContentUpdateProcessorFactory
First, I use JSoup to normalize the html text: remove links: as they are usually used for navigation or contain javascript code, also remove invisible block: style~=display:\\s*none
To hep Solr 3975 to get important sentence, I add period(.) after div, span, textarea if their own text don't end with period(.).
Define DocumentSummaryUpdateProcessorFactory in solrconfig.xml
Let's first look at the definition of DocumentSummaryUpdateProcessorFactory,:
summary.simpleformat is an internal used argument to tell summarizer to only return highlighted section: no stats, terms or sentences sections.
DocumentSummaryUpdateProcessorFactory
As some of web pages define og:description which gives one to two sentence, we can directly use it.
If og:description is defined, then we would use summarizer to get most important summary.count(3) -1 =2 sentences.
Now let's use our crawler to crawl one web page: Official: Debris Sign of Spaceship Breaking Up, and check the summarization.
curl "http://localhost:23456/solr/crawler/crawler?action=start&seeds=http://abcnews.go.com/Health/wireStory/investigators-branson-spacecraft-crash-site-26619288&maxCount=1&constants=cat:news"
The summaries saved in the doc:
References Solr-3975 Document Summarization toolkit, using LSA techniques
Document Summarization with LSA #1: Introduction
In my latest project, I use crawler4j to crawl website, and then would to add some summarization to the article.
After Google search I found this Solr Jira Solr-3975 Document Summarization toolkit, using LSA techniques and the programmer's articles(Document Summarization with LSA #1: Introduction) to describe how it works.
It's not checked in, but works fine for me.
So I started my work based on it: Use boilerpipe to get the main content of web page, then later use Solr 3975 to get the most important sentences.
Normalize Html Text and Get Main Content: MainContentUpdateProcessorFactory
First, I use JSoup to normalize the html text: remove links: as they are usually used for navigation or contain javascript code, also remove invisible block: style~=display:\\s*none
To hep Solr 3975 to get important sentence, I add period(.) after div, span, textarea if their own text don't end with period(.).
<processor class="org.lifelongprogrammer.solr.update.processor.MainContentUpdateProcessorFactory"> <str name="fromField">rawcontent</str> <str name="mainContentField">maincontent</str> </processor>It will parse fromField which contains the raw content of web page, and store the parsed main content to mainContentField.
public class MainContentUpdateProcessorFactory extends UpdateRequestProcessorFactory { private String fromField; private String mainContentField; public void init(NamedList args) { super.init(args); if (args != null) { SolrParams params = SolrParams.toSolrParams(args); fromField = Preconditions.checkNotNull(params.get("fromField"), "Have to set fromField"); mainContentField = Preconditions.checkNotNull( params.get("mainContentField"), "Have to set fromField"); } } public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) { return new MainContentUpdateProcessor(req, next, fromField, mainContentField); } private static class MainContentUpdateProcessor extends UpdateRequestProcessor { private String fromField; private String mainContentField; private ArticleExtractor articleExtractor; public MainContentUpdateProcessor(SolrQueryRequest req, UpdateRequestProcessor next, String fromField, String mainContentField) { super(next); this.fromField = fromField; this.mainContentField = mainContentField; articleExtractor = ArticleExtractor.getInstance(); } public void processAdd(AddUpdateCommand cmd) throws IOException { SolrInputDocument doc = cmd.solrDoc; Object obj = doc.getFieldValue(fromField); if (obj != null) { try { String text = obj.toString(); text = normalize(text); String mainContent = articleExtractor.getText(text); Document jsoupDoc = Jsoup.parse(mainContent); mainContent = jsoupDoc.text(); doc.addField(mainContentField, mainContent); } catch (BoilerpipeProcessingException e) { throw new IOException(e); } } super.processAdd(cmd); } private String normalize(String text) { Document doc = Jsoup.parse(text); doc.select("a, [style~=display:\\s*none]").remove(); Elements divs = doc.select("textarea, span, div"); for (Element tmp : divs) { String html = tmp.html(); if (tmp.childNodeSize() == 1) { // && !html.endsWith(".") String ownText = tmp.ownText(); if (ownText != null && !ownText.trim().equals("") && !ownText.endsWith(".")) { html += "."; tmp.html(html); } } } return doc.html(); } } }Get Summaraization
Define DocumentSummaryUpdateProcessorFactory in solrconfig.xml
Let's first look at the definition of DocumentSummaryUpdateProcessorFactory,:
<processor class="org.lifelongprogrammer.solr.update.processor.DocumentSummaryUpdateProcessorFactory" > <str name="summary.type">text_lsa</str> <str name="summary.fromField">maincontent</str> <str name="summary.summaryField">summary</str> <str name="summary.hl_start"/> <str name="summary.hl_end" /> <bool name="summary.simpleformat">true</bool> <int name="summary.count">3</int> </processor>It wants to parse summary.fromField(maincontent in this case), and get the most important summary.count(3) sentences and them into summary.summaryField(summary in this case), summary.hl_start and summary.hl_end is empty, as we just need the text, not want to use html tag(like em or bold) to highlight important words.
summary.simpleformat is an internal used argument to tell summarizer to only return highlighted section: no stats, terms or sentences sections.
DocumentSummaryUpdateProcessorFactory
As some of web pages define og:description which gives one to two sentence, we can directly use it.
If og:description is defined, then we would use summarizer to get most important summary.count(3) -1 =2 sentences.
public class DocumentSummaryUpdateProcessorFactory extends UpdateRequestProcessorFactory implements SolrCoreAware { private SummarizerOutputFormat outputFormat; private Map<String,String> summarizerParams = new HashMap<>(); private Analyzer analyzer; public void init(NamedList args) { super.init(args); if (args != null) { SolrParams params = SolrParams.toSolrParams(args); Iterator<String> it = params.getParameterNamesIterator(); String prefix = "summary."; while (it.hasNext()) { String paramName = it.next(); if (paramName.startsWith(prefix)) { summarizerParams.put(paramName.substring(prefix.length()), params.get(paramName)); } } outputFormat = getSummarizeOutputFormat(summarizerParams); } } public void inform(SolrCore core) { analyzer = getAnalyzer(core, summarizerParams); } public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) { return new DocumentSummaryUpdateProcessor(next, req, analyzer, summarizerParams, outputFormat); } private Analyzer getAnalyzer(SolrCore core, Map<String,String> params) { FieldType fType = null; if (params.containsKey("type")) { fType = core.getSchema().getFieldTypeByName(params.get("type")); if (fType == null) { throw new IllegalArgumentException("field type not found: " + params.get("type")); } else { return fType.getAnalyzer(); } } else if (params.containsKey("fl")) { fType = core.getSchema().getFieldType(params.get("fl")); if (fType == null) { throw new IllegalArgumentException("field not found: " + params.get("type")); } else { return fType.getAnalyzer(); } } else { throw new IllegalArgumentException("need field name or type"); } } private SummarizerOutputFormat getSummarizeOutputFormat( Map<String,String> params) { SummarizerOutputFormat outputFormat = new SummarizerOutputFormat(); boolean simpleformat = false; if (params.containsKey("simpleformat")) { simpleformat = Boolean.parseBoolean(params.remove("simpleformat")); } outputFormat.setHighlightedOnly(simpleformat); int count = -1; if (params.containsKey("count")) { count = Integer.parseInt(params.remove("count")); } outputFormat.setHighlightedCount(count); return outputFormat; } private static class DocumentSummaryUpdateProcessor extends UpdateRequestProcessor { private SolrQueryRequest req; private SummarizerOutputFormat outputFormat; private Analyzer analyzer; private String fromField; private String summaryField; private SchemaSummarizer summarizer; public DocumentSummaryUpdateProcessor(UpdateRequestProcessor next, SolrQueryRequest req, Analyzer analyzer, Map<String,String> summarizerParams, SummarizerOutputFormat outputFormat) { super(next); this.req = req; this.analyzer = analyzer; this.outputFormat = outputFormat; fromField = Preconditions.checkNotNull(summarizerParams.get("fromField"), "have to set fromField"); summaryField = Preconditions.checkNotNull( summarizerParams.get("summaryField"), "have to set summaryField"); summarizer = new SchemaSummarizer(summarizerParams, Locale.getDefault()); } public void processAdd(AddUpdateCommand cmd) throws IOException { SolrInputDocument doc = cmd.solrDoc; // use og:description String og_description = null; Object obj = doc.getFieldValue("og:description"); int count = 0; if (obj != null) { og_description = obj.toString(); doc.addField(summaryField, og_description); ++count; } obj = doc.getFieldValue(fromField); if (obj != null) { NamedList summary = doSummary(summarizer, analyzer, obj.toString(), req.getParams()); NamedList highlighted = (NamedList) summary.get("highlighted"); List<NamedList> list = highlighted.getAll("sentence"); for (NamedList<Object> sentence : list) { if (count < outputFormat.getHighlightedCount()) { String value = sentence.get("text").toString(); if (value.equals(og_description)) continue; ++count; doc.addField(summaryField, value); } else { break; } } } super.processAdd(cmd); } private NamedList<Object> doSummary(Summarizer sz, Analyzer analyzer, String text, SolrParams solrParams) throws IOException { long start = System.currentTimeMillis(); sz.startSummary(); sz.addDocument(text, analyzer); NamedList<Object> summary = new NamedList<Object>(); sz.finishSummary(summary, outputFormat, start); return summary; } } }Summarizer in Action
Now let's use our crawler to crawl one web page: Official: Debris Sign of Spaceship Breaking Up, and check the summarization.
curl "http://localhost:23456/solr/crawler/crawler?action=start&seeds=http://abcnews.go.com/Health/wireStory/investigators-branson-spacecraft-crash-site-26619288&maxCount=1&constants=cat:news"
The summaries saved in the doc:
<arr name="summary"> <str> Investigators looking into what caused the crash of a Virgin Galactic prototype spacecraft that killed one of two test pilots said a 5-mile path of debris across the California desert indicates the aircraft broke up in flight. "When the wreckage is dispersed like that, it indicates the... </str> <str> "We are determined to find out what went wrong," he said, asserting that safety has always been the top priority of the program that envisions taking wealthy tourists six at a time to the edge of space for a brief experience of weightlessness and a view of Earth below. </str> <str> In grim remarks at the Mojave Air and Space Port, where the craft known as SpaceShipTwo was under development, Branson gave no details of Friday's accident and deferred to the NTSB, whose team began its first day of investigation Saturday. </str> </arr>The first one is the og:description defined in the webpage, the other two sentences is want the most two important sentences the summarizer found.
References Solr-3975 Document Summarization toolkit, using LSA techniques
Document Summarization with LSA #1: Introduction