Solr: Use UpdateRequestProcessor to Round Data

We can extend UpdateRequestProcessor to extend Solr to do many things, clean data, transform date, etc.

Sometimes, we need round the passed in data, for example: a date value, 2012-12-21T12:12:12.234Z, customer may only cares about date part, doesn't care about hour, minute parts.

So to reduce index size, and improve query performance, we can use UpdateRequestProcessor round date to 2012-12-21T00:00:00Z.
In solrconfig.xml, we can configure a processor to specify round what fields to what format, in the following code, we round them to only keep date part.
<updateRequestProcessorChain name="dateRoundChain">
  <processor class="solr.LogUpdateProcessorFactory" />
  </processor>
  <processor class="org.codeexample.jeffery.solr.DateRoundProcessorFactory" >
   <bool name="ignoreError">true</bool>
   <str name="date.fields">access_time,modify_time,mtm</str>
   <str name="date.round.fields">day,day,day</str>
  </processor>
  <processor class="solr.RunUpdateProcessorFactory" />
 </updateRequestProcessorChain>

  <requestHandler name="/import/csv" class="solr.CSVRequestHandler">
  <lst name="defaults">
   <str name="stream.contentType">application/csv</str>
   <str name="update.chain">dateRoundChain</str>
  </lst>
 </requestHandler>

The code is like below:
It now only support rounding date to only keep date or second parts, but you can easily add code to round date to only keep year, month, hour, minute part.
package org.codeexample.jeffery.solr;
public class DateRoundProcessorFactory extends UpdateRequestProcessorFactory {

	private List<String> dateFields;
	private List<String> dateRoundFields;
	// ignoreError
	private boolean ignoreError;

	private static String ROUND_DAY = "DAY";
	private static String FORMAT_DAY = "yyyy-MM-dd'T'00:00:00.0'Z'";

	// yyyy-MM-dd'T'HH:mm:ss.SSS'Z'
	private static String ROUND_SECOND = "SECOND";
	private static String FORMAT_SECOND = "yyyy-MM-dd'T'HH:mm:ss'Z'";

	@SuppressWarnings("rawtypes")
	@Override
	public void init(final NamedList args) {
		if (args != null) {
			SolrParams params = SolrParams.toSolrParams(args);
			Object fields = args.get("date.fields");
			dateFields = fields == null ? null : StrUtils.splitSmart(
					(String) fields, ",", true);

			fields = args.get("date.round.fields");
			dateRoundFields = fields == null ? null : StrUtils.splitSmart(
					(String) fields, ",", true);

			if ((dateFields == null && dateRoundFields != null)
					|| (dateFields != null && dateRoundFields == null)
					|| (dateFields != null && dateRoundFields != null
							& dateFields.size() != dateRoundFields.size()))
				throw new IllegalArgumentException(
						"Size of date.fields and date.round.fields must be same.");
			ignoreError = params.getBool("ignoreError", false);
		}
	}

	@Override
	public UpdateRequestProcessor getInstance(SolrQueryRequest req,
			SolrQueryResponse rsp, UpdateRequestProcessor next) {
		return new DateRoundProcessor(req, next);
	}

	class DateRoundProcessor extends UpdateRequestProcessor {
		public DateRoundProcessor(SolrQueryRequest req,
				UpdateRequestProcessor next) {
			super(next);
		}

		@Override
		public void processAdd(AddUpdateCommand cmd) throws IOException {
			SolrInputDocument solrInputDocument = cmd.getSolrInputDocument();
			for (int i = 0; i < dateFields.size(); i++) {
				try {
					String dateField = dateFields.get(i);
					SolrInputField inputField = solrInputDocument
							.getField(dateField);

					if (inputField != null) {
						Object obj = inputField.getValue();
						Object result = null;
						if (obj instanceof String) {
							String value = (String) obj;
							Date solrDate = parseSolrDate(value);
							String roundTo = dateRoundFields.get(i);
							DateFormat df = null;
							if (ROUND_DAY.equalsIgnoreCase(roundTo)) {
								df = new SimpleDateFormat(FORMAT_DAY);
							} else if (ROUND_SECOND.equalsIgnoreCase(roundTo)) {
								df = new SimpleDateFormat(FORMAT_SECOND);
							}
							if (df != null) {
								result = df.format(solrDate);
								// only remove it, if there is no error
								solrInputDocument.removeField(dateField);
								solrInputDocument.addField(dateField, result);
							}
						}
					}
				} catch (Exception ex) {
					if (!ignoreError) {
						throw new IOException(ex);
					}
				}
			}
			super.processAdd(cmd);
		}
	}

	public Date parseSolrDate(String dateString) throws ParseException {
		SimpleDateFormat sdf = new SimpleDateFormat(
				"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", Locale.US);
		sdf.setTimeZone(TimeZone.getTimeZone("UTC"));
		return sdf.parse(dateString);
	}
}

You can view the complete source code here:
https://github.com/jefferyyuan/solr.misc
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (60) Interview (59) J2SE (53) Algorithm (37) Eclipse (35) Soft Skills (35) Code Example (31) Linux (26) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Continuous Integration (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Design (8) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Miscs (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Firefox (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Bit Operation (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts