Solr: Define an Custom Field Type to Round Data

In previous post: Solr: Use UpdateRequestProcessor to Round Data, I use UpdateRequestProcessor to round a date, in this post, I want to describe how to define a custom field type to round a date.

In Solr, for text field, we can define analyzers. Solr will index the data as specified by the analyzers, and we can set stored=true, this will keep the original data.
But for an number(various subtypes of TrieField) or date field(TrieDateField), we can't define analyzers.

But we can create a custom number field, which extends TrieLongField or other types, or create a custom date field which extends TrieDateField. In the implementation, we can change the stored and indexed data.

The code looks like below.

public class RoundDateField extends TrieDateField {
   
   private SimpleDateFormat sdf;
   
   private String fromFormat = null;
   private static String PARAM_FROM_FORMAT = "fromFormat";
   private static String DATE_FORMAT_UNIX_SECOND = "UNIX_SECOND";
   private static String DATE_FORMAT_UNIX_MILLSECOND = "UNIX_MILLSECOND";
   
   private static long MS_IN_DAY = 3600 * 24 * 1000;
   private static final long SECONDS_FROM_EPCO = new Date().getTime() / 1000;
   
   @Override
   protected void init(IndexSchema schema, Map<String,String> args) {
     if (args != null) {
       fromFormat = args.remove(PARAM_FROM_FORMAT);
     }
     sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", Locale.US);
     sdf.setTimeZone(UTC);
     super.init(schema, args);
   }
   
   /**
    * if value > SECONDS_FROM_EPCO, then treat value as milliseconds, otherwise
    * treat value as seconds
    * 
    * @param value
    * @return
    */
   private long convertToMillseconds(long value) {
     long result = value;
     if (value < SECONDS_FROM_EPCO) {
       result = result * 1000L;
     }
     return result;
   }
   
   @Override
   public IndexableField createField(SchemaField field, Object value, float boost) {
     
     try {
       long millseconds = -1;
       
       try {
         millseconds = Long.parseLong(value.toString());
         
         if (fromFormat != null) {
           if (DATE_FORMAT_UNIX_MILLSECOND.equalsIgnoreCase(fromFormat)) {
             // do nothing
           } else if (DATE_FORMAT_UNIX_SECOND.equalsIgnoreCase(fromFormat)) {
             millseconds = millseconds * 1000L;
           } else {
             throw new RuntimeException("Invalid fromFormat: " + fromFormat);
           }
         } else {
           millseconds = convertToMillseconds(millseconds);
         }
         
       } catch (Exception ex) {
         // so it should be a date string
         millseconds = sdf.parse(value.toString()).getTime();
       }
       
       millseconds = (millseconds / MS_IN_DAY) * MS_IN_DAY + (MS_IN_DAY / 2);
       // returned value must be a date time string
       value = new Date(millseconds);
     } catch (Exception ex) {
       throw new RuntimeException(ex);
     }
     
     return super.createField(field, value, boost);
   }
}

We may want to search on the rounded date, but still able to retrieve the original date. To do this, we can create one field to store the original content, but set indexed=false, as we won't search on it. Then we create another field to index the rounded date: we set indexed=true, but stored=false, as we will not retrieve or display the round value to user.

In solr.XML, the field, access_time is an normal date type, and it will store the original date value, we copy its value to another type access_time_rounded, which type is the custom type we define.
<fieldType name="roundededDate" class="org.codeexample.jeffery.solr.RoundDateField" omitNorms="true" precisionStep="6" 
  positionIncrementGap="0" fromFormat="UNIX_SECOND" /> 

<fieldType name="roundededDateSmart" class="org.codeexample.jeffery.solr.RoundDateField" omitNorms="true" precisionStep="6" 
  positionIncrementGap="0" /> 

<field name="access_time" type="tdate" indexed="false" stored="true" omitNorms="true"/>
<field name="access_time_rouned" type="roundededDate" indexed="false" stored="true" omitNorms="true"/>
<copyField source="access_time" dest="access_time_rouned"/>
Compared with previous version, this has some advantages that:
1. It can auto detect the format of the passed data, whether it is a valid solr date format string, or whether it uses seconds or million seconds to represent the date.
2. Easier to use, no need to configure processor factory in solrconfig.xml, just declare field type of your data.
You can view the complete source code here:
https://github.com/jefferyyuan/solr.misc
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (58) Interview (58) J2SE (53) Algorithm (41) Soft Skills (36) Eclipse (34) Code Example (31) Linux (25) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Continuous Integration (8) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts