Solr: Define an Custom Field Type to Round Data


In previous post: Solr: Use UpdateRequestProcessor to Round Data, I use UpdateRequestProcessor to round a date, in this post, I want to describe how to define a custom field type to round a date.

In Solr, for text field, we can define analyzers. Solr will index the data as specified by the analyzers, and we can set stored=true, this will keep the original data.
But for an number(various subtypes of TrieField) or date field(TrieDateField), we can't define analyzers.

But we can create a custom number field, which extends TrieLongField or other types, or create a custom date field which extends TrieDateField. In the implementation, we can change the stored and indexed data.

The code looks like below.

public class RoundDateField extends TrieDateField {
   
   private SimpleDateFormat sdf;
   
   private String fromFormat = null;
   private static String PARAM_FROM_FORMAT = "fromFormat";
   private static String DATE_FORMAT_UNIX_SECOND = "UNIX_SECOND";
   private static String DATE_FORMAT_UNIX_MILLSECOND = "UNIX_MILLSECOND";
   
   private static long MS_IN_DAY = 3600 * 24 * 1000;
   private static final long SECONDS_FROM_EPCO = new Date().getTime() / 1000;
   
   @Override
   protected void init(IndexSchema schema, Map<String,String> args) {
     if (args != null) {
       fromFormat = args.remove(PARAM_FROM_FORMAT);
     }
     sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", Locale.US);
     sdf.setTimeZone(UTC);
     super.init(schema, args);
   }
   
   /**
    * if value > SECONDS_FROM_EPCO, then treat value as milliseconds, otherwise
    * treat value as seconds
    * 
    * @param value
    * @return
    */
   private long convertToMillseconds(long value) {
     long result = value;
     if (value < SECONDS_FROM_EPCO) {
       result = result * 1000L;
     }
     return result;
   }
   
   @Override
   public IndexableField createField(SchemaField field, Object value, float boost) {
     
     try {
       long millseconds = -1;
       
       try {
         millseconds = Long.parseLong(value.toString());
         
         if (fromFormat != null) {
           if (DATE_FORMAT_UNIX_MILLSECOND.equalsIgnoreCase(fromFormat)) {
             // do nothing
           } else if (DATE_FORMAT_UNIX_SECOND.equalsIgnoreCase(fromFormat)) {
             millseconds = millseconds * 1000L;
           } else {
             throw new RuntimeException("Invalid fromFormat: " + fromFormat);
           }
         } else {
           millseconds = convertToMillseconds(millseconds);
         }
         
       } catch (Exception ex) {
         // so it should be a date string
         millseconds = sdf.parse(value.toString()).getTime();
       }
       
       millseconds = (millseconds / MS_IN_DAY) * MS_IN_DAY + (MS_IN_DAY / 2);
       // returned value must be a date time string
       value = new Date(millseconds);
     } catch (Exception ex) {
       throw new RuntimeException(ex);
     }
     
     return super.createField(field, value, boost);
   }
}

We may want to search on the rounded date, but still able to retrieve the original date. To do this, we can create one field to store the original content, but set indexed=false, as we won't search on it. Then we create another field to index the rounded date: we set indexed=true, but stored=false, as we will not retrieve or display the round value to user.

In solr.XML, the field, access_time is an normal date type, and it will store the original date value, we copy its value to another type access_time_rounded, which type is the custom type we define.
<fieldType name="roundededDate" class="org.codeexample.jeffery.solr.RoundDateField" omitNorms="true" precisionStep="6" 
  positionIncrementGap="0" fromFormat="UNIX_SECOND" /> 

<fieldType name="roundededDateSmart" class="org.codeexample.jeffery.solr.RoundDateField" omitNorms="true" precisionStep="6" 
  positionIncrementGap="0" /> 

<field name="access_time" type="tdate" indexed="false" stored="true" omitNorms="true"/>
<field name="access_time_rouned" type="roundededDate" indexed="false" stored="true" omitNorms="true"/>
<copyField source="access_time" dest="access_time_rouned"/>
Compared with previous version, this has some advantages that:
1. It can auto detect the format of the passed data, whether it is a valid solr date format string, or whether it uses seconds or million seconds to represent the date.
2. Easier to use, no need to configure processor factory in solrconfig.xml, just declare field type of your data.
You can view the complete source code here:
https://github.com/jefferyyuan/solr.misc

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)