Solr: How to Update Multiple Cores in One Request

Solr supports distributed search,  its syntax is like: http://localhost:8080/solr/select?shards=localhost:8080/solr,localhost:9090/solr&indent=true&q=nexus7.

It doesn't support to update/upload files to multiple cores, but it is easy to support:
We can add one parameter shards to specify url of multiple cores, add one parameter shardn.parameter_name=parameter_value to specify the parameter which will be sent to shardn, parameters that not starts with shardn will be sent to all cores.
Example: Upload all csv files in folder1 to core1, all csv files in folder2 to core2:
http://localhost:8080/solr/cores?shards=http://localhost:8080/solr/collection1/,http://localhost:8080/solr/collection2/&url=/import/csv&shard0.stream.folder=foler1_path&shard1.stream.folder=folder2path&stream.contentType=text/csv;charset=utf-8

Please refer here about how to use multiple threads to upload multiple local streams files, and support stream.folder and stream.file.pattern.

Commit to core1 and core2 in one request:
http://localhost:8080/solr/cores?shards=http://localhost:8080/solr/collection1/,http://localhost:8080/solr/collection2/,&url=/update&commit=true"

Now we can update multiple cores in one request, and it's easy to write our script.

The code is like below. You can also view the complete source code here: https://github.com/jefferyyuan/solr.misc

public class MultiCoreUpdateRequestHandler extends UpdateRequestHandler {
  private static String PARAM_SHARDS = "shards";
  
  @Override
  public void handleRequestBody(final SolrQueryRequest req,
      final SolrQueryResponse rsp) throws Exception {
    try {
      
      SolrParams params = req.getParams();
      String shardsStr = params.get(PARAM_SHARDS);
      if (shardsStr == null) {
        throw new RuntimeException("No shards paramter found.");
      }
      List<String> shards = StrUtils.splitSmart(shardsStr, ',');
      
      List<ModifiableSolrParams> shardParams = new ArrayList<ModifiableSolrParams>();
      for (int i = 0; i < shards.size(); i++) {
        shardParams.add(new ModifiableSolrParams());
      }
      Iterator<String> iterator = params.getParameterNamesIterator();
      String shardParamPrefix = "shard";
      while (iterator.hasNext()) {
        String paramName = iterator.next();
        if (paramName.equals(PARAM_SHARDS)) continue;
        if (paramName.startsWith(shardParamPrefix)) {
          int index = paramName.indexOf(".");
          if (index < 0) continue;
          String numStr = paramName.substring(shardParamPrefix.length(), index);
          try {
            int shardNumber = Integer.parseInt(numStr);
            String shardParam = paramName.substring(index + 1);
            shardParams.get(shardNumber).add(shardParam, params.get(paramName));
          } catch (Exception e) {
            // do nothing
          }
        } else {
          // add common parameters
          for (ModifiableSolrParams tmp : shardParams) {
            tmp.add(paramName, params.get(paramName));
          }
        }
      }
      handleShards(shards, shardParams, rsp);
    } finally {}
  }
  
  private void handleShards(final List<String> shards,
      final List<ModifiableSolrParams> shardParams, final SolrQueryResponse rsp)
      throws InterruptedException {
    
    ExecutorService executor = null;
    
    executor = Executors.newFixedThreadPool(shards.size());
    
    for (int i = 0; i < shards.size(); i++) {
      final int index = i;
      executor.submit(new Runnable() {
        @SuppressWarnings("unchecked")
        @Override
        public void run() {
          Map<String,Object> resultMap = new LinkedHashMap<String,Object>();
          try {
            SolrServer solr = new HttpSolrServer(shards.get(index));
            
            ModifiableSolrParams params = shardParams.get(index);
            UpdateRequest request = new UpdateRequest(params.get("url"));
            resultMap.put("params", params.toNamedList());
            request.setParams(params);
            UpdateResponse response = request.process(solr);
            NamedList<Object> header = response.getResponseHeader();
            resultMap.put("responseHeader", header);
            System.err.println(response);
          } catch (Exception e) {
            NamedList<Object> error = new NamedList<Object>();
            error.add("msg", e.getMessage());
            StringWriter sw = new StringWriter();
            e.printStackTrace(new PrintWriter(sw));
            error.add("trace", sw.toString());
            resultMap.put("error", error);
            throw new RuntimeException(e);
          } finally {
            rsp.add("shard" + index, resultMap);
          }
        }
      });
    }
    executor.shutdown();
    
    boolean terminated = executor.awaitTermination(Long.MAX_VALUE,
        TimeUnit.SECONDS);
    if (!terminated) {
      throw new RuntimeException("Request takes too much time");
    }
  }
}

Solr: How to Speed Up Indexing

Store Less And Index Less
Please refer to How to Shrink Solr Index Size
Outline: 
Indexed=false or Stored=false
Use best fit and least-size field type: tlong or tint.
Clean Data
Round Data
Increase precisionStep
Set omitNorms=true

Increase JAVA RAM
java  -server -Xms8192M -Xmx8192M 

Set overwrite as false
If the unqiue key is generated automatically, either uuid or generated in our code, or we can gurantee there is no duplicate date, we can set overwrite as false, see code: org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
After push is finished, we can run facet.field=idfield&facet.mincount=2 to find out whether there is duplicate ids: either delete the old one or check whether there is error in the data. 

Increase ramBufferSizeMB maxBufferedDocs, and mergeFactor in solrconfig.xml
This will reduce disk IO times.
After commit data, you may run optimize to increase query speed.

Increase size of buffer reader to reduce IO times.
To do this, you have to change solr code:
BUFFER_READER_SIZE = params.getInt(PARAM_BUFFER_READER_SIZE, 0);
if (BUFFER_READER_SIZE != 0) {
reader = new BufferedReader(reader, BUFFER_READER_SIZE);
}
Then configure size of BufferedReader in solrconfig.xml. 

Use multiple threads to upload multiple files at same time.
Please refer to Solr: Use Multiple Threads to Import Local stream Files

Use multiple update processor threads
https://issues.apache.org/jira/browse/SOLR-3585
Import this improvement into your solr build.

Use Solr Multiple Cores
In my test, using one core to upload 56 million data, it takes 70 minutes, using 2 cores in one solr server, it takes 40 minutes. But no improve when increases to use 3 cores(in fact worse).
I think this is because when one core busy at IO, another core can do CPU busy operation.

Deploying multiple cores in different web server instances,in different JVMs, the performance will be better.

Solr Cloud
I tested Solr Cloud, and found it is not suitable for my task, because it requires to enable solr transaction logs, which is quite slow, and also because the overhead of zookeeper. Using Solr Cloud with 2 nodes, it takes 4 hours, much much slower.
Solr Cloud should be more suitable when the index is so huge that can't be stored in one machine.

Solr: How to Shrink Index Size

To reduce index size, we should try best to understand the application’s requirement, what each field means, what type it should be(for example, tlong or tint), what tokenizer or filter should be used, what what queries user may make.

Indexed and Stored
If user will not search on that field, we can set indexed=false for that field.
If that field is for search only, customers will never retrieve the original content, we can set stored=false.

Use best fit and least-size field type: tlong or tint.

Clean data before index them.
For instance, remove garbage data, such as NA.

Round Data
For example, for a date field, user may only cares date part, not hh:mm:ss part, so we can round date: round 2012-12-21T12:12:12.234Z to 2012-12-21T00:00:00Z. This can reduce term size.

Use StopFilterFactory to remove stop words.
What analyzers or filters to use to index input.
Range Query and precisionStep
For fields, that don't need range query, or performance is not important when do range query, we can set precisionStep to larger number, this can reduce term size in the cost of query speed.
termVectors, termPositions and termOffsets
For fields we don't need highlighting functionality, set these three properties to false, it will tell Solr not to store any information about terms in the index.
omitNorms
Norms are used to boosts and field length normalization during indexing time so that short document has higher score.
Set omitNorms= true for text fields, that are usually small, and don't need boost for short value.

For primitive types such as string, integer, and so on it's turned on by default in Solr 4.0). This would shrink the index a bit more and in addition to that save us some memory during queries.

Solr: Use Multiple Threads to Import Local stream Files

When import data to Solr, user can use several parameters: stream.file="path" to import multiple local files. But Solr's UpdateRequestHandler import them one by one:
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest, SolrQueryResponse)
for (ContentStream stream : streams) {
  documentLoader.load(req, rsp, stream, processor);
}
So to speed up index, we can use multiple threads to imports files simultaneously.
Meanwhile, I want to extent UpdateRequestHandler to add parameter stream.folder, so it will import all files under on folder, also extend UpdateRequestHandler to add parameter stream.file.pattern, so it will import all files that match the pattern.

package org.codeexample.jeffery.solr;
public class ThreadedUpdateRequestHandler extends UpdateRequestHandler {

 private static String PARAM_THREAD_NUMBER = "threads";

 private static String PARAM_STREAM_FOLDER = "stream.folder";
 private static String PARAM_STREAM_FILE_PATTERN = "stream.file.pattern";

 private static final int DEFAULT_THREAD_NUMBER = 10;
 private static int DEFAULT_THREADS = DEFAULT_THREAD_NUMBER;

 @SuppressWarnings("rawtypes")
 @Override
 public void init(NamedList args) {
  super.init(args);
  if (args != null) {
   NamedList namedList = ((NamedList) args.get("defaults"));
   if (namedList != null) {
    Object obj = namedList.get(PARAM_THREAD_NUMBER);
    if (obj != null) {
     DEFAULT_THREADS = Integer.parseInt(obj.toString());
    }
   }
  }
 }

 @Override
 public void handleRequestBody(final SolrQueryRequest req,
   final SolrQueryResponse rsp) throws Exception {

  List<ContentStream> streams = new ArrayList<ContentStream>();

  handleReqStream(req, streams);
  // here, we handle the new two parameters: stream.folder and
  // strem.filepattern
  handleStreamFolders(req, streams);
  handleFilePatterns(req, streams);
  if (streams.size() < 2) {
   // No need to use threadpool.
   SolrQueryRequestBase reqBase = (SolrQueryRequestBase) req;
   if (!streams.isEmpty()) {
    String contentType = req.getParams().get(
      CommonParams.STREAM_CONTENTTYPE);
    ContentStream stream = streams.get(0);
    if (stream instanceof ContentStreamBase) {
     ((ContentStreamBase) stream).setContentType(contentType);

    }
   }
   reqBase.setContentStreams(streams);
   super.handleRequestBody(req, rsp);
  } else {
   importStreamsMultiThreaded(req, rsp, streams);
  }
 }

 private void handleReqStream(final SolrQueryRequest req,
   List<ContentStream> streams) {
  Iterable<ContentStream> iterabler = req.getContentStreams();
  if (iterabler != null) {
   Iterator<ContentStream> iterator = iterabler.iterator();
   while (iterator.hasNext()) {
    streams.add(iterator.next());
    iterator.remove();
   }
  }
 }

 private ExecutorService importStreamsMultiThreaded(
   final SolrQueryRequest req, final SolrQueryResponse rsp,
   List<ContentStream> streams) throws InterruptedException,
   IOException {
  ExecutorService executor = null;
  SolrParams params = req.getParams();

  final UpdateRequestProcessorChain processorChain = req
    .getCore()
    .getUpdateProcessingChain(params.get(UpdateParams.UPDATE_CHAIN));

  UpdateRequestProcessor processor = processorChain.createProcessor(req,
    rsp);
  try {
   Map<String, Object> resultMap = new LinkedHashMap<String, Object>();

   resultMap.put("start_time", new Date());
   List<Map<String, Object>> details = new ArrayList<Map<String, Object>>();

   try {

    int threads = determineThreadsNumber(params, streams.size());
    ThreadFactory threadFactory = new ThreadFactory() {
     public Thread newThread(Runnable r) {
      return new Thread(r, "threadedReqeustHandler-"
        + new Date());
     }
    };
    executor = Executors.newFixedThreadPool(threads, threadFactory);
    String contentType = params
      .get(CommonParams.STREAM_CONTENTTYPE);

    Iterator<ContentStream> iterator = streams.iterator();
    while (iterator.hasNext()) {
     ContentStream stream = iterator.next();
     iterator.remove();
     if (stream instanceof ContentStreamBase) {
      ((ContentStreamBase) stream)
        .setContentType(contentType);

     }
     submitTask(req, rsp, processorChain, executor, stream,
       details);
    }

    executor.shutdown();

    boolean terminated = executor.awaitTermination(Long.MAX_VALUE,
      TimeUnit.SECONDS);
    if (!terminated) {
     throw new RuntimeException("Request takes too much time");
    }
    // Perhaps commit from the parameters
    RequestHandlerUtils.handleCommit(req, processor, params, false);
    RequestHandlerUtils.handleRollback(req, processor, params,
      false);
   } finally {
    resultMap.put("end_time", new Date());

    // check whether there is error in details
    for (Map<String, Object> map : details) {
     Exception ex = (Exception) map.get("exception");
     if (ex != null) {
      rsp.setException(ex);
      if (ex instanceof SolrException) {
       rsp.add("status", ((SolrException) ex).code());
      } else {
       rsp.add("status",
         SolrException.ErrorCode.BAD_REQUEST);
      }
      break;
     }
    }
   }
   resultMap.put("details", details);
   rsp.add("result", resultMap);
   return executor;
  } finally {
   if (executor != null && !executor.isShutdown()) {
    executor.shutdownNow();
   }
   // finish the request
   processor.finish();
  }
 }

 private int determineThreadsNumber(SolrParams params, int streamSize) {
  int threads = DEFAULT_THREADS;
  String str = params.get(PARAM_THREAD_NUMBER);
  if (str != null) {
   threads = Integer.parseInt(str);
  }

  if (streamSize < threads) {
   threads = streamSize;
  }
  return threads;
 }

 private void handleFilePatterns(final SolrQueryRequest req,
   List<ContentStream> streams) {
  String[] strs = req.getParams().getParams(PARAM_STREAM_FILE_PATTERN);
  if (strs != null) {
   for (String filePattern : strs) {
    // it may point to a file
    File file = new File(filePattern);
    if (file.isFile()) {
     streams.add(new ContentStreamBase.FileStream(file));
    } else {
     // only supports tail regular expression, such as
     // c:\foldera\c*.csv
     int lastIndex = filePattern.lastIndexOf(File.separator);
     if (lastIndex > -1) {
      File folder = new File(filePattern.substring(0,
        lastIndex));

      if (!folder.exists()) {
       throw new RuntimeException("Folder " + folder
         + " doesn't exists.");
      }

      String pattern = filePattern.substring(lastIndex + 1);
      pattern = convertPattern(pattern);
      final Pattern p = Pattern.compile(pattern);

      File[] files = folder.listFiles(new FilenameFilter() {
       @Override
       public boolean accept(File dir, String name) {
        Matcher matcher = p.matcher(name);
        return matcher.matches();
       }
      });

      if (files != null) {
       for (File tmp : files) {
        streams.add(new ContentStreamBase.FileStream(
          tmp));
       }
      }
     }
    }
   }
  }
 }

 private void handleStreamFolders(final SolrQueryRequest req,
   List<ContentStream> streams) {
  String[] strs = req.getParams().getParams(PARAM_STREAM_FOLDER);
  if (strs != null) {
   for (String folderStr : strs) {

    File folder = new File(folderStr);

    File[] files = folder.listFiles();

    if (files != null) {
     for (File file : files) {
      streams.add(new ContentStreamBase.FileStream(file));
     }
    }
   }
  }
 }

 /**
  * replace * to .*, replace . to \.
  */
 private String convertPattern(String pattern) {
  pattern = pattern.replaceAll("\\.", "\\\\.");
  pattern = pattern.replaceAll("\\*", ".*");
  return pattern;
 }

 private void submitTask(final SolrQueryRequest req,
   final SolrQueryResponse rsp,
   final UpdateRequestProcessorChain processorChain,
   ExecutorService executor, final ContentStream stream,
   final List<Map<String, Object>> rspResult) {
  Thread thread = new Thread() {
   public void run() {
    Map<String, Object> map = new LinkedHashMap<String, Object>();
    map.put("start_time", new Date().toString());

    if (stream instanceof ContentStreamBase.FileStream) {
     map.put("Import File: ",
       ((ContentStreamBase.FileStream) stream).getName());
    }
    try {
     UpdateRequestProcessor processor = null;
     try {
      processor = processorChain.createProcessor(req, rsp);

      ContentStreamLoader documentLoader = newLoader(req,
        processor);

      documentLoader.load(req, rsp, stream, processor);
      System.err.println(rsp);

     } finally {
      if (processor != null) {
       // finish the request
       processor.finish();
      }
     }
    } catch (Exception e) {
     rsp.setException(e);
    } finally {
     map.put("end_time", new Date().toString());
     if (rsp.getException() != null) {
      map.put("exception", rsp.getException());
     }
     rspResult.add(map);
    }

   };
  };

  executor.execute(thread);
 }
}
You can view the complete source code here:
https://github.com/jefferyyuan/solr.misc

Solr: Define an Custom Field Type to Round Data

In previous post: Solr: Use UpdateRequestProcessor to Round Data, I use UpdateRequestProcessor to round a date, in this post, I want to describe how to define a custom field type to round a date.

In Solr, for text field, we can define analyzers. Solr will index the data as specified by the analyzers, and we can set stored=true, this will keep the original data.
But for an number(various subtypes of TrieField) or date field(TrieDateField), we can't define analyzers.

But we can create a custom number field, which extends TrieLongField or other types, or create a custom date field which extends TrieDateField. In the implementation, we can change the stored and indexed data.

The code looks like below.

public class RoundDateField extends TrieDateField {
   
   private SimpleDateFormat sdf;
   
   private String fromFormat = null;
   private static String PARAM_FROM_FORMAT = "fromFormat";
   private static String DATE_FORMAT_UNIX_SECOND = "UNIX_SECOND";
   private static String DATE_FORMAT_UNIX_MILLSECOND = "UNIX_MILLSECOND";
   
   private static long MS_IN_DAY = 3600 * 24 * 1000;
   private static final long SECONDS_FROM_EPCO = new Date().getTime() / 1000;
   
   @Override
   protected void init(IndexSchema schema, Map<String,String> args) {
     if (args != null) {
       fromFormat = args.remove(PARAM_FROM_FORMAT);
     }
     sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", Locale.US);
     sdf.setTimeZone(UTC);
     super.init(schema, args);
   }
   
   /**
    * if value > SECONDS_FROM_EPCO, then treat value as milliseconds, otherwise
    * treat value as seconds
    * 
    * @param value
    * @return
    */
   private long convertToMillseconds(long value) {
     long result = value;
     if (value < SECONDS_FROM_EPCO) {
       result = result * 1000L;
     }
     return result;
   }
   
   @Override
   public IndexableField createField(SchemaField field, Object value, float boost) {
     
     try {
       long millseconds = -1;
       
       try {
         millseconds = Long.parseLong(value.toString());
         
         if (fromFormat != null) {
           if (DATE_FORMAT_UNIX_MILLSECOND.equalsIgnoreCase(fromFormat)) {
             // do nothing
           } else if (DATE_FORMAT_UNIX_SECOND.equalsIgnoreCase(fromFormat)) {
             millseconds = millseconds * 1000L;
           } else {
             throw new RuntimeException("Invalid fromFormat: " + fromFormat);
           }
         } else {
           millseconds = convertToMillseconds(millseconds);
         }
         
       } catch (Exception ex) {
         // so it should be a date string
         millseconds = sdf.parse(value.toString()).getTime();
       }
       
       millseconds = (millseconds / MS_IN_DAY) * MS_IN_DAY + (MS_IN_DAY / 2);
       // returned value must be a date time string
       value = new Date(millseconds);
     } catch (Exception ex) {
       throw new RuntimeException(ex);
     }
     
     return super.createField(field, value, boost);
   }
}

We may want to search on the rounded date, but still able to retrieve the original date. To do this, we can create one field to store the original content, but set indexed=false, as we won't search on it. Then we create another field to index the rounded date: we set indexed=true, but stored=false, as we will not retrieve or display the round value to user.

In solr.XML, the field, access_time is an normal date type, and it will store the original date value, we copy its value to another type access_time_rounded, which type is the custom type we define.
<fieldType name="roundededDate" class="org.codeexample.jeffery.solr.RoundDateField" omitNorms="true" precisionStep="6" 
  positionIncrementGap="0" fromFormat="UNIX_SECOND" /> 

<fieldType name="roundededDateSmart" class="org.codeexample.jeffery.solr.RoundDateField" omitNorms="true" precisionStep="6" 
  positionIncrementGap="0" /> 

<field name="access_time" type="tdate" indexed="false" stored="true" omitNorms="true"/>
<field name="access_time_rouned" type="roundededDate" indexed="false" stored="true" omitNorms="true"/>
<copyField source="access_time" dest="access_time_rouned"/>
Compared with previous version, this has some advantages that:
1. It can auto detect the format of the passed data, whether it is a valid solr date format string, or whether it uses seconds or million seconds to represent the date.
2. Easier to use, no need to configure processor factory in solrconfig.xml, just declare field type of your data.
You can view the complete source code here:
https://github.com/jefferyyuan/solr.misc

Solr: Use UpdateRequestProcessor to Round Data

We can extend UpdateRequestProcessor to extend Solr to do many things, clean data, transform date, etc.

Sometimes, we need round the passed in data, for example: a date value, 2012-12-21T12:12:12.234Z, customer may only cares about date part, doesn't care about hour, minute parts.

So to reduce index size, and improve query performance, we can use UpdateRequestProcessor round date to 2012-12-21T00:00:00Z.
In solrconfig.xml, we can configure a processor to specify round what fields to what format, in the following code, we round them to only keep date part.
<updateRequestProcessorChain name="dateRoundChain">
  <processor class="solr.LogUpdateProcessorFactory" />
  </processor>
  <processor class="org.codeexample.jeffery.solr.DateRoundProcessorFactory" >
   <bool name="ignoreError">true</bool>
   <str name="date.fields">access_time,modify_time,mtm</str>
   <str name="date.round.fields">day,day,day</str>
  </processor>
  <processor class="solr.RunUpdateProcessorFactory" />
 </updateRequestProcessorChain>

  <requestHandler name="/import/csv" class="solr.CSVRequestHandler">
  <lst name="defaults">
   <str name="stream.contentType">application/csv</str>
   <str name="update.chain">dateRoundChain</str>
  </lst>
 </requestHandler>

The code is like below:
It now only support rounding date to only keep date or second parts, but you can easily add code to round date to only keep year, month, hour, minute part.
package org.codeexample.jeffery.solr;
public class DateRoundProcessorFactory extends UpdateRequestProcessorFactory {

	private List<String> dateFields;
	private List<String> dateRoundFields;
	// ignoreError
	private boolean ignoreError;

	private static String ROUND_DAY = "DAY";
	private static String FORMAT_DAY = "yyyy-MM-dd'T'00:00:00.0'Z'";

	// yyyy-MM-dd'T'HH:mm:ss.SSS'Z'
	private static String ROUND_SECOND = "SECOND";
	private static String FORMAT_SECOND = "yyyy-MM-dd'T'HH:mm:ss'Z'";

	@SuppressWarnings("rawtypes")
	@Override
	public void init(final NamedList args) {
		if (args != null) {
			SolrParams params = SolrParams.toSolrParams(args);
			Object fields = args.get("date.fields");
			dateFields = fields == null ? null : StrUtils.splitSmart(
					(String) fields, ",", true);

			fields = args.get("date.round.fields");
			dateRoundFields = fields == null ? null : StrUtils.splitSmart(
					(String) fields, ",", true);

			if ((dateFields == null && dateRoundFields != null)
					|| (dateFields != null && dateRoundFields == null)
					|| (dateFields != null && dateRoundFields != null
							& dateFields.size() != dateRoundFields.size()))
				throw new IllegalArgumentException(
						"Size of date.fields and date.round.fields must be same.");
			ignoreError = params.getBool("ignoreError", false);
		}
	}

	@Override
	public UpdateRequestProcessor getInstance(SolrQueryRequest req,
			SolrQueryResponse rsp, UpdateRequestProcessor next) {
		return new DateRoundProcessor(req, next);
	}

	class DateRoundProcessor extends UpdateRequestProcessor {
		public DateRoundProcessor(SolrQueryRequest req,
				UpdateRequestProcessor next) {
			super(next);
		}

		@Override
		public void processAdd(AddUpdateCommand cmd) throws IOException {
			SolrInputDocument solrInputDocument = cmd.getSolrInputDocument();
			for (int i = 0; i < dateFields.size(); i++) {
				try {
					String dateField = dateFields.get(i);
					SolrInputField inputField = solrInputDocument
							.getField(dateField);

					if (inputField != null) {
						Object obj = inputField.getValue();
						Object result = null;
						if (obj instanceof String) {
							String value = (String) obj;
							Date solrDate = parseSolrDate(value);
							String roundTo = dateRoundFields.get(i);
							DateFormat df = null;
							if (ROUND_DAY.equalsIgnoreCase(roundTo)) {
								df = new SimpleDateFormat(FORMAT_DAY);
							} else if (ROUND_SECOND.equalsIgnoreCase(roundTo)) {
								df = new SimpleDateFormat(FORMAT_SECOND);
							}
							if (df != null) {
								result = df.format(solrDate);
								// only remove it, if there is no error
								solrInputDocument.removeField(dateField);
								solrInputDocument.addField(dateField, result);
							}
						}
					}
				} catch (Exception ex) {
					if (!ignoreError) {
						throw new IOException(ex);
					}
				}
			}
			super.processAdd(cmd);
		}
	}

	public Date parseSolrDate(String dateString) throws ParseException {
		SimpleDateFormat sdf = new SimpleDateFormat(
				"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", Locale.US);
		sdf.setTimeZone(TimeZone.getTimeZone("UTC"));
		return sdf.parse(dateString);
	}
}

You can view the complete source code here:
https://github.com/jefferyyuan/solr.misc

Part 3: Use Pack200 to Shrink Solr Application Size

Part 1: Shrink Solr Application Size
Part 2: Use Proguard to Shrink Solr Application Size
Part 3: Use Pack200 to Shrink Solr Application Size
In order to continue to reduce the installation file, I decide to use pack200 to shrink jar size.

Please refer to http://docs.oracle.com/javase/1.5.0/docs/guide/deployment/deployment-guide/pack200.html
https://blogs.oracle.com/manveen/entry/pack200_and_compression_through_ant

This can remove all jars from 6.02mb to 4.44mb: 27% less.

The following is the ANT scrip to pack all jars:
<property name="jarpack-task.jar" value="C:\pathto\Pack200Task.jar" />
<taskdef name="pack200" classname="com.sun.tools.apache.ant.pack200.Pack200Task" classpath="${jarpack-task.jar}" />
<taskdef name="unpack200" classname="com.sun.tools.apache.ant.pack200.Unpack200Task" classpath="${jarpack-task.jar}" />

<target name="pack.all.jars">
 <ac:foreach target="pack.jar" param="file.name">
  <path>
   <fileset dir="${final.jars.output}" includes="*.jar" />
  </path>
 </ac:foreach>
</target>

<target name="pack.jar" description="Applying the pack utility on jars">
 <basename property="file.basename" file="${file.name}" />
 <echo message="pack ${file.name} to ${final.jars.output}/${file.basename}.pack" />
 <pack200 src="${file.name}" destfile="${final.jars.output}/${file.basename}.pack" stripdebug="true" deflatehint="keep" unknownattribute="pass" keepfileorder="true" />
 <delete file="${file.name}" />
</target>

We can use ANT to unpack these jars:
<target name="unpack.all.jars" >
  <ac:foreach target="unpack.jar" param="file.name">
   <path>
    <fileset dir="${runtime.home}" includes="*.pack" />
    <fileset dir="${runtime.home.lib}" includes="*.pack" />
    <fileset dir="${runtime.home}" includes="*.pack" />
    <fileset dir="${runtime.solr.war.lib}" includes="*.pack" />
    <fileset dir="${runtime.solr.core.lib}" includes="*.pack" />
   </path>
  </ac:foreach>
 </target>

 <target name="unpack.jar">
  <propertyregex property="file.unpack.name" input="${file.name}" regexp="(.*).pack" select="\1" />
  <echo message="unpack file ${file.name} to ${file.unpack.name}" />
  <unpack200 src="${file.name}" dest="${file.unpack.name}" />
  <delete file="${file.name}" />
 </target>
Or we can use windows(linux) script to do this:
@ECHO OFF 

echo "Unpack startjetty.jar.pack"
CALL :unpackjar startjetty.jar.pack 

echo "Unpack jars in folder ./lib"
For %%X in (lib\*.pack) do CALL :unpackjar %%X

echo "Unpack jars in folder ./solr.war\WEB-INF\lib"
For %%X in (solr.war\WEB-INF\lib\*.pack) do CALL :unpackjar %%X

echo "Unpack jars in folder ./solr-home\collection1\lib"
For %%X in (solr-home\collection1\lib\*.pack) do CALL :unpackjar %%X

GOTO :EOF

:unpackjar
set packedfile=%1
set unpackedfile=%packedfile:~0,-5%
echo unpack file: %unpackedfile% %packedfile%
unpack200 %packedfile% %unpackedfile%
DEL /Q %packedfile%
GOTO :EOF

@ECHO ON

After all these steps, we use 7zip to zip the application, size is 1,779 kb.
You can view all source code from github:
https://github.com/jefferyyuan/tools/tree/master/ant-scripts/shrink-solr

Labels

Java (159) Lucene-Solr (112) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (38) Eclipse (33) Code Example (31) Linux (25) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Shell (7) ANT (6) Coding Skills (6) Database (6) Lesson Learned (6) Programmer Skills (6) Scala (6) Tips (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) System Design (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts