Solr: Use JSON(GSon) Streaming to Reduce Memory Usage

My Solr application runs at user's laptop, the max memory is set to 512mb. It pull JSON data from remote proxy which talks with remote Solr Server: 100m hundred at a time, commit after 20 times.

Our code gets the whole json and put it into memory, and use UpdateRequestProcessor.processAdd(AddUpdateCommand) to add it into local solr.

Recently it throws OutOfMemoryError, after use Eclipse Memory Analyzer (MAT) to analyze the heapdump file.

I found out that it is because the data returned from remote proxy is too large:, one data is 50-60kb on average. But some data is huge, 100 data would be 60 mb: this is rare case, but when this happens it will cause the application throws OutOfMemoryError and stops to work.

To fix this and reduce memory usage at client side, I take several measures:

1. Reboot the application when OutOfMemoryError happens.
2. Run a thread to monitor free memory, at a certain threshold(40%), run gc. If less than 30%, decrease fetch size(100 to 50, to 25) and decrease commit interval( 20 times, 10 times). If less than 50 mb memory, restart the application.
3. Enable Auto SoftCommit and AutoCommit, reduce Solr cache size.
3. Use Streaming JSON. - This is the topic of this article.
Read document one by one from http input stream, put it to queue, instead read the whole big document in to memory. Another thread is responsible to write the document to local solr.

Same approach apples if we use XML: we can use StAX or SAX to read document one by one.

I use GSON, about how to use Gson Streaming to read and write JSON, please read Gson Streaming 

The code to read document one by one from http stream:
-- Use GSon Stream API and Java Executors Future to wait all thread finished: all docs imported.
/**
 * Use Gson Streaming API to read docuemts one by one to reduce memory usage 
 */
private static ImportedResult handleResponse(SolrQueryRequest request,
    InputStream in, int fetchSize) throws UnsupportedEncodingException,
    IOException {
  ImportedResult importedResult = new ImportedResult();
  JsonReader reader = null;
  List<Future<Void>> futures = new ArrayList<Future<Void>>();
  
  try {
    reader = new JsonReader(new InputStreamReader(in, "UTF-8"));
    reader.beginObject();
    String str = reader.nextName();
    reader.beginObject();
    int fetchedSize = 0;
    int numFound = -1, start = -1;
    while (reader.hasNext()) {
      str = reader.nextName();
      if ("numFound".equals(str)) {
        numFound = Integer.valueOf(reader.nextString());
      } else if ("start".equals(str)) {
        start = Integer.valueOf(reader.nextString());
      } else if ("docs".equals(str)) {
        reader.beginArray();
        // read documents
        while (reader.hasNext()) {
          fetchedSize++;
          readOneDoc(request, reader);
        }
        
        reader.endArray();
      }
    }
    
    reader.endObject();
    waitComplete(futures);
    importedResult.setFetched(fetchedSize);
    importedResult.setHasMore((fetchedSize + start) < numFound);
    importedResult.setImportedData((fetchedSize != 0));
    return importedResult;
  } finally {
    if (reader != null) {
      reader.close();
    }
  }
}

private static java.util.concurrent.Future<Void> readOneDoc(
    SolrQueryRequest request, JsonReader reader) throws IOException {
  String str;
  reader.beginObject();
  String id = null, binaryDoc = null;
  while (reader.hasNext()) {
    str = reader.nextName();
    
    if ("id".equals(str)) {
      id = reader.nextString();
    } else if ("binaryDoc".equals(str)) {
      binaryDoc = reader.nextString();
    }
  }
  reader.endObject();
  return CVSyncDataImporter.getInstance().importData(request, id,
      binaryDoc);
}
The code to write document to local solr:
public Future<Void> importData(SolrQueryRequest request, String id,
    String binaryDoc) {
  if (id == null) {
    throw new IllegalArgumentException("id can't be null.");
  }
  if (binaryDoc == null) {
    throw new IllegalArgumentException("binaryDoc can't be null.");
  }
  SolrDataImporter task = new SolrDataImporter(request, id, binaryDoc);
  return executor.submit(task);
}
private static SolrInputDocument convertToSolrDoc(String id,
    String binaryDoc) throws IOException {
  byte[] bindata = Base64.base64ToByteArray(binaryDoc);
  SolrInputDocument resultDoc = (SolrInputDocument) readZippedFile(bindata);
  resultDoc.setField("id", id);
  return resultDoc;
}

private class SolrDataImporter implements Callable<Void> {
  private SolrQueryRequest request;
  private String id, binaryDoc;
  @Override
  public Void call(){
    try {
      UpdateRequestProcessorChain updateChian = request.getCore()
          .getUpdateProcessingChain("mychain");
      SolrInputDocument toSolrServerSolrDoc = convertToSolrDoc(id,
          binaryDoc);
      binaryDoc = null;
      AddUpdateCommand command = new AddUpdateCommand(request);
      command.solrDoc = toSolrServerSolrDoc;
      SolrQueryResponse response = new SolrQueryResponse();
      UpdateRequestProcessor processor = updateChian.createProcessor(request,
          response);
      processor.processAdd(command);
    } catch (Exception e) {
      logger.error("Exception happened when importdata, id: "
          + id, e);
    }
    return null;
  }
}
Post a Comment

Labels

Java (159) Lucene-Solr (111) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (37) Eclipse (33) Code Example (31) Linux (24) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts