Java: Use Zip Stream and Base64 Encoder to Compress Large String Data

Problem
In my project, we run some queries in Solr server, and return combined response back to client. But some text fields are too large, we would like to reduce their size.

Use GZIPOutputStream or ZipOutputStream?
To compress data, Java provides two streams GZIPOutputStream and ZipOutputStream. What's the difference and which we should use?
The compression algorithm and performance is almost same (Lempel-Ziv Welch), the difference is that GZIP format is used to compress a single file, and ZIP is a archive format: to compress many files in a single archive: using putNextEntry and closeEntry to add a new entry into the archive file.

In this case, we use GZIPOutputStream, because we don't add and compress multiple file in a single archive, so no need to use ZipOutputStream, also the code to use ZipOutputStream si a little simpler.

Use GZIPOutputStream and Base64 Encoder to Compress String
At server side, we can use GZIPOutputStream to compress a string to a byte array stored in ByteArrayOutputStream. But we can't transfer the byte array as text in http response. We have to use a Base64 encoder to encode the byte array as Base64. We can use org.apache.commons.codec.binary.Base64.encodeBase64String(). Then we add the compressed text as a field in Solr Document field - not shown in the code below.
/**
   * At server side, use ZipOutputStream to zip text to byte array, then convert
   * byte array to base64 string, so it can be trasnfered via http request.
   */
  public static String compressString(String srcTxt)
      throws IOException {
    ByteArrayOutputStream rstBao = new ByteArrayOutputStream();
    GZIPOutputStream zos = new GZIPOutputStream(rstBao);
    zos.write(srcTxt.getBytes());
    IOUtils.closeQuietly(zos);
    
    byte[] bytes = rstBao.toByteArray();
    // In my solr project, I use org.apache.solr.co mmon.util.Base64.
    // return = org.apache.solr.common.util.Base64.byteArrayToBase64(bytes, 0,
    // bytes.length);
    return Base64.encodeBase64String(bytes);
  }
Test Data:
Original text in memory is about 134,479,520(134mb), its zipped byte array is about 9,001,240(9mb), base 64 string is 16,198,528(16mb).
We can see that the size reduces 88%. This is huge and it's worth.
Use Base64 Decoder and GZIPInputStream to Uncompress String
At remote client side, we first read the text response from stream, about how to read one Solr document using stream API, please read:
Solr: Use STAX Parser to Read XML Response to Reduce Memory Usage
Solr: Use SAX Parser to Read XML Response to Reduce Memory Usage
Solr: Use JSON(GSon) Streaming to Reduce Memory Usage

Then use org.apache.commons.codec.binary.Base64.decodeBase64() to decode the Base64 string to byte array, and then use ZipInputStream to read the zipped byte array to get original unzipped string, then add it to Solr Document as a field.
/**
   * When client receives the zipped base64 string, it first decode base64
   * String to byte array, then use ZipInputStream to revert the byte array to a
   * string.
   */
  public static String uncompressString(String zippedBase64Str)
      throws IOException {
    String result = null;
    
    // In my solr project, I use org.apache.solr.common.util.Base64.
    // byte[] bytes =
    // org.apache.solr.common.util.Base64.base64ToByteArray(zippedBase64Str);
    byte[] bytes = Base64.decodeBase64(zippedBase64Str);
    GZIPInputStream zi = null;
    try {
      zi = new GZIPInputStream(new ByteArrayInputStream(bytes));
      result = IOUtils.toString(zi);
    } finally {
      IOUtils.closeQuietly(zi);
    }
    return result;
  }
Test Code
public static void main(String... args) throws IOException {
    String source = "-original-file-path;
    String zippedFile = "-base-64-zip-file-path-";
    FileInputStream fis = new FileInputStream(source);
    String srcTxt = IOUtils.toString(fis, "UTF-8");
    IOUtils.closeQuietly(fis);
    
    String str = compressString(srcTxt);
    FileWriter fw = new FileWriter(zippedFile);
    IOUtils.write(str, fw);
    IOUtils.closeQuietly(fw);
    
    fis = new FileInputStream(zippedFile);
    String zippedBase64Str = IOUtils.toString(fis, "UTF-8");
    IOUtils.closeQuietly(fis);
    
    String originalStr = uncompressString(zippedBase64Str);
    fw = new FileWriter("-revertedt-file-path");
    IOUtils.write(originalStr, fw);
    IOUtils.closeQuietly(fw);
  }
Resource
Solr: Use STAX Parser to Read XML Response to Reduce Memory Usage
Solr: Use SAX Parser to Read XML Response to Reduce Memory Usage
Solr: Use JSON(GSon) Streaming to Reduce Memory Usage
Tips and pitfalls when using Java’s ZipOutputStream
GZIPOutputStream vs ZipOutputStream
Post a Comment

Labels

Java (159) Lucene-Solr (112) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (38) Eclipse (33) Code Example (31) Linux (25) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Shell (7) ANT (6) Coding Skills (6) Database (6) Lesson Learned (6) Programmer Skills (6) Scala (6) Tips (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) System Design (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts