Thursday, September 24, 2015

File compression Using Codec in MapReduce



 File compression brings two major benefits: 

  • it reduces the space needed to store files, and 
  • it speeds up data transfer across the network or to or from disk.

If the input file is compressed, then the bytes read in from HDFS is reduced, which means less time to read data. This time conservation is beneficial to the performance of job execution.

If the input files are compressed, they will be decompressed automatically as they are read by MapReduce, using the filename extension to determine which codec to use. For example, a file ending in .gz can be identified as gzip-compressed file and thus read with GzipCodec.


Compression has some tradeoffs too. Compressed data is not splittable. so only one map is going to read gzip file in hadoop map-reduce. Only bzip2 support splitting with compression.

Here is the code to compress the file/input in map-reduce.


  • Register a codec
       Class<?> codecClass =       Class.forName(codecClassName);
  • create a instance of codec       
CompressionCodec codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
  • create compression output stream to write       
//CompressionOutputStream out = codec.createOutputStream(System.out);
CompressionOutputStream outFile = codec.createOutputStream(fs.create(new Path(args[1])));


Code to compression the file

package com.valassis.codec;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionInputStream;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.util.ReflectionUtils;

public class SteamCompression {

    /**
     * @param args
     * @throws ClassNotFoundException
     * @throws IOException
     */
    public static void main(String[] args) throws ClassNotFoundException, IOException {
        // TODO Auto-generated method stub
        String codecClassName = args[0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        Class<?> codecClass =       Class.forName(codecClassName);
        CompressionCodec codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
        CompressionOutputStream outFile = codec.createOutputStream(fs.create(new Path(args[1])));
        IOUtils.copyBytes(System.in, outFile, 4096,false);
        outFile.finish();
    
       
    }

}

No comments:

Post a Comment

Websphere Dummy certificate expired - DummyServerKeyFile.jks , DummyServerTrustFile.jks

If you faced issue with ibm provided dummy certificate expired just like us and looking for the solution.  This blog is for you.  You can re...