File compression brings two major benefits:
- it reduces the space needed to store files, and
- it speeds up data transfer across the network or to or from disk.
If the input file is compressed, then the bytes read in from HDFS is reduced, which means less time to read data. This time conservation is beneficial to the performance of job execution.
If the input files are compressed, they will be decompressed automatically as they are read by MapReduce, using the filename extension to determine which codec to use. For example, a file ending in .gz can be identified as gzip-compressed file and thus read with GzipCodec.
Compression has some tradeoffs too. Compressed data is not splittable. so only one map is going to read gzip file in hadoop map-reduce. Only bzip2 support splitting with compression.
Here is the code to compress the file/input in map-reduce.
- Register a codec
- create a instance of codec
- create compression output stream to write
CompressionOutputStream outFile = codec.createOutputStream(fs.create(new Path(args[1])));
Code to compression the file
package com.valassis.codec;import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionInputStream;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.util.ReflectionUtils;
public class SteamCompression {
/**
* @param args
* @throws ClassNotFoundException
* @throws IOException
*/
public static void main(String[] args) throws ClassNotFoundException, IOException {
// TODO Auto-generated method stub
String codecClassName = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Class<?> codecClass = Class.forName(codecClassName);
CompressionCodec codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
CompressionOutputStream outFile = codec.createOutputStream(fs.create(new Path(args[1])));
IOUtils.copyBytes(System.in, outFile, 4096,false);
outFile.finish();
}
}
No comments:
Post a Comment