Monday, September 7, 2015

Pig and its Transformations


This blog will give you an overview of pig with various Transformations



Pig Programming  -  A dataflow language developed in Yahoo for data scientist who has less exposure of java. Its language is Pig Latin

It has two modes -local for standalone and -mapreduce for distributed.

Its very powerful for ETL operations like slicing or filtering the data.


File should be in HDFS       

Various transformation in Pig

Load - Load the data into alias a and b which store relations or dataset in general terms

1  a = load '/usr/hadoop/training/emp.txt' using PigStorage(',') as (empno:int, name:chararray, sal:int);
2   b = load '/usr/hadoop/training/emp.txt' using PigStorage(',') as (empno:int, name:chararray, sal:int);

store the file in hdfs output folder specified.
STORE b INTO '<output folder>' USING PigStorage('\t');

-- union

4   c = UNION a,b;

only distinct

5   d =  distinct c;
return distinct records 

split data
6   split c into f if empno == 10, g if empno == 20;

sample data of 40%
7   x = sample c 0.4;
8   x = sample c 0.4;

filter data(like where clause)

 g = filter c by empno > 20;

order clause
9   y = order c by name;
10   y = order c by name desc;

--grouping
11   z = group c by name;

loop
12   loopresult = foreach c generate name , sal;

grouping multiple relations
13   cogroupvar = cogroup a by name, b by name;

kind of inner join
  cogroupvar = cogroup a by name, b by name inner;


Note: difference -- join create a flat set of output whereas grouping creates group of bags

-- describe table command


describe cogroupvar;

--join
h = join a by empno, b by empno;

cross product
r = cross a, b;

Describe, illustrate and explain

grunt> describe h;                    
h: {a::empno: int,a::name: chararray,a::sal: int,b::empno: int,b::name: chararray,b::sal: int}
grunt> describe cogroupvar;
cogroupvar: {group: chararray,a: {(empno: int,name: chararray,sal: int)},b: {(empno: int,name: chararray,sal: int)}}
grunt>


describe <alias>- shows relation schema /table defination

illustrate <alias>- shows the sample execution of logical plan

explain <alias>- shows  logical plan and physical plan

build in functions - avg, count, concat, diff, max, min , size, sum.
tokenize - split the chararray
isempty - check if the bag or map is empty
encrypt function - encrypt(ssn)

[parallel m] - m number of reduce task.. work with any transformation


--------------User defined function

Extend FilterFunc and override public Boolean exec(tuple) throws IOException


basic pig Word count program-
lines = load '/usr/hadoop/training/input.txt' as (line:chararray);
words = foreach lines generate FLATTEN(TOKENIZE(line)) as word;
filter_words = filter words by word matches '\\w+';
word_groups = group filter_words by word;
word_count = foreach word_groups generate COUNT(filter_words) as count , group as word;

dump word_count;

Cloudera Certified Hadoop Developer - CDH5



CCHD -  Cloudera Certified Hadoop Developer - CDH5


I would like to share some tips on preparing for the cloudera certification. Mostly importantly cloudera exam is now on CDH Version 5. Please note that you should have hands-on experience on CDH5 and also have good exposure to MapReduce programming/design patterns else consider yourself half prepared.  30-40% Questions in the exam were scenario based and Map Reduce programming and Sqool, Hive or File system commands options.  

I suggest to download and install the vm on your machine and practice
http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-4-x.html


If you are very serious about the Hadoop certification, I highly recommend the Cloudera Developer Training for Apache Hadoop

Material referred during the preparation


  • Hadoop Definitive Guide(4th Edition) by Tom White
  • MapReduce Design Pattern by Donald Miner
  • Good understanding of Sqoop1.4, Sqoop2 and Hive is required for passing the exam. Overview of Pig, Oozie, Flume, Avro,Crunch is required as you may see few questions in the exam. 
  • An overview of an Eco System can be referred via link below http://hadoopecosystemtable.github.io/
  • https://developer.yahoo.com/hadoop/tutorial/
  • Understand HDFS commands with various options.  
  • http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
  • Deep understanding of classic map reduce and YARN Architecture, replicate and re-partition Joins and Sorting etc. 
  • You should be good at core java, IO or NIO and regular expressions
  • knowledge of available input/output format and you should be able to create your own input/output formats
  • Custom Writable,WritableComaparable,RawComparator
  • Should be comfortable in writing map reduce job given a complex SQL queries(HiveQL)
Here is my certificate:




Websphere Dummy certificate expired - DummyServerKeyFile.jks , DummyServerTrustFile.jks

If you faced issue with ibm provided dummy certificate expired just like us and looking for the solution.  This blog is for you.  You can re...