My Notes

Monday, September 7, 2015

Pig and its Transformations

This blog will give you an overview of pig with various Transformations

Pig Programming - A dataflow language developed in Yahoo for data scientist who has less exposure of java. Its language is Pig Latin

It has two modes -local for standalone and -mapreduce for distributed.

Its very powerful for ETL operations like slicing or filtering the data.

File should be in HDFS

Various transformation in Pig

Load - Load the data into alias a and b which store relations or dataset in general terms

1 a = load '/usr/hadoop/training/emp.txt' using PigStorage(',') as (empno:int, name:chararray, sal:int);

2 b = load '/usr/hadoop/training/emp.txt' using PigStorage(',') as (empno:int, name:chararray, sal:int);

store the file in hdfs output folder specified.
STORE b INTO '<output folder>' USING PigStorage('\t');

-- union

4 c = UNION a,b;

only distinct

5 d = distinct c;

return distinct records

split data

6 split c into f if empno == 10, g if empno == 20;

sample data of 40%

7 x = sample c 0.4;

8 x = sample c 0.4;

filter data(like where clause)

g = filter c by empno > 20;

order clause

9 y = order c by name;

10 y = order c by name desc;

--grouping

11 z = group c by name;

loop

12 loopresult = foreach c generate name , sal;

grouping multiple relations

13 cogroupvar = cogroup a by name, b by name;

kind of inner join

cogroupvar = cogroup a by name, b by name inner;

Note: difference -- join create a flat set of output whereas grouping creates group of bags

-- describe table command

describe cogroupvar;

--join

h = join a by empno, b by empno;

cross product

r = cross a, b;

Describe, illustrate and explain

grunt> describe h;

h: {a::empno: int,a::name: chararray,a::sal: int,b::empno: int,b::name: chararray,b::sal: int}

grunt> describe cogroupvar;

cogroupvar: {group: chararray,a: {(empno: int,name: chararray,sal: int)},b: {(empno: int,name: chararray,sal: int)}}

grunt>

describe <alias>- shows relation schema /table defination

illustrate <alias>- shows the sample execution of logical plan

explain <alias>- shows logical plan and physical plan

build in functions - avg, count, concat, diff, max, min , size, sum.

tokenize - split the chararray

isempty - check if the bag or map is empty

encrypt function - encrypt(ssn)

[parallel m] - m number of reduce task.. work with any transformation

--------------User defined function

Extend FilterFunc and override public Boolean exec(tuple) throws IOException

basic pig Word count program-

lines = load '/usr/hadoop/training/input.txt' as (line:chararray);

words = foreach lines generate FLATTEN(TOKENIZE(line)) as word;

filter_words = filter words by word matches '\\w+';

word_groups = group filter_words by word;

word_count = foreach word_groups generate COUNT(filter_words) as count , group as word;

dump word_count;

Cloudera Certified Hadoop Developer - CDH5

CCHD - Cloudera Certified Hadoop Developer - CDH5

I would like to share some tips on preparing for the cloudera certification. Mostly importantly cloudera exam is now on CDH Version 5. Please note that you should have hands-on experience on CDH5 and also have good exposure to MapReduce programming/design patterns else consider yourself half prepared. 30-40% Questions in the exam were scenario based and Map Reduce programming and Sqool, Hive or File system commands options.

I suggest to download and install the vm on your machine and practice
http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-4-x.html

If you are very serious about the Hadoop certification, I highly recommend the Cloudera Developer Training for Apache Hadoop

Material referred during the preparation

Hadoop Definitive Guide(4th Edition) by Tom White
MapReduce Design Pattern by Donald Miner
Good understanding of Sqoop1.4, Sqoop2 and Hive is required for passing the exam. Overview of Pig, Oozie, Flume, Avro,Crunch is required as you may see few questions in the exam.
An overview of an Eco System can be referred via link below http://hadoopecosystemtable.github.io/
https://developer.yahoo.com/hadoop/tutorial/
Understand HDFS commands with various options.
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
Deep understanding of classic map reduce and YARN Architecture, replicate and re-partition Joins and Sorting etc.
You should be good at core java, IO or NIO and regular expressions
knowledge of available input/output format and you should be able to create your own input/output formats
Custom Writable,WritableComaparable,RawComparator
Should be comfortable in writing map reduce job given a complex SQL queries(HiveQL)

Here is my certificate:

My Notes

Monday, September 7, 2015

Pig and its Transformations

Cloudera Certified Hadoop Developer - CDH5

CCHD - Cloudera Certified Hadoop Developer - CDH5

Websphere Dummy certificate expired - DummyServerKeyFile.jks , DummyServerTrustFile.jks

Search This Blog