Monday, September 7, 2015

Pig and its Transformations


This blog will give you an overview of pig with various Transformations



Pig Programming  -  A dataflow language developed in Yahoo for data scientist who has less exposure of java. Its language is Pig Latin

It has two modes -local for standalone and -mapreduce for distributed.

Its very powerful for ETL operations like slicing or filtering the data.


File should be in HDFS       

Various transformation in Pig

Load - Load the data into alias a and b which store relations or dataset in general terms

1  a = load '/usr/hadoop/training/emp.txt' using PigStorage(',') as (empno:int, name:chararray, sal:int);
2   b = load '/usr/hadoop/training/emp.txt' using PigStorage(',') as (empno:int, name:chararray, sal:int);

store the file in hdfs output folder specified.
STORE b INTO '<output folder>' USING PigStorage('\t');

-- union

4   c = UNION a,b;

only distinct

5   d =  distinct c;
return distinct records 

split data
6   split c into f if empno == 10, g if empno == 20;

sample data of 40%
7   x = sample c 0.4;
8   x = sample c 0.4;

filter data(like where clause)

 g = filter c by empno > 20;

order clause
9   y = order c by name;
10   y = order c by name desc;

--grouping
11   z = group c by name;

loop
12   loopresult = foreach c generate name , sal;

grouping multiple relations
13   cogroupvar = cogroup a by name, b by name;

kind of inner join
  cogroupvar = cogroup a by name, b by name inner;


Note: difference -- join create a flat set of output whereas grouping creates group of bags

-- describe table command


describe cogroupvar;

--join
h = join a by empno, b by empno;

cross product
r = cross a, b;

Describe, illustrate and explain

grunt> describe h;                    
h: {a::empno: int,a::name: chararray,a::sal: int,b::empno: int,b::name: chararray,b::sal: int}
grunt> describe cogroupvar;
cogroupvar: {group: chararray,a: {(empno: int,name: chararray,sal: int)},b: {(empno: int,name: chararray,sal: int)}}
grunt>


describe <alias>- shows relation schema /table defination

illustrate <alias>- shows the sample execution of logical plan

explain <alias>- shows  logical plan and physical plan

build in functions - avg, count, concat, diff, max, min , size, sum.
tokenize - split the chararray
isempty - check if the bag or map is empty
encrypt function - encrypt(ssn)

[parallel m] - m number of reduce task.. work with any transformation


--------------User defined function

Extend FilterFunc and override public Boolean exec(tuple) throws IOException


basic pig Word count program-
lines = load '/usr/hadoop/training/input.txt' as (line:chararray);
words = foreach lines generate FLATTEN(TOKENIZE(line)) as word;
filter_words = filter words by word matches '\\w+';
word_groups = group filter_words by word;
word_count = foreach word_groups generate COUNT(filter_words) as count , group as word;

dump word_count;

No comments:

Post a Comment

Websphere Dummy certificate expired - DummyServerKeyFile.jks , DummyServerTrustFile.jks

If you faced issue with ibm provided dummy certificate expired just like us and looking for the solution.  This blog is for you.  You can re...