My Notes: Pig and its Transformations

Monday, September 7, 2015

Pig and its Transformations

This blog will give you an overview of pig with various Transformations

Pig Programming - A dataflow language developed in Yahoo for data scientist who has less exposure of java. Its language is Pig Latin

It has two modes -local for standalone and -mapreduce for distributed.

Its very powerful for ETL operations like slicing or filtering the data.

File should be in HDFS

Various transformation in Pig

Load - Load the data into alias a and b which store relations or dataset in general terms

1 a = load '/usr/hadoop/training/emp.txt' using PigStorage(',') as (empno:int, name:chararray, sal:int);

2 b = load '/usr/hadoop/training/emp.txt' using PigStorage(',') as (empno:int, name:chararray, sal:int);

store the file in hdfs output folder specified.
STORE b INTO '<output folder>' USING PigStorage('\t');

-- union

4 c = UNION a,b;

only distinct

5 d = distinct c;

return distinct records

split data

6 split c into f if empno == 10, g if empno == 20;

sample data of 40%

7 x = sample c 0.4;

8 x = sample c 0.4;

filter data(like where clause)

g = filter c by empno > 20;

order clause

9 y = order c by name;

10 y = order c by name desc;

--grouping

11 z = group c by name;

loop

12 loopresult = foreach c generate name , sal;

grouping multiple relations

13 cogroupvar = cogroup a by name, b by name;

kind of inner join

cogroupvar = cogroup a by name, b by name inner;

Note: difference -- join create a flat set of output whereas grouping creates group of bags

-- describe table command

describe cogroupvar;

--join

h = join a by empno, b by empno;

cross product

r = cross a, b;

Describe, illustrate and explain

grunt> describe h;

h: {a::empno: int,a::name: chararray,a::sal: int,b::empno: int,b::name: chararray,b::sal: int}

grunt> describe cogroupvar;

cogroupvar: {group: chararray,a: {(empno: int,name: chararray,sal: int)},b: {(empno: int,name: chararray,sal: int)}}

grunt>

describe <alias>- shows relation schema /table defination

illustrate <alias>- shows the sample execution of logical plan

explain <alias>- shows logical plan and physical plan

build in functions - avg, count, concat, diff, max, min , size, sum.

tokenize - split the chararray

isempty - check if the bag or map is empty

encrypt function - encrypt(ssn)

[parallel m] - m number of reduce task.. work with any transformation

--------------User defined function

Extend FilterFunc and override public Boolean exec(tuple) throws IOException

basic pig Word count program-

lines = load '/usr/hadoop/training/input.txt' as (line:chararray);

words = foreach lines generate FLATTEN(TOKENIZE(line)) as word;

filter_words = filter words by word matches '\\w+';

word_groups = group filter_words by word;

word_count = foreach word_groups generate COUNT(filter_words) as count , group as word;

dump word_count;

My Notes

Monday, September 7, 2015

Pig and its Transformations

No comments:

Post a Comment

Websphere Dummy certificate expired - DummyServerKeyFile.jks , DummyServerTrustFile.jks

Search This Blog