This blog will give you an overview of pig with various Transformations
Pig Programming - A dataflow language developed in Yahoo for data scientist who has less exposure of java. Its language is Pig Latin
It has two modes -local for standalone and -mapreduce for distributed.
Its very powerful for ETL operations like slicing or filtering the data.
File should be in
HDFS
Various transformation in Pig
Load - Load the data into alias a and b which store relations or dataset in general terms
1 a = load '/usr/hadoop/training/emp.txt' using
PigStorage(',') as (empno:int, name:chararray, sal:int);
2 b = load '/usr/hadoop/training/emp.txt'
using PigStorage(',') as (empno:int, name:chararray, sal:int);
store the file in hdfs output folder specified.
STORE b INTO '<output folder>' USING PigStorage('\t');
-- union
4 c = UNION a,b;
only distinct
5 d = distinct
c;
return distinct records
split data
6 split c into f if empno == 10, g if empno ==
20;
sample data of 40%
7 x = sample c 0.4;
8 x = sample c 0.4;
filter data(like
where clause)
g = filter c by empno > 20;
order clause
9 y = order c by name;
10 y = order c by name desc;
--grouping
11 z = group c by name;
loop
12 loopresult = foreach c generate name , sal;
grouping multiple
relations
13 cogroupvar = cogroup a by name, b by name;
kind of inner join
cogroupvar = cogroup a by name, b by name
inner;
Note: difference -- join create a flat set of output whereas grouping creates group of bags
-- describe table
command
describe cogroupvar;
--join
h = join a by empno,
b by empno;
cross product
r = cross a, b;
Describe, illustrate
and explain
grunt> describe
h;
h: {a::empno: int,a::name:
chararray,a::sal: int,b::empno: int,b::name: chararray,b::sal: int}
grunt> describe
cogroupvar;
cogroupvar: {group:
chararray,a: {(empno: int,name: chararray,sal: int)},b: {(empno: int,name:
chararray,sal: int)}}
grunt>
describe
<alias>- shows relation schema /table defination
illustrate
<alias>- shows the sample execution of logical plan
explain
<alias>- shows logical plan and
physical plan
build in functions -
avg, count, concat, diff, max, min , size, sum.
tokenize - split the
chararray
isempty - check if
the bag or map is empty
encrypt function -
encrypt(ssn)
[parallel m] - m
number of reduce task.. work with any transformation
--------------User
defined function
Extend FilterFunc and
override public Boolean exec(tuple) throws IOException
basic pig Word count program-
lines = load '/usr/hadoop/training/input.txt' as
(line:chararray);
words = foreach lines generate FLATTEN(TOKENIZE(line)) as
word;
filter_words = filter words by word matches '\\w+';
word_groups = group filter_words by word;
word_count = foreach word_groups generate
COUNT(filter_words) as count , group as word;
dump word_count;
No comments:
Post a Comment