Data Science: 2014

Monday, April 14, 2014

MongoDB - Part I

What is mongoDB ?

Of all the definitions that I read , I liked the most from Wikipedia .

MongoDB is a cross-platform document-oriented database system. Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster.

All the mumbo jumbo above only means that MongoDB is a document database whose structure is JSON like and not similar to traditional database .The defination will be clearer when we start performing the operations (CRUD or creating,reading , deleting and updating) later . For now just go with the flow.

Few quick pointers :

1. Document in mongodb is equivalent to row/tuple in relational databases system.
2. Collection is equivalent to table but collection is schema free .Which means that table in RDBMS has a definite table structure whereas in any NoSQL and in mongodb each document/row/tuple has its own schema .

Let us take an example of a relational database system , Here is how a data looks like in our traditional relation database management systems .

 ID         NAME          AGE   
 1          Honey         10  
 2          Roney         20

As is evident and already known the schema/table structure is the same for every record (ID=1 or ID=2 , look at the data), you might be like "I still do not understand what he is trying to say , i can see no difference in the two records .If that is what you are thinking about , you are in right direction .Look at the column/field names . They look static , the ID,NAME and AGE do not change . Do they change with new record insert ? Most definitely No !
Now lets take a look at mongodb documents (Just think of this as a fancy name given to a row/tuple) below, don't worry much about "_id" field .The data is of the format column_name : Column_value or key : value pair combination .
These means :
The first record has columns user_id,age and status with their values .
The second record has columns as name and location only .
If I were to be from traditional Database background . I would say , there is no way this type of storage being possible but here it is . Mongodb makes it possible .

 > db.ab.find()  
 { "_id" : ObjectId("529ca24c3dc2f256c6bfcbb9"), "user_id" : "555", "age" : 100, "status" : "D" }  
 { "_id" : ObjectId("52e72d956cf8a5335f4ac07c"), "name" : "Honey singh", "location" : "Mumbai" }

Does it look different . I bet it does. In simple words every document in mongodb database can have its own schema and hence it is a SCHEMA-FREE database .There is no definite schema for the entire table/collections.

Every document/tuple/row has a unique value identifying itself and that is "_id" . It is of type ObjectId.
If you were to compare "_id" with relational database , this would be similar to primary key .
This is how the structure of "_id" would look like .

 0 1 2 3   |  4 5 6  |  7 8  |  9 10 11     
 Timestamp   Machine    PID     Increment

Few facts of ObjectId.
1. ObjectIds use 12 bytes of storage, which gives them a string representation that is 24
hexadecimal digits: 2 digits for each byte(Look at the above 12 character data for an example ).
2. The first four bytes of an ObjectId are a timestamp in seconds since the epoch.
( an epoch is an instant in time chosen as the origin of a particular era. The "epoch" serves as a reference point from which time is measured)
3. The next three bytes of an ObjectId are a unique identifier of the machine on which it was generated. This is usually a hash of the machine’s hostname
4. To provide uniqueness among different processes generating ObjectIds concurrently on a single machine, the next two bytes are taken from the process identifier (PID) of the ObjectId-generating process.
5. The last three bytes are simply an incrementing counter that is responsible for uniqueness within a second in a single process

Before we start the basic operation , here are a few steps which will help us get going .
Disclaimer : you need to already have mongodb installed.If not go ahead and do it , it is very simple.

 // connecting to mongo   
 hadoop@ubuntu-server64:~/lab/install/mongodb/bin$ mongo  
 MongoDB shell version: 2.4.8  
 connecting to: test  
 // This will show all the databases available in the mongodb  
 > show databases;  
 balaramdb    0.203125GB  
 local  0.078125GB  
 new_db 0.203125GB  
 test  0.203125GB  
 testing 0.203125GB  
 //Very similar to mysql or sql-server , this will take us inside one of the db's . In this case new_db  
 > use new_db  
 switched to db new_db  
 //This will show us all the tables in the mongodb database , dont worry too much about the indexes now. I will explain them later .  
 > show tables;  
 ab  
 system.indexes  
 // This is equivalent to "select * from table " in relational databases;  
 > db.ab.find()  
 { "_id" : ObjectId("529ca24c3dc2f256c6bfcbb9"), "user_id" : "555", "age" : 100, "status" : "D" }

Basic Operations with the Shell :

Even before we start our mongodb journey , you need to understand one thing very clearly .All the database objects in mongodb will be referred to as "db" . So if you have to access a table it has to be db.<table_name>.findOne() or just db.<table_name>.insert() so on and so forth.Lets start with CRUD(create,read,update,delete) operations

Create:

 // let us first check which database we are in , the first line   
 > db.stats()  
 //o/p will be similar to below which means we are in the database : new_db  
 // "db" : "new_db"  
 // let us drop if there is an existing table "blog" that we are trying to create.  
 > db.blog.drop()  
 //o/p  
 //true  
 //Now let us try to see how do we create a collection/table  
 //first we define a local variable which is a javascript object representing our document.  
 post = {"title" : "My Blog Post",  
         "content" : "Here's my blog post.",  
         "date" : new Date()}  
 // this will immediately show the below .The below means that the object "post" is valid.  
 {  
     "title" : "My Blog Post",  
     "content" : "Here's my blog post.",  
     "date" : ISODate("2014-01-28T09:27:52.976Z")  
 }  
 //Let us invoke the below command to insert the data .   
 //did we forget creating the table first before insertioon ?  
 //Nevertheless lets try once and see what happens   
 > db.blog.insert(post)  
 //oops there is no error . Now does that mean the table/collection was created and data inserted on the fly ?  
 //Yes and that is why Mongodb is called schema free NoSQL database .If there is no table it just creates one.

select * from blog
The above is equivalent to the below in mongodb
Read :
This is how you read all the field/columns of all documents in a collection/table .

 > db.blog.find()  //  or db.blog.find({})  they are both same or equivalent command
 { "_id" : ObjectId("52e779116cf8a5335f4ac083"), "title" : "My Blog Post", "content" : "Here's my blog post.", "date" : ISODate("2014-01-28T09:27:52.976Z") }  
 { "_id" : ObjectId("52e79c216cf8a5335f4ac084"), "name" : "Honey singh", "location" : "Mumbai" }

How do I read only those records which I want . Let's say I want to pull records only for "Honey Singh" Or find the SQL equivalent as given below
select * from blog where name = "Honey Singh"
Read Filtering:

 > db.blog.find({"name":"Honey singh"})  
 { "_id" : ObjectId("533a641a0216f306dfcb7163"), "name" : "Honey singh", "location" : "Mumbai" }

 // If I want only few selected columns/fields , Here is how i would write my find() function  
 // the filter criteria "content" : 1 means , pull only "content" from all the documents   
 > db.blog.find({}, {"content" : 1})  
 { "_id" : ObjectId("52e779116cf8a5335f4ac083"), "content" : "Here's my blog post." }  
 { "_id" : ObjectId("52e79c216cf8a5335f4ac084" } // Only _id is returned when there is no data available 
 //select content,name from blog   
 // but this will include "_id" as given below  
 > db.blog.find({}, {"content" : 1, "name" : 1})  
 { "_id" : ObjectId("52e15de00b50d96fdf69790f"), "content" : "Here's my blog post." }  
 { "_id" : ObjectId("533a641a0216f306dfcb7163"), "name" : "Honey singh" }  
 //select content,name from blog   
 //Let me remove the unwanted _id field from my result set and get a clean data  
 > db.blog.find({}, {_id : 0,"content" : 1, "name" : 1})  
 { "content" : "Here's my blog post." }  
 { "name" : "Honey singh" }

Thursday, January 23, 2014

Pig coding made easy.

First and foremost , Yes you need to install Pig and don't worry it is extremely easy.There are many blogs and documents available online for Pig installation . Refer to one and proceed . The below URL from Apache might be useful .Good Luck !

http://pig.apache.org/docs/r0.10.0/start.html#Pig+Setup

Look at this picture below , some of you might have already seen the picture and will be like "yes i have already seen this picture already". I am not here trying to reinvent the whole wheel but rather here to collate the information scattered everywhere and to explain what it is and things it does.

#1 above in the diagram states that "Pig is a dataflow language and Pig Latin is used to express it ". Does not make sense ? Let me take an extract from Apache's website

A Pig Latin statement is an operator that takes a relation as input and produces another relation as output. (This definition applies to all Pig Latin operators except LOAD and STORE which read data from and write data to the file system.) Pig Latin statements can span multiple lines and must end with a semi-colon ( ; ). Pig Latin statements are generally organized in the following manner:

A LOAD statement reads data from the file system - This is reading the data stored in the hard-drive for pig to be able to read or process or display in the terminal .
A series of "transformation" statements process the data - Let us say this a business Logic being performed .
A STORE statement writes output to the file system; or, a DUMP statement displays output to the screen - This is writing the data back to hard-disk .

Simple isn't it . If you are still not able to completely comprehend it . Move on , by the time you are done with reading the blog , you will have a clear understanding of what it is.

To read more , go to the website below .
http://pig.apache.org/docs/r0.7.0/piglatin_ref1.html#Overview
#2 above in the diagram states that pig has 2 execution modes or exectypes:

Local Mode - This means you are running pig only on one machine and not in distributed environment with clusters.Local mode are started using the -x flag ,the syntax of which is given below.

 pig -x local

Mapreduce Mode - This signifies that you need to have access to a Hadoop cluster and HDFS installation. Hadoop should already be running on your machine . How do we know that ? Type jps as given below

 hadoop@ubuntu-server64:~/lab/install/hadoop-1.0.4/bin$ jps  
 2837 Jps  
 2255 SecondaryNameNode  
 1848 DataNode  
 2344 JobTracker  
 2590 TaskTracker  
 1606 NameNode  
 hadoop@ubuntu-server64:~/lab/install/hadoop-1.0.4/bin$

Please note that all the above services might not be running. The datanode and Tasktracer generally run on the slave machines and so if you are not able to see them while executing the command jps , dont worry . That is just allright .

Mapreduce mode is the default mode; you can,but don't need to, specify it using the -x flag Pig can also start in local and mapreduce mode using the java command which we will not discuss it here.

The discussion on Pig does not progress towards logical conclusion if we do not discuss about running PIG in various modes.

Running PIG !!

Pig Latin and Pig Commands are executed in 2 modes
1. Interactive Mode.
2. Batch Mode.

Interactive Mode.

Pig can be run in interactive mode using the Grunt shell. As shown above enter the PIG's grunt shell in either local or Mapreduce mode .Hope you have not forgotten it already . If you have , refer the above .

 grunt> transaction = LOAD 'retail/txn.csv' USING PigStorage(',') AS(txn_id,txn_dt,cust_id,amt:double,cat,sub_cat,adr1,adr2,trans_type);   
 grunt> dump transaction;

Note : If you want to work with these statements refer to the sample data i have provided in my previous blog on Pig.

Batch Mode.

A group of pig latin statements can be grouped together and can be run together . This is more like packaging or creating procedures in PL/SQL .Or you can think of it as bunch of statements written together to perform a task.

Here is a piece of code .

copy the below in a text file . Save it with a name , say "trans.pig" in your local directory wherever pig is installed . For example /user/data/trans.pig

 transactions = LOAD 'retail/txn.csv' USING PigStorage(',') AS(txn_id,txn_dt,cust_id,amt:double,cat,sub_cat,adr1,adr2,trans_type);  
 txn_100plus = FILTER transactions BY amt > 100.00;  
 txn_grpd = GROUP txn_100plus BY cat;  
 txn_cnt_bycat = FOREACH txn_grpd GENERATE group , COUNT(txn_100plus);  
 STORE txn_cnt_bycat INTO '/home/hadoop'

Now how do we execute it ? Here goes the details
Local Mode:

 $ pig -x local trans.pig

Mapreduce Mode:

 $ pig trans.pig  
 or  
 $ pig -x mapreduce trans.pig

Comments in code :

Before we start our journey in exploring the code , it is always good to know how to comment a piece of code or a bunch of line. We used to learn this as a first thing in our college classes as during the practical test all we had to do was "at the least" to make the code error free and what better way then commenting the erroneous code.

Getting to the point , there 2 ways to comment a code in pig

1. single-line comments using --

2. multi-line comments using /* …. */

Oracle pl/sql developers!! does this look exactly the same . Yes it is the same.

Let me give an example :

  /* I am commenting this 3 lines below   
  transactions = LOAD 'retail/txn.csv' USING PigStorage(',') AS(txn_id,txn_dt,cust_id,amt:double,cat,sub_cat,adr1,adr2,trans_type);    
  txn_grpd = GROUP txn_100plus BY cat;   
  txn_cnt_bycat = FOREACH txn_grpd GENERATE group , COUNT(txn_100plus) commented till here */  
  STORE txn_cnt_bycat INTO '/home/hadoop' -- This part is commented again

First thing whenever we start learning a new language is to understand the datatypes , lets quickly walk through it

There are 2 data types in Pig
1. Scalar

2. Complex

Scalar datatype is very similar to other languages and these are 6 different types .

int : They store 4 byte signed integer . Example : 102.

long : They store 8 byte signed integer . Example : 500000000003.

float : They store 4 byte value. In some calculation float loses its precision . This is obvious from its limited storage capacity .Example : 3.14 4.

double : They store 8 byte value . Example : 2.71828 or exponent format : 6.626e-34 chararray : A string or character array. Example : 'Hello' or the value for Ctrl-A is expressed as \u0001.5.

bytearray : A blob or array of bytes , There is no way to specify a bytearray.

Dont worry too much if you are not able to understand 100% . That is fine , the interesting concepts are below .

Complex data types are given below :

Map : This is a key value pair (k,v) . The key is a chararray used as an index to find the element referred to as Value. Because PIG does not know the type of value , it will assume it as a bytearray .However the actual value might be of some other data type.We can CAST the value to a different data type , if we know the data type of the value.By default there is no requirement that all values in a map must be of the same type.It is legitimate to have a map with two keys name and age, where the value for name is a chararry and value for age is an int.Map is formed using square brackets [] . For example, ['name'#'bob', 'age'#55] will create a map with two keys, “name” and “age”.

Tuple : A tuple is a fixed-length, ordered collection of Pig data elements.Tuples are divided into fields, with each field containing one data element. These elements can be of any type—they do not all need to be the same type. A tuple is analogous to a row in SQL,with the fields being SQL columns. Tuple constants use parentheses to indicate the tuple and commas to delimit fields in the tuple. For example, ('bob', 55) describes a tuple constant with two fields.

Bag : A bag is an unordered collection of tuples. Because it has no order, it is not possible to reference tuples in a bag by position.Bag constants are constructed using braces, with tuples in the bag separated by commas.For example, {('bob', 55), ('sally', 52), ('john', 25)} constructs a bag with three tuples, each with two fields.
Please NOTE : Bag is the one type in Pig that is not required to fit into memory. Because bags are used to store collections when grouping(ahh, you might not understand it here, hold your breadth explainatio), bags can become quite large.Pig has the ability to spill bags to disk when necessary, keeping only partial sections of the bag in memory. The size of the bag is limited to the amount of local disk available for spilling the bag.

Enough of theory isn't it , let's now work with some practical examples .Before we process , here is the data snapshot on which we will be performing our operations .The data has 3 fields as give below .
x: Name
y: Number
z: Number
Let us call this file as "foo.txt" and save this file in the default pig directory.

 Bill     10     20  
 Mark     5     5  
 Larry     2     2  
 Bill     5     5

Let us now login into the grunt shell.

 grunt> ls  
 hdfs://localhost:8020/user/hadoop/FIRST_NAME  <dir>  
 hdfs://localhost:8020/user/hadoop/NYSE_daily.txt<r 1>  1471331  
 hdfs://localhost:8020/user/hadoop/NYSE_dividends.txt<r 1>    17695  
 hdfs://localhost:8020/user/hadoop/TEMP_FIRST_NAME    <dir>  
 hdfs://localhost:8020/user/hadoop/foo.txt<r 1> 41  
 hdfs://localhost:8020/user/hadoop/out1 <dir>

Execute the below command to load the data from foo.txt to a relation A

 grunt> A = load 'foo.txt' as (x:chararray, y:int, z:int);  
 grunt> dump A;  
 /* The Command follows Map Reduce .There are lot of information between dump and resultset given below . I am skipping them */  
 (Bill,10,20)  
 (Mark,5,5)  
 (Larry,2,2)  
 (Bill,5,5)

Please note : The result set above is a tuple ( or it is similar to a row in SQL ).
How do i access field level data(data for only x or y) from the above relation A ?
If it were to be SQL , we would have written select x from <table_name> but in this case

 grunt> B = foreach A generate x;  
 grunt> dump B  
 (Bill)  
 (Mark)  
 (Larry)  
 (Bill)

Most important concept here is to understand that each step of a pig statement generates a relation . The above statement FOREACH suggest that the entire set of records of the relation A be parsed one by one until the last record and field x be separated from it . This is then dumped into the terminal.
Note : commands or statements are not case sensitive in pig . For Example load and LOAD are the same but relations are case sensitive . "A" above will not be the same as "a".

 grunt> C = group A by x;  
 grunt> dump C  
 (Bill,{(Bill,10,20),(Bill,5,5)})  
 (Mark,{(Mark,5,5)})  
 (Larry,{(Larry,2,2)})  
 grunt> describe C;  
 C: {group: chararray,A: {(x: chararray,y: int,z: int)}}

The result returned here is tuple with a bag inside it . Remember tuple is denoted by () and bag by {}.
Look at the description of C . it clearly mentions that the grouping has been done only to consolidate data in a single bag .
If we look at the data foo.txt , Bill appears twice and hence when we dump C , Bill as a group has 2 Bill inside it

 (Bill,{(Bill,10,20),(Bill,5,5)})

This means that grouping in pig is entirely different from grouping in SQL .

Grouping in SQL needs an aggregate function whereas in pig it just brings together tuples with similar key or group by values .

Going a little off-topic so as to explain how is SQL grouping different from Pig grouping .
Here is my explanation ! Watch carefully
Let us consider the below table in SQL .Let us call it employee table

 Name       Dept       Salary  
 Bill       10          100  
 Mark       10          200  
 Larry      20          500  
 King       20          50  
 George     30          1000

In sql, A grouping would look like below

 select dept,sum(salary) from employee group by dept

The result-set after firing the above query will look as given below

 Dept      Salary  
 10          300  
 20          550  
 30          1000

Let us now load the same data to a pig relation and assume this is a comma separated value

 employee = load 'employee.csv' USING PigStorage(',') as (name:chararray, dept:int, salary:int);  
 dept_groupby = group employee by dept;  
 dump dept_groupby;  
 (10,{(bill,10,100),(Mark,10,200)})  
 (20,{(Larry,20,500),(King,20,50)})  
 (30,{(George,20,1000,)})

so it means that there is no aggregation happening , it is just grouping a set a tuple or records together . Hence pig's grouping and sql grouping are two different operation .

Now you might ask , how is it that we aggregate or sum in pig . The group statement is not working ! Here is the answer for you . You sum it up mathematically . Don't understand my statement . follow the example below .

 C = foreach dept_groupby generate employee.dept,SUM(employee.salary);  
 dump C;  
 o/p   
 ({(10),(10)},300)  
 ({(20),(20)},550)  
 ({(30)},1000)

Looks a little different and difficult as well for someone who might be working his whole life on high level languages such as SQL but didn't we tell you that this is a DATA FLOW language. You have to take one step at a time , remember No skipping any step here . Get back to basics .

Now the next question that follows is , what are the different types of relational operators available in pig ? Let us explore them one by one .
Foreach :

Foreach takes each record from the relation , applies some expression ( sometimes it doesn't apply any expression at all ) . From these expressions it generates new records to send down
the pipeline to the next operator . Look at the example below .

 grunt> cat foo.txt  
 Bill   10   20  
 Mark    5    5  
 Larry   2    2  
 Bill    5    5  
 grunt> A = LOAD 'foo.txt' as ( name,x:int,y:int);  
 grunt> B = foreach A generate name;  
 grunt> dump B;  
 (Bill)  
 (Mark)  
 (Larry)  
 (Bill)

It is simple , isn't it . The Dump B generates only name field from the foo.txt data .
This was a basic example lets make it a little complex .

 C = foreach A generate x-y;

OR

 D = foreach A generate $2-$1;
 (10)  
 (0)  
 (0)  
 (0)

we can either use the field variables that has been defined or use $0,$1 .. where $0 signifies the first field , $1 second field and so on.
Some more foreach related operations .

 beginning = foreach A generate ..x; -- produces name and x . This means display all the data till the begining of x.  
 middle = foreach prices generate name..y; -- produces name and y . This means display all the data from name to y . In our case it is just 2 column(field) data.  
 end = foreach prices generate x..; -- produces x and y . This means display all the data from x till the end.In our case it is x and y.

UDFs in foreach :

User Defined Functions (UDFs) can be invoked in foreach.These are called evaluation

functions, or eval funcs. Because they are part of a foreach statement, these UDFs take

one record at a time and produce one output. The important thing to note here is that although the eval functions take one argument , that one argument can be bag and hence indirectly we can pass more than one values to the function . Likewise for output of the function .

 A = load 'foo.txt' as (name:chararray, x:int, y:int);  
 D = foreach A generate UPPER(name) as name ,x;  
 E = group D by name;  
 F = foreach E generate group , sum(D.x);  
 o/p  
 014-01-23 02:26:30,254 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve sum using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]

Whoops !!!! why is it that i see error inspite of doing everything right ?? Any guesses ??

The Reason is very simple and yet so very hard to remember . I failed almost for 30 mins before i could figure out what is the problem . UDF's are CASE SENSITIVE .Make sum to SUM

 F = foreach E generate group , SUM(D.x);  
 o/p  
 (BILL,15)  
 (MARK,5)  
 (LARRY,2)

Hell Yeah , it works now !!

Filter :

The filter statement allows you to select which records you want to retain in your data pipeline. You have all types of operators that can be used in filter such as == , != , >= etc . This is same as any other languages.

Now some important points

1. To use these operators with tuples , both tuples must either have same schema or no schema .

2. None of these operators can be applied with bags .

3. Pig Latin follows the operator precedence that is standard in most programming languages. Remember BODMAS rule is maths . Yes it is just that .

A SQL user/developers now knows that , this is nothing but "WHERE" in SQL queries . Now we ask , is there something like regular expression matching ?

Yes there is , look at the example below .

 grunt> describe F;  
 F: {group: chararray,long}  
 grunt> G = filter F by $0 matches 'BI.*';  
 O/P  
 (BILL,15)

The describe on the Relation F above gives me a result which clearly shows that the fields has not been named and hence i have used $0 which gives me the first field .

The ".*" above tells the engine to match everything after BI and hence the result is (BILL,15)

Group :

Having mentioned multiple times above and also having explained once already . Group by shares its syntax with SQL .

 grunt> C = group A by name;   
  grunt> dump C   
  (Bill,{(Bill,10,20),(Bill,5,5)})   
  (Mark,{(Mark,5,5)})   
  (Larry,{(Larry,2,2)})   
  grunt> describe C;   
  C: {group: chararray,A: {(name: chararray,y: int,z: int)}}

Look at what is happening here , Pig group only bring's together set of tuples sharing the same key as in this case all Bill's records is in one bag now .

The difference with SQL is that , SQL needs an aggregator operation to be performed while grouping such as SUM,COUNT,AVG,MIN,MAX etc wheras in PIG ,GROUP literally mean the word . It just groups the data together .

Few pointers :

1. You can also group on multiple keys.

2. You can also use all to group together all of the records in your pipeline.The record coming out of group all has the chararray literal all as a key

 grunt> I = group A all ;  
 grunt> describe I;  
 I: {group: chararray,A: {(name: chararray,x: int,y: int)}}  
 grunt> dump I;  
 o/p  
 (all,{(Bill,10,20),(Mark,5,5),(Larry,2,2),(Bill,5,5)})

3. group is the operator that usually will force a reduce phase.If the pipeline is in a map phase, this will force it to shuffle and then reduce. If the pipeline is already in a reduce, this will force it to pass through map, shuffle, and reduce phases again .
4. group handles nulls in the same way that SQL handles them: by collecting all records with a null key into the same group

After having discussed all these , Let us take a step back and understand why is PIG better for map-reduce .

In Map-Reduce we often get skewed results , meaning reducers might get different load of data at the end of Map,shuffle and sort . The simple reason being some keys might have more values and some less
Just because you have specified that your job have 100 reducers, there is no reason to expect that the number of values per key will be distributed evenly.

For example,
suppose you have an index of web pages and you group by the base URL. Certain values
such as yahoo.com(login.yahoo.com,in.answers.yahoo.com)are going to have far more entries than say indiatimes.com, which means that some reducers get far more data than others.

Now as we know that our MapReduce job is not finished (and any subsequent ones cannot start) until all our reducers have finished, this skew will significantly slow our processing. In some cases it might also not be impossible for one reducer to manage so much data. what do we do ?

Pig has a number of ways that it tries to manage this skew to balance out the load across reducers. The one that applies to grouping is Hadoop’s combiner. Let's talk a little more about combiners.

Combiners:

Combiner gives applications a chance to apply their reducer logic early on . As the map phase writes output, it is serialized and placed into an in-memory buffer.When this buffer fills, MapReduce will sort the buffer and then run the combiner if the application has provided an implementation for it.The resulting output is then written to local disk, to be picked up by the shuffle phase and sent to the reducers.MapReduce might choose not to run the combiner if it determines it will be more efficient not to.

A typical MapReduce will look like this .

If you don't understand where a combiner will be placed in MapReduce , here is another diagram for you .A picture is worth a thousand words , isn't it ?

Which means that combiner is doing the task of a reducer even before reducer comes into the picture .

Now here is what you have been waiting for !

Pig’s operators and built-in UDFs use the combiner whenever possible, because of its skew-reducing features and because early aggregation greatly reduces the amount of data shipped over the network and written to disk, thus speeding performance significantly.

Order by :

I believe we all understand Order by but let me explain very briefly as to what it is .Order statement sorts your data for you, producing a total order of your output data. Remember Sorting by maps, tuples, or bags produces errors

For all data types, nulls are taken to be smaller than all possible values

for that type, and thus will always appear first (or last when desc is used).

 grunt> A = load 'foo.txt' as (name:chararray, x:int, y:int);  
 grunt> B = order A by name;  
 grunt> dump B;  
 o/p  
 (Bill,10,20)  
 (Bill,5,5)  
 (Larry,2,2)  
 (Mark,5,5)

Distinct :

Distinct statement removes duplicate records .It work only on entire records , not on individual files.

 grunt> A = load 'foo.txt' as (name:chararray, x:int, y:int);  
 grunt> B = foreach A generate name;  
 grunt> C = distinct B;  
 o/p  
 (Bill)  
 (Mark)  
 (Larry)

Limit :

Sometimes we want to see only a limited set of results for many purpose . Sometimes to see if we are getting the output or not .Limit helps us achieve that .

 grunt> D = limit C 2;  
 grunt> dump D;  
 o/p  
 (Bill)  
 (Mark)

A fact with limit with Pig is that inspite of we limiting the records ( in this example 2 ) . It will still read the entire dataset but just returns 2 records as result .Therefore it is not for performance .

Except for Order , None of the relation operator in pig guarantees to return the same result-set again if you execute the same query multiple times. Limit is no exception to the rule . In our example above ,if we do a limit againit might return different value these time. For example .

 grunt> D = limit C 2;   
  grunt> dump D;   
  o/p   
  (Larry)   
  (Mark)

This is how pig has been designed .

Sample :

Sample offers a simple way to get a sample of your data. It reads through all of your data but returns only a percentage of rows. What percentage it returns is expressed as a double value, between 0 and 1. So, in the following example, 0.1 indicates 10%:

The percentage will not be an exact match,but close.

 grunt> E = sample A 0.4;

The sample A by 0.1 is rewritten to filter A by random() <= 0.1.

Parallel

One of Pig’s core claims is that it provides a language for parallel data processing.so Pig prefers that you tell it how parallel to be. To do this, it provides the parallel clause.
The parallel clause can be attached to any relational operator in Pig Latin. However,it controls only reduce-side parallelism, so it makes sense only for operators that force a reduce phase.These are: group*, order, distinct, join*, limit, cogroup*, and cross.Operators marked with an asterisk have multiple implementations, some of which force a reduce and some which do not

 grunt> E = sample A 0.4;

In this example, parallel will cause the MapReduce job spawned by Pig to have 10
reducers. parallel clauses apply only to the statement to which they are attached; they
do not carry through the script.
If, however, you do not want to set parallel separately for every reduce-invoking operator

in your script, you can set a script-wide value using the set command

 grunt> set default_parallel 10;

I will talk about advanced operators in my next blog .Watch this space .

Tuesday, January 7, 2014

PIG Overview !!!!

Apache PIG !!

Point to ponder , why is pig sitting on Hadoop ?

What is PIG?

Apache Pig is a platform for analyzing large,very large sets of data . The next question generally asked is ,WHY PIG !! Aren't there other language that can do the same stuff .Traditionally large sets of data are analysed using data warehouse.In today's world Oracle/DB2 Databases are able to process them although they do take hours , days,weeks and sometimes forever. Sounds Deja vu. If you are a database developer you will concur with me .

On top of that , think of the powerful servers that is being used to process these data . Think of the cost associated with it .If i were a small organization .Would i ever imagine processing large data using traditional tools and technologies! Probably not .

One simple reason why pig is gaining popularity is because Pig provides an engine for executing data flow in parallel on top of Hadoop as depicted in the picture as well as in standalone mode .

Pig is a dataflow programming environment for processing very large files .

DataFlow !! Now what is that ?

Pig is sequential in its approach , there is no control statement like IF and ELSE . That means the control cannot traverse from one statement to the next skipping a few in between .

A quick comparison between Pig and SQL .

One major difference and a very interesting difference between PIG and most other programming languages is :

Most programming language are compiled in which the below steps are performed :

1. Lexical analysis
2. Parsing
3. Semantic Analysis
4. Optimization
5. Code generation.

The only thing that Pig does when compiled is to check whether the code is syntactically correct .
or do the first 3 steps Lexical Analysis , Parsing and Semantic Analysis .

For more details on compilers , please refer to the below :

https://prod-c2g.s3.amazonaws.com/CS143/Spring2013/files/lecture01.pdf?Signature=CT%2B6cKEwsgXlGMlWr2XG3CqW2F4%3D&Expires=1704596893&AWSAccessKeyId=AKIAIINO3Q3NXKJA2PXQ

Analysis of why we should use Pig .

Pigs Eat Anything : Pig can operate on data and metadata . It can operate on data that is relational, nested, or unstructured. And it can easily be extended to operate on data beyond files, including key/value stores, databases, etc.
Pigs Live Anywhere : Pig is intended to be a language for parallel data processing. It is not tied to one particular parallel framework. It has been implemented first on Hadoop, bit is not intended that to be only on Hadoop.
Pigs fly : Pig processes data quickly
Pigs Are Domestic Animals:

Pig is designed to be easily controlled and modified by its users.
Pig allows easy integration with Java or languages that can compile down to Java such as Jython. Pig Supports user defined load and store functions. If you do not understand this now . Kindly move on , you will understand them later.
Pig has an optimizer that rearranges some operations in Pig Latin scripts to give better performance, combines Map Reduce jobs together,etc. However, users can easily turn this optimizer off to prevent it from making changes that do not make sense in their situation.

Let us now analyze the basic difference between PIG over Hadoop and SQL.

First and foremost Pig Latin should be a natural choice for constructing data pipelines . Now what the hell is data Pipeline ??

Wikipedia states that : a pipeline is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion; in that case, some amount of buffer storage is often inserted between elements.

In simple terms we can imaging pipelines as water distribution system of a city . Numerous pipes merge into one big pipe carrying water and then it is distributed throughout the household.

If you are interested in reading more about pipe-lining , read from Wikipedia given below :

http://en.wikipedia.org/wiki/Pipeline_(computing)

1. Pig Latin is procedural whereas SQL is declarative : Look at the example below

SQL : A chart below showing several of the SQL language elements that compose a single statement and hence it is declarative.

Read more @ : http://en.wikipedia.org/wiki/SQL

PIG :

Sample data :

File Name : txn.csv

 00031941,02-20-2011,4003793,110.97,Water Sports,Kitesurfing,Springfield,Illinois,credit  
 00031942,08-12-2011,4004260,037.22,Outdoor Recreation,Skateboarding,Indianapolis ,Indiana,cash  
 00031943,01-18-2011,4004986,188.70,Team Sports,Rugby,Santa Ana,California,credit  
 00031944,02-18-2011,4005434,035.64,Games,Board Games,Charlotte,North Carolina,cash  
 00031945,11-15-2011,4004357,126.87,Exercise & Fitness,Weightlifting Belts,Columbia,South Carolina,credit  
 00031946,02-11-2011,4008482,090.33,Exercise & Fitness,Cardio Machine Accessories,Newark,New Jersey,credit  
 00031947,01-21-2011,4000145,074.79,Team Sports,Soccer,Colorado Springs,Colorado,credit

Code Snapshot :

 transaction = LOAD 'retail/txn.csv' USING PigStorage(',') AS(txn_id,txn_dt,cust_id,amt:double,cat,sub_cat,adr1,adr2,trans_type);

The above data will be loaded into a relation "transaction". Most of us coming from other technical background have the tendency to call "transaction" as variable which should actually be called a Relation.

This relation is somewhat similar to oracle's relation which Put simply: a "relation" is a table, the heading being the definition of the structure and the rows being the data.

Also interestingly every step in PIG returns a data set as given below :

Below is reduced data set to 10 as given below :

 grunt> transaction = LOAD 'retail/txn.csv' USING PigStorage(',') AS(txn_id,txn_dt,cust_id,amt:double,cat,sub_cat,adr1,adr2,trans_type);  
 grunt> limit_transaction = limit transaction 10;  #limiting the data to 10 
 grunt> dump limit_transaction  
 2014-01-06 22:55:03,568 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: LIMIT  
 2014-01-06 22:55:03,890 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false  
 2014-01-06 22:55:03,957 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 2  
 2014-01-06 22:55:03,958 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 2  
 2014-01-06 22:55:04,063 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job  
 2014-01-06 22:55:04,098 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3  
 2014-01-06 22:55:04,101 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1  
 2014-01-06 22:55:04,102 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job5187351683488465485.jar  
 2014-01-06 22:55:08,236 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job5187351683488465485.jar created  
 2014-01-06 22:55:08,265 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job  
 2014-01-06 22:55:08,273 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.  
 2014-01-06 22:55:08,274 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche  
 2014-01-06 22:55:08,276 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []  
 2014-01-06 22:55:08,370 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.  
 2014-01-06 22:55:08,838 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1  
 2014-01-06 22:55:08,839 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1  
 2014-01-06 22:55:08,856 [JobControl] INFO org.apache.hadoop.util.NativeCodeLoader - Loaded the native-hadoop library  
 2014-01-06 22:55:08,857 [JobControl] WARN org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library not loaded  
 2014-01-06 22:55:08,861 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 2  
 2014-01-06 22:55:08,874 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete  
 2014-01-06 22:55:09,694 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201401062108_0002  
 2014-01-06 22:55:09,694 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases limit_transaction,transaction  
 2014-01-06 22:55:09,694 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: transaction[1,14],limit_transaction[2,20] C: R:  
 2014-01-06 22:55:09,694 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://localhost:50030/jobdetails.jsp?jobid=job_201401062108_0002  
 2014-01-06 22:55:27,792 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 25% complete  
 2014-01-06 22:55:36,838 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 33% complete  
 2014-01-06 22:55:42,864 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete  
 2014-01-06 22:55:49,426 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job  
 2014-01-06 22:55:49,428 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3  
 2014-01-06 22:55:49,428 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1  
 2014-01-06 22:55:49,429 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job6292515056400790331.jar  
 2014-01-06 22:55:52,856 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job6292515056400790331.jar created  
 2014-01-06 22:55:52,866 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job  
 2014-01-06 22:55:52,867 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.  
 2014-01-06 22:55:52,868 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche  
 2014-01-06 22:55:52,868 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []  
 2014-01-06 22:55:52,906 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.  
 2014-01-06 22:55:53,119 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1  
 2014-01-06 22:55:53,120 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1  
 2014-01-06 22:55:53,122 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1  
 2014-01-06 22:55:53,408 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201401062108_0003  
 2014-01-06 22:55:53,409 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases transaction  
 2014-01-06 22:55:53,409 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: C: R: transaction[-1,-1]  
 2014-01-06 22:55:53,410 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://localhost:50030/jobdetails.jsp?jobid=job_201401062108_0003  
 2014-01-06 22:56:10,004 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 75% complete  
 2014-01-06 22:56:33,334 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete  
 2014-01-06 22:56:33,337 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:  
 HadoopVersion  PigVersion   UserId StartedAt    FinishedAt   Features  
 1.0.4  0.11.0 hadoop 2014-01-06 22:55:04   2014-01-06 22:56:33   LIMIT  
 Success!  
 Job Stats (time in seconds):  
 JobId  Maps  Reduces MaxMapTime   MinMapTIme   AvgMapTime   MedianMapTime  MaxReduceTime  MinReduceTime  AvgReduceTime  MedianReducetime    Alias Feature Outputs  
 job_201401062108_0002  2    1    9    9    9    9    15   15   15   15   limit_transaction,transaction  
 job_201401062108_0003  1    1    6    6    6    6    15   15   15   15   transaction       hdfs://localhost:8020/tmp/temp-833806321/tmp715439762,  
 Input(s):  
 Successfully read 20 records (8927 bytes) from: "hdfs://localhost:8020/user/hadoop/retail/txn.csv"  
 Output(s):  
 Successfully stored 10 records (1109 bytes) in: "hdfs://localhost:8020/tmp/temp-833806321/tmp715439762"  
 Counters:  
 Total records written : 10  
 Total bytes written : 1109  
 Spillable Memory Manager spill count : 0  
 Total bags proactively spilled: 0  
 Total records proactively spilled: 0  
 Job DAG:  
 job_201401062108_0002  ->   job_201401062108_0003,  
 job_201401062108_0003  
 2014-01-06 22:56:33,355 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!  
 2014-01-06 22:56:33,358 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.  
 2014-01-06 22:56:33,366 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1  
 2014-01-06 22:56:33,367 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1  
 (00000000,06-26-2011,4007024,40.33,Exercise & Fitness,Cardio Machine Accessories,Clarksville,Tennessee,credit)  
 (00000001,05-26-2011,4006742,198.44,Exercise & Fitness,Weightlifting Gloves,Long Beach,California,credit)  
 (00000002,06-01-2011,4009775,5.58,Exercise & Fitness,Weightlifting Machine Accessories,Anaheim,California,credit)  
 (00000003,06-05-2011,4002199,198.19,Gymnastics,Gymnastics Rings,Milwaukee,Wisconsin,credit)  
 (00000004,12-17-2011,4002613,98.81,Team Sports,Field Hockey,Nashville ,Tennessee,credit)  
 (00000005,02-14-2011,4007591,193.63,Outdoor Recreation,Camping & Backpacking & Hiking,Chicago,Illinois,credit)  
 (00000006,10-28-2011,4002190,27.89,Puzzles,Jigsaw Puzzles,Charleston,South Carolina,credit)  
 (00000007,07-14-2011,4002964,96.01,Outdoor Play Equipment,Sandboxes,Columbus,Ohio,credit)  
 (00000008,01-17-2011,4007361,10.44,Winter Sports,Snowmobiling,Des Moines,Iowa,credit)  
 (00000009,05-17-2011,4004798,152.46,Jumping,Bungee Jumping,St. Petersburg,Florida,credit)

2. Pig Latin allows pipeline developers to decide where to checkpoint data(save the data ) in the pipeline: Pig allow storage of data at every point .That way, when a failure occurs, the whole pipeline does not have to be rerun. This is done using LOAD and STORE command . I will talk about the

 STORE limit_transaction INTO 'retail/txn_10.csv' USING PigStorage ('*');

This will store the data in hdfs location :

 hdfs://localhost:8020/user/hadoop/retail/txn_10.csv

This is how the data will be displayed :

 grunt> cat hdfs://localhost:8020/user/hadoop/retail/txn_10.csv  
 00000000*06-26-2011*4007024*40.33*Exercise & Fitness*Cardio Machine Accessories*Clarksville*Tennessee*credit  
 00000001*05-26-2011*4006742*198.44*Exercise & Fitness*Weightlifting Gloves*Long Beach*California*credit  
 00000002*06-01-2011*4009775*5.58*Exercise & Fitness*Weightlifting Machine Accessories*Anaheim*California*credit  
 00000003*06-05-2011*4002199*198.19*Gymnastics*Gymnastics Rings*Milwaukee*Wisconsin*credit  
 00000004*12-17-2011*4002613*98.81*Team Sports*Field Hockey*Nashville *Tennessee*credit  
 00000005*02-14-2011*4007591*193.63*Outdoor Recreation*Camping & Backpacking & Hiking*Chicago*Illinois*credit  
 00000006*10-28-2011*4002190*27.89*Puzzles*Jigsaw Puzzles*Charleston*South Carolina*credit  
 00000007*07-14-2011*4002964*96.01*Outdoor Play Equipment*Sandboxes*Columbus*Ohio*credit  
 00000008*01-17-2011*4007361*10.44*Winter Sports*Snowmobiling*Des Moines*Iowa*credit  
 00000009*05-17-2011*4004798*152.46*Jumping*Bungee Jumping*St. Petersburg*Florida*credit

3. Pig Latin allows the developer to select specific operator implementations directly rather than relying on the optimizer : By definition, a declarative language allows the developer to specify what must be done, not how it is done.Thus in SQL , the users can specify the data from the two tables must be joined but not what join implementation to use. This means that the optimizer is free to choose the best algorithm or the shortest path algorithm it deems fit to fetch the data for the given statement .

SQL developers , look at explain plan to see what an optimizer does but Pig keeps it simple , it will just do what you ask it to do .

Currently Pig supports four different join implementations and two grouping implementations. It also allows users to specify parallelism of operations inside a Pig Latin script, and does not require that every operator in the script have the same parallelization factor. This is important because data sizes often grow and shrink as data flows through the pipeline . This will be discussed later in detail.

4. Pig Latin supports splits in the pipeline : As evident in the data splits below. Watch the operators and data splits below .

 transaction = LOAD 'retail/txn.csv' USING PigStorage(',') AS(txn_id,txn_dt,cust_id,amt:double,cat,sub_cat,adr1,adr2,trans_type);  
 less_than_100 = FILTER transaction BY amt < 100.00;  
 greater_than_100 = FILTER transaction BY amt > 100.00;  
 equal_to_100 = FILTER transaction BY amt = 100.00;

5 Pig Latin allows developers to insert their own code almost anywhere in the data pipeline : I will discuss this later when i talk about UDF's ( dont be perplexed by it ,UDF means an ordinary function in Oracle or C or Java )

Before i proceed to the PIG scripting , i would like to briefly talk about the difference between PIG and HIVE . For those of you who do not know HIVE , Please ignore this para.

Read this wonderful blog : http://developer.yahoo.com/blogs/hadoop/pig-hive-yahoo-464.html

Taking a cue from the blog :

Hive and Pig both are used for data processing but have distinct functionality.

Data processing often splits into three separate tasks: data collection, data preparation, and data presentation

The data preparation phase is often known as ETL (Extract Transform Load) or the data factory. Raw data is loaded in, cleaned up, conformed to the selected data model, joined with other data sources, and so on. PIG is apt for these operations.

The data presentation phase is usually referred to as the data warehouse. A warehouse stores products ready for consumers; they need only come and select the proper products off of the shelves. HIVE is good here.

My Next Blog on PIG will cover PIG scripting in details .

Data Science