Hive Vs Pig
Feature
|
Hive
|
Pig
|
Language
|
SQL-like
|
PigLatin
|
Schemas/Types
|
Yes (explicit)
|
Yes (implicit)
|
Partitions
|
Yes
|
No
|
Server
|
Optional (Thrift)
|
No
|
User Defined Functions (UDF)
|
Yes (Java)
|
Yes (Java)
|
Custom Serializer/Deserializer
|
Yes
|
Yes
|
DFS Direct Access
|
Yes (implicit)
|
Yes (explicit)
|
Join/Order/Sort
|
Yes
|
Yes
|
Shell
|
Yes
|
Yes
|
Streaming
|
Yes
|
Yes
|
Web Interface
|
Yes
|
No
|
JDBC/ODBC
|
Yes (limited)
|
No
|
Apache
Pig and Hive are two projects that layer on top of Hadoop, and provide a
higher-level language for using Hadoop's MapReduce library. Apache Pig
provides a scripting language for describing operations like reading,
filtering, transforming, joining, and writing data -- exactly the
operations that MapReduce was originally designed for. Rather than
expressing these operations in thousands of lines of Java code that uses
MapReduce directly, Pig lets users express them in a language not
unlike a bash or perl script. Pig is excellent for prototyping and
rapidly developing MapReduce-based jobs, as opposed to coding MapReduce
jobs in Java itself.
If Pig is "Scripting for Hadoop", then Hive is "SQL queries for Hadoop".
Apache Hive offers an even more specific and higher-level language, for
querying data by running Hadoop jobs, rather than directly scripting
step-by-step the operation of several MapReduce jobs on Hadoop. The
language is, by design, extremely SQL-like. Hive is still intended as a
tool for long-running batch-oriented queries over massive data; it's not
"real-time" in any sense. Hive is an excellent tool for analysts and
business development types who are accustomed to SQL-like queries and
Business Intelligence systems; it will let them easily leverage your
shiny new Hadoop cluster to perform ad-hoc queries or generate report
data across data stored in storage systems mentioned above.
WORD COUNT EXAMPLE - PIG SCRIPT
Q) How to find the number of occurrences of the words in a file using the pig script?
You can find the famous word count example written in map reduce programs in apache website. Here we will write a simple pig script for the word count problem.
The following pig script finds the number of times a word repeated in a file:
Word Count Example Using Pig Script:
The above pig script, first splits each line into words using the TOKENIZE operator. The tokenize function creates a bag of words. Using the FLATTEN function, the bag is converted into a tuple. In the third statement, the words are grouped together so that the count can be computed which is done in fourth statement.
You can see just with 5 lines of pig program, we have solved the word count problem very easily.
You can find the famous word count example written in map reduce programs in apache website. Here we will write a simple pig script for the word count problem.
The following pig script finds the number of times a word repeated in a file:
Word Count Example Using Pig Script:
lines = LOAD '/user/hadoop/HDFS_File.txt' AS (line:chararray); words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word; grouped = GROUP words BY word; wordcount = FOREACH grouped GENERATE group, COUNT(words); DUMP wordcount;
The above pig script, first splits each line into words using the TOKENIZE operator. The tokenize function creates a bag of words. Using the FLATTEN function, the bag is converted into a tuple. In the third statement, the words are grouped together so that the count can be computed which is done in fourth statement.
You can see just with 5 lines of pig program, we have solved the word count problem very easily.
HOW TO FILTER RECORDS - PIG TUTORIAL EXAMPLES
Pig allows you to remove unwanted records based on a condition. The
Filter functionality is similar to the WHERE clause in SQL. The FILTER
operator in pig is used to remove unwanted records from the data file.
The syntax of FILTER operator is shown below:
Here relation is the data set on which the filter is applied, condition is the filter condition and new relation is the relation created after filtering the rows.
Pig Filter Examples:
Lets consider the below sales data set as an example
1. select products whose quantity is greater than or equal to 1000.
2. select products whose quantity is greater than 1000 and year is 2001
3. select products with year not in 2000
You can use all the logical operators (NOT, AND, OR) and relational operators (< , >, ==, !=, >=, <= ) in the filter conditions.
<new relation> = FILTER <relation> BY <condition>
Here relation is the data set on which the filter is applied, condition is the filter condition and new relation is the relation created after filtering the rows.
Pig Filter Examples:
Lets consider the below sales data set as an example
year,product,quantity --------------------- 2000, iphone, 1000 2001, iphone, 1500 2002, iphone, 2000 2000, nokia, 1200 2001, nokia, 1500 2002, nokia, 900
1. select products whose quantity is greater than or equal to 1000.
grunt> A = LOAD '/user/hadoop/sales' USING PigStorage(',') AS (year:int,product:chararray,quantity:int); grunt> B = FILTER A BY quantity >= 1000; grunt> DUMP B; (2000,iphone,1000) (2001,iphone,1500) (2002,iphone,2000) (2000,nokia,1200) (2001,nokia,1500)
2. select products whose quantity is greater than 1000 and year is 2001
grunt> C = FILTER A BY quantity > 1000 AND year == 2001; (2001,iphone,1500) (2001,nokia,1500)
3. select products with year not in 2000
grunt> D = FILTER A BY year != 2000; grunt> DUMP D; (2001,iphone,1500) (2002,iphone,2000) (2001,nokia,1500) (2002,nokia,900)
You can use all the logical operators (NOT, AND, OR) and relational operators (< , >, ==, !=, >=, <= ) in the filter conditions.
CREATING SCHEMA, READING AND WRITING DATA - PIG TUTORIAL
The first step in processing a data set using pig is to define a schema
for the data set. A schema is a representation of the data set in terms
of fields. Let see how to define a schema with an example.
Consider the following products data set in Hadoop as an example:
Here first field is the product id, second field is the product name and third field is the product price.
Defining Schema:
The LOAD operator is used to define a schema for a data set. Let see different usages of the LOAD operator for defining the schema for the above dataset.
1. Creating Schema without specifying any fields.
In this method, we don't specify any field names for creating the schema. An example is shown below:
Pig is a data flow language. Each operational statement in pig consists of a relation and an operation. The left side of the statement is called relation and the right side is called the operation. Pig statements must terminated with a semicolon. Here A is a relation. /user/hadoop/products is the file in the hadoop.
To view the schema of a relation, use the describe statement which is shown below:
As there are no fields are defined, the above describe statement on A shows that "Schema for A unkown". To display the contents on the console use the DUMP operator.
To write the data set into HDFS, use the STORE operator as shown below
2. Defining schema without specifying any data types.
We can create a schema just by specifying the field names without any data types. An example is shown below:
The PigStorge is used to specify the field delimiter. The default field delimiter is tab. If your data is a tab separated, then you can ignore the USING PigStorage keywords. In the STORE operation, you can use the PigStorage class for specifying the output separator.
You have to specify the field names in the 'AS' clause. As we didn't specified any data type, by default pig assigned bytearray as the data type for the fields.
3. Defining schema with field names and data types.
To specify the data type use the colon. Take a look at the below example:
Accessing the Fields:
So far, we have seen how to define a schema, how to print the contents of the data on the console and how to write data to hdfs. Now we will see how to access the fields.
The fields can be accessed in two ways:
Example:
FOREACH is like a for loop used to iterate over the records of a relation. The GENERATE keyword specifies what operation to do on the record. In the above example, the GENERATE is used to get the fields from the relation A.
Note: It is always good practice to see the schema of a relation using the describe statement before performing a operation. By knowing the schema, you will know how to access the fields in the schema.
Consider the following products data set in Hadoop as an example:
10, iphone, 1000 20, samsung, 2000 30, nokia, 3000
Here first field is the product id, second field is the product name and third field is the product price.
Defining Schema:
The LOAD operator is used to define a schema for a data set. Let see different usages of the LOAD operator for defining the schema for the above dataset.
1. Creating Schema without specifying any fields.
In this method, we don't specify any field names for creating the schema. An example is shown below:
grunt> A = LOAD '/user/hadoop/products';
Pig is a data flow language. Each operational statement in pig consists of a relation and an operation. The left side of the statement is called relation and the right side is called the operation. Pig statements must terminated with a semicolon. Here A is a relation. /user/hadoop/products is the file in the hadoop.
To view the schema of a relation, use the describe statement which is shown below:
grunt> describe A; Schema for A unknown.
As there are no fields are defined, the above describe statement on A shows that "Schema for A unkown". To display the contents on the console use the DUMP operator.
grunt> DUMP A; (10,iphone,1000) (20,samsung,2000) (30,nokia,3000)
To write the data set into HDFS, use the STORE operator as shown below
grunt> STORE A INTO 'hadoop directory name'
2. Defining schema without specifying any data types.
We can create a schema just by specifying the field names without any data types. An example is shown below:
grunt> A = LOAD '/user/hadoop/products' USING PigStorage(',') AS (id, product_name, price); grunt> describe A; A: {id: bytearray,product_name: bytearray,price: bytearray} grunt> STORE A into '/user/hadoop/products' USING PigStorage('|'); --Writes data with pipe as delimiter into hdfs product directory.
The PigStorge is used to specify the field delimiter. The default field delimiter is tab. If your data is a tab separated, then you can ignore the USING PigStorage keywords. In the STORE operation, you can use the PigStorage class for specifying the output separator.
You have to specify the field names in the 'AS' clause. As we didn't specified any data type, by default pig assigned bytearray as the data type for the fields.
3. Defining schema with field names and data types.
To specify the data type use the colon. Take a look at the below example:
grunt> A = LOAD '/user/hadoop/products' USING PigStorage(',') AS (id:int, product_name:chararray, price:int); grunt> describe A; A: {id: int,product_name: chararray,price: int}
Accessing the Fields:
So far, we have seen how to define a schema, how to print the contents of the data on the console and how to write data to hdfs. Now we will see how to access the fields.
The fields can be accessed in two ways:
- Field Names: We can specify the field name to access the values from that particular value.
- Positional Parameters: The field positions start from 0 to n. $0 indicates first field, $1 indicates second field.
Example:
grunt> A = LOAD '/user/products/products' USING PigStorage(',') AS (id:int, product_name:chararray, price:int); grunt> B = FOREACH A GENERATE id; grunt> C = FOREACH A GENERATE $1,$2; grunt> DUMP B; (10) (20) (30) grunt> DUMP C; (iphone,1000) (samsung,2000) (nokia,3000)
FOREACH is like a for loop used to iterate over the records of a relation. The GENERATE keyword specifies what operation to do on the record. In the above example, the GENERATE is used to get the fields from the relation A.
Note: It is always good practice to see the schema of a relation using the describe statement before performing a operation. By knowing the schema, you will know how to access the fields in the schema.
PIG DATA TYPES - PRIMITIVE AND COMPLEX
Pig has a very limited set of data types. Pig data types are classified into two types. They are:
Primitive Data Types: The primitive datatypes are also called as simple datatypes. The simple data types that pig supports are:
Complex Types: Pig supports three complex data types. They are listed below:
Pig allows nesting of complex data structures. Example: You can nest a tuple inside a tuple, bag and a Map
Null: Null is not a datatype. Null is an undefined value or corrupted value. Example: Let say you have declared a field as int type. However that field contains character values. When reading data from this field, pig converts those character values(corrupted) values into Nulls. Any operation with Null results in Null. The Null in pig is similar to the Null in SQL.
- Primitive
- Complex
Primitive Data Types: The primitive datatypes are also called as simple datatypes. The simple data types that pig supports are:
- int : It is signed 32 bit integer. This is similar to the Integer in java.
- long : It is a 64 bit signed integer. This is similar to the Long in java.
- float : It is a 32 bit floating point. This data type is similar to the Float in java.
- double : It is a 63 bit floating pint. This data type is similar to the Double in java.
- chararray : It is character array in unicode UTF-8 format. This corresponds to java's String object.
- bytearray : Used to represent bytes. It is the default data type. If you don't specify a data type for a filed, then bytearray datatype is assigned for the field.
- boolean : to represent true/false values.
Complex Types: Pig supports three complex data types. They are listed below:
- Tuple : An ordered set of fields. Tuple is represented by braces. Example: (1,2)
- Bag : A set of tuples is called a bag. Bag is represented by flower or curly braces. Example: {(1,2),(3,4)}
- Map : A set of key value pairs. Map is represented in a square brackets. Example: [key#value] . The # is used to separate key and value.
Pig allows nesting of complex data structures. Example: You can nest a tuple inside a tuple, bag and a Map
Null: Null is not a datatype. Null is an undefined value or corrupted value. Example: Let say you have declared a field as int type. However that field contains character values. When reading data from this field, pig converts those character values(corrupted) values into Nulls. Any operation with Null results in Null. The Null in pig is similar to the Null in SQL.
RELATIONS, BAGS, TUPLES, FIELDS - PIG TUTORIAL
In this article, we will see what is a relation, bag, tuple and field. Let see each one of these in detail.
Lets consider the following products dataset as an example:
Lets consider the following products dataset as an example:
Id, product_name ----------------------- 10, iphone 20, samsung 30, Nokia
- Field: A field is a piece of data. In the above data set product_name is a field.
- Tuple: A tuple is a set of fields. Here Id and product_name form a tuple. Tuples are represented by braces. Example: (10, iphone).
- Bag: A bag is collection of tuples. Bag is represented by flower braces. Example: {(10,iphone),(20, samsung),(30,Nokia)}.
- Relation: Relation represents the complete database. A relation is a bag. To be precise relation is an outer bag. We can call a relation as a bag of tuples.
HOW TO RUN PIG PROGRAMS - EXAMPLES
- Script Mode
- Grunt Mode
- Embedded Mode
Script Mode or Batch Mode: In script mode, pig runs the commands specified in a script file. The following example shows how to run a pig programs from a script file:
> cat scriptfile.pig A = LOAD 'script_file'; DUMP A; > pig scriptfile.pig (pig script mode example) (pig runs on top of hadoop)
Grunt Mode or Interactive Mode: The grunt mode can also be called as interactive mode. Grunt is pig's interactive shell. It is started when no file is specified for pig to run.
> pig grunt> A = LOAD 'grunt_file'; grunt> DUMP A; (pig grunt or interactive mode example) (pig runs on top of hadoop)
You can also run pig scripts from grunt using run and exec commands.
grunt> run scriptfile.pig grunt> exec scriptfile.pig
Embedded Mode: You can embed pig programs in java and can run from java.