Apache Pig Introduction
Apache Pig Environment
Pig Latin
Load & Store Operators
Diagnostic Operators
- Apache Pig - Illustrate Operator
- Apache Pig - Explain Operator
- Apache Pig - Describe Operator
- Apache Pig - Diagnostic Operator
Grouping & Joining
- Apache Pig - Cross Operator
- Apache Pig - Join Operator
- Apache Pig - Cogroup Operator
- Apache Pig - Group Operator
Combining & Splitting
Filtering
Sorting
Pig Latin Built-In Functions
- Apache Pig - Math Functions
- Apache Pig - date-time Functions
- Apache Pig - String Functions
- Apache Pig - Bag & Tuple Functions
- Load & Store Functions
- Apache Pig - Eval Functions
Other Modes Of Execution
Apache Pig Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Apache Pig - Quick Guide
Apache Pig - Overview
What is Apache Pig?
Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Apache Pig.
To write data analysis programs, Pig provides a high-level language known as Pig Latin. This language provides various operators using which programmers can develop their own functions for reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. All these scripts are internally converted to Map and Reduce tasks. Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
Why Do We Need Apache Pig?
Programmers who are not so good at Java normally used to struggle working with Hadoop, especially while performing any MapReduce tasks. Apache Pig is a boon for all such programmers.
Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex codes in Java.
Apache Pig uses multi-query approach, thereby reducing the length of codes. For example, an operation that would require you to type 200 pnes of code (LoC) in Java can be easily done by typing as less as just 10 LoC in Apache Pig. Ultimately Apache Pig reduces the development time by almost 16 times.
Pig Latin is SQL-pke language and it is easy to learn Apache Pig when you are famipar with SQL.
Apache Pig provides many built-in operators to support data operations pke joins, filters, ordering, etc. In addition, it also provides nested data types pke tuples, bags, and maps that are missing from MapReduce.
Features of Pig
Apache Pig comes with the following features −
Rich set of operators − It provides many operators to perform operations pke join, sort, filer, etc.
Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL.
Optimization opportunities − The tasks in Apache Pig optimize their execution automatically, so the programmers need to focus only on semantics of the language.
Extensibipty − Using the existing operators, users can develop their own functions to read, process, and write data.
UDF’s − Pig provides the facipty to create User-defined Functions in other programming languages such as Java and invoke or embed them in Pig Scripts.
Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as well as unstructured. It stores the results in HDFS.
Apache Pig Vs MapReduce
Listed below are the major differences between Apache Pig and MapReduce.
Apache Pig | MapReduce |
---|---|
Apache Pig is a data flow language. | MapReduce is a data processing paradigm. |
It is a high level language. | MapReduce is low level and rigid. |
Performing a Join operation in Apache Pig is pretty simple. | It is quite difficult in MapReduce to perform a Join operation between datasets. |
Any novice programmer with a basic knowledge of SQL can work conveniently with Apache Pig. | Exposure to Java is must to work with MapReduce. |
Apache Pig uses multi-query approach, thereby reducing the length of the codes to a great extent. | MapReduce will require almost 20 times more the number of pnes to perform the same task. |
There is no need for compilation. On execution, every Apache Pig operator is converted internally into a MapReduce job. | MapReduce jobs have a long compilation process. |
Apache Pig Vs SQL
Listed below are the major differences between Apache Pig and SQL.
Pig | SQL |
---|---|
Pig Latin is a procedural language. | SQL is a declarative language. |
In Apache Pig, schema is optional. We can store data without designing a schema (values are stored as $01, $02 etc.) | Schema is mandatory in SQL. |
The data model in Apache Pig is nested relational. | The data model used in SQL is flat relational. |
Apache Pig provides pmited opportunity for Query optimization. | There is more opportunity for query optimization in SQL. |
In addition to above differences, Apache Pig Latin −
Allows sppts in the pipepne.
Allows developers to store data anywhere in the pipepne.
Declares execution plans.
Provides operators to perform ETL (Extract, Transform, and Load) functions.
Apache Pig Vs Hive
Both Apache Pig and Hive are used to create MapReduce jobs. And in some cases, Hive operates on HDFS in a similar way Apache Pig does. In the following table, we have psted a few significant points that set Apache Pig apart from Hive.
Apache Pig | Hive |
---|---|
Apache Pig uses a language called Pig Latin. It was originally created at Yahoo. | Hive uses a language called HiveQL. It was originally created at Facebook. |
Pig Latin is a data flow language. | HiveQL is a query processing language. |
Pig Latin is a procedural language and it fits in pipepne paradigm. | HiveQL is a declarative language. |
Apache Pig can handle structured, unstructured, and semi-structured data. | Hive is mostly for structured data. |
Apppcations of Apache Pig
Apache Pig is generally used by data scientists for performing tasks involving ad-hoc processing and quick prototyping. Apache Pig is used −
To process huge data sources such as web logs.
To perform data processing for search platforms.
To process time sensitive data loads.
Apache Pig – History
In 2006, Apache Pig was developed as a research project at Yahoo, especially to create and execute MapReduce jobs on every dataset. In 2007, Apache Pig was open sourced via Apache incubator. In 2008, the first release of Apache Pig came out. In 2010, Apache Pig graduated as an Apache top-level project.
Apache Pig - Architecture
The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a highlevel data processing language which provides a rich set of data types and operators to perform various operations on the data.
To perform a particular task Programmers using Pig, programmers need to write a Pig script using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt Shell, UDFs, Embedded). After execution, these scripts will go through a series of transformations appped by the Pig Framework, to produce the desired output.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it makes the programmer’s job easy. The architecture of Apache Pig is shown below.
Apache Pig Components
As shown in the figure, there are various components in the Apache Pig framework. Let us take a look at the major components.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and other miscellaneous checks. The output of the parser will be a DAG (directed acycpc graph), which represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desired results.
Pig Latin Data Model
The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes such as map and tuple. Given below is the diagrammatical representation of Pig Latin’s data model.
Atom
Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored as string and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic values of Pig. A piece of data or a simple atomic value is known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30)
Bag
A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unpke a table in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, raja@gmail.com,}}
Map
A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value might be of any type. It is represented by ‘[]’
Example − [name#Raja, age#30]
Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples are processed in any particular order).
Apache Pig - Installation
This chapter explains the how to download, install, and set up Apache Pig in your system.
Prerequisites
It is essential that you have Hadoop and Java installed on your system before you go for Apache Pig. Therefore, prior to instalpng Apache Pig, install Hadoop and Java by following the steps given in the following pnk −
Download Apache Pig
First of all, download the latest version of Apache Pig from the following website −
Step 1
Open the homepage of Apache Pig website. Under the section News, cpck on the pnk release page as shown in the following snapshot.
Step 2
On cpcking the specified pnk, you will be redirected to the Apache Pig Releases page. On this page, under the Download section, you will have two pnks, namely, Pig 0.8 and later and Pig 0.7 and before. Cpck on the pnk Pig 0.8 and later, then you will be redirected to the page having a set of mirrors.
Step 3
Choose and cpck any one of these mirrors as shown below.
Step 4
These mirrors will take you to the Pig Releases page. This page contains various versions of Apache Pig. Cpck the latest version among them.
Step 5
Within these folders, you will have the source and binary files of Apache Pig in various distributions. Download the tar files of the source and binary files of Apache Pig 0.15, pig0.15.0-src.tar.gz and pig-0.15.0.tar.gz.
Install Apache Pig
After downloading the Apache Pig software, install it in your Linux environment by following the steps given below.
Step 1
Create a directory with the name Pig in the same directory where the installation directories of Hadoop, Java, and other software were installed. (In our tutorial, we have created the Pig directory in the user named Hadoop).
$ mkdir Pig
Step 2
Extract the downloaded tar files as shown below.
$ cd Downloads/ $ tar zxvf pig-0.15.0-src.tar.gz $ tar zxvf pig-0.15.0.tar.gz
Step 3
Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created earper as shown below.
$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/
Configure Apache Pig
After instalpng Apache Pig, we have to configure it. To configure, we need to edit two files − bashrc and pig.properties.
.bashrc file
In the .bashrc file, set the following variables −
PIG_HOME folder to the Apache Pig’s installation folder,
PATH environment variable to the bin folder, and
PIG_CLASSPATH environment variable to the etc (configuration) folder of your Hadoop installations (the directory that contains the core-site.xml, hdfs-site.xml and mapred-site.xml files).
export PIG_HOME = /home/Hadoop/Pig export PATH = $PATH:/home/Hadoop/pig/bin export PIG_CLASSPATH = $HADOOP_HOME/conf
pig.properties file
In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file, you can set various parameters as given below.
pig -h properties
The following properties are supported −
Logging: verbose = true|false; default is false. This property is the same as -v switch brief=true|false; default is false. This property is the same as -b switch debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO. This property is the same as -d switch aggregate.warning = true|false; default is true. If true, prints count of warnings of each type rather than logging each warning. Performance tuning: pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all memory). Note that this memory is shared across all large bags used by the apppcation. pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory). Specifies the fraction of heap available for the reducer to perform the join. pig.exec.nocombiner = true|false; default is false. Only disable combiner as a temporary workaround for problems. opt.multiquery = true|false; multiquery is on by default. Only disable multiquery as a temporary workaround for problems. opt.fetch=true|false; fetch is on by default. Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR jobs. pig.tmpfilecompression = true|false; compression is off by default. Determines whether output of intermediate jobs is compressed. pig.tmpfilecompression.codec = lzo|gzip; default is gzip. Used in conjunction with pig.tmpfilecompression. Defines compression type. pig.noSpptCombination = true|false. Sppt combination is on by default. Determines if multiple small files are combined into a single map. pig.exec.mapPartAgg = true|false. Default is false. Determines if partial aggregation is done within map phase, before records are sent to combiner. pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10. If the in-map partial aggregation does not reduce the output num records by this factor, it gets disabled. Miscellaneous: exectype = mapreduce|tez|local; default is mapreduce. This property is the same as -x switch pig.additional.jars.uris=<comma seperated pst of jars>. Used in place of register command. udf.import.pst=<comma seperated pst of imports>. Used to avoid package names in UDF. stop.on.failure = true|false; default is false. Set to true to terminate on the first error. pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host. Determines the timezone used to handle datetime datatype and UDFs. Additionally, any Hadoop property can be specified.
Verifying the Installation
Verify the installation of Apache Pig by typing the version command. If the installation is successful, you will get the version of Apache Pig as shown below.
$ pig –version Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35
Apache Pig - Execution
In the previous chapter, we explained how to install Apache Pig. In this chapter, we will discuss how to execute Apache Pig.
Apache Pig Execution Modes
You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.
Local Mode
In this mode, all the files are installed and run from your local host and local file system. There is no need of Hadoop or HDFS. This mode is generally used for testing purpose.
MapReduce Mode
MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to process the data, a MapReduce job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS.
Apache Pig Execution Mechanisms
Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and embedded mode.
Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using the Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump operator).
Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin script in a single file with .pig extension.
Embedded Mode (UDF) − Apache Pig provides the provision of defining our own functions (User Defined Functions) in programming languages such as Java, and using them in our script.
Invoking the Grunt Shell
You can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as shown below.
Local mode | MapReduce mode |
---|---|
Command − $ ./pig –x local |
Command − $ ./pig -x mapreduce |
Output − |
Output − |
Either of these commands gives you the Grunt shell prompt as shown below.
grunt>
You can exit the Grunt shell using ‘ctrl + d’.
After invoking the Grunt shell, you can execute a Pig script by directly entering the Pig Latin statements in it.
grunt> customers = LOAD customers.txt USING PigStorage( , );
Executing Apache Pig in Batch Mode
You can write an entire Pig Latin script in a file and execute it using the –x command. Let us suppose we have a Pig script in a file named sample_script.pig as shown below.
Sample_script.pig
student = LOAD hdfs://localhost:9000/pig_data/student.txt USING PigStorage( , ) as (id:int,name:chararray,city:chararray); Dump student;
Now, you can execute the script in the above file as shown below.
Local mode | MapReduce mode |
---|---|
$ pig -x local Sample_script.pig | $ pig -x mapreduce Sample_script.pig |
Note − We will discuss in detail how to run a Pig script in Bach mode and in embedded mode in subsequent chapters.
Apache Pig - Grunt Shell
After invoking the Grunt shell, you can run your Pig scripts in the shell. In addition to that, there are certain useful shell and utipty commands provided by the Grunt shell. This chapter explains the shell and utipty commands provided by the Grunt shell.
Note − In some portions of this chapter, the commands pke Load and Store are used. Refer the respective chapters to get in-detail information on them.
Shell Commands
The Grunt shell of Apache Pig is mainly used to write Pig Latin scripts. Prior to that, we can invoke any shell commands using sh and fs.
sh Command
Using sh command, we can invoke any shell commands from the Grunt shell. Using sh command from the Grunt shell, we cannot execute the commands that are a part of the shell environment (ex − cd).
Syntax
Given below is the syntax of sh command.
grunt> sh shell command parameters
Example
We can invoke the ls command of Linux shell from the Grunt shell using the sh option as shown below. In this example, it psts out the files in the /pig/bin/ directory.
grunt> sh ls pig pig_1444799121955.log pig.cmd pig.py
fs Command
Using the fs command, we can invoke any FsShell commands from the Grunt shell.
Syntax
Given below is the syntax of fs command.
grunt> sh File System command parameters
Example
We can invoke the ls command of HDFS from the Grunt shell using fs command. In the following example, it psts the files in the HDFS root directory.
grunt> fs –ls Found 3 items drwxrwxrwx - Hadoop supergroup 0 2015-09-08 14:13 Hbase drwxr-xr-x - Hadoop supergroup 0 2015-09-09 14:52 seqgen_data drwxr-xr-x - Hadoop supergroup 0 2015-09-08 11:30 twitter_data
In the same way, we can invoke all the other file system shell commands from the Grunt shell using the fs command.
Utipty Commands
The Grunt shell provides a set of utipty commands. These include utipty commands such as clear, help, history, quit, and set; and commands such as exec, kill, and run to control Pig from the Grunt shell. Given below is the description of the utipty commands provided by the Grunt shell.
clear Command
The clear command is used to clear the screen of the Grunt shell.
Syntax
You can clear the screen of the grunt shell using the clear command as shown below.
grunt> clear
help Command
The help command gives you a pst of Pig commands or Pig properties.
Usage
You can get a pst of Pig commands using the help command as shown below.
grunt> help Commands: <pig latin statement>; - See the PigLatin manual for details: http://hadoop.apache.org/pig File system commands:fs <fs arguments> - Equivalent to Hadoop dfs command: http://hadoop.apache.org/common/docs/current/hdfs_shell.html Diagnostic Commands:describe <apas>[::<apas] - Show the schema for the apas. Inner apases can be described as A::B. explain [-script <pigscript>] [-out <path>] [-brief] [-dot|-xml] [-param <param_name>=<pCram_value>] [-param_file <file_name>] [<apas>] - Show the execution plan to compute the apas or for entire script. -script - Explain the entire script. -out - Store the output into directory rather than print to stdout. -brief - Don t expand nested plans (presenting a smaller graph for overview). -dot - Generate the output in .dot format. Default is text format. -xml - Generate the output in .xml format. Default is text format. -param <param_name - See parameter substitution for details. -param_file <file_name> - See parameter substitution for details. apas - Apas to explain. dump <apas> - Compute the apas and writes the results to stdout. Utipty Commands: exec [-param <param_name>=param_value] [-param_file <file_name>] <script> - Execute the script with access to grunt environment including apases. -param <param_name - See parameter substitution for details. -param_file <file_name> - See parameter substitution for details. script - Script to be executed. run [-param <param_name>=param_value] [-param_file <file_name>] <script> - Execute the script with access to grunt environment. -param <param_name - See parameter substitution for details. -param_file <file_name> - See parameter substitution for details. script - Script to be executed. sh <shell command> - Invoke a shell command. kill <job_id> - Kill the hadoop job specified by the hadoop job id. set <key> <value> - Provide execution parameters to Pig. Keys and values are case sensitive. The following keys are supported: default_parallel - Script-level reduce parallepsm. Basic input size heuristics used by default. debug - Set debug on or off. Default is off. job.name - Single-quoted name for jobs. Default is PigLatin:<script name> job.priority - Priority for jobs. Values: very_low, low, normal, high, very_high. Default is normal stream.skippath - String that contains the path. This is used by streaming any hadoop property. help - Display this message. history [-n] - Display the pst statements in cache. -n Hide pne numbers. quit - Quit the grunt shell.
history Command
This command displays a pst of statements executed / used so far since the Grunt sell is invoked.
Usage
Assume we have executed three statements since opening the Grunt shell.
grunt> customers = LOAD hdfs://localhost:9000/pig_data/customers.txt USING PigStorage( , ); grunt> orders = LOAD hdfs://localhost:9000/pig_data/orders.txt USING PigStorage( , ); grunt> student = LOAD hdfs://localhost:9000/pig_data/student.txt USING PigStorage( , );
Then, using the history command will produce the following output.
grunt> history customers = LOAD hdfs://localhost:9000/pig_data/customers.txt USING PigStorage( , ); orders = LOAD hdfs://localhost:9000/pig_data/orders.txt USING PigStorage( , ); student = LOAD hdfs://localhost:9000/pig_data/student.txt USING PigStorage( , );
set Command
The set command is used to show/assign values to keys used in Pig.
Usage
Using this command, you can set values to the following keys.
Key | Description and values |
---|---|
default_parallel | You can set the number of reducers for a map job by passing any whole number as a value to this key. |
debug | You can turn off or turn on the debugging freature in Pig by passing on/off to this key. |
job.name | You can set the Job name to the required job by passing a string value to this key. |
job.priority | You can set the job priority to a job by passing one of the following values to this key − very_low low normal high very_high |
stream.skippath | For streaming, you can set the path from where the data is not to be transferred, by passing the desired path in the form of a string to this key. |
quit Command
You can quit from the Grunt shell using this command.
Usage
Quit from the Grunt shell as shown below.
grunt> quit
Let us now take a look at the commands using which you can control Apache Pig from the Grunt shell.
exec Command
Using the exec command, we can execute Pig scripts from the Grunt shell.
Syntax
Given below is the syntax of the utipty command exec.
grunt> exec [–param param_name = param_value] [–param_file file_name] [script]
Example
Let us assume there is a file named student.txt in the /pig_data/ directory of HDFS with the following content.
Student.txt
001,Rajiv,Hyderabad 002,siddarth,Kolkata 003,Rajesh,Delhi
And, assume we have a script file named sample_script.pig in the /pig_data/ directory of HDFS with the following content.
Sample_script.pig
student = LOAD hdfs://localhost:9000/pig_data/student.txt USING PigStorage( , ) as (id:int,name:chararray,city:chararray); Dump student;
Now, let us execute the above script from the Grunt shell using the exec command as shown below.
grunt> exec /sample_script.pig
Output
The exec command executes the script in the sample_script.pig. As directed in the script, it loads the student.txt file into Pig and gives you the result of the Dump operator displaying the following content.
(1,Rajiv,Hyderabad) (2,siddarth,Kolkata) (3,Rajesh,Delhi)
kill Command
You can kill a job from the Grunt shell using this command.
Syntax
Given below is the syntax of the kill command.
grunt> kill JobId
Example
Suppose there is a running Pig job having id Id_0055, you can kill it from the Grunt shell using the kill command, as shown below.
grunt> kill Id_0055
run Command
You can run a Pig script from the Grunt shell using the run command
Syntax
Given below is the syntax of the run command.
grunt> run [–param param_name = param_value] [–param_file file_name] script
Example
Let us assume there is a file named student.txt in the /pig_data/ directory of HDFS with the following content.
Student.txt
001,Rajiv,Hyderabad 002,siddarth,Kolkata 003,Rajesh,Delhi
And, assume we have a script file named sample_script.pig in the local filesystem with the following content.
Sample_script.pig
student = LOAD hdfs://localhost:9000/pig_data/student.txt USING PigStorage( , ) as (id:int,name:chararray,city:chararray);
Now, let us run the above script from the Grunt shell using the run command as shown below.
grunt> run /sample_script.pig
You can see the output of the script using the Dump operator as shown below.
grunt> Dump; (1,Rajiv,Hyderabad) (2,siddarth,Kolkata) (3,Rajesh,Delhi)
Note − The difference between exec and the run command is that if we use run, the statements from the script are available in the command history.
Pig Latin – Basics
Pig Latin is the language used to analyze data in Hadoop using Apache Pig. In this chapter, we are going to discuss the basics of Pig Latin such as Pig Latin statements, data types, general and relational operators, and Pig Latin UDF’s.
Pig Latin – Data Model
As discussed in the previous chapters, the data model of Pig is fully nested. A Relation is the outermost structure of the Pig Latin data model. And it is a bag where −
A bag is a collection of tuples.
A tuple is an ordered set of fields.
A field is a piece of data.
Pig Latin – Statemets
While processing data using Pig Latin, statements are the basic constructs.
These statements work with relations. They include expressions and schemas.
Every statement ends with a semicolon (;).
We will perform various operations using operators provided by Pig Latin, through statements.
Except LOAD and STORE, while performing all other operations, Pig Latin statements take a relation as input and produce another relation as output.
As soon as you enter a Load statement in the Grunt shell, its semantic checking will be carried out. To see the contents of the schema, you need to use the Dump operator. Only after performing the dump operation, the MapReduce job for loading the data into the file system will be carried out.
Example
Given below is a Pig Latin statement, which loads data to Apache Pig.
grunt> Student_data = LOAD student_data.txt USING PigStorage( , )as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Pig Latin – Data types
Given below table describes the Pig Latin data types.
S.N. | Data Type | Description & Example |
---|---|---|
1 | int | Represents a signed 32-bit integer. Example : 8 |
2 | long | Represents a signed 64-bit integer. Example : 5L |
3 | float | Represents a signed 32-bit floating point. Example : 5.5F |
4 | double | Represents a 64-bit floating point. Example : 10.5 |
5 | chararray | Represents a character array (string) in Unicode UTF-8 format. Example : ‘tutorials point’ |
6 | Bytearray | Represents a Byte array (blob). |
7 | Boolean | Represents a Boolean value. Example : true/ false. |
8 | Datetime | Represents a date-time. Example : 1970-01-01T00:00:00.000+00:00 |
9 | Biginteger | Represents a Java BigInteger. Example : 60708090709 |
10 | Bigdecimal | Represents a Java BigDecimal Example : 185.98376256272893883 |
Complex Types | ||
11 | Tuple | A tuple is an ordered set of fields. Example : (raja, 30) |
12 | Bag | A bag is a collection of tuples. Example : {(raju,30),(Mohhammad,45)} |
13 | Map | A Map is a set of key-value pairs. Example : [ ‘name’#’Raju’, ‘age’#30] |
Null Values
Values for all the above data types can be NULL. Apache Pig treats null values in a similar way as SQL does.
A null can be an unknown value or a non-existent value. It is used as a placeholder for optional values. These nulls can occur naturally or can be the result of an operation.
Pig Latin – Arithmetic Operators
The following table describes the arithmetic operators of Pig Latin. Suppose a = 10 and b = 20.
Operator | Description | Example |
---|---|---|
+ | Addition − Adds values on either side of the operator |
a + b will give 30 |
− | Subtraction − Subtracts right hand operand from left hand operand |
a − b will give −10 |
* | Multippcation − Multippes values on either side of the operator |
a * b will give 200 |
/ | Division − Divides left hand operand by right hand operand |
b / a will give 2 |
% | Modulus − Divides left hand operand by right hand operand and returns remainder |
b % a will give 0 |
? : | Bincond − Evaluates the Boolean operators. It has three operands as shown below. variable x = (expression) ? value1 if true : value2 if false. |
b = (a == 1)? 20: 30; if a=1 the value of b is 20. if a!=1 the value of b is 30. |
CASE WHEN THEN ELSE END |
Case − The case operator is equivalent to nested bincond operator. |
CASE f2 % 2 WHEN 0 THEN even WHEN 1 THEN odd END |
Pig Latin – Comparison Operators
The following table describes the comparison operators of Pig Latin.
Operator | Description | Example |
---|---|---|
== | Equal − Checks if the values of two operands are equal or not; if yes, then the condition becomes true. |
(a = b) is not true |
!= | Not Equal − Checks if the values of two operands are equal or not. If the values are not equal, then condition becomes true. |
(a != b) is true. |
> | Greater than − Checks if the value of the left operand is greater than the value of the right operand. If yes, then the condition becomes true. |
(a > b) is not true. |
< | Less than − Checks if the value of the left operand is less than the value of the right operand. If yes, then the condition becomes true. |
(a < b) is true. |
>= | Greater than or equal to − Checks if the value of the left operand is greater than or equal to the value of the right operand. If yes, then the condition becomes true. |
(a >= b) is not true. |
<= | Less than or equal to − Checks if the value of the left operand is less than or equal to the value of the right operand. If yes, then the condition becomes true. |
(a <= b) is true. |
matches | Pattern matching − Checks whether the string in the left-hand side matches with the constant in the right-hand side. |
f1 matches .*tutorial.* |
Pig Latin – Type Construction Operators
The following table describes the Type construction operators of Pig Latin.
Operator | Description | Example |
---|---|---|
() | Tuple constructor operator − This operator is used to construct a tuple. |
(Raju, 30) |
{} | Bag constructor operator − This operator is used to construct a bag. |
{(Raju, 30), (Mohammad, 45)} |
[] | Map constructor operator − This operator is used to construct a tuple. |
[name#Raja, age#30] |
Pig Latin – Relational Operations
The following table describes the relational operators of Pig Latin.
Operator | Description |
---|---|
Loading and Storing | |
LOAD | To Load the data from the file system (local/HDFS) into a relation. |
STORE | To save a relation to the file system (local/HDFS). |
Filtering | |
FILTER | To remove unwanted rows from a relation. |
DISTINCT | To remove duppcate rows from a relation. |
FOREACH, GENERATE | To generate data transformations based on columns of data. |
STREAM | To transform a relation using an external program. |
Grouping and Joining | |
JOIN | To join two or more relations. |
COGROUP | To group the data in two or more relations. |
GROUP | To group the data in a single relation. |
CROSS | To create the cross product of two or more relations. |
Sorting | |
ORDER | To arrange a relation in a sorted order based on one or more fields (ascending or descending). |
LIMIT | To get a pmited number of tuples from a relation. |
Combining and Spptting | |
UNION | To combine two or more relations into a single relation. |
SPLIT | To sppt a single relation into two or more relations. |
Diagnostic Operators | |
DUMP | To print the contents of a relation on the console. |
DESCRIBE | To describe the schema of a relation. |
EXPLAIN | To view the logical, physical, or MapReduce execution plans to compute a relation. |
ILLUSTRATE | To view the step-by-step execution of a series of statements. |
Apache Pig - Reading Data
In general, Apache Pig works on top of Hadoop. It is an analytical tool that analyzes large datasets that exist in the Hadoop File System. To analyze data using Apache Pig, we have to initially load the data into Apache Pig. This chapter explains how to load data to Apache Pig from HDFS.
Preparing HDFS
In MapReduce mode, Pig reads (loads) data from HDFS and stores the results back in HDFS. Therefore, let us start HDFS and create the following sample data in HDFS.
Student ID | First Name | Last Name | Phone | City |
---|---|---|---|---|
001 | Rajiv | Reddy | 9848022337 | Hyderabad |
002 | siddarth | Battacharya | 9848022338 | Kolkata |
003 | Rajesh | Khanna | 9848022339 | Delhi |
004 | Preethi | Agarwal | 9848022330 | Pune |
005 | Trupthi | Mohanthy | 9848022336 | Bhuwaneshwar |
006 | Archana | Mishra | 9848022335 | Chennai |
The above dataset contains personal details pke id, first name, last name, phone number and city, of six students.
Step 1: Verifying Hadoop
First of all, verify the installation using Hadoop version command, as shown below.
$ hadoop version
If your system contains Hadoop, and if you have set the PATH variable, then you will get the following output −
Hadoop 2.6.0 Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1 Compiled by jenkins on 2014-11-13T21:10Z Compiled with protoc 2.5.0 From source with checksum 18e43357c8f927c0695f1e9522859d6a This command was run using /home/Hadoop/hadoop/share/hadoop/common/hadoop common-2.6.0.jar
Step 2: Starting HDFS
Browse through the sbin directory of Hadoop and start yarn and Hadoop dfs (distributed file system) as shown below.
cd /$Hadoop_Home/sbin/ $ start-dfs.sh localhost: starting namenode, logging to /home/Hadoop/hadoop/logs/hadoopHadoop-namenode-localhost.localdomain.out localhost: starting datanode, logging to /home/Hadoop/hadoop/logs/hadoopHadoop-datanode-localhost.localdomain.out Starting secondary namenodes [0.0.0.0] starting secondarynamenode, logging to /home/Hadoop/hadoop/logs/hadoop-Hadoopsecondarynamenode-localhost.localdomain.out $ start-yarn.sh starting yarn daemons starting resourcemanager, logging to /home/Hadoop/hadoop/logs/yarn-Hadoopresourcemanager-localhost.localdomain.out localhost: starting nodemanager, logging to /home/Hadoop/hadoop/logs/yarnHadoop-nodemanager-localhost.localdomain.out
Step 3: Create a Directory in HDFS
In Hadoop DFS, you can create directories using the command mkdir. Create a new directory in HDFS with the name Pig_Data in the required path as shown below.
$cd /$Hadoop_Home/bin/ $ hdfs dfs -mkdir hdfs://localhost:9000/Pig_Data
Step 4: Placing the data in HDFS
The input file of Pig contains each tuple/record in inspanidual pnes. And the entities of the record are separated by a depmiter (In our example we used “,”).
In the local file system, create an input file student_data.txt containing data as shown below.
001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai.
Now, move the file from the local file system to HDFS using put command as shown below. (You can use copyFromLocal command as well.)
$ cd $HADOOP_HOME/bin $ hdfs dfs -put /home/Hadoop/Pig/Pig_Data/student_data.txt dfs://localhost:9000/pig_data/
Verifying the file
You can use the cat command to verify whether the file has been moved into the HDFS, as shown below.
$ cd $HADOOP_HOME/bin $ hdfs dfs -cat hdfs://localhost:9000/pig_data/student_data.txt
Output
You can see the content of the file as shown below.
15/10/01 12:16:55 WARN util.NativeCodeLoader: Unable to load native-hadoop pbrary for your platform... using builtin-java classes where apppcable 001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai
The Load Operator
You can load data into Apache Pig from the file system (HDFS/ Local) using LOAD operator of Pig Latin.
Syntax
The load statement consists of two parts spanided by the “=” operator. On the left-hand side, we need to mention the name of the relation where we want to store the data, and on the right-hand side, we have to define how we store the data. Given below is the syntax of the Load operator.
Relation_name = LOAD Input file path USING function as schema;
Where,
relation_name − We have to mention the relation in which we want to store the data.
Input file path − We have to mention the HDFS directory where the file is stored. (In MapReduce mode)
function − We have to choose a function from the set of load functions provided by Apache Pig (BinStorage, JsonLoader, PigStorage, TextLoader).
Schema − We have to define the schema of the data. We can define the required schema as follows −
(column1 : data type, column2 : data type, column3 : data type);
Note − We load the data without specifying the schema. In that case, the columns will be addressed as $01, $02, etc… (check).
Example
As an example, let us load the data in student_data.txt in Pig under the schema named Student using the LOAD command.
Start the Pig Grunt Shell
First of all, open the Linux terminal. Start the Pig Grunt shell in MapReduce mode as shown below.
$ Pig –x mapreduce
It will start the Pig Grunt shell as shown below.
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL 15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE 15/10/01 12:33:37 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType 2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35 2015-10-01 12:33:38,080 [main] INFO org.apache.pig.Main - Logging error messages to: /home/Hadoop/pig_1443683018078.log 2015-10-01 12:33:38,242 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/Hadoop/.pigbootup not found 2015-10-01 12:33:39,630 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:9000 grunt>
Execute the Load Statement
Now load the data from the file student_data.txt into Pig by executing the following Pig Latin statement in the Grunt shell.
grunt> student = LOAD hdfs://localhost:9000/pig_data/student_data.txt USING PigStorage( , ) as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Following is the description of the above statement.
Relation name | We have stored the data in the schema student. | ||||||||||||
Input file path | We are reading data from the file student_data.txt, which is in the /pig_data/ directory of HDFS. | ||||||||||||
Storage function | We have used the PigStorage() function. It loads and stores data as structured text files. It takes a depmiter using which each entity of a tuple is separated, as a parameter. By default, it takes ‘ ’ as a parameter. | ||||||||||||
schema | We have stored the data using the following schema.
|
Note − The load statement will simply load the data into the specified relation in Pig. To verify the execution of the Load statement, you have to use the Diagnostic Operators which are discussed in the next chapters.
Apache Pig - Storing Data
In the previous chapter, we learnt how to load data into Apache Pig. You can store the loaded data in the file system using the store operator. This chapter explains how to store data in Apache Pig using the Store operator.
Syntax
Given below is the syntax of the Store statement.
STORE Relation_name INTO required_directory_path [USING function];
Example
Assume we have a file student_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown below.
grunt> student = LOAD hdfs://localhost:9000/pig_data/student_data.txt USING PigStorage( , ) as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Now, let us store the relation in the HDFS directory “/pig_Output/” as shown below.
grunt> STORE student INTO hdfs://localhost:9000/pig_Output/ USING PigStorage ( , );
Output
After executing the store statement, you will get the following output. A directory is created with the specified name and the data will be stored in it.
2015-10-05 13:05:05,429 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. MapReduceLau ncher - 100% complete 2015-10-05 13:05:05,429 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.6.0 0.15.0 Hadoop 2015-10-0 13:03:03 2015-10-05 13:05:05 UNKNOWN Success! Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime job_14459_06 1 0 n/a n/a n/a n/a MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Apas Feature 0 0 0 0 student MAP_ONLY OutPut folder hdfs://localhost:9000/pig_Output/ Input(s): Successfully read 0 records from: "hdfs://localhost:9000/pig_data/student_data.txt" Output(s): Successfully stored 0 records in: "hdfs://localhost:9000/pig_Output" Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_1443519499159_0006 2015-10-05 13:06:06,192 [main] INFO org.apache.pig.backend.hadoop.executionengine .mapReduceLayer.MapReduceLau ncher - Success!
Verification
You can verify the stored data as shown below.
Step 1
First of all, pst out the files in the directory named pig_output using the ls command as shown below.
hdfs dfs -ls hdfs://localhost:9000/pig_Output/ Found 2 items rw-r--r- 1 Hadoop supergroup 0 2015-10-05 13:03 hdfs://localhost:9000/pig_Output/_SUCCESS rw-r--r- 1 Hadoop supergroup 224 2015-10-05 13:03 hdfs://localhost:9000/pig_Output/part-m-00000
You can observe that two files were created after executing the store statement.
Step 2
Using cat command, pst the contents of the file named part-m-00000 as shown below.
$ hdfs dfs -cat hdfs://localhost:9000/pig_Output/part-m-00000 1,Rajiv,Reddy,9848022337,Hyderabad 2,siddarth,Battacharya,9848022338,Kolkata 3,Rajesh,Khanna,9848022339,Delhi 4,Preethi,Agarwal,9848022330,Pune 5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 6,Archana,Mishra,9848022335,Chennai
Apache Pig - Diagnostic Operators
The load statement will simply load the data into the specified relation in Apache Pig. To verify the execution of the Load statement, you have to use the Diagnostic Operators. Pig Latin provides four different types of diagnostic operators −
Dump operator
Describe operator
Explanation operator
Illustration operator
In this chapter, we will discuss the Dump operators of Pig Latin.
Dump Operator
The Dump operator is used to run the Pig Latin statements and display the results on the screen. It is generally used for debugging Purpose.
Syntax
Given below is the syntax of the Dump operator.
grunt> Dump Relation_Name
Example
Assume we have a file student_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown below.
grunt> student = LOAD hdfs://localhost:9000/pig_data/student_data.txt USING PigStorage( , ) as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Now, let us print the contents of the relation using the Dump operator as shown below.
grunt> Dump student
Once you execute the above Pig Latin statement, it will start a MapReduce job to read data from HDFS. It will produce the following output.
2015-10-01 15:05:27,642 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2015-10-01 15:05:27,652 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 2.6.0 0.15.0 Hadoop 2015-10-01 15:03:11 2015-10-01 05:27 UNKNOWN Success! Job Stats (time in seconds): JobId job_14459_0004 Maps 1 Reduces 0 MaxMapTime n/a MinMapTime n/a AvgMapTime n/a MedianMapTime n/a MaxReduceTime 0 MinReduceTime 0 AvgReduceTime 0 MedianReducetime 0 Apas student Feature MAP_ONLY Outputs hdfs://localhost:9000/tmp/temp580182027/tmp757878456, Input(s): Successfully read 0 records from: "hdfs://localhost:9000/pig_data/ student_data.txt" Output(s): Successfully stored 0 records in: "hdfs://localhost:9000/tmp/temp580182027/ tmp757878456" Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_1443519499159_0004 2015-10-01 15:06:28,403 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLau ncher - Success! 2015-10-01 15:06:28,441 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code. 2015-10-01 15:06:28,485 [main] INFO org.apache.hadoop.mapreduce.pb.input.FileInputFormat - Total input paths to process : 1 2015-10-01 15:06:28,485 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 (1,Rajiv,Reddy,9848022337,Hyderabad) (2,siddarth,Battacharya,9848022338,Kolkata) (3,Rajesh,Khanna,9848022339,Delhi) (4,Preethi,Agarwal,9848022330,Pune) (5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar) (6,Archana,Mishra,9848022335,Chennai)
Apache Pig - Describe Operator
The describe operator is used to view the schema of a relation.
Syntax
The syntax of the describe operator is as follows −
grunt> Describe Relation_name
Example
Assume we have a file student_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown below.
grunt> student = LOAD hdfs://localhost:9000/pig_data/student_data.txt USING PigStorage( , ) as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Now, let us describe the relation named student and verify the schema as shown below.
grunt> describe student;
Output
Once you execute the above Pig Latin statement, it will produce the following output.
grunt> student: { id: int,firstname: chararray,lastname: chararray,phone: chararray,city: chararray }
Apache Pig - Explain Operator
The explain operator is used to display the logical, physical, and MapReduce execution plans of a relation.
Syntax
Given below is the syntax of the explain operator.
grunt> explain Relation_name;
Example
Assume we have a file student_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown below.
grunt> student = LOAD hdfs://localhost:9000/pig_data/student_data.txt USING PigStorage( , ) as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Now, let us explain the relation named student using the explain operator as shown below.
grunt> explain student;
Output
It will produce the following output.
$ explain student; 2015-10-05 11:32:43,660 [main] 2015-10-05 11:32:43,660 [main] INFO org.apache.pig.newplan.logical.optimizer .LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SpptFilter, StreamTypeCastInserter]} #----------------------------------------------- # New Logical Plan: #----------------------------------------------- student: (Name: LOStore Schema: id#31:int,firstname#32:chararray,lastname#33:chararray,phone#34:chararray,city# 35:chararray) | |---student: (Name: LOForEach Schema: id#31:int,firstname#32:chararray,lastname#33:chararray,phone#34:chararray,city# 35:chararray) | | | (Name: LOGenerate[false,false,false,false,false] Schema: id#31:int,firstname#32:chararray,lastname#33:chararray,phone#34:chararray,city# 35:chararray)ColumnPrune:InputUids=[34, 35, 32, 33, 31]ColumnPrune:OutputUids=[34, 35, 32, 33, 31] | | | | | (Name: Cast Type: int Uid: 31) | | | | | |---id:(Name: Project Type: bytearray Uid: 31 Input: 0 Column: (*)) | | | | | (Name: Cast Type: chararray Uid: 32) | | | | | |---firstname:(Name: Project Type: bytearray Uid: 32 Input: 1 Column: (*)) | | | | | (Name: Cast Type: chararray Uid: 33) | | | | | |---lastname:(Name: Project Type: bytearray Uid: 33 Input: 2 Column: (*)) | | | | | (Name: Cast Type: chararray Uid: 34) | | | | | |---phone:(Name: Project Type: bytearray Uid: 34 Input: 3 Column: (*)) | | | | | (Name: Cast Type: chararray Uid: 35) | | | | | |---city:(Name: Project Type: bytearray Uid: 35 Input: 4 Column: (*)) | | | |---(Name: LOInnerLoad[0] Schema: id#31:bytearray) | | | |---(Name: LOInnerLoad[1] Schema: firstname#32:bytearray) | | | |---(Name: LOInnerLoad[2] Schema: lastname#33:bytearray) | | | |---(Name: LOInnerLoad[3] Schema: phone#34:bytearray) | | | |---(Name: LOInnerLoad[4] Schema: city#35:bytearray) | |---student: (Name: LOLoad Schema: id#31:bytearray,firstname#32:bytearray,lastname#33:bytearray,phone#34:bytearray ,city#35:bytearray)RequiredFields:null #----------------------------------------------- # Physical Plan: #----------------------------------------------- student: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-36 | |---student: New For Each(false,false,false,false,false)[bag] - scope-35 | | | Cast[int] - scope-21 | | | |---Project[bytearray][0] - scope-20 | | | Cast[chararray] - scope-24 | | | |---Project[bytearray][1] - scope-23 | | | Cast[chararray] - scope-27 | | | |---Project[bytearray][2] - scope-26 | | | Cast[chararray] - scope-30 | | | |---Project[bytearray][3] - scope-29 | | | Cast[chararray] - scope-33 | | | |---Project[bytearray][4] - scope-32 | |---student: Load(hdfs://localhost:9000/pig_data/student_data.txt:PigStorage( , )) - scope19 2015-10-05 11:32:43,682 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2015-10-05 11:32:43,684 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOp timizer - MR plan size before optimization: 1 2015-10-05 11:32:43,685 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. MultiQueryOp timizer - MR plan size after optimization: 1 #-------------------------------------------------- # Map Reduce Plan #-------------------------------------------------- MapReduce node scope-37 Map Plan student: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-36 | |---student: New For Each(false,false,false,false,false)[bag] - scope-35 | | | Cast[int] - scope-21 | | | |---Project[bytearray][0] - scope-20 | | | Cast[chararray] - scope-24 | | | |---Project[bytearray][1] - scope-23 | | | Cast[chararray] - scope-27 | | | |---Project[bytearray][2] - scope-26 | | | Cast[chararray] - scope-30 | | | |---Project[bytearray][3] - scope-29 | | | Cast[chararray] - scope-33 | | | |---Project[bytearray][4] - scope-32 | |---student: Load(hdfs://localhost:9000/pig_data/student_data.txt:PigStorage( , )) - scope 19-------- Global sort: false ----------------
Apache Pig - Illustrate Operator
The illustrate operator gives you the step-by-step execution of a sequence of statements.
Syntax
Given below is the syntax of the illustrate operator.
grunt> illustrate Relation_name;
Example
Assume we have a file student_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown below.
grunt> student = LOAD hdfs://localhost:9000/pig_data/student_data.txt USING PigStorage( , ) as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Now, let us illustrate the relation named student as shown below.
grunt> illustrate student;
Output
On executing the above statement, you will get the following output.
grunt> illustrate student; INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$M ap - Apases being processed per job phase (ApasName[pne,offset]): M: student[1,10] C: R: --------------------------------------------------------------------------------------------- |student | id:int | firstname:chararray | lastname:chararray | phone:chararray | city:chararray | --------------------------------------------------------------------------------------------- | | 002 | siddarth | Battacharya | 9848022338 | Kolkata | ---------------------------------------------------------------------------------------------
Apache Pig - Group Operator
The GROUP operator is used to group the data in one or more relations. It collects the data having the same key.
Syntax
Given below is the syntax of the group operator.
grunt> Group_data = GROUP Relation_name BY age;
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad 002,siddarth,Battacharya,22,9848022338,Kolkata 003,Rajesh,Khanna,22,9848022339,Delhi 004,Preethi,Agarwal,21,9848022330,Pune 005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 006,Archana,Mishra,23,9848022335,Chennai 007,Komal,Nayak,24,9848022334,trivendram 008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Apache Pig with the relation name student_details as shown below.
grunt> student_details = LOAD hdfs://localhost:9000/pig_data/student_details.txt USING PigStorage( , ) as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
Now, let us group the records/tuples in the relation by age as shown below.
grunt> group_data = GROUP student_details by age;
Verification
Verify the relation group_data using the DUMP operator as shown below.
grunt> Dump group_data;
Output
Then you will get output displaying the contents of the relation named group_data as shown below. Here you can observe that the resulting schema has two columns −
One is age, by which we have grouped the relation.
The other is a bag, which contains the group of tuples, student records with the respective age.
(21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hydera bad)}) (22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,984802233 8,Kolkata)}) (23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)}) (24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)})
You can see the schema of the table after grouping the data using the describe command as shown below.
grunt> Describe group_data; group_data: {group: int,student_details: {(id: int,firstname: chararray, lastname: chararray,age: int,phone: chararray,city: chararray)}}
In the same way, you can get the sample illustration of the schema using the illustrate command as shown below.
$ Illustrate group_data;
It will produce the following output −
------------------------------------------------------------------------------------------------- |group_data| group:int | student_details:bag{:tuple(id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray)}| ------------------------------------------------------------------------------------------------- | | 21 | { 4, Preethi, Agarwal, 21, 9848022330, Pune), (1, Rajiv, Reddy, 21, 9848022337, Hyderabad)}| | | 2 | {(2,siddarth,Battacharya,22,9848022338,Kolkata),(003,Rajesh,Khanna,22,9848022339,Delhi)}| -------------------------------------------------------------------------------------------------
Grouping by Multiple Columns
Let us group the relation by age and city as shown below.
grunt> group_multiple = GROUP student_details by (age, city);
You can verify the content of the relation named group_multiple using the Dump operator as shown below.
grunt> Dump group_multiple; ((21,Pune),{(4,Preethi,Agarwal,21,9848022330,Pune)}) ((21,Hyderabad),{(1,Rajiv,Reddy,21,9848022337,Hyderabad)}) ((22,Delhi),{(3,Rajesh,Khanna,22,9848022339,Delhi)}) ((22,Kolkata),{(2,siddarth,Battacharya,22,9848022338,Kolkata)}) ((23,Chennai),{(6,Archana,Mishra,23,9848022335,Chennai)}) ((23,Bhuwaneshwar),{(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)}) ((24,Chennai),{(8,Bharathi,Nambiayar,24,9848022333,Chennai)}) (24,trivendram),{(7,Komal,Nayak,24,9848022334,trivendram)})
Group All
You can group a relation by all the columns as shown below.
grunt> group_all = GROUP student_details All;
Now, verify the content of the relation group_all as shown below.
grunt> Dump group_all; (all,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334 ,trivendram), (6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336,Bhuw aneshwar), (4,Preethi,Agarwal,21,9848022330,Pune),(3,Rajesh,Khanna,22,9848022339,Delhi), (2,siddarth,Battacharya,22,9848022338,Kolkata),(1,Rajiv,Reddy,21,9848022337,Hyd erabad)})
Apache Pig - Cogroup Operator
The COGROUP operator works more or less in the same way as the
operator. The only difference between the two operators is that the group operator is normally used with one relation, while the cogroup operator is used in statements involving two or more relations.Grouping Two Relations using Cogroup
Assume that we have two files namely student_details.txt and employee_details.txt in the HDFS directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad 002,siddarth,Battacharya,22,9848022338,Kolkata 003,Rajesh,Khanna,22,9848022339,Delhi 004,Preethi,Agarwal,21,9848022330,Pune 005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 006,Archana,Mishra,23,9848022335,Chennai 007,Komal,Nayak,24,9848022334,trivendram 008,Bharathi,Nambiayar,24,9848022333,Chennai
employee_details.txt
001,Robin,22,newyork 002,BOB,23,Kolkata 003,Maya,23,Tokyo 004,Sara,25,London 005,David,23,Bhuwaneshwar 006,Maggy,22,Chennai
And we have loaded these files into Pig with the relation names student_details and employee_details respectively, as shown below.
grunt> student_details = LOAD hdfs://localhost:9000/pig_data/student_details.txt USING PigStorage( , ) as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray); grunt> employee_details = LOAD hdfs://localhost:9000/pig_data/employee_details.txt USING PigStorage( , ) as (id:int, name:chararray, age:int, city:chararray);
Now, let us group the records/tuples of the relations student_details and employee_details with the key age, as shown below.
grunt> cogroup_data = COGROUP student_details by age, employee_details by age;
Verification
Verify the relation cogroup_data using the DUMP operator as shown below.
grunt> Dump cogroup_data;
Output
It will produce the following output, displaying the contents of the relation named cogroup_data as shown below.
(21,{(4,Preethi,Agarwal,21,9848022330,Pune), (1,Rajiv,Reddy,21,9848022337,Hyderabad)}, { }) (22,{ (3,Rajesh,Khanna,22,9848022339,Delhi), (2,siddarth,Battacharya,22,9848022338,Kolkata) }, { (6,Maggy,22,Chennai),(1,Robin,22,newyork) }) (23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)}, {(5,David,23,Bhuwaneshwar),(3,Maya,23,Tokyo),(2,BOB,23,Kolkata)}) (24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)}, { }) (25,{ }, {(4,Sara,25,London)})
The cogroup operator groups the tuples from each relation according to age where each group depicts a particular age value.
For example, if we consider the 1st tuple of the result, it is grouped by age 21. And it contains two bags −
the first bag holds all the tuples from the first relation (student_details in this case) having age 21, and
the second bag contains all the tuples from the second relation (employee_details in this case) having age 21.
In case a relation doesn’t have tuples having the age value 21, it returns an empty bag.
Apache Pig - Join Operator
The JOIN operator is used to combine records from two or more relations. While performing a join operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these keys match, the two particular tuples are matched, else the records are dropped. Joins can be of the following types −
Self-join
Inner-join
Outer-join − left join, right join, and full join
This chapter explains with examples how to use the join operator in Pig Latin. Assume that we have two files namely customers.txt and orders.txt in the /pig_data/ directory of HDFS as shown below.
customers.txt
1,Ramesh,32,Ahmedabad,2000.00 2,Khilan,25,Delhi,1500.00 3,kaushik,23,Kota,2000.00 4,Chaitap,25,Mumbai,6500.00 5,Hardik,27,Bhopal,8500.00 6,Komal,22,MP,4500.00 7,Muffy,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000 100,2009-10-08 00:00:00,3,1500 101,2009-11-20 00:00:00,2,1560 103,2008-05-20 00:00:00,4,2060
And we have loaded these two files into Pig with the relations customers and orders as shown below.
grunt> customers = LOAD hdfs://localhost:9000/pig_data/customers.txt USING PigStorage( , ) as (id:int, name:chararray, age:int, address:chararray, salary:int); grunt> orders = LOAD hdfs://localhost:9000/pig_data/orders.txt USING PigStorage( , ) as (oid:int, date:chararray, customer_id:int, amount:int);
Let us now perform various Join operations on these two relations.
Self - join
Self-join is used to join a table with itself as if the table were two relations, temporarily renaming at least one relation.
Generally, in Apache Pig, to perform self-join, we will load the same data multiple times, under different apases (names). Therefore let us load the contents of the file customers.txt as two tables as shown below.
grunt> customers1 = LOAD hdfs://localhost:9000/pig_data/customers.txt USING PigStorage( , ) as (id:int, name:chararray, age:int, address:chararray, salary:int); grunt> customers2 = LOAD hdfs://localhost:9000/pig_data/customers.txt USING PigStorage( , ) as (id:int, name:chararray, age:int, address:chararray, salary:int);
Syntax
Given below is the syntax of performing self-join operation using the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
Example
Let us perform self-join operation on the relation customers, by joining the two relations customers1 and customers2 as shown below.
grunt> customers3 = JOIN customers1 BY id, customers2 BY id;
Verification
Verify the relation customers3 using the DUMP operator as shown below.
grunt> Dump customers3;
Output
It will produce the following output, displaying the contents of the relation customers.
(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000) (2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500) (3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000) (4,Chaitap,25,Mumbai,6500,4,Chaitap,25,Mumbai,6500) (5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500) (6,Komal,22,MP,4500,6,Komal,22,MP,4500) (7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)
Inner Join
Inner Join is used quite frequently; it is also referred to as equijoin. An inner join returns rows when there is a match in both tables.
It creates a new relation by combining column values of two relations (say A and B) based upon the join-predicate. The query compares each row of A with each row of B to find all pairs of rows which satisfy the join-predicate. When the join-predicate is satisfied, the column values for each matched pair of rows of A and B are combined into a result row.
Syntax
Here is the syntax of performing inner join operation using the JOIN operator.
grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;
Example
Let us perform inner join operation on the two relations customers and orders as shown below.
grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;
Verification
Verify the relation coustomer_orders using the DUMP operator as shown below.
grunt> Dump coustomer_orders;
Output
You will get the following output that will the contents of the relation named coustomer_orders.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitap,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
Note −
Outer Join: Unpke inner join, outer join returns all the rows from at least one of the relations. An outer join operation is carried out in three ways −
Left outer join
Right outer join
Full outer join
Left Outer Join
The left outer Join operation returns all rows from the left table, even if there are no matches in the right relation.
Syntax
Given below is the syntax of performing left outer join operation using the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;
Example
Let us perform left outer join operation on the two relations customers and orders as shown below.
grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;
Verification
Verify the relation outer_left using the DUMP operator as shown below.
grunt> Dump outer_left;
Output
It will produce the following output, displaying the contents of the relation outer_left.
(1,Ramesh,32,Ahmedabad,2000,,,,) (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitap,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) (5,Hardik,27,Bhopal,8500,,,,) (6,Komal,22,MP,4500,,,,) (7,Muffy,24,Indore,10000,,,,)
Right Outer Join
The right outer join operation returns all rows from the right table, even if there are no matches in the left table.
Syntax
Given below is the syntax of performing right outer join operation using the JOIN operator.
grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
Example
Let us perform right outer join operation on the two relations customers and orders as shown below.
grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
Verification
Verify the relation outer_right using the DUMP operator as shown below.
grunt> Dump outer_right
Output
It will produce the following output, displaying the contents of the relation outer_right.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitap,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
Full Outer Join
The full outer join operation returns rows when there is a match in one of the relations.
Syntax
Given below is the syntax of performing full outer join using the JOIN operator.
grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
Example
Let us perform full outer join operation on the two relations customers and orders as shown below.
grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
Verification
Verify the relation outer_full using the DUMP operator as shown below.
grun> Dump outer_full;
Output
It will produce the following output, displaying the contents of the relation outer_full.
(1,Ramesh,32,Ahmedabad,2000,,,,) (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (4,Chaitap,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) (5,Hardik,27,Bhopal,8500,,,,) (6,Komal,22,MP,4500,,,,) (7,Muffy,24,Indore,10000,,,,)
Using Multiple Keys
We can perform JOIN operation using multiple keys.
Syntax
Here is how you can perform a JOIN operation on two tables using multiple keys.
grunt> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name BY (key1, key2);
Assume that we have two files namely employee.txt and employee_contact.txt in the /pig_data/ directory of HDFS as shown below.
employee.txt
001,Rajiv,Reddy,21,programmer,003 002,siddarth,Battacharya,22,programmer,003 003,Rajesh,Khanna,22,programmer,003 004,Preethi,Agarwal,21,programmer,003 005,Trupthi,Mohanthy,23,programmer,003 006,Archana,Mishra,23,programmer,003 007,Komal,Nayak,24,teamlead,002 008,Bharathi,Nambiayar,24,manager,001
employee_contact.txt
001,9848022337,Rajiv@gmail.com,Hyderabad,003 002,9848022338,siddarth@gmail.com,Kolkata,003 003,9848022339,Rajesh@gmail.com,Delhi,003 004,9848022330,Preethi@gmail.com,Pune,003 005,9848022336,Trupthi@gmail.com,Bhuwaneshwar,003 006,9848022335,Archana@gmail.com,Chennai,003 007,9848022334,Komal@gmail.com,trivendram,002 008,9848022333,Bharathi@gmail.com,Chennai,001
And we have loaded these two files into Pig with relations employee and employee_contact as shown below.
grunt> employee = LOAD hdfs://localhost:9000/pig_data/employee.txt USING PigStorage( , ) as (id:int, firstname:chararray, lastname:chararray, age:int, designation:chararray, jobid:int); grunt> employee_contact = LOAD hdfs://localhost:9000/pig_data/employee_contact.txt USING PigStorage( , ) as (id:int, phone:chararray, email:chararray, city:chararray, jobid:int);
Now, let us join the contents of these two relations using the JOIN operator as shown below.
grunt> emp = JOIN employee BY (id,jobid), employee_contact BY (id,jobid);
Verification
Verify the relation emp using the DUMP operator as shown below.
grunt> Dump emp;
Output
It will produce the following output, displaying the contents of the relation named emp as shown below.
(1,Rajiv,Reddy,21,programmer,113,1,9848022337,Rajiv@gmail.com,Hyderabad,113) (2,siddarth,Battacharya,22,programmer,113,2,9848022338,siddarth@gmail.com,Kolka ta,113) (3,Rajesh,Khanna,22,programmer,113,3,9848022339,Rajesh@gmail.com,Delhi,113) (4,Preethi,Agarwal,21,programmer,113,4,9848022330,Preethi@gmail.com,Pune,113) (5,Trupthi,Mohanthy,23,programmer,113,5,9848022336,Trupthi@gmail.com,Bhuwaneshw ar,113) (6,Archana,Mishra,23,programmer,113,6,9848022335,Archana@gmail.com,Chennai,113) (7,Komal,Nayak,24,teamlead,112,7,9848022334,Komal@gmail.com,trivendram,112) (8,Bharathi,Nambiayar,24,manager,111,8,9848022333,Bharathi@gmail.com,Chennai,111)
Apache Pig - Cross Operator
The CROSS operator computes the cross-product of two or more relations. This chapter explains with example how to use the cross operator in Pig Latin.
Syntax
Given below is the syntax of the CROSS operator.
grunt> Relation3_name = CROSS Relation1_name, Relation2_name;
Example
Assume that we have two files namely customers.txt and orders.txt in the /pig_data/ directory of HDFS as shown below.
customers.txt
1,Ramesh,32,Ahmedabad,2000.00 2,Khilan,25,Delhi,1500.00 3,kaushik,23,Kota,2000.00 4,Chaitap,25,Mumbai,6500.00 5,Hardik,27,Bhopal,8500.00 6,Komal,22,MP,4500.00 7,Muffy,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000 100,2009-10-08 00:00:00,3,1500 101,2009-11-20 00:00:00,2,1560 103,2008-05-20 00:00:00,4,2060
And we have loaded these two files into Pig with the relations customers and orders as shown below.
grunt> customers = LOAD hdfs://localhost:9000/pig_data/customers.txt USING PigStorage( , ) as (id:int, name:chararray, age:int, address:chararray, salary:int); grunt> orders = LOAD hdfs://localhost:9000/pig_data/orders.txt USING PigStorage( , ) as (oid:int, date:chararray, customer_id:int, amount:int);
Let us now get the cross-product of these two relations using the cross operator on these two relations as shown below.
grunt> cross_data = CROSS customers, orders;
Verification
Verify the relation cross_data using the DUMP operator as shown below.
grunt> Dump cross_data;
Output
It will produce the following output, displaying the contents of the relation cross_data.
(7,Muffy,24,Indore,10000,103,2008-05-20 00:00:00,4,2060) (7,Muffy,24,Indore,10000,101,2009-11-20 00:00:00,2,1560) (7,Muffy,24,Indore,10000,100,2009-10-08 00:00:00,3,1500) (7,Muffy,24,Indore,10000,102,2009-10-08 00:00:00,3,3000) (6,Komal,22,MP,4500,103,2008-05-20 00:00:00,4,2060) (6,Komal,22,MP,4500,101,2009-11-20 00:00:00,2,1560) (6,Komal,22,MP,4500,100,2009-10-08 00:00:00,3,1500) (6,Komal,22,MP,4500,102,2009-10-08 00:00:00,3,3000) (5,Hardik,27,Bhopal,8500,103,2008-05-20 00:00:00,4,2060) (5,Hardik,27,Bhopal,8500,101,2009-11-20 00:00:00,2,1560) (5,Hardik,27,Bhopal,8500,100,2009-10-08 00:00:00,3,1500) (5,Hardik,27,Bhopal,8500,102,2009-10-08 00:00:00,3,3000) (4,Chaitap,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) (4,Chaitap,25,Mumbai,6500,101,2009-20 00:00:00,4,2060) (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (2,Khilan,25,Delhi,1500,100,2009-10-08 00:00:00,3,1500) (2,Khilan,25,Delhi,1500,102,2009-10-08 00:00:00,3,3000) (1,Ramesh,32,Ahmedabad,2000,103,2008-05-20 00:00:00,4,2060) (1,Ramesh,32,Ahmedabad,2000,101,2009-11-20 00:00:00,2,1560) (1,Ramesh,32,Ahmedabad,2000,100,2009-10-08 00:00:00,3,1500) (1,Ramesh,32,Ahmedabad,2000,102,2009-10-08 00:00:00,3,3000)-11-20 00:00:00,2,1560) (4,Chaitap,25,Mumbai,6500,100,2009-10-08 00:00:00,3,1500) (4,Chaitap,25,Mumbai,6500,102,2009-10-08 00:00:00,3,3000) (3,kaushik,23,Kota,2000,103,2008-05-20 00:00:00,4,2060) (3,kaushik,23,Kota,2000,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500) (3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (2,Khilan,25,Delhi,1500,103,2008-05-20 00:00:00,4,2060) (2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (2,Khilan,25,Delhi,1500,100,2009-10-08 00:00:00,3,1500) (2,Khilan,25,Delhi,1500,102,2009-10-08 00:00:00,3,3000) (1,Ramesh,32,Ahmedabad,2000,103,2008-05-20 00:00:00,4,2060) (1,Ramesh,32,Ahmedabad,2000,101,2009-11-20 00:00:00,2,1560) (1,Ramesh,32,Ahmedabad,2000,100,2009-10-08 00:00:00,3,1500) (1,Ramesh,32,Ahmedabad,2000,102,2009-10-08 00:00:00,3,3000)
Apache Pig - Union Operator
The UNION operator of Pig Latin is used to merge the content of two relations. To perform UNION operation on two relations, their columns and domains must be identical.
Syntax
Given below is the syntax of the UNION operator.
grunt> Relation_name3 = UNION Relation_name1, Relation_name2;
Example
Assume that we have two files namely student_data1.txt and student_data2.txt in the /pig_data/ directory of HDFS as shown below.
Student_data1.txt
001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai.
Student_data2.txt
7,Komal,Nayak,9848022334,trivendram. 8,Bharathi,Nambiayar,9848022333,Chennai.
And we have loaded these two files into Pig with the relations student1 and student2 as shown below.
grunt> student1 = LOAD hdfs://localhost:9000/pig_data/student_data1.txt USING PigStorage( , ) as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray); grunt> student2 = LOAD hdfs://localhost:9000/pig_data/student_data2.txt USING PigStorage( , ) as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray);
Let us now merge the contents of these two relations using the UNION operator as shown below.
grunt> student = UNION student1, student2;
Verification
Verify the relation student using the DUMP operator as shown below.
grunt> Dump student;
Output
It will display the following output, displaying the contents of the relation student.
(1,Rajiv,Reddy,9848022337,Hyderabad) (2,siddarth,Battacharya,9848022338,Kolkata) (3,Rajesh,Khanna,9848022339,Delhi) (4,Preethi,Agarwal,9848022330,Pune) (5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar) (6,Archana,Mishra,9848022335,Chennai) (7,Komal,Nayak,9848022334,trivendram) (8,Bharathi,Nambiayar,9848022333,Chennai)
Apache Pig - Sppt Operator
The SPLIT operator is used to sppt a relation into two or more relations.
Syntax
Given below is the syntax of the SPLIT operator.
grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name (condition2),
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad 002,siddarth,Battacharya,22,9848022338,Kolkata 003,Rajesh,Khanna,22,9848022339,Delhi 004,Preethi,Agarwal,21,9848022330,Pune 005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 006,Archana,Mishra,23,9848022335,Chennai 007,Komal,Nayak,24,9848022334,trivendram 008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation name student_details as shown below.
student_details = LOAD hdfs://localhost:9000/pig_data/student_details.txt USING PigStorage( , ) as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
Let us now sppt the relation into two, one psting the employees of age less than 23, and the other psting the employees having the age between 22 and 25.
SPLIT student_details into student_details1 if age<23, student_details2 if (22<age and age>25);
Verification
Verify the relations student_details1 and student_details2 using the DUMP operator as shown below.
grunt> Dump student_details1; grunt> Dump student_details2;
Output
It will produce the following output, displaying the contents of the relations student_details1 and student_details2 respectively.
grunt> Dump student_details1; (1,Rajiv,Reddy,21,9848022337,Hyderabad) (2,siddarth,Battacharya,22,9848022338,Kolkata) (3,Rajesh,Khanna,22,9848022339,Delhi) (4,Preethi,Agarwal,21,9848022330,Pune) grunt> Dump student_details2; (5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar) (6,Archana,Mishra,23,9848022335,Chennai) (7,Komal,Nayak,24,9848022334,trivendram) (8,Bharathi,Nambiayar,24,9848022333,Chennai)
Apache Pig - Filter Operator
The FILTER operator is used to select the required tuples from a relation based on a condition.
Syntax
Given below is the syntax of the FILTER operator.
grunt> Relation2_name = FILTER Relation1_name BY (condition);
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad 002,siddarth,Battacharya,22,9848022338,Kolkata 003,Rajesh,Khanna,22,9848022339,Delhi 004,Preethi,Agarwal,21,9848022330,Pune 005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 006,Archana,Mishra,23,9848022335,Chennai 007,Komal,Nayak,24,9848022334,trivendram 008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation name student_details as shown below.
grunt> student_details = LOAD hdfs://localhost:9000/pig_data/student_details.txt USING PigStorage( , ) as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
Let us now use the Filter operator to get the details of the students who belong to the city Chennai.
filter_data = FILTER student_details BY city == Chennai ;
Verification
Verify the relation filter_data using the DUMP operator as shown below.
grunt> Dump filter_data;
Output
It will produce the following output, displaying the contents of the relation filter_data as follows.
(6,Archana,Mishra,23,9848022335,Chennai) (8,Bharathi,Nambiayar,24,9848022333,Chennai)
Apache Pig - Distinct Operator
The DISTINCT operator is used to remove redundant (duppcate) tuples from a relation.
Syntax
Given below is the syntax of the DISTINCT operator.
grunt> Relation_name2 = DISTINCT Relatin_name1;
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,9848022337,Hyderabad 002,siddarth,Battacharya,9848022338,Kolkata 002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi 003,Rajesh,Khanna,9848022339,Delhi 004,Preethi,Agarwal,9848022330,Pune 005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar 006,Archana,Mishra,9848022335,Chennai 006,Archana,Mishra,9848022335,Chennai
And we have loaded this file into Pig with the relation name student_details as shown below.
grunt> student_details = LOAD hdfs://localhost:9000/pig_data/student_details.txt USING PigStorage( , ) as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray);
Let us now remove the redundant (duppcate) tuples from the relation named student_details using the DISTINCT operator, and store it as another relation named distinct_data as shown below.
grunt> distinct_data = DISTINCT student_details;
Verification
Verify the relation distinct_data using the DUMP operator as shown below.
grunt> Dump distinct_data;
Output
It will produce the following output, displaying the contents of the relation distinct_data as follows.
(1,Rajiv,Reddy,9848022337,Hyderabad) (2,siddarth,Battacharya,9848022338,Kolkata) (3,Rajesh,Khanna,9848022339,Delhi) (4,Preethi,Agarwal,9848022330,Pune) (5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar) (6,Archana,Mishra,9848022335,Chennai)
Apache Pig - Foreach Operator
The FOREACH operator is used to generate specified data transformations based on the column data.
Syntax
Given below is the syntax of FOREACH operator.
grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad 002,siddarth,Battacharya,22,9848022338,Kolkata 003,Rajesh,Khanna,22,9848022339,Delhi 004,Preethi,Agarwal,21,9848022330,Pune 005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 006,Archana,Mishra,23,9848022335,Chennai 007,Komal,Nayak,24,9848022334,trivendram 008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation name student_details as shown below.
grunt> student_details = LOAD hdfs://localhost:9000/pig_data/student_details.txt USING PigStorage( , ) as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);
Let us now get the id, age, and city values of each student from the relation student_details and store it into another relation named foreach_data using the foreach operator as shown below.
grunt> foreach_data = FOREACH student_details GENERATE id,age,city;
Verification
Verify the relation foreach_data using the DUMP operator as shown below.
grunt> Dump foreach_data;
Output
It will produce the following output, displaying the contents of the relation foreach_data.
(1,21,Hyderabad) (2,22,Kolkata) (3,22,Delhi) (4,21,Pune) (5,23,Bhuwaneshwar) (6,23,Chennai) (7,24,trivendram) (8,24,Chennai)
Apache Pig - Order By
The ORDER BY operator is used to display the contents of a relation in a sorted order based on one or more fields.
Syntax
Given below is the syntax of the ORDER BY operator.
grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad 002,siddarth,Battacharya,22,9848022338,Kolkata 003,Rajesh,Khanna,22,9848022339,Delhi 004,Preethi,Agarwal,21,9848022330,Pune 005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 006,Archana,Mishra,23,9848022335,Chennai 007,Komal,Nayak,24,9848022334,trivendram 008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation name student_details as shown below.
grunt> student_details = LOAD hdfs://localhost:9000/pig_data/student_details.txt USING PigStorage( , ) as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);
Let us now sort the relation in a descending order based on the age of the student and store it into another relation named order_by_data using the ORDER BY operator as shown below.
grunt> order_by_data = ORDER student_details BY age DESC;
Verification
Verify the relation order_by_data using the DUMP operator as shown below.
grunt> Dump order_by_data;
Output
It will produce the following output, displaying the contents of the relation order_by_data.
(8,Bharathi,Nambiayar,24,9848022333,Chennai) (7,Komal,Nayak,24,9848022334,trivendram) (6,Archana,Mishra,23,9848022335,Chennai) (5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar) (3,Rajesh,Khanna,22,9848022339,Delhi) (2,siddarth,Battacharya,22,9848022338,Kolkata) (4,Preethi,Agarwal,21,9848022330,Pune) (1,Rajiv,Reddy,21,9848022337,Hyderabad)
Apache Pig - Limit Operator
The LIMIT operator is used to get a pmited number of tuples from a relation.
Syntax
Given below is the syntax of the LIMIT operator.
grunt> Result = LIMIT Relation_name required number of tuples;
Example
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad 002,siddarth,Battacharya,22,9848022338,Kolkata 003,Rajesh,Khanna,22,9848022339,Delhi 004,Preethi,Agarwal,21,9848022330,Pune 005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 006,Archana,Mishra,23,9848022335,Chennai 007,Komal,Nayak,24,9848022334,trivendram 008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation name student_details as shown below.
grunt> student_details = LOAD hdfs://localhost:9000/pig_data/student_details.txt USING PigStorage( , ) as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);
Now, let’s sort the relation in descending order based on the age of the student and store it into another relation named pmit_data using the ORDER BY operator as shown below.
grunt> pmit_data = LIMIT student_details 4;
Verification
Verify the relation pmit_data using the DUMP operator as shown below.
grunt> Dump pmit_data;
Output
It will produce the following output, displaying the contents of the relation pmit_data as follows.
(1,Rajiv,Reddy,21,9848022337,Hyderabad) (2,siddarth,Battacharya,22,9848022338,Kolkata) (3,Rajesh,Khanna,22,9848022339,Delhi) (4,Preethi,Agarwal,21,9848022330,Pune)
Apache Pig - Eval Functions
Apache Pig provides various built-in functions namely eval, load, store, math, string, bag and tuple functions.
Eval Functions
Given below is the pst of eval functions provided by Apache Pig.
S.N. | Function & Description |
---|---|
1 | To compute the average of the numerical values within a bag. |
2 | To concatenate the elements of a bag into a string. While concatenating, we can place a depmiter between these values (optional). |
3 | To concatenate two or more expressions of same type. |
4 | To get the number of elements in a bag, while counting the number of tuples in a bag. |
5 | It is similar to the COUNT() function. It is used to get the number of elements in a bag. |
6 | To compare two bags (fields) in a tuple. |
7 | To check if a bag or map is empty. |
8 | To calculate the highest value for a column (numeric values or chararrays) in a single-column bag. |
9 | To get the minimum (lowest) value (numeric or chararray) for a certain column in a single-column bag. |
10 | Using the Pig Latin PluckTuple() function, we can define a string Prefix and filter the columns in a relation that begin with the given prefix. |
11 | To compute the number of elements based on any Pig data type. |
12 | To subtract two bags. It takes two bags as inputs and returns a bag which contains the tuples of the first bag that are not in the second bag. |
13 | To get the total of the numeric values of a column in a single-column bag. |
14 | To sppt a string (which contains a group of words) in a single tuple and return a bag which contains the output of the sppt operation. |
Apache Pig - Load & Store Functions
The Load and Store functions in Apache Pig are used to determine how the data goes ad comes out of Pig. These functions are used with the load and store operators. Given below is the pst of load and store functions available in Pig.
S.N. | Function & Description |
---|---|
1 | To load and store structured files. |
2 | To load unstructured data into Pig. |
3 | To load and store data into Pig using machine readable format. |
4 | In Pig Latin, we can load and store compressed data. |
Apache Pig - Bag & Tuple Functions
Given below is the pst of Bag and Tuple functions.
S.N. | Function & Description |
---|---|
1 | To convert two or more expressions into a bag. |
2 | To get the top N tuples of a relation. |
3 | To convert one or more expressions into a tuple. |
4 | To convert the key-value pairs into a Map. |
Apache Pig - String Functions
We have the following String functions in Apache Pig.
S.N. | Functions & Description |
---|---|
1 | To verify whether a given string ends with a particular substring. |
2 | Accepts two string parameters and verifies whether the first string starts with the second. |
3 | Returns a substring from a given string. |
4 | To compare two stings ignoring the case. |
5 | Returns the first occurrence of a character in a string, searching forward from a start index. |
6 | Returns the index of the last occurrence of a character in a string, searching backward from a start index. |
7 | Converts the first character in a string to lower case. |
8 | Returns a string with the first character converted to upper case. |
9 | UPPER(expression) Returns a string converted to upper case. |
10 | Converts all characters in a string to lower case. |
11 | To replace existing characters in a string with new characters. |
12 | To sppt a string around matches of a given regular expression. |
13 | Similar to the STRSPLIT() function, it sppts the string by given depmiter and returns the result in a bag. |
14 | Returns a copy of a string with leading and traipng whitespaces removed. |
15 | Returns a copy of a string with leading whitespaces removed. |
16 | Returns a copy of a string with traipng whitespaces removed. |
Apache Pig - Date-time Functions
Apache Pig provides the following Date and Time functions −
S.N. | Functions & Description |
---|---|
1 | This function returns a date-time object according to the given parameters. The other alternative for this function are ToDate(iosstring), ToDate(userstring, format), ToDate(userstring, format, timezone) |
2 | returns the date-time object of the current time. |
3 | Returns the day of a month from the date-time object. |
4 | Returns the hour of a day from the date-time object. |
5 | Returns the milpsecond of a second from the date-time object. |
6 | Returns the minute of an hour from the date-time object. |
7 | Returns the month of a year from the date-time object. |
8 | Returns the second of a minute from the date-time object. |
9 | Returns the week of a year from the date-time object. |
10 | Returns the week year from the date-time object. |
11 | Returns the year from the date-time object. |
12 | Returns the result of a date-time object along with the duration object. |
13 | Subtracts the Duration object from the Date-Time object and returns the result. |
14 | Returns the number of days between the two date-time objects. |
15 | Returns the number of hours between two date-time objects. |
16 | Returns the number of milpseconds between two date-time objects. |
17 | Returns the number of minutes between two date-time objects. |
18 | Returns the number of months between two date-time objects. |
19 | Returns the number of seconds between two date-time objects. |
20 | Returns the number of weeks between two date-time objects. |
21 | Returns the number of years between two date-time objects. |
Apache Pig - Math Functions
We have the following Math functions in Apache Pig −
S.N. | Functions & Description |
---|---|
1 | To get the absolute value of an expression. |
2 | To get the arc cosine of an expression. |
3 | To get the arc sine of an expression. |
4 | This function is used to get the arc tangent of an expression. |
5 | This function is used to get the cube root of an expression. |
6 | This function is used to get the value of an expression rounded up to the nearest integer. |
7 | This function is used to get the trigonometric cosine of an expression. |
8 | This function is used to get the hyperbopc cosine of an expression. |
9 | This function is used to get the Euler’s number e raised to the power of x. |
10 | To get the value of an expression rounded down to the nearest integer. |
11 | To get the natural logarithm (base e) of an expression. |
12 | To get the base 10 logarithm of an expression. |
13 | To get a pseudo random number (type double) greater than or equal to 0.0 and less than 1.0. |
14 | To get the value of an expression rounded to an integer (if the result type is float) or rounded to a long (if the result type is double). |
15 | To get the sine of an expression. |
16 | To get the hyperbopc sine of an expression. |
17 | To get the positive square root of an expression. |
18 | To get the trigonometric tangent of an angle. |
19 | To get the hyperbopc tangent of an expression. |
Apache Pig - User Defined Functions
In addition to the built-in functions, Apache Pig provides extensive support for User Defined Functions (UDF’s). Using these UDF’s, we can define our own functions and use them. The UDF support is provided in six programming languages, namely, Java, Jython, Python, JavaScript, Ruby and Groovy.
For writing UDF’s, complete support is provided in Java and pmited support is provided in all the remaining languages. Using Java, you can write UDF’s involving all parts of the processing pke data load/store, column transformation, and aggregation. Since Apache Pig has been written in Java, the UDF’s written using Java language work efficiently compared to other languages.
In Apache Pig, we also have a Java repository for UDF’s named Piggybank. Using Piggybank, we can access Java UDF’s written by other users, and contribute our own UDF’s.
Types of UDF’s in Java
While writing UDF’s using Java, we can create and use the following three types of functions −
Filter Functions − The filter functions are used as conditions in filter statements. These functions accept a Pig value as input and return a Boolean value.
Eval Functions − The Eval functions are used in FOREACH-GENERATE statements. These functions accept a Pig value as input and return a Pig result.
Algebraic Functions − The Algebraic functions act on inner bags in a FOREACHGENERATE statement. These functions are used to perform full MapReduce operations on an inner bag.
Writing UDF’s using Java
To write a UDF using Java, we have to integrate the jar file Pig-0.15.0.jar. In this section, we discuss how to write a sample UDF using Ecppse. Before proceeding further, make sure you have installed Ecppse and Maven in your system.
Follow the steps given below to write a UDF function −
Open Ecppse and create a new project (say myproject).
Convert the newly created project into a Maven project.
Copy the following content in the pom.xml. This file contains the Maven dependencies for Apache Pig and Hadoop-core jar files.
<project xmlns = "http://maven.apache.org/POM/4.0.0" xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation = "http://maven.apache.org/POM/4.0.0http://maven.apache .org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>Pig_Udf</groupId> <artifactId>Pig_Udf</artifactId> <version>0.0.1-SNAPSHOT</version> <build> <sourceDirectory>src</sourceDirectory> <plugins> <plugin> <artifactId>maven-compiler-plugin</artifactId> <version>3.3</version> <configuration> <source>1.7</source> <target>1.7</target> </configuration> </plugin> </plugins> </build> <dependencies> <dependency> <groupId>org.apache.pig</groupId> <artifactId>pig</artifactId> <version>0.15.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>0.20.2</version> </dependency> </dependencies> </project>
Save the file and refresh it. In the Maven Dependencies section, you can find the downloaded jar files.
Create a new class file with name Sample_Eval and copy the following content in it.
import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; pubpc class Sample_Eval extends EvalFunc<String>{ pubpc String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; String str = (String)input.get(0); return str.toUpperCase(); } }
While writing UDF’s, it is mandatory to inherit the EvalFunc class and provide implementation to exec() function. Within this function, the code required for the UDF is written. In the above example, we have return the code to convert the contents of the given column to uppercase.
After compipng the class without errors, right-cpck on the Sample_Eval.java file. It gives you a menu. Select export as shown in the following screenshot.
On cpcking export, you will get the following window. Cpck on JAR file.
Proceed further by cpcking Next> button. You will get another window where you need to enter the path in the local file system, where you need to store the jar file.
Finally cpck the Finish button. In the specified folder, a Jar file sample_udf.jar is created. This jar file contains the UDF written in Java.
Using the UDF
After writing the UDF and generating the Jar file, follow the steps given below −
Step 1: Registering the Jar file
After writing UDF (in Java) we have to register the Jar file that contain the UDF using the Register operator. By registering the Jar file, users can intimate the location of the UDF to Apache Pig.
Syntax
Given below is the syntax of the Register operator.
REGISTER path;
Example
As an example let us register the sample_udf.jar created earper in this chapter.
Start Apache Pig in local mode and register the jar file sample_udf.jar as shown below.
$cd PIG_HOME/bin $./pig –x local REGISTER /$PIG_HOME/sample_udf.jar
Note − assume the Jar file in the path − /$PIG_HOME/sample_udf.jar
Step 2: Defining Apas
After registering the UDF we can define an apas to it using the Define operator.
Syntax
Given below is the syntax of the Define operator.
DEFINE apas {function | [`command` [input] [output] [ship] [cache] [stderr] ] };
Example
Define the apas for sample_eval as shown below.
DEFINE sample_eval sample_eval();
Step 3: Using the UDF
After defining the apas you can use the UDF same as the built-in functions. Suppose there is a file named emp_data in the HDFS /Pig_Data/ directory with the following content.
001,Robin,22,newyork 002,BOB,23,Kolkata 003,Maya,23,Tokyo 004,Sara,25,London 005,David,23,Bhuwaneshwar 006,Maggy,22,Chennai 007,Robert,22,newyork 008,Syam,23,Kolkata 009,Mary,25,Tokyo 010,Saran,25,London 011,Stacy,25,Bhuwaneshwar 012,Kelly,22,Chennai
And assume we have loaded this file into Pig as shown below.
grunt> emp_data = LOAD hdfs://localhost:9000/pig_data/emp1.txt USING PigStorage( , ) as (id:int, name:chararray, age:int, city:chararray);
Let us now convert the names of the employees in to upper case using the UDF sample_eval.
grunt> Upper_case = FOREACH emp_data GENERATE sample_eval(name);
Verify the contents of the relation Upper_case as shown below.
grunt> Dump Upper_case; (ROBIN) (BOB) (MAYA) (SARA) (DAVID) (MAGGY) (ROBERT) (SYAM) (MARY) (SARAN) (STACY) (KELLY)
Apache Pig - Running Scripts
Here in this chapter, we will see how how to run Apache Pig scripts in batch mode.
Comments in Pig Script
While writing a script in a file, we can include comments in it as shown below.
Multi-pne comments
We will begin the multi-pne comments with /* , end them with */ .
/* These are the multi-pne comments In the pig script */
Single –pne comments
We will begin the single-pne comments with -- .
--we can write single pne comments pke this.
Executing Pig Script in Batch mode
While executing Apache Pig statements in batch mode, follow the steps given below.
Step 1
Write all the required Pig Latin statements in a single file. We can write all the Pig Latin statements and commands in a single file and save it as .pig file.
Step 2
Execute the Apache Pig script. You can execute the Pig script from the shell (Linux) as shown below.
Local mode | MapReduce mode |
---|---|
$ pig -x local Sample_script.pig | $ pig -x mapreduce Sample_script.pig |
You can execute it from the Grunt shell as well using the exec command as shown below.
grunt> exec /sample_script.pig
Executing a Pig Script from HDFS
We can also execute a Pig script that resides in the HDFS. Suppose there is a Pig script with the name Sample_script.pig in the HDFS directory named /pig_data/. We can execute it as shown below.
$ pig -x mapreduce hdfs://localhost:9000/pig_data/Sample_script.pig
Example
Assume we have a file student_details.txt in HDFS with the following content.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad 002,siddarth,Battacharya,22,9848022338,Kolkata 003,Rajesh,Khanna,22,9848022339,Delhi 004,Preethi,Agarwal,21,9848022330,Pune 005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar 006,Archana,Mishra,23,9848022335,Chennai 007,Komal,Nayak,24,9848022334,trivendram 008,Bharathi,Nambiayar,24,9848022333,Chennai
We also have a sample script with the name sample_script.pig, in the same HDFS directory. This file contains statements performing operations and transformations on the student relation, as shown below.
student = LOAD hdfs://localhost:9000/pig_data/student_details.txt USING PigStorage( , ) as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray); student_order = ORDER student BY age DESC; student_pmit = LIMIT student_order 4; Dump student_pmit;
The first statement of the script will load the data in the file named student_details.txt as a relation named student.
The second statement of the script will arrange the tuples of the relation in descending order, based on age, and store it as student_order.
The third statement of the script will store the first 4 tuples of student_order as student_pmit.
Finally the fourth statement will dump the content of the relation student_pmit.
Let us now execute the sample_script.pig as shown below.
$./pig -x mapreduce hdfs://localhost:9000/pig_data/sample_script.pig
Apache Pig gets executed and gives you the output with the following content.
(7,Komal,Nayak,24,9848022334,trivendram) (8,Bharathi,Nambiayar,24,9848022333,Chennai) (5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar) (6,Archana,Mishra,23,9848022335,Chennai) 2015-10-19 10:31:27,446 [main] INFO org.apache.pig.Main - Pig script completed in 12 minutes, 32 seconds and 751 milpseconds (752751 ms)Advertisements