Apache Pig Introduction
Apache Pig Environment
Pig Latin
Load & Store Operators
Diagnostic Operators
- Apache Pig - Illustrate Operator
- Apache Pig - Explain Operator
- Apache Pig - Describe Operator
- Apache Pig - Diagnostic Operator
Grouping & Joining
- Apache Pig - Cross Operator
- Apache Pig - Join Operator
- Apache Pig - Cogroup Operator
- Apache Pig - Group Operator
Combining & Splitting
Filtering
Sorting
Pig Latin Built-In Functions
- Apache Pig - Math Functions
- Apache Pig - date-time Functions
- Apache Pig - String Functions
- Apache Pig - Bag & Tuple Functions
- Load & Store Functions
- Apache Pig - Eval Functions
Other Modes Of Execution
Apache Pig Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Apache Pig - User Defined Functions
In addition to the built-in functions, Apache Pig provides extensive support for User Defined Functions (UDF’s). Using these UDF’s, we can define our own functions and use them. The UDF support is provided in six programming languages, namely, Java, Jython, Python, JavaScript, Ruby and Groovy.
For writing UDF’s, complete support is provided in Java and pmited support is provided in all the remaining languages. Using Java, you can write UDF’s involving all parts of the processing pke data load/store, column transformation, and aggregation. Since Apache Pig has been written in Java, the UDF’s written using Java language work efficiently compared to other languages.
In Apache Pig, we also have a Java repository for UDF’s named Piggybank. Using Piggybank, we can access Java UDF’s written by other users, and contribute our own UDF’s.
Types of UDF’s in Java
While writing UDF’s using Java, we can create and use the following three types of functions −
Filter Functions − The filter functions are used as conditions in filter statements. These functions accept a Pig value as input and return a Boolean value.
Eval Functions − The Eval functions are used in FOREACH-GENERATE statements. These functions accept a Pig value as input and return a Pig result.
Algebraic Functions − The Algebraic functions act on inner bags in a FOREACHGENERATE statement. These functions are used to perform full MapReduce operations on an inner bag.
Writing UDF’s using Java
To write a UDF using Java, we have to integrate the jar file Pig-0.15.0.jar. In this section, we discuss how to write a sample UDF using Ecppse. Before proceeding further, make sure you have installed Ecppse and Maven in your system.
Follow the steps given below to write a UDF function −
Open Ecppse and create a new project (say myproject).
Convert the newly created project into a Maven project.
Copy the following content in the pom.xml. This file contains the Maven dependencies for Apache Pig and Hadoop-core jar files.
<project xmlns = "http://maven.apache.org/POM/4.0.0" xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation = "http://maven.apache.org/POM/4.0.0http://maven.apache .org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>Pig_Udf</groupId> <artifactId>Pig_Udf</artifactId> <version>0.0.1-SNAPSHOT</version> <build> <sourceDirectory>src</sourceDirectory> <plugins> <plugin> <artifactId>maven-compiler-plugin</artifactId> <version>3.3</version> <configuration> <source>1.7</source> <target>1.7</target> </configuration> </plugin> </plugins> </build> <dependencies> <dependency> <groupId>org.apache.pig</groupId> <artifactId>pig</artifactId> <version>0.15.0</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>0.20.2</version> </dependency> </dependencies> </project>
Save the file and refresh it. In the Maven Dependencies section, you can find the downloaded jar files.
Create a new class file with name Sample_Eval and copy the following content in it.
import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; pubpc class Sample_Eval extends EvalFunc<String>{ pubpc String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; String str = (String)input.get(0); return str.toUpperCase(); } }
While writing UDF’s, it is mandatory to inherit the EvalFunc class and provide implementation to exec() function. Within this function, the code required for the UDF is written. In the above example, we have return the code to convert the contents of the given column to uppercase.
After compipng the class without errors, right-cpck on the Sample_Eval.java file. It gives you a menu. Select export as shown in the following screenshot.
On cpcking export, you will get the following window. Cpck on JAR file.
Proceed further by cpcking Next> button. You will get another window where you need to enter the path in the local file system, where you need to store the jar file.
Finally cpck the Finish button. In the specified folder, a Jar file sample_udf.jar is created. This jar file contains the UDF written in Java.
Using the UDF
After writing the UDF and generating the Jar file, follow the steps given below −
Step 1: Registering the Jar file
After writing UDF (in Java) we have to register the Jar file that contain the UDF using the Register operator. By registering the Jar file, users can intimate the location of the UDF to Apache Pig.
Syntax
Given below is the syntax of the Register operator.
REGISTER path;
Example
As an example let us register the sample_udf.jar created earper in this chapter.
Start Apache Pig in local mode and register the jar file sample_udf.jar as shown below.
$cd PIG_HOME/bin $./pig –x local REGISTER /$PIG_HOME/sample_udf.jar
Note − assume the Jar file in the path − /$PIG_HOME/sample_udf.jar
Step 2: Defining Apas
After registering the UDF we can define an apas to it using the Define operator.
Syntax
Given below is the syntax of the Define operator.
DEFINE apas {function | [`command` [input] [output] [ship] [cache] [stderr] ] };
Example
Define the apas for sample_eval as shown below.
DEFINE sample_eval sample_eval();
Step 3: Using the UDF
After defining the apas you can use the UDF same as the built-in functions. Suppose there is a file named emp_data in the HDFS /Pig_Data/ directory with the following content.
001,Robin,22,newyork 002,BOB,23,Kolkata 003,Maya,23,Tokyo 004,Sara,25,London 005,David,23,Bhuwaneshwar 006,Maggy,22,Chennai 007,Robert,22,newyork 008,Syam,23,Kolkata 009,Mary,25,Tokyo 010,Saran,25,London 011,Stacy,25,Bhuwaneshwar 012,Kelly,22,Chennai
And assume we have loaded this file into Pig as shown below.
grunt> emp_data = LOAD hdfs://localhost:9000/pig_data/emp1.txt USING PigStorage( , ) as (id:int, name:chararray, age:int, city:chararray);
Let us now convert the names of the employees in to upper case using the UDF sample_eval.
grunt> Upper_case = FOREACH emp_data GENERATE sample_eval(name);
Verify the contents of the relation Upper_case as shown below.
grunt> Dump Upper_case; (ROBIN) (BOB) (MAYA) (SARA) (DAVID) (MAGGY) (ROBERT) (SYAM) (MARY) (SARAN) (STACY) (KELLY)Advertisements