HCatalog - Loader & Storer

The HCatLoader and HCatStorer APIs are used with Pig scripts to read and write data in HCatalog-managed tables. No HCatalog-specific setup is required for these interfaces.

It is better to have some knowledge on Apache Pig scripts to understand this chapter better. For further reference, please go through our Apache Pig tutorial.

HCatloader

HCatLoader is used with Pig scripts to read data from HCatalog-managed tables. Use the following syntax to load data into HDFS using HCatloader.

A = LOAD  tablename  USING org.apache.HCatalog.pig.HCatLoader();

You must specify the table name in single quotes: LOAD tablename . If you are using a non-default database, then you must specify your input as dbname.tablename .

The Hive metastore lets you create tables without specifying a database. If you created tables this way, then the database name is default and is not required when specifying the table for HCatLoader.

The following table contains the important methods and description of the HCatloader class.

Sr.No.	Method Name & Description
1	pubpc InputFormat<?,?> getInputFormat()throws IOException Read the input format of the loading data using the HCatloader class.
2	pubpc String relativeToAbsolutePath(String location, Path curDir) throws IOException It returns the String format of the Absolute path.
3	pubpc void setLocation(String location, Job job) throws IOException It sets the location where the job can be executed.
4	pubpc Tuple getNext() throws IOException Returns the current tuple (key and value) from the loop.

HCatStorer

HCatStorer is used with Pig scripts to write data to HCatalog-managed tables. Use the following syntax for Storing operation.

A = LOAD ...
B = FOREACH A ...
...
...
my_processed_data = ...

STORE my_processed_data INTO  tablename  USING org.apache.HCatalog.pig.HCatStorer();

You must specify the table name in single quotes: LOAD tablename . Both the database and the table must be created prior to running your Pig script. If you are using a non-default database, then you must specify your input as dbname.tablename .

The Hive metastore lets you create tables without specifying a database. If you created tables this way, then the database name is default and you do not need to specify the database name in the store statement.

For the USING clause, you can have a string argument that represents key/value pairs for partitions. This is a mandatory argument when you are writing to a partitioned table and the partition column is not in the output column. The values for partition keys should NOT be quoted.

The following table contains the important methods and description of the HCatStorer class.

Sr.No.	Method Name & Description
1	pubpc OutputFormat getOutputFormat() throws IOException Read the output format of the stored data using the HCatStorer class.
2	pubpc void setStoreLocation (String location, Job job) throws IOException Sets the location where to execute this store apppcation.
3	pubpc void storeSchema (ResourceSchema schema, String arg1, Job job) throws IOException Store the schema.
4	pubpc void prepareToWrite (RecordWriter writer) throws IOException It helps to write data into a particular file using RecordWriter.
5	pubpc void putNext (Tuple tuple) throws IOException Writes the tuple data into the file.

Running Pig with HCatalog

Pig does not automatically pick up HCatalog jars. To bring in the necessary jars, you can either use a flag in the Pig command or set the environment variables PIG_CLASSPATH and PIG_OPTS as described below.

To bring in the appropriate jars for working with HCatalog, simply include the following flag −

pig –useHCatalog <Sample pig scripts file>

Setting the CLASSPATH for Execution

Use the following CLASSPATH setting for synchronizing the HCatalog with Apache Pig.

export HADOOP_HOME = <path_to_hadoop_install>
export HIVE_HOME = <path_to_hive_install>
export HCAT_HOME = <path_to_hcat_install>

export PIG_CLASSPATH = $HCAT_HOME/share/HCatalog/HCatalog-core*.jar:
$HCAT_HOME/share/HCatalog/HCatalog-pig-adapter*.jar:
$HIVE_HOME/pb/hive-metastore-*.jar:$HIVE_HOME/pb/pbthrift-*.jar:
$HIVE_HOME/pb/hive-exec-*.jar:$HIVE_HOME/pb/pbfb303-*.jar:
$HIVE_HOME/pb/jdo2-api-*-ec.jar:$HIVE_HOME/conf:$HADOOP_HOME/conf:
$HIVE_HOME/pb/slf4j-api-*.jar

Example

Assume we have a file student_details.txt in HDFS with the following content.

student_details.txt

001, Rajiv,    Reddy,       21, 9848022337, Hyderabad
002, siddarth, Battacharya, 22, 9848022338, Kolkata
003, Rajesh,   Khanna,      22, 9848022339, Delhi
004, Preethi,  Agarwal,     21, 9848022330, Pune
005, Trupthi,  Mohanthy,    23, 9848022336, Bhuwaneshwar
006, Archana,  Mishra,      23, 9848022335, Chennai
007, Komal,    Nayak,       24, 9848022334, trivendram
008, Bharathi, Nambiayar,   24, 9848022333, Chennai

We also have a sample script with the name sample_script.pig, in the same HDFS directory. This file contains statements performing operations and transformations on the student relation, as shown below.

student = LOAD  hdfs://localhost:9000/pig_data/student_details.txt  USING 
   PigStorage( , ) as (id:int, firstname:chararray, lastname:chararray,
   phone:chararray, city:chararray);
	
student_order = ORDER student BY age DESC;
STORE student_order INTO  student_order_table  USING org.apache.HCatalog.pig.HCatStorer();
student_pmit = LIMIT student_order 4;
Dump student_pmit;

The first statement of the script will load the data in the file named student_details.txt as a relation named student.

The second statement of the script will arrange the tuples of the relation in descending order, based on age, and store it as student_order.

The third statement stores the processed data student_order results in a separate table named student_order_table.

The fourth statement of the script will store the first four tuples of student_order as student_pmit.

Finally the fifth statement will dump the content of the relation student_pmit.

Let us now execute the sample_script.pig as shown below.

$./pig -useHCatalog hdfs://localhost:9000/pig_data/sample_script.pig