Apache Pig - Cogroup Operator-alljchome-开发者的教程家园

Apache Pig Tutorial

Apache Pig - Home

Apache Pig Introduction

Apache Pig Environment

Pig Latin

Pig Latin - Basics

Load & Store Operators

Diagnostic Operators

Grouping & Joining

Combining & Splitting

Filtering

Sorting

Pig Latin Built-In Functions

Other Modes Of Execution

Apache Pig Useful Resources

Selected Reading

Apache Pig - Cogroup Operator

The COGROUP operator works more or less in the same way as the GROUP operator. The only difference between the two operators is that the group operator is normally used with one relation, while the cogroup operator is used in statements involving two or more relations.

Grouping Two Relations using Cogroup

Assume that we have two files namely student_details.txt and employee_details.txt in the HDFS directory /pig_data/ as shown below.

student_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

employee_details.txt

001,Robin,22,newyork 
002,BOB,23,Kolkata 
003,Maya,23,Tokyo 
004,Sara,25,London 
005,David,23,Bhuwaneshwar 
006,Maggy,22,Chennai

And we have loaded these files into Pig with the relation names student_details and employee_details respectively, as shown below.

grunt> student_details = LOAD  hdfs://localhost:9000/pig_data/student_details.txt  USING PigStorage( , )
   as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray); 
  
grunt> employee_details = LOAD  hdfs://localhost:9000/pig_data/employee_details.txt  USING PigStorage( , )
   as (id:int, name:chararray, age:int, city:chararray);

Now, let us group the records/tuples of the relations student_details and employee_details with the key age, as shown below.

grunt> cogroup_data = COGROUP student_details by age, employee_details by age;

Verification

Verify the relation cogroup_data using the DUMP operator as shown below.

grunt> Dump cogroup_data;

Output

It will produce the following output, displaying the contents of the relation named cogroup_data as shown below.

(21,{(4,Preethi,Agarwal,21,9848022330,Pune), (1,Rajiv,Reddy,21,9848022337,Hyderabad)}, 
   {    })  
(22,{ (3,Rajesh,Khanna,22,9848022339,Delhi), (2,siddarth,Battacharya,22,9848022338,Kolkata) },  
   { (6,Maggy,22,Chennai),(1,Robin,22,newyork) })  
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)}, 
   {(5,David,23,Bhuwaneshwar),(3,Maya,23,Tokyo),(2,BOB,23,Kolkata)}) 
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)}, 
   { })  
(25,{   }, 
   {(4,Sara,25,London)})

The cogroup operator groups the tuples from each relation according to age where each group depicts a particular age value.

For example, if we consider the 1st tuple of the result, it is grouped by age 21. And it contains two bags −

the first bag holds all the tuples from the first relation (student_details in this case) having age 21, and

the second bag contains all the tuples from the second relation (employee_details in this case) having age 21.

In case a relation doesn’t have tuples having the age value 21, it returns an empty bag.

Apache Pig - Cogroup Operator

Grouping Two Relations using Cogroup

Verification

Output

友情链接