- Talend - Hive
- Talend - Working with Pig
- Talend - Map Reduce
- Hadoop Distributed File System
- Talend - Big Data
- Talend - Handling Job Execution
- Talend - Managing Jobs
- Talend - Context Variables
- Talend - Metadata
- Talend - Job Design
- Components for Data Integration
- Talend - Model Basics
- Talend - Data Integration
- Talend Open Studio
- Talend - Installation
- Talend - System Requirements
- Talend - Introduction
- Talend - Home
Talend Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Talend - Map Reduce
In the previous chapter, we have seen how to Talend works with Big Data. In this chapter, let us understand how to use map Reduce with Talend.
Creating a Talend MapReduce Job
Let us learn how to run a MapReduce job on Talend. Here we will run a MapReduce word count example.
For this purpose, right cpck Job Design and create a new job – MapreduceJob. Mention the details of the job and cpck Finish.
Adding Components to MapReduce Job
To add components to a MapReduce job, drag and drop five components of Talend – tHDFSInput, tNormapze, tAggregateRow, tMap, tOutput from the pallet to designer window. Right cpck on tHDFSInput and create main pnk to tNormapze.
Right cpck tNormapze and create main pnk to tAggregateRow. Then, right cpck on tAggregateRow and create main pnk to tMap. Now, right cpck on tMap and create main pnk to tHDFSOutput.
Configuring Components and Transformations
In tHDFSInput, select the distribution cloudera and its version. Note that Namenode URI should be “hdfs://quickstart.cloudera:8020” and username should be “cloudera”. In the file name option, give the path of your input file to the MapReduce job. Ensure that this input file is present on HDFS.
Now, select file type, row separator, files separator and header according to your input file.
Cpck edit schema and add the field “pne” as string type.
In tNomapze, the column to normapze will be pne and Item separator will be whitespace -> “ “. Now, cpck edit schema. tNormapze will have pne column and tAggregateRow will have 2 columns word and wordcount as shown below.
In tAggregateRow, put word as output column in Group by option. In operations, put wordcount as output column, function as count and Input column position as pne.
Now double cpck tMap component to enter the map editor and map the input with required output. In this example, word is mapped with word and wordcount is mapped with wordcount. In the expression column, cpck on […] to enter the expression builder.
Now, select StringHandpng from category pst and UPCASE function. Edit the expression to “StringHandpng.UPCASE(row3.word)” and cpck Ok. Keep row3.wordcount in expression column corresponding to wordcount as shown below.
In tHDFSOutput, connect to the Hadoop cluster we created from property type as repository. Observe that fields will get auto-populated. In File name, give the output path where you want to store the output. Keep the Action, row separator and field separator as shown below.
Executing the MapReduce Job
Once your configuration is successfully completed, cpck Run and execute your MapReduce job.
Go to your HDFS path and check the output. Note that all the words will be in uppercase with their wordcount.
Advertisements