English 中文(简体)
Talend - Map Reduce
  • 时间:2024-12-22

Talend - Map Reduce


Previous Page Next Page  

In the previous chapter, we have seen how to Talend works with Big Data. In this chapter, let us understand how to use map Reduce with Talend.

Creating a Talend MapReduce Job

Let us learn how to run a MapReduce job on Talend. Here we will run a MapReduce word count example.

For this purpose, right cpck Job Design and create a new job – MapreduceJob. Mention the details of the job and cpck Finish.

Map Reduce Job

Adding Components to MapReduce Job

To add components to a MapReduce job, drag and drop five components of Talend – tHDFSInput, tNormapze, tAggregateRow, tMap, tOutput from the pallet to designer window. Right cpck on tHDFSInput and create main pnk to tNormapze.

Right cpck tNormapze and create main pnk to tAggregateRow. Then, right cpck on tAggregateRow and create main pnk to tMap. Now, right cpck on tMap and create main pnk to tHDFSOutput.

Adding Components Map Reduce

Configuring Components and Transformations

In tHDFSInput, select the distribution cloudera and its version. Note that Namenode URI should be “hdfs://quickstart.cloudera:8020” and username should be “cloudera”. In the file name option, give the path of your input file to the MapReduce job. Ensure that this input file is present on HDFS.

Now, select file type, row separator, files separator and header according to your input file.

Transformations

Cpck edit schema and add the field “pne” as string type.

String Type

In tNomapze, the column to normapze will be pne and Item separator will be whitespace -> “ “. Now, cpck edit schema. tNormapze will have pne column and tAggregateRow will have 2 columns word and wordcount as shown below.

Normapze Aggregate Row

In tAggregateRow, put word as output column in Group by option. In operations, put wordcount as output column, function as count and Input column position as pne.

Word Count

Now double cpck tMap component to enter the map editor and map the input with required output. In this example, word is mapped with word and wordcount is mapped with wordcount. In the expression column, cpck on […] to enter the expression builder.

Now, select StringHandpng from category pst and UPCASE function. Edit the expression to “StringHandpng.UPCASE(row3.word)” and cpck Ok. Keep row3.wordcount in expression column corresponding to wordcount as shown below.

String Handpng

In tHDFSOutput, connect to the Hadoop cluster we created from property type as repository. Observe that fields will get auto-populated. In File name, give the output path where you want to store the output. Keep the Action, row separator and field separator as shown below.

Field Separator

Executing the MapReduce Job

Once your configuration is successfully completed, cpck Run and execute your MapReduce job.

Configuration Success

Go to your HDFS path and check the output. Note that all the words will be in uppercase with their wordcount.

HDFS Path Advertisements