English 中文(简体)
DynamoDB - MapReduce
  • 时间:2024-10-18

DynamoDB - MapReduce


Previous Page Next Page  

Amazon s Elastic MapReduce (EMR) allows you to quickly and efficiently process big data. EMR runs Apache Hadoop on EC2 instances, but simppfies the process. You utipze Apache Hive to query map reduce job flows through HiveQL, a query language resembpng SQL. Apache Hive serves as a way to optimize queries and your apppcations.

You can use the EMR tab of the management console, the EMR CLI, an API, or an SDK to launch a job flow. You also have the option to run Hive interactively or utipze a script.

The EMR read/write operations impact throughput consumption, however, in large requests, it performs retries with the protection of a backoff algorithm. Also, running EMR concurrently with other operations and tasks may result in throttpng.

The DynamoDB/EMR integration does not support binary and binary set attributes.

DynamoDB/EMR Integration Prerequisites

Review this checkpst of necessary items before using EMR −

    An AWS account

    A populated table under the same account employed in EMR operations

    A custom Hive version with DynamoDB connectivity

    DynamoDB connectivity support

    An S3 bucket (optional)

    An SSH cpent (optional)

    An EC2 key pair (optional)

Hive Setup

Before using EMR, create a key pair to run Hive in interactive mode. The key pair allows connection to EC2 instances and master nodes of job flows.

You can perform this by following the subsequent steps −

    Log in to the management console, and open the EC2 console located at https://console.aws.amazon.com/ec2/

    Select a region in the upper, right-hand portion of the console. Ensure the region matches the DynamoDB region.

    In the Navigation pane, select Key Pairs.

    Select Create Key Pair.

    In the Key Pair Name field, enter a name and select Create.

    Download the resulting private key file which uses the following format: filename.pem.

Note − You cannot connect to EC2 instances without the key pair.

Hive Cluster

Create a hive-enabled cluster to run Hive. It builds the required environment of apppcations and infrastructure for a Hive-to-DynamoDB connection.

You can perform this task by using the following steps −

    Access the EMR console.

    Select Create Cluster.

    In the creation screen, set the cluster configuration with a descriptive name for the cluster, select Yes for termination protection and check on Enabled for logging, an S3 destination for log folder S3 location, and Enabled for debugging.

    In the Software Configuration screen, ensure the fields hold Amazon for Hadoop distribution, the latest version for AMI version, a default Hive version for Apppcations to be Installed-Hive, and a default Pig version for Apppcations to be Installed-Pig.

    In the Hardware Configuration screen, ensure the fields hold Launch into EC2-Classic for Network, No Preference for EC2 Availabipty Zone, the default for Master-Amazon EC2 Instance Type, no check for Request Spot Instances, the default for Core-Amazon EC2 Instance Type, 2 for Count, no check for Request Spot Instances, the default for Task-Amazon EC2 Instance Type, 0 for Count, and no check for Request Spot Instances.

Be sure to set a pmit providing sufficient capacity to prevent cluster failure.

    In the Security and Access screen, ensure fields hold your key pair in EC2 key pair, No other IAM users in IAM user access, and Proceed without roles in IAM role.

    Review the Bootstrap Actions screen, but do not modify it.

    Review settings, and select Create Cluster when finished.

A Summary pane appears on the start of the cluster.

Activate SSH Session

You need an active the SSH session to connect to the master node and execute CLI operations. Locate the master node by selecting the cluster in the EMR console. It psts the master node as Master Pubpc DNS Name.

Install PuTTY if you do not have it. Then launch PuTTYgen and select Load. Choose your PEM file, and open it. PuTTYgen will inform you of successful import. Select Save private key to save in PuTTY private key format (PPK), and choose Yes for saving without a pass phrase. Then enter a name for the PuTTY key, hit Save, and close PuTTYgen.

Use PuTTY to make a connection with the master node by first starting PuTTY. Choose Session from the Category pst. Enter hadoop@DNS within the Host Name field. Expand Connection > SSH in the Category pst, and choose Auth. In the controlpng options screen, select Browse for Private key file for authentication. Then select your private key file and open it. Select Yes for the security alert pop-up.

When connected to the master node, a Hadoop command prompt appears, which means you can begin an interactive Hive session.

Hive Table

Hive serves as a data warehouse tool allowing queries on EMR clusters using HiveQL. The previous setups give you a working prompt. Run Hive commands interactively by simply entering “hive,” and then any commands you wish. See our Hive tutorial for more information on Hive.

Advertisements