How to Set Up Amazon EMR (Amazon Elastic MapReduce)

Rate this post

Amazon EMR is a web service that can easily and efficiently process enormous amounts of data. It uses a hosted Hadoop framework running on the web-scale infrastructure of Amazon EC2 and Amazon S3.

Step 1: Sign in to an AWS account and select Amazon EMR on the management console.

Step 2 − Create an Amazon S3 bucket for cluster logs & output data. (Procedure is explained in detail in Amazon S3 section)

Step 3 − Launch Amazon EMR cluster.

– Use this link to open the Amazon EMR console − https://console.aws.amazon.com/elasticmapreduce/home

– Select create cluster and provide the required details on the Cluster Configuration page.

– Leave the Tags section options as default and proceed.

On the Software configuration section, level the options as default.

– On the File System Configuration section, leave the options for EMRFS as set by default. EMRFS is an implementation of HDFS; it allows Amazon EMR clusters to store data on Amazon S3.

– On the Hardware Configuration section, select m3.xlarge in the EC2 instance type field and leave other settings as default. Click the Next button.

– On the Security and Access section, for the EC2 key pair, select the pair from the list in the EC2 key pair field and leave the other settings as default.

– On the Bootstrap Actions section, leave the fields as set by default and click the Add button. Bootstrap actions are scripts that are executed during the setup before Hadoop starts on every cluster node.

– On the Steps section, leave the settings as default and proceed.

– Click the Create Cluster button, and the Cluster Details page opens. We should run the Hive script as a clustering step and use the Hue web interface to query the data.

Step 4: Run the Hive script using the following steps.

– Open the Amazon EMR console and select the desired cluster.

– Move to the Steps section and expand it. Then click the Add step button.

– The Add Step dialog box opens. Fill in the required fields, then click the Add button.

– To view the output of the Hive script, use the following steps −

– Open the Amazon S3 console and select the S3 bucket used for the output data.

– Select the output folder.

– The query writes the results into a separate folder. Select os_requests.

– The output is stored in a text file. This file can be downloaded.


Pamer

Leave a Comment