Student | Freelance Web Developer | Open Source Developer

About meResume

Execute a Shell Script on Every Node of an Amazon EMR Cluster

Amazon Elastic MapReduce (EMR) is a web service that uses Hadoop to quickly & cost-effectively process vast amounts of data. It helps us analyze and process vast amounts of data by distributing the computational work across a cluster of virtual servers running in the AWS Cloud.

AWS EMR has three types of nodes: master nodes, core nodes, and task nodes. Every EMR cluster has only one master node which manages the cluster and acts as NameNode. All the MapReduce tasks performed by the core nodes which acts as data nodes and execute Hadoop jobs. There may exist certain use cases where a user might want to install certain applications or run bash scripts on every node of the cluster. This tutorial showcases a convenient method to perform this operation.

Note: You can always use the Bootstrap Actions to execute the script if your script is not dependent on the core services such as Hadoop or HDFS. This is because Bootstrap Action runs before these services come alive in the cluster.

Steps to execute a shell script on every node of an EMR Cluster

  1. Copy your ssh key into the master node of your EMR cluster using the command below:

     scp -i key.pem key.pem  hadoop@<master_node_dns>:/home/hadoop/
    
  2. Disable the strict host key checking in order to avoid manual verification of identity for every host. In order to disable this, in your .ssh folder in /home/hadoop on master node, create a file called config if it does not already exist. The appropriate permissions for this file are 400 and the content of this file should be:

     Host *
     StrictHostKeyChecking no
    
  3. Run the below command to get the FQDN(Fully Qualified Domain Name) of the core nodes and save it in a variable nodes

     declare nodes=`yarn node -list 2> /dev/null | grep internal | cut -d' ' -f1 | cut -d: -f1`
    
  4. Now run the sample command given below. This command iterates over the list “nodes" and ssh into every core node, executing the bash script (sample-script.sh)

     for node in $nodes
     do
       ssh -i key.pem hadoop@$node  'bash -s' < sample-script.sh
     done