Software Development Engineer

Blog PostsResume

Disable Local Hive CLI Execution in Amazon EMR

Hive is an open-source, data warehouse, and analytics package that runs on top of a Hadoop cluster. Hive scripts use a SQL-like language called Hive QL (query language) that abstracts programming models and supports typical data warehouse interactions.

Hive implementations need a metastore service, where it stores metadata. It is implemented using tables in a relational database. By default, Hive records metastore information in a MySQL database on the master node's file system and provides clients (including Hive) access to this information using the metastore service API.

The two popular hive clients used are Hive CLI and Beeline. Beeline connects to Hive Metastore via Hive Server 2 which integrates authentication and authorisation mechanisms whereas there is no such mechanism associated with Hive CLI. Thus, for security reasons, it is always recommended that connections via Hive CLI are disabled.

To disable direct connections to hive metastore and allow connections to metastore only via hive-server2, one can log in to the master node of the EMR Cluster and

  1. Disable hive-cli by modifying /etc/hive/conf/hive-env.sh to include the below lines:

     if [ "$SERVICE" = "cli" ]; then
       echo "Sorry! hive-shell is disabled for security purpose."
       exit 1
     fi
  2. Disable hive-cli by renaming/removing hive binary from /usr/bin.

  3. Create an alias for 'hive' in order to redirect it to beeline. This can be done using the following command

     alias hive="beeline"

The first solution can be automated during the creation of the cluster by creating a step.

Automation of the first solution during the creation of the cluster by creating a step.

We can submit work to Amazon EMR clusters using steps. Each step is a unit of work that contains instructions to manipulate data for processing by software installed on the cluster. We can also run a command line script at any time during step processing in your cluster. These scripts can be executed by provisioning either the command-runner.jar or the script-runner.jar. The command-runner.jar package executes commands inputted as arguments to the step whereas the script-runner.jar executes a script stored in an S3 bucket with its URL provided as an argument to the step.

You can use the following script to automate the first solution of this problem.

  #!/bin/bash

  echo -e "\nif [ \"\$SERVICE\" = \"cli\" ]; then echo \"Hive CLI is disabled for security purpose\"; exit 1; fi" | sudo tee --append /etc/hive/conf/hive-env.sh > /dev/null
  

The aforementioned echo command can also be added as an argument to the command-runner.jar step.

To add a step to run the script using the AWS CLI, we can execute the following command

  aws emr create-cluster --name "Test cluster" –-release-label emr-5.15.0 --applications Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-type m4.large --instance-count 3 --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://region.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://mybucket/script-path/my_script.sh”]

© 2024 Ujjwal Bhardwaj. All Rights Reserved.