Transient EMR Clusters and Managing Metadata

As the name suggests, transient EMR (Elastic MapReduce) clusters on AWS are intended to only operate during the lifecycle of a specific set of tasks, thus minimising costs. Alternatively users may need to analyze data in an ad hoc manner. This is better suited to the long-running EMR cluster model.

There is also a third option: a hybrid of transient and long-running. In this model the cluster is only available when there is consistent demand to use it. The challenge with this model though is that the metadata about the data structures (Hive DDL) and the users is lost upon cluster termination. How can this metadata be persisted? 

Fortunately there are two options available that both involve storing the metadata outside of the cluster. It is assumed that the actual data being analyzed or processed is stored on S3 rather than locally on HDFS.

Both Hive and Hue store their metadata on MySQL databases which can be easily exported and copied to S3 prior to running the terminate cluster command. Upon launch, the databases are copied back onto the master node of the cluster via a launch step.

An example use case for this model is to have the cluster scheduled to be launched at the start of the working day and terminated at the end.

Another solution is to use an external metastore rather than backing up and restoring the one on the cluster, as described here:

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-dev-create-metastore-outside.html

This is more suitable in some circumstances, but it involves more operating cost.