Monday, April 4, 2022

Working with EMR Best Practices

 


In this post I'll describe some of the best practices I've learned while working with AWS EMR.


Auto Terminate

Running an EMR cluster has its costs. To save money, configure the EMR to automatically terminate in case it was not active for a long period of time, for example: 1 hour.

AWS CLI

Do not manually create the EMR cluster every time. Once the EMR cluster is configured per your need, use the AWS CLI export button to create a CLI to create the EMR cluster. Then a recreation of a terminated cluster is simple, and can even be automated.



Use Bootstrap

Bootstrap script is a shell script that runs before the spark instance starts. It is used to install pre-requirements for your need. A common pre-requirement is to install python's libraries, for example:


#!/bin/bash
sudo yum install unzip
sudo python3 -m pip install -U boto3 paramiko


Write Dynamic Code

When writing code we sometimes have, well... bugs...
To debug these, we can print debug printing to STDOUT, and check the printings in the logs.
Another method to debug is to run the code locally on your development environment, using the auto-created spark server from the pyspark library. However, there are cases that need to run differently when running on your development machine, for example, you might want to redirect access to S3 files to accessing local files on your machine. To check if the code is running in a cluster or on a development machine, we can use the following simple method:


def is_local_spark():
return 'SPARK_PUBLIC_DNS' not in os.environ


Spark Context

Spark context must be created only once. In case a global variable is used by several modules, python might reinitialize it, hence causing errors that spark context is already created. To avoid this, we use a singleton class.



class SingletonMeta(type):
_instances = {}

def __call__(cls, *args, **kwargs):
if cls not in cls._instances:
instance = super().__call__(*args, **kwargs)
cls._instances[cls] = instance
return cls._instances[cls]


class SparkWrapper(metaclass=SingletonMeta):
def __init__(self):
self.spark_context = SparkContext.getOrCreate()


print(SparkWrapper().spark_context)









No comments:

Post a Comment