Submit Apache Spark Jobs to an Amazon EMR Cluster from Apache Airflow

If you’re running Spark on EMR & need to submit jobs remotely, you’re in the right place! You can have Airflow running on an EC2 instance & use it to submit jobs to EMR, provided they can reach each other. There are several ways you can trigger a spark-submit to a remote Spark server, EMR or otherwise, via Airflow:

Use SparkSubmitOperator

This operator requires you have a spark-submit binary and YARN client config setup on the Airflow server. It invokes the spark-submit command with the given options, blocks until the job finishes & returns the final status. It also streams the logs from the spark-submit command stdout & stderr.

Use Apache Livy

This solution is independent of the remote server running Spark. Use SimpleHTTPOperator with Livy. Livy is an open source REST interface for interacting with Apache Spark from anywhere. You just make REST calls. Here‘s an example from Robert Sanders’ Airflow Spark operator plugin.

Use EmrSteps API

This solution is specific to EMR. It’s async so you’ll need EmrStepSensor & EmrAddStepsOperator as well. Note that you cannot have more than one step running on an EMR cluster.

Use SSHHook / SSHOperator

This one’s again independent of the target server. Use this operator to run Bash commands on a remote server (using SSH protocol via paramiko library) like spark-submit. The benefit of this approach is you don’t need to copy the hdfs-site.xml or maintain any file. It’s just that you have to build your spark-submit command (with all arguments) programmatically. Note that the SSH connection isn’t very reliable & might break.

Specify Remote Master IP

Independent of the remote system but requires configuring global configurations & environment variables.

References