If you’re running Spark on EMR & need to submit jobs remotely, you’re in the right place! You can have Airflow running on an EC2 instance & use it to submit jobs to EMR, provided they can reach each other. There are several ways you can trigger a spark-submit
to a remote Spark server, EMR or otherwise, via Airflow:
Use SparkSubmitOperator
This operator requires you have a spark-submit
binary and YARN client config setup on the Airflow server. It invokes the spark-submit
command with the given options, blocks until the job finishes & returns the final status. It also streams the logs from the spark-submit
command stdout
& stderr
.
Use Apache Livy
This solution is independent of the remote server running Spark. Use SimpleHTTPOperator
with Livy. Livy is an open source REST interface for interacting with Apache Spark from anywhere. You just make REST calls. Here‘s an example from Robert Sanders’ Airflow Spark operator plugin.
Use EmrSteps
API
This solution is specific to EMR. It’s async so you’ll need EmrStepSensor
& EmrAddStepsOperator
as well. Note that you cannot have more than one step running on an EMR cluster.
Use SSHHook
/ SSHOperator
This one’s again independent of the target server. Use this operator to run Bash commands on a remote server (using SSH protocol via paramiko library) like spark-submit
. The benefit of this approach is you don’t need to copy the hdfs-site.xml
or maintain any file. It’s just that you have to build your spark-submit
command (with all arguments) programmatically. Note that the SSH connection isn’t very reliable & might break.
Specify Remote Master IP
Independent of the remote system but requires configuring global configurations & environment variables.