If you’re running Spark on EMR & need to submit jobs remotely, you’re in the right place! You can have Airflow running on an EC2 instance & use it to submit jobs to EMR, provided they can reach each other. There are several ways you can trigger a
spark-submit to a remote Spark server, EMR or otherwise, via Airflow:
This operator requires you have a
spark-submit binary and YARN client config setup on the Airflow server. It invokes the
spark-submit command with the given options, blocks until the job finishes & returns the final status. It also streams the logs from the
Use Apache Livy
This solution is independent of the remote server running Spark. Use
SimpleHTTPOperator with Livy. Livy is an open source REST interface for interacting with Apache Spark from anywhere. You just make REST calls. Here‘s an example from Robert Sanders’ Airflow Spark operator plugin.
This solution is specific to EMR. It’s async so you’ll need
EmrAddStepsOperator as well. Note that you cannot have more than one step running on an EMR cluster.
This one’s again independent of the target server. Use this operator to run Bash commands on a remote server (using SSH protocol via paramiko library) like
spark-submit. The benefit of this approach is you don’t need to copy the
hdfs-site.xml or maintain any file. It’s just that you have to build your
spark-submit command (with all arguments) programmatically. Note that the SSH connection isn’t very reliable & might break.
Specify Remote Master IP
Independent of the remote system but requires configuring global configurations & environment variables.