According to AWS Glue documentation:
Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported.— Providing Your Own Custom Scripts
But if you’re using Python shell jobs in Glue, there is a way to use Python packages like Pandas using Easy Install.
Easy Install is a python module (— Easy Install
easy_install) bundled with
setuptoolsthat lets you automatically download, build, install, and manage Python packages.
Just use the following code:
import os import site from setuptools.command import easy_install install_path = os.environ['GLUE_INSTALLATION'] easy_install.main( ["--install-dir", install_path, "<PACKAGE>"] ) reload(site) import <PACKAGE>
import os import site from setuptools.command import easy_install install_path = os.environ['GLUE_INSTALLATION'] easy_install.main( ["--install-dir", install_path, "https://files.pythonhosted.org/packages/83/03/10902758730d5cc705c0d1dd47072b6216edc652bc2e63a078b58c0b32e6/pg8000-1.12.5.tar.gz"] ) reload(site)
This will install the required packages at runtime, after which, you can import & use them as usual.
Python shell jobs in AWS Glue support scripts that are compatible with Python 2.7 and come pre-loaded with libraries such as the Boto3, NumPy, SciPy, pandas, and others.— Introducing Python Shell Jobs in AWS Glue
You can check what packages are installed using this script as Glue job:
import pip import logging logger = logging.getLogger(__name__) logger.setLevel(logging.INFO) if __name__ == '__main__': logger.info(pip._internal.main(['list']))
AWS Data Wrangler
AWS Data Wrangler is an open source initiative that extends the power of Pandas library to AWS connecting DataFrames and AWS data related services (Amazon Redshift, AWS Glue, Amazon Athena, Amazon EMR, Amazon QuickSight, etc).
Built on top of other open-source projects like Pandas, Apache Arrow, Boto3, s3fs, SQLAlchemy, Psycopg2 and PyMySQL, it offers abstracted functions to execute usual ETL tasks like load/unload data from Data Lakes, Data Warehouses and Databases.— What is AWS Data Wrangler?
AWS Data Wrangler can be used as a Lambda layer, in Glue Python shell jobs, Glue PySpark jobs, SageMaker notebooks & EMR!