Use Python Packages like NumPy & Pandas with AWS Glue

According to AWS Glue documentation:

Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported.

β€” Providing Your Own Custom Scripts

But if you’re using Python shell jobs in Glue, there is a way to use Python packages like Pandas using Easy Install.

Easy Install is a python module (easy_install) bundled with setuptools that lets you automatically download, build, install, and manage Python packages.

β€” Easy Install

Just use the following code:

import os
import site
from setuptools.command import easy_install

install_path = os.environ['GLUE_INSTALLATION']
easy_install.main( ["--install-dir", install_path, "<PACKAGE>"] )
reload(site)

import <PACKAGE>

Example:

import os
import site

from setuptools.command import easy_install
install_path = os.environ['GLUE_INSTALLATION']

easy_install.main( ["--install-dir", install_path, "https://files.pythonhosted.org/packages/83/03/10902758730d5cc705c0d1dd47072b6216edc652bc2e63a078b58c0b32e6/pg8000-1.12.5.tar.gz"] )

reload(site)

This will install the required packages at runtime, after which, you can import & use them as usual.

Python shell jobs in AWS Glue support scripts that are compatible with Python 2.7 and come pre-loaded with libraries such as the Boto3, NumPy, SciPy, pandas, and others.

β€” Introducing Python Shell Jobs in AWS Glue

You can check what packages are installed using this script as Glue job:

import pip
import logging

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

if __name__ == '__main__':
    logger.info(pip._internal.main(['list']))

AWS Data Wrangler

AWS Data Wrangler is an open source initiative that extends the power of Pandas library to AWS connecting DataFrames and AWS data related services (Amazon Redshift, AWS Glue, Amazon Athena, Amazon EMR, Amazon QuickSight, etc).

Built on top of other open-source projects like Pandas, Apache Arrow, Boto3, s3fs, SQLAlchemy, Psycopg2 and PyMySQL, it offers abstracted functions to execute usual ETL tasks like load/unload data from Data Lakes, Data Warehouses and Databases.

β€” What is AWS Data Wrangler?

AWS Data Wrangler can be used as a Lambda layer, in Glue Python shell jobs, Glue PySpark jobs, SageMaker notebooks & EMR!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.