Getting Started with pyIceberg and AWS Glue
As data teams increasingly adopt Apache Iceberg for their data lake needs, we will begin to see a need for better tooling in the space. At the moment pyIceberg is the go to library for working with Iceberg. Given how new all the tooling is around Iceberg there can be some challenges to get started. One common challenge is setting up and using pyIceberg with AWS Glue. I will share a step by step guide on how to get started so you don't melt in the process.
Prerequisites
- AWS access key and secret access key with appropriate permissions
- Python
- Basic familiarity with AWS Glue
- AWS CLI configured locally
If you are just getting started with Iceberg checkout our previous blog here.
We will be working with a simple python script for this guide. Here is the gist if you can't wait till the end.
Step 1: Installation
First, install pyIceberg using pip:
pip install "pyiceberg[glue]"
Step 2: Initialize the Glue Catalog
Give glue access to your access key and secret access key. Since the pyIceberg documentation shares all the configuration options in pythonic style I perfer using a python script to set them.
Setting the client.*
properties let's me use the same credentials for both S3 and Glue.
Note: Please DO NOT hard code your access key and secret access key in a production script. This is just for the sake of the example.
If you'd prefer to use environment variables you can do the following:
This can be a bit confusing at first because the environment variables are nested.
For example:
PYICEBERG_CATALOG__DEFAULT__S3__ACCESS_KEY_ID
Sets:
s3.access-key-id
for the default catalog....
The use of __
is used to denote a nested value, and _
is used to denote a hyphen. This is explained further in the pyIceberg configuration documentation.
Step 3: Creating a AWS Glue Database
In AWS Glue tables are only grouped via databases. The typical DATABASE.SCHEMA.TABLE
format isn't present. We only need to create a database to get started.
We can do the following in pyIceberg to create a database:
Note: The create_namespace_if_not_exists
method follows the pythonic standard of try and ask for forgiveness instead of permission. It will simply try to create the database and ingore any errors if it already exists which might not be ideal if you are expecting an empty database afterwards.
Step 4: Creating an Iceberg Table
To create a new table:
Note: You can find all supported types and how to use them here.
If you want to make a table using the schema from an existing Glue table you can do the following:
Step 5: Working with Tables
To load and interact with existing tables:
Once you have the table object the rest of the pyIceberg API is available to you.
Best Practices
- Always use the latest version of pyIceberg to access new features and fixes
- Review the pyIceberg documentation for exact details on how to use the library. Don't assume method functionality based off the name.
Conclusion
While pyIceberg's documentation might seem sparse, the library is actively maintained and provides robust functionality for working with AWS Glue. The recent acquisition of Tabular by Databricks hasn't affected the development of pyIceberg, and it remains a reliable tool for managing Iceberg tables. With AWS's recent release of S3 Tables, the future of pyIceberg looks bright.
Here is a link to a gist you can use to get started with pyIceberg and AWS Glue: pyIceberg and AWS Glue
Comments ()