Dec 6, 2024 4 min read How-To

Getting Started with pyIceberg and AWS Glue

As data teams increasingly adopt Apache Iceberg for their data lake needs, we will begin to see a need for better tooling in the space. At the moment pyIceberg is the go to library for working with Iceberg. Given how new all the tooling is around Iceberg there can be some challenges to get started. One common challenge is setting up and using pyIceberg with AWS Glue. I will share a step by step guide on how to get started so you don't melt in the process.

Prerequisites

AWS access key and secret access key with appropriate permissions
Python
Basic familiarity with AWS Glue
AWS CLI configured locally

If you are just getting started with Iceberg checkout our previous blog here.

We will be working with a simple python script for this guide. Here is the gist if you can't wait till the end.

Step 1: Installation

First, install pyIceberg using pip:

pip install "pyiceberg[glue]"

Step 2: Initialize the Glue Catalog

Give glue access to your access key and secret access key. Since the pyIceberg documentation shares all the configuration options in pythonic style I perfer using a python script to set them.

Setting the client.* properties let's me use the same credentials for both S3 and Glue.

from pyiceberg.catalog import load_catalog
glue_catalog = load_catalog(
    'default',
    **{
        'client.access-key-id': '********',
        'client.secret-access-key': '********',
        'client.region': 'us-east-1'
    },
    type='glue'
)

Create the catalog connection

Note: Please DO NOT hard code your access key and secret access key in a production script. This is just for the sake of the example.

If you'd prefer to use environment variables you can do the following:

export PYICEBERG_CATALOG__DEFAULT__CLIENT__ACCESS_KEY_ID=********
export PYICEBERG_CATALOG__DEFAULT__CLIENT__SECRET_ACCESS_KEY=********
export PYICEBERG_CATALOG__DEFAULT__CLIENT__REGION=us-east-1

Connection variables set via environment variables

This can be a bit confusing at first because the environment variables are nested.

For example:

PYICEBERG_CATALOG__DEFAULT__S3__ACCESS_KEY_ID

Sets:

s3.access-key-id for the default catalog....

The use of __ is used to denote a nested value, and _ is used to denote a hyphen. This is explained further in the pyIceberg configuration documentation.

Step 3: Creating a AWS Glue Database

In AWS Glue tables are only grouped via databases. The typical DATABASE.SCHEMA.TABLE format isn't present. We only need to create a database to get started.

We can do the following in pyIceberg to create a database:

glue_catalog.create_namespace("some_database_name")

# or with the *_if_not_exists clause

glue_catalog.create_namespace_if_not_exists("some_database_name")

Ways to create a Glue database with the pyIceberg library

Note: The create_namespace_if_not_exists method follows the pythonic standard of try and ask for forgiveness instead of permission. It will simply try to create the database and ingore any errors if it already exists which might not be ideal if you are expecting an empty database afterwards.

Step 4: Creating an Iceberg Table

To create a new table:

from pyiceberg.schema import Schema, NestedField
from pyiceberg.types import (
    StringType,
    LongType,
    TimestampType
)

# Define your schema
schema = Schema(
    NestedField(1, "id", LongType(), required=True),
    NestedField(2, "name", StringType()),
    NestedField(3, "created_at", TimestampType())
)

# Where you want the Iceberg table to be stored
table_location = "s3://your-bucket/path/to/table" 

# Create the table
table = glue_catalog.create_table(
    identifier=("some_database_name", "some_table_name"),
    schema=schema,
    location=table_location
)

Manually set the schema and create an Iceberg table

Note: You can find all supported types and how to use them here.

If you want to make a table using the schema from an existing Glue table you can do the following:

from_table = glue_catalog.load_table("some_database_name", "some_table_name")

from_table_schema = from_table.schema().as_arrow()

# Where you want the Iceberg table to be stored
new_table_location = "s3://your-bucket/path/to/another_table" 

new_table = glue_catalog.create_table(
    identifier=("some_database_name", "some_other_table_name"),
    schema=from_table_schema,
    location=new_table_location
)

Create an Iceberg table copying the schema from another Iceberg table

Step 5: Working with Tables

To load and interact with existing tables:

# Load an existing table
table = glue_catalog.load_table(("some_database_name", "some_table_name"))

# Get table metadata
print(f"Table name: {table.name()}")
print(f"Table location: {table.location()}")
print(f"Table schema: {table.schema()}")

# Run a duckdb query
connection = table.scan().to_duckdb(table_name="some_table_alias")
df = connection.execute("SELECT * FROM some_table_alias").arrow()
print(f"Query results: {df}")

Loading an Iceberg table via the Glue catalog

Once you have the table object the rest of the pyIceberg API is available to you.

Best Practices

Always use the latest version of pyIceberg to access new features and fixes
Review the pyIceberg documentation for exact details on how to use the library. Don't assume method functionality based off the name.

Conclusion

While pyIceberg's documentation might seem sparse, the library is actively maintained and provides robust functionality for working with AWS Glue. The recent acquisition of Tabular by Databricks hasn't affected the development of pyIceberg, and it remains a reliable tool for managing Iceberg tables. With AWS's recent release of S3 Tables, the future of pyIceberg looks bright.

Here is a link to a gist you can use to get started with pyIceberg and AWS Glue: pyIceberg and AWS Glue