Dec 12, 2024 5 min read How-To

Querying Snowflake Managed Iceberg Tables with DuckDB

Querying Snowflake managed Iceberg tables with DuckDB

Welcome to part two in our series of working with Iceberg tables. If you haven't already setup your Snowflake managed Iceberg tables, be sure to check out part one in our series to follow along.

In our previous post, we explored setting up Iceberg tables in Snowflake using Amazon S3. Today, we'll dive into querying these tables locally using DuckDB, showcasing how open table formats enable analytics across different engines.

What's up with the DuckDB hype?

DuckDB is an embedded analytical database that's been all the hype in the data community (besides Iceberg tables of course). The embedded part means that the database is not a separate server (unlike PostgreSQL or MySQL). It is an incredibly efficient fully featured database engine packed into a lightweight 20MB binary file, which is remarkably small compared to traditional database systems.

Let's try it out!

Prerequisites

Before starting, ensure you have:

An S3 bucket containing your Iceberg tables
Access to AWS to configure your credentials

Setting up AWS Credentials

Support for Iceberg tables on DuckDB is still rather limited and we will need direct access to our S3 bucket through an access and secret key for this tutorial. We'll cover how to setup Polaris and read directly from our catalog via PyIceberg in a future post, stay tuned!

Creating an AWS Access and Secret Key

In AWS, navigate to IAM > Users, we will need to create a user and attach the policy that can read from our S3 bucket. Here we want to take note of the AWS region, we're using ca-central-1 in this example.

0:00

/0:29

Creating an AWS user and attaching a policy to it.

Create an access key and secret for this user, and record it in a safe location.

0:00

/0:27

Creating an access and secret key.

Configure AWS CLI

This is an optional step but preferred step. Alternatively store the access and secret key in an environment variable that we can reference later.

In your terminal, install AWS CLI using your installer of choice. Either works for this tutorial.

pip install awscli

brew install awscli

After AWS CLI is installed, let's configure our credentials using the access key and secret we recorded from the previous step. In your terminal, follow the steps after entering this command:

aws configure

0:00

/0:11

Configuring AWS CLI.

aws configure allows us to follow the AWS SDK credential resolution chain to find and use our credentials, more on this later.

Querying with DuckDB CLI

Fire up your code editor of choice and make sure you've pip or brew installed duckdb.

pip install duckdb

CLI Hack

I'll be using a SQL file in a code editor and using a shortcut binding to send commands to my CLI. This experience is much better compared to working with the CLI alone (thanks to Mehdi for this hack).

I'm using Cursor, the instructions should be similar for VSCode. To configure a new keyboard shortcut navigate to the keybindings.json file, which can be found using Command + P or navigating to Cursor > Settings > Profiles > Keyboard Shortcuts. Save the following keybinding into the file.

[
    {
        "key": "cmd+'",
        "command": "workbench.action.terminal.runSelectedText"
      },
    
]

Now, whenever I press Command + ' either the highlighted text or the line my cursor is on will be passed to the terminal.

In your terminal, type in duckdb to enter the DuckDB CLI and create a demo.sql file in your directory. To test our keybinding, type in SELECT 1; and hit Command + '. Your SQL should magically pass to your terminal.

0:00

/0:07

Sample query using the keyboard shortcut.

DuckDB is an In-Memory Database

An in-memory database means that DuckDB operates on data stored in-memory - your computer's RAM rather than on disk. That is, DuckDB won't persist any data. If you exit the CLI and spin DuckDB back up, your tables will have faded into memory.

To persist our data we can "attach" a DuckDB database using the following command. This will create a .duckdb file storing the contents of any tables you materialize.

.open duckdb_demo

The next time you spin up the DuckDB CLI, you can attach that database:

ATTACH '/path/to/your/database' AS mydb;
USE mydb;

Another way is to spin up DuckDB with that database already loaded. In your terminal enter:

duckdb /path/to/your/database

Querying Time!

Next we need to configure DuckDB to work with HTTP, S3, and Iceberg. Enter the DuckDB CLI and enter the following commands to install and load the required extensions:

INSTALL httpfs;
LOAD httpfs;
INSTALL iceberg;
LOAD iceberg;

Load our AWS credentials and turn on our query timer:

CREATE SECRET (
    TYPE S3,
    PROVIDER CREDENTIAL_CHAIN
    );
.timer on

DuckDB's support for Iceberg is still limited, so we won't be interacting with our Iceberg tables through a catalog. Instead, we need to point DuckDB to a metadata file that Iceberg generates which represents the current snapshot of our table.

To identify the current metadata file of your Snowflake managed Iceberg table, run the following command in Snowflake:

SELECT PARSE_JSON(SYSTEM$GET_ICEBERG_TABLE_INFORMATION('GREYBEAM_DB.ICEBERG.PAYMENTS'))['metadataLocation']::varchar;

This process gets easier when we use PyIceberg in a future blog.

Now, back to our SQL file, let's query our payments table using a function called iceberg_scan .

SELECT
    CUSTOMER_ID
    , COUNT(*)
FROM iceberg_scan('s3://duckdb-demo/snowflake-managed/payments/metadata/00001-3f15f1b8-0604-4c8c-b70f-8a089af413b7.metadata.json');
GROUP BY 1
;

This took 54 seconds to run. Slower than we imagined right? That's because DuckDB needs to download the relevant files into our computer's memory before processing the aggregation. We can improve performance by materializing the table into memory:

CREATE TABLE payments AS SELECT * FROM iceberg_scan('s3://duckdb-demo/snowflake-managed/payments/metadata/00001-3f15f1b8-0604-4c8c-b70f-8a089af413b7.metadata.json');

This could take some time depending on the size of your table and your download speed. Our table is 110 million rows and it took 335 seconds. Replacing iceberg_scan(... with payments and re-running now only takes 1.5 seconds!

0:00

/0:13

Running our queries in DuckDB.

Running a SELECT * on my payments table in DuckDB finished in 29 seconds, while running the same query on an XS warehouse in Snowflake finished in 72 seconds - I'm running 16GB of memory on an Apple M1. Of course, this isn't an apples to apples comparison, but it is incredibly impressive to be able to run queries on large datasets with blazing speed.

Wrapping Up

There's so much more to explore in DuckDB and Iceberg, but this should be enough to get you started in working with your Iceberg tables locally on DuckDB. If you're having any issues, please don't hesitate to reach out or join their Slack communities. These two ecosystems are constantly changing, expect to run into many issues.