ASOF JOINs in Snowflake: 100x Performance on Time-Relative Joins

A sample SQL query using ASOF JOIN in Snowflake.
A sample query using ASOF JOIN.

ASOF JOINs are the most efficient way to relate the timestamp of one table with the timestamp of another table, even when those timestamps don't match exactly.

Feel free to skip right to the ASOF JOIN section using the table of contents on the left.

💡
ASOF JOIN is a type of join that pairs a record from two tables based on their proximity (usually temporal). For each row on the left side of the join, the operation finds the closest matching value from the right side. 
Sample data on how we want to match two timestamps using an ASOF JOIN in Snowflake.
Sample data on how we want to match two timestamps using an ASOF JOIN.

Let's dive in with...

An Example

Imagine you work at a company that tracks bridge events.

A bridge can either be up or down. When a bridge is down, cars can travel across it. Only when the last car exits the bridge can it be drawn up.

Your manager wants to investigate the time between cars exiting and the bridge being drawn up. So she tasks you with determining the timestamps for when the bridge was next drawn up for each car that travelled across the bridge.

We need to associate each car event with the closest bridge up event that happened after the car exited.

What we need to do to match the data between two timestamps in Snowflake.
What we need to do to match the data between two timestamps.

Our Data

BRIDGE_STATUS_EVENTS

Tracks the timestamps when a bridge changes status.

SELECT
    timestamp
    , region_id
    , status
FROM bridge_status_events
Sample Snowflake data for our ASOF JOIN.

CAR_CROSSING_EVENTS

One record for each car that crosses a bridge.

SELECT
    event_id
    , region_id
    , enter_time
    , exit_time
FROM car_crossing_events
Sample Snowflake data for our ASOF JOIN.
These cars drive fast!

Solving this Without ASOF JOIN

Before we dive into how to use ASOF JOIN, let's take a look at what we would've done without it. There are a number of ways to do this, I won't cover them all.

Interval Range Joins

Since a bridge can't be drawn up if it's already drawn up. We can find the two bridge up events that sandwich the car crossing, and take the later bridge up event:

bridge up > car crossing > previous bridge up

WITH bridge_up_events AS (
    SELECT
        timestamp AS bridge_up_at
        , LAG(timestamp) OVER (PARTITION BY region_id ORDER BY timestamp ASC) as last_bridge_up_at
        , region_id
        , status
    FROM bridge_status_events
    WHERE
        status = 'UP'
)
SELECT
    c.region_id
    , c.enter_time
    , c.exit_time
    , b.bridge_up_at
FROM car_crossing_events AS c
LEFT JOIN bridge_up_events AS b
    ON c.region_id = b.region_id
    AND c.exit_time <= b.bridge_up_at --range join
    AND (c.exit_time > b.last_bridge_up_at OR b.last_bridge_up_at IS NULL) --range join
;

As expected, this join is incredibly inefficient because this is a range join. This query took 25 minutes and 10 seconds on an XS warehouse with 11.6 million records in CAR_CROSSING_EVENTS.

Learn how to read query plans in our earlier post here.

Snowflake query plan for a range join between two timestamps.
Range join takes 25 minutes and 10 seconds on an XS warehouse.

Notice in the Join node above that Snowflake uses an equality condition before applying the additional join condition. Snowflake initially executes range joins as a cartesian product (also known as a CROSS JOIN) on the equality join, and then applies the additional join conditions. The records in the left table join to each record in the right table, causing an explosion of records and terribly slow performance. In our case, this is depicted below as cartesian product between all matching REGION_IDs, and then filtering based on the timestamps. Hypothetically, if both tables have only 1 region and 50,000 rows, then the intermediate dataset produced will have 2.5 billion rows.

Why Snowflake performance is slow on interval range joins.
Range joins, step by step.

Binning

Binning is another efficient approach but can get rather complex, it deserves a post by itself. Check out Select's blog on how to use bins!

ASOF JOIN

Let's finish the example and then explain how it works.

WITH bridge_up_events AS (
    SELECT
        timestamp AS bridge_up_at
        , region_id
        , status
    FROM greybeam_db.sandbox.bridge_status_events
    WHERE
        status = 'UP'
)
SELECT
    c.region_id
    , c.enter_time
    , c.exit_time
    , b.bridge_up_at
FROM greybeam_db.sandbox.car_crossing_events AS c
ASOF JOIN bridge_up_events AS b
    MATCH_CONDITION (c.exit_time <= b.bridge_up_at)
    ON c.region_id = b.region_id
;

This query finishes in 14 seconds on an XS warehouse, a 107x improvement compared to our range join. 31% of the 14 seconds was spent on processing the join.

Query plan for an ASOF JOIN in Snowflake.
Query plan for an ASOF JOIN in Snowflake, notice the node is an ASOF Join node.

How to Use ASOF JOIN

FROM <left_table> 
ASOF JOIN <right_table>
  MATCH_CONDITION ( <left_table.timecol> <comparison_operator> <right_table.timecol> )
  [ ON <table.col> = <table.col> [ AND ... ] | USING ( <column_list> ) ]

The most important part of an ASOF JOIN is the MATCH_CONDITION. This dictates how Snowflake matches on proximity. In our example, we want to find the nearest bridge_up_at event that happened after a car exited the bridge, so the car exit time must be less than or equal to the bridge up time.

MATCH_CONDITION (c.exit_time <= b.bridge_up_at)

If we wanted to find the most recent bridge down event before a car crossed the bridge, then we would flip the equality.

MATCH_CONDITION (c.exit_time >= b.bridge_down_at)

Finally, we want to ensure we only pair these records on matching regions.

ON c.region_id = b.region_id

How ASOF JOIN Works

ASOF JOIN is designed to simplify and optimize queries that need to match rows based on "closest" relationships, typically involving timestamps. ASOF JOIN queries will always attempt to match a single row in the left table with a single row in the right table.

When you use ASOF JOIN, you're essentially delegating the complex matching logic to Snowflake's query engine. The engine then converts the ASOF JOIN operation into a series of efficient steps:

  1. Alignment: Rows from both the left and right tables are aligned based on the join keys (specified in the ON clause) and the timestamp expression (in the MATCH_CONDITION).
  2. Partition-Aware Sorting: The data is sorted within partitions defined by the join keys.
  3. Closest Match Search: Depending on the comparison operator in the MATCH_CONDITION, Snowflake searches for the closest previous or next value for each row in the left table.

Limitations

While ASOF JOIN is powerful, it's not without its limitations:

  1. Picky About Operators: ASOF JOIN is like that friend who only eats at four specific restaurants. MATCH_CONDITION can only use >=, >, <, or <=.
  2. One Condition to Rule Them All: Your MATCH_CONDITION must be a single condition. It's not a fan of complex logic. If you need multiple conditions, you'll have to get creative!
  3. Time and Numbers Only: ASOF JOIN is picky about its data types. MATCH_CONDITION only plays nice with dates, datetimes, timestamps, or numbers.
  4. Equality in the ON Clause: The ON clause can only contain = equality conditions. You can't use complex logic or non-equality comparisons here.
  5. No "Between" Matches: ASOF JOIN finds the closest match in one direction. If you need to find events that occurred between two timestamps, you might need to combine multiple ASOF JOINs or use a different approach.

Important Notes

  1. NULL Handling: ASOF JOIN treats NULLs like a LEFT JOIN would. If there's no "bridge up" event after a car exits, you'll get NULL values for the right-side columns.
  2. Identical Matches is a Coin Toss: If multiple rows in the right table have the exact same timestamp, the results are non-deterministic because any one of the tying rows might be returned. When such ties exist and you run the same query multiple times, you might get different results.

Advanced Techniques

Once you're comfortable with basic ASOF JOIN usage, consider these advanced techniques:

  1. Multiple ASOF JOINs: You can chain multiple ASOF JOINs in a single query to match events from multiple tables.
  2. Bidirectional Matching: By using two ASOF JOINs, you can find both the previous and next events relative to each row in your main table.
  3. Combining with Window Functions: While ASOF JOIN eliminates the need for many window function use cases, combining the two can solve complex time-series problems.

Wrapping Up

ASOF JOIN is a powerful tool in your Snowflake toolkit, but it only works if your team uses it! At Greybeam, we've built a SQL optimization layer automatically optimizes every query sent to Snowflake, so you can get back to delivering value instead of worrying whether your team is using ASOF JOINs.

Kyle Cheung

Kyle Cheung

optimization for you, you, you, you, and you