ASOF JOINs in Snowflake: 100x Performance on Time-Relative Joins
ASOF JOINs are the most efficient way to relate the timestamp of one table with the timestamp of another table, even when those timestamps don't match exactly.
Feel free to skip right to the ASOF JOIN section using the table of contents on the left.
Let's dive in with...
An Example
Imagine you work at a company that tracks bridge events.
A bridge can either be up or down. When a bridge is down, cars can travel across it. Only when the last car exits the bridge can it be drawn up.
Your manager wants to investigate the time between cars exiting and the bridge being drawn up. So she tasks you with determining the timestamps for when the bridge was next drawn up for each car that travelled across the bridge.
We need to associate each car event with the closest bridge up event that happened after the car exited.
Our Data
BRIDGE_STATUS_EVENTS
Tracks the timestamps when a bridge changes status.
SELECT
timestamp
, region_id
, status
FROM bridge_status_events
CAR_CROSSING_EVENTS
One record for each car that crosses a bridge.
SELECT
event_id
, region_id
, enter_time
, exit_time
FROM car_crossing_events
Solving this Without ASOF JOIN
Before we dive into how to use ASOF JOIN, let's take a look at what we would've done without it. There are a number of ways to do this, I won't cover them all.
Interval Range Joins
Since a bridge can't be drawn up if it's already drawn up. We can find the two bridge up events that sandwich the car crossing, and take the later bridge up event:
bridge up > car crossing > previous bridge up
WITH bridge_up_events AS (
SELECT
timestamp AS bridge_up_at
, LAG(timestamp) OVER (PARTITION BY region_id ORDER BY timestamp ASC) as last_bridge_up_at
, region_id
, status
FROM bridge_status_events
WHERE
status = 'UP'
)
SELECT
c.region_id
, c.enter_time
, c.exit_time
, b.bridge_up_at
FROM car_crossing_events AS c
LEFT JOIN bridge_up_events AS b
ON c.region_id = b.region_id
AND c.exit_time <= b.bridge_up_at --range join
AND (c.exit_time > b.last_bridge_up_at OR b.last_bridge_up_at IS NULL) --range join
;
As expected, this join is incredibly inefficient because this is a range join. This query took 25 minutes and 10 seconds on an XS warehouse with 11.6 million records in CAR_CROSSING_EVENTS
.
Learn how to read query plans in our earlier post here.
Notice in the Join node above that Snowflake uses an equality condition before applying the additional join condition. Snowflake initially executes range joins as a cartesian product (also known as a CROSS JOIN
) on the equality join, and then applies the additional join conditions. The records in the left table join to each record in the right table, causing an explosion of records and terribly slow performance. In our case, this is depicted below as cartesian product between all matching REGION_ID
s, and then filtering based on the timestamps. Hypothetically, if both tables have only 1 region and 50,000 rows, then the intermediate dataset produced will have 2.5 billion rows.
Binning
Binning is another efficient approach but can get rather complex, it deserves a post by itself. Check out Select's blog on how to use bins!
ASOF JOIN
Let's finish the example and then explain how it works.
WITH bridge_up_events AS (
SELECT
timestamp AS bridge_up_at
, region_id
, status
FROM greybeam_db.sandbox.bridge_status_events
WHERE
status = 'UP'
)
SELECT
c.region_id
, c.enter_time
, c.exit_time
, b.bridge_up_at
FROM greybeam_db.sandbox.car_crossing_events AS c
ASOF JOIN bridge_up_events AS b
MATCH_CONDITION (c.exit_time <= b.bridge_up_at)
ON c.region_id = b.region_id
;
This query finishes in 14 seconds on an XS warehouse, a 107x improvement compared to our range join. 31% of the 14 seconds was spent on processing the join.
How to Use ASOF JOIN
FROM <left_table>
ASOF JOIN <right_table>
MATCH_CONDITION ( <left_table.timecol> <comparison_operator> <right_table.timecol> )
[ ON <table.col> = <table.col> [ AND ... ] | USING ( <column_list> ) ]
The most important part of an ASOF JOIN is the MATCH_CONDITION. This dictates how Snowflake matches on proximity. In our example, we want to find the nearest bridge_up_at
event that happened after a car exited the bridge, so the car exit time must be less than or equal to the bridge up time.
MATCH_CONDITION (c.exit_time <= b.bridge_up_at)
If we wanted to find the most recent bridge down event before a car crossed the bridge, then we would flip the equality.
MATCH_CONDITION (c.exit_time >= b.bridge_down_at)
Finally, we want to ensure we only pair these records on matching regions.
ON c.region_id = b.region_id
How ASOF JOIN Works
ASOF JOIN is designed to simplify and optimize queries that need to match rows based on "closest" relationships, typically involving timestamps. ASOF JOIN queries will always attempt to match a single row in the left table with a single row in the right table.
When you use ASOF JOIN, you're essentially delegating the complex matching logic to Snowflake's query engine. The engine then converts the ASOF JOIN operation into a series of efficient steps:
- Alignment: Rows from both the left and right tables are aligned based on the join keys (specified in the ON clause) and the timestamp expression (in the MATCH_CONDITION).
- Partition-Aware Sorting: The data is sorted within partitions defined by the join keys.
- Closest Match Search: Depending on the comparison operator in the MATCH_CONDITION, Snowflake searches for the closest previous or next value for each row in the left table.
Limitations
While ASOF JOIN is powerful, it's not without its limitations:
- Picky About Operators: ASOF JOIN is like that friend who only eats at four specific restaurants. MATCH_CONDITION can only use
>=
,>
,<
, or<=
. - One Condition to Rule Them All: Your MATCH_CONDITION must be a single condition. It's not a fan of complex logic. If you need multiple conditions, you'll have to get creative!
- Time and Numbers Only: ASOF JOIN is picky about its data types. MATCH_CONDITION only plays nice with dates, datetimes, timestamps, or numbers.
- Equality in the ON Clause: The ON clause can only contain
=
equality conditions. You can't use complex logic or non-equality comparisons here. - No "Between" Matches: ASOF JOIN finds the closest match in one direction. If you need to find events that occurred between two timestamps, you might need to combine multiple ASOF JOINs or use a different approach.
Important Notes
- NULL Handling: ASOF JOIN treats NULLs like a LEFT JOIN would. If there's no "bridge up" event after a car exits, you'll get NULL values for the right-side columns.
- Identical Matches is a Coin Toss: If multiple rows in the right table have the exact same timestamp, the results are non-deterministic because any one of the tying rows might be returned. When such ties exist and you run the same query multiple times, you might get different results.
Advanced Techniques
Once you're comfortable with basic ASOF JOIN usage, consider these advanced techniques:
- Multiple ASOF JOINs: You can chain multiple ASOF JOINs in a single query to match events from multiple tables.
- Bidirectional Matching: By using two ASOF JOINs, you can find both the previous and next events relative to each row in your main table.
- Combining with Window Functions: While ASOF JOIN eliminates the need for many window function use cases, combining the two can solve complex time-series problems.
Wrapping Up
ASOF JOIN is a powerful tool in your Snowflake toolkit, but it only works if your team uses it! At Greybeam, we've built a SQL optimization layer automatically optimizes every query sent to Snowflake, so you can get back to delivering value instead of worrying whether your team is using ASOF JOINs.
Comments ()