GA4 Query Optimization Strategies with Dataform

Krisztián Korpa
Data to Value
Published in
6 min readDec 8, 2023

--

DALL-E’s interpretation of this article.

Article outline:
Concerning GA4
Partitioning vs Sharding
Incrementality
Clustering
Final Thoughts

Previous related articles:
Dataform: The Unsung Hero of Marketing Analytics
Dataform JavaScript Utility Functions for GA4 tables
GA4 Sessionization and Traffic Source handling in BigQuery
A Comprehensive Guide to Error Handling and Alerting in Dataform

By using a Cloud vendor’s data warehouse you outsource a lot of the storage and query optimization tasks that would come with running your own on-premise servers. The great benefit of this is that the barrier of entry is much lower and your costs scale with the amount of data you use. You just have to worry about optimizing your data usage. But how can you do that?

One of the horror stories that can scare the living hell out of a marketing analyst is when someone connects the GA4 raw event-level export to a Looker Studio dashboard. Oh lord, please let there be a billing cap on the project!

While that does seem like material you can frighten your peers with, it is not as uncommon as you might think.

Concerning GA4

Most properties’ GA4 export event volume won’t amount to crazy costs for querying or storage (if used sensibly), but optimizing your processes so that they scale well is always a good idea.

You can also sleep well after you give access to a new report to an eager dashboard-clicking fiend because it won’t break the bank.

While having access to a raw export in BigQuery is one of the greatest features of GA4 (a bit ironic), it also comes with a quite rigid schema. The processing, the columns, the nesting, and the nomenclature are all defined by Google, and you have to make do with what you get. Still, I gotta stop complaining because — as I said — it is a pretty sweet feature to have available for free.

One of the idiosyncrasies of the GA4 export is that it is optimized for storage, not querying. The most obvious sign of this is the number of nested fields.

A sign of storage optimization is the number of nested fields

Nesting can definitely make sense in a lot of cases, but if you run a lot of queries, the constant unnesting is computationally more expensive as well as a bit tedious.

The most obvious optimization step is the sessionization of the event-level data, which I wrote an extensive article about. Or you can create other event-level intermediary tables with unnested parameters. Both tables will be a lot cheaper to query than the raw export.
Always have a clear goal in mind what purpose each table serves!

Partitioning vs Sharding

Another interesting aspect of the GA4 export is that it uses sharding instead of partitioning. Sharding is the act of creating separate tables with the same schema, but different table name suffixes (eg. _20231206). Then you can use wildcards and table suffixes to only query a certain date range (instead of the whole dataset).

This choice is interesting because even Google itself advises using partitioning instead of sharding.

But fear not!

When you create new tables with Dataform (be it session or event-level), you can partition them with a simple config block setting in the SQLX file.
Usually partitioning on a date column makes the most sense.

config {
type: "incremental",
tags: ["daily"],
schema: "europe",
description:"Sessionization of GA4 raw data",
bigquery:{
partitionBy:"date"
}
}

When you connect your new table to Looker Studio, you can use the partitioned date dimension as the date range.

You can use the partitioned date field as the date range dimension

If you are interested in how you can create sharded tables (because sometimes it is a requirement), feel free to check my article about JavaScript utility functions!

Incrementality

In most analytics use cases, the data that arrives in BigQuery does not change further. It has already been processed, what else might happen to an event that already occurred in the past?

Well, in the case of GA4, the export can change in the last 72 hours after collection for processing purposes or at any time for bug-fixing exercises.

The 72 hours is the window that Analytics accepts Measurement Protocol hits in. Accounting for “any time” is a bit harder task without refreshing the whole table, but that might be a margin of error you can live with.
Or if you can’t…well that’s your choice.

This is an important caveat because if you want to have the most complete dataset, you have to account for changes that might happen after the GA4 event table was created.

So, what is incrementality?

Incrementality is the principle of only adding new rows to the already existing table. And you also have the option to update already existing rows if one of their dimensions has changed. This way, you avoid querying the same data over and over again.

There are multiple ways you can tackle incrementality in Dataform. Taneli has written a very insightful and extensive article about these techniques; feel free to check it out!

To handle incrementality with plain SQL, you would have to construct a MERGE statement in BigQuery which is by no means impossible, but it ain’t no fun either.

By using either of the strategies from Taneli’s article and utilizing Dataform’s out-of-the-box solution, you don’t have to construct a query like this (which is what runs under the hood):

This is what Dataform runs under the hood

If this is not a useful feature, I don’t know what is!

Clustering

Clustering is essentially a user-defined sorting order that can improve query performance and reduce query costs. It works similarly as an ORDER BY clause because it groups and sorts data into ‘storage blocks’.

The value of clustering comes into play when you filter or aggregate data based on the clustered column.

Adding clustering to a table in Dataform is as simple as adding it to the config block of a SQLX file.

config {
type: "incremental",
tags: ["daily"],
schema: "europe",
description:"Sessionization of GA4 raw data.",
bigquery:{
partitionBy:"date",
clusterBy: ["session_source_medium"]
}
}

Choose a column (or multiple) that you think will be used a lot for aggregating or filtering data. BigQuery also has a partitioning and clustering recommendation feature that can be useful, but this is probably a use case for very large BigQuery projects.

Final Thoughts

When first dealing with the GA4 event export, you might not concern yourself with things like query optimization, but seemingly minor settings can pay huge dividends as your project grows.

With the help of Dataform, these optimization strategies can be managed in a single SQLX file.
It doesn’t get much better than that!

If you have any questions or suggestions about this topic, feel free to connect with me on Linkedin or contact us if you need help with tackling any analytics problems!

--

--