GA4 Sessionization and Traffic Source handling in BigQuery

Krisztián Korpa
Data to Value
Published in
9 min readNov 17, 2023

--

Piecing the puzzle together. According to DALL-E ;-)

One of the biggest changes from UA to GA4 is how the underlying data is modeled. If you still have a few blurry memories of the UA reports, you might recall that everything was centered around a concept called “session”. A session could have a landing page, acquisition sources, page views, conversions, and so on. It is basically a logic of grouping interactions together from the same user.

What constitutes a session is also quite arbitrary.

  • How long can a user be inactive before a new session starts?
  • Do we take into account if the tab is active or if the user is interacting with the page?
  • If the user re-enters the website from a different acquisition source during the same session, how do we handle that?
  • What if the session spans across multiple days?

However you define sessions you have to be consistent with it or communicate changes very well to avoid misunderstandings and data discrepancies.

Note: check this documentation to see how differently UA and GA4 generally handle sessionization.

I keep mentioning the word “sessionization”.

What the hell does it even mean?

Sessionization is the process of taking an event-level table (such as the GA4-BigQuery export) and querying it, so that in the result table every row will correspond to a different session (=the table is ‘session-level’). Querying the session-level table directly or connecting it to Looker Studio will result in much lower costs than doing the same with the raw, event-level export.

While there is no perfect definition or method of how a session should be calculated, (arguably) you cannot really avoid using it as it’s a fundamental aspect of how people use the internet.

If you use the GA4 UI, I highly recommend you read this article from Charles Farina to see just how complex attribution can be under the hood (and how many caveats GA4 has in this area as well).

Don’t forget: you always do a form of attribution modeling, it is just a matter of understanding what you are looking at. Even if you resort to all the standard reports and settings, you are subject to the platform’s definition of it.

If you use the GA4-BigQuery export, you have the luxury of creating your own sessionization and attribution logic. While being able to do this is very powerful, it also comes with great responsibility (did I just quote Uncle Ben?) and requires a profound understanding of how the event-level export and SQL work.

Yes, unfortunately, there are quite a few easter eggs here as well. The most notable one is the “traffic_source” record that contains “name” (=campaign), “medium” and “source” nested fields.
At first glance you might assume they are event or session-level, but these are the first acquisition channels the user visited the website from (they are user-level). Meaning that these values will never change, no matter how many times that user visits the website.

Previously we had to unnest source, medium etc. parameters from the “event_params”, but recently a “collected_traffic_source” record was also added that is event-level and does not require unnesting.
As this addition is not retroactive (appeared first in May-June 2023) if you query data before it was populated, you will get a bunch of null values. Unnesting event_params might just be the most stable solution for a while.

Traffic source related dimensions in BigQuery
The three records that can be used for attribution

One of the biggest challenges of sessionization is how you decide to handle which acquisition channel the user came from. Also, how do you handle multiple sources within the same session?

Let’s take a look at a few potential solutions and what pros and cons they have!

MAX()

Check out this example query:

select
user_pseudo_id,
concat(user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id')) as session_id,
max((select value.string_value from unnest(event_params) where key = 'source')) as source,
max((select value.string_value from unnest(event_params) where key = 'medium')) as medium,
max((select value.string_value from unnest(event_params) where key = 'campaign')) as campaign
from
`datatovaluenl-ga4.analytics_12356789.events_20231114`
group by 1,2

The MAX() function takes the biggest non-null value of the aggregated dimension (eg. source). But how does it handle strings?

You can test how it works with strings like this:

select max(x) as max
from unnest(['1234', '12345', '123', '12']) as x;

Which string do you think the above query should return?

If you change the array elements to regular source/medium strings (eg. ‘google / organic’), you will see the same logic applied.

It returns the longest string that was in the session.

If there is more than 1 channel in the same session, you probably should use a different approach in selecting which one should be used in the end.

The pro of the MAX() function is that it is very easy and readable, but not much else.

ARRAY_AGG()

This was my preferred approach for a while until I encountered a few issues with one of my client’s data.

Can you figure out what it is based on the below query?

select
user_pseudo_id,
concat(user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id')) as session_id,
array_agg((select value.string_value from unnest(event_params) where key = 'source') ignore nulls)[offset(0)] as session_source,
array_agg((select value.string_value from unnest(event_params) where key = 'medium') ignore nulls)[offset(0)] as session_medium,
array_agg((select value.string_value from unnest(event_params) where key = 'campaign') ignore nulls)[offset(0)] as session_campaign,
array_agg((select value.string_value from unnest(event_params) where key = 'term') ignore nulls)[offset(0)] as session_term,
array_agg((select value.string_value from unnest(event_params) where key = 'content') ignore nulls)[offset(0)] as session_content,
array_agg((select value.string_value from unnest(event_params) where key = 'gclid') ignore nulls)[offset(0)] as gclid,
array_agg((select value.string_value from unnest(event_params) where key = 'campaignid') ignore nulls)[offset(0)] as campaign_id,
array_agg((select value.string_value from unnest(event_params) where key = 'adgroupid') ignore nulls)[offset(0)] as adgroup_id,
array_agg((select value.string_value from unnest(event_params) where key = 'adid') ignore nulls)[offset(0)] as ad_id
from
`datatovaluenl-ga4.analytics_12356789.events_20231114`
group by 1,2

The ARRAY_AGG function takes all the values of the parameter from the same session and puts them in an array. The ‘ignore nulls’ part does what it says — leaves all null values out of the array. The [offset(0)] part selects the first element in the array, which is the first source the session originated from.

You can check how it works like this:

select array_agg(x ignore nulls) as array_agg
from unnest(['google / organic',null,'google / cpc','email / email']) as x;

And if you want to select the first value:

select array_agg(x ignore nulls)[offset(0)] as array_agg
from unnest(['google / organic',null,'google / cpc','email / email']) as x;

Looks pretty solid, so what’s the catch?

What happens when the user enters the website from 3 different traffic sources during the same session?

If a session has multiple acquisition channels, ARRAY_AGG will select the first non-null value per parameter. This means the above session would result in this:

session_source: google
session_medium: organic
session_campaign: organic
session_term: search-usa
session_content: watches

If you only query for source, medium, and campaign ARRAY_AGG works nicely (assuming you don’t have combinations where one of them is null), but if you add parameters that might be null for some channels and non-null for others, you will have some mixing going on.

FIRST_VALUE()

select
user_pseudo_id,
concat(user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id')) as session_id,
first_value(collected_traffic_source.manual_source) OVER (PARTITION BY user_pseudo_id, (SELECT value.int_value FROM unnest(event_params) WHERE key = 'ga_session_id') ORDER BY event_timestamp) as session_source,
first_value(collected_traffic_source.manual_medium) OVER (PARTITION BY user_pseudo_id, (SELECT value.int_value FROM unnest(event_params) WHERE key = 'ga_session_id') ORDER BY event_timestamp) as session_medium,
first_value(collected_traffic_source.manual_campaign_name) OVER (PARTITION BY user_pseudo_id, (SELECT value.int_value FROM unnest(event_params) WHERE key = 'ga_session_id') ORDER BY event_timestamp) as session_campaign
from
`datatovaluenl-ga4.analytics_12356789.events_20231114`

Credit: Luka Cempre from Adswerve. If you want to see how Luka creates different attribution models, check out his article!

FIRST_VALUE() is a window function that essentially creates the same logic as we did with ARRAY_AGG above, which is to select the first occurrence of the parameter partitioned by the session ID and ordered by event timestamp.
To my knowledge, a window function is a bit more computationally expensive than the ARRAY_AGG function (it is probably marginal in a GA4 dataset), but the query syntax might be easier to read.

Unfortunately, it runs into the same issue as ARRAY_AGG above.

AS STRUCT + ARRAY_AGG()

I’ve first seen Taneli use this solution, so I’m gonna credit him, but I’ve seen others use this approach as well on Measure Slack. This is my preferred method currently.

The query is gonna be a bit more complex, but bear with me.

-- First, we create a struct from the same event’s parameters called ‘traffic_source’
with prep as (
select
event_timestamp,
user_pseudo_id,
concat(user_pseudo_id, (select value.int_value from unnest(event_params) where key = 'ga_session_id')) as session_id,
(select
as struct
(select value.string_value from unnest(event_params) where key = 'source') as source_value,
(select value.string_value from unnest(event_params) where key = 'medium') as medium,
(select value.string_value from unnest(event_params) where key = 'campaign') as campaign,
(select value.string_value from unnest(event_params) where key = 'term') as term,
(select value.string_value from unnest(event_params) where key = 'content') as content,
(select value.string_value from unnest(event_params) where key = 'gclid') as gclid
) as traffic_source

from
`datatovaluenl-ga4.analytics_12356789.events_20231114`
),

-- Then create an array from events that contain non-null source, medium or gclid (either), and select the first one
prep2 as (
select
user_pseudo_id,
session_id,
array_agg(
if(coalesce(traffic_source.source_value, traffic_source.medium,traffic_source.gclid) is not null, traffic_source, null)
order by event_timestamp asc limit 1)[safe_offset(0)] as session_first_traffic_source
from prep
group by 1,2)

-- Fix gclid bug and add ‘direct’ and ‘(none)’ if source, medium is still null. Added lower() just for good measure
select
user_pseudo_id,
session_id,
lower(if(session_first_traffic_source.gclid is not null,'google',ifnull(session_first_traffic_source.source_value,'direct'))) as session_source,
lower(if(session_first_traffic_source.gclid is not null,'cpc',ifnull(session_first_traffic_source.medium,'(none)'))) as session_medium,
session_first_traffic_source.campaign as session_campaign,
session_first_traffic_source.term as session_term,
session_first_traffic_source.content as session_content,
session_first_traffic_source.gclid as gclid
from prep2

By first creating a struct out of the unnested event parameters, we group all the queried dimensions to their respective events, which means when we are using ARRAY_AGG in the second step, we select the “group” (= source/medium/etc.) as a whole, not the individual parameters.

Considering all touchpoints

One of the idiosyncrasies of using ‘AS STRUCT + ARRAY_AGG()’ together is that in the first CTE (prep) you create a struct of all the combinations of source/medium/etc. that the session had, and select the first non-null one in the second CTE.

But what if you don’t want to do that?

Looking at only the first acquisition channel could devalue a few mid-session sources in some cases. If you want to do attribution modeling for example, you could unnest all the distinct sources the session had and use all the touchpoints instead of a single one per session.
This way, you can build on top of the already existing CTE structure instead of creating a whole different query for the touchpoints.

Last non-direct click

If you have been around the block with GA4, you know that the UI uses the ‘last non-direct click’ attribution model, which overwrites the session-level ‘direct / (none)’ to the last found source/medium in a certain lookback window.

While this method is chosen by Google (would that increase Google’s share of the conversions? Who knows...), it is by no means a “must” or a “best practice” per se.

Still, if you want to go this route — as Taneli also pointed it out in his article — , I recommend adding the non-direct logic to the session-level table you created. Doing it on the event-level table in the same query will result in some extra costs.
Check out his article to see how he uses the LAST_VALUE() function to apply the non-direct aspect!

Finishing up

As a wise man once said: ‘there are many ways to skin a cat’. It certainly holds true when we are talking about GA4 session-level source/mediums in BigQuery. As the raw data export is now event-level (as opposed to UA 360, which was session-level), you have to find your own way of sessionizing it in order to reduce query costs and to overall make sense of it.

Whichever solution you choose to go with, it is always better if it is a conscious decision and with the caveats in mind, rather than going with a copy-paste approach and getting surprised down the line.

Let me know which method you prefer, or if there is any I missed!

If you have any questions or suggestions about this topic, feel free to connect with me on Linkedin or contact us if you need help with tackling any analytics problems!

--

--