Dataform JavaScript utility functions for GA4 tables

Krisztián Korpa
Data to Value
Published in
6 min readNov 24, 2023

--

JavaScript functions in a Dataform workspace

In a previous article, we discussed why Dataform is a great addition to the marketing analytics tool stack and in another, we contemplated how one might go about sessionizing GA4 event tables and selecting session-level source/mediums.

Next up, let’s take a look at a few JavaScript functions that make writing SQL queries in Dataform much less tedious!
Using JavaScript functions in Dataform will have the benefit of much cleaner code and typing a lot less.
As I work a lot with GA4’s BigQuery export, I will use that as an example, but I think most functions will translate well to other data sources.

Ways to Use JavaScript in Dataform

1. You can add it to a SQLX file (inline).

js {
const columnName = "foo";
}

SELECT 1 AS ${columnName} FROM "..."

2. You can reuse code across a single repository with the ‘includes’ folder.

You can store global JavaScript variables or functions in the ‘includes’ folder
// filename is includes/constants.js
const start_date = "'2023-10-01'";
module.exports = { start_date };

And then reference it in a SQLX file:

config {type: "table"}

SELECT * FROM source_table WHERE date > ${constants.start_date}

You might have noticed that the value of ‘const start_date’ is in both double and single quotation marks.
This is the reason why:

start_date is recognized as INT64 if not wrapped in double and single quotation marks

3. You can create a package to reuse code across multiple repositories.

This is useful if you manage a lot of similar repositories that need the same functions. Please read the documentation to see how to create and install packages in Dataform.

I generally use the “separate file” (2.) approach.

Interesting Use Cases

Unnesting event parameters

(credit: Artem Korneev)

One of the recurring themes of querying the GA4 event data is the unnesting of event parameters.
If you see a repetitive pattern in your code, that might be a sign to utilize a JavaScript function, and unnesting event parameters definitely satisfy that criterion!

const getEventParam = (
eventParamName,
eventParamType = "string",
columnName = false
) => {
let eventParamTypeName = "";
switch (eventParamType) {
case "string":
eventParamTypeName = "string_value";
break;
case "int":
eventParamTypeName = "int_value";
break;
case "double":
eventParamTypeName = "double_value";
break;
case "float":
eventParamTypeName = "float_value";
break;
default:
throw "eventType is not valid";
}
return `(SELECT ep.value.${eventParamTypeName} AS ${eventParamName}
FROM UNNEST(event_params) ep WHERE ep.key = '${eventParamName}') AS ${
columnName ? columnName : eventParamName
}`;
};

module.exports = { getEventParam };

In action:

SELECT
event_timestamp,
user_pseudo_id,
${helpers.getEventParam('page_location')},
${helpers.getEventParam('page_referrer')},
${helpers.getEventParam('ga_session_id','int')}
FROM
${ref(constants.GA4_TABLE)}

Selecting categories of columns

(credit: Snowplow Dataform repo)

This one is simple, yet quite effective. You can create template literals of dimensions that are often queried together. This way you can just reference a single JS variable to select multiple columns.
If you want to change the queried columns, you need to only change them in the JS variable instead of replacing them one by one in each table.
You can also add aliases to the template literals.

let USER_FIELDS = `user_id,
domain_userid,
network_userid,
user_ipaddress as ip,
domain_sessionidx,
domain_sessionid`;

let DEVICE_OS_FIELDS = `useragent as ua,
dvce_type,
dvce_ismobile,
dvce_screenheight,
dvce_screenwidth,
os_name,
os_family,
os_manufacturer`;

module.exports = { USER_FIELDS,
DEVICE_OS_FIELDS };

In action:

  SELECT
${event_fields.USER_FIELDS},
${event_fields.DEVICE_OS_FIELDS}
FROM
${ref(constants.EXAMPLE_TABLE)}

Custom channel grouping

Managing custom channel grouping is one of the most obvious use cases for JS functions. Keeping it up-to-date in a single JS file will save you from having to worry about whether all tables use the same logic.
You can achieve similar results with BigQuery’s user-defined functions (UDF), but as the project scales, I think Dataform makes more sense.

Channel grouping is definitely something that can be reused in multiple tables or that is apt to change over time, which makes it a great candidate for a JS function.

const channelGrouping = (source,medium,campaign,gclid) => {
return `
case
when regexp_contains(${campaign}, r'^(.*shop|shopping.*)$')
and regexp_contains(${medium}, r'^(.*cp.*|ppc|paid.*)$')
then 'shopping_paid'
when regexp_contains(${source}, r'^(twitter|facebook|fb|instagram|ig|linkedin|pinterest).*$')
and regexp_contains(${medium}, r'^(.*cp.*|ppc|paid.*|social_paid)$')
then 'social_paid'
when regexp_contains(${source}, r'^(youtube).*$')
and regexp_contains(${medium}, r'^(.*cp.*|ppc|paid.*)$')
then 'video_paid'
when regexp_contains(${medium}, r'^(display|banner|expandable|interstitial|cpm)$')
then 'display'
when regexp_contains(${source}, r'^(google|bing).*$')
and regexp_contains(${medium}, r'^(.*cp.*|ppc|paid.*)$') or
${gclid} is not null
then 'search_paid'
when regexp_contains(${medium}, r'^(.*cp.*|ppc|paid.*)$')
then 'other_paid'
when regexp_contains(${medium}, r'^(.*shop|shopping.*)$')
then 'shopping_organic'
when regexp_contains(${source}, r'^.*(twitter|t\.co|facebook|instagram|linkedin|lnkd\.in|pinterest).*')
or regexp_contains(${medium}, r'^(social|social_advertising|social-advertising|social_network|social-network|social_media|social-media|sm|social-unpaid|social_unpaid)$')
then 'social_organic'
when regexp_contains(${medium}, r'^(.*video.*)$')
then 'video_organic'
when regexp_contains(${source}, r'^(google|bing|yahoo|baidu|duckduckgo|yandex|ask)$')
or ${medium} = 'organic'
then 'search_organic'
when regexp_contains(${source}, r'^(email|mail|e-mail|e_mail|e mail|mail\.google\.com)$')
or regexp_contains(${medium}, r'^(email|mail|e-mail|e_mail|e mail)$')
then 'email'
when regexp_contains(${medium}, r'^(affiliate|affiliates)$')
then 'affiliate'
when ${medium} = 'referral'
then 'referral'
when ${medium} = 'audio'
then 'audio'
when ${medium} = 'sms'
then 'sms'
when ends_with(${medium}, 'push')
or regexp_contains(${medium}, r'.*(mobile|notification).*')
then 'mobile_push'
when (${source} = 'direct' or ${source} is null)
and (regexp_contains(${medium}, r'.*(not set|none).*') or ${medium} is null)
then 'direct'
else '(other)'
end`
}

module.exports = { channelGrouping };

In action:

SELECT
session_id,
${helpers.channelGrouping("session_source","session_medium","session_campaign","gclid")} as channel_grouping
FROM
`clean_ga4_session_table`

Coalesce event params

(credit: Alejandro Zielinsky)

GA4 event parameter processing has a tendency to not be 100% consistent with the intended data type. There are numerous examples where an event parameter can be either a string_value, int_value, double_value, or float_value.

By using the below COALESCE function, we can standardize a parameter no matter which value field it might arrive in.
Do note that the result column can only be a single data type, so you will have to pick which one makes the most sense and CAST all other fields to that type.
It is “string” in our example.

const coalesceEventParam = (param) => {
return `
(select coalesce(value.string_value,
cast(value.int_value as string),
cast(value.double_value as string),
cast(value.float_value as string))
from unnest(event_params) where key = '${param}')`
}

module.exports = { coalesceEventParam };

Get a custom date to create sharded tables

There are several date functions in BigQuery that you can use, but when you want to “shard” your tables (=create separate tables based on a date being its suffix eg. ‘table_name_20231123’), you will bump into some difficulties.

When you run CREATE OR REPLACE TABLE, you cannot really use those built-in functions to give a dynamic suffix to the result table’s name. Except maybe with EXECUTE IMMEDIATE, but I couldn’t make that work in Dataform when I first tried.
Let me know if you did!

The function:

    const getTodayMinusTwoDate = () => {
let date = new Date();
date.setDate(date.getDate()-2);
year = date.getFullYear();
month = (date.getMonth() + 1).toString().padStart(2, "0");
day = date.getDate().toString().padStart(2, "0");
return year + month + day; // returns '20231123'
}

module.exports = { getTodayMinusTwoDate };

In action:

config {
type: "operations",
hasOutput: true
}

create or replace table test.result_${helpers.getTodayMinusTwoDate()} as
select date,channel,clicks,sales
from ${ref('example_table')}
where date = parse_date('%Y%m%d', cast(${helpers.getTodayMinusTwoDate()} as string));

The above operation will create a new sharded table with ‘today minus 2 days date’ whenever this SQLX file is run.
Generally, I think partitioning is a better idea, but sharding might be needed if the result table needs to be exported via a Google Cloud Function for example (and the destination system requires date sharding).

Ending thoughts

One of Dataform’s greatest benefits is the reusability and readability of code which can be achieved by using JavaScript functions.
Every time you find yourself writing the same code again and again, you should give JS functions a whirl!

But proceed with care: if you start to create a function for every little thing, you will have to keep looking at the source file to remember what it does exactly.

Let me know what you think of the examples or if you can think of any other useful functions people can benefit from!

If you have any questions or suggestions about this topic, feel free to connect with me on Linkedin or contact us if you need help with tackling any analytics problems!

--

--