A Comprehensive Guide to Error Handling and Alerting in Dataform

Krisztián Korpa
Data to Value
Published in
5 min readDec 1, 2023

--

DALL-E image of a robot helping you detect errors. It even has 2 coffees at hand!

Article outline:
The Importance of Error Handling
What Can Go Wrong?
Handling Errors in Google Workflows
Assertions in Dataform
How to Catch Errors?
Final Thoughts

Previous related articles:
Dataform: The Unsung Hero of Marketing Analytics
Dataform JavaScript Utility Functions for GA4 tables
GA4 Sessionization and Traffic Source handling in BigQuery

The Importance of Error Handling

On your journey of implementing a data pipeline, you can encounter different types of errors, be it data quality or pipeline orchestration issues.

And that is something to be expected, because let’s be honest, we all make mistakes!
Even if you have never made a mistake in your life: APIs still might change, a service might be down, authorization might be revoked or expired, and so on.

The consequence and magnitude of the oversight will highly depend on how fast you catch those errors, allowing you to fix and rerun pipelines without stakeholders even noticing.

That’s why it is always a good policy to set up an error handling and alerting system that will result in you
- sleeping more deeply
- not looking over your shoulders all the time
- not picturing your stakeholders’ faces when they figure out their most important reports are flawed.

What Can Go Wrong?

Some things you know beforehand: primary keys or unique IDs should always be unique and non-null, metrics (like sessions or conversions) must always be above a certain threshold, and dates should not be missing even if they have metrics that are 0.
I’m sure you can list out a few yourself when you take a look at your tables!

However, some issues might come to light while the pipeline is already in production. It does not matter if it’s related to the codebase, orchestration, or APIs, they can arise anytime and they must be dealt with.

Realistically you cannot prevent every issue beforehand, which is why it is so important to keep iterating on fixing errors while catching them early.
Preventing the same mistake from happening twice is much easier than predicting it happening once.
Still, it’s a good idea to rule out obvious ones from the beginning.

Handling Errors in Google Workflows

Let’s start with orchestration issues as that’s the first step of every pipeline.

I think most people use Google Workflows to trigger Dataform jobs, for which I encountered a weird issue a while back.
When I used the Workflow YAML that the documentation suggested, this error came up (quite randomly I might add):

Step createCompilationResult kept failing randomly

The error message was not super helpful either.

Github connection closed during operation (HTTP error 400)

I modified the YAML a bit to retry when it encounters a 400 error.

main:
steps:
- init:
assign:
- parent: projects/gcp-project/locations/europe-west3/repositories/client-dataform-repo
- createCompilationResult:
try:
call: http.post
args:
url: ${"https://dataform.googleapis.com/v1beta1/" + parent + "/compilationResults"}
auth:
type: OAuth2
body:
gitCommitish: main
result: compilationResult
retry:
predicate: ${custom_predicate}
max_retries: 5
backoff:
initial_delay: 2
max_delay: 60
multiplier: 2
- createWorkflowInvocation:
call: http.post
args:
url: ${"https://dataform.googleapis.com/v1beta1/" + parent + "/workflowInvocations"}
auth:
type: OAuth2
body:
compilationResult: ${compilationResult.body.name}
invocationConfig:
fullyRefreshIncrementalTablesEnabled: false
includedTags: ["daily"]
includedTargets: []
transitiveDependenciesIncluded: true
transitiveDependentsIncluded: false
result: workflowInvocation
- complete:
return: ${workflowInvocation.body.name + " complete"}
custom_predicate:
params: [e]
steps:
- what_to_repeat:
switch:
- condition: ${e.code == 400}
return: true
- otherwise:
return: false

With this change, the error was resolved!

Feel free to check out the Workflows documentation on retrying steps to dissect the above code.
I used custom_predicate because the default retry does not handle the 400 error status.
You might need to tweak the workflow depending on the type of error you encounter!

Assertions in Dataform

Assertions in Dataform are essentially SQL queries that should not return any rows. If they do, that means the rules that we laid out for certain tables were violated.

You can add (simpler) assertions to the config block of a SQLX file:

config {
type: "table",
assertions: {
uniqueKey: ["user_id"],
nonNull: ["user_id", "customer_id"],
rowConditions: [
'signup_date is null or signup_date > "2019-01-01"',
'email like "%@%.%"'
]
}
}

Or you could create separate SQLX files with config type ‘assertion’:

config {
type: "assertion"
}

--Checks if a date had 0 sales data (flagging potential errors)
with sales_per_date as (
select
date,sum(sales) as sales
from ${ref("eu_product_data_feed")}
where date >= date_sub(current_date(),interval 6 DAY)
group by 1
)

select date,sales
from sales_per_date
where sales = 0

By default, when an assertion fails it does not block other dependent queries from running unless it was specified as a dependency.

Another cool use case in my opinion:

config {
type: "assertion"
}

with date_range as (
select
date_sub(current_date(),interval 6 DAY) as min_date,
date_sub(current_date(),interval 2 DAY) as max_date
),
distinct_dates as (
select
distinct date
from ${ref("eu_product_data_feed")}
),
all_dates as (
select
min_date + interval n day as full_date
from
date_range,
unnest(generate_array(0, date_diff(max_date, min_date, day))) as n
)
select
full_date
from
all_dates
left join distinct_dates
on all_dates.full_date = distinct_dates.date
where
distinct_dates.date is null

This assertion checks if all the dates specified in the date_range CTE are also present in the referenced table (“eu_product_data_feed” table in the example).

How to Catch Errors?

To catch errors in GCP, we can create an alert based on filtered logs that we can use to send emails and/or Slack notifications when an error occurs.

To do that, navigate to Logs Explorer.

Filter out the error logs that you want to be alerted by.
Here is a starting point that I usually use:

(resource.type="dataform.googleapis.com/Repository" OR
resource.type="workflows.googleapis.com/Workflow") AND
severity = "ERROR" AND
jsonPayload.@type != "type.googleapis.com/google.cloud.dataform.logging.v1.WorkflowInvocationCompletionLogEntry"

Click ‘Create alert’ if you are satisfied with your query.

Creating an alert from error logs in GCP

In the final step (Who should be notified?) you can set up different notification channels (like email or Slack) so that you can be immediately alerted when a new entry is made that satisfies your query criteria.

Slack notification when Dataform or Workflow error is triggered

Final Thoughts

I don’t think a lot of reasoning has to be made for error handling and alerting, and I hope I gave you a few tools and ideas to prevent and mitigate issues that might come about in a Dataform pipeline.

Let us know if you can think of any other useful techniques to catch and handle errors!

If you have any questions or suggestions about this topic, feel free to connect with me on Linkedin or contact us if you need help with tackling any analytics problems!

--

--