Unnest bigquery example

Unnest bigquery example

Using BigQuery is a great way to generate some custom in-depth analysis of your Google Analytics data, but to really unlock that data, it helps to know a few tricks. Today, let’s talk about one of the most important ones: Using the UNNEST function to analyze event parameters and user properties that you receive along with your Analytics data.

One of the trickier parts of working with Firebase data in BigQuery — and this applies not just to Analytics data but to Crashlytics data, too — is that your data is not organized in nice little rows and columns like this:

Unnest bigquery example

How you’re probably imagining your BigQuery data

Instead, your data consists of rows that maybe look a little more like JSON objects, where columns could contain a single bit of data, but they could also contain arrays, structs, or even arrays of structs. Maybe something a little more like this:

Unnest bigquery example

What your BigQuery data is really like

Working with this data can be weird at first, but once you learn a couple of tricks, it’s not so bad.

Let’s start by looking at our public sample data for our game of Flood-it. Feel free to follow along; this data is available for anybody to play with.

In particular, I’m interested in the level_complete_quickplay event. This is the event that we record when the user successfully completes a quickplay level in the game. Here, let’s take a look at a few:

SELECT *
FROM `firebase-public-project.analytics_153293282.events_20180915`
WHERE event_name = "level_complete_quickplay"
LIMIT 10

If you run this, you’ll see a number of level complete events, along with a couple of other event parameters that seem interesting. We’ve got a value parameter, which is recording the number of moves it took for a player to complete the level, and a board parameter, which records the board size (S, M, or L)

Unnest bigquery example

You might also notice that each of these parameters doesn’t have a single value column, but several: int_value, string_value, double_value and float_value.

Basically, you can think of this as kind of a struct that we use to store your parameter values, and in practice most of them end up being nil. But it’s a nice way of making sure we can take a number of different parameters with a bunch of different formats and store them in a consistent way.

Unnest bigquery example

So, let’s figure out the average user score just by looking at event parameters where key is equal to “value” . Seems pretty simple, right?

SELECT *
FROM `firebase-public-project.analytics_153293282.events_20180915`
WHERE event_name = "level_complete_quickplay"
AND event_params.key = "value"
LIMIT 10

Unnest bigquery example

Oh, dear. That’s… a really weird error message.

The problem here is that event_params is essentially an array (actually in BigQuery parlance it’s a “repeated record”, but you can think of it as an array). So while it might contain several rows that themselves have a key field, it doesn’t have one itself.

This is where the UNNEST function comes in. It basically lets you take elements in an array and expand each one of these individual elements. You can then join your original row against each unnested element to add them to your table.

To use a simpler example, let’s imagine we have a table full of spaceships. And crew is an array, much like event_params was in our analytics tables.

Unnest bigquery example

Right now, if I wanted to find spaceships that contain a crew member named Zoe, I couldn’t do that with a line like this.

SELECT * FROM `spaceships` WHERE crew = "Zoe"

But if I were to write something like this…

SELECT * FROM `spaceships` CROSS JOIN UNNEST(crew) as crew_member

What BigQuery will do is take every individual member in my crew array, and add it on as a new column in my row called crew_member It will repeat my original row as needed to accompany each new value. Kinda like in this nifty animation:

Unnest bigquery example

Once I’ve done that, it’s easy to select spaceships with crew members named Zoe.

SELECT * FROM `spaceships` 
CROSS JOIN UNNEST(crew) as crew_member
WHERE crew_member = "Zoe"

Unnest bigquery example

You’ll find that in practice, though, most BigQuery developers will replace the CROSS JOIN with a comma, like so:

SELECT * FROM `spaceships`,
UNNEST(crew) as crew_member
WHERE crew_member = "Zoe"

It still does the same thing, it just looks cooler. (BigQuery developers are all about looking cool.)

So going back to our example, we essentially have an array (or repeated record) of event_params. So if I were to say:

SELECT * FROM `my_analtyics_table`,
UNNEST(event_params) as param

…what BigQuery will do is take each individual event parameter and expand them out into a new column called param, and repeat the original row as needed. Like this:

Unnest bigquery example

You’ll notice that param itself is still a nested object. But it’s a single record, not a repeating one. (Again, kinda like a struct.) So we can query for things like param.key or param.value.int_value. Specifically, we can now look just for rows where param.key is equal to “value”.

Since, in practice, it’s a little weird to have the full repeated event_params object alongside each unnested parameter, I usually drop it and just look at the param object instead.

So with all that in mind, let’s go back to the task of analyzing the average score for users who complete a quickplay level. I can now grab just those events along with the unnested value parameter.

SELECT event_name, param
FROM `firebase-public-project.analytics_153293282.events_20180915`,
UNNEST(event_params) AS param
WHERE event_name = "level_complete_quickplay"
AND param.key = "value"

Unnest bigquery example

And then, it’s really easy to grab the actual value of the value parameter by looking at param.value.int_value.

SELECT event_name, param.value.int_value AS score
FROM `firebase-public-project.analytics_153293282.events_20180915`,
UNNEST(event_params) AS param
WHERE event_name = "level_complete_quickplay"
AND param.key = "value"

Unnest bigquery example

From there, I can perform interesting tasks like calculating the mean, getting some quantiles, and figuring out the standard deviation…

SELECT AVG(param.value.int_value) AS average, 
APPROX_QUANTILES(param.value.int_value, 2) AS quantiles,
STDDEV(param.value.int_value) AS stddev
FROM `firebase-public-project.analytics_153293282.events_20180915`,
UNNEST(event_params) AS param
WHERE event_name = "level_complete_quickplay"
AND param.key = "value"

Unnest bigquery example

Hey, maybe while I’m at it, let’s make a quick histogram of all the values!

SELECT param.value.int_value AS score, COUNT(1) AS count
FROM `firebase-public-project.analytics_153293282.events_20180915`,
UNNEST(event_params) AS param
WHERE event_name = "level_complete_quickplay"
AND param.key = "value"
GROUP BY 1
ORDER BY 1

Unnest bigquery example

So… this is interesting. There’s a pretty big peak around 21 or 22 moves… but if you keep looking, it seems like there’s also a couple of smaller peaks around 29 moves and in the 34–36 range.

The reason for that is the other parameter I was talking about — the board parameter. While the vast majority of games are being played on a small board, there’s a number of games being played on medium and large boards too, and those probably account for those smaller peaks.

In fact, we can kinda see that by looking at both the board and the value parameters for our event.

SELECT param
FROM `firebase-public-project.analytics_153293282.events_20180915`,
UNNEST(event_params) AS param
WHERE event_name = "level_complete_quickplay"
AND (param.key = "value" OR param.key = "board")

Sure enough, it seems like the “value” parameter is higher when it follows a board of M or L.

Unnest bigquery example

But how can we actually prove this out in BigQuery? Right now, making this analysis is kinda difficult because we have different values in different rows.

Now, there is a way we can bunch these rows together. If we add back the pseudo_user_id (essentially, a unique ID assigned to each app instance) and event_timestamp to each of our columns, we can then group together parameters for the same event based on those two values.

SELECT MAX(if(param.key = "value", param.value.int_value, NULL)) AS score,
MAX(if(param.key = "board", param.value.string_value, NULL)) AS board_type
FROM (
SELECT user_pseudo_id, event_timestamp, param
FROM `firebase-public-project.analytics_153293282.events_20180915`,
UNNEST(event_params) AS param
WHERE event_name = "level_complete_quickplay"
AND (param.key = "value" OR param.key = "board")
)
GROUP BY user_pseudo_id, event_timestamp

Unnest bigquery example

And then we can analyze score by board type…

SELECT ANY_VALUE(board_type) as board, AVG(score) as average_score
FROM (
SELECT MAX(if(param.key = "value", param.value.int_value, NULL)) AS score,
MAX(if(param.key = "board", param.value.string_value, NULL)) AS board_type
FROM (
SELECT user_pseudo_id, event_timestamp, param
FROM `firebase-public-project.analytics_153293282.events_20180915`,
UNNEST(event_params) AS param
WHERE event_name = "level_complete_quickplay"
AND (param.key = "value" OR param.key = "board")
)
GROUP BY user_pseudo_id, event_timestamp
)
GROUP BY board_type

Unnest bigquery example

But this seems like a pretty awkward way of doing things. Wouldn’t it be better if we could find a way of more easily getting multiple values in the same row? Well it turns out you can, and we’ll find out how… in my next post! (Woo! Cliffhanger ending!!)

How do you Unnest in BigQuery?

To convert an ARRAY into a set of rows, also known as "flattening," use the UNNEST operator. UNNEST takes an ARRAY and returns a table with a single row for each element in the ARRAY . Because UNNEST destroys the order of the ARRAY elements, you may wish to restore order to the table.

What does Unnest function do?

The UNNEST function returns a result table that includes a row for each element of the specified array. If there are multiple ordinary array arguments specified, the number of rows will match the array with the largest cardinality.

What is flatten in BigQuery?

FLATTEN. When you query nested data, BigQuery automatically flattens the table data for you.