1

Unsure if I cluster correctly. Basicly I am looking at GCP Billing Info of say 50 clients. Each client has a Billing_ID and I cluster on that billing_ID. I use the clustered table for a data studio dashboard

See the the SQL query below to see what I do right now

CREATE OR REPLACE TABLE `dashboardgcp`
  PARTITION BY DATE(usage_start_time)
  CLUSTER BY billing_account_id
  AS
SELECT
  *
FROM
  `datagcp`
WHERE
 usage_start_time BETWEEN TIMESTAMP('2019-01-01')
  AND TIMESTAMP(CURRENT_DATE)

It is succesfully clustered like this, I am just not a noticeable query performance increase!

2
  • . . I think that would depend on the query you are running. Commented May 10, 2019 at 11:01
  • Cheers for the reply! I should add that I mean by query performance the loading times of a data studio report. The query checks which data should be shown in the data studio report based on the person who is accessing it and the billing_account_ID. A person can only have 1 billing_account_ID. So I thought by clustering it with billing_ID I should see an increase in dashboard performance Commented May 10, 2019 at 11:06

1 Answer 1

1

So I thought by clustering it with billing_ID I should see an increase in dashboard performance

Please consider the following points:

Cluster structure
A Cluster field is composed of an array of fields, like boxes, outer to inner, As state in BigQuery link

When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.

This means As @Gordon wrote, in your query the WHERE part needs to start from the outer field to the inner one to make the most out of your cluster field. In your case, if the userId is part of the WHERE you need to change your cluster field to match this

Cluster limitation
Cluster typically works better for query who scan over 1GB of data, So if you are not scanning this amount of data you won't see the improvement you are looking for

Cluster with Ingestion tables
Assuming your dara is not static and you keep adding data to your table, datagcp, you need to be aware that cluster indexing is a process which BigQuery perform off-line to the insert operation and a separate one to partitioning.
The side effect is that your clustering "weaken" over-time. To solve this you will need to use merge command to re-build your cluster in order to get the most out of your cluster

From the docs:

“Over time, as more and more operations modify a table, the degree to which the data is sorted begins to weaken, and the table becomes partially sorted”.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.