provisioning bigquery datasets using terraform

Question

I am new to GCP and Terraform. I am developing terraform scripts to provision around 50 BQ datasets and each datasets has minimum 10 tables. All the tables do not have same schema.

I have developed scripts to create datasets and tables, but I am facing challenge to add schemas to the tables and I need help. I am making use of terraform variables to build scripts.

Here is my code. I need to integrate logic to creates schemas for tables.

var.tf

variable "test_bq_dataset" {
  type = list(object({
    id       = string
    location = string
  }))
}

variable "test_bq_table" {
  type = list(object({
    dataset_id = string
    table_id   = string
  }))
}

terraform.tfvars

test_bq_dataset = [{
  id       = "ds1"
  location = "US"
  },
  {
    id       = "ds2"
    location = "US"
  }
]

test_bq_table = [{
  dataset_id = "ds1"
  table_id   = "table1"
  },
  {
    dataset_id = "ds2"
    table_id   = "table2"
  },
  {
    dataset_id = "ds1"
    table_id   = "table3"
  }
]

main.tf

resource "google_bigquery_dataset" "dataset" {
  count      = length(var.test_bq_dataset)
  dataset_id = var.test_bq_dataset[count.index]["id"]
  location   = var.test_bq_dataset[count.index]["location"]
  labels = {
    "environment" = "development"
  }
}


resource "google_bigquery_table" "table" {
  count = length(var.test_bq_table)
  dataset_id = var.test_bq_table[count.index]["dataset_id"]
  table_id   = var.test_bq_table[count.index]["table_id"]
  labels = {
    "environment" = "development"
  }
  depends_on = [
    google_bigquery_dataset.dataset,
  ]
}

I am tried all possibilities to create schemas for the tables in the datasets. However none worked.

al-dann · Accepted Answer · 2021-02-14 13:53:32Z

Presumably all your tables should have identical schema...

I would try this way

In the

resource "google_bigquery_table" "table"

just after labels, for example, you can add:

schema = file("${path.root}/subdirectories-path/table_schema.json")

where the

${path.root} - is where you terraform file
subdirectories-path - zero or many subdirectories
table_schema.json - a json file with the schema

==> Update 14/02/2021

Following a request to show an example where table schemas are different... Minimum modifications from the original question.

variables.tf

variable "project_id" {
  description = "The target project"
  type        = string
  default     = "ishim-sample"
}

variable "region" {
  description = "The region where resources are created => europe-west2"
  type        = string
  default     = "europe-west2"
}

variable "zone" {
  description = "The zone in the europe-west region for resources"
  type        = string
  default     = "europe-west2-b"
}

# ===========================
variable "test_bq_dataset" {
  type = list(object({
    id       = string
    location = string
  }))
}

variable "test_bq_table" {
  type = list(object({
    dataset_id = string
    table_id   = string
    schema_id  = string
  }))
}

terraform.tfvars

test_bq_dataset = [
  {
    id       = "ds1"
    location = "EU"
  },
  {
    id       = "ds2"
    location = "EU"
  }
]

test_bq_table = [
  {
    dataset_id = "ds1"
    table_id   = "table1"
    schema_id  = "table-schema-01.json"
  },
  {
    dataset_id = "ds2"
    table_id   = "table2"
    schema_id  = "table-schema-02.json"
  },
  {
    dataset_id = "ds1"
    table_id   = "table3"
    schema_id  = "table-schema-03.json"
  },
  {
    dataset_id = "ds2"
    table_id   = "table4"
    schema_id  = "table-schema-04.json"
  }
]

An example of a json schema file - table-schema-01.json

[
  {
    "name": "table_column_01",
    "mode": "REQUIRED",
    "type": "STRING",
    "description": ""
  },
  {
    "name": "_gcs_file_path",
    "mode": "REQUIRED",
    "type": "STRING",
    "description": "The GCS path to the file for loading."
  },
  {
    "name": "_src_file_ts",
    "mode": "REQUIRED",
    "type": "TIMESTAMP",
    "description": "The source file modification timestamp."
  },
  {
    "name": "_src_file_name",
    "mode": "REQUIRED",
    "type": "STRING",
    "description": "The file name of the source file."
  },
    {
    "name": "_firestore_doc_id",
    "mode": "REQUIRED",
    "type": "STRING",
    "description": "The hash code (based on the file name and its content, so each file has a unique hash) used as a Firestore document id."
  },
  {
    "name": "_ingested_ts",
    "mode": "REQUIRED",
    "type": "TIMESTAMP",
    "description": "The timestamp when this record was processed during ingestion into the BigQuery table."
  }
]

main.tf

provider "google" {
  project = var.project_id
  region  = var.region
  zone    = var.zone
}

resource "google_bigquery_dataset" "test_dataset_set" {
  project    = var.project_id
  count      = length(var.test_bq_dataset)
  dataset_id = var.test_bq_dataset[count.index]["id"]
  location   = var.test_bq_dataset[count.index]["location"]

  labels = {
    "environment" = "development"
  }
}

resource "google_bigquery_table" "test_table_set" {
  project    = var.project_id
  count      = length(var.test_bq_table)
  dataset_id = var.test_bq_table[count.index]["dataset_id"]
  table_id   = var.test_bq_table[count.index]["table_id"]
  schema     = file("${path.root}/bq-schema/${var.test_bq_table[count.index]["schema_id"]}")

  labels = {
    "environment" = "development"
  }
  depends_on = [
    google_bigquery_dataset.test_dataset_set,
  ]
}

Project directory structure - screenshot

Bear in mind the subdirectory name - "bq-schema" as it is used in the "schema" attribute of the "google_bigquery_table" resource in the "main.tf" file.

BigQuery console - screenshot

The result of the "terraform apply" command.

I must have add this to the question. Tables do not have identical schemas.
You probably can define a terraform variable - a map "table name => schema file name", or a list with the schema files names, so that the correct file is chosen using the same count loop instead of the const "table_schema.json".

Daniel Ocando · Accepted Answer · 2021-02-12 15:00:57Z

0

Terraform includes an optional schema argument that expects a JSON string.

The documentation shared on the previous link has an example:

resource "google_bigquery_table" "default" {
  dataset_id = google_bigquery_dataset.default.dataset_id
  table_id   = "bar"

  time_partitioning {
    type = "DAY"
  }

  labels = {
    env = "default"
  }

  schema = <<EOF
[
  {
    "name": "permalink",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "The Permalink"
  },
  {
    "name": "state",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "State where the head office is located"
  }
]
EOF

}

answered Feb 12, 2021 at 15:00

Daniel Ocando

3,8342 gold badges14 silver badges19 bronze badges

2 Comments

Amit Over a year ago

I have 50 BQ Datasets and every DS has 10 tables and I do not prefer hardcoding values. Figuring out a way to make use of variables to create schema (Just like I have been doing to create tables and DS.)

Daniel Ocando Over a year ago

I see! I didn't know the schemas where different. That'll definitely require another approach. I believe @al-dann 's answer offers a way better approach.

Collectives™ on Stack Overflow

provisioning bigquery datasets using terraform

var.tf

terraform.tfvars

main.tf

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

var.tf

terraform.tfvars

main.tf

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related