3

I am new to GCP and Terraform. I am developing terraform scripts to provision around 50 BQ datasets and each datasets has minimum 10 tables. All the tables do not have same schema.

I have developed scripts to create datasets and tables, but I am facing challenge to add schemas to the tables and I need help. I am making use of terraform variables to build scripts.

Here is my code. I need to integrate logic to creates schemas for tables.

var.tf

variable "test_bq_dataset" {
  type = list(object({
    id       = string
    location = string
  }))
}

variable "test_bq_table" {
  type = list(object({
    dataset_id = string
    table_id   = string
  }))
}

terraform.tfvars

test_bq_dataset = [{
  id       = "ds1"
  location = "US"
  },
  {
    id       = "ds2"
    location = "US"
  }
]

test_bq_table = [{
  dataset_id = "ds1"
  table_id   = "table1"
  },
  {
    dataset_id = "ds2"
    table_id   = "table2"
  },
  {
    dataset_id = "ds1"
    table_id   = "table3"
  }
]

main.tf

resource "google_bigquery_dataset" "dataset" {
  count      = length(var.test_bq_dataset)
  dataset_id = var.test_bq_dataset[count.index]["id"]
  location   = var.test_bq_dataset[count.index]["location"]
  labels = {
    "environment" = "development"
  }
}


resource "google_bigquery_table" "table" {
  count = length(var.test_bq_table)
  dataset_id = var.test_bq_table[count.index]["dataset_id"]
  table_id   = var.test_bq_table[count.index]["table_id"]
  labels = {
    "environment" = "development"
  }
  depends_on = [
    google_bigquery_dataset.dataset,
  ]
}

I am tried all possibilities to create schemas for the tables in the datasets. However none worked.

2 Answers 2

7

Presumably all your tables should have identical schema...

I would try this way

In the

resource "google_bigquery_table" "table"

just after labels, for example, you can add:

schema = file("${path.root}/subdirectories-path/table_schema.json")

where the

  • ${path.root} - is where you terraform file
  • subdirectories-path - zero or many subdirectories
  • table_schema.json - a json file with the schema

==> Update 14/02/2021

Following a request to show an example where table schemas are different... Minimum modifications from the original question.

variables.tf

variable "project_id" {
  description = "The target project"
  type        = string
  default     = "ishim-sample"
}

variable "region" {
  description = "The region where resources are created => europe-west2"
  type        = string
  default     = "europe-west2"
}

variable "zone" {
  description = "The zone in the europe-west region for resources"
  type        = string
  default     = "europe-west2-b"
}

# ===========================
variable "test_bq_dataset" {
  type = list(object({
    id       = string
    location = string
  }))
}

variable "test_bq_table" {
  type = list(object({
    dataset_id = string
    table_id   = string
    schema_id  = string
  }))
}

terraform.tfvars

test_bq_dataset = [
  {
    id       = "ds1"
    location = "EU"
  },
  {
    id       = "ds2"
    location = "EU"
  }
]

test_bq_table = [
  {
    dataset_id = "ds1"
    table_id   = "table1"
    schema_id  = "table-schema-01.json"
  },
  {
    dataset_id = "ds2"
    table_id   = "table2"
    schema_id  = "table-schema-02.json"
  },
  {
    dataset_id = "ds1"
    table_id   = "table3"
    schema_id  = "table-schema-03.json"
  },
  {
    dataset_id = "ds2"
    table_id   = "table4"
    schema_id  = "table-schema-04.json"
  }
]

An example of a json schema file - table-schema-01.json

[
  {
    "name": "table_column_01",
    "mode": "REQUIRED",
    "type": "STRING",
    "description": ""
  },
  {
    "name": "_gcs_file_path",
    "mode": "REQUIRED",
    "type": "STRING",
    "description": "The GCS path to the file for loading."
  },
  {
    "name": "_src_file_ts",
    "mode": "REQUIRED",
    "type": "TIMESTAMP",
    "description": "The source file modification timestamp."
  },
  {
    "name": "_src_file_name",
    "mode": "REQUIRED",
    "type": "STRING",
    "description": "The file name of the source file."
  },
    {
    "name": "_firestore_doc_id",
    "mode": "REQUIRED",
    "type": "STRING",
    "description": "The hash code (based on the file name and its content, so each file has a unique hash) used as a Firestore document id."
  },
  {
    "name": "_ingested_ts",
    "mode": "REQUIRED",
    "type": "TIMESTAMP",
    "description": "The timestamp when this record was processed during ingestion into the BigQuery table."
  }
]

main.tf

provider "google" {
  project = var.project_id
  region  = var.region
  zone    = var.zone
}

resource "google_bigquery_dataset" "test_dataset_set" {
  project    = var.project_id
  count      = length(var.test_bq_dataset)
  dataset_id = var.test_bq_dataset[count.index]["id"]
  location   = var.test_bq_dataset[count.index]["location"]

  labels = {
    "environment" = "development"
  }
}

resource "google_bigquery_table" "test_table_set" {
  project    = var.project_id
  count      = length(var.test_bq_table)
  dataset_id = var.test_bq_table[count.index]["dataset_id"]
  table_id   = var.test_bq_table[count.index]["table_id"]
  schema     = file("${path.root}/bq-schema/${var.test_bq_table[count.index]["schema_id"]}")

  labels = {
    "environment" = "development"
  }
  depends_on = [
    google_bigquery_dataset.test_dataset_set,
  ]
}

Project directory structure - screenshot

Bear in mind the subdirectory name - "bq-schema" as it is used in the "schema" attribute of the "google_bigquery_table" resource in the "main.tf" file.

BigQuery console - screenshot

The result of the "terraform apply" command.

Sign up to request clarification or add additional context in comments.

3 Comments

I must have add this to the question. Tables do not have identical schemas.
You probably can define a terraform variable - a map "table name => schema file name", or a list with the schema files names, so that the correct file is chosen using the same count loop instead of the const "table_schema.json".
could you please share an example using map.
0

Terraform includes an optional schema argument that expects a JSON string.

The documentation shared on the previous link has an example:

resource "google_bigquery_table" "default" {
  dataset_id = google_bigquery_dataset.default.dataset_id
  table_id   = "bar"

  time_partitioning {
    type = "DAY"
  }

  labels = {
    env = "default"
  }

  schema = <<EOF
[
  {
    "name": "permalink",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "The Permalink"
  },
  {
    "name": "state",
    "type": "STRING",
    "mode": "NULLABLE",
    "description": "State where the head office is located"
  }
]
EOF

}

2 Comments

I have 50 BQ Datasets and every DS has 10 tables and I do not prefer hardcoding values. Figuring out a way to make use of variables to create schema (Just like I have been doing to create tables and DS.)
I see! I didn't know the schemas where different. That'll definitely require another approach. I believe @al-dann 's answer offers a way better approach.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.