Converting arrays to nested fields in BigQuery

Question

I'm streaming Stackdriver logs into Bigquery, and they end up in a textPayload field in the following format:


member_id_hashed=123456789,

member_age -> Float(37.0,244),

operations=[92967,93486,86220,92814,92943,93279,...],

scores=[3.214899,2.3641025E-5,2.5823574,2.3818345,3.9919448,0.0,...],

[etc]

I then define a query/view on the table with the raw logging entries as follows:

SELECT
member_id_hashed as member_id, member_age,
split(operations,',') as operation,
split(scores,',') as score 
FROM
(
  SELECT
  REGEXP_EXTRACT(textPayload, r'member_id=([0-9]+)') as member_id_hashed,
  REGEXP_EXTRACT(textPayload, r'member_age -> Float\(([0-9]+)') as member_age,
  REGEXP_EXTRACT(textPayload, r'operations=\[(.+)') as operations,
  REGEXP_EXTRACT(textPayload, r'scores=\[(.+)') as scores
  from `myproject.mydataset.mytable`
)

resulting in one row with two single fields and two arrays:

Ideally, for further analysis, I would like the two arrays to be nested (e.g. operation.id and operation.score) or flatten the arrays line by line while keeping the positions (i.e. line 1 of array 1 should appear next to line 1 of array 2, etc):

Can anybody point me to a way to make nested fields out of the arrays, or to flatten the arrays? I tried unnesting and joining, but that would give me too many possible cross-combinations in the result.

Thanks for your help!

Elliott Brossard · Accepted Answer · 2019-02-25 13:16:49Z

You can zip the two arrays like this. It unnests the array with operation IDs and gets the index of each element, then selects the corresponding element of the array with scores. Note that this assumes that the arrays have the same number of elements. If they don't, you could use SAFE_OFFSET instead of OFFSET in order to get NULL if there are more IDs than scores, for instance.

SELECT
  member_id_hashed, member_age,
  ARRAY(
    SELECT AS STRUCT id, split(scores,',')[OFFSET(off)] AS score
    FROM UNNEST(split(operations,',')) AS id WITH OFFSET off
    ORDER BY off
  ) AS operations
FROM (
  SELECT
    REGEXP_EXTRACT(textPayload, r'member_id=([0-9]+)') as member_id,
    REGEXP_EXTRACT(textPayload, r'member_age -> Float\(([0-9]+)') as member_age,
    REGEXP_EXTRACT(textPayload, r'operations=\[(.+)') as operations,
    REGEXP_EXTRACT(textPayload, r'scores=\[(.+)') as scores
  from `myproject.mydataset.mytable`
)

Collectives™ on Stack Overflow

Converting arrays to nested fields in BigQuery

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related