2

I'm trying to load JSON data to BigQuery. The excerpt of my data causing problems looks like that:

{"page":"295"}
{"page":["295", "123"]}

I have defined schema for this field to be:

{
  "name": "page",
  "type": "string",
  "mode": "repeated"
}

I'm getting an error of "Repeated field must be imported as a JSON array." I think the problem is caused by the structure of the JSON itself. The string in the first row should be 1 element array instead of just a string. The data was delivered to me like this. I am trying to find a way to force BigQuery to read this string as a one element array instead of fixing the file (which is hundred gigabytes). Is there any solution?

1 Answer 1

1

One option is to use BigQuery to transform the data itself. Supposing that you are able to import the rows as CSV instead (pick an arbitrary separator that doesn't appear in your data) you can use the JSON_EXTRACT function to retrieve the value(s) of page across rows. For example,

#standardSQL
SELECT JSON_EXTRACT(json, '$.page') AS page
FROM UnprocessedTable;

You can use the SPLIT function or REGEXP_EXTRACT_ALL to retrieve the individual values afterward.

Edit: As a concrete example, you can try this query:

#standardSQL
WITH T AS (
  SELECT '{"page": "foo"}' AS json UNION ALL
  SELECT '{"page": ["bar", "baz"]}' AS json UNION ALL
  SELECT '{"page": ["a", "b", "c"]}' AS json
)
SELECT
  REGEXP_EXTRACT_ALL(JSON_EXTRACT(json, '$.page'), r'"([^"]*)"') AS pages
FROM T;

This returns the JSON arrays (or scalar strings) as an ARRAY<STRING> column.

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks! This indeed might work. The problem is that my data in fact contains like 60 columns. Given example is only the column causing problems. I should have mentioned it earlier. Am I correct that after loading entire dataset as CSV I would need to extract each column separately even those that are working correctly now?
You could potentially generate a query programmatically to call JSON_EXTRACT_SCALAR for each of the other columns. You should only need to do this transformation on the data once, at least.
Thanks. The other solution that would work for me would be to load the entire array ["295", "123"] as a string. Is that somehow possible? Would it be easier?
I think you might also get an error doing that, since the type would be declared as a string, but you could try it and see. If you're able to load it as a string, you could use REGEXP_EXTRACT_ALL as in the example above.
Unfortunately, I am not able to load it simply as a nullable string. Getting an obvious error of Array specified for non-repeated field. Is there any other way to achieve that which does not include importing the rows as CSV?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.