0

I have a value in a JSON column that is sometimes all null in an Azure Databricks table. The full process to get to JSON_TABLE is: read parquet, infer schema of JSON column, convert the column from JSON string to deeply nested structure, explode any arrays within. I am working in SQL with python-defined UDFs (json_exists() checks the schema to see if the key is possible to use, json_get() gets a key from the column or returns a default) and want to do the following:

SELECT
  ID, EXPLODE(json_get(JSON_COL, 'ARRAY', NULL)) AS SINGLE_ARRAY_VALUE 
FROM 
  JSON_TABLE 
WHERE 
  JSON_COL IS NOT NULL AND 
  json_exists(JSON_COL, 'ARRAY')==1

When the data has at least one instance of JSON_COL containing ARRAY, the schema is such that this has no problems. If, however, the data has all null values in JSON_COL.ARRAY, an error occurs because the column has been inferred as a string type (error received: input to function explode should be array or map type, not string). Unfortunately, while the json_exists() function returns the expected values, the error still occurs even when the returned dataset would be empty.

Can I get around this error via casting or replacement of nulls? If not, what is an alternative that still allows inferring the schema of the JSON?

Note: This is a simplified example. I am writing code to generate SQL code for hundreds of similar data structures, so while I am open to workarounds, a direct solution would be ideal. Please ask if anything is unclear.

Example table that causes error:

| ID | JSON_COL                                                          |
| 1  | {"_corrupt_record": null, "otherInfo": [{"test": 1, "from": 3}]}  |
| 2  | {"_corrupt_record": null, "otherInfo": [{"test": 5, "from": 2}]}  |

Example table that does not cause error:

| ID | JSON_COL                                                          |
| 1  | {"_corrupt_record": null, "array": [{"test": 1, "from": 3}]}      |
| 2  | {"_corrupt_record": null, "otherInfo": [{"test": 5, "from": 2}]}  |

This question seems like it might hold the answer, but I was not able to get anything working from it.

1 Answer 1

1

You can filter the table before calling json_get and explode, so that you only explode when json_get returns a non-null value:

SELECT
  ID, EXPLODE(json_get(JSON_COL, 'ARRAY', NULL)) AS SINGLE_ARRAY_VALUE 
FROM (
  SELECT *
  FROM JSON_TABLE 
  WHERE 
    JSON_COL IS NOT NULL AND 
    json_exists(JSON_COL, 'ARRAY')==1
)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.