Azure Databricks, Python - convert json column string to dataframe

Question

I am using Azure Databricks and Python 3.

I have a data frame (df1) with a column called 'BodyJson' which is of 'string' data type.

'BodyJson' is a complex json structure - an example is shown below of one row from (df1).

Column BodyJson From df1

{
  "Timestamp": 3690414400,
  "Sender": "10.99.45.6:32768:wifivm0002EF",
  "Type": "1.3.6.1.4.1.9.9.599.0.8",
  "CaptureTime": 637616722902708244,
  "Variables": [
    {
      "Key": "1.3.6.1.4.1.9.9.513.1.2.1.1.1.0",
      "Value": "1"
    },
    {
      "Key": "1.3.6.1.4.1.9.9.513.1.1.1.1.5.200.249.249.41.0.128",
      "Value": {
        "Hex": "66696E7362792D7761703033",
        "String": "123456-wap03"
      }
    },
    {
      "Key": "1.3.6.1.4.1.9.9.599.1.3.2.1.2.0",
      "Value": 1
    },
    {
      "Key": "1.3.6.1.4.1.9.9.599.1.3.2.1.3.0",
      "Value": {
        "Hex": "0A9603F4",
        "String": "\n?\u0003?"
      }
    },
    {
      "Key": "1.3.6.1.4.1.9.9.599.1.3.1.1.27.114.154.56.22.154.160",
      "Value": {
        "Hex": "766D6564776966692F646965676F33756B407961686F6F2E636F6D",
        "String": "vmedwifi/[email protected]"
      }
    },
    {
      "Key": "1.3.6.1.4.1.9.9.599.1.3.1.1.28.114.154.56.22.154.160",
      "Value": {
        "Hex": "56697267696E204D65646961",
        "String": "Virgin Media"
      }
    },
    {
      "Key": "1.3.6.1.4.1.9.9.599.1.3.1.1.38.114.154.56.22.154.160",
      "Value": {
        "Hex": "36306562663133322F37323A39613A33383A31363A39613A61302F3931323639363136",
        "String": "60ebf132/72:9a:38:16:9a:a0/91269616"
      }
    },
    {
      "Key": "1.3.6.1.4.1.9.9.599.1.3.1.1.8.114.154.56.22.154.160",
      "Value": {
        "Hex": "C8F9F9290080",
        "String": "???)\u0000?"
      }
    }
  ]
}

The only part of 'BodyJson' I am interested in is called "Variables" which holds a array of json rows. These rows come in two forms - examples forms with example values shown below:

Form-1

{
  "Key": "1.3.6.1.4.1.9.9.513.1.2.1.1.1.0",
  "Value": "1"
}

Form-2

{
  "Key": "1.3.6.1.4.1.9.9.513.1.1.1.1.5.200.249.249.41.0.128",
  "Value": {
    "Hex": "66696E7362792D7761703033",
    "String": "123456-wap03"
  }
}

I would like to create a two new data frames that can hold rows of either form-1 or form-2 - for example the columns would be...

New Data Frame holding only Form-1 rows...

Key(string) = "1.3.6.1.4.1.9.9.513.1.2.1.1.1.0"
Value(string) = "1"

New Data Frame holding only Form-2 rows...

Key(string) - "1.3.6.1.4.1.9.9.513.1.1.1.1.5.200.249.249.41.0.128"
Value(string) - "123456-wap03" (Popualated with values from "Value"."String". NB: I am not interested in values from "Value"."Hex")

How do I go about extracting the data from the column 'BodyJson' and create 2 new data frames?

Kafels · Accepted Answer · 2021-07-12 21:22:12Z

First you need to transform your JSON column into another dataframe. To do it, transform your BodyJson into RDD and read using spark.read.json.

After it, to identifying which rows has a JSON you can use get_json_object and select $.String. Case a row doesn't have it, will return as null.

import pyspark.sql.functions as f

df = spark.createDataFrame([
  ["""{"Timestamp":3690414400,"Sender":"10.99.45.6:32768:wifivm0002EF","Type":"1.3.6.1.4.1.9.9.599.0.8","CaptureTime":637616722902708244,"Variables":[{"Key":"1.3.6.1.4.1.9.9.513.1.2.1.1.1.0","Value":"1"},{"Key":"1.3.6.1.4.1.9.9.513.1.1.1.1.5.200.249.249.41.0.128","Value":{"Hex":"66696E7362792D7761703033","String":"123456-wap03"}},{"Key":"1.3.6.1.4.1.9.9.599.1.3.2.1.2.0","Value":1},{"Key":"1.3.6.1.4.1.9.9.599.1.3.2.1.3.0","Value":{"Hex":"0A9603F4","String":"\n?\u0003?"}},{"Key":"1.3.6.1.4.1.9.9.599.1.3.1.1.27.114.154.56.22.154.160","Value":{"Hex":"766D6564776966692F646965676F33756B407961686F6F2E636F6D","String":"vmedwifi/[email protected]"}},{"Key":"1.3.6.1.4.1.9.9.599.1.3.1.1.28.114.154.56.22.154.160","Value":{"Hex":"56697267696E204D65646961","String":"Virgin Media"}},{"Key":"1.3.6.1.4.1.9.9.599.1.3.1.1.38.114.154.56.22.154.160","Value":{"Hex":"36306562663133322F37323A39613A33383A31363A39613A61302F3931323639363136","String":"60ebf132/72:9a:38:16:9a:a0/91269616"}},{"Key":"1.3.6.1.4.1.9.9.599.1.3.1.1.8.114.154.56.22.154.160","Value":{"Hex":"C8F9F9290080","String":"???)\u0000?"}}]}"""]
], schema='BodyJson string')

rdd = df.rdd.map(lambda row: row.BodyJson)
body_df = spark.read.json(rdd, allowUnquotedControlChars=True)

variables_df = body_df.selectExpr('inline(Variables)')
variables_df = variables_df.withColumn('ObjValue', f.get_json_object('Value', '$.String'))

form_1_df = variables_df.where(f.col('ObjValue').isNull())
form_1_df = form_1_df.drop('ObjValue')
display(form_1_df)

form_2_df = variables_df.where(f.col('ObjValue').isNotNull())
form_2_df = form_2_df.select('Key', f.col('ObjValue').alias('Value'))
display(form_2_df)

First output:

Second output:

Hi, thank you so much. Your code and answer were spot on! Thanks very much

Collectives™ on Stack Overflow

Azure Databricks, Python - convert json column string to dataframe

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related