1

I am using Azure Databricks and Python 3.

I have a data frame (df1) with a column called 'BodyJson' which is of 'string' data type.

'BodyJson' is a complex json structure - an example is shown below of one row from (df1).

Column BodyJson From df1

{
  "Timestamp": 3690414400,
  "Sender": "10.99.45.6:32768:wifivm0002EF",
  "Type": "1.3.6.1.4.1.9.9.599.0.8",
  "CaptureTime": 637616722902708244,
  "Variables": [
    {
      "Key": "1.3.6.1.4.1.9.9.513.1.2.1.1.1.0",
      "Value": "1"
    },
    {
      "Key": "1.3.6.1.4.1.9.9.513.1.1.1.1.5.200.249.249.41.0.128",
      "Value": {
        "Hex": "66696E7362792D7761703033",
        "String": "123456-wap03"
      }
    },
    {
      "Key": "1.3.6.1.4.1.9.9.599.1.3.2.1.2.0",
      "Value": 1
    },
    {
      "Key": "1.3.6.1.4.1.9.9.599.1.3.2.1.3.0",
      "Value": {
        "Hex": "0A9603F4",
        "String": "\n?\u0003?"
      }
    },
    {
      "Key": "1.3.6.1.4.1.9.9.599.1.3.1.1.27.114.154.56.22.154.160",
      "Value": {
        "Hex": "766D6564776966692F646965676F33756B407961686F6F2E636F6D",
        "String": "vmedwifi/[email protected]"
      }
    },
    {
      "Key": "1.3.6.1.4.1.9.9.599.1.3.1.1.28.114.154.56.22.154.160",
      "Value": {
        "Hex": "56697267696E204D65646961",
        "String": "Virgin Media"
      }
    },
    {
      "Key": "1.3.6.1.4.1.9.9.599.1.3.1.1.38.114.154.56.22.154.160",
      "Value": {
        "Hex": "36306562663133322F37323A39613A33383A31363A39613A61302F3931323639363136",
        "String": "60ebf132/72:9a:38:16:9a:a0/91269616"
      }
    },
    {
      "Key": "1.3.6.1.4.1.9.9.599.1.3.1.1.8.114.154.56.22.154.160",
      "Value": {
        "Hex": "C8F9F9290080",
        "String": "???)\u0000?"
      }
    }
  ]
}

The only part of 'BodyJson' I am interested in is called "Variables" which holds a array of json rows. These rows come in two forms - examples forms with example values shown below:

Form-1

{
  "Key": "1.3.6.1.4.1.9.9.513.1.2.1.1.1.0",
  "Value": "1"
}

Form-2

{
  "Key": "1.3.6.1.4.1.9.9.513.1.1.1.1.5.200.249.249.41.0.128",
  "Value": {
    "Hex": "66696E7362792D7761703033",
    "String": "123456-wap03"
  }
}

I would like to create a two new data frames that can hold rows of either form-1 or form-2 - for example the columns would be...

New Data Frame holding only Form-1 rows...

  • Key(string) = "1.3.6.1.4.1.9.9.513.1.2.1.1.1.0"
  • Value(string) = "1"

New Data Frame holding only Form-2 rows...

  • Key(string) - "1.3.6.1.4.1.9.9.513.1.1.1.1.5.200.249.249.41.0.128"
  • Value(string) - "123456-wap03" (Popualated with values from "Value"."String". NB: I am not interested in values from "Value"."Hex")

How do I go about extracting the data from the column 'BodyJson' and create 2 new data frames?

1 Answer 1

4

First you need to transform your JSON column into another dataframe. To do it, transform your BodyJson into RDD and read using spark.read.json.

After it, to identifying which rows has a JSON you can use get_json_object and select $.String. Case a row doesn't have it, will return as null.

import pyspark.sql.functions as f

df = spark.createDataFrame([
  ["""{"Timestamp":3690414400,"Sender":"10.99.45.6:32768:wifivm0002EF","Type":"1.3.6.1.4.1.9.9.599.0.8","CaptureTime":637616722902708244,"Variables":[{"Key":"1.3.6.1.4.1.9.9.513.1.2.1.1.1.0","Value":"1"},{"Key":"1.3.6.1.4.1.9.9.513.1.1.1.1.5.200.249.249.41.0.128","Value":{"Hex":"66696E7362792D7761703033","String":"123456-wap03"}},{"Key":"1.3.6.1.4.1.9.9.599.1.3.2.1.2.0","Value":1},{"Key":"1.3.6.1.4.1.9.9.599.1.3.2.1.3.0","Value":{"Hex":"0A9603F4","String":"\n?\u0003?"}},{"Key":"1.3.6.1.4.1.9.9.599.1.3.1.1.27.114.154.56.22.154.160","Value":{"Hex":"766D6564776966692F646965676F33756B407961686F6F2E636F6D","String":"vmedwifi/[email protected]"}},{"Key":"1.3.6.1.4.1.9.9.599.1.3.1.1.28.114.154.56.22.154.160","Value":{"Hex":"56697267696E204D65646961","String":"Virgin Media"}},{"Key":"1.3.6.1.4.1.9.9.599.1.3.1.1.38.114.154.56.22.154.160","Value":{"Hex":"36306562663133322F37323A39613A33383A31363A39613A61302F3931323639363136","String":"60ebf132/72:9a:38:16:9a:a0/91269616"}},{"Key":"1.3.6.1.4.1.9.9.599.1.3.1.1.8.114.154.56.22.154.160","Value":{"Hex":"C8F9F9290080","String":"???)\u0000?"}}]}"""]
], schema='BodyJson string')

rdd = df.rdd.map(lambda row: row.BodyJson)
body_df = spark.read.json(rdd, allowUnquotedControlChars=True)

variables_df = body_df.selectExpr('inline(Variables)')
variables_df = variables_df.withColumn('ObjValue', f.get_json_object('Value', '$.String'))

form_1_df = variables_df.where(f.col('ObjValue').isNull())
form_1_df = form_1_df.drop('ObjValue')
display(form_1_df)

form_2_df = variables_df.where(f.col('ObjValue').isNotNull())
form_2_df = form_2_df.select('Key', f.col('ObjValue').alias('Value'))
display(form_2_df)

First output:

form_1_df

Second output:

form_2_df

Sign up to request clarification or add additional context in comments.

1 Comment

Hi, thank you so much. Your code and answer were spot on! Thanks very much

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.