1

I have a nested JSON dict that I need to convert to spark dataframe. This JSON dict is present in a dataframe column. I have been trying to parse the dict present in dataframe column using "from_json" and "get_json_object", but have been unable to read the data. Here's the smallest snippet of the source data that I've been trying to read:

{"value": "\u0000\u0000\u0000\u0000/{\"context\":\"data\"}"}

I need to extract the nested dict value. I used below code to clean the data and read it into a dataframe

from pyspark.sql.functions import *
from pyspark.sql.types import *
input_path = '/FileStore/tables/enrl/context2.json      #path for the above file
schema1 = StructType([StructField("context",StringType(),True)]) #Schema I'm providing
raw_df = spark.read.json(input_path)
cleansed_df = raw_df.withColumn("cleansed_value",regexp_replace(raw_df.value,'/','')).select('cleansed_value') #Removed extra '/' in the data

cleansed_df.select(from_json('cleansed_value',schema=schema1)).show(1, truncate=False)

I get a null dataframe each time I run the above code. Please help.

Tried below stuff and it didn't work: PySpark: Read nested JSON from a String Type Column and create columns

Also tried to write it to a JSON file and read it. It didn't work as well: reading a nested JSON file in pyspark

1 Answer 1

1

The null chars \u0000 affect the parsing of the JSON. You can replace them as well:

df = spark.read.json('path')

df2 = df.withColumn(
    'cleansed_value', 
    F.regexp_replace('value','[\u0000/]','')
).withColumn(
    'parsed', 
    F.from_json('cleansed_value','context string')
)

df2.show(20,0)
+-----------------------+------------------+------+
|value                  |cleansed_value    |parsed|
+-----------------------+------------------+------+
|/{"context":"data"}|{"context":"data"}|[data]|
+-----------------------+------------------+------+
Sign up to request clarification or add additional context in comments.

2 Comments

thanks, this solved the issue. I was under the assumption that the \u0000 was being automatically removed while Spark read the data into dataframe, since those did not show up in the cleansed_df. This json file is being created by a Scala notebook on Databricks and I was under the assumption that this was being dealt internally.
The \u0000 char is not visible, but you might noticed that the output of .show() is somewhat distorted because of that invisible char

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.