Parse JSON string from Pyspark Dataframe

Question

I have a nested JSON dict that I need to convert to spark dataframe. This JSON dict is present in a dataframe column. I have been trying to parse the dict present in dataframe column using "from_json" and "get_json_object", but have been unable to read the data. Here's the smallest snippet of the source data that I've been trying to read:

{"value": "\u0000\u0000\u0000\u0000/{\"context\":\"data\"}"}

I need to extract the nested dict value. I used below code to clean the data and read it into a dataframe

from pyspark.sql.functions import *
from pyspark.sql.types import *
input_path = '/FileStore/tables/enrl/context2.json      #path for the above file
schema1 = StructType([StructField("context",StringType(),True)]) #Schema I'm providing
raw_df = spark.read.json(input_path)
cleansed_df = raw_df.withColumn("cleansed_value",regexp_replace(raw_df.value,'/','')).select('cleansed_value') #Removed extra '/' in the data

cleansed_df.select(from_json('cleansed_value',schema=schema1)).show(1, truncate=False)

I get a null dataframe each time I run the above code. Please help.

Tried below stuff and it didn't work: PySpark: Read nested JSON from a String Type Column and create columns

Also tried to write it to a JSON file and read it. It didn't work as well: reading a nested JSON file in pyspark

mck · Accepted Answer · 2021-04-01 09:25:21Z

1

The null chars \u0000 affect the parsing of the JSON. You can replace them as well:

df = spark.read.json('path')

df2 = df.withColumn(
    'cleansed_value', 
    F.regexp_replace('value','[\u0000/]','')
).withColumn(
    'parsed', 
    F.from_json('cleansed_value','context string')
)

df2.show(20,0)
+-----------------------+------------------+------+
|value                  |cleansed_value    |parsed|
+-----------------------+------------------+------+
|/{"context":"data"}|{"context":"data"}|[data]|
+-----------------------+------------------+------+

edited Apr 1, 2021 at 9:25

answered Apr 1, 2021 at 9:18

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

the_data_novice Over a year ago

thanks, this solved the issue. I was under the assumption that the \u0000 was being automatically removed while Spark read the data into dataframe, since those did not show up in the cleansed_df. This json file is being created by a Scala notebook on Databricks and I was under the assumption that this was being dealt internally.

mck Over a year ago

The \u0000 char is not visible, but you might noticed that the output of .show() is somewhat distorted because of that invisible char

Collectives™ on Stack Overflow

Parse JSON string from Pyspark Dataframe

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related