0

I'm actually facing an issue I hope I can explain.

I'm trying to parse a CSV file with PySpark. This csv file has some JSON columns. Those Json columns have the same Schema, but are not filled the same way.

For instance i have :

{"targetUrl":"https://snowplowanalytics.com/products/snowplow-insights", "elementId":NULL, "elementClasses":NULL,"elementTarget":NULL}

or

{"targetUrl":"https://snowplowanalytics.com/request-demo/", "elementId":"button-request-demo-header-page", "elementClasses":["btn","btn-primary","call-to-action"]}

Atm, when I do :

simpleSchema = st.StructType([
    st.StructField("targetUrl",st.StringType(),True),
    st.StructField("elementId",st.StringType(),True),
    st.StructField("elementClasses",st.StringType(),True)
])

          
df = spark.read.format("csv").option("header","true").option("quoteAll","true").option("escape", "\"").load("./Sources/explore_snowplow_data_raw.csv")
df.select(fn.from_json(fn.col("link_click_event"),simpleSchema).alias("linkJson")).select("linkJson.*").show(50)

(link_click_event is my JSON column name)

Only my second JSON field is fully returned because no values are null.

My problem is that the first row is returned as

+--------------------+--------------------+--------------------+
|           targetUrl|           elementId|      elementClasses|
+--------------------+--------------------+--------------------+
|                null|                null|                null|

How can I reach a result as followed for my first line ?

+--------------------+--------------------+--------------------+
|           targetUrl|           elementId|      elementClasses|
+--------------------+--------------------+--------------------+
|"https://snowplo"...|                null|                null|

Many thanks

2 Answers 2

1

Since your json is not stringified (but in your case fine I think), it could not be read correctly for test case. So I made it.

col1
"{\"targetUrl\":\"https://snowplowanalytics.com/products/snowplow-insights\",\"elementId\":null,\"elementClasses\":null,\"elementTarget\":null}"
"{\"targetUrl\":\"https://snowplowanalytics.com/request-demo/\", \"elementId\":\"button-request-demo-header-page\", \"elementClasses\":[\"btn\",\"btn-primary\",\"call-to-action\"]}"

After that with this code,

import pyspark.sql.functions as f
from pyspark.sql import types as st

simpleSchema = st.StructType([
    st.StructField("targetUrl",st.StringType(),True),
    st.StructField("elementId",st.StringType(),True),
    st.StructField("elementClasses",st.ArrayType(st.StringType()),True),
    st.StructField("elementTarget",st.StringType(),True)
])

df.withColumn('col1', f.from_json('col1', simpleSchema)).show(10, False)

+-------------------------------------------------------------------------------------------------------------------+
|col1                                                                                                               |
+-------------------------------------------------------------------------------------------------------------------+
|[https://snowplowanalytics.com/products/snowplow-insights,,,]                                                      |
|[https://snowplowanalytics.com/request-demo/, button-request-demo-header-page, [btn, btn-primary, call-to-action],]|
+-------------------------------------------------------------------------------------------------------------------+

it works fine.

Sign up to request clarification or add additional context in comments.

4 Comments

Hey Lamanus, Thank you for the answer. So if I understand well, I had a type problem with my Element Classes Attribute and a problem with my Json Field that is not stringified ?
So is it possible to stringify a JSON column from a spark Dataframe ?
try first to change NULL to null and see what happen.
You were right, NULL was a problem. If I change it to null, my row is shown in my results, many thanks.
0

primitivesAsString parameter as True worked for me.

primitivesAsString – infers all primitive values as a string type. If None is set, it uses the default value, false.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.