create pyspark dataframe with json string values and schema

Question

I am trying to manually create some dummy pyspark dataframe.

I did the following:

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [('{"Time":"2020-08-01T08:14:20.650Z","version":null}')
            ]

schema = StructType([ \
    StructField("raw_json",StringType(),True)
  ])

df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)

but i got the error:

TypeError: StructType can not accept object '[{"Time:"2020-08-01T08:14:20.650Z","version":null}]' in type <class 'str'>

How am i able to put json string into pyspark dataframe as values?

my ideal result is:

+-----------------------------------------------------------------+
|value                                                             |             
+-----------------------------------------------------------------------
| {"Time":"2020-08-01T08:14:20.650Z","version":null}|

Surya · Accepted Answer · 2021-02-19 01:24:11Z

1

The error is because of your braces. data2 should have list of lists - so replace inner parenthesis with square brackets:

data2 = [['{"applicationTimeStamp":"2020-08-01T08:14:20.650Z","version":null}']]

schema = StructType([StructField("raw_json",StringType(),True)])
df = spark.createDataFrame(data=data2,schema=schema)

df.show(truncate=False)
+------------------------------------------------------------------+            
|raw_json                                                          |
+------------------------------------------------------------------+
|{"applicationTimeStamp":"2020-08-01T08:14:20.650Z","version":null}|
+------------------------------------------------------------------+

answered Feb 19, 2021 at 1:24

Surya

3,4293 gold badges22 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mck · Accepted Answer · 2021-02-19 07:24:38Z

It could also work if you specify data2 as a list of tuples, by adding a trailing comma inside the parentheses to specify that it is a tuple.

from pyspark.sql.types import *

# Note the trailing comma inside the parentheses
data2 = [('{"applicationTimeStamp":"2020-08-01T08:14:20.650Z","version":null}',)]

schema = StructType([
    StructField("raw_json",StringType(),True)
])

df = spark.createDataFrame(data=data2,schema=schema)
df.show(truncate=False)
+------------------------------------------------------------------+
|raw_json                                                          |
+------------------------------------------------------------------+
|{"applicationTimeStamp":"2020-08-01T08:14:20.650Z","version":null}|
+------------------------------------------------------------------+

ags29 · Accepted Answer · 2021-02-19 20:51:12Z

0

Try this:

import json

rdd = sc.parallelize(data2).map(lambda x: [json.loads(x)]).toDF(schema=['raw_json'])

answered Feb 19, 2021 at 20:51

ags29

2,7061 gold badge11 silver badges15 bronze badges

Collectives™ on Stack Overflow

create pyspark dataframe with json string values and schema

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related