1

I am trying to manually create some dummy pyspark dataframe.

I did the following:

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [('{"Time":"2020-08-01T08:14:20.650Z","version":null}')
            ]

schema = StructType([ \
    StructField("raw_json",StringType(),True)
  ])

df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)

but i got the error:

TypeError: StructType can not accept object '[{"Time:"2020-08-01T08:14:20.650Z","version":null}]' in type <class 'str'>

How am i able to put json string into pyspark dataframe as values?

my ideal result is:

+-----------------------------------------------------------------+
|value                                                             |             
+-----------------------------------------------------------------------
| {"Time":"2020-08-01T08:14:20.650Z","version":null}|

3 Answers 3

1

The error is because of your braces. data2 should have list of lists - so replace inner parenthesis with square brackets:

data2 = [['{"applicationTimeStamp":"2020-08-01T08:14:20.650Z","version":null}']]

schema = StructType([StructField("raw_json",StringType(),True)])
df = spark.createDataFrame(data=data2,schema=schema)

df.show(truncate=False)
+------------------------------------------------------------------+            
|raw_json                                                          |
+------------------------------------------------------------------+
|{"applicationTimeStamp":"2020-08-01T08:14:20.650Z","version":null}|
+------------------------------------------------------------------+
Sign up to request clarification or add additional context in comments.

Comments

0

It could also work if you specify data2 as a list of tuples, by adding a trailing comma inside the parentheses to specify that it is a tuple.

from pyspark.sql.types import *

# Note the trailing comma inside the parentheses
data2 = [('{"applicationTimeStamp":"2020-08-01T08:14:20.650Z","version":null}',)]

schema = StructType([
    StructField("raw_json",StringType(),True)
])

df = spark.createDataFrame(data=data2,schema=schema)
df.show(truncate=False)
+------------------------------------------------------------------+
|raw_json                                                          |
+------------------------------------------------------------------+
|{"applicationTimeStamp":"2020-08-01T08:14:20.650Z","version":null}|
+------------------------------------------------------------------+

Comments

0

Try this:

import json

rdd = sc.parallelize(data2).map(lambda x: [json.loads(x)]).toDF(schema=['raw_json'])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.