0

I have list of string in python as follows :

['start_column=column123;to_3=2020-09-07 10:29:24;to_1=2020-09-07 10:31:08;to_0=2020-09-07 10:31:13;',
'start_column=column475;to_3=2020-09-07 10:29:34;']

I am trying to convert it into dataframe in following way :

schema = StructType([
    StructField('Rows', ArrayType(StringType()), True)
])

rdd = sc.parallelize(test_list)
query_data = spark.createDataFrame(rdd,schema)
print(query_data.schema)
query_data.show()

I am getting following error:

TypeError: StructType can not accept object 
2
  • what is the output you are looking at ? Commented Nov 5, 2020 at 14:44
  • actually keywords can become column name and value of it correspondingly will be best.Something like this : stackoverflow.com/questions/47552045/… Commented Nov 5, 2020 at 14:54

3 Answers 3

0

You just need to pass that as a list while creating the dataframe as below ...

a_list = ['start_column=column123;to_3=2020-09-07 10:29:24;to_1=2020-09-07 10:31:08;to_0=2020-09-07 10:31:13;',
'start_column=column475;to_3=2020-09-07 10:29:34;']
sparkdf = spark.createDataFrame([a_list],["col1", "col2"])
sparkdf.show(truncate=False)

+--------------------------------------------------------------------------------------------------+------------------------------------------------+
|col1                                                                                              |col2                                            |
+--------------------------------------------------------------------------------------------------+------------------------------------------------+
|start_column=column123;to_3=2020-09-07 10:29:24;to_1=2020-09-07 10:31:08;to_0=2020-09-07 10:31:13;|start_column=column475;to_3=2020-09-07 10:29:34;|
+--------------------------------------------------------------------------------------------------+------------------------------------------------+
Sign up to request clarification or add additional context in comments.

Comments

0

You should use schema = StringType() because your rows contains strings rather than structs of strings.

Comments

0

I have two possible solutions for you.

SOLUTION 1: Assuming you wanted a dataframe with just one row

I was able to make it work by wrapping the values in test_list in Parentheses and using StringType.

v = [('start_column=column123;to_3=2020-09-07 10:29:24;to_1=2020-09-07 10:31:08;to_0=2020-09-07 10:31:13;',
'start_column=column475;to_3=2020-09-07 10:29:34;')]


schema = StructType([
    StructField('col_1', StringType(), True), 
    StructField('col_2', StringType(), True), 

])

rdd = sc.parallelize(v)
query_data = spark.createDataFrame(rdd,schema)
print(query_data.schema)
query_data.show(truncate = False)

SOLUTION 2: Assuming you wanted a dataframe with just one column

v = ['start_column=column123;to_3=2020-09-07 10:29:24;to_1=2020-09-07 10:31:08;to_0=2020-09-07 10:31:13;',
'start_column=column475;to_3=2020-09-07 10:29:34;']


from pyspark.sql.types import StringType

df = spark.createDataFrame(v, StringType())

df.show(truncate = False)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.