4

I am trying to create a new dataframe with ArrayType() column, I tried with and without defining schema but couldn't get the desired result. My code below with schema

from pyspark.sql.types import *
l = [[1,2,3],[3,2,4],[6,8,9]]
schema = StructType([
  StructField("data", ArrayType(IntegerType()), True)
])
df = spark.createDataFrame(l,schema)
df.show(truncate = False)

This gives error:

ValueError: Length of object (3) does not match with length of fields (1)

Desired output:

+---------+
|data     |
+---------+
|[1,2,3]  |
|[3,2,4]  |
|[6,8,9]  |
+---------+

Edit:

I found a strange thing(atleast for me):

if we use the following code, it gives the expected result:

import pyspark.sql.functions as f
data = [
    ('person', ['john', 'sam', 'jane']),
    ('pet', ['whiskers', 'rover', 'fido'])
]

df = spark.createDataFrame(data, ["type", "names"])
df.show(truncate=False)

This gives the following expected output:

+------+-----------------------+
|type  |names                  |
+------+-----------------------+
|person|[john, sam, jane]      |
|pet   |[whiskers, rover, fido]|
+------+-----------------------+

But if we remove the first column, then it gives unexpected result.

import pyspark.sql.functions as f
data = [
    (['john', 'sam', 'jane']),
    (['whiskers', 'rover', 'fido'])
]

df = spark.createDataFrame(data, ["names"])
df.show(truncate=False)

This gives the following output:

+--------+-----+----+
|names   |_2   |_3  |
+--------+-----+----+
|john    |sam  |jane|
|whiskers|rover|fido|
+--------+-----+----+
1
  • 1
    to create a tuple with a single element, add a coma at the end. (['john', 'sam', 'jane'],) The coma makes the tuple, not the parenthesis. 1, is a tuple. Commented Sep 24, 2020 at 8:37

2 Answers 2

5

I think you already have the answer to your question. Another solution is:

>>> l = [([1,2,3],), ([3,2,4],),([6,8,9],)]
>>> df = spark.createDataFrame(l, ['data'])
>>> df.show()

+---------+
|     data|
+---------+
|[1, 2, 3]|
|[3, 2, 4]|
|[6, 8, 9]|
+---------+

or

>>> from pyspark.sql.functions import array

>>> l = [[1,2,3],[3,2,4],[6,8,9]]
>>> df = spark.createDataFrame(l)
>>> df = df.withColumn('data',array(df.columns))
>>> df = df.select('data')
>>> df.show()
+---------+
|     data|
+---------+
|[1, 2, 3]|
|[3, 2, 4]|
|[6, 8, 9]|
+---------+

Regarding the strange thing, it is not that strange but you need to keep in mind that the tuple with a single value is the single value itself

>>> (['john', 'sam', 'jane'])
['john', 'sam', 'jane']

>>> type((['john', 'sam', 'jane']))
<class 'list'>

so the createDataFrame sees a list not the tuple.

Sign up to request clarification or add additional context in comments.

2 Comments

So, createDataframe takes tuple for each row and a tuple is denoted by an end , . Did I got it right?
Yes, according to the documentation the comma is one way to construct a tuple: docs.python.org/3.3/library/stdtypes.html?highlight=tuple#tuple
0

This his how you can build a pyspark dataframe containing a struct or list of struct

from pyspark.sql.types import Row
df = spark.createDataFrame(Row(events=[Row(a=278724874, b="toto")],id="toto"))

give

root
|-- events: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- a: long (nullable = true)
|    |    |-- b: string (nullable = true)
|-- id: string (nullable = true)

+-------------------+----+
|             events|  id|
+-------------------+----+
|[{278724874, toto}]|toto|

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.