3

I am learning PySpark and it is convenient to be able to quickly create example dataframes to try the functionality of the PySpark API.

The following code (where spark is a spark session):

import pyspark.sql.types as T
df = [{'id': 1, 'data': {'x': 'mplah', 'y': [10,20,30]}},
      {'id': 2, 'data': {'x': 'mplah2', 'y': [100,200,300]}},
]
df = spark.createDataFrame(df)
df.printSchema()

gives a map (and does not interpret the array correctly):

root
 |-- data: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- id: long (nullable = true)

I needed a struct. I can force a struct if I give a schema:

import pyspark.sql.types as T
df = [{'id': 1, 'data': {'x': 'mplah', 'y': [10,20,30]}},
      {'id': 2, 'data': {'x': 'mplah2', 'y': [100,200,300]}},
]
schema = T.StructType([
    T.StructField('id', LongType()),
    T.StructField('data', StructType([
        StructField('x', T.StringType()),
        StructField('y', T.ArrayType(T.LongType())),
    ]) )
])
df = spark.createDataFrame(df, schema=schema)
df.printSchema()

That indeed gives:

root
 |-- id: long (nullable = true)
 |-- data: struct (nullable = true)
 |    |-- x: string (nullable = true)
 |    |-- y: array (nullable = true)
 |    |    |-- element: long (containsNull = true)

But this is too much typing.

Is there any other quick way to create the dataframe so that the data column is a struct without specifying the schema?

1 Answer 1

6

When creating an example dataframe, you can use Python's tuples which are transformed into Spark's structs. But this way you cannot specify struct field names.

df = spark.createDataFrame(
    [(1, ('mplah', [10,20,30])),
     (2, ('mplah2', [100,200,300]))],
    ['id', 'data']
)
df.printSchema()
# root
#  |-- id: long (nullable = true)
#  |-- data: struct (nullable = true)
#  |    |-- _1: string (nullable = true)
#  |    |-- _2: array (nullable = true)
#  |    |    |-- element: long (containsNull = true)

Using this approach, you may want to add the schema:

df = spark.createDataFrame(
    [(1, ('mplah', [10,20,30])),
     (2, ('mplah2', [100,200,300]))],
    'id: bigint, data: struct<x:string,y:array<bigint>>'
)
df.printSchema()
# root
#  |-- id: long (nullable = true)
#  |-- data: struct (nullable = true)
#  |    |-- x: string (nullable = true)
#  |    |-- y: array (nullable = true)
#  |    |    |-- element: long (containsNull = true)

However, I often prefer a method using struct. This way detailed schema is not provided and struct field names are taken from column names.

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(1, 'mplah', [10,20,30]),
     (2, 'mplah2', [100,200,300])],
    ['id', 'x', 'y']
)
df = df.select('id', F.struct('x', 'y').alias('data'))

df.printSchema()
# root
#  |-- id: long (nullable = true)
#  |-- data: struct (nullable = false)
#  |    |-- x: string (nullable = true)
#  |    |-- y: array (nullable = true)
#  |    |    |-- element: long (containsNull = true)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.