Flatten dataframe with nested struct ArrayType using pyspark

Question

I have a dataframe with this schema

root
  |-- AUTHOR_ID: integer (nullable = false)
  |-- NAME: string (nullable = true)
  |-- Books: array (nullable = false)
  |    |-- element: struct (containsNull = false)
  |    |    |-- BOOK_ID: integer (nullable = false)
  |    |    |-- Chapters: array (nullable = true) 
  |    |    |    |-- element: struct (containsNull = true)
  |    |    |    |    |-- NAME: string (nullable = true)
  |    |    |    |    |-- NUMBER_PAGES: integer (nullable = true)

How to flat all columns into one level with Pyspark ?

blackbishop · Accepted Answer · 2022-02-27 12:26:04Z

3

Using inline function:

df2 = (df.selectExpr("AUTHOR_ID", "NAME", "inline(Books)")
       .selectExpr("*", "inline(Chapters)")
       .drop("Chapters")
       )

Or explode:

from pyspark.sql import functions as F

df2 = (df.withColumn("Books", F.explode("Books"))
       .select("*", "Books.*")
       .withColumn("Chapters", F.explode("Chapters"))
       .select("*", "Chapters.*")
       )

answered Feb 27, 2022 at 12:26

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Smaillns Over a year ago

how to do the reverse work ? i.e after flatten we take back the first dataframe, how is it possible with pyspark ?

Smaillns Over a year ago

or maybe just doing some 'groupby' and aggregations as usual

blackbishop Over a year ago

@Smaillns yes, you first group by AUTHOR_ID + NAME + BOOK_ID to create the array of chapters, then group by AUTHOR_ID + NAME to create the array of books.

Collectives™ on Stack Overflow

Flatten dataframe with nested struct ArrayType using pyspark

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related