2

I have a dataframe with this schema

root
  |-- AUTHOR_ID: integer (nullable = false)
  |-- NAME: string (nullable = true)
  |-- Books: array (nullable = false)
  |    |-- element: struct (containsNull = false)
  |    |    |-- BOOK_ID: integer (nullable = false)
  |    |    |-- Chapters: array (nullable = true) 
  |    |    |    |-- element: struct (containsNull = true)
  |    |    |    |    |-- NAME: string (nullable = true)
  |    |    |    |    |-- NUMBER_PAGES: integer (nullable = true)

How to flat all columns into one level with Pyspark ?

1 Answer 1

3

Using inline function:

df2 = (df.selectExpr("AUTHOR_ID", "NAME", "inline(Books)")
       .selectExpr("*", "inline(Chapters)")
       .drop("Chapters")
       )

Or explode:

from pyspark.sql import functions as F

df2 = (df.withColumn("Books", F.explode("Books"))
       .select("*", "Books.*")
       .withColumn("Chapters", F.explode("Chapters"))
       .select("*", "Chapters.*")
       )
Sign up to request clarification or add additional context in comments.

3 Comments

how to do the reverse work ? i.e after flatten we take back the first dataframe, how is it possible with pyspark ?
or maybe just doing some 'groupby' and aggregations as usual
@Smaillns yes, you first group by AUTHOR_ID + NAME + BOOK_ID to create the array of chapters, then group by AUTHOR_ID + NAME to create the array of books.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.