Assign list of values to columns with binary value in PySpark

Question

I have a pyspark dataframe with a column Categories that has a list of values, for a given ID. Looks like this, let's call this df1:

ID  | Categories
122   [16777277, 16777350, 16777483, 16777425, 16777303]

There are about 500 categories total, but we can see in the example above that ID 122 only has the above 5 categories. Let's call this df2.columns

I want transform each value in the list from the Categories column into its own column, like this (df3):

ID  | 16777277 | 16777350 | 16777483 | 16777425 | 44444 | 55555 | ....... 
122    1            1        1           1         0       0

Any idea how I can make this transformation? I tried using explode() then pivot() however, this seemed to incorrectly map the list values to columns.

What I tried:

df1 = df1.withColumn('exploded', F.explode('Categories'))
pivoted = df1.groupby('ID').pivot('exploded').count()

Thanks in advance!

wwnde · Accepted Answer · 2022-05-18 20:59:20Z

2

On the right track, being a list, explode, groupby and Pivot and lit(1)

 df1=spark.createDataFrame([(122  , [16777277, 16777350, 16777483, 16777425, 16777303])],
    ('ID'  , 'Categories'))


df1.withColumn('exploded', F.explode('Categories')).groupBy("ID").pivot("exploded").agg(F.lit(1)).na.fill(0).show()

+---+--------+--------+--------+--------+--------+
| ID|16777277|16777303|16777350|16777425|16777483|
+---+--------+--------+--------+--------+--------+
|122|       1|       1|       1|       1|       1|
+---+--------+--------+--------+--------+--------+

answered May 18, 2022 at 20:59

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Assign list of values to columns with binary value in PySpark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related