AttributeError: 'DataFrame' object has no attribute 'pivot'

Question

I have PySpark dataframe:

user_id	item_id	last_watch_dt	total_dur	watched_pct
1	1	2021-05-11	4250	72
1	2	2021-05-11	80	99
2	3	2021-05-11	1000	80
2	4	2021-05-11	5000	40

I used this code:

df_new = df.pivot(index='user_id', columns='item_id', values='watched_pct')

To get this:

	1	2	3	4
1	72	99	0	0
2	0	0	80	40

But I got an error:

AttributeError: 'DataFrame' object has no attribute 'pivot'

What did I do wrong?

Ah just re-read your probem, this is not a pandas DataFrame? Well clearly this object does not have a pivot method. — Jordan Hyatt
– Jordan Hyatt, Commented Sep 22, 2022 at 8:23
Quick google search shows pivot being called on a groupby object, not on the dataframe itself — Jordan Hyatt
– Jordan Hyatt, Commented Sep 22, 2022 at 8:26
I can use df_new = df.groupBy("user_id").pivot("item_id"), but how do I fill it with watched_pct values? — KlimShaman
– KlimShaman, Commented Sep 22, 2022 at 8:33

ZygD · Accepted Answer · 2022-09-22 11:01:28Z

You can only do .pivot on objects having pivot attribute (method or property). You tried to do df.pivot, so it would only work if df had such attribute. You can inspect all the attributes of df (it's an object of pyspark.sql.DataFrame class) here. You see many attributes there, but none of them is called pivot. That's why you get an attribute error.

pivot is a method of pyspark.sql.GroupedData object. It means, in order to use it, you must somehow create pyspark.sql.GroupedData object from your pyspark.sql.DataFrame object. In your case, it's by using .groupBy():

df.groupBy("user_id").pivot("item_id")

This creates yet another pyspark.sql.GroupedData object. In order to make a dataframe out of it you would want to use one of the methods of GroupedData class. agg is the method that you need. Inside it, you will have to provide Spark's aggregation function which you will use for all the grouped elements (e.g. sum, first, etc.).

df.groupBy("user_id").pivot("item_id").agg(F.sum("watched_pct"))

Full example:

from pyspark.sql import functions as F

df = spark.createDataFrame(
    [(1, 1, '2021-05-11', 4250, 72),
     (1, 2, '2021-05-11', 80, 99),
     (2, 3, '2021-05-11', 1000, 80),
     (2, 4, '2021-05-11', 5000, 40)],
    ['user_id', 'item_id', 'last_watch_dt', 'total_dur', 'watched_pct'])

df = df.groupBy("user_id").pivot("item_id").agg(F.sum("watched_pct"))
df.show()
# +-------+----+----+----+----+
# |user_id|   1|   2|   3|   4|
# +-------+----+----+----+----+
# |      1|  72|  99|null|null|
# |      2|null|null|  80|  40|
# +-------+----+----+----+----+

If you want to replace nulls with 0, use fillna of pyspark.sql.DataFrame class.

df = df.fillna(0)
df.show()
# +-------+---+---+---+---+
# |user_id|  1|  2|  3|  4|
# +-------+---+---+---+---+
# |      1| 72| 99|  0|  0|
# |      2|  0|  0| 80| 40|
# +-------+---+---+---+---+

Collectives™ on Stack Overflow

AttributeError: 'DataFrame' object has no attribute 'pivot'

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related