1

I have PySpark dataframe:

user_id item_id last_watch_dt total_dur watched_pct
1 1 2021-05-11 4250 72
1 2 2021-05-11 80 99
2 3 2021-05-11 1000 80
2 4 2021-05-11 5000 40

I used this code:

df_new = df.pivot(index='user_id', columns='item_id', values='watched_pct')

To get this:

1 2 3 4
1 72 99 0 0
2 0 0 80 40

But I got an error:

AttributeError: 'DataFrame' object has no attribute 'pivot'

What did I do wrong?

5
  • did you mean pivot_table? Commented Sep 22, 2022 at 8:11
  • I used both, and the error remains Commented Sep 22, 2022 at 8:17
  • Ah just re-read your probem, this is not a pandas DataFrame? Well clearly this object does not have a pivot method. Commented Sep 22, 2022 at 8:23
  • Quick google search shows pivot being called on a groupby object, not on the dataframe itself Commented Sep 22, 2022 at 8:26
  • I can use df_new = df.groupBy("user_id").pivot("item_id"), but how do I fill it with watched_pct values? Commented Sep 22, 2022 at 8:33

1 Answer 1

1

You can only do .pivot on objects having pivot attribute (method or property). You tried to do df.pivot, so it would only work if df had such attribute. You can inspect all the attributes of df (it's an object of pyspark.sql.DataFrame class) here. You see many attributes there, but none of them is called pivot. That's why you get an attribute error.

pivot is a method of pyspark.sql.GroupedData object. It means, in order to use it, you must somehow create pyspark.sql.GroupedData object from your pyspark.sql.DataFrame object. In your case, it's by using .groupBy():

df.groupBy("user_id").pivot("item_id")

This creates yet another pyspark.sql.GroupedData object. In order to make a dataframe out of it you would want to use one of the methods of GroupedData class. agg is the method that you need. Inside it, you will have to provide Spark's aggregation function which you will use for all the grouped elements (e.g. sum, first, etc.).

df.groupBy("user_id").pivot("item_id").agg(F.sum("watched_pct"))

Full example:

from pyspark.sql import functions as F

df = spark.createDataFrame(
    [(1, 1, '2021-05-11', 4250, 72),
     (1, 2, '2021-05-11', 80, 99),
     (2, 3, '2021-05-11', 1000, 80),
     (2, 4, '2021-05-11', 5000, 40)],
    ['user_id', 'item_id', 'last_watch_dt', 'total_dur', 'watched_pct'])

df = df.groupBy("user_id").pivot("item_id").agg(F.sum("watched_pct"))
df.show()
# +-------+----+----+----+----+
# |user_id|   1|   2|   3|   4|
# +-------+----+----+----+----+
# |      1|  72|  99|null|null|
# |      2|null|null|  80|  40|
# +-------+----+----+----+----+

If you want to replace nulls with 0, use fillna of pyspark.sql.DataFrame class.

df = df.fillna(0)
df.show()
# +-------+---+---+---+---+
# |user_id|  1|  2|  3|  4|
# +-------+---+---+---+---+
# |      1| 72| 99|  0|  0|
# |      2|  0|  0| 80| 40|
# +-------+---+---+---+---+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.