Let's assume that I have a dataset consisting the columns, movieId, title, year and ratings. Below is a sample subset of my data:
| movieId | title | rating | year |
|:-------:|:--------:|:------:|:----:|
| 1 | abc | 3.5 | 1995 |
| 1 | abc | 3 | 1995 |
| 1 | abc | 4 | 1995 |
| 1 | abc | 3 | 1995 |
| 1 | abc | 5 | 1995 |
| 1 | abc | 3.5 | 1995 |
| 1 | abc | 4.5 | 1995 |
| 1 | abc | 0.5 | 1995 |
| 1 | abc | 3.5 | 1995 |
| 1 | abc | 4.5 | 1995 |
| 1 | abc | 4 | 1995 |
| 1 | abc | 5 | 1995 |
| 1 | abc | 4.5 | 1995 |
| 1 | abc | 4 | 1995 |
| 1 | abc | 4 | 1995 |
| 1 | abc | 4 | 1995 |
| 1 | abc | 4 | 1995 |
| 1 | abc | 3 | 1995 |
| 1 | abc | 4 | 1995 |
| 1 | abc | 3.5 | 1995 |
| 1 | abc | 3 | 1995 |
| 1 | abc | 4 | 1995 |
| 1 | abc | 5 | 1995 |
| 1 | abc | 4.5 | 1995 |
| 1 | abc | 5 | 1995 |
| 2 | xyz | 3 | 2000 |
| 2 | xyz | 2 | 2000 |
| 2 | xyz | 3.5 | 2000 |
| 2 | xyz | 4 | 2000 |
| 2 | xyz | 3.5 | 2000 |
| 2 | xyz | 5 | 2000 |
| 2 | xyz | 3.5 | 2000 |
| 2 | xyz | 3 | 2000 |
| 2 | xyz | 3 | 2000 |
| 2 | xyz | 2 | 2000 |
| 2 | xyz | 3.5 | 2000 |
| 2 | xyz | 3 | 2000 |
| 2 | xyz | 3 | 2000 |
| 2 | xyz | 4 | 2000 |
| 2 | xyz | 2 | 2000 |
| 2 | xyz | 3.5 | 2000 |
| 2 | xyz | 1 | 2000 |
| 3 | pqr | 3 | 1997 |
| 3 | pqr | 2 | 1997 |
| 3 | pqr | 3.5 | 1997 |
| 3 | pqr | 3.5 | 1997 |
| 3 | pqr | 3 | 1997 |
| 3 | pqr | 3 | 1997 |
| 3 | pqr | 3 | 1997 |
| 3 | pqr | 3 | 1997 |
| 3 | pqr | 4.5 | 1997 |
| 3 | pqr | 3.5 | 1997 |
| 3 | pqr | 4 | 1997 |
| 3 | pqr | 1.5 | 1997 |
| 3 | pqr | 2 | 1997 |
| 3 | pqr | 2 | 1997 |
| 3 | pqr | 2.5 | 1997 |
| 4 | def | 3 | 1999 |
| 4 | def | 2.5 | 1999 |
| 4 | def | 2.5 | 1999 |
| 4 | def | 0.5 | 1999 |
| 4 | def | 2 | 1999 |
| 4 | def | 3 | 1999 |
| 5 | movie123 | 4 | 2006 |
| 5 | movie123 | 4 | 2006 |
| 5 | movie123 | 3 | 2006 |
| 5 | movie123 | 1.5 | 2006 |
| 5 | movie123 | 3 | 2006 |
| 5 | movie123 | 2 | 2006 |
| 5 | movie123 | 2.5 | 2006 |
| 5 | movie123 | 3 | 2006 |
| 5 | movie123 | 4 | 2006 |
| 5 | movie123 | 0.5 | 2006 |
| 5 | movie123 | 1 | 2006 |
| 5 | movie123 | 3.5 | 2006 |
| 5 | movie123 | 2 | 2006 |
| 5 | movie123 | 3 | 2006 |
| 5 | movie123 | 1.5 | 2006 |
| 5 | movie123 | 2.5 | 2006 |
| 5 | movie123 | 4 | 2006 |
| 5 | movie123 | 4 | 2006 |
| 5 | movie123 | 3.5 | 2006 |
| 5 | movie123 | 3 | 2006 |
| 6 | movie456 | 4 | 2012 |
| 6 | movie456 | 3.5 | 2012 |
| 6 | movie456 | 3.5 | 2012 |
| 6 | movie456 | 4 | 2012 |
| 6 | movie456 | 5 | 2012 |
| 6 | movie456 | 2.5 | 2012 |
| 6 | movie456 | 4 | 2012 |
| 6 | movie456 | 4 | 2012 |
| 6 | movie456 | 3.5 | 2012 |
| 6 | movie456 | 5 | 2012 |
| 6 | movie456 | 2 | 2012 |
| 6 | movie456 | 4 | 2012 |
I want to calculate the average, count, minimum rating, and mean ratings across the whole dataset by defining a function. So, I first calculate the number and average of ratings for each movie.
avg_rating = df.groupby(['movieId','year'])['ratings'].agg([('Count','size'), ('Mean','mean')]).sort_values(by='Mean',ascending=False)
Since, some movies may have less number of reviews but a higher rating Vs other movies with higher number of reviews and higher rating, the analysis might get skewed. So, I want to calculate weighted average, for which I have defined a function.
# R = average for the movie (mean) = (Rating)
# v = number of ratings/reviews for the movie = (votes)
# m = minimum reviews required to be listed in the Top 250 movie list
# C = the mean rating across the whole report
def weighted_rating(R, v, m, C):
return (v/(v+m))*R + (m/(v+m))*C
avg_rating= avg_rating.assign(wr = weighted_rating(mean, count, 500, mean(mean)))
When I run the above last line of code, I get an error: name 'mean' is not defined
The same error persists for column count: name 'count' is not defined
How can I solve this error so that my final output contains the columns movieId, year, count, mean, and wr?