0

Let's assume that I have a dataset consisting the columns, movieId, title, year and ratings. Below is a sample subset of my data:

| movieId |   title  | rating | year |
|:-------:|:--------:|:------:|:----:|
| 1       | abc      | 3.5    | 1995 |
| 1       | abc      | 3      | 1995 |
| 1       | abc      | 4      | 1995 |
| 1       | abc      | 3      | 1995 |
| 1       | abc      | 5      | 1995 |
| 1       | abc      | 3.5    | 1995 |
| 1       | abc      | 4.5    | 1995 |
| 1       | abc      | 0.5    | 1995 |
| 1       | abc      | 3.5    | 1995 |
| 1       | abc      | 4.5    | 1995 |
| 1       | abc      | 4      | 1995 |
| 1       | abc      | 5      | 1995 |
| 1       | abc      | 4.5    | 1995 |
| 1       | abc      | 4      | 1995 |
| 1       | abc      | 4      | 1995 |
| 1       | abc      | 4      | 1995 |
| 1       | abc      | 4      | 1995 |
| 1       | abc      | 3      | 1995 |
| 1       | abc      | 4      | 1995 |
| 1       | abc      | 3.5    | 1995 |
| 1       | abc      | 3      | 1995 |
| 1       | abc      | 4      | 1995 |
| 1       | abc      | 5      | 1995 |
| 1       | abc      | 4.5    | 1995 |
| 1       | abc      | 5      | 1995 |
| 2       | xyz      | 3      | 2000 |
| 2       | xyz      | 2      | 2000 |
| 2       | xyz      | 3.5    | 2000 |
| 2       | xyz      | 4      | 2000 |
| 2       | xyz      | 3.5    | 2000 |
| 2       | xyz      | 5      | 2000 |
| 2       | xyz      | 3.5    | 2000 |
| 2       | xyz      | 3      | 2000 |
| 2       | xyz      | 3      | 2000 |
| 2       | xyz      | 2      | 2000 |
| 2       | xyz      | 3.5    | 2000 |
| 2       | xyz      | 3      | 2000 |
| 2       | xyz      | 3      | 2000 |
| 2       | xyz      | 4      | 2000 |
| 2       | xyz      | 2      | 2000 |
| 2       | xyz      | 3.5    | 2000 |
| 2       | xyz      | 1      | 2000 |
| 3       | pqr      | 3      | 1997 |
| 3       | pqr      | 2      | 1997 |
| 3       | pqr      | 3.5    | 1997 |
| 3       | pqr      | 3.5    | 1997 |
| 3       | pqr      | 3      | 1997 |
| 3       | pqr      | 3      | 1997 |
| 3       | pqr      | 3      | 1997 |
| 3       | pqr      | 3      | 1997 |
| 3       | pqr      | 4.5    | 1997 |
| 3       | pqr      | 3.5    | 1997 |
| 3       | pqr      | 4      | 1997 |
| 3       | pqr      | 1.5    | 1997 |
| 3       | pqr      | 2      | 1997 |
| 3       | pqr      | 2      | 1997 |
| 3       | pqr      | 2.5    | 1997 |
| 4       | def      | 3      | 1999 |
| 4       | def      | 2.5    | 1999 |
| 4       | def      | 2.5    | 1999 |
| 4       | def      | 0.5    | 1999 |
| 4       | def      | 2      | 1999 |
| 4       | def      | 3      | 1999 |
| 5       | movie123 | 4      | 2006 |
| 5       | movie123 | 4      | 2006 |
| 5       | movie123 | 3      | 2006 |
| 5       | movie123 | 1.5    | 2006 |
| 5       | movie123 | 3      | 2006 |
| 5       | movie123 | 2      | 2006 |
| 5       | movie123 | 2.5    | 2006 |
| 5       | movie123 | 3      | 2006 |
| 5       | movie123 | 4      | 2006 |
| 5       | movie123 | 0.5    | 2006 |
| 5       | movie123 | 1      | 2006 |
| 5       | movie123 | 3.5    | 2006 |
| 5       | movie123 | 2      | 2006 |
| 5       | movie123 | 3      | 2006 |
| 5       | movie123 | 1.5    | 2006 |
| 5       | movie123 | 2.5    | 2006 |
| 5       | movie123 | 4      | 2006 |
| 5       | movie123 | 4      | 2006 |
| 5       | movie123 | 3.5    | 2006 |
| 5       | movie123 | 3      | 2006 |
| 6       | movie456 | 4      | 2012 |
| 6       | movie456 | 3.5    | 2012 |
| 6       | movie456 | 3.5    | 2012 |
| 6       | movie456 | 4      | 2012 |
| 6       | movie456 | 5      | 2012 |
| 6       | movie456 | 2.5    | 2012 |
| 6       | movie456 | 4      | 2012 |
| 6       | movie456 | 4      | 2012 |
| 6       | movie456 | 3.5    | 2012 |
| 6       | movie456 | 5      | 2012 |
| 6       | movie456 | 2      | 2012 |
| 6       | movie456 | 4      | 2012 |

I want to calculate the average, count, minimum rating, and mean ratings across the whole dataset by defining a function. So, I first calculate the number and average of ratings for each movie.

avg_rating = df.groupby(['movieId','year'])['ratings'].agg([('Count','size'), ('Mean','mean')]).sort_values(by='Mean',ascending=False)

Since, some movies may have less number of reviews but a higher rating Vs other movies with higher number of reviews and higher rating, the analysis might get skewed. So, I want to calculate weighted average, for which I have defined a function.

# R = average for the movie (mean) = (Rating)
# v = number of ratings/reviews for the movie = (votes)
# m = minimum reviews required to be listed in the Top 250 movie list
# C = the mean rating across the whole report

def weighted_rating(R, v, m, C): 
  return (v/(v+m))*R + (m/(v+m))*C

avg_rating= avg_rating.assign(wr = weighted_rating(mean, count, 500, mean(mean)))

When I run the above last line of code, I get an error: name 'mean' is not defined

The same error persists for column count: name 'count' is not defined

How can I solve this error so that my final output contains the columns movieId, year, count, mean, and wr?

1 Answer 1

2

You need to use the fields from your df:

avg_rating = avg_rating.assign(
        wr=weighted_rating(avg_rating['Mean'], avg_rating['Count'], 500, mean(avg_rating['Mean'])))

So the final code could be like:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from __future__ import (division, absolute_import, print_function,
                        unicode_literals)

import pandas as pd
from statistics import mean


def weighted_rating(R, v, m, C):
    return (v / (v + m)) * R + (m / (v + m)) * C


def main():

    df = pd.read_csv('movies.csv')

    avg_rating = df.groupby(['movieId', 'year'])['rating'].agg(
        [('Count', 'size'), ('Mean', 'mean')]).sort_values(by='Mean', ascending=False)

    avg_rating = avg_rating.assign(
        wr=weighted_rating(avg_rating['Mean'], avg_rating['Count'], 500, mean(avg_rating['Mean'])))

    print(avg_rating)


if __name__ == '__main__':
    main()

Using the CSV:

movieId,title,rating,year
1,abc,3.5,1995
1,abc,3,1995
1,abc,4,1995
1,abc,3,1995
1,abc,5,1995
1,abc,3.5,1995
1,abc,4.5,1995
1,abc,0.5,1995
1,abc,3.5,1995
1,abc,4.5,1995
1,abc,4,1995
1,abc,5,1995
1,abc,4.5,1995
1,abc,4,1995
1,abc,4,1995
1,abc,4,1995
1,abc,4,1995
1,abc,3,1995
1,abc,4,1995
1,abc,3.5,1995
1,abc,3,1995
1,abc,4,1995
1,abc,5,1995
1,abc,4.5,1995
1,abc,5,1995
2,xyz,3,2000
2,xyz,2,2000
2,xyz,3.5,2000
2,xyz,4,2000
2,xyz,3.5,2000
2,xyz,5,2000
2,xyz,3.5,2000
2,xyz,3,2000
2,xyz,3,2000
2,xyz,2,2000
2,xyz,3.5,2000
2,xyz,3,2000
2,xyz,3,2000
2,xyz,4,2000
2,xyz,2,2000
2,xyz,3.5,2000
2,xyz,1,2000
3,pqr,3,1997
3,pqr,2,1997
3,pqr,3.5,1997
3,pqr,3.5,1997
3,pqr,3,1997
3,pqr,3,1997
3,pqr,3,1997
3,pqr,3,1997
3,pqr,4.5,1997
3,pqr,3.5,1997
3,pqr,4,1997
3,pqr,1.5,1997
3,pqr,2,1997
3,pqr,2,1997
3,pqr,2.5,1997
4,def,3,1999
4,def,2.5,1999
4,def,2.5,1999
4,def,0.5,1999
4,def,2,1999
4,def,3,1999
5,movie12,4,2006
5,movie12,4,2006
5,movie12,3,2006
5,movie12,1.5,2006
5,movie12,3,2006
5,movie12,2,2006
5,movie12,2.5,2006
5,movie12,3,2006
5,movie12,4,2006
5,movie12,0.5,2006
5,movie12,1,2006
5,movie12,3.5,2006
5,movie12,2,2006
5,movie12,3,2006
5,movie12,1.5,2006
5,movie12,2.5,2006
5,movie12,4,2006
5,movie12,4,2006
5,movie12,3.5,2006
5,movie12,3,2006
6,movie45,4,2012
6,movie45,3.5,2012
6,movie45,3.5,2012
6,movie45,4,2012
6,movie45,5,2012
6,movie45,2.5,2012
6,movie45,4,2012
6,movie45,4,2012
6,movie45,3.5,2012
6,movie45,5,2012
6,movie45,2,2012
6,movie45,4,2012
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the quick reply! If I need to save this output in a data frame instead of calling as a function, how can I do that? I tried assigning it to a data frame when defining a function but that gives me an error
Using a lambda is an option for you?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.