0

I am trying to make different columns from separated strings. My datasource is the https://grouplens.org/datasets/movielens/ ml-latest-small.zip (size: 1 MB)

movie_df = pd.read_csv('movies.csv')
movie_df.head(10)

Reading in the file, I have raw dataframe

I tried to do

movies_df = pd.read_csv('movies.csv', sep='|', encoding='latin-1',
names=['movie_id', 'movie_title','unknown', 'action','adventure', 'animation', 'childrens', 'comedy', 'crime', 'documentary', 'drama', 'fantasy','film_noir', 'horror', 'musical', 'mystery', 'romance', 'sci_fi', 'thriller', 'war', 'western'])
movies_df.head(10)

but this squishes everything before the separator to the first column and the first split on my genre also goes to the first column. Otherwise, it is what I need. See here.

How do I get all my genres of varying lengths to become a unique column after movieId and title? I want each genre to be a column with NaNs if it is not that column to set up for creating dummy variables later.

Edit: I did movies_df.head(10).to_dict() and the output was:

 'title': {0: 'Toy Story (1995)',
  1: 'Jumanji (1995)',
  2: 'Grumpier Old Men (1995)',
  3: 'Waiting to Exhale (1995)',
  4: 'Father of the Bride Part II (1995)',
  5: 'Heat (1995)',
  6: 'Sabrina (1995)',
  7: 'Tom and Huck (1995)',
  8: 'Sudden Death (1995)',
  9: 'GoldenEye (1995)'},
 'genres': {0: 'Adventure|Animation|Children|Comedy|Fantasy',
  1: 'Adventure|Children|Fantasy',
  2: 'Comedy|Romance',
  3: 'Comedy|Drama|Romance',
  4: 'Comedy',
  5: 'Action|Crime|Thriller',
  6: 'Comedy|Romance',
  7: 'Adventure|Children',
  8: 'Action',
  9: 'Action|Adventure|Thriller'}}
3
  • Could you add the result of movie_df.head(10).to_dict() to your question so that we don't have to download the zip file? Commented Sep 29, 2022 at 22:33
  • I appreciate it, but I meant the raw dataframe when you read it in normally Commented Sep 29, 2022 at 22:45
  • @BenGrossmann Okay I tried again, I wrote it using the code indicators just so it listed nicely even though it's output Commented Sep 29, 2022 at 22:52

1 Answer 1

1

The following seems to work. That said, ideally it should be changed to avoid iterating through the genres column in order to get the list of genres, since looping through columns is slow.

movie_df = pd.read_csv('movies.csv')
genre_set = set()
for lst in movie_df['genres'].str.split('|'):
    genre_set.update(lst)
for g in genre_set:
    movie_df[g] = np.nan
    movie_df.loc[movie_df['genres'].str.contains(g),g] = g

The first for loop can be replaced with the single line genre_set.update(*movie_df['genres'].str.split('|')); I don't believe this changes performance.

The resulting frame movie_df looks like this:

                                title  \
0                    Toy Story (1995)   
1                      Jumanji (1995)   
2             Grumpier Old Men (1995)   
3            Waiting to Exhale (1995)   
4  Father of the Bride Part II (1995)   
5                         Heat (1995)   
6                      Sabrina (1995)   
7                 Tom and Huck (1995)   
8                 Sudden Death (1995)   
9                    GoldenEye (1995)   

                                        genres  Action  Romance  Thriller  \
0  Adventure|Animation|Children|Comedy|Fantasy     NaN      NaN       NaN   
1                   Adventure|Children|Fantasy     NaN      NaN       NaN   
2                               Comedy|Romance     NaN  Romance       NaN   
3                         Comedy|Drama|Romance     NaN  Romance       NaN   
4                                       Comedy     NaN      NaN       NaN   
5                        Action|Crime|Thriller  Action      NaN  Thriller   
6                               Comedy|Romance     NaN  Romance       NaN   
7                           Adventure|Children     NaN      NaN       NaN   
8                                       Action  Action      NaN       NaN   
9                    Action|Adventure|Thriller  Action      NaN  Thriller   

   Adventure  Crime  Children  Comedy  Drama  Animation  Fantasy  
0  Adventure    NaN  Children  Comedy    NaN  Animation  Fantasy  
1  Adventure    NaN  Children     NaN    NaN        NaN  Fantasy  
2        NaN    NaN       NaN  Comedy    NaN        NaN      NaN  
3        NaN    NaN       NaN  Comedy  Drama        NaN      NaN  
4        NaN    NaN       NaN  Comedy    NaN        NaN      NaN  
5        NaN  Crime       NaN     NaN    NaN        NaN      NaN  
6        NaN    NaN       NaN  Comedy    NaN        NaN      NaN  
7  Adventure    NaN  Children     NaN    NaN        NaN      NaN  
8        NaN    NaN       NaN     NaN    NaN        NaN      NaN  
9  Adventure    NaN       NaN     NaN    NaN        NaN      NaN  
Sign up to request clarification or add additional context in comments.

5 Comments

is df supposed to be df or is it supposed to be movie_df? When I use it as stricty df it gives me the "can't find 'df' error. But when I replace it with movie_df I get the error "A value is trying to be set on a copy of a slice from a DataFrame"
Yes, df was supposed to be movie_df; should be fixed now. The "error" that you get in the second case should actually be a warning rather than an error; in spite of this message, you should find that movie_df has the correct form in the end
See this post regarding the warning
I've updated the code so that the warning no longer appears.
ahh! Thank you, it worked! I appreciate all the help!!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.