How to get varying string splits into columns python pandas?

Question

I am trying to make different columns from separated strings. My datasource is the https://grouplens.org/datasets/movielens/ ml-latest-small.zip (size: 1 MB)

movie_df = pd.read_csv('movies.csv')
movie_df.head(10)

Reading in the file, I have raw dataframe

I tried to do

movies_df = pd.read_csv('movies.csv', sep='|', encoding='latin-1',
names=['movie_id', 'movie_title','unknown', 'action','adventure', 'animation', 'childrens', 'comedy', 'crime', 'documentary', 'drama', 'fantasy','film_noir', 'horror', 'musical', 'mystery', 'romance', 'sci_fi', 'thriller', 'war', 'western'])
movies_df.head(10)

but this squishes everything before the separator to the first column and the first split on my genre also goes to the first column. Otherwise, it is what I need. See here.

How do I get all my genres of varying lengths to become a unique column after movieId and title? I want each genre to be a column with NaNs if it is not that column to set up for creating dummy variables later.

Edit: I did movies_df.head(10).to_dict() and the output was:

 'title': {0: 'Toy Story (1995)',
  1: 'Jumanji (1995)',
  2: 'Grumpier Old Men (1995)',
  3: 'Waiting to Exhale (1995)',
  4: 'Father of the Bride Part II (1995)',
  5: 'Heat (1995)',
  6: 'Sabrina (1995)',
  7: 'Tom and Huck (1995)',
  8: 'Sudden Death (1995)',
  9: 'GoldenEye (1995)'},
 'genres': {0: 'Adventure|Animation|Children|Comedy|Fantasy',
  1: 'Adventure|Children|Fantasy',
  2: 'Comedy|Romance',
  3: 'Comedy|Drama|Romance',
  4: 'Comedy',
  5: 'Action|Crime|Thriller',
  6: 'Comedy|Romance',
  7: 'Adventure|Children',
  8: 'Action',
  9: 'Action|Adventure|Thriller'}}

Could you add the result of movie_df.head(10).to_dict() to your question so that we don't have to download the zip file? — Ben Grossmann
– Ben Grossmann, Commented Sep 29, 2022 at 22:33
I appreciate it, but I meant the raw dataframe when you read it in normally — Ben Grossmann
– Ben Grossmann, Commented Sep 29, 2022 at 22:45
@BenGrossmann Okay I tried again, I wrote it using the code indicators just so it listed nicely even though it's output — A Mere Pigeon
– A Mere Pigeon, Commented Sep 29, 2022 at 22:52

Ben Grossmann · Accepted Answer · 2022-09-30 01:48:20Z

1

The following seems to work. That said, ideally it should be changed to avoid iterating through the genres column in order to get the list of genres, since looping through columns is slow.

movie_df = pd.read_csv('movies.csv')
genre_set = set()
for lst in movie_df['genres'].str.split('|'):
    genre_set.update(lst)
for g in genre_set:
    movie_df[g] = np.nan
    movie_df.loc[movie_df['genres'].str.contains(g),g] = g

The first for loop can be replaced with the single line genre_set.update(*movie_df['genres'].str.split('|')); I don't believe this changes performance.

The resulting frame movie_df looks like this:

                                title  \
0                    Toy Story (1995)   
1                      Jumanji (1995)   
2             Grumpier Old Men (1995)   
3            Waiting to Exhale (1995)   
4  Father of the Bride Part II (1995)   
5                         Heat (1995)   
6                      Sabrina (1995)   
7                 Tom and Huck (1995)   
8                 Sudden Death (1995)   
9                    GoldenEye (1995)   

                                        genres  Action  Romance  Thriller  \
0  Adventure|Animation|Children|Comedy|Fantasy     NaN      NaN       NaN   
1                   Adventure|Children|Fantasy     NaN      NaN       NaN   
2                               Comedy|Romance     NaN  Romance       NaN   
3                         Comedy|Drama|Romance     NaN  Romance       NaN   
4                                       Comedy     NaN      NaN       NaN   
5                        Action|Crime|Thriller  Action      NaN  Thriller   
6                               Comedy|Romance     NaN  Romance       NaN   
7                           Adventure|Children     NaN      NaN       NaN   
8                                       Action  Action      NaN       NaN   
9                    Action|Adventure|Thriller  Action      NaN  Thriller   

   Adventure  Crime  Children  Comedy  Drama  Animation  Fantasy  
0  Adventure    NaN  Children  Comedy    NaN  Animation  Fantasy  
1  Adventure    NaN  Children     NaN    NaN        NaN  Fantasy  
2        NaN    NaN       NaN  Comedy    NaN        NaN      NaN  
3        NaN    NaN       NaN  Comedy  Drama        NaN      NaN  
4        NaN    NaN       NaN  Comedy    NaN        NaN      NaN  
5        NaN  Crime       NaN     NaN    NaN        NaN      NaN  
6        NaN    NaN       NaN  Comedy    NaN        NaN      NaN  
7  Adventure    NaN  Children     NaN    NaN        NaN      NaN  
8        NaN    NaN       NaN     NaN    NaN        NaN      NaN  
9  Adventure    NaN       NaN     NaN    NaN        NaN      NaN

edited Sep 30, 2022 at 1:48

answered Sep 29, 2022 at 23:06

Ben Grossmann

5,0471 gold badge15 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

A Mere Pigeon Over a year ago

is df supposed to be df or is it supposed to be movie_df? When I use it as stricty df it gives me the "can't find 'df' error. But when I replace it with movie_df I get the error "A value is trying to be set on a copy of a slice from a DataFrame"

Ben Grossmann Over a year ago

Yes, df was supposed to be movie_df; should be fixed now. The "error" that you get in the second case should actually be a warning rather than an error; in spite of this message, you should find that movie_df has the correct form in the end

Ben Grossmann Over a year ago

See this post regarding the warning

Ben Grossmann Over a year ago

I've updated the code so that the warning no longer appears.

A Mere Pigeon Over a year ago

ahh! Thank you, it worked! I appreciate all the help!!

Collectives™ on Stack Overflow

How to get varying string splits into columns python pandas?

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related