Python: Counting values for columns with multiple values per entry in dataframe

Question

I have a dataframe of restaurants and one column has corresponding cuisines.
The problem is that there are restaurants with multiple cuisines in the same column [up to 8].

Let's say it's something like this:

RestaurantName City   Restaurant ID Cuisines
Restaurant A    Milan    31333         French, Spanish, Italian
Restaurant B    Shanghai 63551         Pizza, Burgers
Restaurant C    Dubai    7991          Burgers, Ice Cream

Here's a copy-able code as a sample:

rst= pd.DataFrame({'RestaurantName': ['Rest A', 'Rest B', 'Rest C'], 
                   'City': ['Milan', 'Shanghai', 'Dubai'],
                    'RestaurantID': [31333,63551,7991],
                    'Cuisines':['French, Spanish, Italian','Pizza, Burgers','Burgers, Ice Cream']})

I used string split to expand them into 8 different columns and added it to the dataframe.

csnsplit=rst.Cuisines.str.split(", ",expand=True)
rst["Cuisine1"]=csnsplit.loc[:,0]
rst["Cuisine2"]=csnsplit.loc[:,1]
rst["Cuisine3"]=csnsplit.loc[:,2]
rst["Cuisine4"]=csnsplit.loc[:,3]
rst["Cuisine5"]=csnsplit.loc[:,4]
rst["Cuisine6"]=csnsplit.loc[:,5]
rst["Cuisine7"]=csnsplit.loc[:,6]
rst["Cuisine8"]=csnsplit.loc[:,7]

Which leaves me with this: https://i.sstatic.net/AUSDY.png

Now I have no idea how to count individual cuisines since they're across up to 8 different columns, let's say if I want to see top cuisine by city.

I also tried getting dummy columns for all of them, Cuisine 1 to Cuisine 8. This is causing me to have duplicates like Cuisine1_Bakery, Cusine2_Bakery, and so on. I could hypothetically merge like ones and keeping only the one that has a count of "1," but no idea how to do that.

dummies=pd.get_dummies(data=rst,columns=["Cuisine1","Cuisine2","Cuisine3","Cuisine4","Cuisine5","Cuisine6","Cuisine7","Cuisine8"])
print(dummies.columns.tolist())

Which leaves me with all of these columns: https://i.sstatic.net/84spI.png

A third thing I tried was to get unique values from all 8 columns, and I have a deduped list of each type of cuisine. I can probably add all these columns to the dataframe, but wouldn't know how to fill the rows with a count for each one based on the column name.

AllCsn=np.concatenate((rst.Cuisine1.unique(), 
                rst.Cuisine2.unique(),
                rst.Cuisine3.unique(),
                rst.Cuisine4.unique(),
                rst.Cuisine5.unique(),
                rst.Cuisine6.unique(),
                rst.Cuisine7.unique(),
                rst.Cuisine8.unique()
               ))
AllCsn=np.unique(AllCsn.astype(str))
AllCsn

Which leaves me with this:

https://i.sstatic.net/O9OpW.png

I do want to create a model later on where I maybe have a column for each cuisine, and use the "unique" code above to get all the columns, but then I would need to figure out how to do a count based on the column header.

I am new to this, so please bear with me and let me know if I need to provide any more info.

Could you please include a small subset of your data as a copyable piece of code that can be used for testing as well as your expected output for the provided data. See How to make good reproducible pandas examples for more information. — Henry Ecker
– Henry Ecker ♦, Commented Jun 18, 2021 at 16:55
This is an interesting problem, but your question is too broad. You need to figure out how to make your question more abstract. There's a good blog post on how to deal with list columns: towardsdatascience.com/… The pandas function explode should help you a lot: pandas.pydata.org/docs/reference/api/… There are related questions on SO: stackoverflow.com/questions/27263805/… — Cornelius Roemer
– Cornelius Roemer, Commented Jun 18, 2021 at 17:53

Henry Ecker · Accepted Answer · 2021-06-18 18:33:36Z

2

It sounds like you're looking for str.split without expanding, then explode:

rst['Cuisines'] = rst['Cuisines'].str.split(', ')
rst = rst.explode('Cuisines')

Creates a frame like:

  RestaurantName      City  RestaurantID   Cuisines
0         Rest A     Milan         31333     French
0         Rest A     Milan         31333    Spanish
0         Rest A     Milan         31333    Italian
1         Rest B  Shanghai         63551      Pizza
1         Rest B  Shanghai         63551    Burgers
2         Rest C     Dubai          7991    Burgers
2         Rest C     Dubai          7991  Ice Cream

Then it sounds like either crosstab:

pd.crosstab(rst['City'], rst['Cuisines'])

Cuisines  Burgers  French  Ice Cream  Italian  Pizza  Spanish
City                                                         
Dubai           1       0          1        0      0        0
Milan           0       1          0        1      0        1
Shanghai        1       0          0        0      1        0

Or value_counts

rst[['City', 'Cuisines']].value_counts().reset_index(name='counts')

       City   Cuisines  counts
0     Dubai    Burgers       1
1     Dubai  Ice Cream       1
2     Milan     French       1
3     Milan    Italian       1
4     Milan    Spanish       1
5  Shanghai    Burgers       1
6  Shanghai      Pizza       1

Max value_count per City via groupby head:

max_counts = (
    rst[['City', 'Cuisines']].value_counts()
        .groupby(level=0).head(1)
        .reset_index(name='counts')
)

max_counts:

       City Cuisines  counts
0     Dubai  Burgers       1
1     Milan   French       1
2  Shanghai  Burgers       1

edited Jun 18, 2021 at 18:33

answered Jun 18, 2021 at 18:07

Henry Ecker♦

35.9k19 gold badges48 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

kdemi001 Over a year ago

This is great! 2 questions: After the "explode" step, why does it only show the first Cuisine in the Cuisine column? Are they still there, but somehow hidden? [I know I can just do Cuisine2 if I want to keep that column]. Is there a way when I do value counts, to only show the max for each city?

Henry Ecker Over a year ago

Explode turns nested values into rows. You'll notice the second Cuisine (that was in row 1) is now in row 2.

kdemi001 Over a year ago

Oh I see! My dataframe rows doubled. This is perfect, thank you so much!

Collectives™ on Stack Overflow

Python: Counting values for columns with multiple values per entry in dataframe

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related