1

I have a dataframe of restaurants and one column has corresponding cuisines.
The problem is that there are restaurants with multiple cuisines in the same column [up to 8].

Let's say it's something like this:

RestaurantName City   Restaurant ID Cuisines
Restaurant A    Milan    31333         French, Spanish, Italian
Restaurant B    Shanghai 63551         Pizza, Burgers
Restaurant C    Dubai    7991          Burgers, Ice Cream

Here's a copy-able code as a sample:

rst= pd.DataFrame({'RestaurantName': ['Rest A', 'Rest B', 'Rest C'], 
                   'City': ['Milan', 'Shanghai', 'Dubai'],
                    'RestaurantID': [31333,63551,7991],
                    'Cuisines':['French, Spanish, Italian','Pizza, Burgers','Burgers, Ice Cream']})

I used string split to expand them into 8 different columns and added it to the dataframe.

csnsplit=rst.Cuisines.str.split(", ",expand=True)
rst["Cuisine1"]=csnsplit.loc[:,0]
rst["Cuisine2"]=csnsplit.loc[:,1]
rst["Cuisine3"]=csnsplit.loc[:,2]
rst["Cuisine4"]=csnsplit.loc[:,3]
rst["Cuisine5"]=csnsplit.loc[:,4]
rst["Cuisine6"]=csnsplit.loc[:,5]
rst["Cuisine7"]=csnsplit.loc[:,6]
rst["Cuisine8"]=csnsplit.loc[:,7]

Which leaves me with this: https://i.sstatic.net/AUSDY.png

Now I have no idea how to count individual cuisines since they're across up to 8 different columns, let's say if I want to see top cuisine by city.

I also tried getting dummy columns for all of them, Cuisine 1 to Cuisine 8. This is causing me to have duplicates like Cuisine1_Bakery, Cusine2_Bakery, and so on. I could hypothetically merge like ones and keeping only the one that has a count of "1," but no idea how to do that.

dummies=pd.get_dummies(data=rst,columns=["Cuisine1","Cuisine2","Cuisine3","Cuisine4","Cuisine5","Cuisine6","Cuisine7","Cuisine8"])
print(dummies.columns.tolist())

Which leaves me with all of these columns: https://i.sstatic.net/84spI.png

A third thing I tried was to get unique values from all 8 columns, and I have a deduped list of each type of cuisine. I can probably add all these columns to the dataframe, but wouldn't know how to fill the rows with a count for each one based on the column name.

AllCsn=np.concatenate((rst.Cuisine1.unique(), 
                rst.Cuisine2.unique(),
                rst.Cuisine3.unique(),
                rst.Cuisine4.unique(),
                rst.Cuisine5.unique(),
                rst.Cuisine6.unique(),
                rst.Cuisine7.unique(),
                rst.Cuisine8.unique()
               ))
AllCsn=np.unique(AllCsn.astype(str))
AllCsn

Which leaves me with this:

https://i.sstatic.net/O9OpW.png

I do want to create a model later on where I maybe have a column for each cuisine, and use the "unique" code above to get all the columns, but then I would need to figure out how to do a count based on the column header.

I am new to this, so please bear with me and let me know if I need to provide any more info.

3

1 Answer 1

2

It sounds like you're looking for str.split without expanding, then explode:

rst['Cuisines'] = rst['Cuisines'].str.split(', ')
rst = rst.explode('Cuisines')

Creates a frame like:

  RestaurantName      City  RestaurantID   Cuisines
0         Rest A     Milan         31333     French
0         Rest A     Milan         31333    Spanish
0         Rest A     Milan         31333    Italian
1         Rest B  Shanghai         63551      Pizza
1         Rest B  Shanghai         63551    Burgers
2         Rest C     Dubai          7991    Burgers
2         Rest C     Dubai          7991  Ice Cream

Then it sounds like either crosstab:

pd.crosstab(rst['City'], rst['Cuisines'])
Cuisines  Burgers  French  Ice Cream  Italian  Pizza  Spanish
City                                                         
Dubai           1       0          1        0      0        0
Milan           0       1          0        1      0        1
Shanghai        1       0          0        0      1        0

Or value_counts

rst[['City', 'Cuisines']].value_counts().reset_index(name='counts')
       City   Cuisines  counts
0     Dubai    Burgers       1
1     Dubai  Ice Cream       1
2     Milan     French       1
3     Milan    Italian       1
4     Milan    Spanish       1
5  Shanghai    Burgers       1
6  Shanghai      Pizza       1

Max value_count per City via groupby head:

max_counts = (
    rst[['City', 'Cuisines']].value_counts()
        .groupby(level=0).head(1)
        .reset_index(name='counts')
)

max_counts:

       City Cuisines  counts
0     Dubai  Burgers       1
1     Milan   French       1
2  Shanghai  Burgers       1
Sign up to request clarification or add additional context in comments.

3 Comments

This is great! 2 questions: After the "explode" step, why does it only show the first Cuisine in the Cuisine column? Are they still there, but somehow hidden? [I know I can just do Cuisine2 if I want to keep that column]. Is there a way when I do value counts, to only show the max for each city?
Explode turns nested values into rows. You'll notice the second Cuisine (that was in row 1) is now in row 2.
Oh I see! My dataframe rows doubled. This is perfect, thank you so much!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.