Creating a dataframe columns with lists of matching values from another dataframe

Question

I am trying to create a pandas dataframe column (in df1) where each element is a list of all values from another dataframe (df2) that match an existing columns in df1. This is different from a pandas left merge because that function would create new rows for each time the df1 value is duplicated in df2.

I have code that does exactly what I need, but it is very slow (on my large dataframe) and I'm almost sure there is a built-in pandas method to do what I'm looking for, but I haven't been able to find it.

Here is a simplified version of my current code.

import pandas as pd
df1=pd.DataFrame({'col4': [1,2,3,4,5,6]})
df2=pd.DataFrame({'col2':['a','b','c','d','e','f','g'],'col3':[1,1,2,3,4,4,6]})
df1['col1']=pd.Series(list(df2['col2'][df2['col3']==df1['col4'][i]]) for i in df1.index)

Is there a way to use the pandas apply method or something similar to do this instead of iterating over df1.index? Thanks!

Quang Hoang · Accepted Answer · 2019-05-08 19:15:59Z

3

Try:

df1.set_index('col4', inplace=True)
df1['col4'] = df2.groupby('col3').col2.apply(list)
df1.reset_index(drop=True, inplace=True)

edited May 8, 2019 at 19:15

answered May 8, 2019 at 19:09

Quang Hoang

151k11 gold badges64 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

yatu · Accepted Answer · 2019-05-08 19:16:05Z

2

This actually is a merge problem, however you also need to groupby and create lists from the groups:

(df1.merge(df2, left_on='col4', right_on='col3', how='right')
    .groupby('col3').col2
    .apply(list))

      col3
1    [a, b]
2       [c]
3       [d]
4    [e, f]
6       [g]
Name: col2, dtype: object

edited May 8, 2019 at 19:16

answered May 8, 2019 at 19:10

yatu

88.6k12 gold badges93 silver badges148 bronze badges

Comments

soundstripe · Accepted Answer · 2019-05-08 19:45:14Z

1

This might get you most of the way there, but I believe the inefficiency might remain because of trying to store a list in the DataFrame. I think you might want to look into a MultiIndex, but I'm not really sure what your use case is.

import pandas as pd
df1 = pd.DataFrame({'col4': [1,2,3,4,5,6]})
df2 = pd.DataFrame({'col2':['a','b','c','d','e','f','g'],'col3':[1,1,2,3,4,4,6]})
result = df1.join(df2.groupby('col3').agg(list), on='col4')

answered May 8, 2019 at 19:45

soundstripe

1,48411 silver badges22 bronze badges

Collectives™ on Stack Overflow

Creating a dataframe columns with lists of matching values from another dataframe

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related