comparing two pandas dataframes with list in column

Question

I have two data frames df1 and df2:

df1 : 
Name A_list
abcd (apple,orange,banana)
bcde (orange,mango)
cdef (apple,pineapple)

df2 :
City B_list
C1   (apple,mango,banana)
C2   (mango)
C3   (pineapple,banana)

I want to make a new dataframe df3

Name A_list City
abcd (apple,orange,banana) (C1,C3)
bcde (orange,mango) (C1,C2)
cdef (apple,pineapple) (C1,C3)

i.e going through A_list in Df1 and identifying the City from which each fruit came. I am not sure how to merge df1 and df2 using the lists A_list and B_list

For (orange,mango) in the second line, do you mean (C1,C2) rather than (C1,C3)? — IanS
– IanS, Commented Aug 25, 2016 at 15:20
Yes, ignore elements in A_list that are not there in B_list and vice versa — Ssank
– Ssank, Commented Aug 25, 2016 at 15:25

piRSquared · Accepted Answer · 2016-08-25 16:15:30Z

Setup

df1 = pd.DataFrame([
        ['abcd', ('apple', 'orange', 'banana')],
        ['bcde', ('orange', 'mango')],
        ['cdef', ('apple', 'pineapple')]
    ], columns=['Name', 'A_list'])
df2 = pd.DataFrame([
        ['C1', ('apple', 'mango', 'banana')],
        ['C2', ('mango')],
        ['C3', ('pineapple', 'banana')]
    ], columns=['City', 'B_list'])

massage data

s2 = df2.set_index('City').squeeze() \
    .apply(pd.Series) \
    .stack().reset_index(1, drop=True)

s2

City
C1        apple
C1        mango
C1       banana
C2        mango
C3    pineapple
C3       banana
dtype: object

s1 = df1.set_index('Name').squeeze() \
    .apply(pd.Series) \
    .stack().reset_index(1, drop=True)

s1

Name
abcd        apple
abcd       orange
abcd       banana
bcde       orange
bcde        mango
cdef        apple
cdef    pineapple
dtype: object

df3 = pd.merge(*[s.rename('fruit').reset_index() for s in [s1, s2]])

df3

def tuplify(series):
    return tuple(set(series))

df3.groupby('Name') \
    .apply(lambda df: df.drop('Name', axis=1).apply(tuplify)) \
    .rename(columns=dict(fruit='A_list')).reset_index()

Notice that 'orange' is missing because it wasn't represented by a 'City'. If you want the same A_list

df3 = pd.merge(*[s.rename('fruit').reset_index() for s in [s1, s2]])
df3 = df3.groupby('Name') \
    .apply(lambda df: df.drop('Name', axis=1).apply(tuplify)) \
    .rename(columns=dict(fruit='A_list'))

df3['A_list'] = df1.set_index('Name')['A_list']
df3.reset_index()

Collectives™ on Stack Overflow

comparing two pandas dataframes with list in column

1 Answer 1

Setup

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Setup

Comments

Your Answer

Sign up or log in

Post as a guest

Related