0

I have two data frames df1 and df2:

df1 : 
Name A_list
abcd (apple,orange,banana)
bcde (orange,mango)
cdef (apple,pineapple)

df2 :
City B_list
C1   (apple,mango,banana)
C2   (mango)
C3   (pineapple,banana)

I want to make a new dataframe df3

Name A_list City
abcd (apple,orange,banana) (C1,C3)
bcde (orange,mango) (C1,C2)
cdef (apple,pineapple) (C1,C3)

i.e going through A_list in Df1 and identifying the City from which each fruit came. I am not sure how to merge df1 and df2 using the lists A_list and B_list

5
  • For (orange,mango) in the second line, do you mean (C1,C2) rather than (C1,C3)? Commented Aug 25, 2016 at 15:20
  • Yes, I corrected my post Commented Aug 25, 2016 at 15:21
  • And there is no orange in df2 so we just ignore it? Commented Aug 25, 2016 at 15:23
  • Yes, ignore elements in A_list that are not there in B_list and vice versa Commented Aug 25, 2016 at 15:25
  • 4
    show what you have tried. Commented Aug 25, 2016 at 15:51

1 Answer 1

2

Setup

df1 = pd.DataFrame([
        ['abcd', ('apple', 'orange', 'banana')],
        ['bcde', ('orange', 'mango')],
        ['cdef', ('apple', 'pineapple')]
    ], columns=['Name', 'A_list'])
df2 = pd.DataFrame([
        ['C1', ('apple', 'mango', 'banana')],
        ['C2', ('mango')],
        ['C3', ('pineapple', 'banana')]
    ], columns=['City', 'B_list'])

massage data

s2 = df2.set_index('City').squeeze() \
    .apply(pd.Series) \
    .stack().reset_index(1, drop=True)

s2

City
C1        apple
C1        mango
C1       banana
C2        mango
C3    pineapple
C3       banana
dtype: object

s1 = df1.set_index('Name').squeeze() \
    .apply(pd.Series) \
    .stack().reset_index(1, drop=True)

s1

Name
abcd        apple
abcd       orange
abcd       banana
bcde       orange
bcde        mango
cdef        apple
cdef    pineapple
dtype: object

df3 = pd.merge(*[s.rename('fruit').reset_index() for s in [s1, s2]])

df3

enter image description here

def tuplify(series):
    return tuple(set(series))

df3.groupby('Name') \
    .apply(lambda df: df.drop('Name', axis=1).apply(tuplify)) \
    .rename(columns=dict(fruit='A_list')).reset_index()

enter image description here

Notice that 'orange' is missing because it wasn't represented by a 'City'. If you want the same A_list

df3 = pd.merge(*[s.rename('fruit').reset_index() for s in [s1, s2]])
df3 = df3.groupby('Name') \
    .apply(lambda df: df.drop('Name', axis=1).apply(tuplify)) \
    .rename(columns=dict(fruit='A_list'))

df3['A_list'] = df1.set_index('Name')['A_list']
df3.reset_index()

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.