0

I have two data frames df1 and df2 as shown below:

df1:

company occupation
0 A Administrator
1 B Engineer
2 C Engineer
3 D Account
4 E Administrator
5 F Engineer

df2:

occupation description
0 Account balance
1 Engineer database
2 Administrator chores
3 Administrator calling
4 Engineer frontend
5 Engineer backendend

What I want:

company occupation description
0 A Administrator chores
1 B Engineer database
2 C Engineer frontend
3 D Account balance
4 E Administrator calling
5 F Engineer backendend

I tried pd.merge(df1,df2,how="inner"), but always get duplicates row:

company occupation description
0 A Administrator chores
1 A Administrator calling
2 E Administrator chores
3 E Administrator calling
4 B Engineer database
5 B Engineer frontend
6 B Engineer backendend
7 C Engineer database
8 C Engineer frontend
9 C Engineer backendend
10 F Engineer database
11 F Engineer frontend
12 F Engineer backendend
13 D Account balance

code:

import pandas as pd
df1 = pd.DataFrame({"company":["A","B","C","D","E","F"],"occupation":["Administrator","Engineer","Engineer","Account","Administrator","Engineer"]})
df2 = pd.DataFrame({"occupation":["Account","Engineer","Administrator","Administrator","Engineer","Engineer"],"description":["balance","database","chores","calling","frontend","backendend"]})
df3 = pd.DataFrame({"company":["A","B","C","D","E","F"],"occupation":["Administrator","Engineer","Engineer","Account","Administrator","Engineer"],"description":["chores","database","balance","frontend","calling","backendend"]})
df4 = pd.merge(df1,df2,how="inner")
display(df1)
display(df2)
display(df3)
display(df4)
3
  • 3
    Is your desired output accurate? I guess you may want to match the a certain occurrence of a occupation in df1 to the corresponding occurrence in df2 e.g. 1st Engineer is assigned 'database'. If yes, then your desired output maybe inaccurate? Commented Jul 19, 2021 at 15:52
  • 1
    I think frontend and balance might be swapped. Commented Jul 19, 2021 at 15:53
  • Yes, it's typo, I revised it Commented Jul 19, 2021 at 15:59

2 Answers 2

2

Let's try to create a key column with groupby cumcount to track position then merge on occupation and key:

df1['key'] = df1.groupby('occupation').cumcount()
df2['key'] = df2.groupby('occupation').cumcount()
df4 = df1.merge(df2, on=['occupation', 'key']).drop('key', axis=1)

df4:

  company     occupation description
0       A  Administrator      chores
1       B       Engineer    database
2       C       Engineer    frontend
3       D        Account     balance
4       E  Administrator     calling
5       F       Engineer  backendend

df4 without dropping key:

  company     occupation  key description
0       A  Administrator    0      chores
1       B       Engineer    0    database
2       C       Engineer    1    frontend
3       D        Account    0     balance
4       E  Administrator    1     calling
5       F       Engineer    2  backendend

Can also do without affecting df1 or df2 by merging on series directly:

df4 = df1.merge(
    df2,
    left_on=['occupation', df1.groupby('occupation').cumcount()],
    right_on=['occupation', df2.groupby('occupation').cumcount()]
).drop('key_1', axis=1)

df4:

  company     occupation description
0       A  Administrator      chores
1       B       Engineer    database
2       C       Engineer    frontend
3       D        Account     balance
4       E  Administrator     calling
5       F       Engineer  backendend
Sign up to request clarification or add additional context in comments.

Comments

1

You can synthesise the part of the merge condition required. The position of the occupation within the data frames.

df1 = pd.DataFrame({'company': ['A', 'B', 'C', 'D', 'E', 'F'],
 'occupation': ['Administrator','Engineer','Engineer','Account','Administrator','Engineer']})

df2 = pd.DataFrame({'occupation': ['Account','Engineer','Administrator','Administrator','Engineer','Engineer'],
 'description': ['balance','database','chores','calling','frontend','backendend']})

df1.assign(oid=df1.groupby("occupation", as_index=False).cumcount()).merge(
    df2.assign(oid=df2.groupby("occupation", as_index=False).cumcount()),
    on=["occupation", "oid"],
)

company occupation oid description
0 A Administrator 0 chores
1 B Engineer 0 database
2 C Engineer 1 frontend
3 D Account 0 balance
4 E Administrator 1 calling
5 F Engineer 2 backendend

2 Comments

Is this not the same as my answer?
@HenryEcker semantically it is the equivalent. tracking position using cumcount()

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.