0

I have 2 below dataframes. I would like to merge both dataframes on "Item" column and below is my expected output and Code so far. However, below pandas code is displaying duplicate rows as shown below. Not sure what is wrong with the code ? I am thinking may be I have to do string search on the "Item"column and then merge the dataframe. However, I am not sure how to achieve this ? Please provide explanation with your code.

Thanks in advance for your time!

1st Dataframe

Item               ID
ABC                1
ABC                2
ABC                3
CD                 12
EF                 11

2nd DataFrame

Name      Item         Price
Name1     ABC           123.00
Name2     ABC           110
Name 3    ABC           100
Name4      CD           50

Expected Output

Name      Item        ID     Price
Name1     ABC          1    123.00
Name2     ABC          2     110
Name 3    ABC          3     100
 Name4    CD           12    50

Code

pd.merge(df1,df2,on="Item",how='outer')

Above code displays below output

    Item    ID  Name    Price
0   ABC 1   Name1   123.0
1   ABC 1   Name2   110.0
2   ABC 1   Name3   100.0
3   ABC 2   Name1   123.0
4   ABC 2   Name2   110.0
5   ABC 2   Name3   100.0
6   ABC 3   Name1   123.0
7   ABC 3   Name2   110.0
8   ABC 3   Name3   100.0
9   CD  12  Name4   50.0
10  EF  11  NaN     NaN

2 Answers 2

1

Enumerate Item with groupby cumcount in each DataFrame to allow for the positional alignment within the merge:

df3 = pd.merge(
    df1, df2,
    left_on=['Item', df1.groupby('Item').cumcount()],
    right_on=['Item', df2.groupby('Item').cumcount()]
)

df3:

  Item  key_1  ID    Name  Price
0  ABC      0   1   Name1  123.0
1  ABC      1   2   Name2  110.0
2  ABC      2   3  Name 3  100.0
3   CD      0  12   Name4   50.0

drop can be added to remove this added merge key column:

df3 = pd.merge(
    df1, df2,
    left_on=['Item', df1.groupby('Item').cumcount()],
    right_on=['Item', df2.groupby('Item').cumcount()]
).drop(columns='key_1')

df3:

  Item  ID    Name  Price
0  ABC   1   Name1  123.0
1  ABC   2   Name2  110.0
2  ABC   3  Name 3  100.0
3   CD  12   Name4   50.0

Setup and imports:

import pandas as pd

df1 = pd.DataFrame({
    'Item': ['ABC', 'ABC', 'ABC', 'CD', 'EF'],
    'ID': [1, 2, 3, 12, 11]
})

df2 = pd.DataFrame({
    'Name': ['Name1', 'Name2', 'Name 3', 'Name4'],
    'Item': ['ABC', 'ABC', 'ABC', 'CD'],
    'Price': [123.00, 110, 100, 50]
})
Sign up to request clarification or add additional context in comments.

Comments

1

The simplest method I can come up with is to add another column to your input dataframes which mimics the index [0, 1, 2, 3...]. This produces a unique entry even when the 'Item' is duplicated, and leads to a correct merge.

Example:

import pandas as pd

A = [[1, 2, 3, 4], ['A', 'A', 'B', 'C'], [0, 1, 2, 3]]
B = [[10, 20, 30, 40], ['A', 'A', 'B', 'C'], [0, 1, 2, 3]]

df = pd.DataFrame(A).T
df.columns = ['Val', 'ID', 'IDX']
print(df)

df2 = pd.DataFrame(B).T
df2.columns = ['Other', 'ID', 'IDX']
print(df2)

M = pd.merge(df, df2)
print(M)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.