Combine dataframe in python but avoid duplicates

Question

I have two dataframes, df1, and df2. I am joining on two different column names. For some reason when I perform this join, the result creates exponential duplicated rows. How would I avoid this. I am using outer join.

Data

df1

ID  Date
a   1/1/2022
a   1/1/2022
b   2/1/2022
b   2/1/2022
b   2/1/2022

df2

Quarter     State
1/1/2022    ny
4/3/2023    ca
6/1/2024    ca
7/1/2021    wa

Desired

ID  Date        Quarter     State
a   1/1/2022    1/1/2022    ny
a   1/1/2022    na          na
b   2/1/2022    na          na
b   2/1/2022    na          na
b   2/1/2022    na          na

Doing

join = pd.merge( df1, df2, left_on='Date', right_on='Quarter', how='outer'

)

However, the output is giving me much more rows than what I began with. I would think that a left join would solve this, but I am still getting duplicates. I am still researching this.Any suggestion is appreciated.

Out of curiosity, what are you trying to accomplish here? What are IDs 'a' and 'b', and what is the output table meant to represent? — Ratler
– Ratler, Commented Jun 10, 2021 at 18:01

Mercury · Accepted Answer · 2021-06-10 17:22:28Z

3

Indeed, a left join should do it. Simply try changing "outer" to "left" in the how argument.

join = pd.merge( df1, df2, left_on='Date', right_on='Quarter', how='left')

answered Jun 10, 2021 at 17:22

Mercury

4,1811 gold badge15 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Lynn Over a year ago

When I perform the left join, I get the duplicates..

Ratler Over a year ago

This works and should probably be the accepted answer since it's the simplest fix. @Lynn, if it's not working for you, please include your dataframe definitions in the question to investigate further.

Lynn Over a year ago

Hmm ok it is not working when I perform the above. The solution that works for me is the first answer.

Mercury Over a year ago

@Ratler the accepted answer is correct. Basically the solution OP needs is to drop the duplicates after left join. Note that row 0 and row 1 of df1 are identical, and will thus naturally create duplicate rows after left join - OP wants to drop those as well after join.

Ratler Over a year ago

@Mercury, you're right, I was focussing on the NaN rows and missed the duplicate first two rows!

Nk03 · Accepted Answer · 2021-06-10 17:15:18Z

2

Create a temp column t with groupby/cumcount and just use that column for the merge.

merged_df = (
    df1.assign(t=df1.groupby('Date').cumcount())
    .merge(
        df2.assign(t=df2.groupby('Quarter').cumcount()),
        left_on=['Date', 't'],
        right_on=['Quarter', 't'],
        how='left')
    .drop('t', 1)
)

OUTPUT:

  ID      Date   Quarter State
0  a  1/1/2022  1/1/2022    ny
1  a  1/1/2022       NaN   NaN
2  b  2/1/2022       NaN   NaN
3  b  2/1/2022       NaN   NaN
4  b  2/1/2022       NaN   NaN

edited Jun 10, 2021 at 17:15

answered Jun 10, 2021 at 17:14

Nk03

15k2 gold badges11 silver badges24 bronze badges

1 Comment

Lynn Over a year ago

Thank you I will try. What is 't'? Is this a common issue with joins? (duplicates?)

Collectives™ on Stack Overflow

Combine dataframe in python but avoid duplicates

2 Answers 2

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related