0

I have two dataframes, df1, and df2. I am joining on two different column names. For some reason when I perform this join, the result creates exponential duplicated rows. How would I avoid this. I am using outer join.

Data

df1

ID  Date
a   1/1/2022
a   1/1/2022
b   2/1/2022
b   2/1/2022
b   2/1/2022

df2

Quarter     State
1/1/2022    ny
4/3/2023    ca
6/1/2024    ca
7/1/2021    wa

Desired

ID  Date        Quarter     State
a   1/1/2022    1/1/2022    ny
a   1/1/2022    na          na
b   2/1/2022    na          na
b   2/1/2022    na          na
b   2/1/2022    na          na

Doing

join = pd.merge( df1, df2, left_on='Date', right_on='Quarter', how='outer'

)

However, the output is giving me much more rows than what I began with. I would think that a left join would solve this, but I am still getting duplicates. I am still researching this.Any suggestion is appreciated.

1
  • Out of curiosity, what are you trying to accomplish here? What are IDs 'a' and 'b', and what is the output table meant to represent? Commented Jun 10, 2021 at 18:01

2 Answers 2

3

Indeed, a left join should do it. Simply try changing "outer" to "left" in the how argument.

join = pd.merge( df1, df2, left_on='Date', right_on='Quarter', how='left')
Sign up to request clarification or add additional context in comments.

5 Comments

When I perform the left join, I get the duplicates..
This works and should probably be the accepted answer since it's the simplest fix. @Lynn, if it's not working for you, please include your dataframe definitions in the question to investigate further.
Hmm ok it is not working when I perform the above. The solution that works for me is the first answer.
@Ratler the accepted answer is correct. Basically the solution OP needs is to drop the duplicates after left join. Note that row 0 and row 1 of df1 are identical, and will thus naturally create duplicate rows after left join - OP wants to drop those as well after join.
@Mercury, you're right, I was focussing on the NaN rows and missed the duplicate first two rows!
2

Create a temp column t with groupby/cumcount and just use that column for the merge.

merged_df = (
    df1.assign(t=df1.groupby('Date').cumcount())
    .merge(
        df2.assign(t=df2.groupby('Quarter').cumcount()),
        left_on=['Date', 't'],
        right_on=['Quarter', 't'],
        how='left')
    .drop('t', 1)
)

OUTPUT:

  ID      Date   Quarter State
0  a  1/1/2022  1/1/2022    ny
1  a  1/1/2022       NaN   NaN
2  b  2/1/2022       NaN   NaN
3  b  2/1/2022       NaN   NaN
4  b  2/1/2022       NaN   NaN

1 Comment

Thank you I will try. What is 't'? Is this a common issue with joins? (duplicates?)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.