merge pandas dataframe with key duplicates

Question

I have 2 dataframes, both have a key column which could have duplicates, but the dataframes mostly have the same duplicated keys. I'd like to merge these dataframes on that key, but in such a way that when both have the same duplicate those duplicates are merged respectively. In addition if one dataframe has more duplicates of a key than the other, I'd like it's values to be filled as NaN. For example:

df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K2', 'K2', 'K3'],
                    'A':   ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']}, 
                   columns=['key', 'A'])
df2 = pd.DataFrame({'B':   ['B0', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6'],
                    'key': ['K0', 'K1', 'K2', 'K2', 'K3', 'K3', 'K4']}, 
                   columns=['key', 'B'])

  key   A
0  K0  A0
1  K1  A1
2  K2  A2
3  K2  A3
4  K2  A4
5  K3  A5

  key   B
0  K0  B0
1  K1  B1
2  K2  B2
3  K2  B3
4  K3  B4
5  K3  B5
6  K4  B6

I'm trying to get the following output

   key    A   B
0   K0   A0  B0
1   K1   A1  B1
2   K2   A2  B2
3   K2   A3  B3
6   K2   A4  NaN
8   K3   A5  B4
9   K3  NaN  B5
10  K4  NaN  B6

So basically, I'd like to treat the duplicated K2 keys as K2_1, K2_2, ... and then do the how='outer' merge on the dataframes. Any ideas how I can accomplish this?

piRSquared · Accepted Answer · 2016-11-14 16:33:30Z

10

faster again

%%cython
# using cython in jupyter notebook
# in another cell run `%load_ext Cython`
from collections import defaultdict
import numpy as np

def cg(x):
    cnt = defaultdict(lambda: 0)

    for j in x.tolist():
        cnt[j] += 1
        yield cnt[j]


def fastcount(x):
    return [i for i in cg(x)]

df1['cc'] = fastcount(df1.key.values)
df2['cc'] = fastcount(df2.key.values)

df1.merge(df2, how='outer').drop('cc', 1)

faster answer; not scalable

def fastcount(x):
    unq, inv = np.unique(x, return_inverse=1)
    m = np.arange(len(unq))[:, None] == inv
    return (m.cumsum(1) * m).sum(0)

df1['cc'] = fastcount(df1.key.values)
df2['cc'] = fastcount(df2.key.values)

df1.merge(df2, how='outer').drop('cc', 1)

old answer

df1['cc'] = df1.groupby('key').cumcount()
df2['cc'] = df2.groupby('key').cumcount()

df1.merge(df2, how='outer').drop('cc', 1)

edited Nov 14, 2016 at 16:33

answered Nov 13, 2016 at 15:30

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

dcmm88 Over a year ago

Is there a way to do this any faster? I'm working with dataframes with about 4M entries and while the merge is basically instantaneous, the cumcount() call takes about a minute.

piRSquared Over a year ago

@dcmm88 I wrote this function special. check to see if it doesn't improve the situation.

dcmm88 Over a year ago

The np.arange... == inv line, you meant to use np.equal there right? Because at least for me the result of '==' is a bool. Even when I use np.equal, I'm getting a memory error there, maybe my dataframes are too big?

piRSquared Over a year ago

@dcmm88 Yes. It's bad. Does not scale. Still working on something

piRSquared Over a year ago

@dcmm88 I've improved performance but requires cython. If you are using jupyter notebook, you should be ok.

|

Mohit Kushwaha · Accepted Answer · 2021-12-18 22:30:19Z

0

df1.set_index('key', inplace=True)

df2.set_index('key', inplace=True)

merged_df = pd.merge(df1, df2, left_index = True, right_index = True, how= 'inner')
merged_df.reset_index('key', drop=False, inplace=True)

edited Dec 18, 2021 at 22:30

Mohit Kushwaha

1,0511 gold badge12 silver badges15 bronze badges

answered Dec 18, 2021 at 5:06

Adarsh Puri

1

2 Comments

M. Zhang Over a year ago

It is usually more helpful if you can explain the keywords you've added instead of posing a code-only answer.

ferreiradev Over a year ago

This does not remove duplicates for me.

Collectives™ on Stack Overflow

merge pandas dataframe with key duplicates

2 Answers 2

6 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related