Compare two pandas dataframe with different size

Question

I have one massive pandas dataframe with this structure:

And a second one, smaller like this:

I want to add a column to my first dataframe following this rule: column df1.C = df2.H when df1.A == df2.G

I manage to do it with for loops, but the database is massive and the code run really slowly so I am looking for a Pandas-way or numpy to do it.

Many thanks,

Boris

So, are all elements from df2.G guaranteed to be in df1.A? Is df2.G sorted? What are the shapes of the input dataframes in your actual use case? — Divakar
– Divakar, Commented Jun 7, 2017 at 14:13
The input data contains more columns/lines, but the structure is the same. The function I needed was DataFrame.merge() which is perfectly working — boris
– boris, Commented Jun 7, 2017 at 14:22

Frans Sjöström · Accepted Answer · 2023-01-18 09:05:23Z

If you only want to match mutual rows in both dataframes:

import pandas as pd

df1 = pd.DataFrame({'Name':['Sara'],'Special ability':['Walk on water']})
df1    
   Name Special ability
0  Sara   Walk on water

df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
     Name  Age
0    Sara    4
1  Gustaf   12
2  Patrik   11

df = df2.merge(df1, left_on='Name', right_on='Name', how='left')
df
     Name  Age Special ability
0    Sara    4             NaN
1  Gustaf   12   Walk on water
2  Patrik   11             NaN

This Can allso be done with more than one matching argument: (In this example Patrik from df1 does not exist in df2 becuse they have different ages and therfore will not merge)

df1 = pd.DataFrame({'Name':['Sara','Patrik'],'Special ability':['Walk on water','FireBalls'],'Age':[12,83]})

df1
     Name Special ability  Age
0    Sara   Walk on water   12
1  Patrik       FireBalls   83

df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
     Name  Age
0    Sara    4
1  Gustaf   12
2  Patrik   11

df = df2.merge(df1,left_on=['Name','Age'],right_on=['Name','Age'],how='left')
df
     Name  Age Special ability
0    Sara   12   Walk on water
1  Gustaf   12             NaN
2  Patrik   11             NaN

jezrael · Accepted Answer · 2017-06-07 14:06:20Z

4

You can use map by Series created by set_index:

df1['C'] = df1['A'].map(df2.set_index('G')['H'])
print (df1)
    A   B   C
0   0  12  15
1   0  15  15
2   0  17  15
3   0  18  15
4   1  45  45
5   1  78  45
6   1  96  45
7   1  32  45
8   2  45  31
9   2  78  31
10  2  44  31
11  2  10  31

Or merge with drop and rename:

df = df1.merge(df2,left_on="A",right_on="G", how='left')
        .drop('G', axis=1)
        .rename(columns={'H':'C'})
print (df)
    A   B   C
0   0  12  15
1   0  15  15
2   0  17  15
3   0  18  15
4   1  45  45
5   1  78  45
6   1  96  45
7   1  32  45
8   2  45  31
9   2  78  31
10  2  44  31
11  2  10  31

answered Jun 7, 2017 at 14:06

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Comments

WNG · Accepted Answer · 2017-06-07 14:05:02Z

2

You probably want to use a merge:

df=df1.merge(df2,left_on="A",right_on="G")

will give you a dataframe with 3 columns, but the third one's name will be H

df.columns=["A","B","C"]

will then give you the column names you want

answered Jun 7, 2017 at 14:05

WNG

3,8153 gold badges25 silver badges32 bronze badges

Comments

Divakar · Accepted Answer · 2017-06-07 14:17:46Z

0

Here's one vectorized NumPy approach -

idx = np.searchsorted(df2.G.values, df1.A.values)
df1['C'] = df2.H.values[idx]

idx could be computed in a simpler way with : df2.G.searchsorted(df1.A), but don't think that would be anymore efficient, because we want to use the underlying array with .values for performance as done earlier.

edited Jun 7, 2017 at 14:17

answered Jun 7, 2017 at 14:10

Divakar

222k19 gold badges273 silver badges374 bronze badges

1 Comment

Divakar Over a year ago

@boris Make sure to time it at your end. Should be pretty efficient :)

Collectives™ on Stack Overflow

Compare two pandas dataframe with different size

4 Answers 4

Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related