Combined two dataframe based on index, replacing matching values in other column

Question

I have the following wide df1:

Area geotype  type    ...
1      a        2      ...
1      a        1      ... 
2      b        4      ...
4      b        8      ...

And the following two-column df2:

Area   geotype
1      London
4      Cambridge

And I want the following:

Area  geotype  type    ...
1     London     2      ...
1     London     1      ... 
2       b        4      ...
4     Cambridge  8      ...

So I need to match based on the non-unique Area column, and then only if there is a match, replace the set values in the geotype column.

Apologies if this is a duplicate, I did actually search hard for a solution to this.

piRSquared · Accepted Answer · 2017-01-20 02:32:09Z

3

use update + map

df1.geotype.update(df1.Area.map(df2.set_index('Area').geotype))

   Area    geotype  type
0     1     London     2
1     1     London     1
2     2          b     4
3     4  Cambridge     8

edited Jan 20, 2017 at 2:32

answered Jan 19, 2017 at 22:24

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

piRSquared Over a year ago

@jezrael fixed.

jezrael · Accepted Answer · 2017-01-19 21:23:38Z

2

I think you can use map by Series created with set_index and then fill NaN values by combine_first or fillna:

df1.geotype = df1.ID.map(df2.set_index('ID')['geotype']).combine_first(df1.geotype)
#df1.geotype = df1.ID.map(df2.set_index('ID')['geotype']).fillna(df1.geotype)
print (df1)
   ID    geotype type
0   1     London    2
1   2          a    1
2   3          b    4
3   4  Cambridge   8e

Another solution with mask and numpy.in1d:

df1.geotype = df1.geotype.mask(np.in1d(df1.ID, df2.ID),
                               df1.ID.map(df2.set_index('ID')['geotype']))
print (df1)
   ID    geotype type
0   1     London    2
1   2          a    1
2   3          b    4
3   4  Cambridge   8e

EDIT by comment:

Problem is not unique ID values in df2 like:

df2 = pd.DataFrame({'ID': [1, 1, 4], 'geotype': ['London', 'Paris', 'Cambridge']})
print (df2)
   ID    geotype
0   1     London
1   1      Paris
2   4  Cambridge

So function map cannot choose right value and raise error.

Solution is remove duplicates by drop_duplicates, by default keep first value:

df2 = df2.drop_duplicates('ID')
print (df2)
   ID    geotype
0   1     London
2   4  Cambridge

Or if need keep last value:

df2 = df2.drop_duplicates('ID', keep='last')
print (df2)
   ID    geotype
1   1      Paris
2   4  Cambridge

If cannot remove duplicates, there is another solution with outer merge, but there are duplicated rows where is duplicated ID in df2:

df1 = pd.merge(df1, df2, on='ID', how='outer', suffixes=('_',''))
df1.geotype = df1.geotype.combine_first(df1.geotype_)
df1 = df1.drop('geotype_', axis=1)
print (df1)
   ID type    geotype
0   1    2     London
1   1    2      Paris
2   2    1          a
3   3    4          b
4   4   8e  Cambridge

edited Jan 19, 2017 at 21:23

answered Jan 19, 2017 at 20:23

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

2 Comments

Thirst for Knowledge Over a year ago

Sorry, I got 'Reindexing only valid with uniquely valued Index objects' as the ID column is really an area column, so there are multiple entries.

jezrael Over a year ago

I see problem - You have for one ID in df2 multiple values, so map is impossible - pandas does not know if need first value or first ID. You need unique values of ID in df2

MaxU - stand with Ukraine · Accepted Answer · 2017-01-19 21:30:31Z

2

alternative solution:

In [78]: df1.loc[df1.ID.isin(df2.ID), 'geotype'] = df1.ID.map(df2.set_index('ID').geotype)

In [79]: df1
Out[79]:
   ID    geotype  type
0   1     London     2
1   2          a     1
2   3          b     4
3   4  Cambridge     8

UPDATE: answers updated question - if you have duplicates in the Area column in the df2 DF:

In [152]: df1.loc[df1.Area.isin(df2.Area), 'geotype'] = df1.Area.map(df2.set_index('Area').geotype)
...
skipped
...
InvalidIndexError: Reindexing only valid with uniquely valued Index objects

get rid of duplicates:

In [153]: df1.loc[df1.Area.isin(df2.Area), 'geotype'] = df1.Area.map(df2.drop_duplicates(subset='Area').set_index('Area').geotype)

In [154]: df1
Out[154]:
   Area    geotype  type
0     1     London     2
1     1     London     1
2     2          b     4
3     4  Cambridge     8

edited Jan 19, 2017 at 21:30

answered Jan 19, 2017 at 20:24

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

3 Comments

Thirst for Knowledge Over a year ago

Sorry, I got 'Reindexing only valid with uniquely valued Index objects' as the ID column is really an area column, so there are multiple entries.

MaxU - stand with Ukraine Over a year ago

@ThirstforKnowledge, do you also have duplicates in the df2 DF?

Thirst for Knowledge Over a year ago

No just duplicates in df1 @MaxU

Collectives™ on Stack Overflow

Combined two dataframe based on index, replacing matching values in other column

3 Answers 3

1 Comment

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related