Manipulate values in pandas DataFrame columns based on matching IDs from another DataFrame

Question

I have two dataframes like the following examples:

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': ['20', '50', '100'], 'b': [1, np.nan, 1],
                 'c': [np.nan, 1, 1]})
df_id = pd.DataFrame({'b': ['50', '4954', '93920', '20'],
                      'c': ['123', '100', '6', np.nan]})

print(df)
     a    b    c
0   20  1.0  NaN
1   50  NaN  1.0
2  100  1.0  1.0

print(df_id)
       b    c
0     50  123
1   4954  100
2  93920    6
3     20  NaN

For each identifier in df['a'], I want to null the value in df['b'] if there is no matching identifier in any row in df_id['b']. I want to do the same for column df['c'].

My desired result is as follows:

result = pd.DataFrame({'a': ['20', '50', '100'], 'b': [1, np.nan, np.nan],
                 'c': [np.nan, np.nan, 1]})
print(result)
     a    b    c
0   20  1.0  NaN
1   50  NaN  NaN    # df_id['c'] did not contain '50'
2  100  NaN  1.0    # df_id['b'] did not contain '100'

My attempt to do this is here:

for i, letter in enumerate(['b','c']):
    df[letter] = (df.apply(lambda x: x[letter] if x['a']
                   .isin(df_id[letter].tolist()) else np.nan, axis = 1))

The error I get:

AttributeError: ("'str' object has no attribute 'isin'", 'occurred at index 0')

This is in Python 3.5.2, Pandas version 20.1

Rayhane Mama · Accepted Answer · 2017-07-07 15:46:44Z

1

You can solve your problem using this instead:

for letter in ['b','c']: # took off enumerate cuz i didn't need it here, maybe you do for the rest of your code
    df[letter] = df.apply(lambda row: row[letter] if row['a'] in (df_id[letter].tolist()) else np.nan,axis=1)

just replace isin with in.

The problem is that when you use apply on df, x will represent df rows, so when you select x['a'] you're actually selecting one element.

However, isin is applicable for series or list-like structures which raises the error so instead we just use in to check if that element is in the list.

Hope that was helpful. If you have any questions please ask.

answered Jul 7, 2017 at 15:46

Rayhane Mama

2,42413 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Riebeckite Over a year ago

I just read the docs again and realized that isin() returns a boolean series. Thanks for the clarification!

Riebeckite · Accepted Answer · 2017-07-07 15:48:25Z

0

Adapting a hard-to-find answer from Pandas New Column Calculation Based on Existing Columns Values:

for i, letter in enumerate(['b','c']):
    mask = df['a'].isin(df_id[letter])
    name = letter + '_new'
    # for some reason, df[letter] = df.loc[mask, letter] does not work
    df.loc[mask, name] = df.loc[mask, letter]
    df[letter] = df[name]
    del df[name]

This isn't pretty, but seems to work.

answered Jul 7, 2017 at 15:48

Riebeckite

5264 silver badges14 bronze badges

Comments

P.Tillmann · Accepted Answer · 2017-07-07 22:59:18Z

0

If you have a bigger Dataframe and performance is important to you, you can first build a mask df and then apply it to your dataframe. First create the mask:

mask = df_id.apply(lambda x: df['a'].isin(x))
       b      c
0   True  False
1   True  False
2  False   True

This can be applied to the original dataframe:

df.iloc[:,1:] = df.iloc[:,1:].mask(~mask, np.nan)
     a    b    c
0   20  1.0  NaN
1   50  NaN  NaN
2  100  NaN  1.0

edited Jul 7, 2017 at 22:59

answered Jul 7, 2017 at 15:54

P.Tillmann

2,12012 silver badges17 bronze badges

3 Comments

Riebeckite Over a year ago

I originally tried using a mask, but ran into the same errors described in this question. Also, this doesn't seem to replicate the desired result.

P.Tillmann Over a year ago

Yes, your right, i messed up the logic because the boolean mask indicates where the replace should happen. I fixed it in the post by inverting the mask. What excatly was your problem with the mask? I think the solution in the question you linked was quite elegant.

Riebeckite Over a year ago

The poster of that question and I both got 'ValueError: Boolean array expected for the condition, not float64' when attempting to use a mask that way. It seems like there's some version-specific bug with either Python or Pandas or some mix of the two.

Collectives™ on Stack Overflow

Manipulate values in pandas DataFrame columns based on matching IDs from another DataFrame

3 Answers 3

1 Comment

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related