1

I have two dataframes like the following examples:

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': ['20', '50', '100'], 'b': [1, np.nan, 1],
                 'c': [np.nan, 1, 1]})
df_id = pd.DataFrame({'b': ['50', '4954', '93920', '20'],
                      'c': ['123', '100', '6', np.nan]})

print(df)
     a    b    c
0   20  1.0  NaN
1   50  NaN  1.0
2  100  1.0  1.0

print(df_id)
       b    c
0     50  123
1   4954  100
2  93920    6
3     20  NaN

For each identifier in df['a'], I want to null the value in df['b'] if there is no matching identifier in any row in df_id['b']. I want to do the same for column df['c'].

My desired result is as follows:

result = pd.DataFrame({'a': ['20', '50', '100'], 'b': [1, np.nan, np.nan],
                 'c': [np.nan, np.nan, 1]})
print(result)
     a    b    c
0   20  1.0  NaN
1   50  NaN  NaN    # df_id['c'] did not contain '50'
2  100  NaN  1.0    # df_id['b'] did not contain '100'

My attempt to do this is here:

for i, letter in enumerate(['b','c']):
    df[letter] = (df.apply(lambda x: x[letter] if x['a']
                   .isin(df_id[letter].tolist()) else np.nan, axis = 1))

The error I get:

AttributeError: ("'str' object has no attribute 'isin'", 'occurred at index 0')

This is in Python 3.5.2, Pandas version 20.1

3 Answers 3

1

You can solve your problem using this instead:

for letter in ['b','c']: # took off enumerate cuz i didn't need it here, maybe you do for the rest of your code
    df[letter] = df.apply(lambda row: row[letter] if row['a'] in (df_id[letter].tolist()) else np.nan,axis=1)

just replace isin with in.

The problem is that when you use apply on df, x will represent df rows, so when you select x['a'] you're actually selecting one element.

However, isin is applicable for series or list-like structures which raises the error so instead we just use in to check if that element is in the list.

Hope that was helpful. If you have any questions please ask.

Sign up to request clarification or add additional context in comments.

1 Comment

I just read the docs again and realized that isin() returns a boolean series. Thanks for the clarification!
0

Adapting a hard-to-find answer from Pandas New Column Calculation Based on Existing Columns Values:

for i, letter in enumerate(['b','c']):
    mask = df['a'].isin(df_id[letter])
    name = letter + '_new'
    # for some reason, df[letter] = df.loc[mask, letter] does not work
    df.loc[mask, name] = df.loc[mask, letter]
    df[letter] = df[name]
    del df[name]

This isn't pretty, but seems to work.

Comments

0

If you have a bigger Dataframe and performance is important to you, you can first build a mask df and then apply it to your dataframe. First create the mask:

mask = df_id.apply(lambda x: df['a'].isin(x))
       b      c
0   True  False
1   True  False
2  False   True

This can be applied to the original dataframe:

df.iloc[:,1:] = df.iloc[:,1:].mask(~mask, np.nan)
     a    b    c
0   20  1.0  NaN
1   50  NaN  NaN
2  100  NaN  1.0

3 Comments

I originally tried using a mask, but ran into the same errors described in this question. Also, this doesn't seem to replicate the desired result.
Yes, your right, i messed up the logic because the boolean mask indicates where the replace should happen. I fixed it in the post by inverting the mask. What excatly was your problem with the mask? I think the solution in the question you linked was quite elegant.
The poster of that question and I both got 'ValueError: Boolean array expected for the condition, not float64' when attempting to use a mask that way. It seems like there's some version-specific bug with either Python or Pandas or some mix of the two.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.