Using an if statement in a dataframe with lambda functions

Question

I am trying to add a new column to a dataframe based on an if statement depending on the values of two columns. i.e. if column x == None then column y else column x

below is the script I have written but doesn't work. any ideas?

dfCurrentReportResults['Retention'] =  dfCurrentReportResults.apply(lambda x : x.Retention_y if x.Retention_x == None else x.Retention_x)

Also I got this error message: AttributeError: ("'Series' object has no attribute 'Retention_x'", u'occurred at index BUSINESSUNIT_NAME')

fyi: BUSINESSUNIT_NAME is the first column name

Additional Info:

My data printed out looks like this and I want to add a 3rd column to take a value if there is one else keep NaN.

   Retention_x  Retention_y
0            1          NaN
1          NaN     0.672183
2          NaN     1.035613
3          NaN     0.771469
4          NaN     0.916667
5          NaN          NaN
6          NaN          NaN
7          NaN          NaN
8          NaN          NaN
9          NaN          NaN

UPDATE: In the end I was having issues referencing the Null or is Null in my dataframe the final line of code I used also including the axis = 1 answered my question.

 dfCurrentReportResults['RetentionLambda'] = dfCurrentReportResults.apply(lambda x : x['Retention_y'] if pd.isnull(x['Retention_x']) else x['Retention_x'], axis = 1)

Thanks @EdChum, @strim099 and @aus_lacy for all your input. As my data set gets larger I may switch to the np.where option if I notice performance issues.

is None a string or a NaN? And could you provide a sample set of your data frame so we can better debug any issues? — alacy
– alacy, Commented Jan 8, 2015 at 16:45
@aus_lacy my use of None was basically an attempt to identify if the value is empty, so I guess it is a NaN and is None? — IcemanBerlin
– IcemanBerlin, Commented Jan 8, 2015 at 16:46
what column are you calling your apply on? A sample of your data would help you get an answer much quicker. — alacy
– alacy, Commented Jan 8, 2015 at 16:48
I would like to apply the function to the new column and get the results by referencing the other two columns. The data is a bit messy and also confidentially, i will try and knock together some simple data for the question. — IcemanBerlin
– IcemanBerlin, Commented Jan 8, 2015 at 16:51

Jason Strimpel · Accepted Answer · 2015-01-08 16:54:19Z

4

You'r lambda is operating on the 0 axis which is columnwise. Simply add axis=1 to the apply arg list. This is clearly documented.

In [1]: import pandas

In [2]: dfCurrentReportResults = pandas.DataFrame([['a','b'],['c','d'],['e','f'],['g','h'],['i','j']], columns=['Retention_y', 'Retention_x'])

In [3]: dfCurrentReportResults['Retention_x'][1] = None

In [4]: dfCurrentReportResults['Retention_x'][3] = None

In [5]: dfCurrentReportResults
Out[5]:
  Retention_y Retention_x
0           a           b
1           c        None
2           e           f
3           g        None
4           i           j

In [6]: dfCurrentReportResults['Retention'] =  dfCurrentReportResults.apply(lambda x : x.Retention_y if x.Retention_x == None else x.Retention_x, axis=1)

In [7]: dfCurrentReportResults
Out[7]:
  Retention_y Retention_x Retention
0           a           b         b
1           c        None         c
2           e           f         f
3           g        None         g
4           i           j         j

answered Jan 8, 2015 at 16:54

Jason Strimpel

15.7k25 gold badges81 silver badges110 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

IcemanBerlin Over a year ago

thanks strimp099 actually I had tried adding axis = 1 in some of my attempts but got the same message. I think the issue is also that my dataframe value is probably not None, i.e. where you have None I have just a blank. is it the same thing or how can i reference the blank if I can't use None?

IcemanBerlin Over a year ago

Actually anyway when I copy your code above and run your sample data i get this error: AttributeError: ("'Series' object has no attribute 'Retention_x'", u'occurred at index 0')

IcemanBerlin Over a year ago

how do you recreate your example to be NaN instead of None?

Jason Strimpel Over a year ago

Assuming you're using numpy, just change your lambda function from x.Retention_x == None to numpy.isnan(x.Retention_x)

IcemanBerlin Over a year ago

Ok in the end this works. dfCurrentReportResults['RetentionLambda'] = dfCurrentReportResults.apply(lambda x : x['Retention_y'] if pd.isnull(x['Retention_x']) else x['Retention_x'], axis = 1) I didnt test your numpy.isnan(x.Retention_x) but i am sure that works as well so in the end i was just having issues with how to reference a NAN in my Lambda. I am marking your answer as correct because it is the closest solution for the title of the question I wrote.

EdChum · Accepted Answer · 2015-01-08 17:52:02Z

2

Just use np.where:

dfCurrentReportResults['Retention'] =  np.where(df.Retention_x == None, df.Retention_y, else df.Retention_x)

This uses the test condition, the first param and sets the value to df.Retention_y else df.Retention_x

also avoid using apply where possible as this is just going to loop over the values, np.where is a vectorised method and will scale much better.

UPDATE

OK no need to use np.where just use the following simpler syntax:

dfCurrentReportResults['Retention'] =  df.Retention_y.where(df.Retention_x == None, df.Retention_x)

Further update

dfCurrentReportResults['Retention'] =  df.Retention_y.where(df.Retention_x.isnull(), df.Retention_x)

edited Jan 8, 2015 at 17:52

answered Jan 8, 2015 at 16:52

EdChum

397k204 gold badges836 silver badges583 bronze badges

5 Comments

EdChum Over a year ago

@DSM I sometimes find the frames and series where syntax slightly confusing a while back due to some subtle differences so I started using np.where from that point on, maybe time to go back and look at it again, I'll post an update, thanks

IcemanBerlin Over a year ago

I was getting a syntex error on the np.where line. the updated line runs but gives me the following error...TypeError: Could not compare <type 'NoneType'> type with Series

alacy Over a year ago

I still find numpy syntax easier to read: dfCurrentReportResults['Retention'] = np.where(df.Retention_x.isnull(), df.Retention_y, df.Retention_x) but that is almost completely objective.

IcemanBerlin Over a year ago

Thanks EdChum your solution using .where also worked fine as another option and in the end I def will use that in future solutions. The final code I used to get yours to work was the following dfCurrentReportResults['RetentionWHERE'] = dfCurrentReportResults.Retention_y.where(dfCurrentReportResults.Retention_x.isnull(), dfCurrentReportResults.Retention_x)

EdChum Over a year ago

@IcemanBerlin no worries, the key thing to take away from this is to look for a vectorised method that will operate on the whole df or series rather than calling apply which loops over the values

Collectives™ on Stack Overflow

Using an if statement in a dataframe with lambda functions

2 Answers 2

5 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related