0

I am trying to add a new column to a dataframe based on an if statement depending on the values of two columns. i.e. if column x == None then column y else column x

below is the script I have written but doesn't work. any ideas?

dfCurrentReportResults['Retention'] =  dfCurrentReportResults.apply(lambda x : x.Retention_y if x.Retention_x == None else x.Retention_x)

Also I got this error message: AttributeError: ("'Series' object has no attribute 'Retention_x'", u'occurred at index BUSINESSUNIT_NAME')

fyi: BUSINESSUNIT_NAME is the first column name

Additional Info:

My data printed out looks like this and I want to add a 3rd column to take a value if there is one else keep NaN.

   Retention_x  Retention_y
0            1          NaN
1          NaN     0.672183
2          NaN     1.035613
3          NaN     0.771469
4          NaN     0.916667
5          NaN          NaN
6          NaN          NaN
7          NaN          NaN
8          NaN          NaN
9          NaN          NaN

UPDATE: In the end I was having issues referencing the Null or is Null in my dataframe the final line of code I used also including the axis = 1 answered my question.

 dfCurrentReportResults['RetentionLambda'] = dfCurrentReportResults.apply(lambda x : x['Retention_y'] if pd.isnull(x['Retention_x']) else x['Retention_x'], axis = 1)

Thanks @EdChum, @strim099 and @aus_lacy for all your input. As my data set gets larger I may switch to the np.where option if I notice performance issues.

4
  • is None a string or a NaN? And could you provide a sample set of your data frame so we can better debug any issues? Commented Jan 8, 2015 at 16:45
  • @aus_lacy my use of None was basically an attempt to identify if the value is empty, so I guess it is a NaN and is None? Commented Jan 8, 2015 at 16:46
  • what column are you calling your apply on? A sample of your data would help you get an answer much quicker. Commented Jan 8, 2015 at 16:48
  • I would like to apply the function to the new column and get the results by referencing the other two columns. The data is a bit messy and also confidentially, i will try and knock together some simple data for the question. Commented Jan 8, 2015 at 16:51

2 Answers 2

4

You'r lambda is operating on the 0 axis which is columnwise. Simply add axis=1 to the apply arg list. This is clearly documented.

In [1]: import pandas

In [2]: dfCurrentReportResults = pandas.DataFrame([['a','b'],['c','d'],['e','f'],['g','h'],['i','j']], columns=['Retention_y', 'Retention_x'])

In [3]: dfCurrentReportResults['Retention_x'][1] = None

In [4]: dfCurrentReportResults['Retention_x'][3] = None

In [5]: dfCurrentReportResults
Out[5]:
  Retention_y Retention_x
0           a           b
1           c        None
2           e           f
3           g        None
4           i           j

In [6]: dfCurrentReportResults['Retention'] =  dfCurrentReportResults.apply(lambda x : x.Retention_y if x.Retention_x == None else x.Retention_x, axis=1)

In [7]: dfCurrentReportResults
Out[7]:
  Retention_y Retention_x Retention
0           a           b         b
1           c        None         c
2           e           f         f
3           g        None         g
4           i           j         j
Sign up to request clarification or add additional context in comments.

5 Comments

thanks strimp099 actually I had tried adding axis = 1 in some of my attempts but got the same message. I think the issue is also that my dataframe value is probably not None, i.e. where you have None I have just a blank. is it the same thing or how can i reference the blank if I can't use None?
Actually anyway when I copy your code above and run your sample data i get this error: AttributeError: ("'Series' object has no attribute 'Retention_x'", u'occurred at index 0')
how do you recreate your example to be NaN instead of None?
Assuming you're using numpy, just change your lambda function from x.Retention_x == None to numpy.isnan(x.Retention_x)
Ok in the end this works. dfCurrentReportResults['RetentionLambda'] = dfCurrentReportResults.apply(lambda x : x['Retention_y'] if pd.isnull(x['Retention_x']) else x['Retention_x'], axis = 1) I didnt test your numpy.isnan(x.Retention_x) but i am sure that works as well so in the end i was just having issues with how to reference a NAN in my Lambda. I am marking your answer as correct because it is the closest solution for the title of the question I wrote.
2

Just use np.where:

dfCurrentReportResults['Retention'] =  np.where(df.Retention_x == None, df.Retention_y, else df.Retention_x)

This uses the test condition, the first param and sets the value to df.Retention_y else df.Retention_x

also avoid using apply where possible as this is just going to loop over the values, np.where is a vectorised method and will scale much better.

UPDATE

OK no need to use np.where just use the following simpler syntax:

dfCurrentReportResults['Retention'] =  df.Retention_y.where(df.Retention_x == None, df.Retention_x)

Further update

dfCurrentReportResults['Retention'] =  df.Retention_y.where(df.Retention_x.isnull(), df.Retention_x)

5 Comments

@DSM I sometimes find the frames and series where syntax slightly confusing a while back due to some subtle differences so I started using np.where from that point on, maybe time to go back and look at it again, I'll post an update, thanks
I was getting a syntex error on the np.where line. the updated line runs but gives me the following error...TypeError: Could not compare <type 'NoneType'> type with Series
I still find numpy syntax easier to read: dfCurrentReportResults['Retention'] = np.where(df.Retention_x.isnull(), df.Retention_y, df.Retention_x) but that is almost completely objective.
Thanks EdChum your solution using .where also worked fine as another option and in the end I def will use that in future solutions. The final code I used to get yours to work was the following dfCurrentReportResults['RetentionWHERE'] = dfCurrentReportResults.Retention_y.where(dfCurrentReportResults.Retention_x.isnull(), dfCurrentReportResults.Retention_x)
@IcemanBerlin no worries, the key thing to take away from this is to look for a vectorised method that will operate on the whole df or series rather than calling apply which loops over the values

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.