Pandas Apply lambda function null values

Question

I'm trying to split a column in two, but I know there are null values in my data. Imagine this dataframe:

df = pd.DataFrame(['fruit: apple','vegetable: asparagus',None, 'fruit: pear'], columns = ['text'])

df

                   text
0          fruit: apple
1  vegetable: asparagus
2                   None
3           fruit: pear

I'd like to split this into multiple columns like so:

df['cat'] = df['text'].apply(lambda x: 'unknown' if x == None else x.split(': ')[0])
df['value'] = df['text'].apply(lambda x: 'unknown' if x == None else x.split(': ')[1])

print df

                   text        cat      value
0          fruit: apple      fruit      apple
1  vegetable: asparagus  vegetable  asparagus
2                  None    unknown    unknown
3           fruit: pear      fruit       pear

However, if I have the following df instead:

df = pd.DataFrame(['fruit: apple','vegetable: asparagus',np.nan, 'fruit: pear'], columns = ['text'])

splitting results in the following error:

df['cat'] = df['text'].apply(lambda x: 'unknown' if x == np.nan else x.split(': ')[0])

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-159-8e5bca809635> in <module>()
      1 df = pd.DataFrame(['fruit: apple','vegetable: asparagus',np.nan, 'fruit: pear'], columns = ['text'])
      2 #df.columns = ['col_name']
----> 3 df['cat'] = df['text'].apply(lambda x: 'unknown' if x == np.nan else x.split(': ')[0])
      4 df['value'] = df['text'].apply(lambda x: 'unknown' if x == np.nan else x.split(': ')[1])

C:\Python27\lib\site-packages\pandas\core\series.pyc in apply(self, func, convert_dtype, args, **kwds)
   2158             values = lib.map_infer(values, lib.Timestamp)
   2159 
-> 2160         mapped = lib.map_infer(values, f, convert=convert_dtype)
   2161         if len(mapped) and isinstance(mapped[0], Series):
   2162             from pandas.core.frame import DataFrame

pandas\src\inference.pyx in pandas.lib.map_infer (pandas\lib.c:62187)()

<ipython-input-159-8e5bca809635> in <lambda>(x)
      1 df = pd.DataFrame(['fruit: apple','vegetable: asparagus',np.nan, 'fruit: pear'], columns = ['text'])
      2 #df.columns = ['col_name']
----> 3 df['cat'] = df['text'].apply(lambda x: 'unknown' if x == np.nan else x.split(': ')[0])
      4 df['value'] = df['text'].apply(lambda x: 'unknown' if x == np.nan else x.split(': ')[1])

AttributeError: 'float' object has no attribute 'split'

How do I do the same split with NaN values? Is there generally a better way to apply a split function that ignores null values?

Imagine this wasn't a string example, instead if I had the following:

df = pd.DataFrame([2,4,6,8,10,np.nan,12], columns = ['numerics'])
df['numerics'].apply(lambda x: np.nan if pd.isnull(x) else x/2.0)

I feel like Series.apply should almost take an argument that instructs it to skip null rows and just output them as nulls. I haven't found a better generic way to do transformations to a series without having to manually avoid nulls.

try df['cat'] = df['text'].apply(lambda x: 'unknown' if pd.isnull(x) else x.split(': ')[0]) — EdChum
– EdChum, Commented May 5, 2016 at 21:18

unutbu · Accepted Answer · 2016-05-05 21:36:34Z

5

Instead of apply with a custom function you could use the Series.str.extract method:

import numpy as np
import pandas as pd
# df = pd.DataFrame(['fruit: apple','vegetable: asparagus',None, 'fruit: pear'], 
#                   columns = ['text'])
df = pd.DataFrame(['fruit: apple','vegetable: asparagus',np.nan, 'fruit: pear'], 
                  columns = ['text'])
df[['cat', 'value']] = df['text'].str.extract(r'([^:]+):?(.*)', expand=True).fillna('unknown')
print(df)

yields

                   text        cat       value
0          fruit: apple      fruit       apple
1  vegetable: asparagus  vegetable   asparagus
2                   NaN    unknown     unknown
3           fruit: pear      fruit        pear

apply with a custom function is generally slower than equivalent code which makes use of vectorized methods such as Series.str.extract. Under the hood, apply (with an unvectorizable function) essentially calls the custom function in a Python for-loop.

Regarding the edited question: If you have

df = pd.DataFrame([2,4,6,8,10,np.nan,12], columns = ['numerics'])

then use

In [207]: df['numerics']/2
Out[207]: 
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    NaN
6    6.0
Name: numerics, dtype: float64

instead of

df['numerics'].apply(lambda x: np.nan if pd.isnull(x) else x/2.0)

Again, vectorized arithmetic beats apply with a custom function:

In [210]: df = pd.concat([df]*100, ignore_index=True)

In [211]: %timeit df['numerics']/2
10000 loops, best of 3: 93.8 µs per loop

In [212]: %timeit df['numerics'].apply(lambda x: np.nan if pd.isnull(x) else x/2.0)
1000 loops, best of 3: 836 µs per loop

edited May 5, 2016 at 21:36

answered May 5, 2016 at 21:20

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

unutbu Over a year ago

What version of pandas are you using? It works with version 0.18.0.

flyingmeatball Over a year ago

17.0, but this is a minor point - updating question to reflect.

MaxU - stand with Ukraine Over a year ago

@flyingmeatball, it works. Here is a bit more complicated test case: df = pd.DataFrame(['', 'fruit: apple',None,'vegetable: asparagus',np.nan, 'fruit: pear'], columns = ['text'])

flyingmeatball Over a year ago

@MaxU I believe the simple cases work, I'm less worried about the minor syntax for this example. I'm more interested in trying to find a good way to skip nulls in an apply statement. I find myself needing to do that relatively frequently.

flyingmeatball Over a year ago

@unutbu thanks - I'm beginning to get the sense that the answer to my underlying question is that there isn't a good way to do a vanilla apply and skip nulls - it depends on the individual column. My edited df was more an example of a non-text manipulation, I wouldn't actually use apply in that instance, just wanted to find a case that I couldn't apply a regular expression. Appreciate the help.

Collectives™ on Stack Overflow

Pandas Apply lambda function null values

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related