1

I'm trying to split a column in two, but I know there are null values in my data. Imagine this dataframe:

df = pd.DataFrame(['fruit: apple','vegetable: asparagus',None, 'fruit: pear'], columns = ['text'])

df

                   text
0          fruit: apple
1  vegetable: asparagus
2                   None
3           fruit: pear

I'd like to split this into multiple columns like so:

df['cat'] = df['text'].apply(lambda x: 'unknown' if x == None else x.split(': ')[0])
df['value'] = df['text'].apply(lambda x: 'unknown' if x == None else x.split(': ')[1])

print df

                   text        cat      value
0          fruit: apple      fruit      apple
1  vegetable: asparagus  vegetable  asparagus
2                  None    unknown    unknown
3           fruit: pear      fruit       pear

However, if I have the following df instead:

df = pd.DataFrame(['fruit: apple','vegetable: asparagus',np.nan, 'fruit: pear'], columns = ['text'])

splitting results in the following error:

df['cat'] = df['text'].apply(lambda x: 'unknown' if x == np.nan else x.split(': ')[0])

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-159-8e5bca809635> in <module>()
      1 df = pd.DataFrame(['fruit: apple','vegetable: asparagus',np.nan, 'fruit: pear'], columns = ['text'])
      2 #df.columns = ['col_name']
----> 3 df['cat'] = df['text'].apply(lambda x: 'unknown' if x == np.nan else x.split(': ')[0])
      4 df['value'] = df['text'].apply(lambda x: 'unknown' if x == np.nan else x.split(': ')[1])

C:\Python27\lib\site-packages\pandas\core\series.pyc in apply(self, func, convert_dtype, args, **kwds)
   2158             values = lib.map_infer(values, lib.Timestamp)
   2159 
-> 2160         mapped = lib.map_infer(values, f, convert=convert_dtype)
   2161         if len(mapped) and isinstance(mapped[0], Series):
   2162             from pandas.core.frame import DataFrame

pandas\src\inference.pyx in pandas.lib.map_infer (pandas\lib.c:62187)()

<ipython-input-159-8e5bca809635> in <lambda>(x)
      1 df = pd.DataFrame(['fruit: apple','vegetable: asparagus',np.nan, 'fruit: pear'], columns = ['text'])
      2 #df.columns = ['col_name']
----> 3 df['cat'] = df['text'].apply(lambda x: 'unknown' if x == np.nan else x.split(': ')[0])
      4 df['value'] = df['text'].apply(lambda x: 'unknown' if x == np.nan else x.split(': ')[1])

AttributeError: 'float' object has no attribute 'split'

How do I do the same split with NaN values? Is there generally a better way to apply a split function that ignores null values?

Imagine this wasn't a string example, instead if I had the following:

df = pd.DataFrame([2,4,6,8,10,np.nan,12], columns = ['numerics'])
df['numerics'].apply(lambda x: np.nan if pd.isnull(x) else x/2.0)

I feel like Series.apply should almost take an argument that instructs it to skip null rows and just output them as nulls. I haven't found a better generic way to do transformations to a series without having to manually avoid nulls.

1
  • 2
    try df['cat'] = df['text'].apply(lambda x: 'unknown' if pd.isnull(x) else x.split(': ')[0]) Commented May 5, 2016 at 21:18

1 Answer 1

5

Instead of apply with a custom function you could use the Series.str.extract method:

import numpy as np
import pandas as pd
# df = pd.DataFrame(['fruit: apple','vegetable: asparagus',None, 'fruit: pear'], 
#                   columns = ['text'])
df = pd.DataFrame(['fruit: apple','vegetable: asparagus',np.nan, 'fruit: pear'], 
                  columns = ['text'])
df[['cat', 'value']] = df['text'].str.extract(r'([^:]+):?(.*)', expand=True).fillna('unknown')
print(df)

yields

                   text        cat       value
0          fruit: apple      fruit       apple
1  vegetable: asparagus  vegetable   asparagus
2                   NaN    unknown     unknown
3           fruit: pear      fruit        pear

apply with a custom function is generally slower than equivalent code which makes use of vectorized methods such as Series.str.extract. Under the hood, apply (with an unvectorizable function) essentially calls the custom function in a Python for-loop.


Regarding the edited question: If you have

df = pd.DataFrame([2,4,6,8,10,np.nan,12], columns = ['numerics'])

then use

In [207]: df['numerics']/2
Out[207]: 
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    NaN
6    6.0
Name: numerics, dtype: float64

instead of

df['numerics'].apply(lambda x: np.nan if pd.isnull(x) else x/2.0)

Again, vectorized arithmetic beats apply with a custom function:

In [210]: df = pd.concat([df]*100, ignore_index=True)

In [211]: %timeit df['numerics']/2
10000 loops, best of 3: 93.8 µs per loop

In [212]: %timeit df['numerics'].apply(lambda x: np.nan if pd.isnull(x) else x/2.0)
1000 loops, best of 3: 836 µs per loop
Sign up to request clarification or add additional context in comments.

5 Comments

What version of pandas are you using? It works with version 0.18.0.
17.0, but this is a minor point - updating question to reflect.
@flyingmeatball, it works. Here is a bit more complicated test case: df = pd.DataFrame(['', 'fruit: apple',None,'vegetable: asparagus',np.nan, 'fruit: pear'], columns = ['text'])
@MaxU I believe the simple cases work, I'm less worried about the minor syntax for this example. I'm more interested in trying to find a good way to skip nulls in an apply statement. I find myself needing to do that relatively frequently.
@unutbu thanks - I'm beginning to get the sense that the answer to my underlying question is that there isn't a good way to do a vanilla apply and skip nulls - it depends on the individual column. My edited df was more an example of a non-text manipulation, I wouldn't actually use apply in that instance, just wanted to find a case that I couldn't apply a regular expression. Appreciate the help.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.