40

I'm trying to replace some NaN values in my data with an empty list []. However the list is represented as a str and doesn't allow me to properly apply the len() function. is there anyway to replace a NaN value with an actual empty list in pandas?

In [28]: d = pd.DataFrame({'x' : [[1,2,3], [1,2], np.NaN, np.NaN], 'y' : [1,2,3,4]})

In [29]: d
Out[29]:
           x  y
0  [1, 2, 3]  1
1     [1, 2]  2
2        NaN  3
3        NaN  4

In [32]: d.x.replace(np.NaN, '[]', inplace=True)

In [33]: d
Out[33]:
           x  y
0  [1, 2, 3]  1
1     [1, 2]  2
2         []  3
3         []  4

In [34]: d.x.apply(len)
Out[34]:
0    3
1    2
2    2
3    2
Name: x, dtype: int64

4 Answers 4

45

This works using isnull and loc to mask the series:

In [90]:
d.loc[d.isnull()] = d.loc[d.isnull()].apply(lambda x: [])
d

Out[90]:
0    [1, 2, 3]
1       [1, 2]
2           []
3           []
dtype: object

In [91]:
d.apply(len)

Out[91]:
0    3
1    2
2    0
3    0
dtype: int64

You have to do this using apply in order for the list object to not be interpreted as an array to assign back to the df which will try to align the shape back to the original series

EDIT

Using your updated sample the following works:

In [100]:
d.loc[d['x'].isnull(),['x']] = d.loc[d['x'].isnull(),'x'].apply(lambda x: [])
d

Out[100]:
           x  y
0  [1, 2, 3]  1
1     [1, 2]  2
2         []  3
3         []  4

In [102]:    
d['x'].apply(len)

Out[102]:
0    3
1    2
2    0
3    0
Name: x, dtype: int64
Sign up to request clarification or add additional context in comments.

1 Comment

what if we want to extend to the multiple columns of df
12

To extend the accepted answer, apply calls can be particularly expensive - the same task can be accomplished without it by constructing a numpy array from scratch.

isna = df['x'].isna()
df.loc[isna, 'x'] = pd.Series([[]] * isna.sum()).values

A quick timing comparison:

def empty_assign_1(s):
    s[s.isna()].apply(lambda x: [])

def empty_assign_2(s):
    [[]] * s.isna().sum()

series = pd.Series(np.random.choice([1, 2, np.nan], 1000000))

%timeit empty_assign_1(series)
>>> 61 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit empty_assign_2(series)
>>> 2.17 ms ± 70.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Nearly 10 times faster!

EDIT: Fixed a bug pointed out by @valentin

You have to be somewhat careful with data types when performing assignment in this case. In the example above, the test series is float, however, adding [] elements coerces the entire series to object. Pandas will handle that for you if you do something like

idx = series.isna()
series[isna] = series[isna].apply(lambda x: [])

Because the output of apply is itself a series. You can test live performance with assignment overhead like so (I've added a string value so the series with be an object, you could instead use a number as the replacement value rather than an empty list to avoid coercion).

def empty_assign_1(s):
    idx = s.isna()
    s[idx] = s[idx].apply(lambda x: [])

def empty_assign_2(s):
    idx = s.isna()
    s.loc[idx] = [[]] * idx.sum()

series = pd.Series(np.random.choice([1, 2, np.nan, '2'], 1000000))

%timeit empty_assign_1(series.copy())
>>> 45.1 ms ± 386 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit empty_assign_2(series.copy())
>>> 24 ms ± 393 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

About 4 ms of that is related to the copy, 10x to 2x, still pretty great.

2 Comments

This answer is misleading since the implementation of the first function empty_assign_1() seems incorrect. It applies the lambda function on every element in the series instead of only on those where the value is actually NaN. It should be s[s.isna()].apply(...). Performing the timing comparison after this fix actually reverses the results so that the first function becomes faster.
Hah! You actually did catch an error, I seem to have forgotten that isna is not the reciprocal of dropna. That being said, the original post is still correct. The reason you're observing a reversal is because of the unnecessary constructor call to pd.Series (which is also quite slow). Just use [[]]*s.isna().sum() and you'll be back in business. The context of this specific question is complicated by replacing nans with a list because of the way pandas interprets list inputs so you'll need to create series with dtype='object' and .loc for assignment (or replace with a non list).
9

You can also use a list comprehension for this:

d['x'] = [ [] if x is np.NaN else x for x in d['x'] ]

Comments

0
import pandas as pd
import numpy as np

data = {'column1': [[1, 2], [2, 3], np.nan, [4, 5], np.nan],
        'column2': [np.nan, "Hi", "Hello", np.nan, "H"]}

df = pd.DataFrame(data)

def replace_none_with_empty_list(x):
    if x is np.nan:
        return []
    else:
        return x

df = df.applymap(replace_none_with_empty_list)

print(df)

wherever NaN is there, this will remove with empty array.else retuns the same value

 column1 column2
0  [1, 2]      []
1  [2, 3]      Hi
2      []   Hello
3  [4, 5]      []
4      []       H

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.