Why cant Pandas replace nan with an array of 0s using masks/replace?

Question

I have a series like this

s = pd.Series([[1,2,3],[1,2,3],np.nan,[1,2,3],[1,2,3],np.nan])

and I simply want the NaN to be replaced by [0,0,0].

I have tried

s.fillna([0,0,0]) # TypeError: "value" parameter must be a scalar or dict, but you passed a "list"

s[s.isna()] = [[0,0,0],[0,0,0]] # just replaces the NaN with a single "0". WHY?!

s.fillna("NAN").replace({"NAN":[0,0,0]}) # ValueError: NumPy boolean array indexing assignment cannot 
                                          #assign 3 input values to the 2 output values where the mask is true


s.fillna("NAN").replace({"NAN":[[0,0,0],[0,0,0]]}) # TypeError: NumPy boolean array indexing assignment
                                                   # requires a 0 or 1-dimensional input, input has 2 dimensions

I really can't understand, why the two first approaches won't work (maybe I get the first, but the second I cant wrap my head around).

Thanks to this SO-question and answer, we can do it by

is_na = s.isna()
s.loc[is_na] = s.loc[is_na].apply(lambda x: [0,0,0])

but since apply often is rather slow I cannot understand, why we can't use replace or the slicing as above

jezrael · Accepted Answer · 2022-05-16 10:00:51Z

1

Pandas working with list with pain, here is hacky solution:

s = s.fillna(pd.Series([[0,0,0]] * len(s), index=s.index))
print (s)
0    [1, 2, 3]
1    [1, 2, 3]
2    [0, 0, 0]
3    [1, 2, 3]
4    [1, 2, 3]
5    [0, 0, 0]
dtype: object

answered May 16, 2022 at 10:00

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Shubham Sharma · Accepted Answer · 2022-05-16 10:02:22Z

1

`Series.reindex`

s.dropna().reindex(s.index, fill_value=[0, 0, 0])

0    [1, 2, 3]
1    [1, 2, 3]
2    [0, 0, 0]
3    [1, 2, 3]
4    [1, 2, 3]
5    [0, 0, 0]
dtype: object

answered May 16, 2022 at 10:02

Shubham Sharma

71.8k6 gold badges26 silver badges58 bronze badges

Comments

norok2 · Accepted Answer · 2022-05-16 14:50:39Z

The documentation indicates that this value cannot be a list.

Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.

This is probably a limitation of the current implementation, and short of patching the source code you must resort to workarounds (as provided below).

However, if you are not planning to work with jagged arrays, what you really want to do is probably replace pd.Series() with pd.DataFrame(), e.g.:

import numpy as np
import pandas as pd


s = pd.DataFrame(
        [[1, 2, 3],
         [1, 2, 3],
         [np.nan],
         [1, 2, 3],
         [1, 2, 3],
         [np.nan]],
        dtype=pd.Int64Dtype())  # to mix integers with NaNs


s.fillna(0)
#    0  1  2
# 0  1  2  3
# 1  1  2  3
# 2  0  0  0
# 3  1  2  3
# 4  1  2  3
# 5  0  0  0

If you do need to use jagged array, you could use any of the proposed workaround from other answers, or you could make one of your attempt work, e.g.:

ii = s.isna()
nn = ii.sum()
s[ii] = pd.Series([[0, 0, 0]] * nn).to_numpy()
# 0    [1, 2, 3]
# 1    [1, 2, 3]
# 2    [0, 0, 0]
# 3    [1, 2, 3]
# 4    [1, 2, 3]
# 5    [0, 0, 0]
# dtype: object

which basically uses NumPy masking to fill in the Series. The trick is to generate a compatible object for the assignment that works at the NumPy level.

If there are too many NaNs in the input, it is probably more efficient / faster to work in a similar way but with s.notna() instead, e.g.:

import pandas as pd


result = pd.Series([[0, 0, 0]] * len(s))
result[s.notna()] = s[s.notna()]

Let's try to do some benchmarking, where:

replace_nan_isna() is from above

import pandas as pd


def replace_nan_isna(s, value, inplace=False):
    if not inplace:
        s = s.copy()
    ii = s.isna()
    nn = ii.sum()
    s[ii] = pd.Series([value] * nn).to_numpy()
    return s

replace_nan_notna() is also from above

import pandas as pd


def replace_nan_notna(s, value, inplace=False):
    if inplace:
        raise ValueError("In-place not supported!")
    result = pd.Series([value] * len(s))
    result[s.notna()] = s[s.notna()]
    return result

replace_nan_reindex() is from @ShubhamSharma's answer

def replace_nan_reindex(s, value, inplace=False):
    if not inplace:
        s = s.copy()
    s.dropna().reindex(s.index, fill_value=value)
    return s

replace_nan_fillna() is from @jezrael's answer

import pandas as pd


def replace_nan_fillna(s, value, inplace=False):
    if not inplace:
        s = s.copy()
    s.fillna(pd.Series([value] * len(s), index=s.index))
    return s

with the following code:

import numpy as np
import pandas as pd


def gen_data(n=5, k=2, p=0.7, obj=(1, 2, 3)):
    return pd.Series(([obj] * int(p * n) + [np.nan] * (n - int(p * n))) * k)


funcs = replace_nan_isna, replace_nan_notna, replace_nan_reindex, replace_nan_fillna

# : inspect results
s = gen_data(5, 1)
for func in funcs:
    print(f'{func.__name__:>20s}  {func(s, value)}')
print()

# : generate benchmarks
s = gen_data(100, 1000)
value = (0, 0, 0)
base = funcs[0](s, value)
for func in funcs:
    print(f'{func.__name__:>20s}  {(func(s, value) == base).all()!s:>5}', end='  ')
    %timeit func(s, value)
#     replace_nan_isna   True  100 loops, best of 5: 16.5 ms per loop
#    replace_nan_notna   True  10 loops, best of 5: 46.5 ms per loop
#  replace_nan_reindex   True  100 loops, best of 5: 9.74 ms per loop
#   replace_nan_fillna   True  10 loops, best of 5: 36.4 ms per loop

indicating that reindex() may be the fastest approach.

Collectives™ on Stack Overflow

Why cant Pandas replace nan with an array of 0s using masks/replace?

3 Answers 3

Comments

`Series.reindex`

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Series.reindex

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

`Series.reindex`