Pandas missing value representation in aggregated dataframe

Question

When applying an aggregation to a grouped pandas DataFrame, the aggregated output appears to contains different values for aggregated all-missing-value-columns, depending on the type of the dataframe column. Below is a minimal example, containing one non-missing-value (an integer, a string and a tuple), one NaN, and one None each:

import pandas as pd
import numpy as np

a1 = pd.DataFrame({'a': [3, np.nan, None], 'b': [0,1,2]})
a2 = pd.DataFrame({'a': ['tree', np.nan, None], 'b': [0,1,2]})
a3 = pd.DataFrame({'a': [(0,1,2), np.nan, None], 'b': [0,1,2]})

a1.groupby('b')['a'].first()
a2.groupby('b')['a'].first()
a3.groupby('b')['a'].first()

a1.groupby('b')['a'].agg('first')
a2.groupby('b')['a'].agg('first')
a3.groupby('b')['a'].agg('first')

Looking at the dtypes of column 'a', it can be seen that these are float64, object and object for a1, a2 and a3, respectively. The None in a1 is converted to NaN at dataframe creation. Therefore I would have the following

Expected output behavior:

a1: NaN for rows 1 and 2 (that is the case)
a2: NaN and None for rows 1 and 2 (not the case)
a3: NaN and None for rows 1 and 2 (not the case)

Actual output:

b
0    3.0
1    NaN
2    NaN
Name: a, dtype: float64

b
0    tree
1    None
2    None
Name: a, dtype: object

b
0    (0, 1, 2)
1         None
2         None
Name: a, dtype: object

Why does the aggregation change the data from NaN to None for row 1 in a2 and a3? As the column is anyways of dtype object, there should be no issue in returning NaN and None for rows 1 and 2, respectively; and we are not in a scenario here, where any group to be aggregated contains both NaNs and None. The documentation (https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.first.html) is not very precise on this behavior either, it just mentions the returned value for all-NA-columns is NA.

Update:

As mentioned in @mozway's answer further below, for pure NaN/None-groups, skipna=False can be used to preserve NaN and None respectively. However, this does not work when having both mixed non-missing-/missing-value and all-missing columns (e.g. [[np.nan, None, 'tree'],[np.nan, None]]), where we still would like to get the first non-missing value, as that would require passing skipna=True.

mozway · Accepted Answer · 2024-10-14 09:22:25Z

1

By default, groupby.first removes the NaNs.

DataFrameGroupBy.first(numeric_only=False, min_count=-1, skipna=True)

Compute the first entry of each column within each group.

Defaults to skipping NA elements.

Thus, the aggregation ignores all your NaNs and outputs the default NA value for your dtype (NaN for numeric, None for object).

You should use skipna=False:

a2.groupby('b')['a'].first(skipna=False)

# with agg
a3.groupby('b')['a'].agg('first', skipna=False)

Output:

# for a2
b
0    tree
1     NaN
2    None
Name: a, dtype: object

# for a3
b
0    (0, 1, 2)
1          NaN
2         None
Name: a, dtype: object

mixed NaN/None

If you have an object Series and a mix of NaN/None, then (with skipna=False) the first object is returned (as expected):

(pd.DataFrame({'a': [np.nan, None, None, np.nan, 'X'],
               'b': [0,0,1,1,2]})
   .groupby('b')['a'].first(skipna=False)
)

b
0     NaN
1    None
2       X
Name: a, dtype: object

custom `first` function:

If you want the first non-null or the first null keeping the original object:

def first(s):
    return next(iter(s.dropna()), s.iloc[0])

(pd.DataFrame({'a': [np.nan, None, None, np.nan, np.nan, 'X'],
               'b': [0,0,1,1,2,2]})
   .groupby('b')['a'].agg(first)
)

Output:

b
0     NaN
1    None
2       X
Name: a, dtype: object

edited Oct 14, 2024 at 9:22

answered Oct 11, 2024 at 10:04

mozway

267k13 gold badges55 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

silence_of_the_lambdas Over a year ago

Thanks for the detailed answer. Has the default of skipna been chosen as True specifically to cover the intended handling of the mixed case?

mozway Over a year ago

I think it's rather to have the first non-missing value by default, which is usually the most useful

silence_of_the_lambdas Over a year ago

But that means that if I have both groups with mixed non-NaN/NaN values as well as all-NaN/None values, I cannot use skipna=False to fix the dtype.

silence_of_the_lambdas Over a year ago

Any way around post-replacing then? Of course I do want the first non-missing value, but if there are only missing values, then pandas should not change NaN to None.

silence_of_the_lambdas Over a year ago

The way I see it, returning the first non-NaN element and the returned missing value representation should not be combined in a single function arg.

Collectives™ on Stack Overflow

Pandas missing value representation in aggregated dataframe

1 Answer 1

mixed NaN/None

custom `first` function:

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

mixed NaN/None

custom first function:

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related

custom `first` function: