Filtering pandas dataframe column of numpy arrays by nan values

Question

I have a pandas DataFrame

    ID    Unique_Countries

0   123     [Japan]
1   124      [nan]
2   125    [US,Brazil]
.
.
.

I got the Unique_Countries column by aggregating over unique countries from each ID group. There were many IDs with only 'NaN' values in the original country column. They are now displayed as what you see in row 1. I would like to filter on these but can't seem to. When I type

df.Unique_Countries[1]

I get

array([nan], dtype=object)

I have tried several methods including

isnull() and isnan()

but it gets messed up because it is a numpy array.

@wwnde it just lists every row with a NaN next to it. It does that if I try .contains('US') instead of .contains('nan') as well — Hunter Mitchell
– Hunter Mitchell, Commented Oct 4, 2020 at 23:08
I cant quite understand what you need. I thought all you needed is to select from the outcome of your intitial operation. If you need to drop and remain with those that are not nan, try df[~df.Unique_Countries.str.contains('nan')] — wwnde
– wwnde, Commented Oct 4, 2020 at 23:11

Andy L. · Accepted Answer · 2020-10-05 00:58:53Z

2

If your cell has NaN not in 1st position, try use explode and groupby.all

df[df.Unique_Countries.explode().notna().groupby(level=0).all()]

OR

df[df.Unique_Countries.explode().notna().all(level=0)]

Let's try

df.Unique_Countries.str[0].isna()  #'nan' is True

df.Unique_Countries.str[0].notna()  #'nan' is False

To pick only non-nan-string just use mask above

df[df.Unique_Countries.str[0].notna()]

edited Oct 5, 2020 at 0:58

answered Oct 4, 2020 at 23:10

Andy L.

25.3k4 gold badges20 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Hunter Mitchell Over a year ago

I don't think it's either. Your answer returned False for every row. I say I don't think its a string because when I type df.unique_countries[0] it gives me array(['US'], dtype=object) where there are quotations around US

Andy L. Over a year ago

could you share output of this command: type(df.Unique_Countries[1][0])

Hunter Mitchell Over a year ago

i think we're getting somewhere. It says it's a float

Hunter Mitchell Over a year ago

that did the trick! I figured it was just a weird type of NaN - thanks a lot!

Hunter Mitchell Over a year ago

wait real quick - this answer would only work if nan is the first value of the array. Is there something I could add to make it work if it wasn't the first? i.e. [Australia, nan]

gioxc88 · Accepted Answer · 2020-10-04 23:50:18Z

0

I believe that the answers based on string method contains would fail if a country contains the substring nan in it.

In my opinion the solution should be this:

df.explode('Unique_Countries').dropna().groupby('ID', as_index=False).agg(list)

This code drops nan from your dataframe and returns the dataset in the original form.

I am not sure from your question if you want to dropna or you want to know the IDs of the records which have nan in the Unique_Countries column you can use something similar:

long_ss = df.set_index('ID').squeeze().explode()
long_ss[long_ss.isna()]

edited Oct 4, 2020 at 23:50

answered Oct 4, 2020 at 23:33

gioxc88

3591 silver badge9 bronze badges

Collectives™ on Stack Overflow

Filtering pandas dataframe column of numpy arrays by nan values

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related