1

I have a pandas DataFrame

    ID    Unique_Countries

0   123     [Japan]
1   124      [nan]
2   125    [US,Brazil]
.
.
.

I got the Unique_Countries column by aggregating over unique countries from each ID group. There were many IDs with only 'NaN' values in the original country column. They are now displayed as what you see in row 1. I would like to filter on these but can't seem to. When I type

df.Unique_Countries[1]

I get

array([nan], dtype=object)

I have tried several methods including

isnull() and isnan()

but it gets messed up because it is a numpy array.

3
  • 1
    Lets try df.Unique_Countries.str.contains('nan') Commented Oct 4, 2020 at 23:00
  • @wwnde it just lists every row with a NaN next to it. It does that if I try .contains('US') instead of .contains('nan') as well Commented Oct 4, 2020 at 23:08
  • I cant quite understand what you need. I thought all you needed is to select from the outcome of your intitial operation. If you need to drop and remain with those that are not nan, try df[~df.Unique_Countries.str.contains('nan')] Commented Oct 4, 2020 at 23:11

2 Answers 2

2

If your cell has NaN not in 1st position, try use explode and groupby.all

df[df.Unique_Countries.explode().notna().groupby(level=0).all()]

OR

df[df.Unique_Countries.explode().notna().all(level=0)]

Let's try

df.Unique_Countries.str[0].isna()  #'nan' is True

df.Unique_Countries.str[0].notna()  #'nan' is False

To pick only non-nan-string just use mask above

df[df.Unique_Countries.str[0].notna()]
Sign up to request clarification or add additional context in comments.

5 Comments

I don't think it's either. Your answer returned False for every row. I say I don't think its a string because when I type df.unique_countries[0] it gives me array(['US'], dtype=object) where there are quotations around US
could you share output of this command: type(df.Unique_Countries[1][0])
i think we're getting somewhere. It says it's a float
that did the trick! I figured it was just a weird type of NaN - thanks a lot!
wait real quick - this answer would only work if nan is the first value of the array. Is there something I could add to make it work if it wasn't the first? i.e. [Australia, nan]
0

I believe that the answers based on string method contains would fail if a country contains the substring nan in it.

In my opinion the solution should be this:

df.explode('Unique_Countries').dropna().groupby('ID', as_index=False).agg(list)

This code drops nan from your dataframe and returns the dataset in the original form.

I am not sure from your question if you want to dropna or you want to know the IDs of the records which have nan in the Unique_Countries column you can use something similar:

long_ss = df.set_index('ID').squeeze().explode()
long_ss[long_ss.isna()]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.