8

I know how to drop a row from a DataFrame containing all nulls OR a single null but can you drop a row based on the nulls for a specified set of columns?

For example, say I am working with data containing geographical info (city, latitude, and longitude) in addition to numerous other fields. I want to keep the rows that at a minimum contain a value for city OR for lat and long but drop rows that have null values for all three.

I am having trouble finding functionality for this in pandas documentation. Any guidance would be appreciated.

4
  • mate, it's in the documentation. Check the help for the dropna function Commented Feb 8, 2017 at 23:10
  • @GeneBurinsky, no, dropna() will work incorrectly in this case. Check a row with index 4 in my example. df.dropna(subset=['city','latitude','longitude'], how='all') will drop it... Commented Feb 8, 2017 at 23:12
  • 1
    @MaxU, that is a fair point. However, at least fo your example, this will work df.dropna(axis=0, subset=[['city', 'longitude', 'latitude']], thresh=2) but in general, you're right, explicit logical statements for what is desired are superior to the dropna solution Commented Feb 8, 2017 at 23:36
  • @GeneBurinsky, wow! i've completely missed out this parameter... Could you please write it as an answer? Commented Feb 8, 2017 at 23:38

5 Answers 5

8

You can use pd.dropna but instead of using how='all' and subset=[], you can use the thresh parameter to require a minimum number of NAs in a row before a row gets dropped. In the city, long/lat example, a thresh=2 will work because we only drop in case of 3 NAs. Using the great data example set up by MaxU, we would do

## get MaxU's example data via copy/paste (ie read_clipboard)
df = pd.read_clipboard()

## remove undesired rows
df.dropna(axis=0, subset=[['city', 'longitude', 'latitude']], thresh=2) 

This yields:

In [5]: df.dropna(axis=0, subset=[['city', 'longitude', 'latitude']], thresh=2)
Out[5]:
  city  latitude  longitude  a  b
0  aaa   11.1111        NaN  1  2
1  bbb       NaN    22.2222  5  6
3  NaN   11.1111    33.3330  1  2
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! Clear and concise solution.
4

dropna has a parameter to apply the tests only on a subset of columns:

dropna(axis=0, how='all', subset=[your three columns in this list])

1 Comment

Note that, as MaxU mentioned in the comments, this wouldn't quite work on the example test set.
3

Try this:

In [25]: df
Out[25]:
  city  latitude  longitude  a  b
0  aaa   11.1111        NaN  1  2
1  bbb       NaN    22.2222  5  6
2  NaN       NaN        NaN  3  4
3  NaN   11.1111    33.3330  1  2
4  NaN       NaN    44.4440  1  1

In [26]: df.query("city == city or (latitude == latitude and longitude == longitude)")
Out[26]:
  city  latitude  longitude  a  b
0  aaa   11.1111        NaN  1  2
1  bbb       NaN    22.2222  5  6
3  NaN   11.1111    33.3330  1  2

If i understand OP correctly the row with index 4 must be dropped as not both coordinates are not-null. So dropna() won't work "properly" in this case:

In [62]: df.dropna(subset=['city','latitude','longitude'], how='all')
Out[62]:
  city  latitude  longitude  a  b
0  aaa   11.1111        NaN  1  2
1  bbb       NaN    22.2222  5  6
3  NaN   11.1111    33.3330  1  2
4  NaN       NaN    44.4440  1  1   # this row should be dropped...

1 Comment

That's correct, index 4 would need to be dropped. This seems to be what I was looking for. I wasn't aware you could use the booleans in this way for query(). Thanks!
1

Using a boolean mask and some clever dot product (this is for @Boud)

subset = ['city', 'latitude', 'longitude']
df[df[subset].notnull().dot([2, 1, 1]).ge(2)]

  city  latitude  longitude  a  b
0  aaa   11.1111        NaN  1  2
1  bbb       NaN    22.2222  5  6
3  NaN   11.1111    33.3330  1  2

Comments

0

You can perform selection by exploiting the bitwise operators.

## create example data
df = pd.DataFrame({'City': ['Gothenburg', None, None], 'Long': [None, 1, 1], 'Lat': [1, None, 1]})

## bitwise/logical operators
~df.City.isnull() | (~df.Lat.isnull() & ~df.Long.isnull())
0     True
1    False
2     True
dtype: bool

## subset using above statement
df[~df.City.isnull() | (~df.Lat.isnull() & ~df.Long.isnull())]
         City  Lat  Long
0  Gothenburg  1.0   NaN
2        None  1.0   1.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.