Delete row based on nulls in certain columns (pandas)

Question

I know how to drop a row from a DataFrame containing all nulls OR a single null but can you drop a row based on the nulls for a specified set of columns?

For example, say I am working with data containing geographical info (city, latitude, and longitude) in addition to numerous other fields. I want to keep the rows that at a minimum contain a value for city OR for lat and long but drop rows that have null values for all three.

I am having trouble finding functionality for this in pandas documentation. Any guidance would be appreciated.

mate, it's in the documentation. Check the help for the dropna function — Gene Burinsky
– Gene Burinsky, Commented Feb 8, 2017 at 23:10
@GeneBurinsky, no, dropna() will work incorrectly in this case. Check a row with index 4 in my example. df.dropna(subset=['city','latitude','longitude'], how='all') will drop it... — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Feb 8, 2017 at 23:12
@MaxU, that is a fair point. However, at least fo your example, this will work df.dropna(axis=0, subset=[['city', 'longitude', 'latitude']], thresh=2) but in general, you're right, explicit logical statements for what is desired are superior to the dropna solution — Gene Burinsky
– Gene Burinsky, Commented Feb 8, 2017 at 23:36
@GeneBurinsky, wow! i've completely missed out this parameter... Could you please write it as an answer? — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Feb 8, 2017 at 23:38

Gene Burinsky · Accepted Answer · 2023-02-16 21:20:57Z

8

You can use pd.dropna but instead of using how='all' and subset=[], you can use the thresh parameter to require a minimum number of NAs in a row before a row gets dropped. In the city, long/lat example, a thresh=2 will work because we only drop in case of 3 NAs. Using the great data example set up by MaxU, we would do

## get MaxU's example data via copy/paste (ie read_clipboard)
df = pd.read_clipboard()

## remove undesired rows
df.dropna(axis=0, subset=[['city', 'longitude', 'latitude']], thresh=2)

This yields:

In [5]: df.dropna(axis=0, subset=[['city', 'longitude', 'latitude']], thresh=2)
Out[5]:
  city  latitude  longitude  a  b
0  aaa   11.1111        NaN  1  2
1  bbb       NaN    22.2222  5  6
3  NaN   11.1111    33.3330  1  2

edited Feb 16, 2023 at 21:20

answered Feb 8, 2017 at 23:58

Gene Burinsky

10.3k2 gold badges24 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

gesingle Over a year ago

Thanks! Clear and concise solution.

piRSquared · Accepted Answer · 2017-02-09 02:27:04Z

4

dropna has a parameter to apply the tests only on a subset of columns:

dropna(axis=0, how='all', subset=[your three columns in this list])

edited Feb 9, 2017 at 2:27

piRSquared

296k68 gold badges509 silver badges654 bronze badges

answered Feb 8, 2017 at 22:55

Zeugma

32.3k9 gold badges73 silver badges85 bronze badges

1 Comment

Gene Burinsky Over a year ago

Note that, as MaxU mentioned in the comments, this wouldn't quite work on the example test set.

MaxU - stand with Ukraine · Accepted Answer · 2017-02-08 23:15:54Z

3

Try this:

In [25]: df
Out[25]:
  city  latitude  longitude  a  b
0  aaa   11.1111        NaN  1  2
1  bbb       NaN    22.2222  5  6
2  NaN       NaN        NaN  3  4
3  NaN   11.1111    33.3330  1  2
4  NaN       NaN    44.4440  1  1

In [26]: df.query("city == city or (latitude == latitude and longitude == longitude)")
Out[26]:
  city  latitude  longitude  a  b
0  aaa   11.1111        NaN  1  2
1  bbb       NaN    22.2222  5  6
3  NaN   11.1111    33.3330  1  2

If i understand OP correctly the row with index 4 must be dropped as not both coordinates are not-null. So dropna() won't work "properly" in this case:

In [62]: df.dropna(subset=['city','latitude','longitude'], how='all')
Out[62]:
  city  latitude  longitude  a  b
0  aaa   11.1111        NaN  1  2
1  bbb       NaN    22.2222  5  6
3  NaN   11.1111    33.3330  1  2
4  NaN       NaN    44.4440  1  1   # this row should be dropped...

edited Feb 8, 2017 at 23:15

answered Feb 8, 2017 at 22:55

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

1 Comment

gesingle Over a year ago

That's correct, index 4 would need to be dropped. This seems to be what I was looking for. I wasn't aware you could use the booleans in this way for query(). Thanks!

piRSquared · Accepted Answer · 2017-02-09 02:18:05Z

1

Using a boolean mask and some clever dot product (this is for @Boud)

subset = ['city', 'latitude', 'longitude']
df[df[subset].notnull().dot([2, 1, 1]).ge(2)]

  city  latitude  longitude  a  b
0  aaa   11.1111        NaN  1  2
1  bbb       NaN    22.2222  5  6
3  NaN   11.1111    33.3330  1  2

answered Feb 9, 2017 at 2:18

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Comments

Gene Burinsky · Accepted Answer · 2017-02-09 00:02:43Z

0

You can perform selection by exploiting the bitwise operators.

## create example data
df = pd.DataFrame({'City': ['Gothenburg', None, None], 'Long': [None, 1, 1], 'Lat': [1, None, 1]})

## bitwise/logical operators
~df.City.isnull() | (~df.Lat.isnull() & ~df.Long.isnull())
0     True
1    False
2     True
dtype: bool

## subset using above statement
df[~df.City.isnull() | (~df.Lat.isnull() & ~df.Long.isnull())]
         City  Lat  Long
0  Gothenburg  1.0   NaN
2        None  1.0   1.0

edited Feb 9, 2017 at 0:02

Gene Burinsky

10.3k2 gold badges24 silver badges31 bronze badges

answered Feb 8, 2017 at 23:02

Jimmy C

9,72013 gold badges50 silver badges68 bronze badges

Collectives™ on Stack Overflow

Delete row based on nulls in certain columns (pandas)

5 Answers 5

1 Comment

1 Comment

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

1 Comment

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related