Finding non-numeric rows in dataframe in pandas?

Question

I have a large dataframe in pandas that apart from the column used as index is supposed to have only numeric values:

df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
                   'b': [0.1, 0.2, 0.3, 0.4, 0.5],
                   'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')

How can I find the row of the dataframe df that has a non-numeric value in it?

In this example it's the fourth row in the dataframe, which has the string 'bad' in the a column. How can this row be found programmatically?

Community · Accepted Answer · 2017-05-23 11:47:26Z

98

You could use np.isreal to check the type of each element (applymap applies a function to each element in the DataFrame):

In [11]: df.applymap(np.isreal)
Out[11]:
          a     b
item
a      True  True
b      True  True
c      True  True
d     False  True
e      True  True

If all in the row are True then they are all numeric:

In [12]: df.applymap(np.isreal).all(1)
Out[12]:
item
a        True
b        True
c        True
d       False
e        True
dtype: bool

So to get the subDataFrame of rouges, (Note: the negation, ~, of the above finds the ones which have at least one rogue non-numeric):

In [13]: df[~df.applymap(np.isreal).all(1)]
Out[13]:
        a    b
item
d     bad  0.4

You could also find the location of the first offender you could use argmin:

In [14]: np.argmin(df.applymap(np.isreal).all(1))
Out[14]: 'd'

As @CTZhu points out, it may be slightly faster to check whether it's an instance of either int or float (there is some additional overhead with np.isreal):

df.applymap(lambda x: isinstance(x, (int, float)))

edited May 23, 2017 at 11:47

CommunityBot

11 silver badge

answered Feb 14, 2014 at 6:13

Andy Hayden

378k110 gold badges640 silver badges546 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

CT Zhu Over a year ago

Great! Only that df.applymap(lambda x: isinstance(x, (int, float))) will be about 18% faster. Using lambda with build in method is sometimes reasonably fast.

Andy Hayden Over a year ago

@CTZhu Good point. I guess that's not soo suprising (since np.isreal does some other stuff too e.g. dealing with arrays not just single values).

MattY Over a year ago

It's worth noting that np.isreal(None) is not True

Superdooperhero Over a year ago

How would you only check one column for values that aren't numbers?

Andy Hayden Over a year ago

@Superdooperhero use .apply on the column rather than .applymap on the DataFrame, i.e. df["a"].apply(np.isreal) and df[~df["a"].apply(np.isreal)]

|

JoshVarty · Accepted Answer · 2019-03-04 19:38:51Z

33

Already some great answers to this question, however here is a nice snippet that I use regularly to drop rows if they have non-numeric values on some columns:

# Eliminate invalid data from dataframe (see Example below for more context)

num_df = (df.drop(data_columns, axis=1)
         .join(df[data_columns].apply(pd.to_numeric, errors='coerce')))

num_df = num_df[num_df[data_columns].notnull().all(axis=1)]

The way this works is we first drop all the data_columns from the df, and then use a join to put them back in after passing them through pd.to_numeric (with option 'coerce', such that all non-numeric entries are converted to NaN). The result is saved to num_df.

On the second line we use a filter that keeps only rows where all values are not null.

Note that pd.to_numeric is coercing to NaN everything that cannot be converted to a numeric value, so strings that represent numeric values will not be removed. For example '1.25' will be recognized as the numeric value 1.25.

Disclaimer: pd.to_numeric was introduced in pandas version 0.17.0

Example:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"item": ["a", "b", "c", "d", "e"],
   ...:                    "a": [1,2,3,"bad",5],
   ...:                    "b":[0.1,0.2,0.3,0.4,0.5]})

In [3]: df
Out[3]: 
     a    b item
0    1  0.1    a
1    2  0.2    b
2    3  0.3    c
3  bad  0.4    d
4    5  0.5    e

In [4]: data_columns = ['a', 'b']

In [5]: num_df = (df
   ...:           .drop(data_columns, axis=1)
   ...:           .join(df[data_columns].apply(pd.to_numeric, errors='coerce')))

In [6]: num_df
Out[6]: 
  item   a    b
0    a   1  0.1
1    b   2  0.2
2    c   3  0.3
3    d NaN  0.4
4    e   5  0.5

In [7]: num_df[num_df[data_columns].notnull().all(axis=1)]
Out[7]: 
  item  a    b
0    a  1  0.1
1    b  2  0.2
2    c  3  0.3
4    e  5  0.5

edited Mar 4, 2019 at 19:38

JoshVarty

9,4964 gold badges56 silver badges84 bronze badges

answered Dec 15, 2015 at 10:09

Pedro M Duarte

28.2k7 gold badges46 silver badges43 bronze badges

3 Comments

skd Over a year ago

This is what I use, for some reason top answer doesn't always work for me and convert_object is deprecated

bart cubrich Over a year ago

I do something similar, but like this float_cols=['a','b','c'] for cols in float_cols: df[cols]=df[cols].apply(pd.to_numeric, errors='coerce').fillna(0) This is particular to cases where I want the nan's to be zero, not nan. You could also leave out the fillna(0) bit.

ericOnline Over a year ago

Does this drop/join logic scale to millions of rows?

matiu · Accepted Answer · 2019-01-15 22:56:13Z

14

# Original code
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
                   'b': [0.1, 0.2, 0.3, 0.4, 0.5],
                   'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')

Convert to numeric using 'coerce' which fills bad values with 'nan'

a = pd.to_numeric(df.a, errors='coerce')

Use isna to return a boolean index:

idx = a.isna()

Apply that index to the data frame:

df[idx]

output

Returns the row with the bad data in it:

        a    b
item          
d     bad  0.4

edited Jan 15, 2019 at 22:56

answered Sep 1, 2018 at 18:15

matiu

7,7555 gold badges47 silver badges49 bronze badges

Comments

CT Zhu · Accepted Answer · 2014-02-14 05:22:59Z

6

Sorry about the confusion, this should be the correct approach. Do you want only to capture 'bad' only, not things like 'good'; Or just any non-numerical values?

In[15]:
np.where(np.any(np.isnan(df.convert_objects(convert_numeric=True)), axis=1))
Out[15]:
(array([3]),)

answered Feb 14, 2014 at 5:22

CT Zhu

54.6k18 gold badges125 silver badges136 bronze badges

4 Comments

DSM Over a year ago

This will pick up 'bad', but won't pick up a rogue string number, though, because the conversion will succeed..

CT Zhu Over a year ago

Or get into problems with coded strings representing valid numbers, like 'over_the_chart', representing the maximum allowed value for that variable. Probably not a safe way. I would rather approach the problem from the way to build the dataframe in the real world case.

Andy Hayden Over a year ago

This depends whether you consider "4" bad or not, my gut feeling would be yes :S

skd Over a year ago

convert_objects is deprecated

dyang · Accepted Answer · 2019-02-20 08:26:33Z

5

I'm thinking something like, just give an idea, to convert the column to string, and work with string is easier. however this does not work with strings containing numbers, like bad123. and ~ is taking the complement of selection.

df['a'] = df['a'].astype(str)
df[~df['a'].str.contains('0|1|2|3|4|5|6|7|8|9')]
df['a'] = df['a'].astype(object)

and using '|'.join([str(i) for i in range(10)]) to generate '0|1|...|8|9'

or using np.isreal() function, just like the most voted answer

df[~df['a'].apply(lambda x: np.isreal(x))]

edited Feb 20, 2019 at 8:26

answered Feb 19, 2019 at 3:12

dyang

731 silver badge5 bronze badges

2 Comments

Book Of Zeus Over a year ago

Please explain your solution

dyang Over a year ago

@BookOfZeus just convert it to string, and work with it. convert it back afterward.

Borja_042 · Accepted Answer · 2017-07-05 08:11:18Z

3

In case you are working with a column with string values, you can use THE VERY USEFUL function series.str.isnumeric() like:

a = pd.Series(['hi','hola','2.31','288','312','1312', '0,21', '0.23'])

What i do is to copy that column to new column, and do a str.replace('.','') and str.replace(',','') then i select the numeric values. and:

a = a.str.replace('.','')
a = a.str.replace(',','') 
a.str.isnumeric()

Out[15]: 0 False 1 False 2 True 3 True 4 True 5 True 6 True 7 True dtype: bool

Good luck all!

edited Jul 5, 2017 at 8:11

answered May 25, 2017 at 10:21

Borja_042

1,0711 gold badge16 silver badges27 bronze badges

4 Comments

toto_tico Over a year ago

why 2.31 is not numeric?

Borja_042 Over a year ago

I think it is because it has a dot. What i do is to copy that column to new column, and do a str.replace('.','') and then i select the numeric values. I have recently faced that problem.

kekert Over a year ago

@toto_tico it is because of a dot. as per pandas.pydata.org/pandas-docs/stable/generated/… it "Check whether all characters in each string in the Series/Index are numeric"

toto_tico Over a year ago

@kekert, thanks, I forgot that, and there is actually no isreal (or similar) in str... @fra44, I think you should add the str.replace clarification directly in the answer because it can be missleading as it is...

Andre Goulart · Accepted Answer · 2022-02-17 05:27:06Z

-1

Did you convert your data using .astype() ?

All great comments above must solve 99% of the cases, but if you are still in trouble, please also check if you converted your data type.

Sometimes I force the data to type float16 to save memory. Using:

df[col] = df[col].astype(np.float16)

But this might silently break your code. So if you did any kind of data type transformation, double check for overflows. Disable the conversion and try again.

It worked for me!

answered Feb 17, 2022 at 5:27

Andre Goulart

6344 silver badges24 bronze badges

Collectives™ on Stack Overflow

Finding non-numeric rows in dataframe in pandas?

7 Answers 7

8 Comments

3 Comments

output

Comments

4 Comments

2 Comments

4 Comments

Did you convert your data using .astype() ?

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

8 Comments

3 Comments

output

Comments

4 Comments

2 Comments

4 Comments

Did you convert your data using .astype() ?

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related