26

I checked this post: finding non-numeric rows in dataframe in pandas? but it doesn't really answer my question.

my sample data:

import pandas as pd


d = {
 'unit': ['UD', 'UD', 'UD', 'UD', 'UD','UD'],
 'N-D': [ 'Q1', 'Q2', 'Q3', 'Q4','Q5','Q6'],
 'num' : [ -1.48, 1.7, -6.18, 0.25, 'sum(d)', 0.25]

}
df = pd.DataFrame(d)

it looks like this:

  N-D   num   unit
0  Q1  -1.48   UD
1  Q2   1.70   UD
2  Q3  -6.18   UD
3  Q4   0.25   UD
4  Q5   sum(d) UD
5  Q6   0.25   UD

I want to filter out only the rows in column 'num' that are NON-NUMERIC. I want all of the columns for only the rows that contain non-numeric values for column 'num'.

desired output:

  N-D   num   unit
4  Q5   sum(d) UD

my attempts:

nonnumeric=df[~df.applymap(np.isreal).all(1)] #didn't work, it pulled out everything, besides i want the condition to check only column 'num'. 

nonnumeric=df['num'][~df.applymap(np.isreal).all(1)] #didn't work, it pulled out all the rows for column 'num' only.
1
  • non_num_rows = list(df[pd.to_numeric(df['num'], errors='coerce').isnull()].index) Commented Sep 29, 2018 at 14:19

5 Answers 5

27

Use boolean indexing with mask created by to_numeric + isnull
Note: This solution does not find or filter numbers saved as strings: like '1' or '22'

print (pd.to_numeric(df['num'], errors='coerce'))
0   -1.48
1    1.70
2   -6.18
3    0.25
4     NaN
5    0.25
Name: num, dtype: float64

print (pd.to_numeric(df['num'], errors='coerce').isnull())
0    False
1    False
2    False
3    False
4     True
5    False
Name: num, dtype: bool

print (df[pd.to_numeric(df['num'], errors='coerce').isnull()])
  N-D     num unit
4  Q5  sum(d)   UD

Another solution with isinstance and apply:

print (df[df['num'].apply(lambda x: isinstance(x, str))])
  N-D     num unit
4  Q5  sum(d)   UD
Sign up to request clarification or add additional context in comments.

Comments

14

Old topic, but if the numbers have been converted to 'str', type(x) == str is not working.

Instead, it's better to use isnumeric() or isdigit().

df = df[df['num'].apply(lambda x: not x.isnumeric())]

I tested all three approaches on my own dataframe with 200k+ rows, assuming numbers have been converted to 'str' by pd.read_csv().

def f1():
    df[pd.to_numeric(df['num'], errors='coerce').isnull()]

def f2():
    df[~df.num.str.match('^\-?(\d*\.?\d+|\d+\.?\d*)$')]

def f3():
    df[df['num'].apply(lambda x: not x.isnumeric())]

I got following execution times by running each function 10 times.

timeit.timeit(f1, number=10)
1.04128568888882

timeit.timeit(f2, number=10)
1.959099448888992

timeit.timeit(f3, number=10)
0.48741375999998127

Conculsion: fastest method is isnumeric(), slowest is regular expression method.

=========================================

Edit: As @set92 commented, isnumeric() works for integer only. So the fastest applicable function is pd.to_numeric() to have a universal solutions works for any type of numerical values.

It is possible to define a isfloat() function in python; but it will be slower than internal functions, especially for big DataFrames.

tmp=['4.0','4','4.5','1','test']*200000
df=pd.DataFrame(data=tmp,columns=['num'])


def f1():
    df[pd.to_numeric(df['num'], errors='coerce').isnull()]

def f2():
    df[df['num'].apply(lambda x: not isfloat(x))] 

def f3():
    df[~df.num.str.match('^\-?(\d*\.?\d+|\d+\.?\d*)$')]


print('to_numeric:',timeit.timeit(f1, number=10))
print('isfloat:',timeit.timeit(f2, number=10))
print('regular exp:',timeit.timeit(f3, number=10))

Results:

to_numeric: 8.303612694763615
isfloat: 9.972200270603594
regular exp: 11.420604273894583

5 Comments

It shouldn't work, df['num'] is float, and isnumeric() and isdigit() only works for integers. I was searching some function but seems the only solution available is create a function def isfloat(value): try: float(value) return True except ValueError: return False and with it change your solution to df['num'].loc[df['num'].apply(lambda x: not isfloat(x))]
but pd.to_numeric() works without any problem. It's an internal function and fast. No need to define a function!
with pd.to_numeric I got an error the strings (ValueError: Unable to parse string "Unrated" at position 32). If you want to try the dataset was ramen ratings of kaggle. But the problem was that the feature Stars has float numbers and 3 cells with "Unrated", so I couldn't find a one-line to apply it and find how many rows were not number.
you should use pd.to_numeric(... ,errors='coerce') to skip errors
only df[pd.to_numeric(df['num'], errors='coerce').isnull()] could match floats
4

I used

df = df[df['num'].apply(lambda x: type(x) == str)]

and now df is

  N-D     num unit
4  Q5  sum(d)   UD

Comments

2

Assuming these are strings, you can filter based on a regular expression match of a floating point number.

df[~df.num.str.match('^\-?(\d*\.?\d+|\d+\.?\d*)$')]

  N-D     num unit
4  Q5  sum(d)   UD

Comments

0

There are many ways to detect non-numeric values in the column of pandas DataFrame, here is one.

df[~df['num'].map(lambda x:x.isnumeric())]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.