Ignoring non-numerical string values in pandas dataframe

Question

I have a DataFrame in which a column might have three kinds of values, integers (12331), integers as strings ('345') or some other string ('text').

Is there a way to drop all rows with the last kind of string from the dataframe, and convert the first kind of string into integers? Or at least some way to ignore the rows that cause type errors if I'm summing the column.

This dataframe is from reading a pretty big CSV file (25 GB), so I'd like some solution that would work when reading in chunks.

Marius · Accepted Answer · 2016-04-18 04:42:26Z

18

Pandas has some tools for converting these kinds of columns, but they may not suit your needs exactly. pd.to_numeric converts mixed columns like yours, but converts non-numeric strings to NaN. This means you'll get float columns, not integer, since only float columns can have NaN values. That usually doesn't matter too much but it's good to be aware of.

df = pd.DataFrame({'mixed_types': [12331, '345', 'text']})

pd.to_numeric(df['mixed_types'], errors='coerce')
Out[7]: 
0    12331.0
1      345.0
2        NaN
Name: mixed_types, dtype: float64

If you want to then drop all the NaN rows:

# Replace the column with the converted values
df['mixed_types'] = pd.to_numeric(df['mixed_types'], errors='coerce')

# Drop NA values, listing the converted columns explicitly
#   so NA values in other columns aren't dropped
df.dropna(subset = ['mixed_types'])
Out[11]: 
   mixed_types
0      12331.0
1        345.0

edited Apr 18, 2016 at 4:42

answered Apr 18, 2016 at 4:28

Marius

60.6k16 gold badges115 silver badges108 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

devil0150 Over a year ago

Since the NaN is created after reading, would these values would be dropped if I set na_values = 'NaN' and execute drop_na?

Marius Over a year ago

@devil0150 Yeah, doing dropna() once you've converted isn't too difficult, see my edit

Marius Over a year ago

@xtian you probably need to strip the '$' off before this will work. If you're struggling, ask a new question on SO giving examples of your data and what you've tried.

Anton Protopopov · Accepted Answer · 2016-04-18 04:28:37Z

5

You could use pd.to_numeric with errors=coerce to substitute your non numeric values with NaN and apply it the each column. Then you could use dropna or fillna whatever you prefer.

df = pd.read_csv('file.csv')
df = df.apply(pd.to_numeric, errors='coerce')
df = df.dropna()

answered Apr 18, 2016 at 4:28

Anton Protopopov

31.9k13 gold badges93 silver badges96 bronze badges

Comments

PhilChang · Accepted Answer · 2016-04-18 06:22:24Z

1

you can use df._get_numeric_data() directly.

answered Apr 18, 2016 at 6:22

PhilChang

2,7011 gold badge18 silver badges18 bronze badges

1 Comment

Domagoj Over a year ago

This omits complete mixed columns in the pandas dataframe.

Collectives™ on Stack Overflow

Ignoring non-numerical string values in pandas dataframe

3 Answers 3

3 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest