13

I have a DataFrame in which a column might have three kinds of values, integers (12331), integers as strings ('345') or some other string ('text').

Is there a way to drop all rows with the last kind of string from the dataframe, and convert the first kind of string into integers? Or at least some way to ignore the rows that cause type errors if I'm summing the column.

This dataframe is from reading a pretty big CSV file (25 GB), so I'd like some solution that would work when reading in chunks.

3 Answers 3

18

Pandas has some tools for converting these kinds of columns, but they may not suit your needs exactly. pd.to_numeric converts mixed columns like yours, but converts non-numeric strings to NaN. This means you'll get float columns, not integer, since only float columns can have NaN values. That usually doesn't matter too much but it's good to be aware of.

df = pd.DataFrame({'mixed_types': [12331, '345', 'text']})

pd.to_numeric(df['mixed_types'], errors='coerce')
Out[7]: 
0    12331.0
1      345.0
2        NaN
Name: mixed_types, dtype: float64

If you want to then drop all the NaN rows:

# Replace the column with the converted values
df['mixed_types'] = pd.to_numeric(df['mixed_types'], errors='coerce')

# Drop NA values, listing the converted columns explicitly
#   so NA values in other columns aren't dropped
df.dropna(subset = ['mixed_types'])
Out[11]: 
   mixed_types
0      12331.0
1        345.0
Sign up to request clarification or add additional context in comments.

3 Comments

Since the NaN is created after reading, would these values would be dropped if I set na_values = 'NaN' and execute drop_na?
@devil0150 Yeah, doing dropna() once you've converted isn't too difficult, see my edit
@xtian you probably need to strip the '$' off before this will work. If you're struggling, ask a new question on SO giving examples of your data and what you've tried.
5

You could use pd.to_numeric with errors=coerce to substitute your non numeric values with NaN and apply it the each column. Then you could use dropna or fillna whatever you prefer.

df = pd.read_csv('file.csv')
df = df.apply(pd.to_numeric, errors='coerce')
df = df.dropna()

Comments

1

you can use df._get_numeric_data() directly.

1 Comment

This omits complete mixed columns in the pandas dataframe.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.