Removing strings from df column in Python

Question

I'm working in a python3 jupyter notebook.

I'm trying to do some numerical calculations on a column in my dataframe which is made up of dollar amounts. Some of the lines have "$- " instead of numbers. How do I tell python to ignore those rows so I can look at the valid data?

movie is my dataframe revenue is the column I'm looking at

set(movie['revenue'])

I get this type of output:

{' $-   ',
 '1',
 '10',
 '100',
 '10000',
 '97250400',
 '98000000',
 '99000000'}

I've tried a few ways so far:

movie['revenue'] = pd.to_numeric(movie['revenue'])

movie['revenue'] = movie['revenue'].astype(np.float64)

Nothing seems to work. Please help!

It is a simple list, so why don't you just test all elements of the list, and remove them if you find a dollar symbol? — Benjamin Barrois
– Benjamin Barrois, Commented Jan 26, 2018 at 18:54

jpp · Accepted Answer · 2018-01-26 18:59:07Z

2

This is one way.

import pandas as pd

df = pd.DataFrame([[' $-   '], ['1'], ['10'], ['100'],
                   ['10000'], ['97250400'], ['98000000'],
                   ['99000000']], columns=['A'])

df['A'] = df['A'].apply(pd.to_numeric, errors='coerce')

df.dtypes

# A    float64
# dtype: object

answered Jan 26, 2018 at 18:59

jpp

166k37 gold badges301 silver badges363 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

SciGuyMcQ · Accepted Answer · 2018-01-26 19:03:23Z

0

Two ways I see going about handling this.

Given:

import pandas as pd
df = pd.DataFrame({'A':['12','$10','22','$99','100']})
df
    A
0   12
1  $10
2   22
3  $99
4  100

1) Coerce the values that pandas.to_numeric(...) can't convert to be Nans. This way most calculations will ignore them.

pd.to_numeric(df.A, errors='coerce')
0     12.0
1      NaN
2     22.0
3      NaN
4    100.0

2) Remove the '$' if present and convert to number so you are not loosing data.

df.A.apply(lambda i: float(i[1:]) if i[0] == '$' else float(i)) 
0     12.0
1     10.0
2     22.0
3     99.0
4    100.0

answered Jan 26, 2018 at 19:03

SciGuyMcQ

1,0539 silver badges23 bronze badges

Comments

Benjamin Barrois · Accepted Answer · 2018-01-26 19:05:09Z

0

Here is a generic solution to remove from a list an element containing '$':

tmp = movie['revenue']
for elt in movie['revenue']:
    if elt.find('$') != -1:
        tmp.remove(elt)
movie['revenue'] = tmp

answered Jan 26, 2018 at 19:05

Benjamin Barrois

2,6961 gold badge18 silver badges34 bronze badges

Comments

pault · Accepted Answer · 2018-01-26 20:07:05Z

You could also create a mask to ignore those rows:

import pandas as pd
movie = pd.DataFrame(
    {
        'revenue': [' $-   ','1','10','100','10000','97250400','98000000','99000000']
    }
)

print(movie[movie['revenue'].map(str.isdigit)])
#    revenue
#1         1
#2        10
#3       100
#4     10000
#5  97250400
#6  98000000
#7  99000000

str.isdigit() returns True if all the characters in the string are digits.

So movie['revenue'].map(str.isdigit) will return a pandas.Series (mask) of the same length as movie with boolean values indicating if the value is a number or not.

Then movie[movie['revenue'].map(str.isdigit)] returns a new pd.DataFrame with only the rows where the mask is True.

Update

If you know ahead of time that the bad value will always be a specific string, for instance ' $- ', you can simply do the following:

movie[movie['revenue'] != ' $-   ']

This is faster because the logical operators are vectorized (AFAIK) and you can avoid calling apply() or map().

Update 2

Yet another method from the docs:

movie[~movie['revenue'].str.contains('\$')]

Collectives™ on Stack Overflow

Removing strings from df column in Python

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related