1

I'm working in a python3 jupyter notebook.

I'm trying to do some numerical calculations on a column in my dataframe which is made up of dollar amounts. Some of the lines have "$- " instead of numbers. How do I tell python to ignore those rows so I can look at the valid data?

movie is my dataframe revenue is the column I'm looking at

set(movie['revenue'])

I get this type of output:

{' $-   ',
 '1',
 '10',
 '100',
 '10000',
 '97250400',
 '98000000',
 '99000000'}

I've tried a few ways so far:

movie['revenue'] = pd.to_numeric(movie['revenue'])

movie['revenue'] = movie['revenue'].astype(np.float64)

Nothing seems to work. Please help!

2
  • It is a simple list, so why don't you just test all elements of the list, and remove them if you find a dollar symbol? Commented Jan 26, 2018 at 18:54
  • How would you do that exactly? if else loop? Commented Jan 26, 2018 at 18:57

4 Answers 4

2

This is one way.

import pandas as pd

df = pd.DataFrame([[' $-   '], ['1'], ['10'], ['100'],
                   ['10000'], ['97250400'], ['98000000'],
                   ['99000000']], columns=['A'])

df['A'] = df['A'].apply(pd.to_numeric, errors='coerce')

df.dtypes

# A    float64
# dtype: object
Sign up to request clarification or add additional context in comments.

Comments

0

Two ways I see going about handling this.

Given:

import pandas as pd
df = pd.DataFrame({'A':['12','$10','22','$99','100']})
df
    A
0   12
1  $10
2   22
3  $99
4  100

1) Coerce the values that pandas.to_numeric(...) can't convert to be Nans. This way most calculations will ignore them.

pd.to_numeric(df.A, errors='coerce')
0     12.0
1      NaN
2     22.0
3      NaN
4    100.0

2) Remove the '$' if present and convert to number so you are not loosing data.

df.A.apply(lambda i: float(i[1:]) if i[0] == '$' else float(i)) 
0     12.0
1     10.0
2     22.0
3     99.0
4    100.0

Comments

0

Here is a generic solution to remove from a list an element containing '$':

tmp = movie['revenue']
for elt in movie['revenue']:
    if elt.find('$') != -1:
        tmp.remove(elt)
movie['revenue'] = tmp

Comments

0

You could also create a mask to ignore those rows:

import pandas as pd
movie = pd.DataFrame(
    {
        'revenue': [' $-   ','1','10','100','10000','97250400','98000000','99000000']
    }
)

print(movie[movie['revenue'].map(str.isdigit)])
#    revenue
#1         1
#2        10
#3       100
#4     10000
#5  97250400
#6  98000000
#7  99000000

str.isdigit() returns True if all the characters in the string are digits.

So movie['revenue'].map(str.isdigit) will return a pandas.Series (mask) of the same length as movie with boolean values indicating if the value is a number or not.

Then movie[movie['revenue'].map(str.isdigit)] returns a new pd.DataFrame with only the rows where the mask is True.

Update

If you know ahead of time that the bad value will always be a specific string, for instance ' $- ', you can simply do the following:

movie[movie['revenue'] != ' $-   ']

This is faster because the logical operators are vectorized (AFAIK) and you can avoid calling apply() or map().

Update 2

Yet another method from the docs:

movie[~movie['revenue'].str.contains('\$')]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.