2

I have a file that lists deposit balances as strings. IN order to plot these numbers, I'm trying to convert the Objects to a float. So I wrote code to remove the $ and to take out spaces before and after the values.

member_clean.TotalDepositBalances = member_clean.TotalDepositBalances.str.replace('$', '')

member_clean['TotalDepositBalances'] = member_clean['TotalDepositBalances'].str.strip()

member_clean['TotalDepositBalances'] = member_clean['TotalDepositBalances'].astype(float)

When I run the code, I get an error message that says

ValueError: could not convert string to float:

That's it. Before I added the str.strip, the error message showed me that some values had spaces before and after, so I knew to remove those. But I'm a little confused what else is causing it,

I looked at the values of the column after I removed the spaces and $, and everything looks normal. Here's a sample.

  1. 309.00
  2. 38.00
  3. 12,486.00
  4. 6,108.00
  5. 2,537.00

Any ideas of what I could check for in the columns that may be causing this error

2 Answers 2

3

You have to delete the commas, they are not a numeric format recognized by Python. So considering the list you gave as possible input:

str_num = ['309.00 ', ' 38.00 ', ' 12,486.00 ', '6,108.00', ' 2,537.00']

you have to do this:

list(map(lambda s: float (s.replace (',', '')), str_num))

and gives your list of float:

[309.0, 38.0, 12486.0, 6108.0, 2537.0]

Note: You don't need to do str.strip() because the spaces are automatically deleted from the float casting operation.

Following your pipeline, before converting to float, you need to do:

member_clean['TotalDepositBalances'] = member_clean['TotalDepositBalances'].str.replace(',', '')

Or you can run your entire pipeline on one line of code as follows:

member_clean['TotalDepositBalances'] = member_clean['TotalDepositBalances'].replace('$', '').replace(',', '').astype(float)

Extra: Performance

Here you will find tests that present a comparison of various methods for performing multiple substitutions inserted in a string. Surprisingly use replace in cascade (as in your pipeline), it turns out to be more efficient than a regex for this type of operation. Give it a reading.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you. So I messed up when copying the output. I've actually removed the commas already and it's still giving the error message. I am going to take your suggestion and run it all on one line of code. Thank you for posting that and sharing the link!
0

A useful method for working with large datasets or series is to create a lookup dictionary of corrected values so that duplicate values aren't re-calculated:

import pandas as pd
import re

def fast_num_conversion(s):
    """
    This is an extremely fast approach to parsing messy numbers to floats.
    For large data, the same values are often repeated. Rather than
    re-parse these, we store all unique dates, parse them, and
    use a lookup to convert all figures. 
    (Should be 10X faster than without lookup dict)

       Note, input must be a pandas series.
    """
    f_convert = lambda x: re.sub('[$\-,\| ]', '', x)
    f_float = lambda x: float(x) if x!='' else np.NaN
    vals = {curr:f_float(f_convert(curr)) for curr in s.unique()}
    return s.map(vals)

str_num = ['309.00', '38 .00 ', '12, 486.00', '6,108.00', '2,537.00']

print(pd.Series(fast_num_conversion))
0      309.0
1       38.0
2    12486.0
3     6108.0
4     2537.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.