4

I have a column Column1 in a pandas dataframe which is of type str, values which are in the following form:

import pandas as pd
df = pd.read_table("filename.dat")
type(df["Column1"].ix[0])   #outputs 'str'
print(df["Column1"].ix[0])

which outputs '1/350'. So, this is currently a string. I would like to convert it into a float.

I tried this:

df["Column1"] = df["Column1"].astype('float64', raise_on_error = False)

But this didn't change the values into floats.

This also failed:

df["Column1"] = df["Column1"].convert_objects(convert_numeric=True)

And this failed:

df["Column1"] = df["Column1"].apply(pd.to_numeric, args=('coerce',))

How do I convert all the values of column "Column1" into floats? Could I somehow use regex to remove the parentheses?

EDIT:

The line

df["Meth"] = df["Meth"].apply(eval)

works, but only if I use it twice, i.e.

df["Meth"] = df["Meth"].apply(eval)
df["Meth"] = df["Meth"].apply(eval)

Why would this be?

3 Answers 3

4

You need to evaluate the expression (e.g. '1/350') in order to get the result, for which you can use Python's eval() function.

By wrapping Panda's apply() function around it, you can then execute the eval() function on every value in your column. Example:

df["Column1"].apply(eval)

As you're interpreting literals, you can also use the ast.literal_eval function as noted in the docs. Update: This won't work, as the use of literal_eval() is still restricted to additions and subtractions (source).

Remark: as mentioned in other answers and comments on this question, the use of eval() is not without risks, as you're basically executing whatever input is passed in. In other words, if your input contains malicious code, you're giving it a free pass.

Alternative option:

# Define a custom div function
def div(a,b):
    return int(a)/int(b)

# Split each string and pass the values to div
df_floats = df['col1'].apply(lambda x: div(*x.split('/')))

Second alternative in case of unclean data:

By using regular expressions, we can remove any non-digits appearing resp. before the numerator and after the denominator.

# Define a custom div function (unchanged)
def div(a,b):
    return int(a)/int(b)

# We'll import the re module and define a precompiled pattern
import re
regex = re.compile('\D*(\d+)/(\d+)\D*')

df_floats = df['col1'].apply(lambda x: div(*regex.findall(x)[0]))

We'll lose a bit of performance, but the upside is that even with input like '!erefdfs?^dfsdf1/350dqsd qsd qs d', we still end up with the value of 1/350.

Performance:

When timing both options on a dataframe with 100.000 rows, the second option (using the user defined div function) clearly wins:

  • using eval: 1 loop, best of 3: 1.41 s per loop
  • using div: 10 loops, best of 3: 159 ms per loop
  • using re: 1 loop, best of 3: 275 ms per loop
Sign up to request clarification or add additional context in comments.

6 Comments

"this is currently a string. I would like to convert it into a float"
No problem :) Thank you for the help!
+1 for the benchmark there, I was surprised at first that eval was slower, but maybe it makes sense. eval needs to account for whatever code might have been passed in, while your alternative explicitly works on the /
The error I get for the above div function (which I like) is alueError: invalid literal for int() with base 10: "'9" I'm not sure how to fix this
Looks like there's some errors in the formatting of the data then, with an extra single quote. That may mean there are other issues, but in this case, in this answer, change it to return int(a.strip("'")/int(b.strip("'")) - that is a double-quote, single-quote, double-quote. Should fix it.
|
3

I hate advocating for the use of eval. I didn't want to spend time on this answer but I was compelled because I don't want you to use eval.

So I wrote this function that works on a pd.Series

def do_math_in_string(s):
    op_map = {'/': '__div__', '*': '__mul__', '+': '__add__', '-': '__sub__'}
    df = s.str.extract(r'(\d+)(\D+)(\d+)', expand=True)
    df = df.stack().str.strip().unstack()
    df.iloc[:, 0] = pd.to_numeric(df.iloc[:, 0]).astype(float)
    df.iloc[:, 2] = pd.to_numeric(df.iloc[:, 2]).astype(float)
    def do_op(x):
        return getattr(x[0], op_map[x[1]])(x[2])
    return df.T.apply(do_op)

Demonstration

s = pd.Series(['1/2', '3/4', '4/5'])

do_math_in_string(s)

0    0.50
1    0.75
2    0.80
dtype: float64

do_math_in_string(pd.Series(['1/2', '3/4', '4/5', '6+5', '11-7', '9*10']))

0     0.50
1     0.75
2     0.80
3    11.00
4     4.00
5    90.00
dtype: float64

Please don't use eval.

6 Comments

"You'll have better self esteem later in life if you avoid it now." Why?
That comment is me being comedically dramatic. But in practice, using eval is a recipe for code that is difficult to manage, maintain, debug, and keep secure. It's very generally regarded as bad practice.
If he's not dealing with user input, how would using eval here be a problem?
I'll say a couple more things: One, an input file can very well be user input. Two, it still makes it more difficult to track bugs and maintain (my belief). Three, all this is coding philosophy and doesn't really matter as the OP can do what he wants. My answer still does what it does, regardless of opinions on eval.
I was actually just curious if there was anything more to it than the problem that it evaluates whatever a user types in. It's a good thing to note for the OP though, that eval can be dangerous if the assumption that this isn't somehow involving user input is wrong.
|
2

You can do it by applying eval to the column:

data = {'one':['1/20', '2/30']}
df = pd.DataFrame(data)

In [8]: df['one'].apply(eval)
Out[8]:
0    0.050000
1    0.066667
Name: one, dtype: float64

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.