Converting string objects to int/float using pandas

Question

import pandas as pd

path1 = "/home/supertramp/Desktop/100&life_180_data.csv"

mydf =  pd.read_csv(path1)

numcigar = {"Never":0 ,"1-5 Cigarettes/day" :1,"10-20 Cigarettes/day":4}

print mydf['Cigarettes']

mydf['CigarNum'] = mydf['Cigarettes'].apply(numcigar.get).astype(float)

print mydf['CigarNum']

mydf.to_csv('/home/supertramp/Desktop/powerRangers.csv')

The csv file "100&life_180_data.csv" contains columns like age, bmi,Cigarettes,Alocohol etc.

No                int64
Age               int64
BMI             float64
Alcohol          object
Cigarettes       object
dtype: object

Cigarettes column contains "Never" "1-5 Cigarettes/day","10-20 Cigarettes/day". I want to assign weights to these object (Never,1-5 Cigarettes/day ,....)

The expected output is new column CigarNum appended which consists only numbers 0,1,2 CigarNum is as expected till 8 rows and then shows Nan till last row in CigarNum column

0                     Never
1                     Never
2        1-5 Cigarettes/day
3                     Never
4                     Never
5                     Never
6                     Never
7                     Never
8                     Never
9                     Never
10                    Never
11                    Never
12     10-20 Cigarettes/day
13       1-5 Cigarettes/day
14                    Never
...
167                    Never
168                    Never
169     10-20 Cigarettes/day
170                    Never
171                    Never
172                    Never
173                    Never
174                    Never
175                    Never
176                    Never
177                    Never
178                    Never
179                    Never
180                    Never
181                    Never
Name: Cigarettes, Length: 182, dtype: object

The output I get shoudln't give NaN after few first rows.

0      0
1      0
2      1
3      0
4      0
5      0
6      0
7      0
8      0
9      0
10   NaN
11   NaN
12   NaN
13   NaN
14     0
...
167   NaN
168   NaN
169   NaN
170   NaN
171   NaN
172   NaN
173   NaN
174   NaN
175   NaN
176   NaN
177   NaN
178   NaN
179   NaN
180   NaN
181   NaN
Name: CigarNum, Length: 182, dtype: float64

Are you sure row 10 and 11 actually equals 'Never' and that there isn't a space or other character in the value? — EdChum
– EdChum, Commented Jun 4, 2014 at 12:39
Yes,I didn't check the space till now.Really thanks.Could you help me with an efficient way to ignore these spaces.I have lot more columns with space in the beginning.Thanks in advance. — codex
– codex, Commented Jun 4, 2014 at 12:44

EdChum · Accepted Answer · 2017-01-30 15:25:08Z

36

OK, first problem is you have embedded spaces causing the function to incorrectly apply:

fix this using vectorised str:

mydf['Cigarettes'] = mydf['Cigarettes'].str.replace(' ', '')

now create your new column should just work:

mydf['CigarNum'] = mydf['Cigarettes'].apply(numcigar.get).astype(float)

UPDATE

Thanks to @Jeff as always for pointing out superior ways to do things:

So you can call replace instead of calling apply:

mydf['CigarNum'] = mydf['Cigarettes'].replace(numcigar)
# now convert the types
mydf['CigarNum'] = mydf['CigarNum'].convert_objects(convert_numeric=True)

you can also use factorize method also.

Thinking about it why not just set the dict values to be floats anyway and then you avoid the type conversion?

So:

numcigar = {"Never":0.0 ,"1-5 Cigarettes/day" :1.0,"10-20 Cigarettes/day":4.0}

Version 0.17.0 or newer

convert_objects is deprecated since 0.17.0, this has been replaced with to_numeric

mydf['CigarNum'] = pd.to_numeric(mydf['CigarNum'], errors='coerce')

Here errors='coerce' will return NaN where the values cannot be converted to a numeric value, without this it will raise an exception

edited Jan 30, 2017 at 15:25

answered Jun 4, 2014 at 12:51

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Jeff Over a year ago

you can use series.replace(dict) I believe to do the substitution, then convert_objects(convert_numeric=True) to change to float (forcibly); you can also factorize to make categoricals (e.g. map the strings to numbers)

EdChum Over a year ago

@Jeff so is replace faster than calling map or apply and passing a dict now? Wasn't aware of factorize also, when was this introduced?

Jeff Over a year ago

replace should be much faster; factorize has been their quite a while (but not advertised :))

haneulkim Over a year ago

@EdChum Hi, when I do this I get "A value is trying to be set on copy of a slice from a Dataframe" how do I change original dataframe?

EdChum Over a year ago

@h_musk please see stackoverflow.com/questions/20625582/… basically if you want to operate on a copy then you should call df.copy(), otherwise follow that link if you want to operate on a view

Apogentus · Accepted Answer · 2015-05-14 20:24:49Z

4

Try using this function for all problems of this kind:

def get_series_ids(x):
    '''Function returns a pandas series consisting of ids, 
       corresponding to objects in input pandas series x
       Example: 
       get_series_ids(pd.Series(['a','a','b','b','c'])) 
       returns Series([0,0,1,1,2], dtype=int)'''

    values = np.unique(x)
    values2nums = dict(zip(values,range(len(values))))
    return x.replace(values2nums)

edited May 14, 2015 at 20:24

answered May 14, 2015 at 20:17

Apogentus

6,6536 gold badges36 silver badges34 bronze badges

1 Comment

Denis Over a year ago

That's nice! Minor comments: requires import numpy as np; desired add-on: optionally, assign -1 or something like that for null values if such occur

Collectives™ on Stack Overflow

Converting string objects to int/float using pandas

2 Answers 2

5 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related