Pandas: Efficiently change multiple values in multiple columns

Question

My DataFrame is 94 columns by 728k rows. Each value is a string representing a colour. I'm aiming to convert each colour to a corresponding numeric value.

Here's a reproducible example. In this example I want to convert the strings as follows:

blue = 1  
green = 2  
red = 3  
grey = 4  
orange = 5

data = {'group1': ['red', 'grey', 'blue', 'orange'],
   'group2': ['red', 'green', 'blue', 'blue'],
    'group3': ['orange', 'blue', 'orange', 'green']}

data = pd.DataFrame(data)
data

    group1  group2  group3
0   red     red     orange  
1   grey    green   blue
2   blue    blue    orange
3   orange  blue    green

Output would be:

    group1  group2  group3
0        3       3       5  
1        4       2       1
2        1       1       5
3        5       1       2

How could I do this efficiently given the size of my actual data?

may not be exactly what you are looking for, but take a look at sklearn.preprocessing.LabelEncoder as well. scikit-learn.org/stable/modules/generated/… — user2285236
– user2285236, Commented Mar 12, 2016 at 15:40

Alex Riley · Accepted Answer · 2016-03-12 15:50:30Z

2

You could first use a dictionary to map the strings to integers:

d = {'blue': 1, 'green': 2, 'red': 3, 'grey': 4, 'orange': 5}

Then use replace and pass in that dictionary:

>>> data.replace(d)
   group1  group2  group3
0       3       3       5
1       4       2       1
2       1       1       5
3       5       1       2

A dictionary has the advantage of allowing you to pick which strings are mapped to which integers. If you don't mind the values being generated for you automatically, you could take advantage of pandas' categorical data type.

Ideally we'd write data.astype('category') and proceed from there, but as of 0.17.1, two-dimensional categorical conversions are not implemented.

A work-around is to stack, cast, and unstack:

>>> c_data = data.stack().astype('category')
>>> c_data.cat.codes.unstack()
   group1  group2  group3
0       4       4       3
1       2       1       0
2       0       0       3
3       3       0       1

edited Mar 12, 2016 at 15:50

answered Mar 12, 2016 at 15:32

Alex Riley

178k46 gold badges274 silver badges247 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jeff Over a year ago

you can explicitly pass in categories when astyping to categorical as well to get whatever numerical codes u want

Collectives™ on Stack Overflow

Pandas: Efficiently change multiple values in multiple columns

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related