Convert multiple columns of a pandas data frame to dummy variables - Python

Question

I have this dataframe:

enter image description here

As far as I know, to use the scikit learn package in Python for machine leaning tasks, the categorical variables should be converted to dummy variables. So, for example, using a library of scikit learn I try to convert the values of the third column to dummy values but my code didn't work:

from sklearn.preprocessing import LabelEncoder

x[:, 2] = LabelEncoder().fit_transform(x[:,2])

So what's wrong with my code? and How Can I convert all the categorical variables to dummy variables in my data frame?

Edit: The full traceback is this :

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-73-c0d726db979e> in <module>()
      1 from sklearn.preprocessing import LabelEncoder
      2 
----> 3 x[:, 2] = LabelEncoder().fit_transform(x[:,2])

C:\Users\toshiba\Anaconda\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
   2001             # get column
   2002             if self.columns.is_unique:
-> 2003                 return self._get_item_cache(key)
   2004 
   2005             # duplicate columns

C:\Users\toshiba\Anaconda\lib\site-packages\pandas\core\generic.pyc in _get_item_cache(self, item)
    665             return cache[item]
    666         except Exception:
--> 667             values = self._data.get(item)
    668             res = self._box_item_values(item, values)
    669             cache[item] = res

C:\Users\toshiba\Anaconda\lib\site-packages\pandas\core\internals.pyc in get(self, item)
   1653     def get(self, item):
   1654         if self.items.is_unique:
-> 1655             _, block = self._find_block(item)
   1656             return block.get(item)
   1657         else:

C:\Users\toshiba\Anaconda\lib\site-packages\pandas\core\internals.pyc in _find_block(self, item)
   1933 
   1934     def _find_block(self, item):
-> 1935         self._check_have(item)
   1936         for i, block in enumerate(self.blocks):
   1937             if item in block:

C:\Users\toshiba\Anaconda\lib\site-packages\pandas\core\internals.pyc in _check_have(self, item)
   1939 
   1940     def _check_have(self, item):
-> 1941         if item not in self.items:
   1942             raise KeyError('no item named %s' % com.pprint_thing(item))
   1943 

C:\Users\toshiba\Anaconda\lib\site-packages\pandas\core\index.pyc in __contains__(self, key)
    317 
    318     def __contains__(self, key):
--> 319         hash(key)
    320         # work around some kind of odd cython bug
    321         try:

TypeError: unhashable type

You should provide the full traceback instead of just saying "it didn't work". I suspect the problem is that making dummy variables results in multiple columns (one for each distinct value in the original column), so you can't assign back to the original column. You will probably want to make a new DataFrame containing your dummy columns. — BrenBarn
– BrenBarn, Commented Sep 29, 2014 at 3:32
In pandas question it's usually better if you include copy-pastable version of your DataFrame. I usually prefer the output of df.to_dict — tktk
– tktk, Commented Sep 29, 2014 at 6:41

omun · Accepted Answer · 2015-05-10 10:17:07Z

I don't think the LabelEncoder function transforms your data to dummy variables (see scikit-learn.org/LabelEncoder) but creates new numerical labels for the variable.

I use the get_dummies function from pandas to do this (see pandas.pydata.org/dummies). Below a simple example.

Create a simple DataFrame with categorical and numerical data

import pandas as pd
X = pd.DataFrame({"Var1": ["a", "a", "b"],
                  "Var2": ["a", "b", "c"],
                  "Var3": [1, 2, 3]},
                  dtype = "category")
X["Var3"] = X["Var3"].astype(int)

Transform data to dummy variables

pd.get_dummies(X)

Out[4]:

   Var3  Var1_a  Var1_b  Var2_a  Var2_b  Var2_c
0     1       1       0       1       0       0
1     2       1       0       0       1       0
2     3       0       1       0       0       1

Notice that Var1 was transformed to two dummy variables, but you might want to have all three categories [a, b, c]. You will need to add the new category.

X["Var1"].cat.add_categories("c", inplace=True)

And the result:

pd.get_dummies(X)

Out[6]:

   Var3  Var1_a  Var1_b  Var1_c  Var2_a  Var2_b  Var2_c
0     1       1       0       0       1       0       0
1     2       1       0       0       0       1       0
2     3       0       1       0       0       0       1

Hope this helps

Collectives™ on Stack Overflow

Convert multiple columns of a pandas data frame to dummy variables - Python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related