0

I have this dataframe:

enter image description here

As far as I know, to use the scikit learn package in Python for machine leaning tasks, the categorical variables should be converted to dummy variables. So, for example, using a library of scikit learn I try to convert the values of the third column to dummy values but my code didn't work:

from sklearn.preprocessing import LabelEncoder

x[:, 2] = LabelEncoder().fit_transform(x[:,2])

So what's wrong with my code? and How Can I convert all the categorical variables to dummy variables in my data frame?

Edit: The full traceback is this :

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-73-c0d726db979e> in <module>()
      1 from sklearn.preprocessing import LabelEncoder
      2 
----> 3 x[:, 2] = LabelEncoder().fit_transform(x[:,2])

C:\Users\toshiba\Anaconda\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
   2001             # get column
   2002             if self.columns.is_unique:
-> 2003                 return self._get_item_cache(key)
   2004 
   2005             # duplicate columns

C:\Users\toshiba\Anaconda\lib\site-packages\pandas\core\generic.pyc in _get_item_cache(self, item)
    665             return cache[item]
    666         except Exception:
--> 667             values = self._data.get(item)
    668             res = self._box_item_values(item, values)
    669             cache[item] = res

C:\Users\toshiba\Anaconda\lib\site-packages\pandas\core\internals.pyc in get(self, item)
   1653     def get(self, item):
   1654         if self.items.is_unique:
-> 1655             _, block = self._find_block(item)
   1656             return block.get(item)
   1657         else:

C:\Users\toshiba\Anaconda\lib\site-packages\pandas\core\internals.pyc in _find_block(self, item)
   1933 
   1934     def _find_block(self, item):
-> 1935         self._check_have(item)
   1936         for i, block in enumerate(self.blocks):
   1937             if item in block:

C:\Users\toshiba\Anaconda\lib\site-packages\pandas\core\internals.pyc in _check_have(self, item)
   1939 
   1940     def _check_have(self, item):
-> 1941         if item not in self.items:
   1942             raise KeyError('no item named %s' % com.pprint_thing(item))
   1943 

C:\Users\toshiba\Anaconda\lib\site-packages\pandas\core\index.pyc in __contains__(self, key)
    317 
    318     def __contains__(self, key):
--> 319         hash(key)
    320         # work around some kind of odd cython bug
    321         try:

TypeError: unhashable type
2
  • You should provide the full traceback instead of just saying "it didn't work". I suspect the problem is that making dummy variables results in multiple columns (one for each distinct value in the original column), so you can't assign back to the original column. You will probably want to make a new DataFrame containing your dummy columns. Commented Sep 29, 2014 at 3:32
  • In pandas question it's usually better if you include copy-pastable version of your DataFrame. I usually prefer the output of df.to_dict Commented Sep 29, 2014 at 6:41

1 Answer 1

3

I don't think the LabelEncoder function transforms your data to dummy variables (see scikit-learn.org/LabelEncoder) but creates new numerical labels for the variable.

I use the get_dummies function from pandas to do this (see pandas.pydata.org/dummies). Below a simple example.

Create a simple DataFrame with categorical and numerical data

import pandas as pd
X = pd.DataFrame({"Var1": ["a", "a", "b"],
                  "Var2": ["a", "b", "c"],
                  "Var3": [1, 2, 3]},
                  dtype = "category")
X["Var3"] = X["Var3"].astype(int)

Transform data to dummy variables

pd.get_dummies(X)

Out[4]:

   Var3  Var1_a  Var1_b  Var2_a  Var2_b  Var2_c
0     1       1       0       1       0       0
1     2       1       0       0       1       0
2     3       0       1       0       0       1

Notice that Var1 was transformed to two dummy variables, but you might want to have all three categories [a, b, c]. You will need to add the new category.

X["Var1"].cat.add_categories("c", inplace=True)

And the result:

pd.get_dummies(X)

Out[6]:

   Var3  Var1_a  Var1_b  Var1_c  Var2_a  Var2_b  Var2_c
0     1       1       0       0       1       0       0
1     2       1       0       0       0       1       0
2     3       0       1       0       0       0       1

Hope this helps

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.