1

I have a dataset formed by some text columns (with limited possibilities) and some numeric columns in a csv format. Is there any way to automatically transform the text columns to numbers (for example: A will be 0, B will be 1 and so on) to transform the dataset to np.array?

This will be later used on scikit-learn, so it needs to be np.array at the end of all the processing.

EDIT: Adding one line of the dataset:

ENABLED;ENABLED;10;MANUAL;ENABLED;ENABLED;1800000;OFF;0.175;5.0;0.13;OFF;NEITHER;ENABLED;-65;2417;"wifi01";65;-75;DISCONNECTED;NO;NO;2621454;432477;3759;2.2436838539123705E-6;
2
  • Can you give us an example (excerpt) from the file (or "text columns") so we can better understand what you're working with? numpy's genfromtext might be a good place to start, or possibly pandas.read_csv... Commented Nov 18, 2016 at 0:18
  • Added to the description. Each text columns may have 3 or 4 possible values. Commented Nov 18, 2016 at 0:30

1 Answer 1

2

You can apply sklearn.preprocessing.labelEncoder() to each text column. Here is an example:

import pandas as pd
df = pd.DataFrame({'col1': [1,2,3,4,5],
                  'col2': ['ON','ON','OFF','OFF','ON']})
from sklearn.preprocessing import LabelEncoder 
lb = LabelEncoder()
df['encoded'] = lb.fit_transform(df.col2)
df

  col1  col2  encoded
0   1    ON     1
1   2    ON     1
2   3    OFF    0
3   4    OFF    0
4   5    ON     1

I just added the numerical values in another column but you can replace them. Also, you can convert them into numpy array:

df.as_matrix()
array([[1, 'ON', 1],
       [2, 'ON', 1],
       [3, 'OFF', 0],
       [4, 'OFF', 0],
       [5, 'ON', 1]], dtype=object)

Here is how you may encode with numpy. In this example I am just passing a python list:

alist = ['ON','ON','OFF','OFF','ON']
uniqe_values , y = np.unique(alist, return_inverse=True)
print uniqe_values
print y

The results are:

['OFF' 'ON']
[1 1 0 0 1]
Sign up to request clarification or add additional context in comments.

2 Comments

Is it possible to do without Pandas? Looking to transform, though.
Yes, it is possible but this way seems much simpler to me. You can read your data as a pandas DataFrame and then follow the above procedure. Under the hood sklearn uses numpy in their labelEncoder(). I think if you read the data as numpy array you should be able to do the same thing.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.