python - Transform data to numpy array for sklearn

Question

I have a dataset formed by some text columns (with limited possibilities) and some numeric columns in a csv format. Is there any way to automatically transform the text columns to numbers (for example: A will be 0, B will be 1 and so on) to transform the dataset to np.array?

This will be later used on scikit-learn, so it needs to be np.array at the end of all the processing.

EDIT: Adding one line of the dataset:

ENABLED;ENABLED;10;MANUAL;ENABLED;ENABLED;1800000;OFF;0.175;5.0;0.13;OFF;NEITHER;ENABLED;-65;2417;"wifi01";65;-75;DISCONNECTED;NO;NO;2621454;432477;3759;2.2436838539123705E-6;

Can you give us an example (excerpt) from the file (or "text columns") so we can better understand what you're working with? numpy's genfromtext might be a good place to start, or possibly pandas.read_csv... — mgilson
– mgilson, Commented Nov 18, 2016 at 0:18
Added to the description. Each text columns may have 3 or 4 possible values. — Minoru
– Minoru, Commented Nov 18, 2016 at 0:30

MhFarahani · Accepted Answer · 2016-11-18 00:53:25Z

2

You can apply sklearn.preprocessing.labelEncoder() to each text column. Here is an example:

import pandas as pd
df = pd.DataFrame({'col1': [1,2,3,4,5],
                  'col2': ['ON','ON','OFF','OFF','ON']})
from sklearn.preprocessing import LabelEncoder 
lb = LabelEncoder()
df['encoded'] = lb.fit_transform(df.col2)
df

  col1  col2  encoded
0   1    ON     1
1   2    ON     1
2   3    OFF    0
3   4    OFF    0
4   5    ON     1

I just added the numerical values in another column but you can replace them. Also, you can convert them into numpy array:

df.as_matrix()
array([[1, 'ON', 1],
       [2, 'ON', 1],
       [3, 'OFF', 0],
       [4, 'OFF', 0],
       [5, 'ON', 1]], dtype=object)

Here is how you may encode with numpy. In this example I am just passing a python list:

alist = ['ON','ON','OFF','OFF','ON']
uniqe_values , y = np.unique(alist, return_inverse=True)
print uniqe_values
print y

The results are:

['OFF' 'ON']
[1 1 0 0 1]

edited Nov 18, 2016 at 0:53

answered Nov 18, 2016 at 0:30

MhFarahani

9702 gold badges9 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Minoru Over a year ago

Is it possible to do without Pandas? Looking to transform, though.

MhFarahani Over a year ago

Yes, it is possible but this way seems much simpler to me. You can read your data as a pandas DataFrame and then follow the above procedure. Under the hood sklearn uses numpy in their labelEncoder(). I think if you read the data as numpy array you should be able to do the same thing.

Collectives™ on Stack Overflow

python - Transform data to numpy array for sklearn

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related