issue when loading a data file with numpy

Question

I want to train a classifier with scikit, but for doing this first I need to load the corresponding data. I am using the following data file available in:

https://archive.ics.uci.edu/ml/machine-learning-databases/yeast/

When I open it in word it has the following contents:

ADT1_YEAST  0.58  0.61  0.47  0.13  0.50  0.00  0.48  0.22  MIT
ADT2_YEAST  0.43  0.67  0.48  0.27  0.50  0.00  0.53  0.22  MIT
ADT3_YEAST  0.64  0.62  0.49  0.15  0.50  0.00  0.53  0.22  MIT
AAR2_YEAST  0.58  0.44  0.57  0.13  0.50  0.00  0.54  0.22  NUC

Each file is separated by a double space and every line with a return carriage.

I want to read it with the following command:

f=open("yeast.data")
data = np.loadtxt(f,delimiter=" ")

and at the end I want to be able to use the following:

X = data[:,:-1]  # select all columns except the last
y = data[:, -1]   # select the last column

for using:

X_train, X_test, y_train, y_test = train_test_split(X, y)

but when I try to read it the following error appears:

ValueError: could not convert string to float: ADT1_YEAST

so how can I read this file in Python to use later the MLPClassifier?

Thanks

I hadn't seen that my original solution gave a (n,) shaped array. Take a look at my update, I think it works. — sacuL
– sacuL, Commented Aug 4, 2018 at 17:13
The usecols parameter will let you load the string and float columns separately. — hpaulj
– hpaulj, Commented Aug 4, 2018 at 21:44

sacuL · Accepted Answer · 2018-08-04 17:16:27Z

1

You can skip the f=open(...), and you can to use dtype='O' to make sure numpy reads it as an mix of numericals and strings. Because of some inconsistancies in the data structure in the file you linked, it's best to use genfromtxt instead of loadtxt:

data = np.genfromtxt('yeast.data',dtype='O')

>>> data
array([[b'ADT1_YEAST', b'0.58', b'0.61', ..., b'0.48', b'0.22', b'MIT'],
       [b'ADT2_YEAST', b'0.43', b'0.67', ..., b'0.53', b'0.22', b'MIT'],
       [b'ADT3_YEAST', b'0.64', b'0.62', ..., b'0.53', b'0.22', b'MIT'],
       ..., 
       [b'ZNRP_YEAST', b'0.67', b'0.57', ..., b'0.56', b'0.22', b'ME2'],
       [b'ZUO1_YEAST', b'0.43', b'0.40', ..., b'0.53', b'0.39', b'NUC'],
       [b'G6PD_YEAST', b'0.65', b'0.54', ..., b'0.53', b'0.22', b'CYT']], dtype=object)

>>> data.shape
(1484, 10)

You can change the dtypes when you call genfromtxt (see documentation), or you can change them manually after like this:

data[:,0] = data[:,0].astype(str)
data[:,1:-1]= data[:,1:-1].astype(float)
data[:,-1] = data[:,-1].astype(str)

>>> data
array([['ADT1_YEAST', 0.58, 0.61, ..., 0.48, 0.22, 'MIT'],
       ['ADT2_YEAST', 0.43, 0.67, ..., 0.53, 0.22, 'MIT'],
       ['ADT3_YEAST', 0.64, 0.62, ..., 0.53, 0.22, 'MIT'],
       ..., 
       ['ZNRP_YEAST', 0.67, 0.57, ..., 0.56, 0.22, 'ME2'],
       ['ZUO1_YEAST', 0.43, 0.4, ..., 0.53, 0.39, 'NUC'],
       ['G6PD_YEAST', 0.65, 0.54, ..., 0.53, 0.22, 'CYT']], dtype=object)

edited Aug 4, 2018 at 17:16

answered Aug 4, 2018 at 16:44

sacuL

51.6k9 gold badges88 silver badges115 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

hpaulj Over a year ago

Or set dtype=None to get a structured array - 1d with fields corresponding to the file's columns.

Collectives™ on Stack Overflow

issue when loading a data file with numpy

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related