0

I want to train a classifier with scikit, but for doing this first I need to load the corresponding data. I am using the following data file available in:

https://archive.ics.uci.edu/ml/machine-learning-databases/yeast/

When I open it in word it has the following contents:

ADT1_YEAST  0.58  0.61  0.47  0.13  0.50  0.00  0.48  0.22  MIT
ADT2_YEAST  0.43  0.67  0.48  0.27  0.50  0.00  0.53  0.22  MIT
ADT3_YEAST  0.64  0.62  0.49  0.15  0.50  0.00  0.53  0.22  MIT
AAR2_YEAST  0.58  0.44  0.57  0.13  0.50  0.00  0.54  0.22  NUC

Each file is separated by a double space and every line with a return carriage.

I want to read it with the following command:

f=open("yeast.data")
data = np.loadtxt(f,delimiter=" ")

and at the end I want to be able to use the following:

X = data[:,:-1]  # select all columns except the last
y = data[:, -1]   # select the last column

for using:

X_train, X_test, y_train, y_test = train_test_split(X, y)

but when I try to read it the following error appears:

ValueError: could not convert string to float: ADT1_YEAST

so how can I read this file in Python to use later the MLPClassifier?

Thanks

2
  • I hadn't seen that my original solution gave a (n,) shaped array. Take a look at my update, I think it works. Commented Aug 4, 2018 at 17:13
  • The usecols parameter will let you load the string and float columns separately. Commented Aug 4, 2018 at 21:44

1 Answer 1

1

You can skip the f=open(...), and you can to use dtype='O' to make sure numpy reads it as an mix of numericals and strings. Because of some inconsistancies in the data structure in the file you linked, it's best to use genfromtxt instead of loadtxt:

data = np.genfromtxt('yeast.data',dtype='O')

>>> data
array([[b'ADT1_YEAST', b'0.58', b'0.61', ..., b'0.48', b'0.22', b'MIT'],
       [b'ADT2_YEAST', b'0.43', b'0.67', ..., b'0.53', b'0.22', b'MIT'],
       [b'ADT3_YEAST', b'0.64', b'0.62', ..., b'0.53', b'0.22', b'MIT'],
       ..., 
       [b'ZNRP_YEAST', b'0.67', b'0.57', ..., b'0.56', b'0.22', b'ME2'],
       [b'ZUO1_YEAST', b'0.43', b'0.40', ..., b'0.53', b'0.39', b'NUC'],
       [b'G6PD_YEAST', b'0.65', b'0.54', ..., b'0.53', b'0.22', b'CYT']], dtype=object)

>>> data.shape
(1484, 10)

You can change the dtypes when you call genfromtxt (see documentation), or you can change them manually after like this:

data[:,0] = data[:,0].astype(str)
data[:,1:-1]= data[:,1:-1].astype(float)
data[:,-1] = data[:,-1].astype(str)

>>> data
array([['ADT1_YEAST', 0.58, 0.61, ..., 0.48, 0.22, 'MIT'],
       ['ADT2_YEAST', 0.43, 0.67, ..., 0.53, 0.22, 'MIT'],
       ['ADT3_YEAST', 0.64, 0.62, ..., 0.53, 0.22, 'MIT'],
       ..., 
       ['ZNRP_YEAST', 0.67, 0.57, ..., 0.56, 0.22, 'ME2'],
       ['ZUO1_YEAST', 0.43, 0.4, ..., 0.53, 0.39, 'NUC'],
       ['G6PD_YEAST', 0.65, 0.54, ..., 0.53, 0.22, 'CYT']], dtype=object)
Sign up to request clarification or add additional context in comments.

1 Comment

Or set dtype=None to get a structured array - 1d with fields corresponding to the file's columns.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.