3

I am trying to split my numpy array of data points into test and training sets. To do that, I'm randomly selecting rows from the array to use as the training set and the remaining are the test set.

This is my code:

matrix = numpy.loadtxt("matrix_vals.data", delimiter=',', dtype=float)
matrix_rows, matrix_cols = matrix.shape

# training set 
randvals = numpy.random.randint(matrix_rows, size=50)
train = matrix[randvals,:]
test = numpy.delete(matrix, randvals, 0)

print matrix.shape
print train.shape
print test.shape

But the output I get is:

matrix.shape: (130, 14)
train.shape: (50, 14)
test.shape: (89, 14)

This is obviously wrong since the number of rows from train and test should add up to the total number of rows in the matrix but here it's clearly more. Can anyone help me figure out what's going wrong?

2 Answers 2

4

Because you are generating random integers with replacement, randvals will almost certainly contain repeat indices.

Indexing with repeated indices will return the same row multiple times, so matrix[randvals, :] is guaranteed to give you an output with exactly 50 rows, regardless of whether some of them are repeated.

In contrast, np.delete(matrix, randvals, 0) will only remove unique row indices, so it will reduce the number of rows only by the number of unique values in randvals.

Try comparing:

print(np.unique(randvals).shape[0] == matrix_rows - test.shape[0])
# True

To generate a vector of unique random indices between 0 and 1 - matrix_rows, you could use np.random.choice with replace=False:

uidx = np.random.choice(matrix_rows, size=50, replace=False)

Then matrix[uidx].shape[0] + np.delete(matrix, uidx, 0).shape[0] == matrix_rows.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! That was exactly the problem
3

Why not use scikit learn's train_test_split function instead and avoid all the hassle?

import numpy as np
from sklearn.cross_validation import train_test_split

train, test = train_test_split(mat, test_size = 50.0/130.0)

2 Comments

Is this going to give me a random split? EDIT: Just checked the doc and it does! Thanks for the alternate solution! I didn't know about this function. I'd still like to know why my code doesn't work though
Yes. You can test it for yourself if you would like. Check the documentation on the link in my answer above for info on how it works and what parameters it takes.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.