Deleting rows from numpy array not working

Question

I am trying to split my numpy array of data points into test and training sets. To do that, I'm randomly selecting rows from the array to use as the training set and the remaining are the test set.

This is my code:

matrix = numpy.loadtxt("matrix_vals.data", delimiter=',', dtype=float)
matrix_rows, matrix_cols = matrix.shape

# training set 
randvals = numpy.random.randint(matrix_rows, size=50)
train = matrix[randvals,:]
test = numpy.delete(matrix, randvals, 0)

print matrix.shape
print train.shape
print test.shape

But the output I get is:

matrix.shape: (130, 14)
train.shape: (50, 14)
test.shape: (89, 14)

This is obviously wrong since the number of rows from train and test should add up to the total number of rows in the matrix but here it's clearly more. Can anyone help me figure out what's going wrong?

ali_m · Accepted Answer · 2016-02-05 21:34:08Z

4

Because you are generating random integers with replacement, randvals will almost certainly contain repeat indices.

Indexing with repeated indices will return the same row multiple times, so matrix[randvals, :] is guaranteed to give you an output with exactly 50 rows, regardless of whether some of them are repeated.

In contrast, np.delete(matrix, randvals, 0) will only remove unique row indices, so it will reduce the number of rows only by the number of unique values in randvals.

Try comparing:

print(np.unique(randvals).shape[0] == matrix_rows - test.shape[0])
# True

To generate a vector of unique random indices between 0 and 1 - matrix_rows, you could use np.random.choice with replace=False:

uidx = np.random.choice(matrix_rows, size=50, replace=False)

Then matrix[uidx].shape[0] + np.delete(matrix, uidx, 0).shape[0] == matrix_rows.

edited Feb 5, 2016 at 21:34

answered Feb 5, 2016 at 21:29

ali_m

74.6k28 gold badges230 silver badges314 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

SanjanaS801 Over a year ago

Thanks! That was exactly the problem

Charlie Haley · Accepted Answer · 2016-02-05 21:21:04Z

3

Why not use scikit learn's train_test_split function instead and avoid all the hassle?

import numpy as np
from sklearn.cross_validation import train_test_split

train, test = train_test_split(mat, test_size = 50.0/130.0)

answered Feb 5, 2016 at 21:21

Charlie Haley

4,3264 gold badges24 silver badges36 bronze badges

2 Comments

SanjanaS801 Over a year ago

Is this going to give me a random split? EDIT: Just checked the doc and it does! Thanks for the alternate solution! I didn't know about this function. I'd still like to know why my code doesn't work though

Charlie Haley Over a year ago

Yes. You can test it for yourself if you would like. Check the documentation on the link in my answer above for info on how it works and what parameters it takes.

Collectives™ on Stack Overflow

Deleting rows from numpy array not working

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related