1

I am very new to numpy. I need to take a dataset and create a test set and a training set out of it. If my dataset is a numpy array of 150 rows and 4 columns (last column is the labels), what is the correct way to populate the training and test arrays with the values from the dataset, given that the datasets can be different - i.e., I don't want to manually write the shapes for test and training sets?

What I want to do is, provided a split value, it will take a dataset and fill the test and training sets with the rows of dataset, split according to that value.

I need to write a method like so:

def split(dataset, value, training, test):
 training = np.array #this is what I am confused about how to define   
test = np.array
if random.random() < value:
#this is where I am confused about how to populate the arrays
    append rows to training
else:
    append rows to test
1
  • What do you mean by "split value"? Commented Mar 23, 2017 at 12:00

4 Answers 4

1

If not for educational purposes you want to manually split the data, I would suggest to use an existing solution. That way you can be sure it is correct*. Scikit-learn has various functions to perform cross-validation or simply split data in a training and a test set with train_test_split:

Split arrays or matrices into random train and test subsets

For example, to split a data set into 80 rows for training and 20 rows for testing:

from sklearn.model_selection import train_test_split

x = np.random.randn(100, 5)  # generate random data

x_train, x_test = train_test_split(x, train_size=0.8)

print(x_train.shape)  # (80, 5)
print(x_test.shape)  # (20, 5)

*At least the function will be implemented correctly. It is not necessarily the correct function to use - usually there are many ways to split data into train and test sets. Some of them can be more appropriate than others, depending on the specifics of the application.

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you, this was the best solution. Basically I can pass my dataset as x in your code and split it like this: train, test = train_test_split(data, train_size = split_value)
btw the cross_validation is deprecated, so I used model_selection instead
Right, I tend to forget that my version of scikit learn is rather old. Answer updated, thanks.
0

If you want to split your data at random into train and test, you can do it in the following way:

import numpy as np
from sklearn.model_selection import train_test_split
m=150
n=4
data=np.random.randint(5,size=[m,n])
X_train, X_test, y_train, y_test = train_test_split(data[:,:n-1], data[:,n-1], test_size=value)

In the above code, value represents the percentage of the data that will be used as test data.

If you want to split relatively to the value and not at random, which seems to be the case according to your revised code:

if random.random() < value:
#this is where I am confused about how to populate the arrays
    append rows to training
else:
    append rows to test

you can do

data_train=data[data[:,n-1]<value]
data_test=data[data[:,n-1]>=value]

Comments

0

The implementation will vary on how you want to split your data into training and test set. A simple way can be to randomly split based on a boolean mask.

data = np.random.rand(150,4)
mask = np.random.rand(len(data)) < 0.5 #returns a boolean array
train = data[mask]
test = data[~mask]

This will split the train and test equally, 50% each. you can vary the size of each set by modifying 0.5

Comments

0

You can simply do something like

n = 4
m = 120

data = np.loadtxt('iris.txt')
train_X = data[:m, :n]
train_Y = data[:m, n:]
test_X = data[m:, :n]
test_Y = data[m:, :n]

where n is the dimensions of the input, and m the number of patterns in the training set.

2 Comments

The way I read the question the OP wants to split on rows, not columns: x_test = data[:n, :] and x_train = data[n:, :], but splitting along columns may become interesting for seperating features from labels.
This just splits the dataset into features and labels, I need to split the dataset according to an input value into 2 sets...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.