0

I would like to split my data into two random sets. I've done the first part:

ind = np.random.choice(df.shape[0], size=[int(df.shape[0]*0.7)], replace=False)
X_train = df.iloc[ind]

Now I would like to select all index' not in ind to create my test set. Please can you tell me how to do this?

I thought it would be

X_test = df.iloc[-ind]

but apparently it isn't

2
  • So you want to select 70% as test data and use the rest 30% as training data ? An easier way to do that might be to use np.random.shuffle to shuffle indexes and use first 70% of the shuffled indexes as training and rest as test. Commented May 29, 2017 at 15:47
  • Yes, that's exactly what I want Commented May 29, 2017 at 15:53

3 Answers 3

4

Check out scikit-learn test_train_split()

Example from the docs:

>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]

>>>

>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
       [8, 9]])
>>> y_test
[1, 4]

In your case you could do it like this:

larger, smaller = test_train_split(df, test_size=0.3)
Sign up to request clarification or add additional context in comments.

Comments

1

Another way to get a 70 - 30 train test split would be to generate indices, shuffle them and then split it in 70 - 30 parts.

ind = np.arange(df.shape[0])
np.random.shuffle(ind)
X_train = df.iloc[ind[:int(0.7*df.shape[0])],:]
X_test = df.iloc[ind[int(0.7*df.shape[0]):],:]

I would suggest converting the pandas.dataframe to a numeric matrix and using scikit-learn's train_test_split to do the splitting unless you really want to do it this way.

1 Comment

I like this method. Thanks. I've used train_test_split before (although I had forgotten about it), but I find the data easier to check and visualise in a dataframe.
0

Try this pure-Python approach.

ind_inversed = list(set(range(df.shape[0])) - set(ind))
X_test = df.iloc[ind_inversed]

5 Comments

This doesn't randomize the two sets
It does since I assume ind was calculated the same way as in the original question. ind_inversed represents all the other indecies not in ind.
I'm getting the Error int() argument must be a string, a bytes-like object or a number, not 'set' with this technique
@jlt199, I've updated my answer. I've tested this solution, it does work.
Thank you, this works too. I've selected this one as the approved answer simply because it's the one I have ended up using. The other suggested techniques also work great and no doubt will be used in the future

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.