Python splitting data into random sets

Question

I would like to split my data into two random sets. I've done the first part:

ind = np.random.choice(df.shape[0], size=[int(df.shape[0]*0.7)], replace=False)
X_train = df.iloc[ind]

Now I would like to select all index' not in ind to create my test set. Please can you tell me how to do this?

I thought it would be

X_test = df.iloc[-ind]

but apparently it isn't

So you want to select 70% as test data and use the rest 30% as training data ? An easier way to do that might be to use np.random.shuffle to shuffle indexes and use first 70% of the shuffled indexes as training and rest as test. — Some Guy
– Some Guy, Commented May 29, 2017 at 15:47

redacted · Accepted Answer · 2017-05-29 15:49:16Z

4

Check out scikit-learn test_train_split()

Example from the docs:

>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]

>>>

>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
       [8, 9]])
>>> y_test
[1, 4]

In your case you could do it like this:

larger, smaller = test_train_split(df, test_size=0.3)

answered May 29, 2017 at 15:49

redacted

4,0096 gold badges28 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Some Guy · Accepted Answer · 2017-05-29 15:58:14Z

1

Another way to get a 70 - 30 train test split would be to generate indices, shuffle them and then split it in 70 - 30 parts.

ind = np.arange(df.shape[0])
np.random.shuffle(ind)
X_train = df.iloc[ind[:int(0.7*df.shape[0])],:]
X_test = df.iloc[ind[int(0.7*df.shape[0]):],:]

I would suggest converting the pandas.dataframe to a numeric matrix and using scikit-learn's train_test_split to do the splitting unless you really want to do it this way.

edited May 29, 2017 at 15:58

answered May 29, 2017 at 15:54

Some Guy

1,79712 silver badges15 bronze badges

1 Comment

jlt199 Over a year ago

I like this method. Thanks. I've used train_test_split before (although I had forgotten about it), but I find the data easier to check and visualise in a dataframe.

arrakis_sun · Accepted Answer · 2017-05-29 16:01:27Z

0

Try this pure-Python approach.

ind_inversed = list(set(range(df.shape[0])) - set(ind))
X_test = df.iloc[ind_inversed]

edited May 29, 2017 at 16:01

answered May 29, 2017 at 15:48

arrakis_sun

5763 silver badges8 bronze badges

5 Comments

redacted Over a year ago

This doesn't randomize the two sets

arrakis_sun Over a year ago

It does since I assume ind was calculated the same way as in the original question. ind_inversed represents all the other indecies not in ind.

jlt199 Over a year ago

I'm getting the Error int() argument must be a string, a bytes-like object or a number, not 'set' with this technique

arrakis_sun Over a year ago

@jlt199, I've updated my answer. I've tested this solution, it does work.

jlt199 Over a year ago

Thank you, this works too. I've selected this one as the approved answer simply because it's the one I have ended up using. The other suggested techniques also work great and no doubt will be used in the future

Collectives™ on Stack Overflow

Python splitting data into random sets

3 Answers 3

Comments

1 Comment

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related