6

I have a csv file without headers which I'm importing into python using pandas. The last column is the target class, while the rest of the columns are pixel values for images. How can I go ahead and split this dataset into a training set and a testing set using pandas (80/20)?

Also, once that is done how would I also split each of those sets so that I can define x (all columns except the last one), and y (the last column)?

I've imported my file using:

dataset = pd.read_csv('example.csv', header=None, sep=',')

Thanks

1

3 Answers 3

10

I'd recommend using sklearn's train_test_split

from sklearn.model_selection import train_test_split
# for older versions import from sklearn.cross_validation
# from sklearn.cross_validation import train_test_split
X, y = dataset.iloc[:, :-1], dataset.iloc[:, -1]
kwargs = dict(test_size=0.2, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, **kwargs)
Sign up to request clarification or add additional context in comments.

Comments

1

You can try this.

Sperating target class from the rest:

pixel_values = Dataset[df.columns[0:len(Dataset.axes[1])-1]]
target_class = Dataset[df.columns[len(Dataset.axes[1])-1:]]

Now to create test and training samples:

I would just use numpy's randn:

 mask = np.random.rand(len(pixel_values )) < 0.8
 train = pixel_values [mask]
 test = pixel_values [~msk] 

Now you have traning and test samples in train and test with 80:20 ratio.

2 Comments

Thanks Randhawa. I'm not too sure what you're doing with the randn, and masks, but after fixing a few things I did manage to split my target class and my input features. I do feel like using sklearn's built in cross validation split is a much better choice though.
randn is generating random indexes between 0 and length of dataframe (no of rows) and mask is used to contain retain 80% of these randomly generated indexes.
0

You can simply do:

choices = np.in1d(dataset.index, np.random.choice(dataset.index,int(0.8*len(dataset)),replace=False))
training = dataset[choices]
testing = dataset[np.invert(choices)]

Then, to pass it as x and y to Scikit-Learn:

scikit_func(x=training.iloc[:,0:-1], y=training.iloc[:,-1])

Let me know if this doesn't work.

5 Comments

Thanks Kartik. I imported numpy as np and defined training and testing as you describe above. I get the following error when defining testing though: line 2765, in _evaluate_compare raise ValueError('Lengths must match to compare') ValueError: Lengths must match to compare
Sorry for the mistake, I forgot that training will have fewer rows than dataset, which caused the error. I have tested the code this time, and it should work. Again, I apologize.
Hmmm X is saying there are 0 columns instead of the correct amount which should be 1024 ([3173 rows x 0 columns]).
Use .iloc instead of .ix, like in the edited answer.
Thanks Kartik that definitely fixed the problem. With that said I'll be going with @ayhan's answer because he did recommend using sklearn's cross validation function to split the data which is the best way in my opinion, and his use of .ix seems more appropriate. Thank you for your help though!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.