Preparing CSV file data for Scikit-Learn Using Pandas?

Question

I have a csv file without headers which I'm importing into python using pandas. The last column is the target class, while the rest of the columns are pixel values for images. How can I go ahead and split this dataset into a training set and a testing set using pandas (80/20)?

Also, once that is done how would I also split each of those sets so that I can define x (all columns except the last one), and y (the last column)?

I've imported my file using:

dataset = pd.read_csv('example.csv', header=None, sep=',')

Thanks

Did you try the sample function in Pandas: pandas.pydata.org/pandas-docs/stable/generated/…? — Kartik
– Kartik, Commented Mar 28, 2016 at 6:05

score 10 · Accepted Answer · 2018-08-26 10:44:50Z

10

I'd recommend using sklearn's train_test_split

from sklearn.model_selection import train_test_split
# for older versions import from sklearn.cross_validation
# from sklearn.cross_validation import train_test_split
X, y = dataset.iloc[:, :-1], dataset.iloc[:, -1]
kwargs = dict(test_size=0.2, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, **kwargs)

edited Aug 26, 2018 at 10:44

answered Mar 28, 2016 at 6:54

user2285236

Sign up to request clarification or add additional context in comments.

Comments

Randhawa · Accepted Answer · 2016-03-28 06:22:17Z

1

You can try this.

Sperating target class from the rest:

pixel_values = Dataset[df.columns[0:len(Dataset.axes[1])-1]]
target_class = Dataset[df.columns[len(Dataset.axes[1])-1:]]

Now to create test and training samples:

I would just use numpy's randn:

 mask = np.random.rand(len(pixel_values )) < 0.8
 train = pixel_values [mask]
 test = pixel_values [~msk]

Now you have traning and test samples in train and test with 80:20 ratio.

answered Mar 28, 2016 at 6:22

Randhawa

6441 gold badge8 silver badges15 bronze badges

2 Comments

KingPolygon Over a year ago

Thanks Randhawa. I'm not too sure what you're doing with the randn, and masks, but after fixing a few things I did manage to split my target class and my input features. I do feel like using sklearn's built in cross validation split is a much better choice though.

Randhawa Over a year ago

randn is generating random indexes between 0 and length of dataframe (no of rows) and mask is used to contain retain 80% of these randomly generated indexes.

Kartik · Accepted Answer · 2016-03-28 08:23:18Z

0

You can simply do:

choices = np.in1d(dataset.index, np.random.choice(dataset.index,int(0.8*len(dataset)),replace=False))
training = dataset[choices]
testing = dataset[np.invert(choices)]

Then, to pass it as x and y to Scikit-Learn:

scikit_func(x=training.iloc[:,0:-1], y=training.iloc[:,-1])

Let me know if this doesn't work.

edited Mar 28, 2016 at 8:23

answered Mar 28, 2016 at 6:13

Kartik

8,73345 silver badges78 bronze badges

5 Comments

KingPolygon Over a year ago

Thanks Kartik. I imported numpy as np and defined training and testing as you describe above. I get the following error when defining testing though:

line 2765, in _evaluate_compare     raise ValueError('Lengths must match to compare') ValueError: Lengths must match to compare

Kartik Over a year ago

Sorry for the mistake, I forgot that training will have fewer rows than dataset, which caused the error. I have tested the code this time, and it should work. Again, I apologize.

KingPolygon Over a year ago

Hmmm X is saying there are 0 columns instead of the correct amount which should be 1024 ([3173 rows x 0 columns]).

Kartik Over a year ago

Use .iloc instead of .ix, like in the edited answer.

KingPolygon Over a year ago

Thanks Kartik that definitely fixed the problem. With that said I'll be going with @ayhan's answer because he did recommend using sklearn's cross validation function to split the data which is the best way in my opinion, and his use of .ix seems more appropriate. Thank you for your help though!

Collectives™ on Stack Overflow

Preparing CSV file data for Scikit-Learn Using Pandas?

3 Answers 3

Comments

2 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related