0

I'm relatively new to using sklearn and python for data analysis and am trying to run some linear regression on a dataset that I loaded from a .csv file.

I have loaded my data into train_test_split without any issues, but when I try to fit my training data I receive an error ValueError: Expected 2D array, got 1D array instead: ... Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample..

Error at model = lm.fit(X_train, y_train)

Because of my freshness with working with these packages, I'm trying to determine if this is the result of not setting my imported csv to a pandas data frame before running the regression or if this has to do with something else.

My CSV is in the format of:

Month,Date,Day of Week,Growth,Sunlight,Plants
7,7/1/17,Saturday,44,611,26
7,7/2/17,Sunday,30,507,14
7,7/5/17,Wednesday,55,994,25
7,7/6/17,Thursday,50,1014,23
7,7/7/17,Friday,78,850,49
7,7/8/17,Saturday,81,551,50
7,7/9/17,Sunday,59,506,29

Here is how I set up the regression:

import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt


organic = pd.read_csv("linear-regression.csv")

organic.columns
Index(['Month', 'Date', 'Day of Week', 'Growth', 'Sunlight', 'Plants'], dtype='object')

# Set the depedent (Growth) and independent (Sunlight)
y = organic['Growth']
X = organic['Sunlight']

# Test train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

print (X_train.shape, X_test.shape)
print (y_train.shape, y_test.shape)
(192,) (49,)
(192,) (49,)

lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)

# Error pointing to an array with values from Sunlight [611, 507, 994, ...]
1
  • for the sake of reproducability please delete the 'g' in Sunglight when you define your training data X Commented Mar 27, 2018 at 16:14

3 Answers 3

5

You just need to adjust your last columns to

lm = linear_model.LinearRegression()
model = lm.fit(X_train.values.reshape(-1,1), y_train)

and the model will fit. The reason for this is that the linear model from sklearn expects

X : numpy array or sparse matrix of shape [n_samples,n_features]

So our training data must be of form [7,1] in this particular case

Sign up to request clarification or add additional context in comments.

Comments

1

You are only using one feature, so it tells you what to do within the error:

Reshape your data either using array.reshape(-1, 1) if your data has a single feature.

The data always has to be 2D in scikit-learn.

(Don't forget the typo in X = organic['Sunglight'])

Comments

0

Once you load the data into train_test_split(X, y, test_size=0.2), it returns Pandas Series X_train and X_test with (192, ) and (49, ) dimensions. As mentioned in the previous answers, sklearn expect matrices of shape [n_samples,n_features] as the X_train, X_test data. You can simply convert the Pandas Series X_train and X_test to Pandas Dataframes to change their dimensions to (192, 1) and (49, 1).

lm = linear_model.LinearRegression()
model = lm.fit(X_train.to_frame(), y_train)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.