error in using scikit-learn linear regression model in python

Question

I am working on Linear Regression model given at this scikit-learn page using python and ipython notebook. The dataset that I have looks like:

KR,Alabama,97071129.11369997,186026.0,63.14000000000001,923.8600000000001
KR,Alabama,67445447.0459,187201.0,94.71,1385.79
KR,Alabama,66332319.626799986,186611.0,121.77000000000001,1781.73
KR,Alabama,75868163.65490001,188002.0,171.38,2507.62
KR,Alabama,104626353.3301,192055.0,62.300000000000004,924.2800000000001
KR,Alabama,82482715.69460002,193070.0,93.45,1386.4199999999998
KR,Alabama,81095032.9574,196819.0,120.15,1782.5400000000002
KR,Alabama,70076833.3433,196738.0,169.1,2508.76
KR,Alabama,111183092.64729999,195091.0,64.82000000000001,946.2600000000001
KR,Alabama,90909063.08510002,197789.0,97.22999999999999,1419.3899999999999
KR,Alabama,90934598.2206,201541.0,125.01,1824.93
KR,Alabama,107374172.93309999,203338.0,175.94,2568.42
KR,Arizona,1126677862.6940002,264600.0,63.14000000000001,923.8600000000001
KR,Arizona,838166771.0832,268153.0,94.71,1385.79
KR,Arizona,956037530.2797,268429.0,121.77000000000001,1781.73
KR,Arizona,984328946.5951,268792.0,171.38,2507.62
KR,Arizona,1257812174.3229997,270547.0,62.300000000000004,924.2800000000001
KR,Arizona,883093705.2885998,272764.0,93.45,1386.4199999999998
KR,Arizona,880652373.4425,276307.0,120.15,1782.5400000000002
KR,Arizona,910039260.961,279318.0,169.1,2508.76
KR,Arizona,1226385050.8268003,279983.0,64.82000000000001,946.2600000000001
KR,Arizona,1087126209.1170998,281409.0,97.22999999999999,1419.3899999999999
KR,Arizona,934971659.6374002,286590.0,125.01,1824.93
KR,Arizona,986475815.6928002,288644.0,175.94,2568.42
KR,California,7830776748.968867,2085424.0,63.14000000000001,923.8600000000001
KR,California,5999727784.478112,2103999.0,94.71,1385.79
KR,California,5804539962.436825,2138267.0,121.77000000000001,1781.73
KR,California,6547521069.504964,2172849.0,171.38,2507.62
KR,California,7945616026.08499,2157455.0,62.300000000000004,924.2800000000001
KR,California,6068949829.714768,2182688.0,93.45,1386.4199999999998
KR,California,5767177648.936179,2227205.0,120.15,1782.5400000000002
KR,California,6292965589.900258,2284617.0,169.1,2508.76
KR,California,8805205589.885035,2254347.0,64.82000000000001,946.2600000000001
KR,California,6855033176.090414,2292655.0,97.22999999999999,1419.3899999999999
KR,California,6930741761.859158,2341652.0,125.01,1824.93
KR,California,6916313224.326924,2357810.0,175.94,2568.42

In this dataset for each company_id and for each state in that company_id there are 12 records. Now what I am trying to do is for each company_id and for each state in that company_id, I want to form a training set and test set with 10 and 2 records respectively.

Here's my current updated code:

from sklearn import linear_model
import csv


def process_chunk(chuk):

    training_set_feature_list = []
    training_set_label_list = []
    test_set_feature_list = []
    test_set_label_list = []
    count = 1
    # to divide into training & test, I am putting line 10th and 11th in test set
    count = 0
    for line in chuk:
        # Converting strings to numpy arrays
        if count == 9 or count == 10:   
            test_set_feature_list.append(np.array(line[3:5],dtype = np.float))
            test_set_label_list.append(np.array(line[2],dtype = np.float))
        else:    
            training_set_feature_list.append(np.array(line[3:5],dtype = np.float))
            training_set_label_list.append(np.array(line[2],dtype = np.float))

        count += 1
    # Create linear regression object
    regr = linear_model.LinearRegression()
    # Train the model using the training sets
    regr.fit(training_set_feature_list, training_set_label_list)

    print regr.predict(test_set_feature_list)



# Load and parse the data
file_read = open('file.csv', 'r')

reader = csv.reader(file_read)

chunk, chunksize = [], 12

for i, line in enumerate(reader):
    if (i % chunksize == 0 and i > 0):
        process_chunk(chunk)
        del chunk[:]
    chunk.append(line)

# process the remainder

process_chunk(chunk)

When I execute this code I get error as:

ValueError: Found arrays with inconsistent numbers of samples: [ 1 10] at line regr.fit(training_set_feature_list, training_set_label_list)

What is the mistake here and how to resolve it?

UPDATE: After suggestion here is my current output screen which has some weird numbers coming in:

[  1.01999724e+08   1.03189615e+08]
[  1.08523268e+09   1.05427929e+09]
[  7.77478189e+09   7.56564733e+09]
[  8.87437438e+08   8.77578642e+08]
[  1.62710654e+08   1.51921308e+08]
[  4.19988737e+09   4.00902600e+09]
[  7.70222690e+08   7.31282229e+08]
[  1.60301569e+09   1.51976018e+09]
[  9.31799698e+08   9.28243073e+08]
[ 51831980.55257727  53136008.17725636]
[  1.92207016e+08   1.85232202e+08]
[  3.82247927e+08   3.33879176e+08]
[  1.35276200e+09   1.34525871e+09]
[  1.62557223e+09   1.53895636e+09]
[  2.12376099e+09   2.08585811e+09]
[ 61386995.4473462   58500866.29796618]
[  3.18458112e+08   3.09384959e+08]
[  4.90038249e+08   4.87984249e+08]

Looks like you have different number of samples and labels. E.g size of training_set_feature_list is not same as training_set_label_list. — Ibraim Ganiev
– Ibraim Ganiev, Commented Sep 17, 2015 at 12:26
Also, for such tasks you can use pandas package, and group by your dataframe by company_id and state. — Ibraim Ganiev
– Ibraim Ganiev, Commented Sep 17, 2015 at 12:56
@Olologin can you show how should I do it? Also training_set_feature_list and training_set_label_list have same number of records because they are getting formed together — Jason Donnald
– Jason Donnald, Commented Sep 17, 2015 at 13:11
Can you share your csv data? Or maybe csv with few first lines in it? Because i want to debug it by myself. — Ibraim Ganiev
– Ibraim Ganiev, Commented Sep 17, 2015 at 15:13
@Olologin I have updated my post above to have some of the csv data. Please check it — Jason Donnald
– Jason Donnald, Commented Sep 17, 2015 at 15:36

Sahil M · Accepted Answer · 2015-09-18 05:28:30Z

1

I think your data has strings, and that's why it complains, there were some other problems, I am posting a corrected version.

from sklearn import linear_model
import csv
import numpy as np
import matplotlib.pyplot as plt

def process_chunk(chuk):

    training_set_feature_list = []
    training_set_label_list = []
    test_set_feature_list = []
    test_set_label_list = []
    count = 1
    # to divide into training & test
    chuk = map(lambda x: x[2:], chuk) # Removing first 2 columns
    chunk = np.array(chuk,dtype = np.float) # Make floats array from strings
    ########## Testing dataset: Data after 30th row =########################################
    test_set_feature_list = chunk[30:,3:5]  #4rd and 5th column of chunk 
    test_set_label_list = chunk[30:,2] #3rd column of chunk

    ########## Training dataset: All data before 30th row########################################
    training_set_feature_list = chunk[:30,3:5]
    training_set_label_list = chunk[:30, 2]

    # Create linear regression object
    regr = linear_model.LinearRegression()
    # Train the model using the training sets
    regr.fit(training_set_feature_list, training_set_label_list)

    predictedTestSet = regr.predict(test_set_feature_list)

     # The coefficients
    print 'Coefficients: {}'.format(regr.coef_)
    # The mean square error
    print 'Residual sum of squares: %.2f' % np.mean(predictedTestSet - test_set_label_list) ** 2
    # Explained variance score: 1 is perfect prediction
    print 'Variance score: %.2f' % regr.score( test_set_feature_list, test_set_label_list)
    X = [x for (y,x) in sorted(zip(test_set_label_list, predictedTestSet))]
    Y = [y for (y,x) in sorted(zip(test_set_label_list, predictedTestSet))]
    plt.plot(range(len(X)),X , 'r.', label='predicted')    
    plt.plot(range(len(Y)),Y , 'g-',label='test_set')    
    plt.legend()
    plt.show()
    return predictedTestSet


# Load and parse the data
file_read = open('file1.csv', 'r')

reader = csv.reader(file_read)

chunk, chunksize = [], 12

for i, line in enumerate(reader):
    if ( i > 0):
        chunk.append(line)

predictedSet = process_chunk(chunk)
print predictedSet

Result:

Coefficients: [ 0.06821406]
Residual sum of squares: 0.00
Variance score: 1.00
[ 121.39022086  170.9286349    64.34416748   96.61828528  124.28181483
  174.99828567]

Plots (with arbitrary x-axis) showing the fit:

edited Sep 18, 2015 at 5:28

answered Sep 17, 2015 at 13:37

Sahil M

1,8472 gold badges17 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Jason Donnald Over a year ago

I updated my code as per your suggestion and when I execute it I see some weird results in output. I have posted updated code and output in my post above

pbu Over a year ago

Have you checked for missing values?

Jason Donnald Over a year ago

I have updated my post above with some of the csv data that I have

Jason Donnald Over a year ago

@pbu I have updated my code in the above post and have also provided some of my csv data. There is no header in the dataset but the order is - company_id,state,profit,attr1,attr2,attr3

Sahil M Over a year ago

Those are not random weird values, they are from your data, check line 11 and 12. Secondly, keep the header intact when you run this code, it is meant to get rid of it. Also, the extra chunk remained from a print statement I had kept there, I have removed that from the update.

|

Collectives™ on Stack Overflow

error in using scikit-learn linear regression model in python

1 Answer 1

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related