3

I am working on Linear Regression model given at this scikit-learn page using python and ipython notebook. The dataset that I have looks like:

KR,Alabama,97071129.11369997,186026.0,63.14000000000001,923.8600000000001
KR,Alabama,67445447.0459,187201.0,94.71,1385.79
KR,Alabama,66332319.626799986,186611.0,121.77000000000001,1781.73
KR,Alabama,75868163.65490001,188002.0,171.38,2507.62
KR,Alabama,104626353.3301,192055.0,62.300000000000004,924.2800000000001
KR,Alabama,82482715.69460002,193070.0,93.45,1386.4199999999998
KR,Alabama,81095032.9574,196819.0,120.15,1782.5400000000002
KR,Alabama,70076833.3433,196738.0,169.1,2508.76
KR,Alabama,111183092.64729999,195091.0,64.82000000000001,946.2600000000001
KR,Alabama,90909063.08510002,197789.0,97.22999999999999,1419.3899999999999
KR,Alabama,90934598.2206,201541.0,125.01,1824.93
KR,Alabama,107374172.93309999,203338.0,175.94,2568.42
KR,Arizona,1126677862.6940002,264600.0,63.14000000000001,923.8600000000001
KR,Arizona,838166771.0832,268153.0,94.71,1385.79
KR,Arizona,956037530.2797,268429.0,121.77000000000001,1781.73
KR,Arizona,984328946.5951,268792.0,171.38,2507.62
KR,Arizona,1257812174.3229997,270547.0,62.300000000000004,924.2800000000001
KR,Arizona,883093705.2885998,272764.0,93.45,1386.4199999999998
KR,Arizona,880652373.4425,276307.0,120.15,1782.5400000000002
KR,Arizona,910039260.961,279318.0,169.1,2508.76
KR,Arizona,1226385050.8268003,279983.0,64.82000000000001,946.2600000000001
KR,Arizona,1087126209.1170998,281409.0,97.22999999999999,1419.3899999999999
KR,Arizona,934971659.6374002,286590.0,125.01,1824.93
KR,Arizona,986475815.6928002,288644.0,175.94,2568.42
KR,California,7830776748.968867,2085424.0,63.14000000000001,923.8600000000001
KR,California,5999727784.478112,2103999.0,94.71,1385.79
KR,California,5804539962.436825,2138267.0,121.77000000000001,1781.73
KR,California,6547521069.504964,2172849.0,171.38,2507.62
KR,California,7945616026.08499,2157455.0,62.300000000000004,924.2800000000001
KR,California,6068949829.714768,2182688.0,93.45,1386.4199999999998
KR,California,5767177648.936179,2227205.0,120.15,1782.5400000000002
KR,California,6292965589.900258,2284617.0,169.1,2508.76
KR,California,8805205589.885035,2254347.0,64.82000000000001,946.2600000000001
KR,California,6855033176.090414,2292655.0,97.22999999999999,1419.3899999999999
KR,California,6930741761.859158,2341652.0,125.01,1824.93
KR,California,6916313224.326924,2357810.0,175.94,2568.42

In this dataset for each company_id and for each state in that company_id there are 12 records. Now what I am trying to do is for each company_id and for each state in that company_id, I want to form a training set and test set with 10 and 2 records respectively.

Here's my current updated code:

from sklearn import linear_model
import csv


def process_chunk(chuk):

    training_set_feature_list = []
    training_set_label_list = []
    test_set_feature_list = []
    test_set_label_list = []
    count = 1
    # to divide into training & test, I am putting line 10th and 11th in test set
    count = 0
    for line in chuk:
        # Converting strings to numpy arrays
        if count == 9 or count == 10:   
            test_set_feature_list.append(np.array(line[3:5],dtype = np.float))
            test_set_label_list.append(np.array(line[2],dtype = np.float))
        else:    
            training_set_feature_list.append(np.array(line[3:5],dtype = np.float))
            training_set_label_list.append(np.array(line[2],dtype = np.float))

        count += 1
    # Create linear regression object
    regr = linear_model.LinearRegression()
    # Train the model using the training sets
    regr.fit(training_set_feature_list, training_set_label_list)

    print regr.predict(test_set_feature_list)



# Load and parse the data
file_read = open('file.csv', 'r')

reader = csv.reader(file_read)

chunk, chunksize = [], 12

for i, line in enumerate(reader):
    if (i % chunksize == 0 and i > 0):
        process_chunk(chunk)
        del chunk[:]
    chunk.append(line)

# process the remainder

process_chunk(chunk)

When I execute this code I get error as:

ValueError: Found arrays with inconsistent numbers of samples: [ 1 10] at line regr.fit(training_set_feature_list, training_set_label_list)

What is the mistake here and how to resolve it?

UPDATE: After suggestion here is my current output screen which has some weird numbers coming in:

[  1.01999724e+08   1.03189615e+08]
[  1.08523268e+09   1.05427929e+09]
[  7.77478189e+09   7.56564733e+09]
[  8.87437438e+08   8.77578642e+08]
[  1.62710654e+08   1.51921308e+08]
[  4.19988737e+09   4.00902600e+09]
[  7.70222690e+08   7.31282229e+08]
[  1.60301569e+09   1.51976018e+09]
[  9.31799698e+08   9.28243073e+08]
[ 51831980.55257727  53136008.17725636]
[  1.92207016e+08   1.85232202e+08]
[  3.82247927e+08   3.33879176e+08]
[  1.35276200e+09   1.34525871e+09]
[  1.62557223e+09   1.53895636e+09]
[  2.12376099e+09   2.08585811e+09]
[ 61386995.4473462   58500866.29796618]
[  3.18458112e+08   3.09384959e+08]
[  4.90038249e+08   4.87984249e+08]
8
  • Looks like you have different number of samples and labels. E.g size of training_set_feature_list is not same as training_set_label_list. Commented Sep 17, 2015 at 12:26
  • Also, for such tasks you can use pandas package, and group by your dataframe by company_id and state. Commented Sep 17, 2015 at 12:56
  • @Olologin can you show how should I do it? Also training_set_feature_list and training_set_label_list have same number of records because they are getting formed together Commented Sep 17, 2015 at 13:11
  • Can you share your csv data? Or maybe csv with few first lines in it? Because i want to debug it by myself. Commented Sep 17, 2015 at 15:13
  • @Olologin I have updated my post above to have some of the csv data. Please check it Commented Sep 17, 2015 at 15:36

1 Answer 1

1

I think your data has strings, and that's why it complains, there were some other problems, I am posting a corrected version.

from sklearn import linear_model
import csv
import numpy as np
import matplotlib.pyplot as plt

def process_chunk(chuk):

    training_set_feature_list = []
    training_set_label_list = []
    test_set_feature_list = []
    test_set_label_list = []
    count = 1
    # to divide into training & test
    chuk = map(lambda x: x[2:], chuk) # Removing first 2 columns
    chunk = np.array(chuk,dtype = np.float) # Make floats array from strings
    ########## Testing dataset: Data after 30th row =########################################
    test_set_feature_list = chunk[30:,3:5]  #4rd and 5th column of chunk 
    test_set_label_list = chunk[30:,2] #3rd column of chunk

    ########## Training dataset: All data before 30th row########################################
    training_set_feature_list = chunk[:30,3:5]
    training_set_label_list = chunk[:30, 2]

    # Create linear regression object
    regr = linear_model.LinearRegression()
    # Train the model using the training sets
    regr.fit(training_set_feature_list, training_set_label_list)

    predictedTestSet = regr.predict(test_set_feature_list)

     # The coefficients
    print 'Coefficients: {}'.format(regr.coef_)
    # The mean square error
    print 'Residual sum of squares: %.2f' % np.mean(predictedTestSet - test_set_label_list) ** 2
    # Explained variance score: 1 is perfect prediction
    print 'Variance score: %.2f' % regr.score( test_set_feature_list, test_set_label_list)
    X = [x for (y,x) in sorted(zip(test_set_label_list, predictedTestSet))]
    Y = [y for (y,x) in sorted(zip(test_set_label_list, predictedTestSet))]
    plt.plot(range(len(X)),X , 'r.', label='predicted')    
    plt.plot(range(len(Y)),Y , 'g-',label='test_set')    
    plt.legend()
    plt.show()
    return predictedTestSet


# Load and parse the data
file_read = open('file1.csv', 'r')

reader = csv.reader(file_read)

chunk, chunksize = [], 12

for i, line in enumerate(reader):
    if ( i > 0):
        chunk.append(line)

predictedSet = process_chunk(chunk)
print predictedSet

Result:

Coefficients: [ 0.06821406]
Residual sum of squares: 0.00
Variance score: 1.00
[ 121.39022086  170.9286349    64.34416748   96.61828528  124.28181483
  174.99828567]

Plots (with arbitrary x-axis) showing the fit:

Fit on an arbitrary X-axis

Sign up to request clarification or add additional context in comments.

9 Comments

I updated my code as per your suggestion and when I execute it I see some weird results in output. I have posted updated code and output in my post above
Have you checked for missing values?
I have updated my post above with some of the csv data that I have
@pbu I have updated my code in the above post and have also provided some of my csv data. There is no header in the dataset but the order is - company_id,state,profit,attr1,attr2,attr3
Those are not random weird values, they are from your data, check line 11 and 12. Secondly, keep the header intact when you run this code, it is meant to get rid of it. Also, the extra chunk remained from a print statement I had kept there, I have removed that from the update.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.