2

I'm new to Python and Machine Learning and I have an homework to deliver next week. This is the code I have until now:

# to get in-line plots
%matplotlib inline
import matplotlib.pyplot as plt

import numpy as np
import scipy as sp
from scipy import stats

# Load the data
IDnumber = 0000001
np.random.seed(IDnumber)

filename = "ccpp_Data_clean2018.csv"

Data = np.genfromtxt(filename, delimiter=';',skip_header=1)
dataDescription = stats.describe(Data)
print(dataDescription)

Data.shape

#get number of total samples
num_total_samples = Data.shape[0]

print("Total number of samples: "+str(num_total_samples))

#size of each chunk of data for training, validation, testing
size_chunk = int(num_total_samples/3.)

print("Size of each chunk of data: "+str(size_chunk))

#shuffle the data
np.random.shuffle(Data)

#training data
X_training = np.delete(Data[:size_chunk], 4, 1)
Y_training = Data[:size_chunk, 4]
print("Training data input size: "+str(X_training.shape))
print("Training data output size: "+str(Y_training.shape))

#validation data, to be used to choose among different models
X_validation = np.delete(Data[size_chunk:size_chunk*2], 4, 1)
Y_validation = Data[size_chunk:size_chunk*2, 4]
print("Validation data input size: "+str(X_validation.shape))
print("Validation data ouput size: "+str(Y_validation.shape))

#test data, to be used to estimate the true loss of the final model(s)
X_test = np.delete(Data[size_chunk*2:num_total_samples], 4, 1)
Y_test = Data[size_chunk*2: num_total_samples, 4]
print("Test data input size: "+str(X_test.shape))
print("Test data output size: "+str(Y_test.shape))
#scale the data

# standardize the input matrix
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X_training)
X_training = scaler.transform(X_training)
print("Mean of the training input data:"+str(X_training.mean(axis=0)))
print("Std of the training input data:"+str(X_training.std(axis=0)))
X_validation = scaler.transform(X_validation) # use the same transformation on validation data
print("Mean of the validation input data:"+str(X_validation.mean(axis=0)))
print("Std of the validation input data:"+str(X_validation.std(axis=0)))
X_test = scaler.transform(X_test) # use the same transformation on test data
print("Mean of the test input data:"+str(X_test.mean(axis=0)))
print("Std of the test input data:"+str(X_test.std(axis=0)))
#compute linear regression coefficients for training data

#add a 1 at the beginning of each sample for training, validation, and testing
m_training = # COMPLETE: NUMBER OF POINTS IN THE TRAINING SET
X_training = np.hstack((np.ones((m_training,1)),X_training))

m_validation = # COMPLETE: NUMBER OF POINTS IN THE VALIDATION SET
X_validation = np.hstack((np.ones((m_validation,1)),X_validation))

m_test = # COMPLETE: NUMBER OF POINTS IN THE TEST SET
X_test = np.hstack((np.ones((m_test,1)),X_test))

# Compute the coefficients for linear regression (LR) using linalg.lstsq
w_np, RSStr_np, rank_X_tr, sv_X_tr = #COMPLETE

print("LR coefficients with numpy lstsq: "+ str(w_np))

# compute Residual sums of squares by hand
print("RSS with numpy lstsq: "+str(RSStr_np))
print("Empirical risk with numpy lstsq:"+str(RSStr_np/m_training))

The way I split the set was part of the assignment, the data I have to predict is in the last column and this is the dataset: http://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant.

My question is: in the last part of the code (where are the "complete" line) the m_training, m_validation and m_test are simply the shape of the corresponding X? I mean:

m_training = X_training.shape

and so on. I am not sure about that. Finally what are the parameters that I have to pass in input to the linalg.lstsq function?

UPDATE I'm going forward with the code but I'm stuck again, this time I have to:

#compute predictions on training set, validation set, and test set
prediction_training = # COMPLETE
prediction_validation = # COMPLETE
prediction_test = # COMPLETE

#what about the RSS and loss for points in the validation data?
RSS_validation =# COMPLETE
RSS_test = # COMPLETE

print("RSS on validation data: "+str(RSS_validation))
print("Loss estimated from validation data:"+str(RSS_validation/m_validation))


#another measure of how good our linear fit is given by the following (that is 1 - R^2)
#compute 1 -R^2 for training, validation, and test set
Rmeasure_training = #COMPLETE
Rmeasure_validation = #COMPLETE 
Rmeasure_test = #COMPLETE

I am finding many difficulties so if you have some good suggestion on where I can find and learn what I need I would appreciate so much. I have a text book but there isn't programming, only theory.

1 Answer 1

3

You can use

 m_training=len(X_training)

but a better way is indeed to use shape

X_training.shape

that will return a tuple (m, n), where m is the number of rows, and n is the number of columns. Then

m_training = X_training.shape[0]

is what you are looking for. Indeed in order to add a column of 1 in the fist row of your data you need to indicate the number of rows.

For the function linalg.lstsq you can look at the examples in: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.linalg.lstsq.html

In your case it should be:

linalg.lstsq(X_training,y_training)
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you so much. Do you, by chance, where I can find some good examples (or even tutorials) about this? I'm looking online but I haven't found what I need.
@forzalupi1912 from my understanding you have to compute the inner dot between your samples and your model coefficients like prediction_training = np.dot(X_training, w_np) ...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.