Learn linear regression with hands-on projects
Marvel Comics introduced a fictional character Destiny in the 1980s, with the ability to foresee future occurrences. The exciting news is that predicting future events is no longer just a fantasy! With the progress made in machine learning, a machine can help forecast future events by utilizing the past.
Exciting, right? Let’s start this journey with a simple prediction model. A regression is a mathematical function that defines the relationship between a dependent variable and one or more independent variables. Regression in machine learning analyzes how independent variables or features correlate with a dependent variable or outcome. It serves as a predictive modeling approach in machine learning, where an algorithm predicts continuous outcomes. Rather than delving into theory, the focus will be on creating different models for regression.
Before starting to build a Python regression model, one should examine the data. For instance, if an individual owns a fish farm and needs to predict a fish’s weight based on its dimensions, they can explore the dataset by clicking the “RUN” button to display the top few rows of the DataFrame.
Line 2: pandas library is imported to read DataFrame.
Line 6: Read the data from the Fish.txt file with columns defined in line 5.
Line 9: Prints the top five rows of the DataFrame. The three lengths define the vertical, diagonal, and cross lengths in cm.
Here, the fish’s length, height, and width are independent variables, with weight serving as the dependent variable. In machine learning, independent variables are often referred to as features and dependent variables as labels, and these terms will be used interchangeably throughout this blog.
Linear regression models, a fundamental concept you’ll encounter as you learn machine learning, are widely used in statistics and machine learning. These models use a straight line to describe the relationship between an independent variable and a dependent variable. For example, when analyzing the weight of fish, a linear regression model is used to describe the relationship between the weight of the fish and one of the independent variables as follows,
Where is the slope of the line that defines its steepness, and is the y-intercept, the point where line crosses the y-axis.
The dataset contains five independent variables. A simple linear regression model with only one feature can be initiated by selecting the most strongly related feature to the fish’s Weight. One approach to accomplish this is to calculate the cross-correlation between Weight and the features.
Ater examining the first column, the following is observed:
Weight, and the feature X-Length.Weight has the weakest correlation with Height.Given this information, it is clear that if the individual is limited to using only one independent variable to predict the dependent variable, they should choose X-Length and not Height.
# Step 3: Separating the data into features and labels
X = Fish[['X-Length']]
y = Fish['Weight']
With the features and labels in place, DataFrame can now be divided into training and test sets. The training dataset trains the model, while the test dataset evaluates its performance.
The train_test_split function is imported from the sklearn library to split the data.
The arguments of the train_test_split function can be examined as follows:
test_size=0.3 to select 70% of the data for training and the remaining 30% for testing purposes.shuffle=True to ensure that the model is not overfitting to a specific set of data.As a result, the training data in variables X_train and y_train and test data in X_test and y_test is obtained.
At this point, the linear regression model can be created.
LinearRegression function from sklearn library is imported.X_train and y_train.Remember, 30% of the data was set aside for testing. The Mean Absolute Error (MAE) can be calculated using this data as an indicator of the average absolute difference between the predicted and actual values, with a lower MAE value indicating more accurate predictions. Other measures for model validation exist, but they won’t be explored in this context.
Here’s a complete running example, including all of the previously mentioned steps mentioned above to perform a linear regression.
In this instance, the model.predict() function is applied to the training data on line 23, and on line 26, it is used on the test data. But what does it show?
Essentially, this approach demonstrates the model’s performance on a known dataset when compared to an unfamiliar test dataset. The two MAE values suggest that the predictions on both train and test data are similar.
Note: It is essential to recall that the
X-Lengthwas chosen as the feature because of its high correlation with the label. To verify the choice of feature, one can replace it with theHeighton line 12 and rerun the linear regression, then compare the two MAE values.
So far, only one feature, X-Length has been used to train the model. However, there are features available that can be utilized to improve the predictions. These features include the vertical length, diagonal length, height, and width of the fish, and can be used to re-evaluate the linear regression model.
# Step 3: Separating the data into features and labels
X = Fish[['V-Length', 'D-Length', 'X-Length', 'Height', 'Width']]
y = Fish['Weight']
Mathematically, the multiple linear regression model can be written as follows:
where represents the weightage for feature in predicting and denotes the number of features.
Following the similar steps as earlier, the performance of the model can be calculated by utilizing all the features.
The MAE values will be similar to the results obtained when using a single feature.
This blog explains the concept of polynomial regression, which is used when the assumption of a linear relationship between the features and label is not accurate. By allowing for a more flexible fit to the data, polynomial regression can capture more complex relationships and lead to more accurate predictions.
For example, if the relationship between the dependent variables and the independent variable is not a straight line, a polynomial regression model can be used to model it more accurately. This can lead to a better fit to the data and more accurate predictions.
Mathematically, the relationship between dependent and independent variables is described using the following equation:
The above equation looks very similar to the one used earlier to describe multiple linear regression. However, it includes the transformed features called 's which are the polynomial version of 's used in multiple linear regression.
This can be further explained using an example of two features and to create new features , , , , , , , and so on.
The new polynomial features can be created based on trial and error or techniques like cross-validation. The degree of the polynomial can also be chosen based on the complexity of the relationship between the variables.
The following example presents a polynomial regression and validates the models’ performance.
The features were transformed using PolynomialFeatures function on line 18. The PolynomialFeatures function, imported from the sklearn library on line 7, was used for this purpose.
It should be noticed that the MAE value in this case is superior to that of linear regression models, implying that the linear assumption was not entirely accurate.
This blog has provided a quick introduction to Machine learning regression models with python. Don’t stop here! Explore and practice different techniques and libraries to build more accurate and robust models. You can also check out the following courses on Educative:
A Practical Guide to Machine Learning with Python
This course teaches you how to code basic machine learning models. The content is designed for beginners with general knowledge of machine learning, including common algorithms such as linear regression, logistic regression, SVM, KNN, decision trees, and more. If you need a refresher, we have summarized key concepts from machine learning, and there are overviews of specific algorithms dispersed throughout the course.
Mastering Machine Learning Theory and Practice
The machine learning field is rapidly advancing today due to the availability of large datasets and the ability to process big data efficiently. Moreover, several new techniques have produced groundbreaking results for standard machine learning problems. This course provides a detailed description of different machine learning algorithms and techniques, including regression, deep learning, reinforcement learning, Bayes nets, support vector machines (SVMs), and decision trees. The course also offers sufficient mathematical details for a deeper understanding of how different techniques work. An overview of the Python programming language and the fundamental theoretical aspects of ML, including probability theory and optimization, is also included. The course contains several practical coding exercises as well. By the end of the course, you will have a deep understanding of different machine-learning methods and the ability to choose the right method for different applications.
Hands-on Machine Learning with Scikit-Learn
Scikit-Learn is a powerful library that provides a handful of supervised and unsupervised learning algorithms. If you’re serious about having a career in machine learning, then scikit-learn is a must know. In this course, you will start by learning the various built-in datasets that scikit-learn offers, such as iris and mnist. You will then learn about feature engineering and more specifically, feature selection, feature extraction, and dimension reduction. In the latter half of the course, you will dive into linear and logistic regression where you’ll work through a few challenges to test your understanding. Lastly, you will focus on unsupervised learning and deep learning where you’ll get into k-means clustering and neural networks. By the end of this course, you will have a great new skill to add to your resume, and you’ll be ready to start working on your own projects that will utilize scikit-learn.
What are the 3 types of regression?
What are the regression models in Python?