Introduction to Linear Regression using Python
linear regression is a regression model that estimates the relationship between one independent variable and one dependent variable using a straight line. Both variables should be quantitative. In this tutorial, we will cover linear regression using python in more detail. We will discuss why to use linear regression and will cover the mathematical calculation behind the scene.
We will also implement linear regression in Python and will explain each part of the code in more detail. In a nutshell, this tutorial will contain all the necessary details about linear regression using Python from mathematical calculations to Python implementation.
Getting Started with Linear Regression using Python
As we have already discussed that the linear regression is used to estimate the relationship between independent and dependent variables using a straight line. In the upcoming sections, we will see how this straight line is built by looking at the mathematical part and python implementation as well.
For this tutorial, we are using the python version Python 3.8.10
. You can check the python version installed on your system by using any of the following commands depending on the python version.
python3 --version or python --version
We will use the pip3 command to install different types of modules that we will use in this tutorial. Python 3.4+ in most operating systems includes pip3 by default. If your python version is less than 3.4, then you should upgrade your Python version which will automatically install pip3. You can upgrade pip by using the following command:
python -m pip3 install --upgrade pip
In this tutorial, we will be also dealing with the data set in .csv file named Data_for_LR.csv
. This file contains the following data: You can just copy and paste this data into your own CSV file.
Hours,Score 1,76 2,78 2,85 4,88 2,72 1,69 5,94 4,94 2,88 4,92 4,90 3,75 6,96 5,90 3,82 4,85 6,99 2,83 1,62 2,76
Notice that there are two columns, one as input and another output. This is perfect data set for simple linear regression. In the upcoming sections, we will use this data set to train and predict the output using a linear regression algorithm.
The mathematical part of the Linear regression algorithm
Simple Linear Regression is a statistical method that allows us to summarize and study the relationship between two continuous (quantitative) variables. In other words, given an input x, we would like to compute an output y. The simple equation which is used to train the model in a simple linear equation is given below:
Y = mX + C
Where Y is the output, X is the input, m is the slope of the line and C is a constant value. The algorithm calculates, the values of m and C using the training data set and then calculates the output in the testing part.
Types of linear Regression
As we have already discussed that linear regression is used to estimate the relationship between independent and dependent variables using a straight line. Based on the number of independent variables, Linear Regression can be divided into two main categories. Simple linear regression and multi-linear regression.
Simple Linear Regression
Simple linear regression is a type of linear regression with only one variable as an input. The data set for simple linear regression contains pairs of values, one as input or independent and other output or dependent variable. The equation for simple linear regression is as follows:
f(x) = M + cx
- f(x) : is the output value
- M : is the slope of the linear equation
- c: is the constant value
- x : is the input variable
Multiple Linear Regression
Multiple linear regression is a type of linear regression with two or more than two variables as input or independent variables. The dataset for multiple linear regression contains one output along with multiple inputs variables, The equation for multiple linear regression is as follows:
f(x₁, x₂,...) = M + b₁x₁ + b₂x₂,...
- f(x₁, x₂,...) : is the output value or the dependent variable
- M : is the slope of the equation.
- b1, b2, ...: are the constant values.
- x1, x2, ...: are the input values or independent variables
Python code for Linear regression with a full explanation
In this section, we will discuss the Python program for linear regression using simple data set. We will first train our model by splitting the data into training data set and then will test the model by providing testing data set. But before jumping into the Python program, let us first install all the required modules for this training.
Set up the environment for linear regression testing using Python
Let us now set the environment for our python program. First of all, we need to install the following python modules using the pip or pip3 command.
pip3 install matplotlib pip3 install pandas pip3 install sklearn
We are using the following versions of these modules in this tutorial.
matplotlib: 3.3.4
sklearn : 0.24.1
pandas : 1.2.3
You can use the following python code to check the versions of the module installed on your system. See the python code below:
import matplotlib
import seaborn
import sklearn
import pandas
print("matplotlib: ", matplotlib.__version__)
print("seaborn :", seaborn.__version__)
print("sklearn :", sklearn.__version__)
print("pandas :", pandas.__version__)
Explanation of Python code for linear regression
Now let us jump into the explanation part and see how we can implement the linear regression in Python. First, we have to import the following modules.
# Importing the required modules for linear regression using python
import matplotlib.pyplot as plt
import pandas as pd
We will use the matplotlib
module to visualize the training and testing part. and Pandas is used to import the data set and split the data set into input values and output values. After importing the modules, we have to import the dataset using pandas module.
See the following python program.
# Importing the dataset
dataset = pd.read_csv('Linear regression//Data_for_LR.csv')
#get a copy of dataset exclude last column
X = dataset.iloc[:, :-1].values
#get array of dataset in column 2st
y = dataset.iloc[:, 1].values
Here we have imported our data from the csv file using pandas and stored it in a variable named dataset
. Then we divide our data set into input(independent) values as X and output(dependent) values as y. Now let us split our data set into the training and testing parts. For that, we have to import the data splitting function from sklearn
library. See the following python program.
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
# 30% data for testing, random state 0
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)
Here we have defined four different variables: X_train
, X_test
, y_train
, y_test
.
Then we split the data into the training part and the testing part. Inside the train_test_split()
method, we have defined the test_size
to .3
which means we are assigning 30% of the whole data set to the testing variables and 70% to the training variables.
Then we have defined random_state
to 0
which means that the splitting of the data set into training and test part will be random. You can print these testing and training data sets to see the splitting more accurately.
Now the next step is to train our model using linear regression. Fortunately, we don't have to write whole python code from scratch to train our model. We can use sklearn
module to train our model using a linear regression algorithm. See the python training model part below:
# Importing linear regression form sklearn
from sklearn.linear_model import LinearRegression
# initializing the algorithm
regressor = LinearRegression()
# Fitting Simple Linear Regression to the Training set
regressor.fit(X_train, y_train)
Notice that here we are importing the Linear regression from sklearn module and we are training our model. While training our model, we are only providing the X_train
and Y_train
data sets ( which are 70% of the original data set as we had split in the above). Once our model is trained, then we can provide inputs to predict the output:
See the following predicting part:
# Predicting the Test set results
y_pred = regressor.predict(X_test)
So here we are providing the input values without outputs. The outputs will be predicted by our model and will store them in y_pred
variable. We can also provide individual values to predict the output. For example, in our original data set for the input value 1, we have an output of 76.
Let us provide the same input to our trained model and see what will be the predicted output. See the example below:
# Predicting single value
Pred_Salary= regressor.predict([[1]])
print(Pred_Salary)
Output:
[72.36290323]
Notice that the output is not exactly the same but it is near to the actual value. Once we have the output, we can use different performance evaluations to see how accurate our algorithm is in predicting the output values.
Create Visualization Chart
Now let us go to the visualization part and see visually the predicted and actual values. See the python program for visualization below:
# Visualizing the Test set results
viz_test = plt
# red dot colors for actual values
viz_test.scatter(X_test, y_test, color='red')
# Blue line for the predicted values
viz_test.plot(X_test, regressor.predict(X_test), color='blue')
# defining the title
viz_test.title('Hours vs Score')
# x lable
viz_test.xlabel('Hours')
# y label
viz_test.ylabel('Score')
# showing the graph
viz_test.show()
Output:
Notice that our model prediction is not accurate, it is because we have very little data in our training part. Usually, it needs thousands of data to train the model but for the sake of learning purposes, we took a small data set.
Linear regression using sklearn
The following is the full code for the linear regression using python. See the following example.
# Importing the required modules for linear regression using python
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
# Importing the dataset
dataset = pd.read_csv('Linear regression//Data_for_LR.csv')
#get a copy of dataset exclude last column
X = dataset.iloc[:, :-1].values
#get array of dataset in column 2st
y = dataset.iloc[:, 1].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
# 30% data for testing, random state 0
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)
# Importing linear regression form sklear
from sklearn.linear_model import LinearRegression
# initializing the algorithm
regressor = LinearRegression()
# Fitting Simple Linear Regression to the Training set
regressor.fit(X_train, y_train)
# Predicting the Test set results
y_pred = regressor.predict(X_test)
# Visualizing the Training set results
viz_train = plt
viz_train.scatter(X_train, y_train, color='red')
viz_train.plot(X_train, regressor.predict(X_train), color='blue')
viz_train.title('Hours vs Score')
viz_train.xlabel('Hours')
viz_train.ylabel('Score')
viz_train.show()
# Visualizing the Test set results
viz_test = plt
# red dot colors for actual values
viz_test.scatter(X_test, y_test, color='red')
# Blue line for the predicted values
viz_test.plot(X_test, regressor.predict(X_test), color='blue')
# defining the title
viz_test.title('Hours vs Score')
# x lable
viz_test.xlabel('Hours')
# y label
viz_test.ylabel('Score')
# showing the graph
viz_test.show()
Output:
This graph shows the training data sets in red and our train model in the blue line.
This graph shows the inputs values and the predicted output from our trained model.
Performance evaluation of Linear regression
Performance evaluation is also known as Evaluation metrics, is a measure of how well a model performs and how well it approximates the relationship. These methods help us to know the accuracy and precision of our model in making predictions. There can be multiple methods to evaluate the performance of linear regression. Some of which we will discuss here:
Mean Squared Error (MSE): The most common metric for regression tasks is MSE. It is the average of the squared difference between the predicted and actual value. The following is the simple formula:
Mean Absolute Error (MAE): This is simply the average of the absolute difference between the target value and the value predicted by the model. Not preferred in cases where outliers are prominent. The following is the simple formula for MAE:
Root Mean Squared Error (RMSE): This is the square root of the average of the squared difference of the predicted and actual value. The following is the simple formula of RMSE.
Summary
Linear Regression is the process of finding a line that best fits the data points available on the plot, so that we can use it to predict output values for inputs that are not present in the data set we have, with the belief that those outputs would fall on the line. In this tutorial, we learned about linear regression using Python. We discussed the mathematical calculation behind the linear regression algorithm. We also discussed in detail the implementation of the linear regression algorithm using python.
To summarize, this tutorial contains all the necessary details of the linear algorithm implementation using python.
Further Reading
Sklearn module
Pandas module
matplotlib module