Simple Linear Regression in Python: sklearn Tutorial with CSV, Prediction, and Plot

Simple linear regression predicts one target value from one input feature using a straight line. This tutorial uses pandas to load a CSV, scikit-learn to train LinearRegression, matplotlib to plot the fit, and pandas again to save predictions. For ML context, see introduction to Python for machine learning and supervised learning algorithms.

By the end you will: load data → split train/test → fit a model → predict → evaluate with MAE, MSE, RMSE, and R² → plot the regression line → save predictions to CSV. The sample file hours_scores.csv in this article’s folder maps Hours (study time) to Score.

Tested on: Python 3.13.3; scikit-learn 1.9.0; pandas 3.0.3; kernel 6.14.0-37-generic.

Simple linear regression quick reference

Step	Python tool
Load CSV data	`pandas.read_csv()`
Select input and target columns	DataFrame column selection
Split train and test data	`train_test_split()`
Create model	`LinearRegression()`
Train model	`model.fit()`
Predict values	`model.predict()`
Check slope and intercept	`model.coef_`, `model.intercept_`
Evaluate model	MAE, MSE, RMSE, R²
Plot regression line	matplotlib
Save predictions	`DataFrame.to_csv()`

What is simple linear regression?

Simple linear regression models the relationship between one independent variable (input) and one dependent variable (target) with a straight line. Use it to predict numeric outcomes or to see how two numeric variables move together.

One formula—use it consistently:

y = mx + b

y — predicted value
x — input feature
m — slope (how much y changes when x increases by 1)
b — intercept (predicted y when x is 0)

Both variables should be numeric. This is not a full statistics course—just the line you fit in code.

Simple vs multiple linear regression

Type	Meaning
Simple linear regression	One input feature predicts one target
Multiple linear regression	Two or more input features predict one target

This article focuses on simple linear regression. Multiple regression uses the same sklearn workflow with more columns in X.

Install required Python libraries

bash

pip install pandas matplotlib scikit-learn

The PyPI package name is scikit-learn. The import name in Python is sklearn. Do not pip install sklearn—that deprecated package is not what you want.

Prepare the CSV dataset

Use a CSV with two numeric columns—for example Hours and Score. For general CSV read/write patterns in Python, see read and write CSV:

csv


Hours,Score
1,76
2,78
4,88
...

Hours is the input feature; Score is the target. For this walkthrough, drop missing values and keep both columns numeric. Save the file as hours_scores.csv (a copy ships with this article).

More representative data usually improves reliability; a tiny demo set can still show the workflow even when metrics are modest.

Load the dataset with pandas

python


import pandas as pd

dataset = pd.read_csv("hours_scores.csv")
print(dataset.head())
print(dataset.columns)

X = dataset[["Hours"]].values
y = dataset["Score"].values

X must be two-dimensional for scikit-learn—even with one feature, select dataset[["Hours"]] (double brackets), not dataset["Hours"] alone.

Split data into training and test sets

python


from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0
)

Training data fits the model; test data checks prediction on rows the model did not see during training. A common split is 70/30 or 80/20. Set random_state for reproducible splits.

Train simple linear regression model with sklearn

python

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression fits an ordinary least squares line—it minimizes the sum of squared differences between actual and predicted training values.

Check slope and intercept

python


print("Slope (m):", model.coef_[0])
print("Intercept (b):", model.intercept_)

With random_state=0 on the sample CSV, you should see a slope near 5.1 and intercept near 67.3—meaning predicted score rises about 5 points per extra study hour in this fit. A positive slope means higher hours tend to link to higher scores in the example.

These map to y = mx + b in your fitted line.

Make predictions using the model

Test set predictions:

python

y_pred = model.predict(X_test)

One new value (student who studied 5 hours):

python


predicted_score = model.predict([[5]])
print(predicted_score[0])

You should see about 92.8 on the sample data—an estimate, not a guaranteed exact score.

Save predictions to a CSV file

python


import pandas as pd

results = pd.DataFrame({
    "Hours": X_test.ravel(),
    "Actual_Score": y_test,
    "Predicted_Score": y_pred,
})
results.to_csv("predictions.csv", index=False)

index=False avoids an extra row-number column. This answers the common “save predictions to CSV” step many tutorials skip.

Evaluate the linear regression model

python


import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R²:", r2)

On the sample split (random_state=0), expect roughly MAE 5.1, RMSE 6.5, and R² 0.50—moderate fit on a small dataset.

Metric	Meaning
MAE	Average absolute prediction error
MSE	Average squared prediction error
RMSE	Error in the same unit as the target
R² score	Share of target variance the model explains

Lower MAE and RMSE are better. R² closer to 1 usually means a stronger fit, but use it with MAE/RMSE—not alone.

Plot actual data and regression line

train_test_split shuffles rows, so X_test may be out of order—plotting X_test against predictions can zigzag the line. Sort by x first. For scatter and line plots beyond this regression example, see Python matplotlib:

python


import matplotlib.pyplot as plt
import numpy as np

order = np.argsort(X_test.ravel())
x_line = X_test.ravel()[order]
y_line = y_pred[order]

plt.scatter(X_test, y_test, color="red", label="Actual")
plt.plot(x_line, y_line, color="blue", label="Predicted line")
plt.title("Hours vs Score")
plt.xlabel("Hours")
plt.ylabel("Score")
plt.legend()
plt.savefig("regression_plot.png", bbox_inches="tight")
plt.show()

Red points are actual test scores; the blue line is the model’s predictions across sorted hours.

Complete example script

python


import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

dataset = pd.read_csv("hours_scores.csv")
X = dataset[["Hours"]].values
y = dataset["Score"].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0
)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Slope:", model.coef_[0], "Intercept:", model.intercept_)
print("Predict 5 hours:", model.predict([[5]])[0])
print("MAE:", mean_absolute_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R²:", r2_score(y_test, y_pred))

pd.DataFrame({
    "Hours": X_test.ravel(),
    "Actual_Score": y_test,
    "Predicted_Score": y_pred,
}).to_csv("predictions.csv", index=False)

order = np.argsort(X_test.ravel())
plt.scatter(X_test, y_test, color="red")
plt.plot(X_test.ravel()[order], y_pred[order], color="blue")
plt.xlabel("Hours")
plt.ylabel("Score")
plt.title("Simple linear regression")
plt.savefig("regression_plot.png", bbox_inches="tight")

Run from the folder that contains hours_scores.csv.

Common mistakes to avoid

pip install sklearn instead of pip install scikit-learn
Passing 1D X instead of shape (n_samples, 1)
Evaluating only on training data and calling it “accuracy”
Swapping X (features) and y (target)
Treating predictions as exact truth
Using linear regression when the relationship is clearly curved
Plotting the regression line with unsorted test X values
Skipping metrics (MAE, RMSE, R²) after plotting

Summary

Simple linear regression uses one input feature to predict one numeric target along a line y = mx + b. In Python, load CSV data with pandas, split with train_test_split, train LinearRegression().fit(), predict with predict(), evaluate with MAE, RMSE, and R², plot with matplotlib (sort x for a clean line), and save outputs with to_csv() when needed.

Frequently Asked Questions

1. What is simple linear regression in Python?

A model that predicts one numeric target from one numeric input feature by fitting a straight line—usually with sklearn.linear_model.LinearRegression after loading data with pandas.

2. Do I install sklearn or scikit-learn with pip?

Install scikit-learn with pip install scikit-learn; import it in code as sklearn.

3. Why must X be two-dimensional for LinearRegression?

scikit-learn expects X with shape (n_samples, n_features)—even one feature uses X = df[["Hours"]].values, not a 1D array.

4. Which metrics evaluate simple linear regression?

Common choices are MAE, MSE, RMSE, and R² from sklearn.metrics—lower MAE/RMSE is better; R² closer to 1 means more explained variance.

5. How do I save regression predictions to a CSV file?

Build a pandas DataFrame with actual and predicted columns and call to_csv("predictions.csv", index=False).