Getting Started with Linear Regression in Matlab (From Scratch)

2022-04-18

matlablinear-regressionmachine-learningmathematics

There is no better way to understand a machine learning algorithm than to build it yourself, with nothing between you and the mathematics. Toolboxes are convenient — and I use them in production — but they hide the very thing you need to see when you are learning: the gradient, the cost surface, the moment the parameters converge. What follows is a from-scratch implementation of single-variable linear regression in Matlab, no toolboxes, no shortcuts.

Initialization

Let’s start with proper script initialization:

% Linear Regression with One Variable
% Clear environment
clear;
clc;
close all;

% Add library path for our custom functions
addpath('lib');

It’s always good practice to clear the heap from any variables and the command window before starting. The lib folder will contain our custom functions.

What is Linear Regression?

From Wikipedia:

In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables).

Simply put, linear regression is used to model the relationship between two continuous variables. Often, the objective is to predict the value of an output variable (response) based on the value of an input (predictor) variable.

The Hypothesis Function

The idea is to fit a linear function to a given dataset. This function is called the Hypothesis:

$$h_\theta(x) = \theta_0 + \theta_1 x$$

Where:

$\theta_0$ is the y-intercept (bias)
$\theta_1$ is the slope
$x$ is the input variable

The Cost Function

The cost function measures the average error of n-samples in the data:

$$J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2$$

Our goal is to minimise the cost function to get the best approximation possible to the n-samples in the dataset.

Implementing the Cost Function

Let’s implement the cost function and save it as lib/compute_cost.m:

function J = compute_cost(X, y, theta)
    % COMPUTE_COST Compute cost for linear regression
    %   J = COMPUTE_COST(X, y, theta) computes the cost of using theta as the
    %   parameter for linear regression to fit the data points in X and y

    m = length(y);  % number of training examples

    predictions = X * theta;
    sqrErrors = (predictions - y).^2;

    J = 1/(2*m) * sum(sqrErrors);
end

Visualizing the Data

Let’s create a basic dataset and visualize it:

% Sample dataset
X = [1; 2; 3; 4; 5];
y = [1; 2; 2.5; 4; 5];

% Plot data
figure;
plot(X, y, 'rx', 'MarkerSize', 10);
xlabel('x');
ylabel('y');
title('Training Data');

Testing Different Hypotheses

Let’s try a simple hypothesis: $f(x) = x$ (i.e., $\theta_0 = 0$, $\theta_1 = 1$):

% Add column of ones for theta_0
X_with_ones = [ones(length(X), 1), X];
theta = [0; 1];  % theta_0 = 0, theta_1 = 1

cost = compute_cost(X_with_ones, y, theta);
% cost ≈ 0.2750

This line is close to most data points, but the error is still significant. Let’s try another hypothesis: $f(x) = \frac{1}{2}x + \frac{1}{2}$:

theta = [0.5; 0.5];  % theta_0 = 0.5, theta_1 = 0.5
cost = compute_cost(X_with_ones, y, theta);
% cost ≈ 0.1042

The cost is lower! This is exactly what minimisation is about. But manually adjusting parameters is impractical—enter Gradient Descent.

Gradient Descent

From Wikipedia:

Gradient descent is a first-order iterative optimisation algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient of the function at the current point.

The Update Rule

For each iteration, we update the parameters:

$$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$$

Where $\alpha$ is the learning rate.

The partial derivatives work out to:

$$\frac{\partial}{\partial \theta_0} J(\theta) = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})$$$$\frac{\partial}{\partial \theta_1} J(\theta) = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x^{(i)}$$

Implementing the Derivatives

Save this as lib/compute_derivatives.m:

function [d_theta0, d_theta1] = compute_derivatives(X, y, theta)
    % COMPUTE_DERIVATIVES Compute partial derivatives for gradient descent

    m = length(y);
    predictions = X * theta;
    errors = predictions - y;

    d_theta0 = (1/m) * sum(errors);
    d_theta1 = (1/m) * sum(errors .* X(:, 2));
end

The Gradient Descent Function

Save this as lib/gradient_descent.m:

function [theta, J_history] = gradient_descent(X, y, theta, alpha, num_iters)
    % GRADIENT_DESCENT Performs gradient descent to learn theta

    m = length(y);
    J_history = zeros(num_iters, 1);

    for iter = 1:num_iters
        [d_theta0, d_theta1] = compute_derivatives(X, y, theta);

        theta(1) = theta(1) - alpha * d_theta0;
        theta(2) = theta(2) - alpha * d_theta1;

        J_history(iter) = compute_cost(X, y, theta);
    end
end

Putting It All Together

% Clear environment
clear; clc; close all;
addpath('lib');

% Dataset
X = [1; 2; 3; 4; 5];
y = [1; 2; 2.5; 4; 5];

% Add ones column
X_with_ones = [ones(length(X), 1), X];

% Initialize parameters
theta = [0; 0];
alpha = 0.01;        % Learning rate
iterations = 1000;

% Run gradient descent
[theta, J_history] = gradient_descent(X_with_ones, y, theta, alpha, iterations);

% Display results
fprintf('Theta found: %f, %f\n', theta(1), theta(2));
fprintf('Final cost: %f\n', J_history(end));

% Plot results
figure;
subplot(1, 2, 1);
plot(X, y, 'rx', 'MarkerSize', 10);
hold on;
plot(X, X_with_ones * theta, 'b-', 'LineWidth', 2);
xlabel('x'); ylabel('y');
title('Linear Regression Fit');
legend('Training data', 'Linear regression');

subplot(1, 2, 2);
plot(1:iterations, J_history, 'b-', 'LineWidth', 2);
xlabel('Iterations'); ylabel('Cost J');
title('Cost Function Convergence');

Results

The gradient descent algorithm converges to optimal values for $\theta_0$ and $\theta_1$, resulting in a line that best fits the training data while minimizing the cost function.

Reflection

These implementations are deliberately naive — and that is the point. Vectorisation, adaptive learning rates, regularisation, and multivariate generalisation are all natural next steps, and any production system would demand them. But I find that the moment you hand everything to a library, you lose the intuition for why the loss curve flattens, or why a poorly chosen learning rate makes the parameters oscillate instead of converge. Implementing gradient descent by hand — watching the cost drop iteration by iteration — is the closest thing to understanding the algorithm with your hands, not just your eyes. The toolbox can come later. The intuition has to come first.

Getting Started with Linear Regression in Matlab (From Scratch)

Implementing machine learning fundamentals without toolboxes.

Achraf SOLTANI — April 18, 2022

The Sanctuary