Linear Regression

Linear regression is a supervised algorithm or statistical method that learns to model a dependent variable (target) as a function of some independent variables (features) by finding a line (or surface) that best “fits” the data. For linear regression, we assume the target to be continuous (a number), for Logistic regression we assume target to be discrete (has a finite number of classes).

Simple Linear Regression

$y = β_{0} + β_{1} x + ε$

Multiple Linear Regression

$y = Xβ + ε$

Where:

$y$ is the $n \times 1$ vector of dependent variable observations
$X$ is the $n \times (p + 1)$ matrix of independent variables (including a column of 1s for the intercept)
$β$ is the $(p + 1) \times 1$ vector of coefficients
$ε$ is the $n \times 1$ vector of error terms

Learning the coefficients

Ordinary Least Squares (OLS) / Analytics solution

OLS is the most common method for estimating the parameters of a linear regression model.
Objective: Minimize the sum of squared residuals: $min Σ (y_{i} - (β_{0} + β_{1} x_{1 i} + ... + β_{p} x_{p i}))^{2}$
$\hat{β} = (X^{'} X)^{- 1} X^{'} y$

Where:

$\hat{β}$ is the vector of estimated coefficients
$X^{'}$ is the transpose of $X$
$(X^{'} X)^{- 1}$ is the inverse of $X^{'} X$

Properties of OLS estimators under the classical assumptions:

Best Linear Unbiased Estimator (BLUE)
Consistent
Asymptotically normal

Estimation of σ²

$\overset{σ}{^}^{2} = \frac{ESS}{n}$
Coefficient validity
In statistics, the coefficients are usually paired with their p-values. These p-values come from null hypothesis statistical tests: t-tests are used to measure whether a given coefficient is significantly different than zero (the null hypothesis that a particular coefficient $β i$ equals zero), while F-tests are used to measure whether any of the terms in a regression model are significantly different from zero.

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation is an alternative method to OLS for estimating the parameters of a linear regression model. It’s particularly useful when dealing with non-normal error distributions. MLE finds the parameter values that maximize the likelihood of observing the given data.

Likelihood Function for Linear Regression
Assuming normally distributed errors:
$L (β, σ^{2} ∣ y, X) = \prod_{i} (\frac{1}{2 π σ ^{2}}) * e x p (\frac{- ( y _{i} - x _{i}^{'} β ) ^{2}}{( 2 σ ^{2} )})$

Log-likelihood:
$l n L (β, σ^{2} ∣ y, X) = - n /2 * l n (2 π) - n /2 * l n (σ^{2}) - 1/ (2 σ^{2}) * Σ_{i} (y_{i} - x_{i}^{'} β)^{2}$

$l n L (β, σ^{2} ∣ y, X) = - \frac{n}{2} * l n (2 π) - \frac{n}{2} * l n (σ^{2}) - \frac{1}{( 2 σ ^{2} )} * Σ_{i} (y_{i} - x_{i}^{'} β)^{2}$

Comparison with OLS

Under normality assumption, MLE and OLS produce the same estimates for β
MLE can be extended to non-normal error distributions
MLE provides a framework for hypothesis testing and model selection (e.g., likelihood ratio tests)

Gradient Descent for Linear Regression

When dealing with large datasets, the analytical solution may be computationally expensive. Gradient descent is an iterative optimization algorithm used to find the minimum of the cost function.

Assumptions

Linearity: The relationship between $X$ and $Y$ is linear.
Homoscedasticity: Constant variance of residuals $(Va r (ε ∣ X) = σ^{2})$ . This means that error distribution is consistent or all values of the features, there should be no discernible patterns.
No Multicollinearity: Independent variables shouldn’t be highly correlated with each other. This can be checked using correlation matrices or Variance Inflation Factor (VIF).
Normality: Residuals are normally distributed $(ε N (0, σ^{2}))$ . This can be checked through Q-Q plots of the residuals or by histograms, or through statistical tests such as the Kolmogorov-Smirnov test. Relevant for MLE only
Independence: Observations are independent of each other
No Exogeneity: $E (ε ∣ X) = 0$ , meaning the errors are uncorrelated with the predictors

Consequences of violating assumptions:

Violating linearity: Biased and inconsistent estimates
Violating independence: Incorrect standard errors, inefficient estimates
Violating homoscedasticity: Inefficient estimates, incorrect standard errors
Violating normality: Hypothesis tests may be invalid for small samples
Perfect multicollinearity: Unable to estimate unique coefficients

Model Evaluation Metrics

R-squared (Coefficient of Determination): $R^{2} = 1 - \frac{SSR}{SST}$ Where SSR is the sum of squared residuals and SST is the total sum of squares
Adjusted R-squared: $A d j R^{2} = 1 - [(1 - R^{2}) (n - 1) / (n - p - 1)]$
Mean Squared Error (MSE): $MSE = Σ \frac{( y _{i} - y ^ _{i} ) ^{2}}{n}$
Root Mean Squared Error (RMSE): $RMSE = MSE$
Mean Absolute Error (MAE): $M A E = Σ \frac{∣ y _{i} - y ^ _{i} ∣}{n}$

Hypothesis Testing

t-test for individual coefficients:
$H_{0} : β_{i} = 0$
$t = \frac{β ^ _{i}}{SE ( β ^ _{i} )}$
F-test for overall model significance:
$F = (SSR / p) / (SSE / (n - p - 1))$
$F = \frac{\frac{SSR}{p}}{\frac{SSE}{( n - p - 1 )}}$

Confidence Intervals

CI for $β_{i}$ : $\hat{β}_{i} \pm t (α /2, n - p - 1) * SE (\hat{β}_{i})$

Extensions and Regularization

Ridge Regression (L2): $\hat{β} r i d g e = a r g min (∣∣ y - Xβ ∣ ∣^{2} + λ ∣∣ β ∣ ∣^{2})$
Lasso Regression (L1): $\hat{β} l a sso = a r g min (∣∣ y - Xβ ∣ ∣^{2} + λ ∣∣ β ∣ ∣_{1})$
Elastic Net: Combination of L1 and L2 penalties

Modeling complex relationships

Polynomial Regression

Extends linear regression to model non-linear relationships:
$y = β_{0} + β_{1} x + β_{2} x^{2} + ... + β_{n} x^{n} + ε$

Interaction Terms

Allows for modeling the combined effect of two or more variables:
$y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + β_{3} (x_{1} * x_{2}) + ε$

Code example

import numpy as np
 
class LinearRegression:
    def __init__(self, learning_rate: float=0.01, n_iterations: int=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None
 
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
 
        for _ in range(self.n_iterations):
            prediction = np.dot(X, self.weights) + self.bias
 
            dw = (1 / n_samples) * np.dot(X.T, (prediction - y))
            db = (1 / n_samples) * np.sum(prediction - y)
 
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
 
    def predict(self, X):
        return np.dot(X, self.weights) + self.bias
 
    def mse(self, X, y):
        y_predicted = self.predict(X)
        return np.mean((y - y_predicted) ** 2)

DSWoK — Data Science Well of Knowledge

Explorer