Linear regression is a supervised algorithm or statistical method that learns to model a dependent variable (target) as a function of some independent variables (features) by finding a line (or surface) that best “fits” the data. For linear regression, we assume the target to be continuous (a number), for Logistic regression we assume target to be discrete (has a finite number of classes).

Simple Linear Regression

Multiple Linear Regression

Where:

  • is the vector of dependent variable observations
  • is the matrix of independent variables (including a column of 1s for the intercept)
  • is the vector of coefficients
  • is the vector of error terms

Learning the coefficients

Ordinary Least Squares (OLS) / Analytics solution

OLS is the most common method for estimating the parameters of a linear regression model.
Objective: Minimize the sum of squared residuals:

Where:

  • is the vector of estimated coefficients
  • is the transpose of
  • is the inverse of

Properties of OLS estimators under the classical assumptions:

  1. Best Linear Unbiased Estimator (BLUE)
  2. Consistent
  3. Asymptotically normal

Estimation of σ²


Coefficient validity
In statistics, the coefficients are usually paired with their p-values. These p-values come from null hypothesis statistical tests: t-tests are used to measure whether a given coefficient is significantly different than zero (the null hypothesis that a particular coefficient ​ equals zero), while F-tests are used to measure whether any of the terms in a regression model are significantly different from zero.

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation is an alternative method to OLS for estimating the parameters of a linear regression model. It’s particularly useful when dealing with non-normal error distributions. MLE finds the parameter values that maximize the likelihood of observing the given data.

Likelihood Function for Linear Regression
Assuming normally distributed errors:

Log-likelihood:

Comparison with OLS

  • Under normality assumption, MLE and OLS produce the same estimates for β
  • MLE can be extended to non-normal error distributions
  • MLE provides a framework for hypothesis testing and model selection (e.g., likelihood ratio tests)

Gradient Descent for Linear Regression

When dealing with large datasets, the analytical solution may be computationally expensive. Gradient descent is an iterative optimization algorithm used to find the minimum of the cost function.

Assumptions

  • Linearity: The relationship between and is linear.
  • Homoscedasticity: Constant variance of residuals . This means that error distribution is consistent or all values of the features, there should be no discernible patterns.
  • No Multicollinearity: Independent variables shouldn’t be highly correlated with each other. This can be checked using correlation matrices or Variance Inflation Factor (VIF).
  • Normality: Residuals are normally distributed . This can be checked through Q-Q plots of the residuals or by histograms, or through statistical tests such as the Kolmogorov-Smirnov test. Relevant for MLE only
  • Independence: Observations are independent of each other
  • No Exogeneity: , meaning the errors are uncorrelated with the predictors

Consequences of violating assumptions:

  • Violating linearity: Biased and inconsistent estimates
  • Violating independence: Incorrect standard errors, inefficient estimates
  • Violating homoscedasticity: Inefficient estimates, incorrect standard errors
  • Violating normality: Hypothesis tests may be invalid for small samples
  • Perfect multicollinearity: Unable to estimate unique coefficients

Model Evaluation Metrics

  1. R-squared (Coefficient of Determination): Where SSR is the sum of squared residuals and SST is the total sum of squares
  2. Adjusted R-squared:
  3. Mean Squared Error (MSE):
  4. Root Mean Squared Error (RMSE):
  5. Mean Absolute Error (MAE):

Hypothesis Testing

  1. t-test for individual coefficients:

  2. F-test for overall model significance:

Confidence Intervals

CI for :

Extensions and Regularization

  1. Ridge Regression (L2):
  2. Lasso Regression (L1):
  3. Elastic Net: Combination of L1 and L2 penalties

Modeling complex relationships

Polynomial Regression

Extends linear regression to model non-linear relationships:

Interaction Terms

Allows for modeling the combined effect of two or more variables: