Linear regression is a supervised algorithm or statistical method that learns to model a dependent variable (target) as a function of some independent variables (features) by finding a line (or surface) that best “fits” the data. For linear regression, we assume the target to be continuous (a number), for Logistic regression we assume target to be discrete (has a finite number of classes).
Simple Linear Regression
y=β0+β1x+ε
Multiple Linear Regression
y=Xβ+ε
Where:
y is the n×1 vector of dependent variable observations
X is the n×(p+1) matrix of independent variables (including a column of 1s for the intercept)
β is the (p+1)×1 vector of coefficients
ε is the n×1 vector of error terms
Learning the coefficients
Ordinary Least Squares (OLS) / Analytics solution
OLS is the most common method for estimating the parameters of a linear regression model.
Objective: Minimize the sum of squared residuals: minΣ(yi−(β0+β1x1i+...+βpxpi))2 β^=(X′X)−1X′y
Where:
β^ is the vector of estimated coefficients
X′ is the transpose of X
(X′X)−1 is the inverse of X′X
Properties of OLS estimators under the classical assumptions:
Best Linear Unbiased Estimator (BLUE)
Consistent
Asymptotically normal
Estimation of σ²
σ^2=nESS Coefficient validity
In statistics, the coefficients are usually paired with their p-values. These p-values come from null hypothesis statistical tests: t-tests are used to measure whether a given coefficient is significantly different than zero (the null hypothesis that a particular coefficient βi equals zero), while F-tests are used to measure whether any of the terms in a regression model are significantly different from zero.
Maximum Likelihood Estimation (MLE)
Maximum Likelihood Estimation is an alternative method to OLS for estimating the parameters of a linear regression model. It’s particularly useful when dealing with non-normal error distributions. MLE finds the parameter values that maximize the likelihood of observing the given data.
Likelihood Function for Linear Regression
Assuming normally distributed errors: L(β,σ2∣y,X)=∏i(2πσ21)∗exp((2σ2)−(yi−xi′β)2)
Under normality assumption, MLE and OLS produce the same estimates for β
MLE can be extended to non-normal error distributions
MLE provides a framework for hypothesis testing and model selection (e.g., likelihood ratio tests)
Gradient Descent for Linear Regression
When dealing with large datasets, the analytical solution may be computationally expensive. Gradient descent is an iterative optimization algorithm used to find the minimum of the cost function.
Assumptions
Linearity: The relationship between X and Y is linear.
Homoscedasticity: Constant variance of residuals (Var(ε∣X)=σ2). This means that error distribution is consistent or all values of the features, there should be no discernible patterns.
No Multicollinearity: Independent variables shouldn’t be highly correlated with each other. This can be checked using correlation matrices or Variance Inflation Factor (VIF).
Normality: Residuals are normally distributed (εN(0,σ2)). This can be checked through Q-Q plots of the residuals or by histograms, or through statistical tests such as the Kolmogorov-Smirnov test. Relevant for MLE only
Independence: Observations are independent of each other
No Exogeneity: E(ε∣X)=0, meaning the errors are uncorrelated with the predictors
Consequences of violating assumptions:
Violating linearity: Biased and inconsistent estimates
Violating independence: Incorrect standard errors, inefficient estimates
Violating homoscedasticity: Inefficient estimates, incorrect standard errors
Violating normality: Hypothesis tests may be invalid for small samples
Perfect multicollinearity: Unable to estimate unique coefficients
Model Evaluation Metrics
R-squared (Coefficient of Determination): R2=1−SSTSSR Where SSR is the sum of squared residuals and SST is the total sum of squares
Adjusted R-squared: AdjR2=1−[(1−R2)(n−1)/(n−p−1)]
Mean Squared Error (MSE): MSE=Σn(yi−y^i)2
Root Mean Squared Error (RMSE): RMSE=MSE
Mean Absolute Error (MAE): MAE=Σn∣yi−y^i∣
Hypothesis Testing
t-test for individual coefficients: H0:βi=0 t=SE(β^i)β^i
F-test for overall model significance: F=(SSR/p)/(SSE/(n−p−1)) F=(n−p−1)SSEpSSR