Understanding Regression Assumptions

Linear regression is one of the most widely used statistical tools, yet its validity depends on a set of assumptions that are often overlooked. In this post, I walk through the core assumptions, why they matter, and how to diagnose violations.

The Five Key Assumptions

1. Linearity

The relationship between the predictors and the response variable must be linear. This seems obvious, but it is easy to miss when working with high-dimensional data. A residual vs. fitted plot is the quickest diagnostic — if you see a curve, the linearity assumption is violated.

2. Independence of Errors

Observations must be independent of each other. This is most commonly violated in time series data, where consecutive observations are correlated. The Durbin-Watson test is a standard check for autocorrelation.

3. Homoscedasticity

The variance of the residuals should be constant across all levels of the fitted values. A scale-location plot (sqrt of standardized residuals vs. fitted values) reveals heteroscedasticity as a funnel shape. When violated, weighted least squares or a log transformation of the response often helps.

4. Normality of Residuals

The residuals should be approximately normally distributed. A Q-Q plot is the standard visual check. Note that this assumption matters most for inference (confidence intervals, hypothesis tests) and less so for prediction.

5. No Multicollinearity

Predictors should not be highly correlated with each other. The Variance Inflation Factor (VIF) quantifies this — a VIF above 10 is a common threshold for concern. Removing one of the correlated predictors or using ridge regression are typical remedies.

A Quick Diagnostic Workflow in R

model <- lm(SalePrice ~ GrLivArea + OverallQual + YearBuilt, data = ames)

# Four diagnostic plots at once
par(mfrow = c(2, 2))
plot(model)

# VIF check
library(car)
vif(model)

When Assumptions Are Violated

Not every violation is catastrophic. Linear regression is fairly robust to mild departures from normality, especially with large samples (central limit theorem). Heteroscedasticity and multicollinearity, however, can seriously distort standard errors and lead to unreliable inference.

The key is to diagnose first, then decide whether a transformation, a different model, or a robust estimation method is warranted.