Explore the classical linear regression assumptions that ensure unbiased and efficient estimators, discover how each assumption applies to finance scenarios, and review practical steps to detect and address common violations.
So, you’ve probably heard that multiple regression is one of the most powerful workhorses in a quant’s toolbox. We use it to model relationships among variables, forecast future trends, and measure how different explanatory factors affect an outcome of interest. But, as flexible as multiple regression might be, there’s a catch: certain assumptions must hold true for us to fully trust our model’s results. Now, let’s begin our deep dive into these assumptions, see why they matter so much, look at real-world finance examples where they might go wrong, and figure out how to wrangle them effectively.
Below is a simple flowchart that shows a typical multiple regression process—from data collection to final inference. Most of our “assumption checking” happens during and after fitting the model, so pay attention to that residual diagnostics step.
flowchart LR A["Start <br/>Data Collection"] --> B["Check for Non-Collinearity <br/>(Correlations)"] B --> C["Fit OLS Model <br/>y = β0 + β1X1 + ... + βnXn + ε"] C --> D["Check Residuals <br/>for Homoskedasticity, Normality, etc."] D --> E["Evaluate Model Fit <br/>R² and Diagnostics"] E --> F["Make Inferences <br/>and Forecast"]
When these assumptions hold, the ordinary least squares (OLS) estimator is not only unbiased but obtains the minimum variance among all linear unbiased estimators—a property known as being BLUE (Best Linear Unbiased Estimator) under the Gauss–Markov Theorem. For exam-level mastery, it is super important to understand what each assumption entails, how to spot violations, and what to do if you see a big “Oh no!” in your residual plots.
One major assumption in classical regression is that the relationship between the dependent variable (often denoted as y) and the independent variables (Xs) is linear in parameters. That’s a slightly fancy way of saying that the equation we are estimating looks like:
where the βs are the parameters we want to estimate. It doesn’t mean that y and X have to be linearly related in the real world—maybe you transform Xs using logs or polynomials if needed—but the final form you use in your regression must be linear once you put it all together.
In finance, you might see polynomial terms for interest rates or log transformations for portfolio sizes. As long as the model is linear in the βs, you’re all good on this assumption.
Anecdotally, I once had a friend who tried to model corporate bond spreads as an exponential function of GDP growth rates, forgetting to transform it. He ended up with a weird “nonlinear in parameters” mess that gave nonsensical p-values. Once he transformed the data (like using logs or partial linear expansions), everything started to make sense.
Multicollinearity means that some of your X variables are highly correlated with each other. Perfect multicollinearity means they’re 100% correlated, making it impossible to untangle their individual effects on y. OLS basically throws up its hands if two explanatory variables provide the exact same information and cannot produce unique estimates of their separate coefficients.
In practical finance scenarios, you might see near-perfect collinearity when battered stocks and a market index move stride for stride. Or in factor models, sometimes two factors—like Size factor and a Market factor—are so correlated that your regression model can hardly tell them apart.
While perfect multicollinearity is rare, “near” multicollinearity can cause large standard errors, making it hard to interpret results clearly. Common ways to detect it include looking at correlation matrices or computing variance inflation factors (VIFs).
This assumption states that the average of the error term, \(\varepsilon\), across all observations is zero. Symbolically:
This basically means there’s no systematic bias in your residuals. If your model is missing an important variable that systematically bumps or nudges your predictions in one direction, you violate this assumption because your \(\varepsilon\) would consistently be positive or negative.
In finance, imagine forgetting to include the risk-free rate or a relevant macroeconomic control in your model. You might see that your residuals are systematically under- or over-predicting your dependent variable. That’s a sign you might need extra variables or a different functional form.
When you look at how scattered your residuals are, you want that scatter to be about the same across all levels of your independent variables. That’s homoskedasticity. If the residual variance is not uniform, we call it heteroskedasticity.
Heteroskedasticity is super common in real financial data. For example, cross-sectional returns on stocks can have volatility that’s higher for large-cap stocks during certain periods, or they might get hammered more severely during a market sell-off. That is, the variability in returns might not be the same for high and low capitalizations.
While OLS can still produce unbiased coefficient estimates even if the errors aren’t homoskedastic, the variance (standard errors) of those estimates are no longer reliable. You can’t trust your t‑stats or p-values. In practice, analysts use robust standard errors or other corrections (like Weighted Least Squares) to fix this.
Autocorrelation means the residuals from one observation are correlated with the residuals from another. This often happens in time-series data—like stock returns or macroeconomic indicators—where residuals in consecutive quarters can be related to each other.
A big example is a time‑series model of daily returns. If big positive residuals today imply big positive residuals tomorrow, you’ve got positive serial correlation. Among cross-sectional data (e.g., comparing many companies at a single time point), you can occasionally see patterns in the residuals if there is some underlying grouping unaccounted for, like being in the same industry or region.
When residuals are autocorrelated, your standard errors can become too small (leading to inflated t‑statistics). Durbin-Watson tests or analyzing the autocorrelation function (ACF) of residuals can help detect it. If you do find autocorrelation, you might adopt time-series–specific corrections (e.g., Newey-West standard errors) or adjust your model specification.
Strictly speaking, OLS does not require normality of errors to produce unbiased estimates. However, normality matters if you want to construct reliable confidence intervals and hypothesis tests—especially in small samples. The Central Limit Theorem often rides to the rescue in large samples, letting your estimates approximate normal distributions even if the errors aren’t truly normal. But in smaller data sets, if your residuals are heavily skewed or have fat tails, your p-values and confidence intervals won’t be trustworthy.
In finance, heavy-tailed distributions are relatively common (think extreme negative returns during crises). So, be mindful that traditional inference might understate the risk of outlier-driven results. Tools such as bootstrapping or distribution-robust methods are used when normality is highly suspect.
When all these assumptions hold, the Gauss–Markov Theorem tells us that OLS provides the Best Linear Unbiased Estimates (BLUE) of the parameters. That means:
• Unbiasedness: On average, the estimated coefficients will reflect the true population parameters.
• Minimum Variance: Among all linear and unbiased estimators, your OLS estimates have the lowest possible variance. In simpler terms, they’re as efficient as you can get.
So, if your driving aim is to measure or forecast in a financial context—like predicting stock returns, bond yield spreads, or corporate earnings—ensuring (or at least approximating) these assumptions means you’ll typically get reliable coefficient estimates, standard errors, and test statistics.
Failure to adhere to these assumptions can distort your inferences. For instance, if you have autocorrelation but ignore it, you might incorrectly conclude that a coefficient is significant (maybe you invest in an alpha strategy that doesn’t really exist!). Likewise, if you have severe heteroskedasticity but keep using standard OLS standard errors, you might unwittingly design a risky portfolio strategy based on faulty risk-return estimates.
• Time-Series Autocorrelation: Equity returns often exhibit autocorrelation during turbulent periods. For instance, consecutive negative shocks can cluster, leading to correlated residuals. If you assume independence, you might understate the risk or misjudge significance.
• Cross-Sectional Heteroskedasticity: In a cross-sectional regression of stock returns on firm size, leverage, and industry classification, large tech firms might have very different volatility (residual spread) than stable utility firms. This difference in variance violates the homoskedasticity assumption.
• Collinearity in Factor Models: Many popular factors in asset pricing (e.g., market factor, large-cap factor, momentum factor) show strong correlations. Near-multicollinearity inflates standard errors, making it difficult to identify which factor is truly driving returns.
I remember once running a multi-factor model for emerging market equities and seeing high correlation between a “value factor” and a “momentum factor.” The t-statistics for each factor’s coefficient were minimal, yet the overall model looked good on an R² basis. That was a textbook case of near-multicollinearity making each coefficient look non-significant individually.
• Non-stationarity: Although not among the classical OLS assumptions, stationarity is critical in finance time-series. If your data has a trend or structural break (e.g., post-2008 crisis), your model estimates can be thrown off. Checking and differencing data, or using cointegration techniques, often helps.
• Structural Breaks: A model that explains corporate bond spreads well pre-COVID might fail spectacularly after COVID. If the underlying data-generating process shifts drastically, your assumptions about linearity or error properties might go out the window. Always keep an eye out for such regime changes.
In more advanced or specialized finance models, you might see:
• Stationarity in Time-Series: “Stationarity” basically insists that the statistical properties of the series (mean, variance, autocorrelations) are constant over time. Non-stationary data (like integrated series) can lead to spurious regression results.
• Absence of Structural Breaks: If a market crash or policy shift changes the relationship in your data, the old model might not apply. You might test for structural breaks with something like the Chow test or look at rolling regressions to see if coefficients are stable.
• Homoskedasticity: Uniform variance of residuals across all levels of the regressors.
• Heteroskedasticity: The variance of residuals varies, potentially invalidating standard errors.
• Serial Correlation (Autocorrelation): Residuals are correlated with each other—often in time-series data, but can also happen cross-sectionally if there’s an unmodeled group factor.
• BLUE (Best Linear Unbiased Estimator): By the Gauss–Markov Theorem, the OLS estimator is the one with the lowest variance among all linear unbiased estimators, provided certain assumptions hold.
• Always Plot Residuals: A quick scatterplot of residuals can reveal patterns like increasing spread (heteroskedasticity) or clumping (autocorrelation).
• Check Correlation Matrices: Before running your regression, see if two or more X variables are suspiciously correlated. Consider dropping one or combining them (e.g., principal component analysis) if the correlation is extremely high.
• Consider Transformations: If your data look nonlinear, consider taking logs (especially with big ranges) or building polynomial terms. Make sure the final form is linear in the parameters.
• Use Robust Standard Errors: If you detect heteroskedasticity, robust estimators can help fix your standard errors without re-specifying the entire model.
• Check for Stationarity in Time-Series: Tools like the Augmented Dickey-Fuller (ADF) test can help ensure you’re not building a classic “spurious regression.”
• Watch out for Big Data Quirks: In very large data sets, everything might look significant (p-values become super tiny). But if your assumptions are off, you might still get nonsense results.
In next section (2.5 Identifying Violations from Residual Plots), we’ll dive deeper into practical ways to visually assess whether your model is living up to these assumptions.
• Damodar Gujarati & Dawn C. Porter (2010). “Essentials of Econometrics.” An accessible yet robust text on regression fundamentals.
• CFA Institute Practice Problems: Great for seeing how test questions address assumption violations—and how to correct them.
• Online Modules (NPTEL, Coursera) on Advanced Regression: Helpful for deeper dives into robust solutions, time-series corrections, and practical demos.
Important Notice: FinancialAnalystGuide.com provides supplemental CFA study materials, including mock exams, sample exam questions, and other practice resources to aid your exam preparation. These resources are not affiliated with or endorsed by the CFA Institute. CFA® and Chartered Financial Analyst® are registered trademarks owned exclusively by CFA Institute. Our content is independent, and we do not guarantee exam success. CFA Institute does not endorse, promote, or warrant the accuracy or quality of our products.