Assumptions Underlying Multiple Regression

March 21, 2025 14 min read

Explore the classical linear regression assumptions that ensure unbiased and efficient estimators, discover how each assumption applies to finance scenarios, and review practical steps to detect and address common violations.

On this page

Introduction

So, you’ve probably heard that multiple regression is one of the most powerful workhorses in a quant’s toolbox. We use it to model relationships among variables, forecast future trends, and measure how different explanatory factors affect an outcome of interest. But, as flexible as multiple regression might be, there’s a catch: certain assumptions must hold true for us to fully trust our model’s results. Now, let’s begin our deep dive into these assumptions, see why they matter so much, look at real-world finance examples where they might go wrong, and figure out how to wrangle them effectively.

Below is a simple flowchart that shows a typical multiple regression process—from data collection to final inference. Most of our “assumption checking” happens during and after fitting the model, so pay attention to that residual diagnostics step.

    flowchart LR
	    A["Start <br/>Data Collection"] --> B["Check for Non-Collinearity <br/>(Correlations)"]
	    B --> C["Fit OLS Model <br/>y = β0 + β1X1 + ... + βnXn + ε"]
	    C --> D["Check Residuals <br/>for Homoskedasticity, Normality, etc."]
	    D --> E["Evaluate Model Fit <br/>R² and Diagnostics"]
	    E --> F["Make Inferences <br/>and Forecast"]

When these assumptions hold, the ordinary least squares (OLS) estimator is not only unbiased but obtains the minimum variance among all linear unbiased estimators—a property known as being BLUE (Best Linear Unbiased Estimator) under the Gauss–Markov Theorem. For exam-level mastery, it is super important to understand what each assumption entails, how to spot violations, and what to do if you see a big “Oh no!” in your residual plots.

Key Classical Linear Regression Assumptions

Linear Relationship

One major assumption in classical regression is that the relationship between the dependent variable (often denoted as y) and the independent variables (Xs) is linear in parameters. That’s a slightly fancy way of saying that the equation we are estimating looks like:

y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n + \varepsilon,

where the βs are the parameters we want to estimate. It doesn’t mean that y and X have to be linearly related in the real world—maybe you transform Xs using logs or polynomials if needed—but the final form you use in your regression must be linear once you put it all together.

In finance, you might see polynomial terms for interest rates or log transformations for portfolio sizes. As long as the model is linear in the βs, you’re all good on this assumption.

Anecdotally, I once had a friend who tried to model corporate bond spreads as an exponential function of GDP growth rates, forgetting to transform it. He ended up with a weird “nonlinear in parameters” mess that gave nonsensical p-values. Once he transformed the data (like using logs or partial linear expansions), everything started to make sense.

No Perfect Multicollinearity

Multicollinearity means that some of your X variables are highly correlated with each other. Perfect multicollinearity means they’re 100% correlated, making it impossible to untangle their individual effects on y. OLS basically throws up its hands if two explanatory variables provide the exact same information and cannot produce unique estimates of their separate coefficients.

In practical finance scenarios, you might see near-perfect collinearity when battered stocks and a market index move stride for stride. Or in factor models, sometimes two factors—like Size factor and a Market factor—are so correlated that your regression model can hardly tell them apart.

While perfect multicollinearity is rare, “near” multicollinearity can cause large standard errors, making it hard to interpret results clearly. Common ways to detect it include looking at correlation matrices or computing variance inflation factors (VIFs).

Expected Value of Errors = 0

This assumption states that the average of the error term, \(\varepsilon\), across all observations is zero. Symbolically:

E(\varepsilon) = 0.

This basically means there’s no systematic bias in your residuals. If your model is missing an important variable that systematically bumps or nudges your predictions in one direction, you violate this assumption because your \(\varepsilon\) would consistently be positive or negative.

In finance, imagine forgetting to include the risk-free rate or a relevant macroeconomic control in your model. You might see that your residuals are systematically under- or over-predicting your dependent variable. That’s a sign you might need extra variables or a different functional form.

Homoskedasticity (Constant Variance of Residuals)

When you look at how scattered your residuals are, you want that scatter to be about the same across all levels of your independent variables. That’s homoskedasticity. If the residual variance is not uniform, we call it heteroskedasticity.

Heteroskedasticity is super common in real financial data. For example, cross-sectional returns on stocks can have volatility that’s higher for large-cap stocks during certain periods, or they might get hammered more severely during a market sell-off. That is, the variability in returns might not be the same for high and low capitalizations.

While OLS can still produce unbiased coefficient estimates even if the errors aren’t homoskedastic, the variance (standard errors) of those estimates are no longer reliable. You can’t trust your t‑stats or p-values. In practice, analysts use robust standard errors or other corrections (like Weighted Least Squares) to fix this.

No Autocorrelation (No Serial Correlation in Residuals)

Autocorrelation means the residuals from one observation are correlated with the residuals from another. This often happens in time-series data—like stock returns or macroeconomic indicators—where residuals in consecutive quarters can be related to each other.

A big example is a time‑series model of daily returns. If big positive residuals today imply big positive residuals tomorrow, you’ve got positive serial correlation. Among cross-sectional data (e.g., comparing many companies at a single time point), you can occasionally see patterns in the residuals if there is some underlying grouping unaccounted for, like being in the same industry or region.

When residuals are autocorrelated, your standard errors can become too small (leading to inflated t‑statistics). Durbin-Watson tests or analyzing the autocorrelation function (ACF) of residuals can help detect it. If you do find autocorrelation, you might adopt time-series–specific corrections (e.g., Newey-West standard errors) or adjust your model specification.

Residuals Are Normally Distributed

Strictly speaking, OLS does not require normality of errors to produce unbiased estimates. However, normality matters if you want to construct reliable confidence intervals and hypothesis tests—especially in small samples. The Central Limit Theorem often rides to the rescue in large samples, letting your estimates approximate normal distributions even if the errors aren’t truly normal. But in smaller data sets, if your residuals are heavily skewed or have fat tails, your p-values and confidence intervals won’t be trustworthy.

In finance, heavy-tailed distributions are relatively common (think extreme negative returns during crises). So, be mindful that traditional inference might understate the risk of outlier-driven results. Tools such as bootstrapping or distribution-robust methods are used when normality is highly suspect.

Implications: Why These Assumptions Matter

When all these assumptions hold, the Gauss–Markov Theorem tells us that OLS provides the Best Linear Unbiased Estimates (BLUE) of the parameters. That means:

• Unbiasedness: On average, the estimated coefficients will reflect the true population parameters.
• Minimum Variance: Among all linear and unbiased estimators, your OLS estimates have the lowest possible variance. In simpler terms, they’re as efficient as you can get.

So, if your driving aim is to measure or forecast in a financial context—like predicting stock returns, bond yield spreads, or corporate earnings—ensuring (or at least approximating) these assumptions means you’ll typically get reliable coefficient estimates, standard errors, and test statistics.

Failure to adhere to these assumptions can distort your inferences. For instance, if you have autocorrelation but ignore it, you might incorrectly conclude that a coefficient is significant (maybe you invest in an alpha strategy that doesn’t really exist!). Likewise, if you have severe heteroskedasticity but keep using standard OLS standard errors, you might unwittingly design a risky portfolio strategy based on faulty risk-return estimates.

Real-World Finance Examples of Violated Assumptions

• Time-Series Autocorrelation: Equity returns often exhibit autocorrelation during turbulent periods. For instance, consecutive negative shocks can cluster, leading to correlated residuals. If you assume independence, you might understate the risk or misjudge significance.

• Cross-Sectional Heteroskedasticity: In a cross-sectional regression of stock returns on firm size, leverage, and industry classification, large tech firms might have very different volatility (residual spread) than stable utility firms. This difference in variance violates the homoskedasticity assumption.

• Collinearity in Factor Models: Many popular factors in asset pricing (e.g., market factor, large-cap factor, momentum factor) show strong correlations. Near-multicollinearity inflates standard errors, making it difficult to identify which factor is truly driving returns.

I remember once running a multi-factor model for emerging market equities and seeing high correlation between a “value factor” and a “momentum factor.” The t-statistics for each factor’s coefficient were minimal, yet the overall model looked good on an R² basis. That was a textbook case of near-multicollinearity making each coefficient look non-significant individually.

• Non-stationarity: Although not among the classical OLS assumptions, stationarity is critical in finance time-series. If your data has a trend or structural break (e.g., post-2008 crisis), your model estimates can be thrown off. Checking and differencing data, or using cointegration techniques, often helps.

• Structural Breaks: A model that explains corporate bond spreads well pre-COVID might fail spectacularly after COVID. If the underlying data-generating process shifts drastically, your assumptions about linearity or error properties might go out the window. Always keep an eye out for such regime changes.

Additional Nuanced Assumptions for Advanced Finance Contexts

In more advanced or specialized finance models, you might see:

• Stationarity in Time-Series: “Stationarity” basically insists that the statistical properties of the series (mean, variance, autocorrelations) are constant over time. Non-stationary data (like integrated series) can lead to spurious regression results.

• Absence of Structural Breaks: If a market crash or policy shift changes the relationship in your data, the old model might not apply. You might test for structural breaks with something like the Chow test or look at rolling regressions to see if coefficients are stable.

Glossary

• Homoskedasticity: Uniform variance of residuals across all levels of the regressors.
• Heteroskedasticity: The variance of residuals varies, potentially invalidating standard errors.
• Serial Correlation (Autocorrelation): Residuals are correlated with each other—often in time-series data, but can also happen cross-sectionally if there’s an unmodeled group factor.
• BLUE (Best Linear Unbiased Estimator): By the Gauss–Markov Theorem, the OLS estimator is the one with the lowest variance among all linear unbiased estimators, provided certain assumptions hold.

Practical Tips and Common Pitfalls

• Always Plot Residuals: A quick scatterplot of residuals can reveal patterns like increasing spread (heteroskedasticity) or clumping (autocorrelation).
• Check Correlation Matrices: Before running your regression, see if two or more X variables are suspiciously correlated. Consider dropping one or combining them (e.g., principal component analysis) if the correlation is extremely high.
• Consider Transformations: If your data look nonlinear, consider taking logs (especially with big ranges) or building polynomial terms. Make sure the final form is linear in the parameters.
• Use Robust Standard Errors: If you detect heteroskedasticity, robust estimators can help fix your standard errors without re-specifying the entire model.
• Check for Stationarity in Time-Series: Tools like the Augmented Dickey-Fuller (ADF) test can help ensure you’re not building a classic “spurious regression.”
• Watch out for Big Data Quirks: In very large data sets, everything might look significant (p-values become super tiny). But if your assumptions are off, you might still get nonsense results.

In next section (2.5 Identifying Violations from Residual Plots), we’ll dive deeper into practical ways to visually assess whether your model is living up to these assumptions.

References, Suggested Readings, and Further Insights

• Damodar Gujarati & Dawn C. Porter (2010). “Essentials of Econometrics.” An accessible yet robust text on regression fundamentals.
• CFA Institute Practice Problems: Great for seeing how test questions address assumption violations—and how to correct them.
• Online Modules (NPTEL, Coursera) on Advanced Regression: Helpful for deeper dives into robust solutions, time-series corrections, and practical demos.

Test Your Knowledge: Multiple Regression Assumptions Quiz

### Which of the following is true about linearity in parameters for multiple regression? - [ ] The dependent variable (y) must be linearly distributed. - [x] The model must be linear in terms of the β coefficients. - [ ] The regressor variables cannot be transformed (e.g., logs). - [ ] The error term must be linear in time. > **Explanation:** The assumption requires that the model is linear in the β parameters, meaning the equation y = β0 + β1 X1 + … + βn Xn + ε is linear in β. You can still log or otherwise transform X or y, as long as the final expression is linear in β. ### In the context of no perfect multicollinearity, what is the major consequence if two regressors are exactly correlated? - [ ] The residual sum of squares is minimized. - [x] You cannot estimate unique coefficients for those regressors. - [ ] The variance of the residuals is automatically smaller. - [ ] Your regression becomes homoskedastic by default. > **Explanation:** Perfect multicollinearity makes it impossible for OLS to separate the effect of one regressor from the other, meaning the model cannot estimate distinct βs for each perfectly correlated variable. ### Why is the assumption E(ε) = 0 important? - [ ] It ensures the model is perfectly specified. - [ ] It guarantees that residual plots have zero slope. - [x] It assures that on average, the model doesn’t over- or underpredict. - [ ] It forces heteroskedasticity to vanish. > **Explanation:** E(ε) = 0 indicates the error terms don’t consistently bias estimates upward or downward. If errors had a non-zero average, it would reflect a systematic bias in the regression’s predictions. ### What does homoskedasticity imply? - [ ] The errors follow a perfect bell curve. - [x] The variance of the residuals is constant across all predicted values. - [ ] There are no missing variables in the model. - [ ] All explanatory variables are uncorrelated. > **Explanation:** Homoskedasticity states that residuals have the same variance regardless of the level of the regressors or the fitted values. This assumption is crucial for standard error accuracy. ### What is a common symptom of autocorrelation in regression residuals? - [x] Residuals that systematically follow a pattern over time. - [ ] Residuals that spread out evenly with no pattern. - [ ] Perfect normal distribution of residuals. - [x] Inflated t-statistics due to understated standard errors. > **Explanation:** Autocorrelation often shows up as a pattern in the residuals across time (e.g., positive errors tend to follow positive errors). It can inflate t-statistics because the standard errors are underestimated when residuals are serially correlated. ### In small samples, why does normality of errors matter? - [ ] It speeds up the regression process. - [x] It ensures valid inferences from t-tests and confidence intervals. - [ ] It guarantees the Gauss–Markov Theorem holds. - [ ] It makes the dependent variable linear. > **Explanation:** Small-sample inferences heavily rely on the assumption of normally distributed errors. If the errors are not normal, the t-tests and confidence intervals might not be correct. ### If a finance researcher suspects heteroskedastic residuals in a cross-sectional study, which of the following is a valid fix? - [ ] Dropping many observations randomly. - [ ] Ignoring standard errors altogether. - [x] Using robust (White or HAC) standard errors. - [ ] Re-running the regression with an added lag of the dependent variable. > **Explanation:** Robust standard errors (such as White’s correction or HAC in certain time-series contexts) are a common and valid approach to address heteroskedasticity. ### Which term describes the phenomenon where certain variables are so closely correlated that the model can’t reliably estimate their coefficients? - [ ] Stationarity - [ ] Heteroskedasticity - [ ] Normality - [x] Multicollinearity > **Explanation:** Multicollinearity arises when explanatory variables are highly related, making their separate effects difficult to isolate within the regression. ### One of the core ideas behind the Gauss–Markov Theorem is that if assumptions hold, OLS estimators are: - [ ] Biased and consistent. - [x] The Best Linear Unbiased Estimators. - [ ] Guaranteed to be normally distributed in large samples. - [ ] Only valid for non-time-series data. > **Explanation:** Under the Gauss–Markov conditions, the OLS estimators have the minimum variance among all linear unbiased estimators; they’re called “Best Linear Unbiased Estimators.” ### True or False: Violation of the no autocorrelation assumption automatically implies a biased estimate of β. - [x] True - [ ] False > **Explanation:** Serial correlation in the residuals can lead to biased standard errors and thereby can also bias the estimates of coefficients if there are omitted dynamics or important lagged variables unaccounted for.

View the page source Edit the page History

Friday, April 11, 2025 Friday, March 21, 2025

Important Notice: FinancialAnalystGuide.com provides supplemental CFA study materials, including mock exams, sample exam questions, and other practice resources to aid your exam preparation. These resources are not affiliated with or endorsed by the CFA Institute. CFA® and Chartered Financial Analyst® are registered trademarks owned exclusively by CFA Institute. Our content is independent, and we do not guarantee exam success. CFA Institute does not endorse, promote, or warrant the accuracy or quality of our products.

2.3 Interpreting Regression Coefficients and Statistical Significance

2.5 Identifying Violations from Residual Plots

Browse CFA Level 2