Explore the key assumptions underlying simple linear regression, including linearity in parameters, random sampling, homoscedasticity, and more. Understand why these assumptions are crucial for accurate parameter estimates and reliable inference, and learn how to check for common violations using diagnostic tests and residual analysis.
So, when I was first learning linear regression, I have to admit I thought it was basically just drawing a line through a set of data points and calling it a day. Simple as that, right? Well, it turns out there’s a bit more nuance to it. The line itself—described by the regression equation—needs to follow certain assumptions to ensure that our estimates of the slope and intercept are meaningful and that all those fancy hypothesis tests we talk about in econometrics (like t-tests and confidence intervals) are valid.
Below, we walk through each major assumption in a simple linear regression model. We’ll dig into why each assumption matters, and we’ll also talk about some practical ways to see if your data might be violating these assumptions. While this article focuses on simple linear regression (one independent variable), many of the assumptions carry over to multiple regression contexts (explored in Chapter 14). And if you’re a budding financial analyst, keep an eye out for the ways these assumptions show up in topics like equity valuation, interest rate modeling, or even portfolio risk assessment.
A regression model is only as good as the assumptions on which it stands. In a financial context, messing up these assumptions can lead you to overestimate a security’s expected returns, misjudge risk exposures in a portfolio, or incorrectly conclude the significance of economic indicators. If you use erroneous outputs in real investment decisions, well, let’s just say your portfolio might take a hit.
Linearity in parameters means the model takes the following form:
where and appear to the first power only (no or or bizarre transformations in the coefficients). This assumption does not forbid you from transforming . For instance, you could have a model like . Even though is logged, the model is still linear in and .
• In finance, you might explore a log transformation of market capitalization (X) when modeling stock returns (Y). As long as is just multiplied by , the requirement of linearity in parameters is satisfied.
• This assumption ensures that the ordinary least squares (OLS) method can be applied directly to estimate and .
We usually assume the data come from a random sample of the population. In practice, this is often approximated by:
• Drawing observations from a relatively large and well-mixed population.
• Ensuring independence and identical distribution (i.i.d.) of data points.
In many financial studies—like cross-sectional data on multiple firms—this assumption implies each firm’s returns and characteristics are (roughly) independent of others, or at least the data-gathering process is structured to reduce systemic biases. If we’re dealing with time-series data, we often rely on weaker forms of the random sampling assumption and must be more wary of autocorrelation.
We assume:
Translated, this says that knowing doesn’t give you any clue (on average) about the error term. Another way to read this is For everyday usage, it means that the regression line acts as an unbiased estimate of the actual conditional average of . For instance, if were stock returns and a fundamental factor like price-to-book ratio, we assume all the unaccounted factors (the error term) average out to zero at every -value.
Homoscedasticity means the variability (variance) of the error term is the same for all values of . Symbolically,
a constant. If the variance of changes with , that’s called heteroskedasticity. Why does this matter?
• If heteroskedasticity is present, the standard errors you compute for your and might be wrong, leading you to make inaccurate inferences. Maybe you’ll think your slope is super significant when it’s not.
• Financial data can exhibit this problem often. For example, returns on low-priced assets might show more volatility (in percentage terms) than returns on high-priced assets. Or small-cap stocks might have more pronounced swings than large-cap stocks.
A quick approach is to look at a plot of residuals versus fitted values (or residuals versus ). If the spread of residuals grows or shrinks as changes, you’re looking at potential heteroskedasticity. Formal statistical tests for this (like White’s test or the Breusch–Pagan test) are also used. This topic is explored in more depth in Chapter 14.5 when we handle multiple regression issues.
No autocorrelation means that for any pair of distinct observations , the error terms and are not correlated:
With time-series data, this is often violated if, say, shocks to returns in one period carry over into the next period. In cross-sectional data, we usually assume independence across samples.
• Why is this a big deal? If errors are autocorrelated, your model might systematically over-or-underestimate certain times or groups. Tests like Durbin–Watson or Ljung–Box are used to detect autocorrelation.
• For instance, in a financial time-series context, returns often have time dependencies, especially in volatility (leading to phenomena like GARCH processes). That can definitely break the simple assumption.
The classical linear regression model also assumes:
Strictly speaking, the normality of errors is relevant for small-sample inference—those t-tests, confidence intervals, and F-tests rely on synergy between normal errors and sample size. In large samples, the Central Limit Theorem (see Chapter 7.2) can mitigate the normality requirement, making OLS estimates remain consistent and nearly normally distributed (as the sample size grows).
• In finance, error term distributions can be thick-tailed (i.e., more prone to large outliers than a normal would expect), which is one reason real data might not fit perfectly into this assumption.
• A helpful approach is to check normal probability plots or run tests like the Jarque–Bera test on your residuals. If everything looks relatively normal, you can proceed comfortably. Otherwise, you might need robust or alternative methods.
Exogeneity states:
If is correlated with the error , OLS estimates of are biased. In the real world, exogeneity is tricky. For instance, if you’re trying to see if “advertising spending” drives “product sales,” but the company invests in advertising precisely when sales are expected to rise anyway, you get endogeneity—a correlation of and . That can happen a lot in finance: suppose you regress a firm’s stock return on “analyst recommendations,” but analyst recommendations might themselves be responses to perfectly timed inside info about the firm’s performance. The cause is muddy, and your slope estimate can go off track.
Let’s walk through a tiny example. Suppose you’re analyzing the impact of an economic indicator (like interest rate changes) on a certain bond’s daily returns. You gather daily data for 200 days:
If all looks good, you can trust your coefficient estimates more. If not, you might correct for heteroskedasticity, to handle autocorrelation, or use more advanced techniques (like instrumental variables when exogeneity is violated).
Below is a simple Mermaid diagram to illustrate the general idea of simple linear regression with one predictor , an error term , and the outcome .
In this schematic, flows into the linear function, and we add the error term to get the dependent variable . The assumptions we’ve explored essentially specify how behaves relative to and how is distributed around that line.
Violations of these assumptions can lead to:
• Biased parameter estimates (exogeneity issue).
• Inflated or deflated confidence levels (heteroskedasticity or autocorrelation).
• Invalid hypothesis testing and inference (non-normality in small samples).
In a real-world portfolio management setting, incorrectly concluding a factor is significant (when in reality your standard errors are off) can lead you to overweight or underweight assets, distorting your risk/return profile. For exam purposes, you’ll often be asked to:
• Identify potential violations from a residual plot.
• Discuss their impact on the reliability of the estimates.
• Suggest solutions or alternative tests, such as using robust standard errors or transformations.
• Always plot your data. It’s amazing how much you can detect visually—nonlinear patterns, outliers, or changes in variance.
• Be cautious about confounding variables. In finance, factors often correlate with each other (e.g., size and value style factors). If is correlated with the error term, you might need instrumental variable approaches (discussed in advanced regressions).
• Don’t assume everything is normal. Financial returns are notorious for fat tails. Large sample sizes help, but you might consider robust or nonparametric techniques.
• Pay attention to time-series aspects. If you’re using daily or monthly returns, watch out for autocorrelation. Run those tests!
• Always interpret your results in the context of these assumptions. If you suspect some are violated, mention that caveat in your analysis.
Note to Readers:
Always remember that, in finance, real-world data can be messy—non-normal distributions, correlated errors, and structural changes over time are quite common. Don’t let these assumptions intimidate you, though: the key is knowing how to check each assumption and how to proceed (e.g., using heteroskedasticity-consistent standard errors, or adjusting for autocorrelation).
By mastering these assumptions, you’ll have a stronger feel for when your regression results are reliable and when they might be leading you astray. Good luck studying, and be sure to explore Chapter 10.6 for hypothesis testing on regression coefficients and Chapter 14 for multiple regression complexities!
Important Notice: FinancialAnalystGuide.com provides supplemental CFA study materials, including mock exams, sample exam questions, and other practice resources to aid your exam preparation. These resources are not affiliated with or endorsed by the CFA Institute. CFA® and Chartered Financial Analyst® are registered trademarks owned exclusively by CFA Institute. Our content is independent, and we do not guarantee exam success. CFA Institute does not endorse, promote, or warrant the accuracy or quality of our products.