Browse CFA Level 1

Assumptions of the Simple Linear Regression Model

Explore the key assumptions underlying simple linear regression, including linearity in parameters, random sampling, homoscedasticity, and more. Understand why these assumptions are crucial for accurate parameter estimates and reliable inference, and learn how to check for common violations using diagnostic tests and residual analysis.

Introduction§

So, when I was first learning linear regression, I have to admit I thought it was basically just drawing a line through a set of data points and calling it a day. Simple as that, right? Well, it turns out there’s a bit more nuance to it. The line itself—described by the regression equation—needs to follow certain assumptions to ensure that our estimates of the slope and intercept are meaningful and that all those fancy hypothesis tests we talk about in econometrics (like t-tests and confidence intervals) are valid.

Below, we walk through each major assumption in a simple linear regression model. We’ll dig into why each assumption matters, and we’ll also talk about some practical ways to see if your data might be violating these assumptions. While this article focuses on simple linear regression (one independent variable), many of the assumptions carry over to multiple regression contexts (explored in Chapter 14). And if you’re a budding financial analyst, keep an eye out for the ways these assumptions show up in topics like equity valuation, interest rate modeling, or even portfolio risk assessment.

Why These Assumptions Matter§

A regression model is only as good as the assumptions on which it stands. In a financial context, messing up these assumptions can lead you to overestimate a security’s expected returns, misjudge risk exposures in a portfolio, or incorrectly conclude the significance of economic indicators. If you use erroneous outputs in real investment decisions, well, let’s just say your portfolio might take a hit.

Linearity in Parameters§

Linearity in parameters means the model takes the following form:

Yi=β0+β1Xi+εi, Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i,

where β0\beta_0 and β1\beta_1 appear to the first power only (no β12\beta_1^2 or β0β1\beta_0 \cdot \beta_1 or bizarre transformations in the coefficients). This assumption does not forbid you from transforming XX. For instance, you could have a model like Yi=β0+β1ln(Xi)+εi Y_i = \beta_0 + \beta_1 \ln(X_i) + \varepsilon_i . Even though XX is logged, the model is still linear in β0\beta_0 and β1\beta_1.

• In finance, you might explore a log transformation of market capitalization (X) when modeling stock returns (Y). As long as β1\beta_1 is just multiplied by ln(X)\ln(X), the requirement of linearity in parameters is satisfied.

• This assumption ensures that the ordinary least squares (OLS) method can be applied directly to estimate β0\beta_0 and β1\beta_1.

Random Sampling§

We usually assume the data (Xi,Yi)(X_i, Y_i) come from a random sample of the population. In practice, this is often approximated by:

• Drawing observations from a relatively large and well-mixed population.
• Ensuring independence and identical distribution (i.i.d.) of data points.

In many financial studies—like cross-sectional data on multiple firms—this assumption implies each firm’s returns and characteristics are (roughly) independent of others, or at least the data-gathering process is structured to reduce systemic biases. If we’re dealing with time-series data, we often rely on weaker forms of the random sampling assumption and must be more wary of autocorrelation.

Mean Zero Error Term§

We assume:

E[εiXi]=0. E[\varepsilon_i \mid X_i] = 0.

Translated, this says that knowing XiX_i doesn’t give you any clue (on average) about the error term. Another way to read this is E[YiXi]=β0+β1Xi.E[Y_i \mid X_i] = \beta_0 + \beta_1 X_i. For everyday usage, it means that the regression line acts as an unbiased estimate of the actual conditional average of YY. For instance, if YY were stock returns and XX a fundamental factor like price-to-book ratio, we assume all the unaccounted factors (the error term) average out to zero at every XX-value.

Homoscedasticity (Constant Variance of Errors)§

Homoscedasticity means the variability (variance) of the error term ε\varepsilon is the same for all values of XX. Symbolically,

Var(εiXi)=σ2, \text{Var}(\varepsilon_i \mid X_i) = \sigma^2,

a constant. If the variance of ε\varepsilon changes with XX, that’s called heteroskedasticity. Why does this matter?

• If heteroskedasticity is present, the standard errors you compute for your β^0\hat{\beta}_0 and β^1\hat{\beta}_1 might be wrong, leading you to make inaccurate inferences. Maybe you’ll think your slope is super significant when it’s not.

• Financial data can exhibit this problem often. For example, returns on low-priced assets might show more volatility (in percentage terms) than returns on high-priced assets. Or small-cap stocks might have more pronounced swings than large-cap stocks.

Checking for Heteroskedasticity§

A quick approach is to look at a plot of residuals versus fitted values (or residuals versus XX). If the spread of residuals grows or shrinks as XX changes, you’re looking at potential heteroskedasticity. Formal statistical tests for this (like White’s test or the Breusch–Pagan test) are also used. This topic is explored in more depth in Chapter 14.5 when we handle multiple regression issues.

No Autocorrelation (Uncorrelated Errors)§

No autocorrelation means that for any pair of distinct observations (i,j)(i, j), the error terms εi\varepsilon_i and εj\varepsilon_j are not correlated:

Cov(εi,εj)=0for ij. \text{Cov}(\varepsilon_i, \varepsilon_j) = 0 \quad \text{for } i \neq j.

With time-series data, this is often violated if, say, shocks to returns in one period carry over into the next period. In cross-sectional data, we usually assume independence across samples.

• Why is this a big deal? If errors are autocorrelated, your model might systematically over-or-underestimate certain times or groups. Tests like Durbin–Watson or Ljung–Box are used to detect autocorrelation.

• For instance, in a financial time-series context, returns often have time dependencies, especially in volatility (leading to phenomena like GARCH processes). That can definitely break the simple assumption.

Normality of Errors§

The classical linear regression model also assumes:

εiN(0,σ2). \varepsilon_i \sim N(0, \sigma^2).

Strictly speaking, the normality of errors is relevant for small-sample inference—those t-tests, confidence intervals, and F-tests rely on synergy between normal errors and sample size. In large samples, the Central Limit Theorem (see Chapter 7.2) can mitigate the normality requirement, making OLS estimates remain consistent and nearly normally distributed (as the sample size grows).

• In finance, error term distributions can be thick-tailed (i.e., more prone to large outliers than a normal would expect), which is one reason real data might not fit perfectly into this assumption.

• A helpful approach is to check normal probability plots or run tests like the Jarque–Bera test on your residuals. If everything looks relatively normal, you can proceed comfortably. Otherwise, you might need robust or alternative methods.

Exogeneity (No Correlation Between XX and ε\varepsilon)§

Exogeneity states:

Cov(Xi,εi)=0. \text{Cov}(X_i, \varepsilon_i) = 0.

If XX is correlated with the error ε\varepsilon, OLS estimates of β1\beta_1 are biased. In the real world, exogeneity is tricky. For instance, if you’re trying to see if “advertising spending” drives “product sales,” but the company invests in advertising precisely when sales are expected to rise anyway, you get endogeneity—a correlation of XX and ε\varepsilon. That can happen a lot in finance: suppose you regress a firm’s stock return on “analyst recommendations,” but analyst recommendations might themselves be responses to perfectly timed inside info about the firm’s performance. The cause is muddy, and your slope estimate β^1\hat{\beta}_1 can go off track.

Practical Example of Checking Assumptions§

Let’s walk through a tiny example. Suppose you’re analyzing the impact of an economic indicator (like interest rate changes) on a certain bond’s daily returns. You gather daily data for 200 days:

  1. First, you run an OLS regression of Bond Return on Change in Interest Rate.
  2. Then you plot the residuals against fitted values to see if they form a uniform band around zero (checking homoscedasticity).
  3. You run the Durbin–Watson test for autocorrelation.
  4. You check the correlation between your predictor (change in interest rates) and the residuals to see if there’s exogeneity.
  5. Finally, you do a quick normality test on the residuals (e.g., Jarque–Bera).

If all looks good, you can trust your coefficient estimates more. If not, you might correct for heteroskedasticity, to handle autocorrelation, or use more advanced techniques (like instrumental variables when exogeneity is violated).

Visualizing the Regression Framework§

Below is a simple Mermaid diagram to illustrate the general idea of simple linear regression with one predictor XX, an error term ε\varepsilon, and the outcome YY.

In this schematic, XX flows into the linear function, and we add the error term ε\varepsilon to get the dependent variable YY. The assumptions we’ve explored essentially specify how ε\varepsilon behaves relative to XX and how YY is distributed around that line.

How Violations Affect Financial Analysis§

Violations of these assumptions can lead to:

• Biased parameter estimates (exogeneity issue).
• Inflated or deflated confidence levels (heteroskedasticity or autocorrelation).
• Invalid hypothesis testing and inference (non-normality in small samples).

In a real-world portfolio management setting, incorrectly concluding a factor is significant (when in reality your standard errors are off) can lead you to overweight or underweight assets, distorting your risk/return profile. For exam purposes, you’ll often be asked to:

• Identify potential violations from a residual plot.
• Discuss their impact on the reliability of the estimates.
• Suggest solutions or alternative tests, such as using robust standard errors or transformations.

Best Practices and Common Pitfalls§

• Always plot your data. It’s amazing how much you can detect visually—nonlinear patterns, outliers, or changes in variance.
• Be cautious about confounding variables. In finance, factors often correlate with each other (e.g., size and value style factors). If XX is correlated with the error term, you might need instrumental variable approaches (discussed in advanced regressions).
• Don’t assume everything is normal. Financial returns are notorious for fat tails. Large sample sizes help, but you might consider robust or nonparametric techniques.
• Pay attention to time-series aspects. If you’re using daily or monthly returns, watch out for autocorrelation. Run those tests!
• Always interpret your results in the context of these assumptions. If you suspect some are violated, mention that caveat in your analysis.

Exam Tips for CFA Candidates§

  1. Familiarize yourself with the typical tests: Durbin–Watson for autocorrelation, White’s or Breusch–Pagan for heteroskedasticity.
  2. Know how each assumption affects the validity of OLS estimators (particularly unbiasedness, consistency, and the correct standard errors).
  3. When a question states you suspect “unequal variances,” it’s hinting at heteroskedasticity. Arm yourself with the knowledge of how to fix it or properly interpret the question’s implications.
  4. Don’t ignore the line “the data are from a random sample.” This is your signal the exam question is referencing the random sampling assumption.
  5. Watch out for mention of “endogeneity,” “omitted variable bias,” or “simultaneity”—all of which break exogeneity.

References and Further Reading§

  • Stock, J. H. & Watson, M. W. (2020). Introduction to Econometrics. Pearson.
  • NIST/SEMATECH e-Handbook of Statistical Methods: https://www.itl.nist.gov/div898/handbook/
  • CFA Program Curriculum, particularly the sections on quantitative methods for deeper examples of testing these assumptions.

Test Your Knowledge: Simple Linear Regression Assumptions Quiz§


Note to Readers:
Always remember that, in finance, real-world data can be messy—non-normal distributions, correlated errors, and structural changes over time are quite common. Don’t let these assumptions intimidate you, though: the key is knowing how to check each assumption and how to proceed (e.g., using heteroskedasticity-consistent standard errors, or adjusting for autocorrelation).

By mastering these assumptions, you’ll have a stronger feel for when your regression results are reliable and when they might be leading you astray. Good luck studying, and be sure to explore Chapter 10.6 for hypothesis testing on regression coefficients and Chapter 14 for multiple regression complexities!

Monday, March 24, 2025 Friday, March 21, 2025

Important Notice: FinancialAnalystGuide.com provides supplemental CFA study materials, including mock exams, sample exam questions, and other practice resources to aid your exam preparation. These resources are not affiliated with or endorsed by the CFA Institute. CFA® and Chartered Financial Analyst® are registered trademarks owned exclusively by CFA Institute. Our content is independent, and we do not guarantee exam success. CFA Institute does not endorse, promote, or warrant the accuracy or quality of our products.