Learn how to extend simple linear regression to multiple regression by incorporating multiple explanatory variables, exploring OLS assumptions, evaluating model fit, and applying the framework in real-world finance scenarios.
If you’ve ever tried to predict a company’s future earnings based on just one factor—say, historical earnings growth—you probably realized pretty quickly that real-world outcomes are influenced by more than a single feature. That’s exactly where multiple regression steps in. It’s like moving from a single flashlight in a dark room to lighting up the entire space with multiple light sources. With multiple regression, we can use a whole set of explanatory variables (independent variables) to better understand and forecast a dependent variable such as sales, profits, stock returns, or practically any metric of interest.
Multiple regression is really valuable for finance professionals who want to isolate the effect of multiple factors on an outcome. In equity research, for instance, we might tie a stock’s return (dependent variable) to factors like market risk (beta), company size, valuation metrics, and momentum signals. In corporate finance, we might link a firm’s credit rating to leverage ratios, liquidity, and historical performance. This ability to capture many drivers at once is what makes multiple regression such a powerful tool.
Multiple regression, in its most common form, is estimated using Ordinary Least Squares (OLS). The model can be written as:
• y is the dependent variable (e.g., a firm’s stock return).
• \(x_1, x_2, \dots, x_k\) are the independent (explanatory) variables.
• \(\beta_0\) is the intercept term.
• \(\beta_1, \beta_2, \dots, \beta_k\) are the regression coefficients (we call these “beta-hats” when estimated).
• \(\varepsilon\) is the error term, capturing influences not explicitly included in the model.
OLS finds those coefficient estimators \(\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_k\) that minimize the sum of squared residuals:
In simpler terms, OLS tries to fit the best “line” (or hyperplane, to be precise) through the data points in a multi-dimensional space of explanatory variables.
One of the first times I tried to interpret a multiple regression output, I got excited about the meaning of the coefficients—only to realize afterwards I’d broken a fundamental assumption in the data set. So let’s talk about these assumptions up front.
• Linearity. The relationship between the dependent variable and each independent variable should be linear in the parameters (coefficients). This doesn’t mean x must be linear in its own raw scale—transformations are allowed—but the model must enter linearly in the parameters.
• Random Sampling. Observations should be representative of the population, typically assumed random.
• No Perfect Multicollinearity. Two or more predictor variables should not be perfectly or near-perfectly correlated. High correlation among x’s can make coefficient estimates unreliable.
• Homoskedasticity. The variance of the error term is constant for all observations. If errors grow as x grows, you’d have heteroskedasticity.
• No Autocorrelation. Error terms from different observations should be uncorrelated with each other.
• Normality of Errors. While not always essential for unbiased estimates, normal residuals are typically assumed for valid hypothesis testing on coefficients.
These assumptions aren’t just academic. Violate them, and your inferences—like t-tests or F-tests—might become suspect. In real markets, though, perfect compliance with every assumption is rare, so it’s crucial to test for and, if needed, correct violations as part of the modeling process.
The specification of a multiple regression model involves deciding which explanatory variables to include and in what functional form. For example, you might want to predict:
When you plan your model:
Here’s a quick look at a typical workflow for building a multiple regression model:
flowchart LR A["Collect Data"] --> B["Check Data Quality <br/>(Missing Values, Outliers)"] B --> C["Specify Model <br/>(Choose Variables)"] C --> D["Estimate Coefficients <br/>(Using OLS)"] D --> E["Assess Model Fit <br/>(R², Residual Analysis)"] E --> F["Interpret Results <br/>& Conduct Diagnostics"]
Once you run the regression, you’ll often see output in a table that might look something like this:
Coefficient | Estimate | Std. Error | t-Stat | p-Value |
---|---|---|---|---|
Intercept (β₀) | 2.35 | 0.78 | 3.01 | 0.01 |
Market Return (β₁) | 0.65 | 0.12 | 5.42 | 0.00 |
GDP Growth (β₂) | -0.34 | 0.09 | -3.78 | 0.00 |
R² (Coefficient of Determination). R² measures how much of the variation in the dependent variable is explained by the regression model. R² is always between 0 and 1. A higher R² suggests the model captures more of the variation in y.
Adjusted R². This modifies R² to penalize models that add more independent variables that do not genuinely improve model performance. The formula incorporates the sample size (n) and the number of predictors (k). If new variables improve the model only marginally (relative to how much complexity they add), adjusted R² might actually drop, signaling overfitting or irrelevance.
Multiple regression is awesome when done right, but it can be misleading if you overlook certain issues.
If two or more independent variables are heavily correlated, the model may have trouble isolating each variable’s individual effect. Coefficient estimates might become large in magnitude, flip signs unpredictably, or yield large standard errors. Tools like the Variance Inflation Factor (VIF) can help detect multicollinearity.
When error-term variance changes with the level of x, standard OLS confidence intervals become unreliable. One way to check is to look at plots of residuals versus fitted values. If the spread of residuals increases with the fitted values, that’s a clue. A “white test” or “Breusch-Pagan test” is often used for more formal detection. If present, robust standard errors or Weighted Least Squares can help.
In time-series data, errors can be correlated across different time periods. The Durbin-Watson test is commonly used to detect first-order autocorrelation. If present, you might need to use specialized methods (e.g., Newey-West standard errors).
Sometimes the relationship between x and y is not linear. For instance, a company’s revenue might scale exponentially rather than linearly with time. In that case, log transformations or polynomial terms might improve the specification.
If your data aren’t randomly selected or if some observations are systematically excluded, your regression results might be biased. Make sure your sample is representative of the population you aim to describe.
Let’s consider a simplified yet practical scenario. Suppose you’re analyzing a real estate investment trust’s (REIT) total return. You believe it’s influenced by (1) GDP growth, (2) a construction index, and (3) interest rates. Your multiple regression equation could be:
You gather monthly data for each variable over five years and run an OLS regression. Let’s say your output reveals:
You might spot, however, that your residuals plot a funnel shape, indicating potential heteroskedasticity. Then you decide to use robust standard errors. The interest rate coefficient remains negative and stays significant, confirming your insight that interest rate hikes probably hamper REIT returns.
You can’t just rely on a single pass:
An enduring question: Should we rely on R² or adjusted R²? When you add new variables, R² typically rises (or at least never falls). But high R² doesn’t necessarily mean your model is good if you keep adding random noise variables. Adjusted R² attempts to correct for over-specification. It’s usually the safer measure when comparing two models with different numbers of explanatory variables.
KaTeX formula for Adjusted R² is often given by:
where \(n\) is the number of observations, and \(k\) is the number of predictors in the model (excluding the intercept).
• Corroborate With Theory. Don’t just let a stepwise or computer-driven variable selection process choose your factors. Use economic or finance theory to guide variable inclusion.
• Collect Enough Data. More variables usually mean you need more observations. A rule of thumb is at least 15 observations per variable, though more is always better.
• Test Different Specifications. Sometimes, logs or polynomial terms significantly improve the fit and interpretability.
• Evaluate Outliers. Ask yourself: Are outliers data errors or real phenomena? If real, they can heavily influence your results. Sensitivity analysis is your friend.
• Implement Robust or Generalized Methods. If assumptions like homoskedasticity or normality fail, you can use robust standard errors, Weighted Least Squares (WLS), or Generalized Least Squares (GLS).
I remember the first time I tried building a multi-factor model to forecast a firm’s equity returns. I kind of threw everything into the model—GDP, interest rates, consumer confidence, capital expenditures—and the model’s R² soared. But guess what? Half of those variables had huge p-values, and a few were strongly correlated with each other. Ultimately, the model was less stable than I expected. Adjusted R² told a different story, dropping whenever I threw in questionable variables. That was a valuable (though slightly embarrassing) learning experience that taught me: focusing on parsimony is crucial, and high R² alone can be misleading.
Multiple regression is a powerful extension of simple linear regression, enabling analysts to understand how various explanatory variables simultaneously impact a dependent variable. Whether you’re examining how multiple economic factors drive stock prices or studying how operational metrics affect a company’s bottom line, multiple regression can reveal nuanced relationships that single-factor models miss.
That said, you want to remain vigilant: watch out for assumption violations, keep an eye on diagnostics, and ensure you choose your explanatory variables wisely. With robust model checks and a good theoretical foundation, you’ll be able to harness the full power of multiple regression in your financial analyses.
• Clearly understand the main OLS assumptions; typical exam questions love to test your knowledge of which assumption was violated in a given scenario.
• Know how to interpret both R² and adjusted R².
• Be ready to apply hypothesis testing on coefficients (e.g., t-tests for significance) and the overall model (F-test).
• Expect scenario-based questions about detecting and handling violations like heteroskedasticity or multicollinearity.
• Practice reading output tables quickly, focusing on coefficient estimates, standard errors, t-stats, and p-values to see if a variable is telling a meaningful story.
Important Notice: FinancialAnalystGuide.com provides supplemental CFA study materials, including mock exams, sample exam questions, and other practice resources to aid your exam preparation. These resources are not affiliated with or endorsed by the CFA Institute. CFA® and Chartered Financial Analyst® are registered trademarks owned exclusively by CFA Institute. Our content is independent, and we do not guarantee exam success. CFA Institute does not endorse, promote, or warrant the accuracy or quality of our products.