Explore how to choose the right variables, functional forms, and diagnostics in multiple regression models, including common pitfalls and best-practice extensions.
Sometimes I think back to when I first learned regression. I remember sitting in a small study room with my friend, who was absolutely convinced that if he put every single variable he could think of into his model—things like yesterday’s temperature or the number of coffee cups he’d had that morning—he’d get the “best” analysis. Turns out, that approach sank faster than a poorly built boat. That was my first real lesson in the importance of proper model specification. In this section, we’ll talk about what that means, why we often make mistakes (misspecification), and how to extend our regression to capture more complex real-world relationships.
Before we jump into the details, keep in mind that this content builds upon our discussion in Section 1.1 concerning the fundamentals of Multiple Regression. If you need a quick refresher on ordinary least squares (OLS), hypothesis testing, or the assumptions behind regression models, you might want to glance back at that section.
Let’s dive right in.
Model specification is basically your game plan—like picking which players you want on the field and deciding your strategy. You’ve got to figure out:
• Which independent variables (predictors) are relevant.
• How these variables should enter the model (linear form vs. nonlinear transformation).
• The overall functional form (like a plain linear regression, a polynomial, or a log-linear model).
This is crucial because your final model’s validity hinges on it. Think of it like building a house: using the correct blueprint with the right materials ensures the final structure is sturdy. If you leave out crucial structural beams (i.e., important variables) or try to add decorative pillars that don’t fit (irrelevant predictors), you can compromise the entire project.
There’s a common dilemma in building statistical models:
• Under-specification means you might have omitted a relevant variable. This can cause a bias in the coefficients, especially if the missing variable is correlated with the included ones.
• Over-specification, on the other hand, feels like adding too much seasoning to a dish. Sure, you might stumble onto some interesting flavors, but you may also bury the main taste under a bunch of random spices. Similarly, in statistics, adding tons of irrelevant variables can inflate the variance of your estimates, making it tougher to detect true relationships.
Nobody’s perfect, and that’s particularly true in modeling. Mistakes happen—some big, some small. Here are some ways we can get it wrong.
Omitting a relevant variable is a big no-no, especially if that variable is correlated with the regressors you included. The classic example is a wage equation that leaves out years of education. If “years of education” is correlated with “age,” or “industry experience,” your results suddenly become pretty wonky. You’re effectively letting the omitted variable’s effect get lumped into the included regressors, messing up your coefficient estimates and interpretations.
• Real-World Example: Suppose you’re modeling housing prices in a Canadian city. If you forget to include something important like distance to a major train station, you might see your model attribute extra significance to random variables like the number of nearby grocery stores or local crime rates. That’s because the distance to the train station might be correlated with these other variables in some subtle way, biasing your results.
Picture this scenario: you’ve got a strong suspicion that a new variable—maybe “average monthly rainfall”—should be included in your model of consumer spending. But let’s face it: unless your spending is heavily dependent on the weather (like umbrella sales), you might just be adding noise. Including irrelevant variables can:
• Increase the complexity of the model,
• Decrease the precision of your estimates (bigger standard errors),
• Confuse the interpretation of your results, especially if readers keep asking, “What’s with the rainfall variable?”
So, do you see how letting your model be guided by a “throw everything at the wall and see what sticks” mentality can lead to confusion?
Sometimes you see a scatter plot of your data and it’s pretty obvious you’ve got a curve in your relationship. But we still force-fit a linear line out of habit or convenience. That’s a “wrong functional form” misspecification.
• Consequences: The coefficients you estimate might not capture the real pattern, and predictions will be systematically off.
• Solutions: Insert transformations (like polynomial terms or logs), or do more robust parametric or nonparametric forms, so your model can capture actual data patterns.
Example: If you suspect that the effect of an interest rate on corporate spending diminishes beyond a certain point, a polynomial or a piecewise linear function might work better than a plain linear slope.
So, you’re collecting data for “annual advertising spend” but your data is riddled with errors—some CFO reported it for the entire region, others reported monthly averages, etc. If your main independent variable is measured incorrectly, you’ll likely get a biased slope and confidence intervals that lie to you.
• Strategies to Overcome: “Instrumental variable” approaches, data cleaning, or robust measurement techniques can help when you suspect your variable measurement is fuzzy.
Let’s face it: not every influential factor in finance is numeric. Sometimes you want to model the presence or absence of an event: an acquisition, a shift in U.S. or Canadian trade policy, or even something intangible like “new CEO at the helm.”
Dummy variables let you represent these on-off conditions in your regression. If the event is present, you code it as 1; otherwise, it’s 0. This is super handy for:
• Structural breaks: Did Canadian interest rate policies change drastically in 2010? Pop a dummy variable for data points post-2010 to see how that new policy might have shifted a certain trend.
• Categorical Organization: Industries, regions, or big policy announcements (like NAFTA vs. USMCA trade environment) can also be represented.
Be mindful of the “dummy variable trap,” where you accidentally include an unneeded reference category that leads to perfect multicollinearity. Typically, if you have k categories, you include k-1 dummies, leaving one category as the baseline.
Let’s say you suspect that the impact of one variable depends on another. For instance, maybe rising oil prices multiply an effect on Canadian energy stocks more than on other industries. You can capture that synergy by adding an interaction term: the product of the two variables.
• Example Format: Suppose we have Price of Oil = X1 and a dummy for Canadian Energy Sector = D. Then we define an interaction: X1 × D.
• Benefits: Captures the idea that changes in oil price might matter more (or less) for a specific segment.
In practice, you might interpret the interaction coefficient as how the slope of X1 changes when D = 1. In other words, “the effect of X1 on Y is shifted by this many units if D = 1.”
Anyone who’s ever had that bizarre data point in a scatter plot—like a revenue figure that’s 1000 times higher than the next highest—knows how strongly outliers can skew a regression line. Influence analysis helps us detect if a data point is too influential, has high leverage, or is exerting outsized control over the regression.
• Cook’s Distance: A measure capturing how much your regression estimates (coefficients) would change if you drop one observation.
• High-Leverage Points: Observations with unusual X-values relative to the rest of your data. If the X-value is far from the mean of X, it can shift your entire line.
• DFBETA: Another measure that looks at the impact on individual coefficients when a particular observation is excluded.
Practical Tip: If you spot a data point with suspiciously high influence, investigate whether it’s a genuine data point or a data error. If it’s genuine, you might want to try robust regression methods or segment your dataset.
So far, we’ve covered ways we typically go wrong—or at least ways we might be a little misguided. But there are also ways to extend our straightforward linear regression for more advanced analyses.
Polynomial regressions can help you capture the idea that the effect of X changes in a nonlinear way. Maybe the effect of inflation on bond yields grows at a decreasing rate, or the effect of average monthly hours worked saturates after a certain point.
• Example:
(Y) = β₀ + β₁(X) + β₂(X²) + ε
Sometimes you might even include higher orders like X³, but watch for overfitting. As you degree-up, you could be modeling the noise rather than the real signal.
When your data grows exponentially—like GDP, population, or sales volumes—log transformations can often linearize your relationship. For instance:
• If you hypothesize that a 1% change in X leads to a certain percent change in Y, it’s typical to run a log-log regression:
ln(Y) = β₀ + β₁ ln(X) + ε
• Or if you suspect that the absolute change in Y depends on a percentage change in X, a log-linear approach might be the ticket:
ln(Y) = β₀ + β₁ X + ε
Quick personal note: I find log-linear models super neat for analyzing economic time-series data like GDP, population, or corporate revenue, because it simplifies interpreting coefficients into percentage changes, which can be more intuitive than raw slopes.
Below is a simple Mermaid diagram illustrating a step-by-step process for deciding on your model specification and refinement. Notice how we keep iterating until we land on something that’s both statistically sound and practically relevant.
flowchart LR A["Decide on theoretical model <br/> Hypotheses"] --> B["Select Variables"] B["Select Variables"] --> C["Check functional form <br/> (Linear or Non-linear)"] C["Check functional form"] --> D["Estimate regression"] D["Estimate regression"] --> E["Diagnostic checks <br/> (Residual analysis)"] E["Diagnostic checks <br/> (Residual analysis)"] --> F["Refine or finalize <br/> or consider new approach"]
To tie things together, let’s do a hypothetical example involving house prices. Suppose you want to model the relationship:
• Dependent variable (Y): House Price
• Independent variables: Square footage (X1), Age of building (X2), and a dummy for whether the home is located within the city center (D).
• Underfitting: Model is too simple, excluding important variables/features; likely leads to biased or incomplete results.
• Overfitting: Model is too complex or includes a heap of irrelevant variables; it captures noise rather than the true signals, leading to poor out-of-sample performance.
• Dummy Variable: A 0/1 indicator representing presence or absence of a condition, category, or regime.
• Cook’s Distance: A measure to detect how much an observation impacts the fitted regression parameters.
• High-Leverage Point: A data point with unusual predictor values that can sway the regression line significantly.
• Interaction Term: A variable formed by multiplying two predictors, capturing how an effect shifts or changes in combination with another variable.
• Understand the Why: Always link your variables to a strong theoretical or economic rationale. If you can’t explain why a variable should be in your model, think twice before including it.
• Check Diagnostics: After running your regression, look at residual plots, test for omitted variables, and evaluate outliers. Don’t skip this step.
• Be Wary of “P-Hacking”: Resist the temptation to mindlessly add or remove variables just to get pretty p-values. You’ll sacrifice interpretability and possibly introduce biases.
• Nonlinearities: Don’t assume everything is linear. Investigate transformations, polynomials, or piecewise definitions if your domain knowledge suggests a curve.
• Policy Shifts: Especially relevant with cross-border finance in Canada and the U.S., remember to incorporate dummy variables for regulatory or policy breaks.
• Influence Analysis: Run influence metrics like Cook’s Distance or leverage scores to ensure that one or two outliers aren’t running the show.
In exams—particularly on the CFA Level II—be prepared for questions that combine real-world context with theoretical knowledge. You might see a vignette describing an analyst’s regression setup, then you have to spot if they omitted a variable or used the wrong functional form. Sharpen your radar for these pitfalls.
• CFA Institute Level II Curriculum (Quantitative Methods)
• Greene, W. (2018). Econometric Analysis.
• Dielman, T. E. (2005). Applied Regression Analysis.
• Canadian Journal of Statistics. (https://www.ssc.ca/en/cjs)
• US Bureau of Economic Analysis (BEA) data. (https://www.bea.gov)
Important Notice: FinancialAnalystGuide.com provides supplemental CFA study materials, including mock exams, sample exam questions, and other practice resources to aid your exam preparation. These resources are not affiliated with or endorsed by the CFA Institute. CFA® and Chartered Financial Analyst® are registered trademarks owned exclusively by CFA Institute. Our content is independent, and we do not guarantee exam success. CFA Institute does not endorse, promote, or warrant the accuracy or quality of our products.