Explore the foundations of logistic regression for binary outcomes in finance, including log-odds, odds ratios, goodness-of-fit measures, and real-world applications such as default prediction and classification of profitable trades.
So, I remember the first time I faced a binary classification problem: I was trying to figure out which small businesses might default on their loans, and I initially used a simple linear regression. Big mistake. The linear regression model started spitting out probabilities that were greater than 1 and less than 0, and I knew something fishy was going on. That’s pretty much the day I realized logistic regression is the go-to method for modeling the probability of binary events in finance—like default vs. no default, bankruptcy vs. non-bankruptcy, or even profitable trade vs. non-profitable trade.
Logistic regression constrains predicted probabilities to be between 0 and 1, making it a particularly good fit for these yes/no scenarios where the outcome can only take on two distinct values. In this section, we’ll explore how logistic regression works, how we interpret its coefficients, what tests we can run to assess its performance, and how to apply it in real-life financial contexts.
You might be wondering: “Why not just run a linear regression?” Well, if you were predicting a numeric outcome, linear regression is fantastic. But binary outcomes (like default = 1, no default = 0) have some quirks:
• Linear regression could predict negative probabilities or probabilities above 1, which just doesn’t make sense (like a –0.2 chance of default, or a 1.3 chance of default).
• The relationship between predictors and a binary outcome is often nonlinear. Logistic regression captures that curvature better.
• The error terms from a linear regression on a binary outcome often violate the assumptions of homoskedasticity, messing up the reliability of the model.
Logistic regression handles all of these issues by modeling the (transformed) probability of the event using the so-called log-odds (or logit) function.
The heart of logistic regression is the idea that we can transform the probability p of an event occurring (like default) into something that can then be modeled by a linear function. Specifically, logistic regression uses the logit transformation:
where:
• \(p\) = Probability of a specific event (e.g., probability that a loan defaults).
• \(1 - p\) = Probability that the event does not occur (e.g., loan does not default).
• \(\frac{p}{1 - p}\) = The odds (chance of event happening vs. not happening).
• \(\beta_0, \beta_1, \ldots, \beta_k\) = Coefficients to be estimated from the data.
• \(x_1, \ldots, x_k\) = Predictor variables (e.g., debt ratio, income level, credit utilization).
This linear function in the log-odds space ensures that the predicted probability, once transformed back, is always between 0 and 1.
In standard logistic regression, the relationship between p and the linear predictors is:
That’s a mouthful, but it means we’ll never end up with probabilities outside of [0, 1].
If you’re used to interpreting coefficients in a linear regression, you might feel a little weird the first time you see logistic regression coefficients. Here’s the deal:
• Each \(\beta_i\) is the change in the log-odds per one-unit increase in \(x_i\). Not as intuitive!
• The odds ratio is given by \(\exp(\beta_i)\). This number tells you how the odds change when \(x_i\) increases by one unit, holding all other variables constant. For example:
– If \(\exp(\beta_i) = 1.2\), the odds of a default increase by 20% for each unit increase in \(x_i\).
– If \(\exp(\beta_i) = 0.6\), the odds of a default decrease by 40% for each unit increase in \(x_i\).
Sometimes, it’s easier to talk about odds ratios than raw coefficients because it relates more naturally to a “percent change in odds.”
Unlike linear regression, the typical R² measure doesn’t directly apply. Logistic regression instead offers:
• Likelihood Ratio (LR) Test: Compares the fit of the full model (with all predictors) to a reduced model (with fewer predictors). A significant LR test suggests your additional predictors help explain the outcome better than the reduced set.
• Deviance: A measure of how well the chosen model fits compared to a perfect model. Smaller deviance indicates a better fit.
• Pseudo R² Measures: McFadden’s R², Cox & Snell R², and others approximate the concept of R² for logistic regression but don’t have the exact interpretation of the linear regression R². They’re still useful for comparing models or checking if adding variables helps.
You’ll also see classification-oriented measures popular in practice:
• Classification Table (Confusion Matrix): This table lays out correct predictions vs. false positives and false negatives.
• ROC Curve and AUC: The Receiver Operating Characteristic (ROC) curve plots the trade-off between the true positive rate and the false positive rate at various probability thresholds. The area under this curve (AUC) is a handy summary of model discriminative power.
• Hosmer–Lemeshow Test: Tests whether observed event rates in subgroups of the data match predicted probabilities. If the test result is not significant, that typically indicates a good model fit.
I probably don’t have to sell you on how crucial logistic regression can be in finance. Let’s talk about a few biggies:
• Bankruptcy or Default Prediction: When analyzing corporate bonds or consumer credit risk, logistic regression gives you the probability that the issuer will default. Traditional Z-score models can be viewed as a logistic-like approach, though logistic regression is more flexible.
• Classification of Profitable Trades: Hedge funds or proprietary trading desks might label trades as “profitable” (1) or “not profitable” (0) and try to figure out which factors (e.g., market conditions, signals, risk exposures) drive that classification.
• Customer Churn in Banking: Similarly, for retail banking, you might treat “churn vs. non-churn” as your target outcome, with logistic regression identifying which elements of a client’s profile predict departure.
In each case, the logistic model gives you a probability—like the chance that a borrower defaults. Then you decide how to act on that probability.
Here’s a small but critical detail. Usually, we treat “predicted probability ≥ 0.5” as a “1” classification and anything below 0.5 as a “0.” But that threshold can be shifted depending on the costs of making an error:
• Type I Error (False Positive): Predicting a default (1) when the borrower actually does not default (0).
• Type II Error (False Negative): Failing to predict a default (0) when the borrower actually defaults (1).
If the cost of a false negative is super high (like a big financial loss from an unexpected default), we might pick a threshold lower than 0.5 to reduce Type II errors, even if that means more false positives. In practice, institutions will weigh these error costs and set a “cutoff” that best suits their risk profile.
Standard residual plots in logistic regression aren’t as straightforward as in linear models. A handful of approaches can help you evaluate your logistic regression’s performance:
• Classification Table: Tally up correct and incorrect predictions.
• ROC Curve: Graphically see how well your model distinguishes between events and nonevents.
• Hosmer–Lemeshow Test: Group observations based on predicted probabilities and then see how actual outcomes compare. Large discrepancies suggest poor fit.
Let’s do a quick hypothetical snippet (and trust me, once you get used to it, it’s not as scary as it looks). Suppose we have a dataset with a binary dependent variable “Default.” We can run a logistic regression using scikit-learn:
1import pandas as pd
2from sklearn.linear_model import LogisticRegression
3
4df = pd.read_csv("credit_dataset.csv")
5X = df[['DebtRatio', 'CreditUtilization', 'Income']]
6y = df['Default']
7
8model = LogisticRegression()
9model.fit(X, y)
10
11print("Coefficients:", model.coef_)
12print("Intercept:", model.intercept_)
13print("Predicted Probability:", model.predict_proba(X.iloc[:5]))
This kind of code helps you see how changes in DebtRatio or Income might increase or decrease the probability of default. Remember, the logistic function ensures those probabilities always lie between 0 and 1.
Below is a simple flowchart to illustrate a basic workflow for applying logistic regression in finance:
flowchart TB A["Data Collection <br/> (Financial Ratios)"] B["Data Cleaning <br/> (Missing Values, Outliers)"] C["Logistic Model <br/> Training"] D["Predicted Probability <br/> (P(Default))"] E["Classification <br/> (Above threshold = 1, else 0)"] A --> B B --> C C --> D D --> E
If all goes well, the final step is deciding whether the predicted probability is high enough to flag a case as risky.
Alright, let’s recap. Logistic regression is the real MVP for binary classification in finance, thanks to how it elegantly keeps probabilities in [0, 1] and offers interpretable odds ratio insights. It’s essential to remember:
• Always verify that logistic regression is the best approach for your binary outcome.
• Interpret coefficients as changes in log-odds or, more simply, look at the exponentiated coefficients to get odds ratios.
• Use an appropriate classification threshold to balance the risk of false positives vs. false negatives.
• Keep an eye on model diagnostics such as the LR test, pseudo R²s, and classification-based tools (ROC/AUC, confusion matrix).
• Maintain ethical and professional standards as per the CFA Institute Code and Standards, especially when your model impacts real people’s financial well-being (e.g., credit approvals).
As you delve deeper, you’ll find logistic regression appearing everywhere from credit scoring to compliance checks to algorithmic trading. Always keep learning, revisiting assumptions, and refining your approach. Maybe next time you stumble on probabilities above 1, you’ll smirk and say, “Nope, time for logistic regression.”
• CFA Institute Level II Program Curriculum (Quantitative Methods – Logistic Regression)
• Wooldridge, J. M. (2019). Introductory Econometrics (Chapter on limited dependent variable models)
• Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression
Important Notice: FinancialAnalystGuide.com provides supplemental CFA study materials, including mock exams, sample exam questions, and other practice resources to aid your exam preparation. These resources are not affiliated with or endorsed by the CFA Institute. CFA® and Chartered Financial Analyst® are registered trademarks owned exclusively by CFA Institute. Our content is independent, and we do not guarantee exam success. CFA Institute does not endorse, promote, or warrant the accuracy or quality of our products.