Explore key supervised machine learning techniques—penalized regression, SVM, k-NN, and ensemble methods—and see how they apply to real-world financial modeling.
Well, sometimes it feels like machine learning is everywhere—finance included. I remember the early days when a simple linear regression was fancy enough to impress folks. Now, words like “penalized regression,” “support vector machine,” and “random forest” are popping up in investor presentations. In this section, let’s explore the big supervised learning methods you should know for the CFA® Level II exam (and also for real-world applications). We’ll walk through penalized regression, SVM, k-NN, decision trees, and powerful ensemble models. We’ll also chat about how to interpret and deploy these models responsibly in finance—because let’s be honest, a misapplied model can be worse than no model at all.
Penalized regression methods introduce a penalty term to the model’s objective function, which helps control the complexity of the regression. In a typical ordinary least squares (OLS) setting, the cost function (let’s call it J) might look like:
J = ∑(yᵢ - xᵢβ)²
But penalized regression tacks on a penalty term that grows with larger coefficients. This encourages the model to keep coefficients small or even zero, reducing overfitting.
Ridge regression adds an L2 penalty. In other words, it adds the sum of squared coefficients, weighted by a penalty parameter λ:
J = ∑(yᵢ - xᵢβ)² + λ ∑βⱼ²
Ridge regression shrinks coefficients but generally does not force any to zero. It’s helpful when you have many correlated features. In equity factor modeling, for example, you may collect numerous potential factors (like value metrics, momentum scores, volatility measures). Using ridge can help reduce the magnitude of the factor coefficients so that no single factor dominates unpredictably.
Lasso uses an L1 penalty, which is the sum of absolute values of coefficients:
J = ∑(yᵢ - xᵢβ)² + λ ∑|βⱼ|
This penalty can drive some coefficients to exactly zero. Lasso is great for feature selection. In, say, a factor-based equity strategy with 50 possible factors, Lasso might push 40 of them to zero, leaving you with 10 that meaningfully explain returns.
Elastic Net mixes the L1 and L2 penalties, capturing benefits from both ridge and lasso. So you get:
J = ∑(yᵢ - xᵢβ)² + α λ ∑βⱼ² + (1-α) λ ∑|βⱼ|
Where α controls the weighting between ridge (α part) and lasso (1-α part). In finance, especially in high-dimensional data (like thousands of potential predictive signals), Elastic Net can be a balanced approach to shrink coefficients and drop less relevant features.
Imagine you’re building a style factor model to select undervalued stocks. You gather 20 fundamentals-based signals, 10 technical signals, and 5 macro factors. You suspect not all of them matter, and some could be correlated. Using Lasso or Elastic Net, you can systematically “prune” your signals while mitigating overfitting. The result is a more stable set of signals likely to generalize well.
Support Vector Machine is a classification (or regression) technique aiming to find the hyperplane—basically a boundary—that best separates classes. The goal is to maximize the margin, which is the distance from the decision boundary to the nearest data points in each category. For finance, we might classify whether a loan applicant is likely to default or not, or whether the market is bullish or bearish based on certain macro/fundamental indicators.
The SVM can work in high-dimensional or even infinite-dimensional spaces. That’s thanks to the so-called kernel trick. Essentially, the kernel function allows the model to operate in a transformed feature space without explicitly computing the transformation. Polynomials, radial basis functions, or other kernels can help turn complicated, non-linear classification boundaries into linear hyperplanes in a higher dimension.
Picture a credit portfolio manager wanting to classify loans into “likely to default” vs. “safe.” Data includes credit scores, debt-to-income ratios, employment length, etc. Because these features might not be linearly separable (some low credit score folks might still pay, while some high score folks might default), the manager sets up an SVM with a radial basis function kernel. The model finds a hyperplane in a transformed space, giving a boundary that best segments default risks.
k-NN is about as intuitive as it gets. If you want to classify a new observation, you look at the k closest existing observations and “ask them how they’re classified.” For regression, you average their values. For classification, you pick the majority label.
• Distance Metric: Typically Euclidean distance, but you’ll see others in practice.
• Scaling: Because distance depends heavily on magnitude, you should normalize or standardize your data.
• Dimensionality: The curse of dimensionality can cause distance measures to become less meaningful as features expand.
Let’s say you want to group stocks by fundamentals—like price-to-earnings ratio (P/E) and return on equity (ROE). If these are your two features, you can scatter-plot them, pick a new stock point X, then find the k (say, 3 or 5) stocks closest to X on that plot. If most “neighbors” are in the “value” portfolio, you might similarly classify X as value-oriented.
A decision tree partitions the feature space using simple if-then splits. For instance, a first split might separate companies with market cap > $5 billion from those below. Another split might separate high P/E from low P/E, and so on. Ultimately you get leaves containing groups of observations with certain characteristics.
Trees are super interpretable: you can literally trace a path. However, they can easily overfit. For instance, a fully grown tree that tries to perfectly classify training data usually ends up capturing a lot of noise. We often prune trees (by controlling depth or minimum sample splits) to strike a balance.
You might build a decision tree to segment borrowers with “annual income > $70k?” If yes, next question: “FICO score > 700?” If not, etc. Each path leads to a different risk bucket. This helps a risk manager quickly see how the classification is derived.
Ensemble methods bring multiple “weak learners” (often shallow or pruned trees) into a single stronger model. In finance, ensemble methods have become a go-to for tasks like forecasting market volatility, predicting default probabilities, or even generating alpha signals. The idea: many lightly correlated or uncorrelated models combined can yield a better overall prediction.
Random Forest builds multiple decision trees in parallel, each one trained on a bootstrapped sample of data. It also randomly selects a subset of features for each tree as further decorrelation. Then the predictions are aggregated (for classification, via majority vote; for regression, via averaging). It’s relatively straightforward, generally robust to overfitting, and rarely the worst choice out of the box.
Below is a simple Mermaid diagram illustrating how a random forest might work:
graph LR A["Training <br/>Data"] --> B["Bootstrap <br/>Sample 1"] A["Training <br/>Data"] --> C["Bootstrap <br/>Sample 2"] A["Training <br/>Data"] --> D["Bootstrap <br/>Sample 3"] B --> E["Tree 1"] C --> F["Tree 2"] D --> G["Tree 3"] E --> H["Ensemble <br/>Voting"] F --> H G --> H H --> I["Final <br/>Prediction"]
Gradient boosting is another popular ensemble technique where you iteratively add weak learners. Each new tree tries to predict the errors (residuals) that the previous ensemble made. Over many iterations, the model becomes better at predicting the hardest-to-fit observations. A widely known version is XGBoost in open-source libraries. In finance, gradient boosting can serve as a powerful approach to, say, capture non-linear relationships in stock returns or factor exposures.
Ensemble methods often deliver strong performance but at the expense of interpretability. In a regulated financial context—for instance, explaining to regulators or clients why a credit decision was made—you may need to rely on additional interpretability techniques such as feature importance metrics (e.g., how many times a feature is used in splits).
• Overfitting: Models that memorize training data deliver poor out-of-sample performance. Techniques like cross-validation help mitigate this.
• Data Leakage: If your dataset includes future information or data not available at the time of prediction, your model’s performance metrics become inflated. In finance, always carefully consider time-stamping.
• Scaling: k-NN, SVM, and penalized regression often require data to be normalized or standardized. Overlooking such a detail can degrade performance.
• Model Tuning: Each method has parameters (like λ in penalized regression, cost in SVM, k in k-NN, max depth for trees, etc.). Tuning these is crucial.
• Interpretability Needs: Regulatory contexts often require a simpler or more interpretable approach. If you’re using a black-box method (like a complex ensemble), you might need tools like Shapley values to show the contribution of each variable.
If you’re curious how you might do this in Python, here’s a short snippet showing how to use random forest on a sample dataset:
1import pandas as pd
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.model_selection import train_test_split
4
5X = df.drop('target', axis=1)
6y = df['target']
7
8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
9
10model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
11model.fit(X_train, y_train)
12
13preds = model.predict(X_test)
14
15accuracy = (preds == y_test).mean()
16print("Accuracy: ", accuracy)
In a financial context, you might use this for a classification problem such as “will bond default or not?” or “will the portfolio underperform the benchmark next quarter?”.
• Know the difference between L1 and L2 penalties: L1 (Lasso) can shrink some coefficients completely to zero, which is great for variable selection. L2 (Ridge) shrinks them all but won’t make any exactly zero.
• SVM: Pay special attention to the kernel trick and max-margin concept. In exam questions, look carefully for hints about non-linear boundary needs.
• k-NN: The biggest issues revolve around choosing k and normalizing. Watch out for dimension pitfalls.
• Decision Trees & Ensembles: Expect item sets that show how a random forest or gradient boosting might be used. Also, be ready for questions about the trade-off between interpretability vs. predictive power.
• Practical Vignette Angle: The CFA exam might frame a scenario about factor modeling or classification of default risk. They’ll test your ability to identify which algorithm is most appropriate and why.
• Breiman, L. (2001). “Random Forests.” Machine Learning.
• Friedman, J. (2001). “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics.
• scikit-learn documentation on ensemble methods: https://scikit-learn.org/stable/modules/ensemble.html
• CFA Institute Code of Ethics and Standards of Professional Conduct (as applicable to model usage and communications).
Important Notice: FinancialAnalystGuide.com provides supplemental CFA study materials, including mock exams, sample exam questions, and other practice resources to aid your exam preparation. These resources are not affiliated with or endorsed by the CFA Institute. CFA® and Chartered Financial Analyst® are registered trademarks owned exclusively by CFA Institute. Our content is independent, and we do not guarantee exam success. CFA Institute does not endorse, promote, or warrant the accuracy or quality of our products.