Learn how to conduct effective Exploratory Data Analysis (EDA) and engineer powerful features—both numeric and textual—to enhance modeling in financial applications.
Data Exploration and Feature Engineering represent crucial steps in any quantitative finance workflow, helping you squeeze actionable insights out of raw data. And trust me, it can be quite exciting—there’s that moment when you see an interesting pattern in your data, and you think, “Hey, this might actually matter for my investment model.” Meanwhile, ignoring these steps or rushing through them can doom even the most sophisticated machine learning or regression approach.
Throughout this section, we’ll discuss best practices for Exploratory Data Analysis (EDA) and feature engineering, including textual data extraction. We’ll get into how to handle messy data—like, say, those big financial text disclosures where half the words are legal boilerplate—and transform it into something your model can digest. We’ll also talk about how to spot interesting signals that might be hidden in your data. Let’s dive right in!
One of the most important tasks is to calculate basic descriptive metrics—mean, median, standard deviation, minimum, maximum—for each variable in your dataset. If you’re analyzing daily stock returns, you’d inspect the average return, volatility (standard deviation), and maybe skewness and kurtosis to see if the distribution is nearly normal or severely fat-tailed. This is also the phase where you might encounter something surprising. A while back, at a small quant shop I worked for, we discovered that 90% of the data for a particular equity factor was missing for certain months—yikes. Checking summary stats saved us from building a model with massive blind spots.
EDA doesn’t stop at numeric summaries:
• Histograms show the distribution of a single variable. For returns, a histogram can reveal if there’s a large spike around zero or if you see extreme tails.
• Scatter plots let you see the relationship between two variables easily. Plot, for instance, return on the x-axis versus volume on the y-axis. Do you notice any trend?
• Boxplots help you spot outliers quickly, especially if you suspect your data might have anomalies (like returns that are suspiciously high). Boxplots for different sectors side-by-side are also quite insightful.
• Correlation matrices can help identify pairs of variables that are positively/negatively correlated. This is crucial if you’re going to feed your data into a linear model, because strong correlations among predictors might lead to multicollinearity headaches.
Below is a simple diagram illustrating the iterative EDA and modeling cycle. Notice that outcomes from one step feed back into previous steps:
flowchart LR A["Collect Data"] B["Perform EDA"] C["Feature Engineering"] D["Model Training & Validation"] E["Refine / Iterate"] A --> B B --> C C --> D D --> E E --> B
Feature engineering is about turning raw data into meaningful inputs for your model. It’s part art, part science. A well-engineered input can capture fundamental relationships (like momentum, volatility, or sentiment) better than a raw number sitting in your dataset.
Try creating combined signals that reflect real-life relationships. For instance, if you have daily returns, you might create a 20-day moving average to capture momentum. Or, if you have financial statement data, consider ratios that scale one metric by another, such as “Operating Cash Flow / Total Debt” to assess leverage in a more dynamic way. For equity analysts, combining price and volume data could yield features like “Price * Volume,” which approximates daily dollar volume that might get used in liquidity risk assessments.
Sometimes a relationship between two variables is nonlinear. In that case, you can generate extra terms (e.g., squared or cubed versions) or products of two features (interactions). For instance, if you suspect that volume’s effect on returns is different at higher prices compared to lower prices, you might include the interaction term “Price × Volume.” Done carefully, these expansions can capture complexities that linear models might miss. But you’ve got to be mindful of overfitting, especially if you add too many features. A prudent approach could involve cross-validation to see which polynomial or interaction terms actually boost predictive power in out-of-sample tests.
In finance, experience and domain knowledge are crucial. If you’re analyzing an options strategy, you might create implied volatility features. For credit risk modeling, you might incorporate a company’s interest coverage ratio (EBIT / Interest Expense) or a measure of short-term liquidity (Current Ratio). If you’re dealing with currency markets, you might transform raw exchange rates to daily percentage changes or relative valuations (purchasing power parity deviation). That’s the real art: picking transformations that truly reflect an economic or financial rationale.
Textual data—like news articles, corporate filings, or social media chatter—can be a goldmine for signals, but it requires unique processing steps.
Text is messy. Often, you’ll strip punctuation, convert everything to lowercase, and remove non-informative “stop words” (the, is, at, and so on). In finance, be mindful that generic stop word lists might remove terms that are meaningful in your domain, such as “asset,” “return,” or “risk.” You might even create a custom dictionary of abbreviations, especially if you’re dealing with 10-K or 10-Q reports where terms like “md&a” (management discussion and analysis) reappear in sometimes abbreviated forms.
Tokenization is the process of splitting text into words (tokens). For example, the sentence “Earnings soared 10% at XYZ Inc.” might be split into tokens like [“earnings”, “soared”, “10”, “xyz”, “inc”]. That’s your raw material for the next steps.
Once text is tokenized, you need a numerical representation—your models can’t handle words directly. Two common methods:
• Bag-of-Words: Build a vocabulary of all tokens across documents, then represent each document by the counts of these tokens. For instance, if “growth” appears three times, you record a “3” for that word. This approach is simple but might overemphasize very frequent words.
• TF-IDF (Term Frequency – Inverse Document Frequency): A weighting scheme that downplays extremely common words (like “the”) and up-weights more distinctive terms. If “auto” appears frequently in only one company’s filing, that might be a more telling term than something that appears in every single filing.
If you’re looking to gauge market sentiment from textual data, you’ll often rely on conventional dictionaries or custom machine learning models.
• Loughran–McDonald Dictionary is specialized for finance, distinguishing words that are positive, negative, uncertain, or litigious. So, if you see “concern,” that might belong in the negative sentiment bucket.
• Advanced Approaches: You can train big language models or deep learning classifiers on historical textual data labeled as “positive,” “negative,” or “neutral.” This is more complex, but sometimes more accurate than dictionary-based approaches.
Below is a tiny sample Python snippet showing how you might create a simple sentiment feature:
1import nltk
2from nltk.corpus import stopwords
3from collections import Counter
4
5call_text = "XYZ Inc. delivered strong performance, but management is concerned about future interest rates."
6
7tokens = nltk.word_tokenize(call_text.lower())
8
9stop_words = set(stopwords.words('english'))
10filtered_tokens = [t for t in tokens if t.isalpha() and t not in stop_words]
11
12lm_negative = {"concerned", "risk", "uncertain"} # Typically a much bigger dictionary
13lm_positive = {"strong", "success", "gain"}
14
15word_counts = Counter(filtered_tokens)
16
17neg_count = sum(word_counts[w] for w in lm_negative)
18pos_count = sum(word_counts[w] for w in lm_positive)
19
20sentiment_score = pos_count - neg_count
21print("Sentiment Score:", sentiment_score)
In practice, you would incorporate more words into your dictionaries and handle additional complexities (e.g., negations like “not good”).
Now that you’ve engineered a bunch of potential predictors—numeric features, polynomial terms, textual sentiment scores—you might end up with dozens or hundreds of variables. Many will be redundant or have little predictive power.
These methods rank features by their correlation with the target or by simple statistical tests (e.g., ANOVA F-tests). If a feature has near-zero correlation with your target asset return, it might be excluded as unhelpful. Filter methods are fast, but they ignore interactions between features.
Here, you repeatedly train and evaluate a model while adding/removing features. Techniques like Recursive Feature Elimination (RFE) or forward selection systematically explore which subset of features gives the best performance. The downside? They can be computationally expensive, especially if your dataset is large.
Regularized models (like LASSO, Ridge, or Elastic Net) integrate feature selection into the training process itself. LASSO is especially known for setting the coefficients of less important variables to zero, effectively discarding them. In finance, LASSO is handy when you suspect many variables provide minimal unique information.
Feature engineering and selection is not a one-and-done affair. If your model’s results look subpar, or if you run into suspiciously high error in out-of-sample tests, revisit your assumptions. Maybe you missed an important transformation, or maybe there’s a data quality issue that slipped through. Keep track of your data transformations and the rationale behind each new feature, so you can learn from both your successes and your dead ends.
• Exploratory Data Analysis (EDA): Summarizing data’s main characteristics and distributions, often visually.
• Tokenization: Splitting text into tokens (words or phrases) for computational analysis.
• TF-IDF (Term Frequency – Inverse Document Frequency): A numeric measure that highlights how important a word is in a particular document relative to all documents.
• Loughran–McDonald Dictionary: A finance-specific dictionary commonly used for sentiment and textual analysis of corporate filings.
• Feature Selection: Techniques to pare down the number of input variables, keeping only the ones that provide significant predictive power.
• Loughran, T., & McDonald, B. (2011). “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” The Journal of Finance.
• Hastie, T., Tibshirani, R., & Friedman, J. (2009). “The Elements of Statistical Learning.” Springer.
• CFA Institute Official Publication on NLP and textual analysis in investment research.
• Additional recommended reading: “Applied Text Analysis with Python” by Benjamin Bengfort et al., for a deeper dive into textual feature engineering.
Important Notice: FinancialAnalystGuide.com provides supplemental CFA study materials, including mock exams, sample exam questions, and other practice resources to aid your exam preparation. These resources are not affiliated with or endorsed by the CFA Institute. CFA® and Chartered Financial Analyst® are registered trademarks owned exclusively by CFA Institute. Our content is independent, and we do not guarantee exam success. CFA Institute does not endorse, promote, or warrant the accuracy or quality of our products.