Explore how technology and big data have reshaped factor investing, introducing non-traditional data sources, novel factor definitions, and the potential benefits and pitfalls of these emergent strategies.
Have you ever scrolled through a social media feed and thought, “Wait, could this be financial gold?” Well, it turns out that in many corners of the investment world, portfolio managers are doing exactly that—trawling online conversations, satellite imagery, app usage metrics, and more to glean new “factors” that might give them an edge. This approach, known broadly as using alternative data, is changing the way we define and measure risk factors in modern portfolios. Traditional factors like value, size, or momentum still matter (and indeed, you can find them covered thoroughly in earlier sections of this volume), but technology has made it possible for us to incorporate unique, non-traditional metrics, sometimes derived from truly unexpected places.
In this section, we’ll dive deep into the world of alternative factor definitions and data sources. We’ll talk about everything from how we can harvest tweets in real time for sentiment analysis, to how credit card transaction data might reveal consumer spending trends before official economic indicators are released. We’ll also show how these sources can be integrated into multi-factor models (see Chapter 9.1 for more on multi-factor approaches). And, um, since not everything that’s new is automatically good, we’ll also talk about spurious correlations, data scrubbing, and the all-important need to protect privacy and adhere to compliance. Let’s jump in.
When we discuss “factor investing,” we typically mean constructing portfolios around a handful of systematic drivers (a.k.a. factors) that explain much of the cross-sectional variation in returns. Classic factors often include:
• Value (e.g., Price-to-Book)
• Size (i.e., market capitalization)
• Momentum (e.g., 12-month trailing returns)
• Quality (e.g., profitability metrics)
• Low volatility (e.g., beta or standard deviation measures)
However, technology’s rapid rise has challenged the assumption that these five or so long-standing factors represent the whole story. Advances in data processing (especially big data and machine learning) now allow us to incorporate a wide range of new signals to help explain asset price movements or forecast risk. This shift has enabled investors to define factors based on everything from the “tone” of news articles to geospatial data on foot traffic outside of retail stores.
By “alternative data,” we’re referring to non-traditional data sources used for investment analysis—social media feeds, satellite imagery, web scraping results, credit card transaction data, IoT device usage, geolocation information, job postings, and so on. These are not the usual earnings reports, market prices, or economic releases that are well-trodden territory for fundamental and quantitative analysts alike. Instead, they’re new streams of insight that can potentially predict market behavior before the conventional data hits the wires.
Key Terms:
Natural Language Processing (NLP) is a broad technology that helps us interpret and analyze large amounts of text or speech data. Picture this scenario: you have thousands of news articles discussing a single stock, and you want to understand if, on balance, the coverage has a positive or negative tone. That’s the job of NLP. The possibility of scaling this up to tens of thousands of companies, or millions of social media posts, is what’s really cool here.
As we saw in earlier chapters, factor returns hinge on consistent signals. With NLP, you might derive a sentiment factor for each company in your investment universe based on how optimistically or pessimistically the company is discussed. This new factor can then be integrated into a broader multi-factor model—potentially boosting alpha if your sentiment signals are robust.
Key Terms:
It’s one thing to read a few tweets about a company, and it’s another to systematically evaluate millions of them daily. Portfolio managers might measure the ratio of positive to negative tweets or track how viral a social media post becomes in a given hour to estimate the market’s sentiment factor. The quick reaction time in social media means that this factor might exhibit a shorter duration of predictive power—think tactical, not necessarily strategic. However, in certain markets, it can serve as an early warning sign or a near-real-time measure of consumer sentiment.
In fact, we owe a significant chunk of these new insights to improvements in cloud computing, which let us access social media feeds in real time. And believe me (been there, done that in a previous role at a boutique hedge fund), the biggest challenge often isn’t pulling the data—it’s making sense of its sheer volume without drowning in random noise. It’s easy to end up with “garbage in, garbage out,” so we need advanced analytics just to handle the scale.
Satellite imagery can capture data on car counts in retail parking lots, changes in farmland conditions, or even shipping traffic at major ports. Portfolio managers then translate these observations into factors such as “retail foot traffic momentum” or “cargo shipment velocity.” If aggregated over time, these signals might provide an early sneak peek at upcoming sales figures or logistic slowdowns—before official stats come out.
We can illustrate a typical workflow for turning raw satellite imagery into a factor:
flowchart LR A["Raw Alternative Data <br/> (Satellite, Social Media)"] B["Data Scrubbing & Validation"] C["Feature Engineering <br/> (Factor Construction)"] D["Portfolio Model <br/> Integration"] A --> B B --> C C --> D
Some data providers (think: specialized data vendors) gather anonymized credit card transaction records from millions of consumers. Of course, they must adhere to strict privacy regulations—like GDPR in the EU—and ensure no personal identifying information is stored. From these records, we can glean early insights into revenues of a particular restaurant chain or e-commerce site. This aggregated information might give us an “early call” on whether a retailer’s quarterly earnings will outperform or disappoint. In factor terms, the aggregated spending growth or decline could form a consumer demand factor.
Increasingly, we’re able to measure how many people visit a company’s website or download its mobile app, including how often they return or how much time they spend there. This can be grouped into an “engagement factor” that captures potential growth in user activity, brand recognition, or customer loyalty. Even more nuanced is how quickly a site’s traffic is accelerating compared to its peers—sometimes called “traffic momentum.”
All this data is exciting, but it’s also, well, complicated. We’ve probably all heard that classic cautionary line: “Correlation is not causation.” With alternative data, the risk of generating spurious correlations is huge. If you gather enough random signals, you might accidentally find one that correlates strongly with an asset’s returns—at least for a while.
Spurious correlation arises when two variables appear to be closely connected, but the relationship is simply a coincidence or driven by an unseen factor. You might find an odd correlation between the monthly production of cheese in Denmark and biotech stock returns. It’s possible there’s a bizarre pattern, but more likely, it’s meaningless. When searching for new signals among the ocean of alternative data, it’s similar to rummaging through a cosmic game of “connect the dots.” The best practice is to use rigorous statistical validations, out-of-sample testing, and robust research design to filter out random patterns.
Key Term:
Most alternative datasets are notoriously messy. We have to combine all sorts of data from disparate sources (some might come in raw images, some in text form, some in partial spreadsheets). Data scrubbing is crucial. Think of it like cleaning a house before the guests arrive—you’ve got to remove duplicates, fill in missing values where appropriate, or decide which outliers to exclude. Without consistent data scrubbing and normalization procedures, our factor signals might reflect data errors rather than fundamental insights.
Key Term:
Many alternative data sources involve aggregated or even personalized data. Strict privacy regulations—especially in jurisdictions like the EU—dictate how such data can be collected, stored, and employed for commercial use. The last thing any reputable investment firm wants is to run afoul of data protection laws, face reputational damage, or see clients jump ship. As a result, large firms often build entire compliance teams dedicated to ensuring their data science pipelines respect local and international data laws. This might mean anonymizing data at the source or only using aggregated measures that cannot be traced back to individual users.
Key Term:
At this point, you might be wondering: “Ok, but how do we actually incorporate these new data streams into, say, a multi-factor model?” Let’s take a quick look at a conceptual approach.
Once integrated, you use backtesting to evaluate performance, ideally over both an in-sample and out-of-sample period. Also, you might want to see how these new factors correlate with existing ones—sometimes, they overlap heavily with established factors like momentum, limiting the additional benefit they can bring.
Let’s say you manage a commodity-focused hedge fund with exposure to wheat and corn futures. By layering daily satellite images of farmland to measure ground coverage or vegetation indices, you might estimate crop yield earlier than government agencies like the USDA. If your analysis signals a supply shortage, you could allocate more heavily to that commodity ahead of official announcements. Eventually, the difference between your “satellite factor forecast” and what the market expects can create alpha (excess returns).
Consider an e-commerce portal that sells electronics. You purchase aggregated credit card transaction data providing the total daily spending on that portal. E-commerce discretionary spending might be your factor of interest. If you find a strong correlation between daily spending changes and next-quarter revenue growth, then your factor may be predictive. You incorporate it into your factor model, weighting stocks in your portfolio more heavily if they show robust e-commerce spending signals.
Before we get swept up in the excitement, we have to be mindful:
• Overfitting: The more data you throw into a model, the higher the temptation to “curve-fit.” Always do out-of-sample tests and robust cross-validation.
• Data Lag and Timeliness: Some alternative data might be “fresh,” but there’s no real rule guaranteeing that the data is updated quickly enough to be valuable.
• Regulatory Complications: Different jurisdictions have different privacy regulations. Navigating them can be expensive and complex.
• Cost: Accessing unique alternative data can be costly, especially if it’s proprietary. Smaller firms often can’t afford the same depth of data as large institutions.
• Ethical Considerations: Collecting data from individuals (even if anonymized) can raise ethical concerns about consent and exploitation.
Despite these challenges, competition in the alternative data space is intensifying. Providers are increasingly specializing in niche data, such as analyzing brand logos in daily news coverage or aggregator sites that track consumer reviews. Ultimately, success with alternative factor definitions depends on strong data governance, advanced analytics, and robust risk management—topics that tie directly to what we covered in Chapter 6 (Introduction to Risk Management) and Chapter 7 (Professional Practices in Portfolio Management).
Just because something is publicly available doesn’t mean it’s free of compliance concerns. Some social media data might be retrieved under constraints determined by the platform’s terms of service. Even when it’s aggregated, you need to ensure your usage does not violate any intellectual property rights or privacy protections. Over the past few years, enforcement of these rules has become stricter, with regulators like the SEC focusing on how investment managers obtain and use non-traditional data. In certain cases, regulators might question if an unfair advantage was gained, raising insider trading concerns if the data is deemed “material nonpublic information.” So, it’s crucial for firms to consult legal counsel and compliance officers before launching any alternative data-based strategy.
Given the ongoing explosion of Internet of Things (IoT) devices, 5G networks, and advanced computing, the scope of potential alternative data sources will only expand. We’ll likely see more widespread use of pattern recognition in large unstructured data sets—perhaps real-time kiosk data from malls, environmental sensors in manufacturing plants, or advanced smartphone usage metrics. In the not-too-distant future, entire risk models might revolve around purely digital footprints or real-time “social listening.” For now, though, these alternative factor definitions are frontier markets that demand high levels of caution, creativity, and due diligence.
• On the exam, remember that alternative factor definitions still follow the same principle as traditional factors: they must be systematic, persistent, and grounded in some underlying economic rationale.
• Understand the steps of data handling—especially how to clean and validate data—and be able to discuss the dangers of spurious correlations.
• Don’t forget to weigh the regulatory, ethical, and privacy dimensions.
• If a question asks you to design a multi-factor model using new data, emphasize how you would test the factor’s relevance, ensure data integrity, and maintain compliance.
These questions might appear in essay or item set format, often testing your grasp of both conceptual and practical aspects. Make sure you can articulate not just how to gather alternative data but also how to convert it into an actionable factor (Chapter 3.14 on risk-adjusted portfolio selection approaches might be a useful anchor here, too). Good luck, and remember, factor investing is becoming more creative by the day—but it still relies heavily on strong analytical foundations.
Important Notice: FinancialAnalystGuide.com provides supplemental CFA study materials, including mock exams, sample exam questions, and other practice resources to aid your exam preparation. These resources are not affiliated with or endorsed by the CFA Institute. CFA® and Chartered Financial Analyst® are registered trademarks owned exclusively by CFA Institute. Our content is independent, and we do not guarantee exam success. CFA Institute does not endorse, promote, or warrant the accuracy or quality of our products.