Alternative Factor Definitions and Data Sources

March 21, 2025 16 min read

Explore how technology and big data have reshaped factor investing, introducing non-traditional data sources, novel factor definitions, and the potential benefits and pitfalls of these emergent strategies.

On this page

Enhance Your Learning:

CFA Mock Exams (iOS) CFA Mock Exams (Android)

Introduction

Have you ever scrolled through a social media feed and thought, “Wait, could this be financial gold?” Well, it turns out that in many corners of the investment world, portfolio managers are doing exactly that—trawling online conversations, satellite imagery, app usage metrics, and more to glean new “factors” that might give them an edge. This approach, known broadly as using alternative data, is changing the way we define and measure risk factors in modern portfolios. Traditional factors like value, size, or momentum still matter (and indeed, you can find them covered thoroughly in earlier sections of this volume), but technology has made it possible for us to incorporate unique, non-traditional metrics, sometimes derived from truly unexpected places.

In this section, we’ll dive deep into the world of alternative factor definitions and data sources. We’ll talk about everything from how we can harvest tweets in real time for sentiment analysis, to how credit card transaction data might reveal consumer spending trends before official economic indicators are released. We’ll also show how these sources can be integrated into multi-factor models (see Chapter 9.1 for more on multi-factor approaches). And, um, since not everything that’s new is automatically good, we’ll also talk about spurious correlations, data scrubbing, and the all-important need to protect privacy and adhere to compliance. Let’s jump in.

The Evolution of Factor Definitions

When we discuss “factor investing,” we typically mean constructing portfolios around a handful of systematic drivers (a.k.a. factors) that explain much of the cross-sectional variation in returns. Classic factors often include:

• Value (e.g., Price-to-Book)
• Size (i.e., market capitalization)
• Momentum (e.g., 12-month trailing returns)
• Quality (e.g., profitability metrics)
• Low volatility (e.g., beta or standard deviation measures)

However, technology’s rapid rise has challenged the assumption that these five or so long-standing factors represent the whole story. Advances in data processing (especially big data and machine learning) now allow us to incorporate a wide range of new signals to help explain asset price movements or forecast risk. This shift has enabled investors to define factors based on everything from the “tone” of news articles to geospatial data on foot traffic outside of retail stores.

Alternative Data in a Nutshell

By “alternative data,” we’re referring to non-traditional data sources used for investment analysis—social media feeds, satellite imagery, web scraping results, credit card transaction data, IoT device usage, geolocation information, job postings, and so on. These are not the usual earnings reports, market prices, or economic releases that are well-trodden territory for fundamental and quantitative analysts alike. Instead, they’re new streams of insight that can potentially predict market behavior before the conventional data hits the wires.

Key Terms:

Alternative Data: Non-traditional data sources (social media feeds, web traffic, geolocation data) used for investment analysis.
Proprietary Data: Information collected or purchased by an entity that is not publicly available.

Technological Developments Leading to New Factor Definitions

Natural Language Processing (NLP)

Natural Language Processing (NLP) is a broad technology that helps us interpret and analyze large amounts of text or speech data. Picture this scenario: you have thousands of news articles discussing a single stock, and you want to understand if, on balance, the coverage has a positive or negative tone. That’s the job of NLP. The possibility of scaling this up to tens of thousands of companies, or millions of social media posts, is what’s really cool here.

As we saw in earlier chapters, factor returns hinge on consistent signals. With NLP, you might derive a sentiment factor for each company in your investment universe based on how optimistically or pessimistically the company is discussed. This new factor can then be integrated into a broader multi-factor model—potentially boosting alpha if your sentiment signals are robust.

Key Terms:

Natural Language Processing (NLP): A technology to interpret and analyze large amounts of text or speech data.
Sentiment Analysis: Measuring the positive or negative sentiment in news, social media, or other textual data.

It’s one thing to read a few tweets about a company, and it’s another to systematically evaluate millions of them daily. Portfolio managers might measure the ratio of positive to negative tweets or track how viral a social media post becomes in a given hour to estimate the market’s sentiment factor. The quick reaction time in social media means that this factor might exhibit a shorter duration of predictive power—think tactical, not necessarily strategic. However, in certain markets, it can serve as an early warning sign or a near-real-time measure of consumer sentiment.

In fact, we owe a significant chunk of these new insights to improvements in cloud computing, which let us access social media feeds in real time. And believe me (been there, done that in a previous role at a boutique hedge fund), the biggest challenge often isn’t pulling the data—it’s making sense of its sheer volume without drowning in random noise. It’s easy to end up with “garbage in, garbage out,” so we need advanced analytics just to handle the scale.

Emerging Alternative Data Sources

Satellite Imagery and Geospatial Data

Satellite imagery can capture data on car counts in retail parking lots, changes in farmland conditions, or even shipping traffic at major ports. Portfolio managers then translate these observations into factors such as “retail foot traffic momentum” or “cargo shipment velocity.” If aggregated over time, these signals might provide an early sneak peek at upcoming sales figures or logistic slowdowns—before official stats come out.

We can illustrate a typical workflow for turning raw satellite imagery into a factor:

    flowchart LR
	    A["Raw Alternative Data <br/> (Satellite, Social Media)"]
	    B["Data Scrubbing & Validation"]
	    C["Feature Engineering <br/> (Factor Construction)"]
	    D["Portfolio Model <br/> Integration"]
	
	    A --> B
	    B --> C
	    C --> D

Credit Card Transaction Data

Some data providers (think: specialized data vendors) gather anonymized credit card transaction records from millions of consumers. Of course, they must adhere to strict privacy regulations—like GDPR in the EU—and ensure no personal identifying information is stored. From these records, we can glean early insights into revenues of a particular restaurant chain or e-commerce site. This aggregated information might give us an “early call” on whether a retailer’s quarterly earnings will outperform or disappoint. In factor terms, the aggregated spending growth or decline could form a consumer demand factor.

Web Traffic and App Usage

Increasingly, we’re able to measure how many people visit a company’s website or download its mobile app, including how often they return or how much time they spend there. This can be grouped into an “engagement factor” that captures potential growth in user activity, brand recognition, or customer loyalty. Even more nuanced is how quickly a site’s traffic is accelerating compared to its peers—sometimes called “traffic momentum.”

Data Validation and Reliability Concerns

All this data is exciting, but it’s also, well, complicated. We’ve probably all heard that classic cautionary line: “Correlation is not causation.” With alternative data, the risk of generating spurious correlations is huge. If you gather enough random signals, you might accidentally find one that correlates strongly with an asset’s returns—at least for a while.

The Danger of Spurious Correlation

Spurious correlation arises when two variables appear to be closely connected, but the relationship is simply a coincidence or driven by an unseen factor. You might find an odd correlation between the monthly production of cheese in Denmark and biotech stock returns. It’s possible there’s a bizarre pattern, but more likely, it’s meaningless. When searching for new signals among the ocean of alternative data, it’s similar to rummaging through a cosmic game of “connect the dots.” The best practice is to use rigorous statistical validations, out-of-sample testing, and robust research design to filter out random patterns.

Key Term:

Spurious Correlation: A statistical relationship that arises by chance, not a meaningful causation.

Data Scrubbing and Normalization

Most alternative datasets are notoriously messy. We have to combine all sorts of data from disparate sources (some might come in raw images, some in text form, some in partial spreadsheets). Data scrubbing is crucial. Think of it like cleaning a house before the guests arrive—you’ve got to remove duplicates, fill in missing values where appropriate, or decide which outliers to exclude. Without consistent data scrubbing and normalization procedures, our factor signals might reflect data errors rather than fundamental insights.

Key Term:

Data Scrubbing: The process of cleaning and normalizing raw data to ensure accuracy and usefulness.

Privacy Regulations and Compliance

Many alternative data sources involve aggregated or even personalized data. Strict privacy regulations—especially in jurisdictions like the EU—dictate how such data can be collected, stored, and employed for commercial use. The last thing any reputable investment firm wants is to run afoul of data protection laws, face reputational damage, or see clients jump ship. As a result, large firms often build entire compliance teams dedicated to ensuring their data science pipelines respect local and international data laws. This might mean anonymizing data at the source or only using aggregated measures that cannot be traced back to individual users.

Key Term:

Privacy Regulations: Laws and guidelines (e.g., GDPR) governing the collection, storage, and usage of personal data.

Incorporating Non-Traditional Signals into Factor Models

At this point, you might be wondering: “Ok, but how do we actually incorporate these new data streams into, say, a multi-factor model?” Let’s take a quick look at a conceptual approach.

Data Ingestion: Gather raw data streams—tweets, satellite images, credit card records—and store them in a secure database.
Data Scrubbing: Clean up the data. Handle missing values, adjust for outliers, ensure consistent time stamps.
Feature Engineering: Derive meaningful numeric signals. For example, convert sentiment analysis from text into a “sentiment score.” Or track monthly changes in satellite-derived foot traffic as “foot traffic growth rate.”
Factor Construction: Blend and scale these signals into standardized factors. This might involve standardizing each factor so that it has a mean of zero and a standard deviation of 1 across your universe.
Model Integration: Incorporate these alternative factors into your portfolio optimization routine. This could mean standard multi-factor regression, or it could be a more advanced machine learning model that includes these new features as explanatory variables.

Once integrated, you use backtesting to evaluate performance, ideally over both an in-sample and out-of-sample period. Also, you might want to see how these new factors correlate with existing ones—sometimes, they overlap heavily with established factors like momentum, limiting the additional benefit they can bring.

Practical Applications and Case Studies

Case Study: Satellite Imagery for Agriculture

Let’s say you manage a commodity-focused hedge fund with exposure to wheat and corn futures. By layering daily satellite images of farmland to measure ground coverage or vegetation indices, you might estimate crop yield earlier than government agencies like the USDA. If your analysis signals a supply shortage, you could allocate more heavily to that commodity ahead of official announcements. Eventually, the difference between your “satellite factor forecast” and what the market expects can create alpha (excess returns).

Case Study: E-commerce Consumer Behavior

Consider an e-commerce portal that sells electronics. You purchase aggregated credit card transaction data providing the total daily spending on that portal. E-commerce discretionary spending might be your factor of interest. If you find a strong correlation between daily spending changes and next-quarter revenue growth, then your factor may be predictive. You incorporate it into your factor model, weighting stocks in your portfolio more heavily if they show robust e-commerce spending signals.

Risks, Pitfalls, and Best Practices

Before we get swept up in the excitement, we have to be mindful:

• Overfitting: The more data you throw into a model, the higher the temptation to “curve-fit.” Always do out-of-sample tests and robust cross-validation.
• Data Lag and Timeliness: Some alternative data might be “fresh,” but there’s no real rule guaranteeing that the data is updated quickly enough to be valuable.
• Regulatory Complications: Different jurisdictions have different privacy regulations. Navigating them can be expensive and complex.
• Cost: Accessing unique alternative data can be costly, especially if it’s proprietary. Smaller firms often can’t afford the same depth of data as large institutions.
• Ethical Considerations: Collecting data from individuals (even if anonymized) can raise ethical concerns about consent and exploitation.

Despite these challenges, competition in the alternative data space is intensifying. Providers are increasingly specializing in niche data, such as analyzing brand logos in daily news coverage or aggregator sites that track consumer reviews. Ultimately, success with alternative factor definitions depends on strong data governance, advanced analytics, and robust risk management—topics that tie directly to what we covered in Chapter 6 (Introduction to Risk Management) and Chapter 7 (Professional Practices in Portfolio Management).

Regulatory and Ethical Considerations

Just because something is publicly available doesn’t mean it’s free of compliance concerns. Some social media data might be retrieved under constraints determined by the platform’s terms of service. Even when it’s aggregated, you need to ensure your usage does not violate any intellectual property rights or privacy protections. Over the past few years, enforcement of these rules has become stricter, with regulators like the SEC focusing on how investment managers obtain and use non-traditional data. In certain cases, regulators might question if an unfair advantage was gained, raising insider trading concerns if the data is deemed “material nonpublic information.” So, it’s crucial for firms to consult legal counsel and compliance officers before launching any alternative data-based strategy.

Future Outlook

Given the ongoing explosion of Internet of Things (IoT) devices, 5G networks, and advanced computing, the scope of potential alternative data sources will only expand. We’ll likely see more widespread use of pattern recognition in large unstructured data sets—perhaps real-time kiosk data from malls, environmental sensors in manufacturing plants, or advanced smartphone usage metrics. In the not-too-distant future, entire risk models might revolve around purely digital footprints or real-time “social listening.” For now, though, these alternative factor definitions are frontier markets that demand high levels of caution, creativity, and due diligence.

Exam Tips and Final Thoughts

• On the exam, remember that alternative factor definitions still follow the same principle as traditional factors: they must be systematic, persistent, and grounded in some underlying economic rationale.
• Understand the steps of data handling—especially how to clean and validate data—and be able to discuss the dangers of spurious correlations.
• Don’t forget to weigh the regulatory, ethical, and privacy dimensions.
• If a question asks you to design a multi-factor model using new data, emphasize how you would test the factor’s relevance, ensure data integrity, and maintain compliance.

These questions might appear in essay or item set format, often testing your grasp of both conceptual and practical aspects. Make sure you can articulate not just how to gather alternative data but also how to convert it into an actionable factor (Chapter 3.14 on risk-adjusted portfolio selection approaches might be a useful anchor here, too). Good luck, and remember, factor investing is becoming more creative by the day—but it still relies heavily on strong analytical foundations.

References

Boehmer, E., Wu, J., & Zhang, X. (2017). “Microstructure-based modeling of intraday exchange rates using machine learning.” Journal of Banking & Finance.
AlternativeData.org – A hub for exploring non-traditional data providers.
CFA Institute Official Curriculum – Big Data and AI in Finance.

Test Your Knowledge: Alternative Factor Definitions and Big Data Quiz

### Which statement best describes alternative data in an investment context? - [ ] Data officially published in government statistical releases. - [x] Non-traditional sources such as social media feeds, geolocation data, or satellite imagery. - [ ] Primarily vendor-sponsored fundamental data. - [ ] Traditional factor-driven data like market cap and earnings announcements. > **Explanation:** Alternative data refers to new, non-traditional types of data used in investment analysis, such as satellite imagery or social media analysis. ### What is a potential pitfall when incorporating new alternative data sources into factor models? - [x] The risk of spurious correlation leading to incorrect inferences. - [ ] A guaranteed improvement in long-term portfolio returns. - [ ] Elimination of the need for traditional fundamental analysis. - [ ] Instant regulatory approval for usage. > **Explanation:** There’s a real danger of finding false correlations in large and noisy datasets, leading to incorrect factor signals. ### A sentiment factor derived from NLP on social media would: - [ ] Directly measure a stock’s fundamental value. - [x] Reflect the overall positive or negative tone of user-generated textual data. - [ ] Eliminate the need for risk management tools. - [ ] Provide a permanent alpha edge without ongoing maintenance. > **Explanation:** Sentiment factors capture the tone of textual content, such as tweets or news articles, but cannot directly substitute fundamental analysis or eliminate risk concerns. ### Which of the following is a key reason data scrubbing is essential? - [ ] To make the data appear more random. - [x] To remove errors, inconsistencies, and duplicates that can distort factor analysis. - [ ] To lower regulatory requirements. - [ ] To bypass privacy concerns. > **Explanation:** Data scrubbing helps ensure the usability and accuracy of alternative data and avoids misleading signals in factor models. ### An investor who purchases aggregated credit card transaction data must: - [x] Comply with privacy regulations to ensure data anonymity. - [ ] Force customers to reveal their individual spending patterns. - [x] Confirm that data is free from personal identifying information. - [ ] Publish it in real-time on public forums. > **Explanation:** Privacy regulations require that personal data be protected and anonymized. Investors typically work with aggregated, anonymized data that cannot be traced back to individual consumers. ### Why might credit card transaction data offer an investment edge? - [x] It sometimes reveals consumer spending trends before official economic reports. - [ ] It is always more accurate than company quarterly filings. - [ ] It is a guaranteed predictor of stock price movements. - [ ] It must be publicly released on an exchange. > **Explanation:** Credit card data can uncover real-time consumer behavior, often anticipating official data releases and earnings reports. ### What is one of the biggest obstacles to effectively integrating social media sentiment into investment strategies? - [x] The high volume of data, which can mask genuine signals with noise. - [ ] Automatic regulatory endorsement of sentiment-based strategies. - [x] Overreliance on a single tweet from the CEO of a company. - [ ] The elimination of all risk from the investment process. > **Explanation:** Social media data is vast, and extracting meaningful signals requires sophisticated processing and filtering to avoid random noise. ### Satellite imagery can be particularly valuable in analyzing: - [x] Consumer foot traffic at retail locations. - [ ] Only intangible technology sector assets. - [ ] An individual’s private bank statements. - [ ] Equities that do not trade on exchanges. > **Explanation:** Satellite imagery helps gauge real-world activity like crowd sizes, shipping volumes, and farmland conditions, all of which can inform investment decisions. ### Spurious correlations in alternative data can best be addressed by: - [x] Conducting robust out-of-sample testing and validation. - [ ] Selecting the top five correlations regardless of underlying logic. - [ ] Immediately adding them to a portfolio. - [ ] Ignoring all historical data and relying on real-time data only. > **Explanation:** To combat spurious correlations, one should do out-of-sample testing, cross-validation, and carefully assess whether the factor has real economic merit. ### True or False: Integrating alternative data sources automatically lessens the need for due diligence regarding insider trading regulations. - [ ] True - [x] False > **Explanation:** Even with alternative data sources, managers still have to ensure that no material nonpublic information is being used. Compliance and due diligence remain critical.

View the page source Edit the page History

Thursday, April 10, 2025 Friday, March 21, 2025

Important Notice: FinancialAnalystGuide.com provides supplemental CFA study materials, including mock exams, sample exam questions, and other practice resources to aid your exam preparation. These resources are not affiliated with or endorsed by the CFA Institute. CFA® and Chartered Financial Analyst® are registered trademarks owned exclusively by CFA Institute. Our content is independent, and we do not guarantee exam success. CFA Institute does not endorse, promote, or warrant the accuracy or quality of our products.

9.12 Regime Switching and Machine Learning in Factor Allocation

9.14 Shifts Between High Beta and Defensive Factors

Browse CFA Level 1

Alternative Factor Definitions and Data Sources

Introduction

The Evolution of Factor Definitions

Alternative Data in a Nutshell

Technological Developments Leading to New Factor Definitions

Natural Language Processing (NLP)

Social Media and Sentiment Factors

Emerging Alternative Data Sources

Satellite Imagery and Geospatial Data

Credit Card Transaction Data

Web Traffic and App Usage

Data Validation and Reliability Concerns

The Danger of Spurious Correlation

Data Scrubbing and Normalization

Privacy Regulations and Compliance

Incorporating Non-Traditional Signals into Factor Models

Practical Applications and Case Studies

Case Study: Satellite Imagery for Agriculture

Case Study: E-commerce Consumer Behavior

Risks, Pitfalls, and Best Practices

Regulatory and Ethical Considerations

Future Outlook

Exam Tips and Final Thoughts

References

Test Your Knowledge: Alternative Factor Definitions and Big Data Quiz