docs: add US alpha research design spec
This commit is contained in:
376
docs/superpowers/specs/2026-04-17-us-alpha-research-design.md
Normal file
376
docs/superpowers/specs/2026-04-17-us-alpha-research-design.md
Normal file
@@ -0,0 +1,376 @@
|
|||||||
|
# US High-Alpha Research Design
|
||||||
|
|
||||||
|
**Date:** 2026-04-17
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Build a research framework for US `long-only` equity strategies that uses only free or already-accessible data, avoids lookahead and survivorship traps as much as the available data allows, and can rank candidate strategy families over `1/2/3/5/10y` windows. The objective is not to manufacture the single highest backtest CAGR, but to identify strategy families whose alpha survives realistic liquidity filters, transaction costs, and point-in-time constraints.
|
||||||
|
|
||||||
|
## Constraints
|
||||||
|
|
||||||
|
- Data sources must be free or already accessible from the current project environment.
|
||||||
|
- Portfolio construction must be `long-only`.
|
||||||
|
- The US research universe may extend beyond the S&P 500 into a broader US stock pool, but all conclusions must clearly distinguish between:
|
||||||
|
- `strict` results from a point-in-time-clean universe.
|
||||||
|
- `exploratory` results from a wider free-data universe that is not fully point-in-time-clean.
|
||||||
|
- All signals must use only information available at the time of decision.
|
||||||
|
- The framework must explicitly guard against:
|
||||||
|
- survivorship bias
|
||||||
|
- lookahead bias
|
||||||
|
- static industry-label leakage
|
||||||
|
- microcap and illiquidity contamination
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
The framework is successful if it produces:
|
||||||
|
|
||||||
|
1. A unified research and backtest pipeline for US strategies.
|
||||||
|
2. A ranked comparison of `3-5` high-value strategy families across `1/2/3/5/10y`.
|
||||||
|
3. Metrics that go beyond headline CAGR, including:
|
||||||
|
- `CAGR`
|
||||||
|
- `Sharpe`
|
||||||
|
- `Sortino`
|
||||||
|
- `MaxDD`
|
||||||
|
- `Calmar`
|
||||||
|
- `Turnover`
|
||||||
|
- `Average positions`
|
||||||
|
- `Median ADV usage`
|
||||||
|
- `Subperiod stability`
|
||||||
|
4. Tiered interpretation of results:
|
||||||
|
- `Tier A`: realistic and tradable under tighter liquidity assumptions
|
||||||
|
- `Tier B`: strong alpha but lower capacity
|
||||||
|
- `Tier C`: attractive only under loose assumptions and not suitable as a production candidate
|
||||||
|
|
||||||
|
Any strategy that reports near-`50% CAGR` must also explain:
|
||||||
|
|
||||||
|
- which market regime contributed most of the return
|
||||||
|
- whether performance depends on low-liquidity or small-cap tails
|
||||||
|
- whether results survive after removing the most extreme tail names
|
||||||
|
|
||||||
|
## Research Philosophy
|
||||||
|
|
||||||
|
This project should prefer honest, repeatable alpha discovery over spectacular but fragile backtests. Under the current constraints, a `10y 50% CAGR` should be treated as an upper-end outcome that may appear in selective windows, not as a baseline expectation. The more realistic goal is to find strategies that are strong over `3/5y`, still meaningfully outperform over `10y`, and remain robust after tightening assumptions.
|
||||||
|
|
||||||
|
## Strategy Families
|
||||||
|
|
||||||
|
The research effort will focus on four strategy families.
|
||||||
|
|
||||||
|
### 1. Earnings Drift Proxy
|
||||||
|
|
||||||
|
Target the post-information-repricing phase after major company-specific events. This is conceptually the highest-alpha family, but also the most dependent on event data quality.
|
||||||
|
|
||||||
|
Primary implementation order:
|
||||||
|
|
||||||
|
- use free historical earnings date data if it is stable enough
|
||||||
|
- otherwise fall back to price-and-volume-defined event proxies
|
||||||
|
|
||||||
|
Core signal ingredients:
|
||||||
|
|
||||||
|
- strong post-event excess return over `1-3` days
|
||||||
|
- abnormal volume
|
||||||
|
- gap that does not immediately fill
|
||||||
|
- price holding near short- and medium-term highs after the event
|
||||||
|
|
||||||
|
### 2. Breakout After Compression
|
||||||
|
|
||||||
|
Target stocks that transition from low-volatility congestion into sustained trend expansion. This is the cleanest strategy family to implement with free daily OHLCV data and is the best first candidate for a strict production-grade pipeline.
|
||||||
|
|
||||||
|
Core signal ingredients:
|
||||||
|
|
||||||
|
- proximity to `120d` or `252d` highs
|
||||||
|
- volatility compression over the prior `20-40` trading days
|
||||||
|
- rising dollar volume
|
||||||
|
- positive relative strength versus market and industry proxies
|
||||||
|
|
||||||
|
### 3. Gap-and-Go / High-Volume Continuation
|
||||||
|
|
||||||
|
Target the second phase of move continuation after abnormal return and volume shocks rather than blindly chasing the first event day.
|
||||||
|
|
||||||
|
Core signal ingredients:
|
||||||
|
|
||||||
|
- abnormal `1d` or `3d` return
|
||||||
|
- abnormal volume versus trailing `60d`
|
||||||
|
- post-event price holding above the event anchor
|
||||||
|
- subsequent breakout continuation
|
||||||
|
|
||||||
|
This family has high potential upside but is more sensitive to cost assumptions and market regime.
|
||||||
|
|
||||||
|
### 4. Regime-Gated Cross-Sectional Alpha
|
||||||
|
|
||||||
|
Use broad market and industry-state filters to improve the hit rate of the other strategy families and provide a lower-volatility baseline alpha engine.
|
||||||
|
|
||||||
|
Core signal ingredients:
|
||||||
|
|
||||||
|
- market risk-on versus risk-off state
|
||||||
|
- industry ETF leadership
|
||||||
|
- relative strength
|
||||||
|
- recovery from drawdowns
|
||||||
|
- trend quality
|
||||||
|
- near-`52w` high behavior
|
||||||
|
- price/volume confirmation
|
||||||
|
|
||||||
|
This family is not expected to produce the highest standalone CAGR, but it is expected to improve robustness and reduce participation in hostile environments.
|
||||||
|
|
||||||
|
## Prioritization
|
||||||
|
|
||||||
|
Recommended implementation order:
|
||||||
|
|
||||||
|
1. `Breakout After Compression`
|
||||||
|
2. `Regime-Gated Cross-Sectional Alpha`
|
||||||
|
3. `Gap-and-Go / High-Volume Continuation`
|
||||||
|
4. `Earnings Drift Proxy` only after validating free event-data quality
|
||||||
|
|
||||||
|
Rationale:
|
||||||
|
|
||||||
|
- `Breakout After Compression` is the most implementable and least ambiguous with free data.
|
||||||
|
- `Regime-Gated Cross-Sectional Alpha` provides a shared control layer for the rest of the framework.
|
||||||
|
- `Gap-and-Go` has higher upside but also higher sensitivity to assumptions.
|
||||||
|
- `Earnings Drift Proxy` is theoretically powerful but should not become the project bottleneck if free event history is incomplete.
|
||||||
|
|
||||||
|
## Data Layer
|
||||||
|
|
||||||
|
The framework needs a richer data layer than the current `close/open` setup.
|
||||||
|
|
||||||
|
### Required price fields
|
||||||
|
|
||||||
|
Daily US market data should support at least:
|
||||||
|
|
||||||
|
- `open`
|
||||||
|
- `high`
|
||||||
|
- `low`
|
||||||
|
- `close`
|
||||||
|
- `volume`
|
||||||
|
|
||||||
|
This is required to define:
|
||||||
|
|
||||||
|
- real breakouts
|
||||||
|
- gap events
|
||||||
|
- volatility compression
|
||||||
|
- abnormal dollar volume
|
||||||
|
|
||||||
|
### Required ETF layer
|
||||||
|
|
||||||
|
Add stable market and industry ETFs for regime and leadership analysis, at minimum:
|
||||||
|
|
||||||
|
- `SPY`
|
||||||
|
- `QQQ`
|
||||||
|
- `IWM`
|
||||||
|
- `MDY`
|
||||||
|
- `XLF`
|
||||||
|
- `XLK`
|
||||||
|
- `XLI`
|
||||||
|
- `XLV`
|
||||||
|
- `XLY`
|
||||||
|
- `XLP`
|
||||||
|
- `XLE`
|
||||||
|
- `XLU`
|
||||||
|
- `XLRE`
|
||||||
|
- `XLB`
|
||||||
|
- `SOXX`
|
||||||
|
- `IGV`
|
||||||
|
- `SMH`
|
||||||
|
|
||||||
|
### Universe modes
|
||||||
|
|
||||||
|
The framework must support two explicit modes.
|
||||||
|
|
||||||
|
#### Strict mode
|
||||||
|
|
||||||
|
Use point-in-time-clean universe membership, initially based on the existing PIT S&P 500 machinery in the repository. This is the baseline for formal, defensible results.
|
||||||
|
|
||||||
|
#### Exploratory mode
|
||||||
|
|
||||||
|
Use a wider free-data US stock pool to search for stronger alpha patterns. These results are useful for idea generation but must be labeled as exploratory unless later promoted into a point-in-time-clean setup.
|
||||||
|
|
||||||
|
## Universe Construction Rules
|
||||||
|
|
||||||
|
The tradable universe must be computed daily from lagged information.
|
||||||
|
|
||||||
|
### Daily eligibility rules
|
||||||
|
|
||||||
|
Each stock may enter the candidate set only if all required conditions hold as of `t-1`:
|
||||||
|
|
||||||
|
- enough listing history exists to compute the strategy lookbacks
|
||||||
|
- enough valid volume observations exist
|
||||||
|
- minimum lagged price threshold is met
|
||||||
|
- minimum lagged dollar-volume threshold is met
|
||||||
|
|
||||||
|
Representative defaults:
|
||||||
|
|
||||||
|
- `close[t-1] > 5`
|
||||||
|
- `median_dollar_volume_60d[t-1] > $20M` in `strict` mode
|
||||||
|
- `median_dollar_volume_60d[t-1] > $5M` in `exploratory` mode
|
||||||
|
- `>= 252` valid trading days before eligibility
|
||||||
|
- `>= 40` valid volume days in the trailing `60d`
|
||||||
|
|
||||||
|
Thresholds should be strategy-specific and tunable in robustness sweeps.
|
||||||
|
|
||||||
|
### Industry mapping
|
||||||
|
|
||||||
|
Do not use today's static sector labels to explain historical behavior. For historical regime and industry alignment, prefer PIT-safe proxies such as rolling correlation or beta to industry ETFs over `63/126d` windows.
|
||||||
|
|
||||||
|
## Anti-Lookahead Rules
|
||||||
|
|
||||||
|
The framework must enforce the following rules consistently.
|
||||||
|
|
||||||
|
1. Signals computed using `t` daily bars may only be traded no earlier than `t+1`.
|
||||||
|
2. If an event is effectively published after market close, it becomes tradable no earlier than the next trading day after publication.
|
||||||
|
3. Rolling inputs for liquidity, volatility, and breakout logic must use complete lagged windows with explicit timing semantics.
|
||||||
|
4. Cross-sectional ranking must happen only within the daily eligible universe.
|
||||||
|
5. Universe membership, filters, and factor normalization must be applied before portfolio selection, not after.
|
||||||
|
|
||||||
|
## Execution Convention
|
||||||
|
|
||||||
|
Default execution convention:
|
||||||
|
|
||||||
|
- observe data through `t` close
|
||||||
|
- compute signal after the `t` close
|
||||||
|
- trade at `t+1`
|
||||||
|
|
||||||
|
The framework may compare `t+1 open` and `t+1 close` execution variants if the data path supports both, but the default research baseline should be conservative and consistent.
|
||||||
|
|
||||||
|
## Backtest and Evaluation Framework
|
||||||
|
|
||||||
|
Every strategy family must run through a single pipeline that:
|
||||||
|
|
||||||
|
1. loads required market data
|
||||||
|
2. constructs the daily eligible universe
|
||||||
|
3. computes regime filters
|
||||||
|
4. computes strategy scores or event states
|
||||||
|
5. builds a `long-only` portfolio
|
||||||
|
6. applies transaction costs
|
||||||
|
7. reports `1/2/3/5/10y` windows
|
||||||
|
8. records robustness diagnostics
|
||||||
|
|
||||||
|
### Portfolio defaults
|
||||||
|
|
||||||
|
Initial baseline settings:
|
||||||
|
|
||||||
|
- `long-only`
|
||||||
|
- concentrated books such as `top 5`, `top 10`, `top 20`
|
||||||
|
- start with `equal weight`
|
||||||
|
- add `inverse-vol` weighting only as a secondary comparison
|
||||||
|
|
||||||
|
Equal-weight concentrated portfolios should be the first baseline because they are harder to over-engineer than adaptive weighting schemes.
|
||||||
|
|
||||||
|
### Required robustness checks
|
||||||
|
|
||||||
|
Any strategy candidate that looks strong must automatically be re-run under:
|
||||||
|
|
||||||
|
- tighter liquidity thresholds
|
||||||
|
- fewer and more positions
|
||||||
|
- higher trading costs
|
||||||
|
- different rebalance frequencies
|
||||||
|
- exclusion of the lowest-liquidity or smallest-cap tail
|
||||||
|
|
||||||
|
Only strategies that survive these perturbations should be promoted to `Tier A`.
|
||||||
|
|
||||||
|
## Repository Changes
|
||||||
|
|
||||||
|
The following repository changes are required.
|
||||||
|
|
||||||
|
### New modules
|
||||||
|
|
||||||
|
#### `research/us_universe.py`
|
||||||
|
|
||||||
|
Responsibilities:
|
||||||
|
|
||||||
|
- build daily tradable-universe masks
|
||||||
|
- support `strict` and `exploratory` modes
|
||||||
|
- enforce lagged eligibility rules
|
||||||
|
|
||||||
|
#### `data_manager.py` extension or new `market_data.py`
|
||||||
|
|
||||||
|
Responsibilities:
|
||||||
|
|
||||||
|
- support daily US `OHLCV`
|
||||||
|
- support ETF data updates
|
||||||
|
- preserve existing price-loading workflows where practical
|
||||||
|
|
||||||
|
#### `research/regime_filters.py`
|
||||||
|
|
||||||
|
Responsibilities:
|
||||||
|
|
||||||
|
- market risk-on/risk-off filters
|
||||||
|
- ETF leadership signals
|
||||||
|
- breadth and relative-strength helpers
|
||||||
|
|
||||||
|
#### `research/event_factors.py`
|
||||||
|
|
||||||
|
Responsibilities:
|
||||||
|
|
||||||
|
- breakout-compression scores
|
||||||
|
- gap-continuation scores
|
||||||
|
- high-volume continuation logic
|
||||||
|
- earnings-drift proxy logic
|
||||||
|
|
||||||
|
#### `research/us_alpha_pipeline.py`
|
||||||
|
|
||||||
|
Responsibilities:
|
||||||
|
|
||||||
|
- orchestrate end-to-end research runs
|
||||||
|
- load data
|
||||||
|
- build universe masks
|
||||||
|
- run strategy families
|
||||||
|
- produce windowed rankings
|
||||||
|
- label output as `strict` or `exploratory`
|
||||||
|
|
||||||
|
#### `research/us_alpha_report.py`
|
||||||
|
|
||||||
|
Responsibilities:
|
||||||
|
|
||||||
|
- format tables and CSV outputs
|
||||||
|
- summarize results by family and horizon
|
||||||
|
- support markdown export if needed
|
||||||
|
|
||||||
|
## Research Phasing
|
||||||
|
|
||||||
|
The implementation should be split into two phases.
|
||||||
|
|
||||||
|
### Phase 1
|
||||||
|
|
||||||
|
Build the strict, defensible research backbone:
|
||||||
|
|
||||||
|
- PIT S&P 500 universe
|
||||||
|
- OHLCV data support
|
||||||
|
- ETF regime filters
|
||||||
|
- `Breakout After Compression`
|
||||||
|
- `Regime-Gated Cross-Sectional Alpha`
|
||||||
|
- `Gap-and-Go / High-Volume Continuation`
|
||||||
|
- unified backtest and reporting pipeline
|
||||||
|
|
||||||
|
This phase should produce a clean research system that is difficult to fool with future information.
|
||||||
|
|
||||||
|
### Phase 2
|
||||||
|
|
||||||
|
Expand into higher-upside exploratory research:
|
||||||
|
|
||||||
|
- wider US stock universe
|
||||||
|
- broader signal scanning
|
||||||
|
- stronger CAGR search
|
||||||
|
- explicit exploratory labeling
|
||||||
|
|
||||||
|
This phase is for alpha discovery, not for making final claims about unbiased production performance.
|
||||||
|
|
||||||
|
## Recommended Output
|
||||||
|
|
||||||
|
The finished framework should produce:
|
||||||
|
|
||||||
|
- a repeatable research entrypoint for US alpha studies
|
||||||
|
- CSV outputs for `1/2/3/5/10y` windows
|
||||||
|
- a ranked table of strategy families
|
||||||
|
- tier classification for candidates
|
||||||
|
- notes on where near-`50% CAGR` outcomes come from and whether they remain credible after tightening assumptions
|
||||||
|
|
||||||
|
## Non-Goals
|
||||||
|
|
||||||
|
This project does not aim to:
|
||||||
|
|
||||||
|
- promise stable `10y 50% CAGR`
|
||||||
|
- claim a fully point-in-time-clean all-US-stock universe from free data alone
|
||||||
|
- optimize to a single headline metric at the expense of realism
|
||||||
|
- treat exploratory full-market scans as production-quality evidence
|
||||||
|
|
||||||
|
## Key Decision
|
||||||
|
|
||||||
|
The core design choice is to build infrastructure that minimizes self-deception first, and only then search for extreme CAGR outcomes. Any other order is likely to produce attractive but unreliable results.
|
||||||
Reference in New Issue
Block a user