docs: add US alpha research design spec

2026-04-17 23:41:10 +08:00
parent 5e1c4a681d
commit 7239310be3
1 changed files with 376 additions and 0 deletions
--- a/docs/superpowers/specs/2026-04-17-us-alpha-research-design.md
+++ b/docs/superpowers/specs/2026-04-17-us-alpha-research-design.md
@@ -0,0 +1,376 @@
+# US High-Alpha Research Design
+
+**Date:** 2026-04-17
+
+## Goal
+
+Build a research framework for US `long-only` equity strategies that uses only free or already-accessible data, avoids lookahead and survivorship traps as much as the available data allows, and can rank candidate strategy families over `1/2/3/5/10y` windows. The objective is not to manufacture the single highest backtest CAGR, but to identify strategy families whose alpha survives realistic liquidity filters, transaction costs, and point-in-time constraints.
+
+## Constraints
+
+- Data sources must be free or already accessible from the current project environment.
+- Portfolio construction must be `long-only`.
+- The US research universe may extend beyond the S&P 500 into a broader US stock pool, but all conclusions must clearly distinguish between:
+  - `strict` results from a point-in-time-clean universe.
+  - `exploratory` results from a wider free-data universe that is not fully point-in-time-clean.
+- All signals must use only information available at the time of decision.
+- The framework must explicitly guard against:
+  - survivorship bias
+  - lookahead bias
+  - static industry-label leakage
+  - microcap and illiquidity contamination
+
+## Success Criteria
+
+The framework is successful if it produces:
+
+1. A unified research and backtest pipeline for US strategies.
+2. A ranked comparison of `3-5` high-value strategy families across `1/2/3/5/10y`.
+3. Metrics that go beyond headline CAGR, including:
+   - `CAGR`
+   - `Sharpe`
+   - `Sortino`
+   - `MaxDD`
+   - `Calmar`
+   - `Turnover`
+   - `Average positions`
+   - `Median ADV usage`
+   - `Subperiod stability`
+4. Tiered interpretation of results:
+   - `Tier A`: realistic and tradable under tighter liquidity assumptions
+   - `Tier B`: strong alpha but lower capacity
+   - `Tier C`: attractive only under loose assumptions and not suitable as a production candidate
+
+Any strategy that reports near-`50% CAGR` must also explain:
+
+- which market regime contributed most of the return
+- whether performance depends on low-liquidity or small-cap tails
+- whether results survive after removing the most extreme tail names
+
+## Research Philosophy
+
+This project should prefer honest, repeatable alpha discovery over spectacular but fragile backtests. Under the current constraints, a `10y 50% CAGR` should be treated as an upper-end outcome that may appear in selective windows, not as a baseline expectation. The more realistic goal is to find strategies that are strong over `3/5y`, still meaningfully outperform over `10y`, and remain robust after tightening assumptions.
+
+## Strategy Families
+
+The research effort will focus on four strategy families.
+
+### 1. Earnings Drift Proxy
+
+Target the post-information-repricing phase after major company-specific events. This is conceptually the highest-alpha family, but also the most dependent on event data quality.
+
+Primary implementation order:
+
+- use free historical earnings date data if it is stable enough
+- otherwise fall back to price-and-volume-defined event proxies
+
+Core signal ingredients:
+
+- strong post-event excess return over `1-3` days
+- abnormal volume
+- gap that does not immediately fill
+- price holding near short- and medium-term highs after the event
+
+### 2. Breakout After Compression
+
+Target stocks that transition from low-volatility congestion into sustained trend expansion. This is the cleanest strategy family to implement with free daily OHLCV data and is the best first candidate for a strict production-grade pipeline.
+
+Core signal ingredients:
+
+- proximity to `120d` or `252d` highs
+- volatility compression over the prior `20-40` trading days
+- rising dollar volume
+- positive relative strength versus market and industry proxies
+
+### 3. Gap-and-Go / High-Volume Continuation
+
+Target the second phase of move continuation after abnormal return and volume shocks rather than blindly chasing the first event day.
+
+Core signal ingredients:
+
+- abnormal `1d` or `3d` return
+- abnormal volume versus trailing `60d`
+- post-event price holding above the event anchor
+- subsequent breakout continuation
+
+This family has high potential upside but is more sensitive to cost assumptions and market regime.
+
+### 4. Regime-Gated Cross-Sectional Alpha
+
+Use broad market and industry-state filters to improve the hit rate of the other strategy families and provide a lower-volatility baseline alpha engine.
+
+Core signal ingredients:
+
+- market risk-on versus risk-off state
+- industry ETF leadership
+- relative strength
+- recovery from drawdowns
+- trend quality
+- near-`52w` high behavior
+- price/volume confirmation
+
+This family is not expected to produce the highest standalone CAGR, but it is expected to improve robustness and reduce participation in hostile environments.
+
+## Prioritization
+
+Recommended implementation order:
+
+1. `Breakout After Compression`
+2. `Regime-Gated Cross-Sectional Alpha`
+3. `Gap-and-Go / High-Volume Continuation`
+4. `Earnings Drift Proxy` only after validating free event-data quality
+
+Rationale:
+
+- `Breakout After Compression` is the most implementable and least ambiguous with free data.
+- `Regime-Gated Cross-Sectional Alpha` provides a shared control layer for the rest of the framework.
+- `Gap-and-Go` has higher upside but also higher sensitivity to assumptions.
+- `Earnings Drift Proxy` is theoretically powerful but should not become the project bottleneck if free event history is incomplete.
+
+## Data Layer
+
+The framework needs a richer data layer than the current `close/open` setup.
+
+### Required price fields
+
+Daily US market data should support at least:
+
+- `open`
+- `high`
+- `low`
+- `close`
+- `volume`
+
+This is required to define:
+
+- real breakouts
+- gap events
+- volatility compression
+- abnormal dollar volume
+
+### Required ETF layer
+
+Add stable market and industry ETFs for regime and leadership analysis, at minimum:
+
+- `SPY`
+- `QQQ`
+- `IWM`
+- `MDY`
+- `XLF`
+- `XLK`
+- `XLI`
+- `XLV`
+- `XLY`
+- `XLP`
+- `XLE`
+- `XLU`
+- `XLRE`
+- `XLB`
+- `SOXX`
+- `IGV`
+- `SMH`
+
+### Universe modes
+
+The framework must support two explicit modes.
+
+#### Strict mode
+
+Use point-in-time-clean universe membership, initially based on the existing PIT S&P 500 machinery in the repository. This is the baseline for formal, defensible results.
+
+#### Exploratory mode
+
+Use a wider free-data US stock pool to search for stronger alpha patterns. These results are useful for idea generation but must be labeled as exploratory unless later promoted into a point-in-time-clean setup.
+
+## Universe Construction Rules
+
+The tradable universe must be computed daily from lagged information.
+
+### Daily eligibility rules
+
+Each stock may enter the candidate set only if all required conditions hold as of `t-1`:
+
+- enough listing history exists to compute the strategy lookbacks
+- enough valid volume observations exist
+- minimum lagged price threshold is met
+- minimum lagged dollar-volume threshold is met
+
+Representative defaults:
+
+- `close[t-1] > 5`
+- `median_dollar_volume_60d[t-1] > $20M` in `strict` mode
+- `median_dollar_volume_60d[t-1] > $5M` in `exploratory` mode
+- `>= 252` valid trading days before eligibility
+- `>= 40` valid volume days in the trailing `60d`
+
+Thresholds should be strategy-specific and tunable in robustness sweeps.
+
+### Industry mapping
+
+Do not use today's static sector labels to explain historical behavior. For historical regime and industry alignment, prefer PIT-safe proxies such as rolling correlation or beta to industry ETFs over `63/126d` windows.
+
+## Anti-Lookahead Rules
+
+The framework must enforce the following rules consistently.
+
+1. Signals computed using `t` daily bars may only be traded no earlier than `t+1`.
+2. If an event is effectively published after market close, it becomes tradable no earlier than the next trading day after publication.
+3. Rolling inputs for liquidity, volatility, and breakout logic must use complete lagged windows with explicit timing semantics.
+4. Cross-sectional ranking must happen only within the daily eligible universe.
+5. Universe membership, filters, and factor normalization must be applied before portfolio selection, not after.
+
+## Execution Convention
+
+Default execution convention:
+
+- observe data through `t` close
+- compute signal after the `t` close
+- trade at `t+1`
+
+The framework may compare `t+1 open` and `t+1 close` execution variants if the data path supports both, but the default research baseline should be conservative and consistent.
+
+## Backtest and Evaluation Framework
+
+Every strategy family must run through a single pipeline that:
+
+1. loads required market data
+2. constructs the daily eligible universe
+3. computes regime filters
+4. computes strategy scores or event states
+5. builds a `long-only` portfolio
+6. applies transaction costs
+7. reports `1/2/3/5/10y` windows
+8. records robustness diagnostics
+
+### Portfolio defaults
+
+Initial baseline settings:
+
+- `long-only`
+- concentrated books such as `top 5`, `top 10`, `top 20`
+- start with `equal weight`
+- add `inverse-vol` weighting only as a secondary comparison
+
+Equal-weight concentrated portfolios should be the first baseline because they are harder to over-engineer than adaptive weighting schemes.
+
+### Required robustness checks
+
+Any strategy candidate that looks strong must automatically be re-run under:
+
+- tighter liquidity thresholds
+- fewer and more positions
+- higher trading costs
+- different rebalance frequencies
+- exclusion of the lowest-liquidity or smallest-cap tail
+
+Only strategies that survive these perturbations should be promoted to `Tier A`.
+
+## Repository Changes
+
+The following repository changes are required.
+
+### New modules
+
+#### `research/us_universe.py`
+
+Responsibilities:
+
+- build daily tradable-universe masks
+- support `strict` and `exploratory` modes
+- enforce lagged eligibility rules
+
+#### `data_manager.py` extension or new `market_data.py`
+
+Responsibilities:
+
+- support daily US `OHLCV`
+- support ETF data updates
+- preserve existing price-loading workflows where practical
+
+#### `research/regime_filters.py`
+
+Responsibilities:
+
+- market risk-on/risk-off filters
+- ETF leadership signals
+- breadth and relative-strength helpers
+
+#### `research/event_factors.py`
+
+Responsibilities:
+
+- breakout-compression scores
+- gap-continuation scores
+- high-volume continuation logic
+- earnings-drift proxy logic
+
+#### `research/us_alpha_pipeline.py`
+
+Responsibilities:
+
+- orchestrate end-to-end research runs
+- load data
+- build universe masks
+- run strategy families
+- produce windowed rankings
+- label output as `strict` or `exploratory`
+
+#### `research/us_alpha_report.py`
+
+Responsibilities:
+
+- format tables and CSV outputs
+- summarize results by family and horizon
+- support markdown export if needed
+
+## Research Phasing
+
+The implementation should be split into two phases.
+
+### Phase 1
+
+Build the strict, defensible research backbone:
+
+- PIT S&P 500 universe
+- OHLCV data support
+- ETF regime filters
+- `Breakout After Compression`
+- `Regime-Gated Cross-Sectional Alpha`
+- `Gap-and-Go / High-Volume Continuation`
+- unified backtest and reporting pipeline
+
+This phase should produce a clean research system that is difficult to fool with future information.
+
+### Phase 2
+
+Expand into higher-upside exploratory research:
+
+- wider US stock universe
+- broader signal scanning
+- stronger CAGR search
+- explicit exploratory labeling
+
+This phase is for alpha discovery, not for making final claims about unbiased production performance.
+
+## Recommended Output
+
+The finished framework should produce:
+
+- a repeatable research entrypoint for US alpha studies
+- CSV outputs for `1/2/3/5/10y` windows
+- a ranked table of strategy families
+- tier classification for candidates
+- notes on where near-`50% CAGR` outcomes come from and whether they remain credible after tightening assumptions
+
+## Non-Goals
+
+This project does not aim to:
+
+- promise stable `10y 50% CAGR`
+- claim a fully point-in-time-clean all-US-stock universe from free data alone
+- optimize to a single headline metric at the expense of realism
+- treat exploratory full-market scans as production-quality evidence
+
+## Key Decision
+
+The core design choice is to build infrastructure that minimizes self-deception first, and only then search for extreme CAGR outcomes. Any other order is likely to produce attractive but unreliable results.