docs: add US alpha research design spec

2026-04-17 23:41:10 +08:00
parent 5e1c4a681d
commit 7239310be3
1 changed files with 376 additions and 0 deletions
--- a/docs/superpowers/specs/2026-04-17-us-alpha-research-design.md
+++ b/docs/superpowers/specs/2026-04-17-us-alpha-research-design.md
@@ -0,0 +1,376 @@
 # US High-Alpha Research Design
 **Date:** 2026-04-17
 ## Goal
 Build a research framework for US `long-only` equity strategies that uses only free or already-accessible data, avoids lookahead and survivorship traps as much as the available data allows, and can rank candidate strategy families over `1/2/3/5/10y` windows. The objective is not to manufacture the single highest backtest CAGR, but to identify strategy families whose alpha survives realistic liquidity filters, transaction costs, and point-in-time constraints.
 ## Constraints
 - Data sources must be free or already accessible from the current project environment.
 - Portfolio construction must be `long-only`.
 - The US research universe may extend beyond the S&P 500 into a broader US stock pool, but all conclusions must clearly distinguish between:
  - `strict` results from a point-in-time-clean universe.
  - `exploratory` results from a wider free-data universe that is not fully point-in-time-clean.
 - All signals must use only information available at the time of decision.
 - The framework must explicitly guard against:
  - survivorship bias
  - lookahead bias
  - static industry-label leakage
  - microcap and illiquidity contamination
 ## Success Criteria
 The framework is successful if it produces:
 1. A unified research and backtest pipeline for US strategies.
 2. A ranked comparison of `3-5` high-value strategy families across `1/2/3/5/10y`.
 3. Metrics that go beyond headline CAGR, including:
   - `CAGR`
   - `Sharpe`
   - `Sortino`
   - `MaxDD`
   - `Calmar`
   - `Turnover`
   - `Average positions`
   - `Median ADV usage`
   - `Subperiod stability`
 4. Tiered interpretation of results:
   - `Tier A`: realistic and tradable under tighter liquidity assumptions
   - `Tier B`: strong alpha but lower capacity
   - `Tier C`: attractive only under loose assumptions and not suitable as a production candidate
 Any strategy that reports near-`50% CAGR` must also explain:
 - which market regime contributed most of the return
 - whether performance depends on low-liquidity or small-cap tails
 - whether results survive after removing the most extreme tail names
 ## Research Philosophy
 This project should prefer honest, repeatable alpha discovery over spectacular but fragile backtests. Under the current constraints, a `10y 50% CAGR` should be treated as an upper-end outcome that may appear in selective windows, not as a baseline expectation. The more realistic goal is to find strategies that are strong over `3/5y`, still meaningfully outperform over `10y`, and remain robust after tightening assumptions.
 ## Strategy Families
 The research effort will focus on four strategy families.
 ### 1. Earnings Drift Proxy
 Target the post-information-repricing phase after major company-specific events. This is conceptually the highest-alpha family, but also the most dependent on event data quality.
 Primary implementation order:
 - use free historical earnings date data if it is stable enough
 - otherwise fall back to price-and-volume-defined event proxies
 Core signal ingredients:
 - strong post-event excess return over `1-3` days
 - abnormal volume
 - gap that does not immediately fill
 - price holding near short- and medium-term highs after the event
 ### 2. Breakout After Compression
 Target stocks that transition from low-volatility congestion into sustained trend expansion. This is the cleanest strategy family to implement with free daily OHLCV data and is the best first candidate for a strict production-grade pipeline.
 Core signal ingredients:
 - proximity to `120d` or `252d` highs
 - volatility compression over the prior `20-40` trading days
 - rising dollar volume
 - positive relative strength versus market and industry proxies
 ### 3. Gap-and-Go / High-Volume Continuation
 Target the second phase of move continuation after abnormal return and volume shocks rather than blindly chasing the first event day.
 Core signal ingredients:
 - abnormal `1d` or `3d` return
 - abnormal volume versus trailing `60d`
 - post-event price holding above the event anchor
 - subsequent breakout continuation
 This family has high potential upside but is more sensitive to cost assumptions and market regime.
 ### 4. Regime-Gated Cross-Sectional Alpha
 Use broad market and industry-state filters to improve the hit rate of the other strategy families and provide a lower-volatility baseline alpha engine.
 Core signal ingredients:
 - market risk-on versus risk-off state
 - industry ETF leadership
 - relative strength
 - recovery from drawdowns
 - trend quality
 - near-`52w` high behavior
 - price/volume confirmation
 This family is not expected to produce the highest standalone CAGR, but it is expected to improve robustness and reduce participation in hostile environments.
 ## Prioritization
 Recommended implementation order:
 1. `Breakout After Compression`
 2. `Regime-Gated Cross-Sectional Alpha`
 3. `Gap-and-Go / High-Volume Continuation`
 4. `Earnings Drift Proxy` only after validating free event-data quality
 Rationale:
 - `Breakout After Compression` is the most implementable and least ambiguous with free data.
 - `Regime-Gated Cross-Sectional Alpha` provides a shared control layer for the rest of the framework.
 - `Gap-and-Go` has higher upside but also higher sensitivity to assumptions.
 - `Earnings Drift Proxy` is theoretically powerful but should not become the project bottleneck if free event history is incomplete.
 ## Data Layer
 The framework needs a richer data layer than the current `close/open` setup.
 ### Required price fields
 Daily US market data should support at least:
 - `open`
 - `high`
 - `low`
 - `close`
 - `volume`
 This is required to define:
 - real breakouts
 - gap events
 - volatility compression
 - abnormal dollar volume
 ### Required ETF layer
 Add stable market and industry ETFs for regime and leadership analysis, at minimum:
 - `SPY`
 - `QQQ`
 - `IWM`
 - `MDY`
 - `XLF`
 - `XLK`
 - `XLI`
 - `XLV`
 - `XLY`
 - `XLP`
 - `XLE`
 - `XLU`
 - `XLRE`
 - `XLB`
 - `SOXX`
 - `IGV`
 - `SMH`
 ### Universe modes
 The framework must support two explicit modes.
 #### Strict mode
 Use point-in-time-clean universe membership, initially based on the existing PIT S&P 500 machinery in the repository. This is the baseline for formal, defensible results.
 #### Exploratory mode
 Use a wider free-data US stock pool to search for stronger alpha patterns. These results are useful for idea generation but must be labeled as exploratory unless later promoted into a point-in-time-clean setup.
 ## Universe Construction Rules
 The tradable universe must be computed daily from lagged information.
 ### Daily eligibility rules
 Each stock may enter the candidate set only if all required conditions hold as of `t-1`:
 - enough listing history exists to compute the strategy lookbacks
 - enough valid volume observations exist
 - minimum lagged price threshold is met
 - minimum lagged dollar-volume threshold is met
 Representative defaults:
 - `close[t-1] > 5`
 - `median_dollar_volume_60d[t-1] > $20M` in `strict` mode
 - `median_dollar_volume_60d[t-1] > $5M` in `exploratory` mode
 - `>= 252` valid trading days before eligibility
 - `>= 40` valid volume days in the trailing `60d`
 Thresholds should be strategy-specific and tunable in robustness sweeps.
 ### Industry mapping
 Do not use today's static sector labels to explain historical behavior. For historical regime and industry alignment, prefer PIT-safe proxies such as rolling correlation or beta to industry ETFs over `63/126d` windows.
 ## Anti-Lookahead Rules
 The framework must enforce the following rules consistently.
 1. Signals computed using `t` daily bars may only be traded no earlier than `t+1`.
 2. If an event is effectively published after market close, it becomes tradable no earlier than the next trading day after publication.
 3. Rolling inputs for liquidity, volatility, and breakout logic must use complete lagged windows with explicit timing semantics.
 4. Cross-sectional ranking must happen only within the daily eligible universe.
 5. Universe membership, filters, and factor normalization must be applied before portfolio selection, not after.
 ## Execution Convention
 Default execution convention:
 - observe data through `t` close
 - compute signal after the `t` close
 - trade at `t+1`
 The framework may compare `t+1 open` and `t+1 close` execution variants if the data path supports both, but the default research baseline should be conservative and consistent.
 ## Backtest and Evaluation Framework
 Every strategy family must run through a single pipeline that:
 1. loads required market data
 2. constructs the daily eligible universe
 3. computes regime filters
 4. computes strategy scores or event states
 5. builds a `long-only` portfolio
 6. applies transaction costs
 7. reports `1/2/3/5/10y` windows
 8. records robustness diagnostics
 ### Portfolio defaults
 Initial baseline settings:
 - `long-only`
 - concentrated books such as `top 5`, `top 10`, `top 20`
 - start with `equal weight`
 - add `inverse-vol` weighting only as a secondary comparison
 Equal-weight concentrated portfolios should be the first baseline because they are harder to over-engineer than adaptive weighting schemes.
 ### Required robustness checks
 Any strategy candidate that looks strong must automatically be re-run under:
 - tighter liquidity thresholds
 - fewer and more positions
 - higher trading costs
 - different rebalance frequencies
 - exclusion of the lowest-liquidity or smallest-cap tail
 Only strategies that survive these perturbations should be promoted to `Tier A`.
 ## Repository Changes
 The following repository changes are required.
 ### New modules
 #### `research/us_universe.py`
 Responsibilities:
 - build daily tradable-universe masks
 - support `strict` and `exploratory` modes
 - enforce lagged eligibility rules
 #### `data_manager.py` extension or new `market_data.py`
 Responsibilities:
 - support daily US `OHLCV`
 - support ETF data updates
 - preserve existing price-loading workflows where practical
 #### `research/regime_filters.py`
 Responsibilities:
 - market risk-on/risk-off filters
 - ETF leadership signals
 - breadth and relative-strength helpers
 #### `research/event_factors.py`
 Responsibilities:
 - breakout-compression scores
 - gap-continuation scores
 - high-volume continuation logic
 - earnings-drift proxy logic
 #### `research/us_alpha_pipeline.py`
 Responsibilities:
 - orchestrate end-to-end research runs
 - load data
 - build universe masks
 - run strategy families
 - produce windowed rankings
 - label output as `strict` or `exploratory`
 #### `research/us_alpha_report.py`
 Responsibilities:
 - format tables and CSV outputs
 - summarize results by family and horizon
 - support markdown export if needed
 ## Research Phasing
 The implementation should be split into two phases.
 ### Phase 1
 Build the strict, defensible research backbone:
 - PIT S&P 500 universe
 - OHLCV data support
 - ETF regime filters
 - `Breakout After Compression`
 - `Regime-Gated Cross-Sectional Alpha`
 - `Gap-and-Go / High-Volume Continuation`
 - unified backtest and reporting pipeline
 This phase should produce a clean research system that is difficult to fool with future information.
 ### Phase 2
 Expand into higher-upside exploratory research:
 - wider US stock universe
 - broader signal scanning
 - stronger CAGR search
 - explicit exploratory labeling
 This phase is for alpha discovery, not for making final claims about unbiased production performance.
 ## Recommended Output
 The finished framework should produce:
 - a repeatable research entrypoint for US alpha studies
 - CSV outputs for `1/2/3/5/10y` windows
 - a ranked table of strategy families
 - tier classification for candidates
 - notes on where near-`50% CAGR` outcomes come from and whether they remain credible after tightening assumptions
 ## Non-Goals
 This project does not aim to:
 - promise stable `10y 50% CAGR`
 - claim a fully point-in-time-clean all-US-stock universe from free data alone
 - optimize to a single headline metric at the expense of realism
 - treat exploratory full-market scans as production-quality evidence
 ## Key Decision
 The core design choice is to build infrastructure that minimizes self-deception first, and only then search for extreme CAGR outcomes. Any other order is likely to produce attractive but unreliable results.