diff --git a/docs/superpowers/specs/2026-04-17-us-alpha-research-design.md b/docs/superpowers/specs/2026-04-17-us-alpha-research-design.md new file mode 100644 index 0000000..2a175df --- /dev/null +++ b/docs/superpowers/specs/2026-04-17-us-alpha-research-design.md @@ -0,0 +1,376 @@ +# US High-Alpha Research Design + +**Date:** 2026-04-17 + +## Goal + +Build a research framework for US `long-only` equity strategies that uses only free or already-accessible data, avoids lookahead and survivorship traps as much as the available data allows, and can rank candidate strategy families over `1/2/3/5/10y` windows. The objective is not to manufacture the single highest backtest CAGR, but to identify strategy families whose alpha survives realistic liquidity filters, transaction costs, and point-in-time constraints. + +## Constraints + +- Data sources must be free or already accessible from the current project environment. +- Portfolio construction must be `long-only`. +- The US research universe may extend beyond the S&P 500 into a broader US stock pool, but all conclusions must clearly distinguish between: + - `strict` results from a point-in-time-clean universe. + - `exploratory` results from a wider free-data universe that is not fully point-in-time-clean. +- All signals must use only information available at the time of decision. +- The framework must explicitly guard against: + - survivorship bias + - lookahead bias + - static industry-label leakage + - microcap and illiquidity contamination + +## Success Criteria + +The framework is successful if it produces: + +1. A unified research and backtest pipeline for US strategies. +2. A ranked comparison of `3-5` high-value strategy families across `1/2/3/5/10y`. +3. Metrics that go beyond headline CAGR, including: + - `CAGR` + - `Sharpe` + - `Sortino` + - `MaxDD` + - `Calmar` + - `Turnover` + - `Average positions` + - `Median ADV usage` + - `Subperiod stability` +4. Tiered interpretation of results: + - `Tier A`: realistic and tradable under tighter liquidity assumptions + - `Tier B`: strong alpha but lower capacity + - `Tier C`: attractive only under loose assumptions and not suitable as a production candidate + +Any strategy that reports near-`50% CAGR` must also explain: + +- which market regime contributed most of the return +- whether performance depends on low-liquidity or small-cap tails +- whether results survive after removing the most extreme tail names + +## Research Philosophy + +This project should prefer honest, repeatable alpha discovery over spectacular but fragile backtests. Under the current constraints, a `10y 50% CAGR` should be treated as an upper-end outcome that may appear in selective windows, not as a baseline expectation. The more realistic goal is to find strategies that are strong over `3/5y`, still meaningfully outperform over `10y`, and remain robust after tightening assumptions. + +## Strategy Families + +The research effort will focus on four strategy families. + +### 1. Earnings Drift Proxy + +Target the post-information-repricing phase after major company-specific events. This is conceptually the highest-alpha family, but also the most dependent on event data quality. + +Primary implementation order: + +- use free historical earnings date data if it is stable enough +- otherwise fall back to price-and-volume-defined event proxies + +Core signal ingredients: + +- strong post-event excess return over `1-3` days +- abnormal volume +- gap that does not immediately fill +- price holding near short- and medium-term highs after the event + +### 2. Breakout After Compression + +Target stocks that transition from low-volatility congestion into sustained trend expansion. This is the cleanest strategy family to implement with free daily OHLCV data and is the best first candidate for a strict production-grade pipeline. + +Core signal ingredients: + +- proximity to `120d` or `252d` highs +- volatility compression over the prior `20-40` trading days +- rising dollar volume +- positive relative strength versus market and industry proxies + +### 3. Gap-and-Go / High-Volume Continuation + +Target the second phase of move continuation after abnormal return and volume shocks rather than blindly chasing the first event day. + +Core signal ingredients: + +- abnormal `1d` or `3d` return +- abnormal volume versus trailing `60d` +- post-event price holding above the event anchor +- subsequent breakout continuation + +This family has high potential upside but is more sensitive to cost assumptions and market regime. + +### 4. Regime-Gated Cross-Sectional Alpha + +Use broad market and industry-state filters to improve the hit rate of the other strategy families and provide a lower-volatility baseline alpha engine. + +Core signal ingredients: + +- market risk-on versus risk-off state +- industry ETF leadership +- relative strength +- recovery from drawdowns +- trend quality +- near-`52w` high behavior +- price/volume confirmation + +This family is not expected to produce the highest standalone CAGR, but it is expected to improve robustness and reduce participation in hostile environments. + +## Prioritization + +Recommended implementation order: + +1. `Breakout After Compression` +2. `Regime-Gated Cross-Sectional Alpha` +3. `Gap-and-Go / High-Volume Continuation` +4. `Earnings Drift Proxy` only after validating free event-data quality + +Rationale: + +- `Breakout After Compression` is the most implementable and least ambiguous with free data. +- `Regime-Gated Cross-Sectional Alpha` provides a shared control layer for the rest of the framework. +- `Gap-and-Go` has higher upside but also higher sensitivity to assumptions. +- `Earnings Drift Proxy` is theoretically powerful but should not become the project bottleneck if free event history is incomplete. + +## Data Layer + +The framework needs a richer data layer than the current `close/open` setup. + +### Required price fields + +Daily US market data should support at least: + +- `open` +- `high` +- `low` +- `close` +- `volume` + +This is required to define: + +- real breakouts +- gap events +- volatility compression +- abnormal dollar volume + +### Required ETF layer + +Add stable market and industry ETFs for regime and leadership analysis, at minimum: + +- `SPY` +- `QQQ` +- `IWM` +- `MDY` +- `XLF` +- `XLK` +- `XLI` +- `XLV` +- `XLY` +- `XLP` +- `XLE` +- `XLU` +- `XLRE` +- `XLB` +- `SOXX` +- `IGV` +- `SMH` + +### Universe modes + +The framework must support two explicit modes. + +#### Strict mode + +Use point-in-time-clean universe membership, initially based on the existing PIT S&P 500 machinery in the repository. This is the baseline for formal, defensible results. + +#### Exploratory mode + +Use a wider free-data US stock pool to search for stronger alpha patterns. These results are useful for idea generation but must be labeled as exploratory unless later promoted into a point-in-time-clean setup. + +## Universe Construction Rules + +The tradable universe must be computed daily from lagged information. + +### Daily eligibility rules + +Each stock may enter the candidate set only if all required conditions hold as of `t-1`: + +- enough listing history exists to compute the strategy lookbacks +- enough valid volume observations exist +- minimum lagged price threshold is met +- minimum lagged dollar-volume threshold is met + +Representative defaults: + +- `close[t-1] > 5` +- `median_dollar_volume_60d[t-1] > $20M` in `strict` mode +- `median_dollar_volume_60d[t-1] > $5M` in `exploratory` mode +- `>= 252` valid trading days before eligibility +- `>= 40` valid volume days in the trailing `60d` + +Thresholds should be strategy-specific and tunable in robustness sweeps. + +### Industry mapping + +Do not use today's static sector labels to explain historical behavior. For historical regime and industry alignment, prefer PIT-safe proxies such as rolling correlation or beta to industry ETFs over `63/126d` windows. + +## Anti-Lookahead Rules + +The framework must enforce the following rules consistently. + +1. Signals computed using `t` daily bars may only be traded no earlier than `t+1`. +2. If an event is effectively published after market close, it becomes tradable no earlier than the next trading day after publication. +3. Rolling inputs for liquidity, volatility, and breakout logic must use complete lagged windows with explicit timing semantics. +4. Cross-sectional ranking must happen only within the daily eligible universe. +5. Universe membership, filters, and factor normalization must be applied before portfolio selection, not after. + +## Execution Convention + +Default execution convention: + +- observe data through `t` close +- compute signal after the `t` close +- trade at `t+1` + +The framework may compare `t+1 open` and `t+1 close` execution variants if the data path supports both, but the default research baseline should be conservative and consistent. + +## Backtest and Evaluation Framework + +Every strategy family must run through a single pipeline that: + +1. loads required market data +2. constructs the daily eligible universe +3. computes regime filters +4. computes strategy scores or event states +5. builds a `long-only` portfolio +6. applies transaction costs +7. reports `1/2/3/5/10y` windows +8. records robustness diagnostics + +### Portfolio defaults + +Initial baseline settings: + +- `long-only` +- concentrated books such as `top 5`, `top 10`, `top 20` +- start with `equal weight` +- add `inverse-vol` weighting only as a secondary comparison + +Equal-weight concentrated portfolios should be the first baseline because they are harder to over-engineer than adaptive weighting schemes. + +### Required robustness checks + +Any strategy candidate that looks strong must automatically be re-run under: + +- tighter liquidity thresholds +- fewer and more positions +- higher trading costs +- different rebalance frequencies +- exclusion of the lowest-liquidity or smallest-cap tail + +Only strategies that survive these perturbations should be promoted to `Tier A`. + +## Repository Changes + +The following repository changes are required. + +### New modules + +#### `research/us_universe.py` + +Responsibilities: + +- build daily tradable-universe masks +- support `strict` and `exploratory` modes +- enforce lagged eligibility rules + +#### `data_manager.py` extension or new `market_data.py` + +Responsibilities: + +- support daily US `OHLCV` +- support ETF data updates +- preserve existing price-loading workflows where practical + +#### `research/regime_filters.py` + +Responsibilities: + +- market risk-on/risk-off filters +- ETF leadership signals +- breadth and relative-strength helpers + +#### `research/event_factors.py` + +Responsibilities: + +- breakout-compression scores +- gap-continuation scores +- high-volume continuation logic +- earnings-drift proxy logic + +#### `research/us_alpha_pipeline.py` + +Responsibilities: + +- orchestrate end-to-end research runs +- load data +- build universe masks +- run strategy families +- produce windowed rankings +- label output as `strict` or `exploratory` + +#### `research/us_alpha_report.py` + +Responsibilities: + +- format tables and CSV outputs +- summarize results by family and horizon +- support markdown export if needed + +## Research Phasing + +The implementation should be split into two phases. + +### Phase 1 + +Build the strict, defensible research backbone: + +- PIT S&P 500 universe +- OHLCV data support +- ETF regime filters +- `Breakout After Compression` +- `Regime-Gated Cross-Sectional Alpha` +- `Gap-and-Go / High-Volume Continuation` +- unified backtest and reporting pipeline + +This phase should produce a clean research system that is difficult to fool with future information. + +### Phase 2 + +Expand into higher-upside exploratory research: + +- wider US stock universe +- broader signal scanning +- stronger CAGR search +- explicit exploratory labeling + +This phase is for alpha discovery, not for making final claims about unbiased production performance. + +## Recommended Output + +The finished framework should produce: + +- a repeatable research entrypoint for US alpha studies +- CSV outputs for `1/2/3/5/10y` windows +- a ranked table of strategy families +- tier classification for candidates +- notes on where near-`50% CAGR` outcomes come from and whether they remain credible after tightening assumptions + +## Non-Goals + +This project does not aim to: + +- promise stable `10y 50% CAGR` +- claim a fully point-in-time-clean all-US-stock universe from free data alone +- optimize to a single headline metric at the expense of realism +- treat exploratory full-market scans as production-quality evidence + +## Key Decision + +The core design choice is to build infrastructure that minimizes self-deception first, and only then search for extreme CAGR outcomes. Any other order is likely to produce attractive but unreliable results.