Files
quant/docs/superpowers/specs/2026-04-17-us-alpha-research-design.md

12 KiB

US High-Alpha Research Design

Date: 2026-04-17

Goal

Build a research framework for US long-only equity strategies that uses only free or already-accessible data, avoids lookahead and survivorship traps as much as the available data allows, and can rank candidate strategy families over 1/2/3/5/10y windows. The objective is not to manufacture the single highest backtest CAGR, but to identify strategy families whose alpha survives realistic liquidity filters, transaction costs, and point-in-time constraints.

Constraints

  • Data sources must be free or already accessible from the current project environment.
  • Portfolio construction must be long-only.
  • The US research universe may extend beyond the S&P 500 into a broader US stock pool, but all conclusions must clearly distinguish between:
    • strict results from a point-in-time-clean universe.
    • exploratory results from a wider free-data universe that is not fully point-in-time-clean.
  • All signals must use only information available at the time of decision.
  • The framework must explicitly guard against:
    • survivorship bias
    • lookahead bias
    • static industry-label leakage
    • microcap and illiquidity contamination

Success Criteria

The framework is successful if it produces:

  1. A unified research and backtest pipeline for US strategies.
  2. A ranked comparison of 3-5 high-value strategy families across 1/2/3/5/10y.
  3. Metrics that go beyond headline CAGR, including:
    • CAGR
    • Sharpe
    • Sortino
    • MaxDD
    • Calmar
    • Turnover
    • Average positions
    • Median ADV usage
    • Subperiod stability
  4. Tiered interpretation of results:
    • Tier A: realistic and tradable under tighter liquidity assumptions
    • Tier B: strong alpha but lower capacity
    • Tier C: attractive only under loose assumptions and not suitable as a production candidate

Any strategy that reports near-50% CAGR must also explain:

  • which market regime contributed most of the return
  • whether performance depends on low-liquidity or small-cap tails
  • whether results survive after removing the most extreme tail names

Research Philosophy

This project should prefer honest, repeatable alpha discovery over spectacular but fragile backtests. Under the current constraints, a 10y 50% CAGR should be treated as an upper-end outcome that may appear in selective windows, not as a baseline expectation. The more realistic goal is to find strategies that are strong over 3/5y, still meaningfully outperform over 10y, and remain robust after tightening assumptions.

Strategy Families

The research effort will focus on four strategy families.

1. Earnings Drift Proxy

Target the post-information-repricing phase after major company-specific events. This is conceptually the highest-alpha family, but also the most dependent on event data quality.

Primary implementation order:

  • use free historical earnings date data if it is stable enough
  • otherwise fall back to price-and-volume-defined event proxies

Core signal ingredients:

  • strong post-event excess return over 1-3 days
  • abnormal volume
  • gap that does not immediately fill
  • price holding near short- and medium-term highs after the event

2. Breakout After Compression

Target stocks that transition from low-volatility congestion into sustained trend expansion. This is the cleanest strategy family to implement with free daily OHLCV data and is the best first candidate for a strict production-grade pipeline.

Core signal ingredients:

  • proximity to 120d or 252d highs
  • volatility compression over the prior 20-40 trading days
  • rising dollar volume
  • positive relative strength versus market and industry proxies

3. Gap-and-Go / High-Volume Continuation

Target the second phase of move continuation after abnormal return and volume shocks rather than blindly chasing the first event day.

Core signal ingredients:

  • abnormal 1d or 3d return
  • abnormal volume versus trailing 60d
  • post-event price holding above the event anchor
  • subsequent breakout continuation

This family has high potential upside but is more sensitive to cost assumptions and market regime.

4. Regime-Gated Cross-Sectional Alpha

Use broad market and industry-state filters to improve the hit rate of the other strategy families and provide a lower-volatility baseline alpha engine.

Core signal ingredients:

  • market risk-on versus risk-off state
  • industry ETF leadership
  • relative strength
  • recovery from drawdowns
  • trend quality
  • near-52w high behavior
  • price/volume confirmation

This family is not expected to produce the highest standalone CAGR, but it is expected to improve robustness and reduce participation in hostile environments.

Prioritization

Recommended implementation order:

  1. Breakout After Compression
  2. Regime-Gated Cross-Sectional Alpha
  3. Gap-and-Go / High-Volume Continuation
  4. Earnings Drift Proxy only after validating free event-data quality

Rationale:

  • Breakout After Compression is the most implementable and least ambiguous with free data.
  • Regime-Gated Cross-Sectional Alpha provides a shared control layer for the rest of the framework.
  • Gap-and-Go has higher upside but also higher sensitivity to assumptions.
  • Earnings Drift Proxy is theoretically powerful but should not become the project bottleneck if free event history is incomplete.

Data Layer

The framework needs a richer data layer than the current close/open setup.

Required price fields

Daily US market data should support at least:

  • open
  • high
  • low
  • close
  • volume

This is required to define:

  • real breakouts
  • gap events
  • volatility compression
  • abnormal dollar volume

Required ETF layer

Add stable market and industry ETFs for regime and leadership analysis, at minimum:

  • SPY
  • QQQ
  • IWM
  • MDY
  • XLF
  • XLK
  • XLI
  • XLV
  • XLY
  • XLP
  • XLE
  • XLU
  • XLRE
  • XLB
  • SOXX
  • IGV
  • SMH

Universe modes

The framework must support two explicit modes.

Strict mode

Use point-in-time-clean universe membership, initially based on the existing PIT S&P 500 machinery in the repository. This is the baseline for formal, defensible results.

Exploratory mode

Use a wider free-data US stock pool to search for stronger alpha patterns. These results are useful for idea generation but must be labeled as exploratory unless later promoted into a point-in-time-clean setup.

Universe Construction Rules

The tradable universe must be computed daily from lagged information.

Daily eligibility rules

Each stock may enter the candidate set only if all required conditions hold as of t-1:

  • enough listing history exists to compute the strategy lookbacks
  • enough valid volume observations exist
  • minimum lagged price threshold is met
  • minimum lagged dollar-volume threshold is met

Representative defaults:

  • close[t-1] > 5
  • median_dollar_volume_60d[t-1] > $20M in strict mode
  • median_dollar_volume_60d[t-1] > $5M in exploratory mode
  • >= 252 valid trading days before eligibility
  • >= 40 valid volume days in the trailing 60d

Thresholds should be strategy-specific and tunable in robustness sweeps.

Industry mapping

Do not use today's static sector labels to explain historical behavior. For historical regime and industry alignment, prefer PIT-safe proxies such as rolling correlation or beta to industry ETFs over 63/126d windows.

Anti-Lookahead Rules

The framework must enforce the following rules consistently.

  1. Signals computed using t daily bars may only be traded no earlier than t+1.
  2. If an event is effectively published after market close, it becomes tradable no earlier than the next trading day after publication.
  3. Rolling inputs for liquidity, volatility, and breakout logic must use complete lagged windows with explicit timing semantics.
  4. Cross-sectional ranking must happen only within the daily eligible universe.
  5. Universe membership, filters, and factor normalization must be applied before portfolio selection, not after.

Execution Convention

Default execution convention:

  • observe data through t close
  • compute signal after the t close
  • trade at t+1

The framework may compare t+1 open and t+1 close execution variants if the data path supports both, but the default research baseline should be conservative and consistent.

Backtest and Evaluation Framework

Every strategy family must run through a single pipeline that:

  1. loads required market data
  2. constructs the daily eligible universe
  3. computes regime filters
  4. computes strategy scores or event states
  5. builds a long-only portfolio
  6. applies transaction costs
  7. reports 1/2/3/5/10y windows
  8. records robustness diagnostics

Portfolio defaults

Initial baseline settings:

  • long-only
  • concentrated books such as top 5, top 10, top 20
  • start with equal weight
  • add inverse-vol weighting only as a secondary comparison

Equal-weight concentrated portfolios should be the first baseline because they are harder to over-engineer than adaptive weighting schemes.

Required robustness checks

Any strategy candidate that looks strong must automatically be re-run under:

  • tighter liquidity thresholds
  • fewer and more positions
  • higher trading costs
  • different rebalance frequencies
  • exclusion of the lowest-liquidity or smallest-cap tail

Only strategies that survive these perturbations should be promoted to Tier A.

Repository Changes

The following repository changes are required.

New modules

research/us_universe.py

Responsibilities:

  • build daily tradable-universe masks
  • support strict and exploratory modes
  • enforce lagged eligibility rules

data_manager.py extension or new market_data.py

Responsibilities:

  • support daily US OHLCV
  • support ETF data updates
  • preserve existing price-loading workflows where practical

research/regime_filters.py

Responsibilities:

  • market risk-on/risk-off filters
  • ETF leadership signals
  • breadth and relative-strength helpers

research/event_factors.py

Responsibilities:

  • breakout-compression scores
  • gap-continuation scores
  • high-volume continuation logic
  • earnings-drift proxy logic

research/us_alpha_pipeline.py

Responsibilities:

  • orchestrate end-to-end research runs
  • load data
  • build universe masks
  • run strategy families
  • produce windowed rankings
  • label output as strict or exploratory

research/us_alpha_report.py

Responsibilities:

  • format tables and CSV outputs
  • summarize results by family and horizon
  • support markdown export if needed

Research Phasing

The implementation should be split into two phases.

Phase 1

Build the strict, defensible research backbone:

  • PIT S&P 500 universe
  • OHLCV data support
  • ETF regime filters
  • Breakout After Compression
  • Regime-Gated Cross-Sectional Alpha
  • Gap-and-Go / High-Volume Continuation
  • unified backtest and reporting pipeline

This phase should produce a clean research system that is difficult to fool with future information.

Phase 2

Expand into higher-upside exploratory research:

  • wider US stock universe
  • broader signal scanning
  • stronger CAGR search
  • explicit exploratory labeling

This phase is for alpha discovery, not for making final claims about unbiased production performance.

The finished framework should produce:

  • a repeatable research entrypoint for US alpha studies
  • CSV outputs for 1/2/3/5/10y windows
  • a ranked table of strategy families
  • tier classification for candidates
  • notes on where near-50% CAGR outcomes come from and whether they remain credible after tightening assumptions

Non-Goals

This project does not aim to:

  • promise stable 10y 50% CAGR
  • claim a fully point-in-time-clean all-US-stock universe from free data alone
  • optimize to a single headline metric at the expense of realism
  • treat exploratory full-market scans as production-quality evidence

Key Decision

The core design choice is to build infrastructure that minimizes self-deception first, and only then search for extreme CAGR outcomes. Any other order is likely to produce attractive but unreliable results.