gahow/quant

Fork 0

Files

Gahow Wang 7239310be3 docs: add US alpha research design spec

2026-04-17 23:41:10 +08:00

12 KiB

Raw Blame History

US High-Alpha Research Design

Date: 2026-04-17

Goal

Build a research framework for US long-only equity strategies that uses only free or already-accessible data, avoids lookahead and survivorship traps as much as the available data allows, and can rank candidate strategy families over 1/2/3/5/10y windows. The objective is not to manufacture the single highest backtest CAGR, but to identify strategy families whose alpha survives realistic liquidity filters, transaction costs, and point-in-time constraints.

Constraints

Data sources must be free or already accessible from the current project environment.
Portfolio construction must be long-only.
The US research universe may extend beyond the S&P 500 into a broader US stock pool, but all conclusions must clearly distinguish between:
- strict results from a point-in-time-clean universe.
- exploratory results from a wider free-data universe that is not fully point-in-time-clean.
All signals must use only information available at the time of decision.
The framework must explicitly guard against:
- survivorship bias
- lookahead bias
- static industry-label leakage
- microcap and illiquidity contamination

Success Criteria

The framework is successful if it produces:

A unified research and backtest pipeline for US strategies.
A ranked comparison of 3-5 high-value strategy families across 1/2/3/5/10y.
Metrics that go beyond headline CAGR, including:
- CAGR
- Sharpe
- Sortino
- MaxDD
- Calmar
- Turnover
- Average positions
- Median ADV usage
- Subperiod stability
Tiered interpretation of results:
- Tier A: realistic and tradable under tighter liquidity assumptions
- Tier B: strong alpha but lower capacity
- Tier C: attractive only under loose assumptions and not suitable as a production candidate

Any strategy that reports near-50% CAGR must also explain:

which market regime contributed most of the return
whether performance depends on low-liquidity or small-cap tails
whether results survive after removing the most extreme tail names

Research Philosophy

This project should prefer honest, repeatable alpha discovery over spectacular but fragile backtests. Under the current constraints, a 10y 50% CAGR should be treated as an upper-end outcome that may appear in selective windows, not as a baseline expectation. The more realistic goal is to find strategies that are strong over 3/5y, still meaningfully outperform over 10y, and remain robust after tightening assumptions.

Strategy Families

The research effort will focus on four strategy families.

1. Earnings Drift Proxy

Target the post-information-repricing phase after major company-specific events. This is conceptually the highest-alpha family, but also the most dependent on event data quality.

Primary implementation order:

use free historical earnings date data if it is stable enough
otherwise fall back to price-and-volume-defined event proxies

Core signal ingredients:

strong post-event excess return over 1-3 days
abnormal volume
gap that does not immediately fill
price holding near short- and medium-term highs after the event

2. Breakout After Compression

Target stocks that transition from low-volatility congestion into sustained trend expansion. This is the cleanest strategy family to implement with free daily OHLCV data and is the best first candidate for a strict production-grade pipeline.

Core signal ingredients:

proximity to 120d or 252d highs
volatility compression over the prior 20-40 trading days
rising dollar volume
positive relative strength versus market and industry proxies

3. Gap-and-Go / High-Volume Continuation

Target the second phase of move continuation after abnormal return and volume shocks rather than blindly chasing the first event day.

Core signal ingredients:

abnormal 1d or 3d return
abnormal volume versus trailing 60d
post-event price holding above the event anchor
subsequent breakout continuation

This family has high potential upside but is more sensitive to cost assumptions and market regime.

4. Regime-Gated Cross-Sectional Alpha

Use broad market and industry-state filters to improve the hit rate of the other strategy families and provide a lower-volatility baseline alpha engine.

Core signal ingredients:

market risk-on versus risk-off state
industry ETF leadership
relative strength
recovery from drawdowns
trend quality
near-52w high behavior
price/volume confirmation

This family is not expected to produce the highest standalone CAGR, but it is expected to improve robustness and reduce participation in hostile environments.

Prioritization

Recommended implementation order:

Breakout After Compression
Regime-Gated Cross-Sectional Alpha
Gap-and-Go / High-Volume Continuation
Earnings Drift Proxy only after validating free event-data quality

Rationale:

Breakout After Compression is the most implementable and least ambiguous with free data.
Regime-Gated Cross-Sectional Alpha provides a shared control layer for the rest of the framework.
Gap-and-Go has higher upside but also higher sensitivity to assumptions.
Earnings Drift Proxy is theoretically powerful but should not become the project bottleneck if free event history is incomplete.

Data Layer

The framework needs a richer data layer than the current close/open setup.

Required price fields

Daily US market data should support at least:

open
high
low
close
volume

This is required to define:

real breakouts
gap events
volatility compression
abnormal dollar volume

Required ETF layer

Add stable market and industry ETFs for regime and leadership analysis, at minimum:

SPY
QQQ
IWM
MDY
XLF
XLK
XLI
XLV
XLY
XLP
XLE
XLU
XLRE
XLB
SOXX
IGV
SMH

Universe modes

The framework must support two explicit modes.

Strict mode

Use point-in-time-clean universe membership, initially based on the existing PIT S&P 500 machinery in the repository. This is the baseline for formal, defensible results.

Exploratory mode

Use a wider free-data US stock pool to search for stronger alpha patterns. These results are useful for idea generation but must be labeled as exploratory unless later promoted into a point-in-time-clean setup.

Universe Construction Rules

The tradable universe must be computed daily from lagged information.

Daily eligibility rules

Each stock may enter the candidate set only if all required conditions hold as of t-1:

enough listing history exists to compute the strategy lookbacks
enough valid volume observations exist
minimum lagged price threshold is met
minimum lagged dollar-volume threshold is met

Representative defaults:

close[t-1] > 5
median_dollar_volume_60d[t-1] > $20M in strict mode
median_dollar_volume_60d[t-1] > $5M in exploratory mode
>= 252 valid trading days before eligibility
>= 40 valid volume days in the trailing 60d

Thresholds should be strategy-specific and tunable in robustness sweeps.

Industry mapping

Do not use today's static sector labels to explain historical behavior. For historical regime and industry alignment, prefer PIT-safe proxies such as rolling correlation or beta to industry ETFs over 63/126d windows.

Anti-Lookahead Rules

The framework must enforce the following rules consistently.

Signals computed using t daily bars may only be traded no earlier than t+1.
If an event is effectively published after market close, it becomes tradable no earlier than the next trading day after publication.
Rolling inputs for liquidity, volatility, and breakout logic must use complete lagged windows with explicit timing semantics.
Cross-sectional ranking must happen only within the daily eligible universe.
Universe membership, filters, and factor normalization must be applied before portfolio selection, not after.

Execution Convention

Default execution convention:

observe data through t close
compute signal after the t close
trade at t+1

The framework may compare t+1 open and t+1 close execution variants if the data path supports both, but the default research baseline should be conservative and consistent.

Backtest and Evaluation Framework

Every strategy family must run through a single pipeline that:

loads required market data
constructs the daily eligible universe
computes regime filters
computes strategy scores or event states
builds a long-only portfolio
applies transaction costs
reports 1/2/3/5/10y windows
records robustness diagnostics

Portfolio defaults

Initial baseline settings:

long-only
concentrated books such as top 5, top 10, top 20
start with equal weight
add inverse-vol weighting only as a secondary comparison

Equal-weight concentrated portfolios should be the first baseline because they are harder to over-engineer than adaptive weighting schemes.

Required robustness checks

Any strategy candidate that looks strong must automatically be re-run under:

tighter liquidity thresholds
fewer and more positions
higher trading costs
different rebalance frequencies
exclusion of the lowest-liquidity or smallest-cap tail

Only strategies that survive these perturbations should be promoted to Tier A.

Repository Changes

The following repository changes are required.

New modules

`research/us_universe.py`

Responsibilities:

build daily tradable-universe masks
support strict and exploratory modes
enforce lagged eligibility rules

`data_manager.py` extension or new `market_data.py`

Responsibilities:

support daily US OHLCV
support ETF data updates
preserve existing price-loading workflows where practical

`research/regime_filters.py`

Responsibilities:

market risk-on/risk-off filters
ETF leadership signals
breadth and relative-strength helpers

`research/event_factors.py`

Responsibilities:

breakout-compression scores
gap-continuation scores
high-volume continuation logic
earnings-drift proxy logic

`research/us_alpha_pipeline.py`

Responsibilities:

orchestrate end-to-end research runs
load data
build universe masks
run strategy families
produce windowed rankings
label output as strict or exploratory

`research/us_alpha_report.py`

Responsibilities:

format tables and CSV outputs
summarize results by family and horizon
support markdown export if needed

Research Phasing

The implementation should be split into two phases.

Phase 1

Build the strict, defensible research backbone:

PIT S&P 500 universe
OHLCV data support
ETF regime filters
Breakout After Compression
Regime-Gated Cross-Sectional Alpha
Gap-and-Go / High-Volume Continuation
unified backtest and reporting pipeline

This phase should produce a clean research system that is difficult to fool with future information.

Phase 2

Expand into higher-upside exploratory research:

wider US stock universe
broader signal scanning
stronger CAGR search
explicit exploratory labeling

This phase is for alpha discovery, not for making final claims about unbiased production performance.

Recommended Output

The finished framework should produce:

a repeatable research entrypoint for US alpha studies
CSV outputs for 1/2/3/5/10y windows
a ranked table of strategy families
tier classification for candidates
notes on where near-50% CAGR outcomes come from and whether they remain credible after tightening assumptions

Non-Goals

This project does not aim to:

promise stable 10y 50% CAGR
claim a fully point-in-time-clean all-US-stock universe from free data alone
optimize to a single headline metric at the expense of realism
treat exploratory full-market scans as production-quality evidence

Key Decision

The core design choice is to build infrastructure that minimizes self-deception first, and only then search for extreme CAGR outcomes. Any other order is likely to produce attractive but unreliable results.

12 KiB Raw Blame History

US High-Alpha Research Design

Goal

Constraints

Success Criteria

Research Philosophy

Strategy Families

1. Earnings Drift Proxy

2. Breakout After Compression

3. Gap-and-Go / High-Volume Continuation

4. Regime-Gated Cross-Sectional Alpha

Prioritization

Data Layer

Required price fields

Required ETF layer

Universe modes

Strict mode

Exploratory mode

Universe Construction Rules

Daily eligibility rules

Industry mapping

Anti-Lookahead Rules

Execution Convention

Backtest and Evaluation Framework

Portfolio defaults

Required robustness checks

Repository Changes

New modules

research/us_universe.py

data_manager.py extension or new market_data.py

research/regime_filters.py

research/event_factors.py

research/us_alpha_pipeline.py

research/us_alpha_report.py

Research Phasing

Phase 1

Phase 2

Recommended Output

Non-Goals

Key Decision

12 KiB

Raw Blame History

`research/us_universe.py`

`data_manager.py` extension or new `market_data.py`

`research/regime_filters.py`

`research/event_factors.py`

`research/us_alpha_pipeline.py`

`research/us_alpha_report.py`