Add factor attribution design spec
This commit is contained in:
376
docs/superpowers/specs/2026-04-07-factor-attribution-design.md
Normal file
376
docs/superpowers/specs/2026-04-07-factor-attribution-design.md
Normal file
@@ -0,0 +1,376 @@
|
|||||||
|
# Factor Attribution Design
|
||||||
|
|
||||||
|
Date: 2026-04-07
|
||||||
|
Repo: `/Users/gahow/projects/quant`
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Add a factor attribution module that explains strategy returns using:
|
||||||
|
|
||||||
|
- Standard external US factors when available: `MKT-RF`, `SMB`, `HML`, `RMW`, `CMA`, `RF`
|
||||||
|
- Local price-derived extension factors: `MOM`, `LOWVOL`, `RECOVERY`
|
||||||
|
- Local proxy fallback factors for markets without standard external data
|
||||||
|
|
||||||
|
The module must integrate with the current backtest workflow, reuse existing strategy equity curves, cache downloaded factor data locally, and produce both terminal summaries and exportable tabular outputs.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
In scope:
|
||||||
|
|
||||||
|
- New factor attribution module for research backtests
|
||||||
|
- US support using external standard factors plus local extension factors
|
||||||
|
- CN support using local proxy factors only
|
||||||
|
- CAPM, FF5, and FF5-plus-extension models
|
||||||
|
- CLI flags in `main.py` to enable attribution and export results
|
||||||
|
- Tests for parsing, factor construction, and regression behavior
|
||||||
|
|
||||||
|
Out of scope for this iteration:
|
||||||
|
|
||||||
|
- Intraday attribution
|
||||||
|
- Portfolio optimizer changes
|
||||||
|
- Live trader attribution in `trader.py`
|
||||||
|
- Notebook or plotting UI for attribution results
|
||||||
|
- External fundamental datasets beyond standard downloadable factor files
|
||||||
|
|
||||||
|
## Existing Context
|
||||||
|
|
||||||
|
The repo already has:
|
||||||
|
|
||||||
|
- A vectorized backtest engine in `main.py`
|
||||||
|
- Strategy implementations that produce daily target weights
|
||||||
|
- Performance metrics in `metrics.py`
|
||||||
|
- Local daily price caches in `data/us.csv`, `data/us_open.csv`, `data/cn.csv`
|
||||||
|
|
||||||
|
Current "alpha" in `trader.py simulate` is only total return minus benchmark return. The new module adds regression-based alpha and factor exposure analysis.
|
||||||
|
|
||||||
|
## Design Overview
|
||||||
|
|
||||||
|
Add a new module `factor_attribution.py` with four responsibilities:
|
||||||
|
|
||||||
|
1. Load and cache factor datasets
|
||||||
|
2. Build local extension and proxy factors from existing price data
|
||||||
|
3. Run regression models against strategy daily returns
|
||||||
|
4. Render summary tables and export detailed results
|
||||||
|
|
||||||
|
`main.py` remains the orchestration point. It will continue running backtests and benchmark normalization, then optionally invoke attribution on the resulting daily return series.
|
||||||
|
|
||||||
|
## Module Structure
|
||||||
|
|
||||||
|
### `factor_attribution.py`
|
||||||
|
|
||||||
|
Planned top-level responsibilities:
|
||||||
|
|
||||||
|
- `load_external_us_factors(...)`
|
||||||
|
- Download Ken French daily factor files
|
||||||
|
- Parse, normalize, convert percent to decimal
|
||||||
|
- Cache to `data/factors/`
|
||||||
|
- Fall back to cache when network fetch fails
|
||||||
|
|
||||||
|
- `build_extension_factors(price_data, benchmark, market)`
|
||||||
|
- Build local daily factor return series for:
|
||||||
|
- `MOM`
|
||||||
|
- `LOWVOL`
|
||||||
|
- `RECOVERY`
|
||||||
|
|
||||||
|
- `build_proxy_core_factors(price_data, benchmark, market)`
|
||||||
|
- Used mainly for CN or when external factors are unavailable
|
||||||
|
- Build daily proxy series for:
|
||||||
|
- `MKT`
|
||||||
|
- `SMB_PROXY`
|
||||||
|
- `HML_PROXY`
|
||||||
|
- `RMW_PROXY`
|
||||||
|
- `CMA_PROXY`
|
||||||
|
|
||||||
|
- `prepare_factor_models(...)`
|
||||||
|
- Merge standard factors and local factors
|
||||||
|
- Produce factor matrices for:
|
||||||
|
- `capm`
|
||||||
|
- `ff5`
|
||||||
|
- `ff5plus`
|
||||||
|
|
||||||
|
- `run_factor_regression(strategy_returns, factor_frame, risk_free_col)`
|
||||||
|
- Fit OLS with intercept
|
||||||
|
- Return alpha, annualized alpha, loadings, t-stats, p-values, R-squared, adjusted R-squared, residual volatility, date range, and observation count
|
||||||
|
|
||||||
|
- `attribute_strategies(results_df, benchmark_series, price_data, market, model_selection)`
|
||||||
|
- Convert equity curves to returns
|
||||||
|
- Run attribution for each strategy
|
||||||
|
- Return structured summary and long-form loadings tables
|
||||||
|
|
||||||
|
- `print_attribution_summary(...)`
|
||||||
|
- Render compact terminal output
|
||||||
|
|
||||||
|
- `export_attribution(...)`
|
||||||
|
- Write CSV outputs
|
||||||
|
|
||||||
|
## Data Sources
|
||||||
|
|
||||||
|
### US Standard Factors
|
||||||
|
|
||||||
|
Preferred source:
|
||||||
|
|
||||||
|
- Ken French daily factor datasets for:
|
||||||
|
- Fama-French 5 Factors daily
|
||||||
|
- Momentum daily if separately required
|
||||||
|
|
||||||
|
Normalization rules:
|
||||||
|
|
||||||
|
- Convert index to pandas `DatetimeIndex`
|
||||||
|
- Convert values from percent to decimal returns
|
||||||
|
- Keep `RF` as decimal daily risk-free rate
|
||||||
|
|
||||||
|
Cache location:
|
||||||
|
|
||||||
|
- `data/factors/ff5_us_daily.csv`
|
||||||
|
- `data/factors/mom_us_daily.csv`
|
||||||
|
|
||||||
|
If the source format changes or download fails:
|
||||||
|
|
||||||
|
- Use the latest local cache if present
|
||||||
|
- Otherwise fall back to local proxy factors and mark the run as `proxy_only`
|
||||||
|
|
||||||
|
### Local Price Inputs
|
||||||
|
|
||||||
|
Reuse repo price caches:
|
||||||
|
|
||||||
|
- US: `data/us.csv`, `data/us_open.csv`
|
||||||
|
- CN: `data/cn.csv`
|
||||||
|
|
||||||
|
Only adjusted close prices are required for attribution factor construction.
|
||||||
|
|
||||||
|
## Factor Definitions
|
||||||
|
|
||||||
|
### Standard Factors
|
||||||
|
|
||||||
|
For US:
|
||||||
|
|
||||||
|
- `MKT-RF`, `SMB`, `HML`, `RMW`, `CMA`, `RF` from external factor data
|
||||||
|
|
||||||
|
### Local Extension Factors
|
||||||
|
|
||||||
|
These are built from the same universe already used by the repo.
|
||||||
|
|
||||||
|
#### `MOM`
|
||||||
|
|
||||||
|
- Cross-sectional momentum long-short factor
|
||||||
|
- Rank stocks by 12-1 month return
|
||||||
|
- Long top quantile, short bottom quantile
|
||||||
|
- Equal weight within long and short legs
|
||||||
|
- Factor return is long return minus short return
|
||||||
|
|
||||||
|
#### `LOWVOL`
|
||||||
|
|
||||||
|
- Cross-sectional low-volatility factor
|
||||||
|
- Compute rolling volatility from daily returns
|
||||||
|
- Long lowest-vol quantile, short highest-vol quantile
|
||||||
|
- Equal weight within legs
|
||||||
|
|
||||||
|
#### `RECOVERY`
|
||||||
|
|
||||||
|
- Cross-sectional recovery factor
|
||||||
|
- Rank stocks by distance from rolling 63-day low
|
||||||
|
- Long strongest recovery names, short weakest recovery names
|
||||||
|
- Equal weight within legs
|
||||||
|
|
||||||
|
### Proxy Core Factors
|
||||||
|
|
||||||
|
Used for CN by default and as fallback for US.
|
||||||
|
|
||||||
|
#### `MKT`
|
||||||
|
|
||||||
|
- Benchmark daily return if benchmark exists
|
||||||
|
- Otherwise equal-weight universe return
|
||||||
|
|
||||||
|
#### `SMB_PROXY`
|
||||||
|
|
||||||
|
- Size proxy using inverse price level or market-cap proxy when only price data is available
|
||||||
|
- First iteration uses inverse price rank as a transparent proxy and explicitly labels it as proxy
|
||||||
|
|
||||||
|
#### `HML_PROXY`
|
||||||
|
|
||||||
|
- Value proxy using price-to-range or distance-to-trailing-low style signal
|
||||||
|
- This is not a true book-to-market factor and must be labeled proxy
|
||||||
|
|
||||||
|
#### `RMW_PROXY`
|
||||||
|
|
||||||
|
- Profitability proxy from return consistency and stability
|
||||||
|
|
||||||
|
#### `CMA_PROXY`
|
||||||
|
|
||||||
|
- Investment proxy from asset trend smoothness or expansion/contraction behavior inferred from price action
|
||||||
|
|
||||||
|
Proxy factors are included for model completeness, but the output must label them clearly as proxies rather than standard academic factors.
|
||||||
|
|
||||||
|
## Factor Construction Rules
|
||||||
|
|
||||||
|
- All local factors use only information available up to date `t` to explain returns at `t+1`
|
||||||
|
- No future data leakage
|
||||||
|
- Factor series are daily return series, not ranks
|
||||||
|
- Long-short factors should be approximately dollar-neutral
|
||||||
|
- Missing values are allowed during warmup windows and dropped during model alignment
|
||||||
|
- Quantile counts should adapt to available universe size
|
||||||
|
|
||||||
|
## Regression Models
|
||||||
|
|
||||||
|
### CAPM
|
||||||
|
|
||||||
|
Model:
|
||||||
|
|
||||||
|
- `strategy_excess_return ~ alpha + (MKT-RF)`
|
||||||
|
|
||||||
|
### FF5
|
||||||
|
|
||||||
|
Model:
|
||||||
|
|
||||||
|
- `strategy_excess_return ~ alpha + MKT-RF + SMB + HML + RMW + CMA`
|
||||||
|
|
||||||
|
### FF5Plus
|
||||||
|
|
||||||
|
Model:
|
||||||
|
|
||||||
|
- `strategy_excess_return ~ alpha + MKT-RF + SMB + HML + RMW + CMA + MOM + LOWVOL + RECOVERY`
|
||||||
|
|
||||||
|
### Proxy Model
|
||||||
|
|
||||||
|
For markets without standard factors:
|
||||||
|
|
||||||
|
- `strategy_return ~ alpha + MKT + SMB_PROXY + HML_PROXY + RMW_PROXY + CMA_PROXY + MOM + LOWVOL + RECOVERY`
|
||||||
|
|
||||||
|
The module should report which model family was actually used.
|
||||||
|
|
||||||
|
## Alignment Rules
|
||||||
|
|
||||||
|
- Convert all equity curves to daily returns
|
||||||
|
- Build factor frames at daily frequency
|
||||||
|
- Join strategy returns and factor returns on date intersection
|
||||||
|
- For standard factor models, subtract `RF` from strategy returns
|
||||||
|
- Keep benchmark return separately for active return diagnostics, but not as a replacement for `MKT-RF` in standard factor models
|
||||||
|
|
||||||
|
## Output Schema
|
||||||
|
|
||||||
|
### Summary Output
|
||||||
|
|
||||||
|
One row per strategy per model with fields including:
|
||||||
|
|
||||||
|
- `strategy`
|
||||||
|
- `market`
|
||||||
|
- `model`
|
||||||
|
- `factor_source`
|
||||||
|
- `proxy_only`
|
||||||
|
- `start_date`
|
||||||
|
- `end_date`
|
||||||
|
- `n_obs`
|
||||||
|
- `alpha_daily`
|
||||||
|
- `alpha_ann`
|
||||||
|
- `alpha_t_stat`
|
||||||
|
- `alpha_p_value`
|
||||||
|
- `r_squared`
|
||||||
|
- `adj_r_squared`
|
||||||
|
- `residual_vol_ann`
|
||||||
|
|
||||||
|
Selected factor loadings should also be flattened into summary columns when available:
|
||||||
|
|
||||||
|
- `beta_mkt`
|
||||||
|
- `beta_smb`
|
||||||
|
- `beta_hml`
|
||||||
|
- `beta_rmw`
|
||||||
|
- `beta_cma`
|
||||||
|
- `beta_mom`
|
||||||
|
- `beta_lowvol`
|
||||||
|
- `beta_recovery`
|
||||||
|
|
||||||
|
### Loadings Output
|
||||||
|
|
||||||
|
Long-form table:
|
||||||
|
|
||||||
|
- `strategy`
|
||||||
|
- `model`
|
||||||
|
- `factor`
|
||||||
|
- `beta`
|
||||||
|
- `t_stat`
|
||||||
|
- `p_value`
|
||||||
|
|
||||||
|
## CLI Changes
|
||||||
|
|
||||||
|
Add arguments to `main.py`:
|
||||||
|
|
||||||
|
- `--attribution`
|
||||||
|
- `--attribution-model {capm,ff5,ff5plus,all}`
|
||||||
|
- `--attribution-export <dir>`
|
||||||
|
|
||||||
|
Behavior:
|
||||||
|
|
||||||
|
- If `--attribution` is not set, current behavior is unchanged
|
||||||
|
- If set, attribution runs after backtest metrics are printed
|
||||||
|
- If export path is set, write:
|
||||||
|
- `summary.csv`
|
||||||
|
- `loadings.csv`
|
||||||
|
|
||||||
|
## Terminal Reporting
|
||||||
|
|
||||||
|
For each strategy and selected model, print a compact line containing:
|
||||||
|
|
||||||
|
- annualized alpha
|
||||||
|
- major factor loadings
|
||||||
|
- R-squared
|
||||||
|
- residual volatility
|
||||||
|
|
||||||
|
After the numeric table, print a short interpretation section:
|
||||||
|
|
||||||
|
- whether alpha remains after adding factors
|
||||||
|
- which factors explain most of the strategy
|
||||||
|
- whether the model fit is weak or strong
|
||||||
|
|
||||||
|
Interpretation should remain descriptive and avoid overclaiming statistical significance.
|
||||||
|
|
||||||
|
## Error Handling
|
||||||
|
|
||||||
|
- External factor download failure:
|
||||||
|
- Use cache if available
|
||||||
|
- Otherwise downgrade to proxy mode
|
||||||
|
- Missing or short overlap window:
|
||||||
|
- Skip that model and report insufficient data
|
||||||
|
- Singular matrix or severe multicollinearity:
|
||||||
|
- Catch and report model failure or unstable fit
|
||||||
|
- Missing benchmark column:
|
||||||
|
- Fall back to equal-weight universe market proxy where possible
|
||||||
|
|
||||||
|
## Testing Plan
|
||||||
|
|
||||||
|
### Unit Tests
|
||||||
|
|
||||||
|
- External factor parser converts dates and percent units correctly
|
||||||
|
- Cache loader returns cached data on download failure
|
||||||
|
- Extension factor builders produce expected columns and no future leakage
|
||||||
|
- Regression on synthetic data recovers approximate known alpha and betas
|
||||||
|
|
||||||
|
### Integration Tests
|
||||||
|
|
||||||
|
- End-to-end attribution on a small deterministic equity and factor dataset
|
||||||
|
- CLI export produces expected files and columns
|
||||||
|
|
||||||
|
### Regression Tests
|
||||||
|
|
||||||
|
- Fixed local US sample produces stable output shape and model naming
|
||||||
|
|
||||||
|
## Implementation Notes
|
||||||
|
|
||||||
|
- Prefer `numpy.linalg.lstsq` or `scipy` OLS utilities already available in dependencies
|
||||||
|
- Keep implementation dependency-light
|
||||||
|
- Keep factor construction functions separate from regression code for testability
|
||||||
|
- Avoid changing existing strategy behavior
|
||||||
|
|
||||||
|
## Risks
|
||||||
|
|
||||||
|
- Standard factor downloads may change source file formatting
|
||||||
|
- Proxy factor definitions for CN will be weaker than true academic factors
|
||||||
|
- Some strategy returns may be highly collinear with momentum-like factors, reducing interpretability
|
||||||
|
- Short or overlapping warmup windows can materially reduce sample size
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
- A user can run backtests with `--attribution` and receive factor-based explanations of returns
|
||||||
|
- US runs use standard external factors when available
|
||||||
|
- CN runs still produce a clearly labeled proxy attribution report
|
||||||
|
- Outputs distinguish residual alpha from factor exposure
|
||||||
|
- The module is easy to extend with new factors later
|
||||||
Reference in New Issue
Block a user