diff --git a/docs/superpowers/specs/2026-04-07-factor-attribution-design.md b/docs/superpowers/specs/2026-04-07-factor-attribution-design.md new file mode 100644 index 0000000..d9c1bd3 --- /dev/null +++ b/docs/superpowers/specs/2026-04-07-factor-attribution-design.md @@ -0,0 +1,376 @@ +# Factor Attribution Design + +Date: 2026-04-07 +Repo: `/Users/gahow/projects/quant` + +## Goal + +Add a factor attribution module that explains strategy returns using: + +- Standard external US factors when available: `MKT-RF`, `SMB`, `HML`, `RMW`, `CMA`, `RF` +- Local price-derived extension factors: `MOM`, `LOWVOL`, `RECOVERY` +- Local proxy fallback factors for markets without standard external data + +The module must integrate with the current backtest workflow, reuse existing strategy equity curves, cache downloaded factor data locally, and produce both terminal summaries and exportable tabular outputs. + +## Scope + +In scope: + +- New factor attribution module for research backtests +- US support using external standard factors plus local extension factors +- CN support using local proxy factors only +- CAPM, FF5, and FF5-plus-extension models +- CLI flags in `main.py` to enable attribution and export results +- Tests for parsing, factor construction, and regression behavior + +Out of scope for this iteration: + +- Intraday attribution +- Portfolio optimizer changes +- Live trader attribution in `trader.py` +- Notebook or plotting UI for attribution results +- External fundamental datasets beyond standard downloadable factor files + +## Existing Context + +The repo already has: + +- A vectorized backtest engine in `main.py` +- Strategy implementations that produce daily target weights +- Performance metrics in `metrics.py` +- Local daily price caches in `data/us.csv`, `data/us_open.csv`, `data/cn.csv` + +Current "alpha" in `trader.py simulate` is only total return minus benchmark return. The new module adds regression-based alpha and factor exposure analysis. + +## Design Overview + +Add a new module `factor_attribution.py` with four responsibilities: + +1. Load and cache factor datasets +2. Build local extension and proxy factors from existing price data +3. Run regression models against strategy daily returns +4. Render summary tables and export detailed results + +`main.py` remains the orchestration point. It will continue running backtests and benchmark normalization, then optionally invoke attribution on the resulting daily return series. + +## Module Structure + +### `factor_attribution.py` + +Planned top-level responsibilities: + +- `load_external_us_factors(...)` + - Download Ken French daily factor files + - Parse, normalize, convert percent to decimal + - Cache to `data/factors/` + - Fall back to cache when network fetch fails + +- `build_extension_factors(price_data, benchmark, market)` + - Build local daily factor return series for: + - `MOM` + - `LOWVOL` + - `RECOVERY` + +- `build_proxy_core_factors(price_data, benchmark, market)` + - Used mainly for CN or when external factors are unavailable + - Build daily proxy series for: + - `MKT` + - `SMB_PROXY` + - `HML_PROXY` + - `RMW_PROXY` + - `CMA_PROXY` + +- `prepare_factor_models(...)` + - Merge standard factors and local factors + - Produce factor matrices for: + - `capm` + - `ff5` + - `ff5plus` + +- `run_factor_regression(strategy_returns, factor_frame, risk_free_col)` + - Fit OLS with intercept + - Return alpha, annualized alpha, loadings, t-stats, p-values, R-squared, adjusted R-squared, residual volatility, date range, and observation count + +- `attribute_strategies(results_df, benchmark_series, price_data, market, model_selection)` + - Convert equity curves to returns + - Run attribution for each strategy + - Return structured summary and long-form loadings tables + +- `print_attribution_summary(...)` + - Render compact terminal output + +- `export_attribution(...)` + - Write CSV outputs + +## Data Sources + +### US Standard Factors + +Preferred source: + +- Ken French daily factor datasets for: + - Fama-French 5 Factors daily + - Momentum daily if separately required + +Normalization rules: + +- Convert index to pandas `DatetimeIndex` +- Convert values from percent to decimal returns +- Keep `RF` as decimal daily risk-free rate + +Cache location: + +- `data/factors/ff5_us_daily.csv` +- `data/factors/mom_us_daily.csv` + +If the source format changes or download fails: + +- Use the latest local cache if present +- Otherwise fall back to local proxy factors and mark the run as `proxy_only` + +### Local Price Inputs + +Reuse repo price caches: + +- US: `data/us.csv`, `data/us_open.csv` +- CN: `data/cn.csv` + +Only adjusted close prices are required for attribution factor construction. + +## Factor Definitions + +### Standard Factors + +For US: + +- `MKT-RF`, `SMB`, `HML`, `RMW`, `CMA`, `RF` from external factor data + +### Local Extension Factors + +These are built from the same universe already used by the repo. + +#### `MOM` + +- Cross-sectional momentum long-short factor +- Rank stocks by 12-1 month return +- Long top quantile, short bottom quantile +- Equal weight within long and short legs +- Factor return is long return minus short return + +#### `LOWVOL` + +- Cross-sectional low-volatility factor +- Compute rolling volatility from daily returns +- Long lowest-vol quantile, short highest-vol quantile +- Equal weight within legs + +#### `RECOVERY` + +- Cross-sectional recovery factor +- Rank stocks by distance from rolling 63-day low +- Long strongest recovery names, short weakest recovery names +- Equal weight within legs + +### Proxy Core Factors + +Used for CN by default and as fallback for US. + +#### `MKT` + +- Benchmark daily return if benchmark exists +- Otherwise equal-weight universe return + +#### `SMB_PROXY` + +- Size proxy using inverse price level or market-cap proxy when only price data is available +- First iteration uses inverse price rank as a transparent proxy and explicitly labels it as proxy + +#### `HML_PROXY` + +- Value proxy using price-to-range or distance-to-trailing-low style signal +- This is not a true book-to-market factor and must be labeled proxy + +#### `RMW_PROXY` + +- Profitability proxy from return consistency and stability + +#### `CMA_PROXY` + +- Investment proxy from asset trend smoothness or expansion/contraction behavior inferred from price action + +Proxy factors are included for model completeness, but the output must label them clearly as proxies rather than standard academic factors. + +## Factor Construction Rules + +- All local factors use only information available up to date `t` to explain returns at `t+1` +- No future data leakage +- Factor series are daily return series, not ranks +- Long-short factors should be approximately dollar-neutral +- Missing values are allowed during warmup windows and dropped during model alignment +- Quantile counts should adapt to available universe size + +## Regression Models + +### CAPM + +Model: + +- `strategy_excess_return ~ alpha + (MKT-RF)` + +### FF5 + +Model: + +- `strategy_excess_return ~ alpha + MKT-RF + SMB + HML + RMW + CMA` + +### FF5Plus + +Model: + +- `strategy_excess_return ~ alpha + MKT-RF + SMB + HML + RMW + CMA + MOM + LOWVOL + RECOVERY` + +### Proxy Model + +For markets without standard factors: + +- `strategy_return ~ alpha + MKT + SMB_PROXY + HML_PROXY + RMW_PROXY + CMA_PROXY + MOM + LOWVOL + RECOVERY` + +The module should report which model family was actually used. + +## Alignment Rules + +- Convert all equity curves to daily returns +- Build factor frames at daily frequency +- Join strategy returns and factor returns on date intersection +- For standard factor models, subtract `RF` from strategy returns +- Keep benchmark return separately for active return diagnostics, but not as a replacement for `MKT-RF` in standard factor models + +## Output Schema + +### Summary Output + +One row per strategy per model with fields including: + +- `strategy` +- `market` +- `model` +- `factor_source` +- `proxy_only` +- `start_date` +- `end_date` +- `n_obs` +- `alpha_daily` +- `alpha_ann` +- `alpha_t_stat` +- `alpha_p_value` +- `r_squared` +- `adj_r_squared` +- `residual_vol_ann` + +Selected factor loadings should also be flattened into summary columns when available: + +- `beta_mkt` +- `beta_smb` +- `beta_hml` +- `beta_rmw` +- `beta_cma` +- `beta_mom` +- `beta_lowvol` +- `beta_recovery` + +### Loadings Output + +Long-form table: + +- `strategy` +- `model` +- `factor` +- `beta` +- `t_stat` +- `p_value` + +## CLI Changes + +Add arguments to `main.py`: + +- `--attribution` +- `--attribution-model {capm,ff5,ff5plus,all}` +- `--attribution-export ` + +Behavior: + +- If `--attribution` is not set, current behavior is unchanged +- If set, attribution runs after backtest metrics are printed +- If export path is set, write: + - `summary.csv` + - `loadings.csv` + +## Terminal Reporting + +For each strategy and selected model, print a compact line containing: + +- annualized alpha +- major factor loadings +- R-squared +- residual volatility + +After the numeric table, print a short interpretation section: + +- whether alpha remains after adding factors +- which factors explain most of the strategy +- whether the model fit is weak or strong + +Interpretation should remain descriptive and avoid overclaiming statistical significance. + +## Error Handling + +- External factor download failure: + - Use cache if available + - Otherwise downgrade to proxy mode +- Missing or short overlap window: + - Skip that model and report insufficient data +- Singular matrix or severe multicollinearity: + - Catch and report model failure or unstable fit +- Missing benchmark column: + - Fall back to equal-weight universe market proxy where possible + +## Testing Plan + +### Unit Tests + +- External factor parser converts dates and percent units correctly +- Cache loader returns cached data on download failure +- Extension factor builders produce expected columns and no future leakage +- Regression on synthetic data recovers approximate known alpha and betas + +### Integration Tests + +- End-to-end attribution on a small deterministic equity and factor dataset +- CLI export produces expected files and columns + +### Regression Tests + +- Fixed local US sample produces stable output shape and model naming + +## Implementation Notes + +- Prefer `numpy.linalg.lstsq` or `scipy` OLS utilities already available in dependencies +- Keep implementation dependency-light +- Keep factor construction functions separate from regression code for testability +- Avoid changing existing strategy behavior + +## Risks + +- Standard factor downloads may change source file formatting +- Proxy factor definitions for CN will be weaker than true academic factors +- Some strategy returns may be highly collinear with momentum-like factors, reducing interpretability +- Short or overlapping warmup windows can materially reduce sample size + +## Success Criteria + +- A user can run backtests with `--attribution` and receive factor-based explanations of returns +- US runs use standard external factors when available +- CN runs still produce a clearly labeled proxy attribution report +- Outputs distinguish residual alpha from factor exposure +- The module is easy to extend with new factors later