Add factor attribution design spec

2026-04-07 15:01:57 +08:00
parent 14ec64c1da
commit 80493cb6af
1 changed files with 376 additions and 0 deletions
--- a/docs/superpowers/specs/2026-04-07-factor-attribution-design.md
+++ b/docs/superpowers/specs/2026-04-07-factor-attribution-design.md
@@ -0,0 +1,376 @@
+# Factor Attribution Design
+
+Date: 2026-04-07
+Repo: `/Users/gahow/projects/quant`
+
+## Goal
+
+Add a factor attribution module that explains strategy returns using:
+
+- Standard external US factors when available: `MKT-RF`, `SMB`, `HML`, `RMW`, `CMA`, `RF`
+- Local price-derived extension factors: `MOM`, `LOWVOL`, `RECOVERY`
+- Local proxy fallback factors for markets without standard external data
+
+The module must integrate with the current backtest workflow, reuse existing strategy equity curves, cache downloaded factor data locally, and produce both terminal summaries and exportable tabular outputs.
+
+## Scope
+
+In scope:
+
+- New factor attribution module for research backtests
+- US support using external standard factors plus local extension factors
+- CN support using local proxy factors only
+- CAPM, FF5, and FF5-plus-extension models
+- CLI flags in `main.py` to enable attribution and export results
+- Tests for parsing, factor construction, and regression behavior
+
+Out of scope for this iteration:
+
+- Intraday attribution
+- Portfolio optimizer changes
+- Live trader attribution in `trader.py`
+- Notebook or plotting UI for attribution results
+- External fundamental datasets beyond standard downloadable factor files
+
+## Existing Context
+
+The repo already has:
+
+- A vectorized backtest engine in `main.py`
+- Strategy implementations that produce daily target weights
+- Performance metrics in `metrics.py`
+- Local daily price caches in `data/us.csv`, `data/us_open.csv`, `data/cn.csv`
+
+Current "alpha" in `trader.py simulate` is only total return minus benchmark return. The new module adds regression-based alpha and factor exposure analysis.
+
+## Design Overview
+
+Add a new module `factor_attribution.py` with four responsibilities:
+
+1. Load and cache factor datasets
+2. Build local extension and proxy factors from existing price data
+3. Run regression models against strategy daily returns
+4. Render summary tables and export detailed results
+
+`main.py` remains the orchestration point. It will continue running backtests and benchmark normalization, then optionally invoke attribution on the resulting daily return series.
+
+## Module Structure
+
+### `factor_attribution.py`
+
+Planned top-level responsibilities:
+
+- `load_external_us_factors(...)`
+  - Download Ken French daily factor files
+  - Parse, normalize, convert percent to decimal
+  - Cache to `data/factors/`
+  - Fall back to cache when network fetch fails
+
+- `build_extension_factors(price_data, benchmark, market)`
+  - Build local daily factor return series for:
+    - `MOM`
+    - `LOWVOL`
+    - `RECOVERY`
+
+- `build_proxy_core_factors(price_data, benchmark, market)`
+  - Used mainly for CN or when external factors are unavailable
+  - Build daily proxy series for:
+    - `MKT`
+    - `SMB_PROXY`
+    - `HML_PROXY`
+    - `RMW_PROXY`
+    - `CMA_PROXY`
+
+- `prepare_factor_models(...)`
+  - Merge standard factors and local factors
+  - Produce factor matrices for:
+    - `capm`
+    - `ff5`
+    - `ff5plus`
+
+- `run_factor_regression(strategy_returns, factor_frame, risk_free_col)`
+  - Fit OLS with intercept
+  - Return alpha, annualized alpha, loadings, t-stats, p-values, R-squared, adjusted R-squared, residual volatility, date range, and observation count
+
+- `attribute_strategies(results_df, benchmark_series, price_data, market, model_selection)`
+  - Convert equity curves to returns
+  - Run attribution for each strategy
+  - Return structured summary and long-form loadings tables
+
+- `print_attribution_summary(...)`
+  - Render compact terminal output
+
+- `export_attribution(...)`
+  - Write CSV outputs
+
+## Data Sources
+
+### US Standard Factors
+
+Preferred source:
+
+- Ken French daily factor datasets for:
+  - Fama-French 5 Factors daily
+  - Momentum daily if separately required
+
+Normalization rules:
+
+- Convert index to pandas `DatetimeIndex`
+- Convert values from percent to decimal returns
+- Keep `RF` as decimal daily risk-free rate
+
+Cache location:
+
+- `data/factors/ff5_us_daily.csv`
+- `data/factors/mom_us_daily.csv`
+
+If the source format changes or download fails:
+
+- Use the latest local cache if present
+- Otherwise fall back to local proxy factors and mark the run as `proxy_only`
+
+### Local Price Inputs
+
+Reuse repo price caches:
+
+- US: `data/us.csv`, `data/us_open.csv`
+- CN: `data/cn.csv`
+
+Only adjusted close prices are required for attribution factor construction.
+
+## Factor Definitions
+
+### Standard Factors
+
+For US:
+
+- `MKT-RF`, `SMB`, `HML`, `RMW`, `CMA`, `RF` from external factor data
+
+### Local Extension Factors
+
+These are built from the same universe already used by the repo.
+
+#### `MOM`
+
+- Cross-sectional momentum long-short factor
+- Rank stocks by 12-1 month return
+- Long top quantile, short bottom quantile
+- Equal weight within long and short legs
+- Factor return is long return minus short return
+
+#### `LOWVOL`
+
+- Cross-sectional low-volatility factor
+- Compute rolling volatility from daily returns
+- Long lowest-vol quantile, short highest-vol quantile
+- Equal weight within legs
+
+#### `RECOVERY`
+
+- Cross-sectional recovery factor
+- Rank stocks by distance from rolling 63-day low
+- Long strongest recovery names, short weakest recovery names
+- Equal weight within legs
+
+### Proxy Core Factors
+
+Used for CN by default and as fallback for US.
+
+#### `MKT`
+
+- Benchmark daily return if benchmark exists
+- Otherwise equal-weight universe return
+
+#### `SMB_PROXY`
+
+- Size proxy using inverse price level or market-cap proxy when only price data is available
+- First iteration uses inverse price rank as a transparent proxy and explicitly labels it as proxy
+
+#### `HML_PROXY`
+
+- Value proxy using price-to-range or distance-to-trailing-low style signal
+- This is not a true book-to-market factor and must be labeled proxy
+
+#### `RMW_PROXY`
+
+- Profitability proxy from return consistency and stability
+
+#### `CMA_PROXY`
+
+- Investment proxy from asset trend smoothness or expansion/contraction behavior inferred from price action
+
+Proxy factors are included for model completeness, but the output must label them clearly as proxies rather than standard academic factors.
+
+## Factor Construction Rules
+
+- All local factors use only information available up to date `t` to explain returns at `t+1`
+- No future data leakage
+- Factor series are daily return series, not ranks
+- Long-short factors should be approximately dollar-neutral
+- Missing values are allowed during warmup windows and dropped during model alignment
+- Quantile counts should adapt to available universe size
+
+## Regression Models
+
+### CAPM
+
+Model:
+
+- `strategy_excess_return ~ alpha + (MKT-RF)`
+
+### FF5
+
+Model:
+
+- `strategy_excess_return ~ alpha + MKT-RF + SMB + HML + RMW + CMA`
+
+### FF5Plus
+
+Model:
+
+- `strategy_excess_return ~ alpha + MKT-RF + SMB + HML + RMW + CMA + MOM + LOWVOL + RECOVERY`
+
+### Proxy Model
+
+For markets without standard factors:
+
+- `strategy_return ~ alpha + MKT + SMB_PROXY + HML_PROXY + RMW_PROXY + CMA_PROXY + MOM + LOWVOL + RECOVERY`
+
+The module should report which model family was actually used.
+
+## Alignment Rules
+
+- Convert all equity curves to daily returns
+- Build factor frames at daily frequency
+- Join strategy returns and factor returns on date intersection
+- For standard factor models, subtract `RF` from strategy returns
+- Keep benchmark return separately for active return diagnostics, but not as a replacement for `MKT-RF` in standard factor models
+
+## Output Schema
+
+### Summary Output
+
+One row per strategy per model with fields including:
+
+- `strategy`
+- `market`
+- `model`
+- `factor_source`
+- `proxy_only`
+- `start_date`
+- `end_date`
+- `n_obs`
+- `alpha_daily`
+- `alpha_ann`
+- `alpha_t_stat`
+- `alpha_p_value`
+- `r_squared`
+- `adj_r_squared`
+- `residual_vol_ann`
+
+Selected factor loadings should also be flattened into summary columns when available:
+
+- `beta_mkt`
+- `beta_smb`
+- `beta_hml`
+- `beta_rmw`
+- `beta_cma`
+- `beta_mom`
+- `beta_lowvol`
+- `beta_recovery`
+
+### Loadings Output
+
+Long-form table:
+
+- `strategy`
+- `model`
+- `factor`
+- `beta`
+- `t_stat`
+- `p_value`
+
+## CLI Changes
+
+Add arguments to `main.py`:
+
+- `--attribution`
+- `--attribution-model {capm,ff5,ff5plus,all}`
+- `--attribution-export <dir>`
+
+Behavior:
+
+- If `--attribution` is not set, current behavior is unchanged
+- If set, attribution runs after backtest metrics are printed
+- If export path is set, write:
+  - `summary.csv`
+  - `loadings.csv`
+
+## Terminal Reporting
+
+For each strategy and selected model, print a compact line containing:
+
+- annualized alpha
+- major factor loadings
+- R-squared
+- residual volatility
+
+After the numeric table, print a short interpretation section:
+
+- whether alpha remains after adding factors
+- which factors explain most of the strategy
+- whether the model fit is weak or strong
+
+Interpretation should remain descriptive and avoid overclaiming statistical significance.
+
+## Error Handling
+
+- External factor download failure:
+  - Use cache if available
+  - Otherwise downgrade to proxy mode
+- Missing or short overlap window:
+  - Skip that model and report insufficient data
+- Singular matrix or severe multicollinearity:
+  - Catch and report model failure or unstable fit
+- Missing benchmark column:
+  - Fall back to equal-weight universe market proxy where possible
+
+## Testing Plan
+
+### Unit Tests
+
+- External factor parser converts dates and percent units correctly
+- Cache loader returns cached data on download failure
+- Extension factor builders produce expected columns and no future leakage
+- Regression on synthetic data recovers approximate known alpha and betas
+
+### Integration Tests
+
+- End-to-end attribution on a small deterministic equity and factor dataset
+- CLI export produces expected files and columns
+
+### Regression Tests
+
+- Fixed local US sample produces stable output shape and model naming
+
+## Implementation Notes
+
+- Prefer `numpy.linalg.lstsq` or `scipy` OLS utilities already available in dependencies
+- Keep implementation dependency-light
+- Keep factor construction functions separate from regression code for testability
+- Avoid changing existing strategy behavior
+
+## Risks
+
+- Standard factor downloads may change source file formatting
+- Proxy factor definitions for CN will be weaker than true academic factors
+- Some strategy returns may be highly collinear with momentum-like factors, reducing interpretability
+- Short or overlapping warmup windows can materially reduce sample size
+
+## Success Criteria
+
+- A user can run backtests with `--attribution` and receive factor-based explanations of returns
+- US runs use standard external factors when available
+- CN runs still produce a clearly labeled proxy attribution report
+- Outputs distinguish residual alpha from factor exposure
+- The module is easy to extend with new factors later