Add factor attribution design spec

This commit is contained in:
2026-04-07 15:01:57 +08:00
parent 14ec64c1da
commit 80493cb6af

View File

@@ -0,0 +1,376 @@
# Factor Attribution Design
Date: 2026-04-07
Repo: `/Users/gahow/projects/quant`
## Goal
Add a factor attribution module that explains strategy returns using:
- Standard external US factors when available: `MKT-RF`, `SMB`, `HML`, `RMW`, `CMA`, `RF`
- Local price-derived extension factors: `MOM`, `LOWVOL`, `RECOVERY`
- Local proxy fallback factors for markets without standard external data
The module must integrate with the current backtest workflow, reuse existing strategy equity curves, cache downloaded factor data locally, and produce both terminal summaries and exportable tabular outputs.
## Scope
In scope:
- New factor attribution module for research backtests
- US support using external standard factors plus local extension factors
- CN support using local proxy factors only
- CAPM, FF5, and FF5-plus-extension models
- CLI flags in `main.py` to enable attribution and export results
- Tests for parsing, factor construction, and regression behavior
Out of scope for this iteration:
- Intraday attribution
- Portfolio optimizer changes
- Live trader attribution in `trader.py`
- Notebook or plotting UI for attribution results
- External fundamental datasets beyond standard downloadable factor files
## Existing Context
The repo already has:
- A vectorized backtest engine in `main.py`
- Strategy implementations that produce daily target weights
- Performance metrics in `metrics.py`
- Local daily price caches in `data/us.csv`, `data/us_open.csv`, `data/cn.csv`
Current "alpha" in `trader.py simulate` is only total return minus benchmark return. The new module adds regression-based alpha and factor exposure analysis.
## Design Overview
Add a new module `factor_attribution.py` with four responsibilities:
1. Load and cache factor datasets
2. Build local extension and proxy factors from existing price data
3. Run regression models against strategy daily returns
4. Render summary tables and export detailed results
`main.py` remains the orchestration point. It will continue running backtests and benchmark normalization, then optionally invoke attribution on the resulting daily return series.
## Module Structure
### `factor_attribution.py`
Planned top-level responsibilities:
- `load_external_us_factors(...)`
- Download Ken French daily factor files
- Parse, normalize, convert percent to decimal
- Cache to `data/factors/`
- Fall back to cache when network fetch fails
- `build_extension_factors(price_data, benchmark, market)`
- Build local daily factor return series for:
- `MOM`
- `LOWVOL`
- `RECOVERY`
- `build_proxy_core_factors(price_data, benchmark, market)`
- Used mainly for CN or when external factors are unavailable
- Build daily proxy series for:
- `MKT`
- `SMB_PROXY`
- `HML_PROXY`
- `RMW_PROXY`
- `CMA_PROXY`
- `prepare_factor_models(...)`
- Merge standard factors and local factors
- Produce factor matrices for:
- `capm`
- `ff5`
- `ff5plus`
- `run_factor_regression(strategy_returns, factor_frame, risk_free_col)`
- Fit OLS with intercept
- Return alpha, annualized alpha, loadings, t-stats, p-values, R-squared, adjusted R-squared, residual volatility, date range, and observation count
- `attribute_strategies(results_df, benchmark_series, price_data, market, model_selection)`
- Convert equity curves to returns
- Run attribution for each strategy
- Return structured summary and long-form loadings tables
- `print_attribution_summary(...)`
- Render compact terminal output
- `export_attribution(...)`
- Write CSV outputs
## Data Sources
### US Standard Factors
Preferred source:
- Ken French daily factor datasets for:
- Fama-French 5 Factors daily
- Momentum daily if separately required
Normalization rules:
- Convert index to pandas `DatetimeIndex`
- Convert values from percent to decimal returns
- Keep `RF` as decimal daily risk-free rate
Cache location:
- `data/factors/ff5_us_daily.csv`
- `data/factors/mom_us_daily.csv`
If the source format changes or download fails:
- Use the latest local cache if present
- Otherwise fall back to local proxy factors and mark the run as `proxy_only`
### Local Price Inputs
Reuse repo price caches:
- US: `data/us.csv`, `data/us_open.csv`
- CN: `data/cn.csv`
Only adjusted close prices are required for attribution factor construction.
## Factor Definitions
### Standard Factors
For US:
- `MKT-RF`, `SMB`, `HML`, `RMW`, `CMA`, `RF` from external factor data
### Local Extension Factors
These are built from the same universe already used by the repo.
#### `MOM`
- Cross-sectional momentum long-short factor
- Rank stocks by 12-1 month return
- Long top quantile, short bottom quantile
- Equal weight within long and short legs
- Factor return is long return minus short return
#### `LOWVOL`
- Cross-sectional low-volatility factor
- Compute rolling volatility from daily returns
- Long lowest-vol quantile, short highest-vol quantile
- Equal weight within legs
#### `RECOVERY`
- Cross-sectional recovery factor
- Rank stocks by distance from rolling 63-day low
- Long strongest recovery names, short weakest recovery names
- Equal weight within legs
### Proxy Core Factors
Used for CN by default and as fallback for US.
#### `MKT`
- Benchmark daily return if benchmark exists
- Otherwise equal-weight universe return
#### `SMB_PROXY`
- Size proxy using inverse price level or market-cap proxy when only price data is available
- First iteration uses inverse price rank as a transparent proxy and explicitly labels it as proxy
#### `HML_PROXY`
- Value proxy using price-to-range or distance-to-trailing-low style signal
- This is not a true book-to-market factor and must be labeled proxy
#### `RMW_PROXY`
- Profitability proxy from return consistency and stability
#### `CMA_PROXY`
- Investment proxy from asset trend smoothness or expansion/contraction behavior inferred from price action
Proxy factors are included for model completeness, but the output must label them clearly as proxies rather than standard academic factors.
## Factor Construction Rules
- All local factors use only information available up to date `t` to explain returns at `t+1`
- No future data leakage
- Factor series are daily return series, not ranks
- Long-short factors should be approximately dollar-neutral
- Missing values are allowed during warmup windows and dropped during model alignment
- Quantile counts should adapt to available universe size
## Regression Models
### CAPM
Model:
- `strategy_excess_return ~ alpha + (MKT-RF)`
### FF5
Model:
- `strategy_excess_return ~ alpha + MKT-RF + SMB + HML + RMW + CMA`
### FF5Plus
Model:
- `strategy_excess_return ~ alpha + MKT-RF + SMB + HML + RMW + CMA + MOM + LOWVOL + RECOVERY`
### Proxy Model
For markets without standard factors:
- `strategy_return ~ alpha + MKT + SMB_PROXY + HML_PROXY + RMW_PROXY + CMA_PROXY + MOM + LOWVOL + RECOVERY`
The module should report which model family was actually used.
## Alignment Rules
- Convert all equity curves to daily returns
- Build factor frames at daily frequency
- Join strategy returns and factor returns on date intersection
- For standard factor models, subtract `RF` from strategy returns
- Keep benchmark return separately for active return diagnostics, but not as a replacement for `MKT-RF` in standard factor models
## Output Schema
### Summary Output
One row per strategy per model with fields including:
- `strategy`
- `market`
- `model`
- `factor_source`
- `proxy_only`
- `start_date`
- `end_date`
- `n_obs`
- `alpha_daily`
- `alpha_ann`
- `alpha_t_stat`
- `alpha_p_value`
- `r_squared`
- `adj_r_squared`
- `residual_vol_ann`
Selected factor loadings should also be flattened into summary columns when available:
- `beta_mkt`
- `beta_smb`
- `beta_hml`
- `beta_rmw`
- `beta_cma`
- `beta_mom`
- `beta_lowvol`
- `beta_recovery`
### Loadings Output
Long-form table:
- `strategy`
- `model`
- `factor`
- `beta`
- `t_stat`
- `p_value`
## CLI Changes
Add arguments to `main.py`:
- `--attribution`
- `--attribution-model {capm,ff5,ff5plus,all}`
- `--attribution-export <dir>`
Behavior:
- If `--attribution` is not set, current behavior is unchanged
- If set, attribution runs after backtest metrics are printed
- If export path is set, write:
- `summary.csv`
- `loadings.csv`
## Terminal Reporting
For each strategy and selected model, print a compact line containing:
- annualized alpha
- major factor loadings
- R-squared
- residual volatility
After the numeric table, print a short interpretation section:
- whether alpha remains after adding factors
- which factors explain most of the strategy
- whether the model fit is weak or strong
Interpretation should remain descriptive and avoid overclaiming statistical significance.
## Error Handling
- External factor download failure:
- Use cache if available
- Otherwise downgrade to proxy mode
- Missing or short overlap window:
- Skip that model and report insufficient data
- Singular matrix or severe multicollinearity:
- Catch and report model failure or unstable fit
- Missing benchmark column:
- Fall back to equal-weight universe market proxy where possible
## Testing Plan
### Unit Tests
- External factor parser converts dates and percent units correctly
- Cache loader returns cached data on download failure
- Extension factor builders produce expected columns and no future leakage
- Regression on synthetic data recovers approximate known alpha and betas
### Integration Tests
- End-to-end attribution on a small deterministic equity and factor dataset
- CLI export produces expected files and columns
### Regression Tests
- Fixed local US sample produces stable output shape and model naming
## Implementation Notes
- Prefer `numpy.linalg.lstsq` or `scipy` OLS utilities already available in dependencies
- Keep implementation dependency-light
- Keep factor construction functions separate from regression code for testability
- Avoid changing existing strategy behavior
## Risks
- Standard factor downloads may change source file formatting
- Proxy factor definitions for CN will be weaker than true academic factors
- Some strategy returns may be highly collinear with momentum-like factors, reducing interpretability
- Short or overlapping warmup windows can materially reduce sample size
## Success Criteria
- A user can run backtests with `--attribution` and receive factor-based explanations of returns
- US runs use standard external factors when available
- CN runs still produce a clearly labeled proxy attribution report
- Outputs distinguish residual alpha from factor exposure
- The module is easy to extend with new factors later