gahow/quant

Fork 0

Files

Gahow Wang 80493cb6af Add factor attribution design spec

2026-04-07 15:01:57 +08:00

10 KiB

Raw Blame History

Factor Attribution Design

Date: 2026-04-07 Repo: /Users/gahow/projects/quant

Goal

Add a factor attribution module that explains strategy returns using:

Standard external US factors when available: MKT-RF, SMB, HML, RMW, CMA, RF
Local price-derived extension factors: MOM, LOWVOL, RECOVERY
Local proxy fallback factors for markets without standard external data

The module must integrate with the current backtest workflow, reuse existing strategy equity curves, cache downloaded factor data locally, and produce both terminal summaries and exportable tabular outputs.

Scope

In scope:

New factor attribution module for research backtests
US support using external standard factors plus local extension factors
CN support using local proxy factors only
CAPM, FF5, and FF5-plus-extension models
CLI flags in main.py to enable attribution and export results
Tests for parsing, factor construction, and regression behavior

Out of scope for this iteration:

Intraday attribution
Portfolio optimizer changes
Live trader attribution in trader.py
Notebook or plotting UI for attribution results
External fundamental datasets beyond standard downloadable factor files

Existing Context

The repo already has:

A vectorized backtest engine in main.py
Strategy implementations that produce daily target weights
Performance metrics in metrics.py
Local daily price caches in data/us.csv, data/us_open.csv, data/cn.csv

Current "alpha" in trader.py simulate is only total return minus benchmark return. The new module adds regression-based alpha and factor exposure analysis.

Design Overview

Add a new module factor_attribution.py with four responsibilities:

Load and cache factor datasets
Build local extension and proxy factors from existing price data
Run regression models against strategy daily returns
Render summary tables and export detailed results

main.py remains the orchestration point. It will continue running backtests and benchmark normalization, then optionally invoke attribution on the resulting daily return series.

Module Structure

`factor_attribution.py`

Planned top-level responsibilities:

load_external_us_factors(...)
- Download Ken French daily factor files
- Parse, normalize, convert percent to decimal
- Cache to data/factors/
- Fall back to cache when network fetch fails
build_extension_factors(price_data, benchmark, market)
- Build local daily factor return series for:
  - MOM
  - LOWVOL
  - RECOVERY
build_proxy_core_factors(price_data, benchmark, market)
- Used mainly for CN or when external factors are unavailable
- Build daily proxy series for:
  - MKT
  - SMB_PROXY
  - HML_PROXY
  - RMW_PROXY
  - CMA_PROXY
prepare_factor_models(...)
- Merge standard factors and local factors
- Produce factor matrices for:
  - capm
  - ff5
  - ff5plus
run_factor_regression(strategy_returns, factor_frame, risk_free_col)
- Fit OLS with intercept
- Return alpha, annualized alpha, loadings, t-stats, p-values, R-squared, adjusted R-squared, residual volatility, date range, and observation count
attribute_strategies(results_df, benchmark_series, price_data, market, model_selection)
- Convert equity curves to returns
- Run attribution for each strategy
- Return structured summary and long-form loadings tables
print_attribution_summary(...)
- Render compact terminal output
export_attribution(...)
- Write CSV outputs

Data Sources

US Standard Factors

Preferred source:

Ken French daily factor datasets for:
- Fama-French 5 Factors daily
- Momentum daily if separately required

Normalization rules:

Convert index to pandas DatetimeIndex
Convert values from percent to decimal returns
Keep RF as decimal daily risk-free rate

Cache location:

data/factors/ff5_us_daily.csv
data/factors/mom_us_daily.csv

If the source format changes or download fails:

Use the latest local cache if present
Otherwise fall back to local proxy factors and mark the run as proxy_only

Local Price Inputs

Reuse repo price caches:

US: data/us.csv, data/us_open.csv
CN: data/cn.csv

Only adjusted close prices are required for attribution factor construction.

Factor Definitions

Standard Factors

For US:

MKT-RF, SMB, HML, RMW, CMA, RF from external factor data

Local Extension Factors

These are built from the same universe already used by the repo.

`MOM`

Cross-sectional momentum long-short factor
Rank stocks by 12-1 month return
Long top quantile, short bottom quantile
Equal weight within long and short legs
Factor return is long return minus short return

`LOWVOL`

Cross-sectional low-volatility factor
Compute rolling volatility from daily returns
Long lowest-vol quantile, short highest-vol quantile
Equal weight within legs

`RECOVERY`

Cross-sectional recovery factor
Rank stocks by distance from rolling 63-day low
Long strongest recovery names, short weakest recovery names
Equal weight within legs

Proxy Core Factors

Used for CN by default and as fallback for US.

`MKT`

Benchmark daily return if benchmark exists
Otherwise equal-weight universe return

`SMB_PROXY`

Size proxy using inverse price level or market-cap proxy when only price data is available
First iteration uses inverse price rank as a transparent proxy and explicitly labels it as proxy

`HML_PROXY`

Value proxy using price-to-range or distance-to-trailing-low style signal
This is not a true book-to-market factor and must be labeled proxy

`RMW_PROXY`

Profitability proxy from return consistency and stability

`CMA_PROXY`

Investment proxy from asset trend smoothness or expansion/contraction behavior inferred from price action

Proxy factors are included for model completeness, but the output must label them clearly as proxies rather than standard academic factors.

Factor Construction Rules

All local factors use only information available up to date t to explain returns at t+1
No future data leakage
Factor series are daily return series, not ranks
Long-short factors should be approximately dollar-neutral
Missing values are allowed during warmup windows and dropped during model alignment
Quantile counts should adapt to available universe size

Regression Models

CAPM

Model:

strategy_excess_return ~ alpha + (MKT-RF)

FF5

Model:

strategy_excess_return ~ alpha + MKT-RF + SMB + HML + RMW + CMA

FF5Plus

Model:

strategy_excess_return ~ alpha + MKT-RF + SMB + HML + RMW + CMA + MOM + LOWVOL + RECOVERY

Proxy Model

For markets without standard factors:

strategy_return ~ alpha + MKT + SMB_PROXY + HML_PROXY + RMW_PROXY + CMA_PROXY + MOM + LOWVOL + RECOVERY

The module should report which model family was actually used.

Alignment Rules

Convert all equity curves to daily returns
Build factor frames at daily frequency
Join strategy returns and factor returns on date intersection
For standard factor models, subtract RF from strategy returns
Keep benchmark return separately for active return diagnostics, but not as a replacement for MKT-RF in standard factor models

Output Schema

Summary Output

One row per strategy per model with fields including:

strategy
market
model
factor_source
proxy_only
start_date
end_date
n_obs
alpha_daily
alpha_ann
alpha_t_stat
alpha_p_value
r_squared
adj_r_squared
residual_vol_ann

Selected factor loadings should also be flattened into summary columns when available:

beta_mkt
beta_smb
beta_hml
beta_rmw
beta_cma
beta_mom
beta_lowvol
beta_recovery

Loadings Output

Long-form table:

strategy
model
factor
beta
t_stat
p_value

CLI Changes

Add arguments to main.py:

--attribution
--attribution-model {capm,ff5,ff5plus,all}
--attribution-export <dir>

Behavior:

If --attribution is not set, current behavior is unchanged
If set, attribution runs after backtest metrics are printed
If export path is set, write:
- summary.csv
- loadings.csv

Terminal Reporting

For each strategy and selected model, print a compact line containing:

annualized alpha
major factor loadings
R-squared
residual volatility

After the numeric table, print a short interpretation section:

whether alpha remains after adding factors
which factors explain most of the strategy
whether the model fit is weak or strong

Interpretation should remain descriptive and avoid overclaiming statistical significance.

Error Handling

External factor download failure:
- Use cache if available
- Otherwise downgrade to proxy mode
Missing or short overlap window:
- Skip that model and report insufficient data
Singular matrix or severe multicollinearity:
- Catch and report model failure or unstable fit
Missing benchmark column:
- Fall back to equal-weight universe market proxy where possible

Testing Plan

Unit Tests

External factor parser converts dates and percent units correctly
Cache loader returns cached data on download failure
Extension factor builders produce expected columns and no future leakage
Regression on synthetic data recovers approximate known alpha and betas

Integration Tests

End-to-end attribution on a small deterministic equity and factor dataset
CLI export produces expected files and columns

Regression Tests

Fixed local US sample produces stable output shape and model naming

Implementation Notes

Prefer numpy.linalg.lstsq or scipy OLS utilities already available in dependencies
Keep implementation dependency-light
Keep factor construction functions separate from regression code for testability
Avoid changing existing strategy behavior

Risks

Standard factor downloads may change source file formatting
Proxy factor definitions for CN will be weaker than true academic factors
Some strategy returns may be highly collinear with momentum-like factors, reducing interpretability
Short or overlapping warmup windows can materially reduce sample size

Success Criteria

A user can run backtests with --attribution and receive factor-based explanations of returns
US runs use standard external factors when available
CN runs still produce a clearly labeled proxy attribution report
Outputs distinguish residual alpha from factor exposure
The module is easy to extend with new factors later

10 KiB Raw Blame History

Factor Attribution Design

Goal

Scope

Existing Context

Design Overview

Module Structure

factor_attribution.py

Data Sources

US Standard Factors

Local Price Inputs

Factor Definitions

Standard Factors

Local Extension Factors

MOM

LOWVOL

RECOVERY

Proxy Core Factors

MKT

SMB_PROXY

HML_PROXY

RMW_PROXY

CMA_PROXY

Factor Construction Rules

Regression Models

CAPM

FF5

FF5Plus

Proxy Model

Alignment Rules

Output Schema

Summary Output

Loadings Output

CLI Changes

Terminal Reporting

Error Handling

Testing Plan

Unit Tests

Integration Tests

Regression Tests

Implementation Notes

Risks

Success Criteria

10 KiB

Raw Blame History

`factor_attribution.py`

`MOM`

`LOWVOL`

`RECOVERY`

`MKT`

`SMB_PROXY`

`HML_PROXY`

`RMW_PROXY`

`CMA_PROXY`