Merge branch 'feat/us-alpha-phase1'

feat: add PIT OHLCV runner and fetch support
feat: add strict US alpha research pipeline
2026-04-18 15:00:56 +08:00 · 2026-04-18 14:59:48 +08:00 · 2026-04-18 00:38:29 +08:00 · 2026-04-18 00:31:16 +08:00 · 2026-04-18 00:23:07 +08:00 · 2026-04-18 00:03:07 +08:00
36 changed files with 8365 additions and 9 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -21,6 +21,14 @@ data/universe_*.json
 # Trader state — per-machine, regenerated by auto/simulate
 data/trader_*.json
 # Factor attribution output and cached factors
 data/attribution_*/
 data/factors/
 data/factors_review_tmp/
 # External tool artifacts
 docs/superpowers/
 # IDE / editor
 .idea/
 .vscode/
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -44,7 +44,7 @@ No test suite or linter is configured.
 **Backtest engine** (`main.py`): Orchestrates data loading, strategy execution, and visualization. The `backtest()` function is vectorized — it takes a strategy and price DataFrame, applies transaction costs (proportional + optional fixed per-trade fee) via turnover, and returns an equity curve. Supports two execution modes: `close` (classic) and `open-close` (signal on open prices, execute at close).
-**Daily trader** (`trader.py`): Live/forward-testing system with persistent portfolio state in `data/trader_{market}_{strategy}.json`. The `auto` subcommand runs both signal generation and execution in a single invocation — designed for cron. The `simulate` subcommand replays a date range day-by-day with realistic portfolio tracking (fractional shares, cash, commissions). Available strategies: `recovery_mom_top10`, `recovery_mom_top20`, `momentum`, `momentum_quality`, `dual_momentum`, `inverse_vol`, `trend_following`, `buy_and_hold`.
+**Daily trader** (`trader.py`): Live/forward-testing system with persistent portfolio state in `data/trader_{market}_{strategy}.json`. The `auto` subcommand runs both signal generation and execution in a single invocation — designed for cron. The `simulate` subcommand replays a date range day-by-day with realistic portfolio tracking (fractional shares, cash, commissions). Available strategies: `recovery_mom_top10`, `recovery_mom_top20`, `momentum`, `momentum_quality`, `dual_momentum`, `inverse_vol`, `trend_following`, `buy_and_hold`, plus 32 factor-combo strategies (`fc_{signal}_{freq}` — see `strategies/factor_combo.py`).
 **Strategy protocol** (`strategies/base.py`): All strategies inherit from `Strategy` ABC and implement `generate_signals(data) → DataFrame` where the returned DataFrame contains portfolio weights (rows = dates, columns = assets, values sum to ~1.0 per row). Each strategy is responsible for applying its own 1-day lag via `.shift(1)` to avoid lookahead bias — the backtest engine does not shift.
@@ -59,6 +59,7 @@ No test suite or linter is configured.
 - `momentum_quality.py` — Momentum + return consistency + low drawdown
 - `adaptive_momentum.py` — Momentum weighted by inverse volatility
 - `recovery_momentum.py` — Recovery (price/63d low) + 12-1mo momentum composite. Best US performer.
 - `factor_combo.py` — Configurable factor-combination strategies with daily/weekly/biweekly/monthly rebalancing. US champions: `rec_mfilt+deep_upvol` (50.7% CAGR monthly), `ma200+mom7m+rec126`, `rec_mfilt+ma200`, `mom7m+rec126`. CN champions: `up_cap+quality_mom` (26.1% CAGR monthly), `down_resil+qual_mom`, `rec63+mom_gap`, `up_cap+mom_gap`. All registered in trader.py as `fc_{signal}_{freq}` (e.g., `fc_rec_mfilt_deep_upvol_monthly`). 32 new strategies total.
 **Metrics** (`metrics.py`): Standalone functions for portfolio analytics (Sharpe, Sortino, Calmar, max drawdown, etc.). `summary()` prints a formatted report and returns a dict.
--- a/data/sp500_history.json
+++ b/data/sp500_history.json
--- a/data_manager.py
+++ b/data_manager.py
@@ -63,10 +63,11 @@ def _download(tickers: list[str], start: str, end: str | None = None,
    result = {}
    for field in fields:
        if field in raw.columns.get_level_values(0) if isinstance(raw.columns, pd.MultiIndex) else field in raw.columns:
-            if len(tickers) > 1:
+            selected = raw[field]
-                result[field] = raw[field]
+            if isinstance(selected, pd.Series):
                result[field] = selected.to_frame(name=tickers[0])
            else:
-                result[field] = raw[field].to_frame(name=tickers[0])
+                result[field] = selected
        else:
            result[field] = pd.DataFrame()
    return result
@@ -83,10 +84,11 @@ def _download_period(tickers: list[str], period: str,
    result = {}
    for field in fields:
        if field in raw.columns.get_level_values(0) if isinstance(raw.columns, pd.MultiIndex) else field in raw.columns:
-            if len(tickers) > 1:
+            selected = raw[field]
-                result[field] = raw[field]
+            if isinstance(selected, pd.Series):
                result[field] = selected.to_frame(name=tickers[0])
            else:
-                result[field] = raw[field].to_frame(name=tickers[0])
+                result[field] = selected
        else:
            result[field] = pd.DataFrame()
    return result
@@ -103,6 +105,66 @@ def _clean(data: pd.DataFrame) -> pd.DataFrame:
    return data
 def _clean_market_data(data: pd.DataFrame, field: str) -> pd.DataFrame:
    """Clean market data while preserving volume gaps."""
    good = data.columns[data.notna().mean() > 0.5]
    dropped = set(data.columns) - set(good)
    if dropped:
        print(f"--- Dropped {len(dropped)} tickers with >50% missing data ---")
    data = data[good]
    if field == "volume":
        return data
    return data.ffill().dropna(how="all")
 def _merge_market_panel(existing: pd.DataFrame | None, new_data: pd.DataFrame) -> pd.DataFrame:
    """Merge new data into an existing cached panel, preserving old columns and dates."""
    if existing is None or existing.empty:
        merged = new_data.copy()
    elif new_data.empty:
        merged = existing.copy()
    else:
        merged = existing.combine_first(new_data)
        merged.loc[new_data.index, new_data.columns] = new_data
    merged = merged.sort_index()
    merged = merged[~merged.index.duplicated(keep="last")]
    return merged
 def update_market_data(market: str, tickers: list[str], fields: list[str]) -> dict[str, pd.DataFrame]:
    """Download, clean, persist, and return market data panels for requested Yahoo fields."""
    field_aliases = {
        "close": "Close",
        "open": "Open",
        "high": "High",
        "low": "Low",
        "volume": "Volume",
    }
    normalized_fields = []
    yahoo_fields = []
    for field in fields:
        normalized = field.lower()
        if normalized not in field_aliases:
            raise ValueError(f"Unsupported market data field: {field}")
        normalized_fields.append(normalized)
        yahoo_fields.append(field_aliases[normalized])
    os.makedirs(DATA_DIR, exist_ok=True)
    start = (datetime.now() - timedelta(days=365 * 10)).strftime("%Y-%m-%d")
    downloaded = _download(tickers, start=start, fields=yahoo_fields)
    cleaned = {}
    for normalized, yahoo_field in zip(normalized_fields, yahoo_fields):
        data = _clean_market_data(downloaded.get(yahoo_field, pd.DataFrame()), normalized)
        existing = load(market, normalized)
        data = _merge_market_panel(existing, data)
        path = _data_path(market, normalized)
        data.to_csv(path)
        print(f"--- Saved {data.shape[0]} days x {data.shape[1]} tickers to {path} ---")
        cleaned[normalized] = data
    return cleaned
 def update(market: str, tickers: list[str],
           with_open: bool = False) -> pd.DataFrame | tuple[pd.DataFrame, pd.DataFrame]:
    """
--- a/docs/superpowers/specs/2026-04-07-factor-attribution-design.md
+++ b/docs/superpowers/specs/2026-04-07-factor-attribution-design.md
@@ -0,0 +1,376 @@
 # Factor Attribution Design
 Date: 2026-04-07
 Repo: `/Users/gahow/projects/quant`
 ## Goal
 Add a factor attribution module that explains strategy returns using:
 - Standard external US factors when available: `MKT-RF`, `SMB`, `HML`, `RMW`, `CMA`, `RF`
 - Local price-derived extension factors: `MOM`, `LOWVOL`, `RECOVERY`
 - Local proxy fallback factors for markets without standard external data
 The module must integrate with the current backtest workflow, reuse existing strategy equity curves, cache downloaded factor data locally, and produce both terminal summaries and exportable tabular outputs.
 ## Scope
 In scope:
 - New factor attribution module for research backtests
 - US support using external standard factors plus local extension factors
 - CN support using local proxy factors only
 - CAPM, FF5, and FF5-plus-extension models
 - CLI flags in `main.py` to enable attribution and export results
 - Tests for parsing, factor construction, and regression behavior
 Out of scope for this iteration:
 - Intraday attribution
 - Portfolio optimizer changes
 - Live trader attribution in `trader.py`
 - Notebook or plotting UI for attribution results
 - External fundamental datasets beyond standard downloadable factor files
 ## Existing Context
 The repo already has:
 - A vectorized backtest engine in `main.py`
 - Strategy implementations that produce daily target weights
 - Performance metrics in `metrics.py`
 - Local daily price caches in `data/us.csv`, `data/us_open.csv`, `data/cn.csv`
 Current "alpha" in `trader.py simulate` is only total return minus benchmark return. The new module adds regression-based alpha and factor exposure analysis.
 ## Design Overview
 Add a new module `factor_attribution.py` with four responsibilities:
 1. Load and cache factor datasets
 2. Build local extension and proxy factors from existing price data
 3. Run regression models against strategy daily returns
 4. Render summary tables and export detailed results
 `main.py` remains the orchestration point. It will continue running backtests and benchmark normalization, then optionally invoke attribution on the resulting daily return series.
 ## Module Structure
 ### `factor_attribution.py`
 Planned top-level responsibilities:
 - `load_external_us_factors(...)`
  - Download Ken French daily factor files
  - Parse, normalize, convert percent to decimal
  - Cache to `data/factors/`
  - Fall back to cache when network fetch fails
 - `build_extension_factors(price_data, benchmark, market)`
  - Build local daily factor return series for:
    - `MOM`
    - `LOWVOL`
    - `RECOVERY`
 - `build_proxy_core_factors(price_data, benchmark, market)`
  - Used mainly for CN or when external factors are unavailable
  - Build daily proxy series for:
    - `MKT`
    - `SMB_PROXY`
    - `HML_PROXY`
    - `RMW_PROXY`
    - `CMA_PROXY`
 - `prepare_factor_models(...)`
  - Merge standard factors and local factors
  - Produce factor matrices for:
    - `capm`
    - `ff5`
    - `ff5plus`
 - `run_factor_regression(strategy_returns, factor_frame, risk_free_col)`
  - Fit OLS with intercept
  - Return alpha, annualized alpha, loadings, t-stats, p-values, R-squared, adjusted R-squared, residual volatility, date range, and observation count
 - `attribute_strategies(results_df, benchmark_series, price_data, market, model_selection)`
  - Convert equity curves to returns
  - Run attribution for each strategy
  - Return structured summary and long-form loadings tables
 - `print_attribution_summary(...)`
  - Render compact terminal output
 - `export_attribution(...)`
  - Write CSV outputs
 ## Data Sources
 ### US Standard Factors
 Preferred source:
 - Ken French daily factor datasets for:
  - Fama-French 5 Factors daily
  - Momentum daily if separately required
 Normalization rules:
 - Convert index to pandas `DatetimeIndex`
 - Convert values from percent to decimal returns
 - Keep `RF` as decimal daily risk-free rate
 Cache location:
 - `data/factors/ff5_us_daily.csv`
 - `data/factors/mom_us_daily.csv`
 If the source format changes or download fails:
 - Use the latest local cache if present
 - Otherwise fall back to local proxy factors and mark the run as `proxy_only`
 ### Local Price Inputs
 Reuse repo price caches:
 - US: `data/us.csv`, `data/us_open.csv`
 - CN: `data/cn.csv`
 Only adjusted close prices are required for attribution factor construction.
 ## Factor Definitions
 ### Standard Factors
 For US:
 - `MKT-RF`, `SMB`, `HML`, `RMW`, `CMA`, `RF` from external factor data
 ### Local Extension Factors
 These are built from the same universe already used by the repo.
 #### `MOM`
 - Cross-sectional momentum long-short factor
 - Rank stocks by 12-1 month return
 - Long top quantile, short bottom quantile
 - Equal weight within long and short legs
 - Factor return is long return minus short return
 #### `LOWVOL`
 - Cross-sectional low-volatility factor
 - Compute rolling volatility from daily returns
 - Long lowest-vol quantile, short highest-vol quantile
 - Equal weight within legs
 #### `RECOVERY`
 - Cross-sectional recovery factor
 - Rank stocks by distance from rolling 63-day low
 - Long strongest recovery names, short weakest recovery names
 - Equal weight within legs
 ### Proxy Core Factors
 Used for CN by default and as fallback for US.
 #### `MKT`
 - Benchmark daily return if benchmark exists
 - Otherwise equal-weight universe return
 #### `SMB_PROXY`
 - Size proxy using inverse price level or market-cap proxy when only price data is available
 - First iteration uses inverse price rank as a transparent proxy and explicitly labels it as proxy
 #### `HML_PROXY`
 - Value proxy using price-to-range or distance-to-trailing-low style signal
 - This is not a true book-to-market factor and must be labeled proxy
 #### `RMW_PROXY`
 - Profitability proxy from return consistency and stability
 #### `CMA_PROXY`
 - Investment proxy from asset trend smoothness or expansion/contraction behavior inferred from price action
 Proxy factors are included for model completeness, but the output must label them clearly as proxies rather than standard academic factors.
 ## Factor Construction Rules
 - All local factors use only information available up to date `t` to explain returns at `t+1`
 - No future data leakage
 - Factor series are daily return series, not ranks
 - Long-short factors should be approximately dollar-neutral
 - Missing values are allowed during warmup windows and dropped during model alignment
 - Quantile counts should adapt to available universe size
 ## Regression Models
 ### CAPM
 Model:
 - `strategy_excess_return ~ alpha + (MKT-RF)`
 ### FF5
 Model:
 - `strategy_excess_return ~ alpha + MKT-RF + SMB + HML + RMW + CMA`
 ### FF5Plus
 Model:
 - `strategy_excess_return ~ alpha + MKT-RF + SMB + HML + RMW + CMA + MOM + LOWVOL + RECOVERY`
 ### Proxy Model
 For markets without standard factors:
 - `strategy_return ~ alpha + MKT + SMB_PROXY + HML_PROXY + RMW_PROXY + CMA_PROXY + MOM + LOWVOL + RECOVERY`
 The module should report which model family was actually used.
 ## Alignment Rules
 - Convert all equity curves to daily returns
 - Build factor frames at daily frequency
 - Join strategy returns and factor returns on date intersection
 - For standard factor models, subtract `RF` from strategy returns
 - Keep benchmark return separately for active return diagnostics, but not as a replacement for `MKT-RF` in standard factor models
 ## Output Schema
 ### Summary Output
 One row per strategy per model with fields including:
 - `strategy`
 - `market`
 - `model`
 - `factor_source`
 - `proxy_only`
 - `start_date`
 - `end_date`
 - `n_obs`
 - `alpha_daily`
 - `alpha_ann`
 - `alpha_t_stat`
 - `alpha_p_value`
 - `r_squared`
 - `adj_r_squared`
 - `residual_vol_ann`
 Selected factor loadings should also be flattened into summary columns when available:
 - `beta_mkt`
 - `beta_smb`
 - `beta_hml`
 - `beta_rmw`
 - `beta_cma`
 - `beta_mom`
 - `beta_lowvol`
 - `beta_recovery`
 ### Loadings Output
 Long-form table:
 - `strategy`
 - `model`
 - `factor`
 - `beta`
 - `t_stat`
 - `p_value`
 ## CLI Changes
 Add arguments to `main.py`:
 - `--attribution`
 - `--attribution-model {capm,ff5,ff5plus,all}`
 - `--attribution-export <dir>`
 Behavior:
 - If `--attribution` is not set, current behavior is unchanged
 - If set, attribution runs after backtest metrics are printed
 - If export path is set, write:
  - `summary.csv`
  - `loadings.csv`
 ## Terminal Reporting
 For each strategy and selected model, print a compact line containing:
 - annualized alpha
 - major factor loadings
 - R-squared
 - residual volatility
 After the numeric table, print a short interpretation section:
 - whether alpha remains after adding factors
 - which factors explain most of the strategy
 - whether the model fit is weak or strong
 Interpretation should remain descriptive and avoid overclaiming statistical significance.
 ## Error Handling
 - External factor download failure:
  - Use cache if available
  - Otherwise downgrade to proxy mode
 - Missing or short overlap window:
  - Skip that model and report insufficient data
 - Singular matrix or severe multicollinearity:
  - Catch and report model failure or unstable fit
 - Missing benchmark column:
  - Fall back to equal-weight universe market proxy where possible
 ## Testing Plan
 ### Unit Tests
 - External factor parser converts dates and percent units correctly
 - Cache loader returns cached data on download failure
 - Extension factor builders produce expected columns and no future leakage
 - Regression on synthetic data recovers approximate known alpha and betas
 ### Integration Tests
 - End-to-end attribution on a small deterministic equity and factor dataset
 - CLI export produces expected files and columns
 ### Regression Tests
 - Fixed local US sample produces stable output shape and model naming
 ## Implementation Notes
 - Prefer `numpy.linalg.lstsq` or `scipy` OLS utilities already available in dependencies
 - Keep implementation dependency-light
 - Keep factor construction functions separate from regression code for testability
 - Avoid changing existing strategy behavior
 ## Risks
 - Standard factor downloads may change source file formatting
 - Proxy factor definitions for CN will be weaker than true academic factors
 - Some strategy returns may be highly collinear with momentum-like factors, reducing interpretability
 - Short or overlapping warmup windows can materially reduce sample size
 ## Success Criteria
 - A user can run backtests with `--attribution` and receive factor-based explanations of returns
 - US runs use standard external factors when available
 - CN runs still produce a clearly labeled proxy attribution report
 - Outputs distinguish residual alpha from factor exposure
 - The module is easy to extend with new factors later
--- a/docs/superpowers/specs/2026-04-17-us-alpha-research-design.md
+++ b/docs/superpowers/specs/2026-04-17-us-alpha-research-design.md
@@ -0,0 +1,376 @@
 # US High-Alpha Research Design
 **Date:** 2026-04-17
 ## Goal
 Build a research framework for US `long-only` equity strategies that uses only free or already-accessible data, avoids lookahead and survivorship traps as much as the available data allows, and can rank candidate strategy families over `1/2/3/5/10y` windows. The objective is not to manufacture the single highest backtest CAGR, but to identify strategy families whose alpha survives realistic liquidity filters, transaction costs, and point-in-time constraints.
 ## Constraints
 - Data sources must be free or already accessible from the current project environment.
 - Portfolio construction must be `long-only`.
 - The US research universe may extend beyond the S&P 500 into a broader US stock pool, but all conclusions must clearly distinguish between:
  - `strict` results from a point-in-time-clean universe.
  - `exploratory` results from a wider free-data universe that is not fully point-in-time-clean.
 - All signals must use only information available at the time of decision.
 - The framework must explicitly guard against:
  - survivorship bias
  - lookahead bias
  - static industry-label leakage
  - microcap and illiquidity contamination
 ## Success Criteria
 The framework is successful if it produces:
 1. A unified research and backtest pipeline for US strategies.
 2. A ranked comparison of `3-5` high-value strategy families across `1/2/3/5/10y`.
 3. Metrics that go beyond headline CAGR, including:
   - `CAGR`
   - `Sharpe`
   - `Sortino`
   - `MaxDD`
   - `Calmar`
   - `Turnover`
   - `Average positions`
   - `Median ADV usage`
   - `Subperiod stability`
 4. Tiered interpretation of results:
   - `Tier A`: realistic and tradable under tighter liquidity assumptions
   - `Tier B`: strong alpha but lower capacity
   - `Tier C`: attractive only under loose assumptions and not suitable as a production candidate
 Any strategy that reports near-`50% CAGR` must also explain:
 - which market regime contributed most of the return
 - whether performance depends on low-liquidity or small-cap tails
 - whether results survive after removing the most extreme tail names
 ## Research Philosophy
 This project should prefer honest, repeatable alpha discovery over spectacular but fragile backtests. Under the current constraints, a `10y 50% CAGR` should be treated as an upper-end outcome that may appear in selective windows, not as a baseline expectation. The more realistic goal is to find strategies that are strong over `3/5y`, still meaningfully outperform over `10y`, and remain robust after tightening assumptions.
 ## Strategy Families
 The research effort will focus on four strategy families.
 ### 1. Earnings Drift Proxy
 Target the post-information-repricing phase after major company-specific events. This is conceptually the highest-alpha family, but also the most dependent on event data quality.
 Primary implementation order:
 - use free historical earnings date data if it is stable enough
 - otherwise fall back to price-and-volume-defined event proxies
 Core signal ingredients:
 - strong post-event excess return over `1-3` days
 - abnormal volume
 - gap that does not immediately fill
 - price holding near short- and medium-term highs after the event
 ### 2. Breakout After Compression
 Target stocks that transition from low-volatility congestion into sustained trend expansion. This is the cleanest strategy family to implement with free daily OHLCV data and is the best first candidate for a strict production-grade pipeline.
 Core signal ingredients:
 - proximity to `120d` or `252d` highs
 - volatility compression over the prior `20-40` trading days
 - rising dollar volume
 - positive relative strength versus market and industry proxies
 ### 3. Gap-and-Go / High-Volume Continuation
 Target the second phase of move continuation after abnormal return and volume shocks rather than blindly chasing the first event day.
 Core signal ingredients:
 - abnormal `1d` or `3d` return
 - abnormal volume versus trailing `60d`
 - post-event price holding above the event anchor
 - subsequent breakout continuation
 This family has high potential upside but is more sensitive to cost assumptions and market regime.
 ### 4. Regime-Gated Cross-Sectional Alpha
 Use broad market and industry-state filters to improve the hit rate of the other strategy families and provide a lower-volatility baseline alpha engine.
 Core signal ingredients:
 - market risk-on versus risk-off state
 - industry ETF leadership
 - relative strength
 - recovery from drawdowns
 - trend quality
 - near-`52w` high behavior
 - price/volume confirmation
 This family is not expected to produce the highest standalone CAGR, but it is expected to improve robustness and reduce participation in hostile environments.
 ## Prioritization
 Recommended implementation order:
 1. `Breakout After Compression`
 2. `Regime-Gated Cross-Sectional Alpha`
 3. `Gap-and-Go / High-Volume Continuation`
 4. `Earnings Drift Proxy` only after validating free event-data quality
 Rationale:
 - `Breakout After Compression` is the most implementable and least ambiguous with free data.
 - `Regime-Gated Cross-Sectional Alpha` provides a shared control layer for the rest of the framework.
 - `Gap-and-Go` has higher upside but also higher sensitivity to assumptions.
 - `Earnings Drift Proxy` is theoretically powerful but should not become the project bottleneck if free event history is incomplete.
 ## Data Layer
 The framework needs a richer data layer than the current `close/open` setup.
 ### Required price fields
 Daily US market data should support at least:
 - `open`
 - `high`
 - `low`
 - `close`
 - `volume`
 This is required to define:
 - real breakouts
 - gap events
 - volatility compression
 - abnormal dollar volume
 ### Required ETF layer
 Add stable market and industry ETFs for regime and leadership analysis, at minimum:
 - `SPY`
 - `QQQ`
 - `IWM`
 - `MDY`
 - `XLF`
 - `XLK`
 - `XLI`
 - `XLV`
 - `XLY`
 - `XLP`
 - `XLE`
 - `XLU`
 - `XLRE`
 - `XLB`
 - `SOXX`
 - `IGV`
 - `SMH`
 ### Universe modes
 The framework must support two explicit modes.
 #### Strict mode
 Use point-in-time-clean universe membership, initially based on the existing PIT S&P 500 machinery in the repository. This is the baseline for formal, defensible results.
 #### Exploratory mode
 Use a wider free-data US stock pool to search for stronger alpha patterns. These results are useful for idea generation but must be labeled as exploratory unless later promoted into a point-in-time-clean setup.
 ## Universe Construction Rules
 The tradable universe must be computed daily from lagged information.
 ### Daily eligibility rules
 Each stock may enter the candidate set only if all required conditions hold as of `t-1`:
 - enough listing history exists to compute the strategy lookbacks
 - enough valid volume observations exist
 - minimum lagged price threshold is met
 - minimum lagged dollar-volume threshold is met
 Representative defaults:
 - `close[t-1] > 5`
 - `median_dollar_volume_60d[t-1] > $20M` in `strict` mode
 - `median_dollar_volume_60d[t-1] > $5M` in `exploratory` mode
 - `>= 252` valid trading days before eligibility
 - `>= 40` valid volume days in the trailing `60d`
 Thresholds should be strategy-specific and tunable in robustness sweeps.
 ### Industry mapping
 Do not use today's static sector labels to explain historical behavior. For historical regime and industry alignment, prefer PIT-safe proxies such as rolling correlation or beta to industry ETFs over `63/126d` windows.
 ## Anti-Lookahead Rules
 The framework must enforce the following rules consistently.
 1. Signals computed using `t` daily bars may only be traded no earlier than `t+1`.
 2. If an event is effectively published after market close, it becomes tradable no earlier than the next trading day after publication.
 3. Rolling inputs for liquidity, volatility, and breakout logic must use complete lagged windows with explicit timing semantics.
 4. Cross-sectional ranking must happen only within the daily eligible universe.
 5. Universe membership, filters, and factor normalization must be applied before portfolio selection, not after.
 ## Execution Convention
 Default execution convention:
 - observe data through `t` close
 - compute signal after the `t` close
 - trade at `t+1`
 The framework may compare `t+1 open` and `t+1 close` execution variants if the data path supports both, but the default research baseline should be conservative and consistent.
 ## Backtest and Evaluation Framework
 Every strategy family must run through a single pipeline that:
 1. loads required market data
 2. constructs the daily eligible universe
 3. computes regime filters
 4. computes strategy scores or event states
 5. builds a `long-only` portfolio
 6. applies transaction costs
 7. reports `1/2/3/5/10y` windows
 8. records robustness diagnostics
 ### Portfolio defaults
 Initial baseline settings:
 - `long-only`
 - concentrated books such as `top 5`, `top 10`, `top 20`
 - start with `equal weight`
 - add `inverse-vol` weighting only as a secondary comparison
 Equal-weight concentrated portfolios should be the first baseline because they are harder to over-engineer than adaptive weighting schemes.
 ### Required robustness checks
 Any strategy candidate that looks strong must automatically be re-run under:
 - tighter liquidity thresholds
 - fewer and more positions
 - higher trading costs
 - different rebalance frequencies
 - exclusion of the lowest-liquidity or smallest-cap tail
 Only strategies that survive these perturbations should be promoted to `Tier A`.
 ## Repository Changes
 The following repository changes are required.
 ### New modules
 #### `research/us_universe.py`
 Responsibilities:
 - build daily tradable-universe masks
 - support `strict` and `exploratory` modes
 - enforce lagged eligibility rules
 #### `data_manager.py` extension or new `market_data.py`
 Responsibilities:
 - support daily US `OHLCV`
 - support ETF data updates
 - preserve existing price-loading workflows where practical
 #### `research/regime_filters.py`
 Responsibilities:
 - market risk-on/risk-off filters
 - ETF leadership signals
 - breadth and relative-strength helpers
 #### `research/event_factors.py`
 Responsibilities:
 - breakout-compression scores
 - gap-continuation scores
 - high-volume continuation logic
 - earnings-drift proxy logic
 #### `research/us_alpha_pipeline.py`
 Responsibilities:
 - orchestrate end-to-end research runs
 - load data
 - build universe masks
 - run strategy families
 - produce windowed rankings
 - label output as `strict` or `exploratory`
 #### `research/us_alpha_report.py`
 Responsibilities:
 - format tables and CSV outputs
 - summarize results by family and horizon
 - support markdown export if needed
 ## Research Phasing
 The implementation should be split into two phases.
 ### Phase 1
 Build the strict, defensible research backbone:
 - PIT S&P 500 universe
 - OHLCV data support
 - ETF regime filters
 - `Breakout After Compression`
 - `Regime-Gated Cross-Sectional Alpha`
 - `Gap-and-Go / High-Volume Continuation`
 - unified backtest and reporting pipeline
 This phase should produce a clean research system that is difficult to fool with future information.
 ### Phase 2
 Expand into higher-upside exploratory research:
 - wider US stock universe
 - broader signal scanning
 - stronger CAGR search
 - explicit exploratory labeling
 This phase is for alpha discovery, not for making final claims about unbiased production performance.
 ## Recommended Output
 The finished framework should produce:
 - a repeatable research entrypoint for US alpha studies
 - CSV outputs for `1/2/3/5/10y` windows
 - a ranked table of strategy families
 - tier classification for candidates
 - notes on where near-`50% CAGR` outcomes come from and whether they remain credible after tightening assumptions
 ## Non-Goals
 This project does not aim to:
 - promise stable `10y 50% CAGR`
 - claim a fully point-in-time-clean all-US-stock universe from free data alone
 - optimize to a single headline metric at the expense of realism
 - treat exploratory full-market scans as production-quality evidence
 ## Key Decision
 The core design choice is to build infrastructure that minimizes self-deception first, and only then search for extreme CAGR outcomes. Any other order is likely to produce attractive but unreliable results.
--- a/factor_attribution.py
+++ b/factor_attribution.py
@@ -0,0 +1,727 @@
 from __future__ import annotations
 import json
 import http.client
 import io
 import re
 import socket
 import ssl
 import warnings
 import zipfile
 from pathlib import Path
 from urllib.error import URLError
 from urllib.request import Request, urlopen
 import numpy as np
 import pandas as pd
 from scipy import stats
 KEN_FRENCH_DAILY_FF5_ZIP_URL = (
    "https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/"
    "F-F_Research_Data_5_Factors_2x3_daily_CSV.zip"
 )
 EXPECTED_FACTOR_COLUMNS = ["MKT_RF", "SMB", "HML", "RMW", "CMA", "RF"]
 CAPM_FACTOR_COLUMNS = ["MKT_RF"]
 FF5_FACTOR_COLUMNS = ["MKT_RF", "SMB", "HML", "RMW", "CMA"]
 EXTENSION_FACTOR_COLUMNS = ["MOM", "LOWVOL", "RECOVERY"]
 FF5PLUS_FACTOR_COLUMNS = FF5_FACTOR_COLUMNS + EXTENSION_FACTOR_COLUMNS
 PROXY_FACTOR_COLUMNS = [
    "MKT",
    "SMB_PROXY",
    "HML_PROXY",
    "RMW_PROXY",
    "CMA_PROXY",
 ] + EXTENSION_FACTOR_COLUMNS
 TRADING_DAYS_PER_YEAR = 252
 MISSING_BENCHMARK_SENTINEL = "__missing_benchmark__"
 SUMMARY_BETA_COLUMN_BY_FACTOR = {
    "MKT_RF": "beta_mkt",
    "MKT": "beta_mkt",
    "SMB": "beta_smb",
    "SMB_PROXY": "beta_smb",
    "HML": "beta_hml",
    "HML_PROXY": "beta_hml",
    "RMW": "beta_rmw",
    "RMW_PROXY": "beta_rmw",
    "CMA": "beta_cma",
    "CMA_PROXY": "beta_cma",
    "MOM": "beta_mom",
    "LOWVOL": "beta_lowvol",
    "RECOVERY": "beta_recovery",
 }
 SUMMARY_COLUMNS = [
    "strategy",
    "market",
    "model",
    "factor_source",
    "proxy_only",
    "beta_semantics",
    "start_date",
    "end_date",
    "n_obs",
    "alpha_daily",
    "alpha_ann",
    "alpha_t_stat",
    "alpha_p_value",
    "r_squared",
    "adj_r_squared",
    "residual_vol_ann",
    "beta_mkt",
    "beta_smb",
    "beta_hml",
    "beta_rmw",
    "beta_cma",
    "beta_mom",
    "beta_lowvol",
    "beta_recovery",
 ]
 LOADING_COLUMNS = [
    "strategy",
    "market",
    "model",
    "factor_source",
    "proxy_only",
    "factor",
    "beta",
    "t_stat",
    "p_value",
 ]
 SEMANTIC_BETA_COLUMNS = [
    "beta_mkt",
    "beta_smb",
    "beta_hml",
    "beta_rmw",
    "beta_cma",
    "beta_mom",
    "beta_lowvol",
    "beta_recovery",
 ]
 class ExternalFactorFormatError(ValueError):
    pass
 class ExternalFactorDownloadError(OSError):
    pass
 def _download_kf_zip_bytes() -> bytes:
    request = Request(
        KEN_FRENCH_DAILY_FF5_ZIP_URL,
        headers={"User-Agent": "quant-factor-attribution/0.1"},
    )
    try:
        with urlopen(request, timeout=30) as response:
            return response.read()
    except (
        URLError,
        TimeoutError,
        ConnectionError,
        socket.timeout,
        socket.gaierror,
        ssl.SSLError,
        http.client.HTTPException,
        http.client.IncompleteRead,
        http.client.RemoteDisconnected,
    ) as exc:
        raise ExternalFactorDownloadError(f"Failed to download external factor data: {exc}") from exc
 def _parse_kf_daily_csv(raw_bytes: bytes) -> pd.DataFrame:
    with zipfile.ZipFile(io.BytesIO(raw_bytes)) as archive:
        member_names = [
            name
            for name in archive.namelist()
            if not name.endswith("/") and name.lower().endswith((".csv", ".txt"))
        ]
        if not member_names:
            raise ExternalFactorFormatError("Ken French archive did not contain a CSV or TXT file")
        try:
            text = archive.read(member_names[0]).decode("utf-8-sig")
        except UnicodeDecodeError as exc:
            raise ExternalFactorFormatError("Ken French factor file was not valid UTF-8 text") from exc
    lines = [line for line in text.splitlines() if line.strip()]
    try:
        header_index = next(i for i, line in enumerate(lines) if "Mkt-RF" in line)
    except StopIteration as exc:
        raise ExternalFactorFormatError("Ken French factor file was missing the daily factor header") from exc
    table = "\n".join(lines[header_index:])
    try:
        factors = pd.read_csv(io.StringIO(table))
    except pd.errors.ParserError as exc:
        raise ExternalFactorFormatError("Ken French factor table could not be parsed") from exc
    factors = factors.rename(columns={"Mkt-RF": "MKT_RF"})
    date_column = factors.columns[0]
    missing_columns = [column for column in EXPECTED_FACTOR_COLUMNS if column not in factors.columns]
    if missing_columns:
        raise ExternalFactorFormatError(
            f"Ken French factor table was missing columns: {', '.join(missing_columns)}"
        )
    factors = factors[factors[date_column].astype(str).str.fullmatch(r"\d{8}")]
    if factors.empty:
        raise ExternalFactorFormatError("Ken French factor table did not contain daily rows")
    try:
        factors[date_column] = pd.to_datetime(factors[date_column], format="%Y%m%d")
    except ValueError as exc:
        raise ExternalFactorFormatError("Ken French factor table contained invalid dates") from exc
    factors = factors.set_index(date_column)
    factors.index.name = None
    try:
        factors = factors[EXPECTED_FACTOR_COLUMNS].astype(float) / 100.0
    except ValueError as exc:
        raise ExternalFactorFormatError("Ken French factor table contained non-numeric values") from exc
    return factors
 def _warn_and_load_cached_factors(cache_path: Path, reason: str) -> pd.DataFrame:
    warnings.warn(
        f"Using cached data from {cache_path} because {reason}.",
        UserWarning,
        stacklevel=2,
    )
    return pd.read_csv(cache_path, index_col=0, parse_dates=True)
 def load_external_us_factors(cache_dir: Path | str = "data/factors") -> pd.DataFrame:
    cache_path = Path(cache_dir) / "ff5_us_daily.csv"
    cache_path.parent.mkdir(parents=True, exist_ok=True)
    try:
        raw_bytes = _download_kf_zip_bytes()
    except ExternalFactorDownloadError as exc:
        if cache_path.exists():
            return _warn_and_load_cached_factors(cache_path, f"download failed: {exc}")
        raise
    try:
        factors = _parse_kf_daily_csv(raw_bytes)
    except zipfile.BadZipFile as exc:
        if cache_path.exists():
            return _warn_and_load_cached_factors(cache_path, f"the upstream ZIP was invalid: {exc}")
        raise
    except ExternalFactorFormatError as exc:
        if cache_path.exists():
            return _warn_and_load_cached_factors(
                cache_path,
                f"the upstream factor format was invalid: {exc}",
            )
        raise
    factors.to_csv(cache_path)
    return factors
 def _select_stock_prices(price_data: pd.DataFrame, benchmark: str) -> pd.DataFrame:
    stocks = price_data.drop(columns=[benchmark], errors="ignore")
    return stocks.sort_index().astype(float)
 def _long_short_factor(
    scores: pd.DataFrame,
    returns: pd.DataFrame,
    quantile: float = 0.3,
 ) -> pd.Series:
    lagged_scores = scores.shift(1)
    high_cutoff = lagged_scores.quantile(1 - quantile, axis=1)
    low_cutoff = lagged_scores.quantile(quantile, axis=1)
    long_mask = lagged_scores.ge(high_cutoff, axis=0)
    short_mask = lagged_scores.le(low_cutoff, axis=0)
    long_returns = returns.where(long_mask).mean(axis=1)
    short_returns = returns.where(short_mask).mean(axis=1)
    return (long_returns - short_returns).rename(None)
 def build_extension_factors(
    price_data: pd.DataFrame,
    benchmark: str,
    market: str,
 ) -> pd.DataFrame:
    del market
    stocks = _select_stock_prices(price_data, benchmark)
    returns = stocks.pct_change()
    momentum_scores = stocks.shift(21).pct_change(231)
    low_vol_scores = -returns.rolling(60, min_periods=60).std()
    recovery_scores = stocks / stocks.rolling(63, min_periods=63).min() - 1.0
    return pd.DataFrame(
        {
            "MOM": _long_short_factor(momentum_scores, returns),
            "LOWVOL": _long_short_factor(low_vol_scores, returns),
            "RECOVERY": _long_short_factor(recovery_scores, returns),
        },
        index=price_data.index,
    )
 def _positive_share(values: np.ndarray) -> float:
    return float(np.mean(values > 0))
 def build_proxy_core_factors(
    price_data: pd.DataFrame,
    benchmark: str,
    market: str,
 ) -> pd.DataFrame:
    del market
    stocks = _select_stock_prices(price_data, benchmark)
    returns = stocks.pct_change()
    if benchmark in price_data:
        market_factor = price_data[benchmark].astype(float).pct_change()
    else:
        market_factor = returns.mean(axis=1)
    inverse_price_scores = -stocks
    value_proxy_scores = -(stocks / stocks.rolling(252, min_periods=252).min() - 1.0)
    profitability_proxy_scores = returns.rolling(63, min_periods=63).apply(_positive_share, raw=True)
    investment_proxy_scores = -stocks.pct_change(126)
    return pd.DataFrame(
        {
            "MKT": market_factor,
            "SMB_PROXY": _long_short_factor(inverse_price_scores, returns),
            "HML_PROXY": _long_short_factor(value_proxy_scores, returns),
            "RMW_PROXY": _long_short_factor(profitability_proxy_scores, returns),
            "CMA_PROXY": _long_short_factor(investment_proxy_scores, returns),
        },
        index=price_data.index,
    )
 def prepare_factor_models(
    market: str,
    extension_factors: pd.DataFrame,
    proxy_factors: pd.DataFrame | None = None,
    external_factors: pd.DataFrame | None = None,
 ) -> dict[str, object]:
    market_name = market.lower()
    if market_name == "us" and external_factors is not None:
        factor_frame = pd.concat([external_factors, extension_factors], axis=1)
        return {
            "factor_frame": factor_frame,
            "models": {
                "capm": CAPM_FACTOR_COLUMNS.copy(),
                "ff5": FF5_FACTOR_COLUMNS.copy(),
                "ff5plus": FF5PLUS_FACTOR_COLUMNS.copy(),
            },
            "risk_free_col": "RF",
            "factor_source": "external+local",
            "proxy_only": False,
            "model_family": "standard",
        }
    if proxy_factors is None:
        raise ValueError("proxy_factors are required when external factors are unavailable")
    factor_frame = pd.concat([proxy_factors, extension_factors], axis=1)
    return {
        "factor_frame": factor_frame,
        "models": {"proxy": PROXY_FACTOR_COLUMNS.copy()},
        "risk_free_col": None,
        "factor_source": "proxy_only",
        "proxy_only": True,
        "model_family": "proxy",
    }
 def run_factor_regression(
    strategy_returns: pd.Series,
    factor_frame: pd.DataFrame,
    factor_cols: list[str],
    risk_free_col: str | None = None,
 ) -> dict[str, object]:
    regression_frame = pd.concat(
        [strategy_returns.rename("strategy"), factor_frame[factor_cols + ([risk_free_col] if risk_free_col else [])]],
        axis=1,
    ).dropna()
    if regression_frame.empty:
        raise ValueError("No overlapping strategy and factor observations were available for regression")
    y = regression_frame["strategy"].astype(float)
    if risk_free_col is not None:
        y = y - regression_frame[risk_free_col].astype(float)
    x = regression_frame[factor_cols].astype(float).to_numpy()
    x = np.column_stack([np.ones(len(regression_frame)), x])
    n_obs = len(regression_frame)
    param_count = x.shape[1]
    if n_obs < param_count:
        raise ValueError(
            f"Insufficient observations for regression: need at least {param_count} rows, got {n_obs}"
        )
    coefficients, _, rank, _ = np.linalg.lstsq(x, y.to_numpy(), rcond=None)
    if rank < param_count:
        raise ValueError(
            "Regression design matrix is rank-deficient; coefficients are not uniquely identified"
        )
    fitted = x @ coefficients
    residuals = y.to_numpy() - fitted
    residual_series = pd.Series(residuals, index=regression_frame.index)
    if len(residual_series) == 1:
        residual_vol_ann = 0.0
    else:
        residual_vol_ann = float(residual_series.std(ddof=1) * np.sqrt(TRADING_DAYS_PER_YEAR))
    dof = n_obs - param_count
    if dof > 0:
        residual_variance = float((residuals @ residuals) / dof)
        covariance = residual_variance * np.linalg.pinv(x.T @ x)
        standard_errors = np.sqrt(np.diag(covariance))
        with np.errstate(divide="ignore", invalid="ignore"):
            t_stats = np.divide(
                coefficients,
                standard_errors,
                out=np.full_like(coefficients, np.nan, dtype=float),
                where=standard_errors > 0,
            )
        p_values = 2.0 * stats.t.sf(np.abs(t_stats), df=dof)
        adj_r_squared_is_defined = True
    else:
        t_stats = np.full_like(coefficients, np.nan, dtype=float)
        p_values = np.full_like(coefficients, np.nan, dtype=float)
        adj_r_squared_is_defined = False
    ss_total = float(((y - y.mean()) ** 2).sum())
    ss_residual = float(np.sum(residuals**2))
    r_squared = 1.0 - ss_residual / ss_total if ss_total else 0.0
    if adj_r_squared_is_defined:
        adj_r_squared = 1.0 - (1.0 - r_squared) * (n_obs - 1) / (n_obs - param_count)
    else:
        adj_r_squared = float("nan")
    factor_slice = slice(1, None)
    return {
        "alpha_daily": float(coefficients[0]),
        "alpha_ann": float(coefficients[0] * TRADING_DAYS_PER_YEAR),
        "alpha_t_stat": float(t_stats[0]),
        "alpha_p_value": float(p_values[0]),
        "betas": {name: float(value) for name, value in zip(factor_cols, coefficients[factor_slice])},
        "t_stats": {name: float(value) for name, value in zip(factor_cols, t_stats[factor_slice])},
        "p_values": {name: float(value) for name, value in zip(factor_cols, p_values[factor_slice])},
        "r_squared": float(r_squared),
        "adj_r_squared": float(adj_r_squared),
        "residual_vol_ann": residual_vol_ann,
        "start_date": regression_frame.index.min().date().isoformat(),
        "end_date": regression_frame.index.max().date().isoformat(),
        "n_obs": n_obs,
    }
 def _empty_attribution_frames() -> tuple[pd.DataFrame, pd.DataFrame]:
    return (
        pd.DataFrame(columns=SUMMARY_COLUMNS),
        pd.DataFrame(columns=LOADING_COLUMNS),
    )
 def _select_model_names(
    model_selection: str,
    available_models: dict[str, list[str]],
 ) -> list[str]:
    if model_selection == "all":
        return list(available_models)
    if model_selection in available_models:
        return [model_selection]
    return list(available_models)
 def _resolve_benchmark_symbol(benchmark: str | None) -> str:
    if benchmark is None:
        return MISSING_BENCHMARK_SENTINEL
    return benchmark
 def _beta_semantics_map(proxy_only: bool) -> dict[str, str]:
    return {
        "beta_mkt": "MKT" if proxy_only else "MKT_RF",
        "beta_smb": "SMB_PROXY" if proxy_only else "SMB",
        "beta_hml": "HML_PROXY" if proxy_only else "HML",
        "beta_rmw": "RMW_PROXY" if proxy_only else "RMW",
        "beta_cma": "CMA_PROXY" if proxy_only else "CMA",
        "beta_mom": "MOM",
        "beta_lowvol": "LOWVOL",
        "beta_recovery": "RECOVERY",
    }
 def _resolve_beta_semantics(row: pd.Series) -> dict[str, str]:
    canonical = _beta_semantics_map(bool(row.get("proxy_only", False)))
    raw_value = row.get("beta_semantics")
    if isinstance(raw_value, str) and raw_value:
        try:
            parsed = json.loads(raw_value)
        except json.JSONDecodeError:
            return canonical
        else:
            if isinstance(parsed, dict):
                parsed_mapping = {str(key): str(value) for key, value in parsed.items()}
                if set(parsed_mapping) == set(SEMANTIC_BETA_COLUMNS) and all(
                    value.strip() for value in parsed_mapping.values()
                ) and _semantics_have_unique_headers(parsed_mapping):
                    return parsed_mapping
    return canonical
 def _beta_header_name(factor_name: str) -> str:
    suffix = factor_name.strip().lower()
    suffix = re.sub(r"[^a-z0-9]+", "_", suffix).strip("_")
    if suffix == "mkt_rf":
        suffix = "mkt"
    return f"beta_{suffix}"
 def _semantics_have_unique_headers(semantics: dict[str, str]) -> bool:
    headers = [_beta_header_name(semantics[column]) for column in SEMANTIC_BETA_COLUMNS]
    if any(header == "beta_" for header in headers):
        return False
    return len(headers) == len(set(headers))
 def _section_beta_header_map(semantics: dict[str, str]) -> dict[str, str]:
    header_map: dict[str, str] = {}
    for beta_column, factor_name in semantics.items():
        header_map[beta_column] = _beta_header_name(factor_name)
    return header_map
 def _section_key(row: pd.Series) -> tuple[bool, tuple[tuple[str, str], ...]]:
    semantics = _resolve_beta_semantics(row)
    return bool(row.get("proxy_only", False)), tuple((key, semantics[key]) for key in SEMANTIC_BETA_COLUMNS)
 def attribute_strategies(
    results_df: pd.DataFrame,
    benchmark_label: str,
    price_data: pd.DataFrame,
    market: str,
    model_selection: str = "all",
    benchmark: str | None = None,
    external_factors: pd.DataFrame | None = None,
 ) -> tuple[pd.DataFrame, pd.DataFrame]:
    benchmark_symbol = _resolve_benchmark_symbol(benchmark)
    extension_factors = build_extension_factors(price_data, benchmark=benchmark_symbol, market=market)
    resolved_external_factors = external_factors
    market_name = market.lower()
    if market_name == "us" and resolved_external_factors is None:
        try:
            resolved_external_factors = load_external_us_factors()
        except (ExternalFactorDownloadError, ExternalFactorFormatError, zipfile.BadZipFile) as exc:
            warnings.warn(
                f"Falling back to proxy factor attribution because external US factors were unavailable: {exc}",
                UserWarning,
                stacklevel=2,
            )
            resolved_external_factors = None
    proxy_factors = None
    if market_name != "us" or resolved_external_factors is None:
        proxy_factors = build_proxy_core_factors(price_data, benchmark=benchmark_symbol, market=market)
    prepared = prepare_factor_models(
        market=market,
        extension_factors=extension_factors,
        proxy_factors=proxy_factors,
        external_factors=resolved_external_factors,
    )
    model_names = _select_model_names(model_selection, prepared["models"])
    strategy_returns = results_df.sort_index().pct_change(fill_method=None)
    if strategy_returns.empty:
        return _empty_attribution_frames()
    summary_rows: list[dict[str, object]] = []
    loading_rows: list[dict[str, object]] = []
    for strategy_name in strategy_returns.columns:
        if strategy_name == benchmark_label:
            continue
        for model_name in model_names:
            factor_cols = prepared["models"][model_name]
            try:
                regression_result = run_factor_regression(
                    strategy_returns=strategy_returns[strategy_name],
                    factor_frame=prepared["factor_frame"],
                    factor_cols=factor_cols,
                    risk_free_col=prepared["risk_free_col"],
                )
            except ValueError as exc:
                warnings.warn(
                    f"Skipping factor attribution for {strategy_name} ({model_name}): {exc}",
                    UserWarning,
                    stacklevel=2,
                )
                continue
            summary_row: dict[str, object] = {
                "strategy": strategy_name,
                "market": market_name,
                "model": model_name,
                "factor_source": prepared["factor_source"],
                "proxy_only": prepared["proxy_only"],
                "beta_semantics": json.dumps(_beta_semantics_map(bool(prepared["proxy_only"])), sort_keys=True),
                "start_date": regression_result["start_date"],
                "end_date": regression_result["end_date"],
                "n_obs": regression_result["n_obs"],
                "alpha_daily": regression_result["alpha_daily"],
                "alpha_ann": regression_result["alpha_ann"],
                "alpha_t_stat": regression_result["alpha_t_stat"],
                "alpha_p_value": regression_result["alpha_p_value"],
                "r_squared": regression_result["r_squared"],
                "adj_r_squared": regression_result["adj_r_squared"],
                "residual_vol_ann": regression_result["residual_vol_ann"],
                "beta_mkt": np.nan,
                "beta_smb": np.nan,
                "beta_hml": np.nan,
                "beta_rmw": np.nan,
                "beta_cma": np.nan,
                "beta_mom": np.nan,
                "beta_lowvol": np.nan,
                "beta_recovery": np.nan,
            }
            for factor_name, beta in regression_result["betas"].items():
                summary_column = SUMMARY_BETA_COLUMN_BY_FACTOR.get(factor_name)
                if summary_column is not None:
                    summary_row[summary_column] = beta
                loading_rows.append(
                    {
                        "strategy": strategy_name,
                        "market": market_name,
                        "model": model_name,
                        "factor_source": prepared["factor_source"],
                        "proxy_only": prepared["proxy_only"],
                        "factor": factor_name,
                        "beta": beta,
                        "t_stat": regression_result["t_stats"][factor_name],
                        "p_value": regression_result["p_values"][factor_name],
                    }
                )
            summary_rows.append(summary_row)
    summary_df = pd.DataFrame(summary_rows, columns=SUMMARY_COLUMNS)
    loadings_df = pd.DataFrame(loading_rows, columns=LOADING_COLUMNS)
    return summary_df, loadings_df
 def export_attribution(
    summary_df: pd.DataFrame,
    loadings_df: pd.DataFrame,
    output_dir: Path | str,
 ) -> None:
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    summary_df.to_csv(output_path / "summary.csv", index=False)
    loadings_df.to_csv(output_path / "loadings.csv", index=False)
 def _describe_alpha(alpha_ann: float) -> str:
    if alpha_ann > 0.02:
        return "positive"
    if alpha_ann < -0.02:
        return "negative"
    return "close to flat"
 def _describe_fit(r_squared: float) -> str:
    if r_squared >= 0.75:
        return "strong"
    if r_squared >= 0.4:
        return "moderate"
    return "weak"
 def _top_loading_descriptions(row: pd.Series, limit: int = 2) -> str:
    beta_columns = [column for column in row.index if column.startswith("beta_")]
    factor_labels = _resolve_beta_semantics(row)
    present = []
    for column in beta_columns:
        value = row.get(column)
        label = factor_labels.get(column)
        if label is not None and pd.notna(value):
            present.append((label, float(value)))
    if not present:
        return "no material factor loadings were estimated"
    top_loadings = sorted(present, key=lambda item: abs(item[1]), reverse=True)[:limit]
    return ", ".join(f"{name} {value:.2f}" for name, value in top_loadings)
 def _print_attribution_section(summary_df: pd.DataFrame, title: str, semantics: dict[str, str]) -> None:
    display_columns = [
        "strategy",
        "market",
        "model",
        "factor_source",
        "proxy_only",
        "alpha_ann",
        "r_squared",
        "residual_vol_ann",
        "beta_mkt",
        "beta_smb",
        "beta_hml",
        "beta_rmw",
        "beta_cma",
        "beta_mom",
        "beta_lowvol",
        "beta_recovery",
    ]
    table = summary_df.reindex(columns=display_columns).copy()
    table = table.rename(columns=_section_beta_header_map(semantics))
    numeric_columns = [
        column
        for column in table.columns
        if column not in {"strategy", "market", "model", "factor_source", "proxy_only"}
    ]
    table.loc[:, numeric_columns] = table.loc[:, numeric_columns].round(4)
    print(f"\n{title}")
    print(table.to_string(index=False, na_rep=""))
 def print_attribution_summary(summary_df: pd.DataFrame) -> None:
    if summary_df.empty:
        print("Factor attribution: no usable regressions were produced.")
        return
    print("\nFactor attribution")
    sections: dict[tuple[bool, tuple[tuple[str, str], ...]], list[int]] = {}
    for index, row in summary_df.iterrows():
        sections.setdefault(_section_key(row), []).append(index)
    for (is_proxy, semantics_items), row_indexes in sections.items():
        section_rows = summary_df.loc[row_indexes]
        title = "Proxy factor attribution" if is_proxy else "Standard factor attribution"
        _print_attribution_section(
            section_rows,
            title=title,
            semantics=dict(semantics_items),
        )
    print("\nInterpretation")
    for _, row in summary_df.iterrows():
        print(
            f"- {row['strategy']} / {row['model']}: estimated annualized alpha is "
            f"{_describe_alpha(float(row['alpha_ann']))} ({row['alpha_ann']:.2%}); "
            f"strongest loadings are {_top_loading_descriptions(row)}; "
            f"model fit looks {_describe_fit(float(row['r_squared']))} (R^2={row['r_squared']:.2f})."
        )
--- a/factor_backtest.py
+++ b/factor_backtest.py
@@ -0,0 +1,213 @@
 """
 Backtest best factor combinations with yearly return breakdown.
 US best:  momentum + recovery + low_downside_beta + short_term_reversal
 CN best:  momentum + anti_lottery + vol_reversal
 """
 from __future__ import annotations
 import argparse
 import numpy as np
 import pandas as pd
 import data_manager
 import metrics
 from universe import UNIVERSES
 from factor_research import (
    factor_momentum_12_1,
    factor_recovery,
    factor_short_term_reversal,
    factor_downside_beta_proxy,
    factor_lottery_demand,
    factor_turnover_reversal,
    factor_52w_high_distance,
 )
 def build_strategy_signals(
    prices: pd.DataFrame,
    factor_funcs: list,
    weights: list[float],
    top_n: int = 10,
    rebal_freq: int = 21,
 ) -> pd.DataFrame:
    """Build equal-weight top-N strategy from ranked factor combination."""
    signals_list = [f(prices) for f in factor_funcs]
    ranked = [s.rank(axis=1, pct=True, na_option="keep") for s in signals_list]
    composite = sum(w * r for w, r in zip(weights, ranked))
    # Warmup: need at least 252 days
    warmup = 252
    rank = composite.rank(axis=1, ascending=False, na_option="bottom")
    n_valid = composite.notna().sum(axis=1)
    enough = n_valid >= top_n
    top_mask = (rank <= top_n) & enough.values.reshape(-1, 1)
    raw = top_mask.astype(float)
    row_sums = raw.sum(axis=1).replace(0, np.nan)
    signals = raw.div(row_sums, axis=0).fillna(0.0)
    # Monthly rebalance
    rebal_mask = pd.Series(False, index=prices.index)
    rebal_indices = list(range(warmup, len(prices), rebal_freq))
    rebal_mask.iloc[rebal_indices] = True
    signals[~rebal_mask] = np.nan
    signals = signals.ffill().fillna(0.0)
    signals.iloc[:warmup] = 0.0
    return signals.shift(1).fillna(0.0)
 def backtest_equity(signals: pd.DataFrame, prices: pd.DataFrame, cost: float = 0.001) -> pd.Series:
    """Simple vectorized backtest returning equity curve."""
    returns = prices.pct_change().fillna(0.0)
    port_ret = (signals * returns).sum(axis=1)
    # Transaction costs via turnover
    turnover = signals.diff().abs().sum(axis=1)
    port_ret -= turnover * cost
    equity = (1 + port_ret).cumprod() * 100000
    return equity
 def yearly_returns(equity: pd.Series) -> pd.DataFrame:
    """Compute calendar year returns from equity curve."""
    daily_ret = equity.pct_change().fillna(0)
    years = daily_ret.index.year
    rows = []
    for year in sorted(years.unique()):
        mask = years == year
        yr_ret = (1 + daily_ret[mask]).prod() - 1
        # Also compute max drawdown for the year
        eq_yr = equity[mask]
        running_max = eq_yr.cummax()
        dd = (eq_yr - running_max) / running_max
        rows.append({
            "year": year,
            "return": yr_ret,
            "max_dd": dd.min(),
            "start_val": float(eq_yr.iloc[0]),
            "end_val": float(eq_yr.iloc[-1]),
        })
    return pd.DataFrame(rows).set_index("year")
 def run(market: str, years_list: list[int]):
    config = UNIVERSES[market]
    benchmark = config["benchmark"]
    print(f"Loading {market.upper()} price data...")
    prices = data_manager.load(market)
    bench_prices = prices[benchmark] if benchmark in prices.columns else None
    stocks = prices.drop(columns=[benchmark], errors="ignore")
    if market == "us":
        label = "Mom+Recovery+LowDBeta+STR"
        factor_funcs = [factor_momentum_12_1, factor_recovery, factor_downside_beta_proxy, factor_short_term_reversal]
        weights = [0.25, 0.25, 0.25, 0.25]
        baseline_label = "Recovery+Mom (baseline)"
        baseline_funcs = [factor_momentum_12_1, factor_recovery]
        baseline_weights = [0.5, 0.5]
    else:
        label = "Mom+Near52wHigh+VolReversal"
        factor_funcs = [factor_momentum_12_1, factor_52w_high_distance, factor_turnover_reversal]
        weights = [0.40, 0.30, 0.30]
        baseline_label = "Mom+Recovery (baseline)"
        baseline_funcs = [factor_momentum_12_1, factor_recovery]
        baseline_weights = [0.5, 0.5]
    for top_n in [10]:
        print(f"\n{'='*90}")
        print(f"  {market.upper()} — Top {top_n} — {label}")
        print(f"{'='*90}")
        # Best combo
        sig = build_strategy_signals(stocks, factor_funcs, weights, top_n=top_n)
        eq = backtest_equity(sig, stocks)
        # Baseline
        sig_base = build_strategy_signals(stocks, baseline_funcs, baseline_weights, top_n=top_n)
        eq_base = backtest_equity(sig_base, stocks)
        # Benchmark
        if bench_prices is not None:
            bp = bench_prices.dropna()
            eq_bench = bp / bp.iloc[0] * 100000
        for n_years in years_list:
            cutoff = stocks.index[-1] - pd.DateOffset(years=n_years)
            eq_slice = eq[eq.index >= cutoff]
            eq_base_slice = eq_base[eq_base.index >= cutoff]
            if len(eq_slice) < 50:
                continue
            # Normalize to starting capital
            eq_norm = eq_slice / eq_slice.iloc[0] * 100000
            eq_base_norm = eq_base_slice / eq_base_slice.iloc[0] * 100000
            yr = yearly_returns(eq_norm)
            yr_base = yearly_returns(eq_base_norm)
            if bench_prices is not None:
                eq_bench_slice = eq_bench[eq_bench.index >= cutoff]
                eq_bench_norm = eq_bench_slice / eq_bench_slice.iloc[0] * 100000
                yr_bench = yearly_returns(eq_bench_norm)
            print(f"\n--- Last {n_years} Years (from {eq_slice.index[0].date()}) ---\n")
            # Combined table
            print(f"  {'Year':<6} | {label:>30} | {baseline_label:>25} | {'Benchmark':>12} | {'Alpha vs Bench':>14}")
            print(f"  {'-'*6}-+-{'-'*30}-+-{'-'*25}-+-{'-'*12}-+-{'-'*14}")
            all_years = sorted(yr.index.tolist())
            total_new = 1.0
            total_base = 1.0
            total_bench = 1.0
            for y in all_years:
                r_new = yr.loc[y, "return"] if y in yr.index else 0
                dd_new = yr.loc[y, "max_dd"] if y in yr.index else 0
                r_base = yr_base.loc[y, "return"] if y in yr_base.index else 0
                r_bench = yr_bench.loc[y, "return"] if bench_prices is not None and y in yr_bench.index else 0
                alpha = r_new - r_bench
                total_new *= (1 + r_new)
                total_base *= (1 + r_base)
                total_bench *= (1 + r_bench)
                print(f"  {y:<6} | {r_new:>+14.2%} (dd {dd_new:>+7.2%}) | {r_base:>+25.2%} | {r_bench:>+12.2%} | {alpha:>+14.2%}")
            total_r_new = total_new - 1
            total_r_base = total_base - 1
            total_r_bench = total_bench - 1
            cagr_new = (total_new ** (1 / n_years)) - 1
            cagr_base = (total_base ** (1 / n_years)) - 1
            cagr_bench = (total_bench ** (1 / n_years)) - 1
            print(f"  {'-'*6}-+-{'-'*30}-+-{'-'*25}-+-{'-'*12}-+-{'-'*14}")
            print(f"  {'Total':<6} | {total_r_new:>+14.2%}{' '*16} | {total_r_base:>+25.2%} | {total_r_bench:>+12.2%} |")
            print(f"  {'CAGR':<6} | {cagr_new:>+14.2%}{' '*16} | {cagr_base:>+25.2%} | {cagr_bench:>+12.2%} |")
            # Full period metrics
            print(f"\n  Full metrics ({label}):")
            daily_ret = eq_norm.pct_change().dropna()
            sharpe = daily_ret.mean() / daily_ret.std() * np.sqrt(252) if daily_ret.std() > 0 else 0
            running_max = eq_norm.cummax()
            max_dd = ((eq_norm - running_max) / running_max).min()
            print(f"    Sharpe: {sharpe:.2f}  |  Max Drawdown: {max_dd:.2%}  |  Win Rate: {(daily_ret > 0).mean():.2%}")
 def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--market", default="us", choices=["us", "cn"])
    args = parser.parse_args()
    run(args.market, years_list=[3, 5, 10])
 if __name__ == "__main__":
    main()
--- a/factor_deep_analysis.py
+++ b/factor_deep_analysis.py
@@ -0,0 +1,324 @@
 """
 Deep factor analysis — orthogonality, proper correlations, residual alpha.
 For the top factor candidates identified in factor_research.py, this script:
 1. Computes proper daily cross-sectional rank correlations between factors
 2. Tests residual IC after neutralizing known factors (momentum, recovery)
 3. Runs sub-period breakdown (2-year windows)
 4. Tests factor combinations
 """
 from __future__ import annotations
 import argparse
 import warnings
 import numpy as np
 import pandas as pd
 import data_manager
 from universe import UNIVERSES
 from factor_research import (
    factor_momentum_12_1,
    factor_recovery,
    factor_inverse_vol,
    factor_short_term_reversal,
    factor_idio_vol_change,
    factor_max_drawdown_recovery,
    factor_mean_reversion_residual,
    factor_skewness,
    factor_high_low_range as factor_range_compression,
    factor_52w_high_distance as factor_near_52w_high,
    factor_downside_beta_proxy as factor_low_downside_beta,
    factor_lottery_demand,
    factor_turnover_reversal,
    factor_gap_momentum as factor_smooth_momentum,
    factor_up_down_vol_ratio,
    factor_trend_strength,
    factor_consecutive_up_days,
    factor_volume_price_divergence,
    factor_recovery_acceleration,
    factor_relative_volume_momentum,
    factor_price_level,
    factor_liquidity_premium,
    compute_ic,
 )
 warnings.filterwarnings("ignore", category=FutureWarning)
 def daily_cross_sectional_correlation(
    sig_a: pd.DataFrame, sig_b: pd.DataFrame
 ) -> pd.Series:
    """Daily cross-sectional Spearman correlation between two factor signals."""
    common_idx = sig_a.index.intersection(sig_b.index)
    common_cols = sig_a.columns.intersection(sig_b.columns)
    a = sig_a.loc[common_idx, common_cols]
    b = sig_b.loc[common_idx, common_cols]
    corrs = {}
    for date in common_idx:
        va = a.loc[date].dropna()
        vb = b.loc[date].dropna()
        common = va.index.intersection(vb.index)
        if len(common) < 30:
            continue
        c = va[common].corr(vb[common], method="spearman")
        if np.isfinite(c):
            corrs[date] = c
    return pd.Series(corrs)
 def proper_factor_correlation_matrix(factors: dict[str, pd.DataFrame]) -> pd.DataFrame:
    """Compute average daily cross-sectional Spearman correlations."""
    names = list(factors.keys())
    n = len(names)
    matrix = pd.DataFrame(1.0, index=names, columns=names)
    for i in range(n):
        for j in range(i + 1, n):
            corr_series = daily_cross_sectional_correlation(factors[names[i]], factors[names[j]])
            avg_corr = corr_series.mean() if len(corr_series) > 0 else np.nan
            matrix.loc[names[i], names[j]] = avg_corr
            matrix.loc[names[j], names[i]] = avg_corr
    return matrix
 def residual_signal(
    target: pd.DataFrame,
    controls: list[pd.DataFrame],
 ) -> pd.DataFrame:
    """Cross-sectionally orthogonalize target signal against control signals.
    For each day, regress target ranks on control ranks, return residual."""
    ranked_target = target.rank(axis=1, pct=True, na_option="keep")
    ranked_controls = [c.rank(axis=1, pct=True, na_option="keep") for c in controls]
    residuals = pd.DataFrame(index=target.index, columns=target.columns, dtype=float)
    for date in target.index:
        y = ranked_target.loc[date].dropna()
        xs = [rc.loc[date].reindex(y.index) for rc in ranked_controls if date in rc.index]
        if not xs:
            residuals.loc[date] = y
            continue
        x_df = pd.concat(xs, axis=1).dropna()
        common = y.index.intersection(x_df.index)
        if len(common) < 30:
            continue
        y_c = y[common].values
        x_c = x_df.loc[common].values
        x_c = np.column_stack([np.ones(len(common)), x_c])
        try:
            coef, _, _, _ = np.linalg.lstsq(x_c, y_c, rcond=None)
            resid = y_c - x_c @ coef
            residuals.loc[date, common] = resid
        except np.linalg.LinAlgError:
            residuals.loc[date, common] = y[common].values
    return residuals
 def subperiod_ic(signal: pd.DataFrame, prices: pd.DataFrame, horizon: int = 5, window_years: int = 2):
    """Compute IC for each rolling sub-period."""
    fwd_ret = prices.pct_change(horizon).shift(-horizon)
    ic_series = compute_ic(signal, fwd_ret)
    if len(ic_series) == 0:
        return pd.DataFrame()
    window = 252 * window_years
    results = []
    start = ic_series.index[0]
    while start < ic_series.index[-1]:
        end = start + pd.DateOffset(years=window_years)
        subset = ic_series[(ic_series.index >= start) & (ic_series.index < end)]
        if len(subset) > 100:
            results.append({
                "period": f"{start.year}-{end.year}",
                "ic_mean": subset.mean(),
                "ic_std": subset.std(),
                "icir": subset.mean() / subset.std() if subset.std() > 0 else 0,
                "pct_positive": (subset > 0).mean(),
                "n_days": len(subset),
            })
        start = end
    return pd.DataFrame(results)
 def test_factor_combination(
    factors: dict[str, pd.DataFrame],
    factor_names: list[str],
    weights: list[float],
    prices: pd.DataFrame,
    label: str,
 ):
    """Test a weighted combination of factors."""
    ranked = [factors[n].rank(axis=1, pct=True, na_option="keep") for n in factor_names]
    combo = sum(w * r for w, r in zip(weights, ranked))
    fwd_5d = prices.pct_change(5).shift(-5)
    ic_series = compute_ic(combo, fwd_5d)
    if len(ic_series) == 0:
        return None
    return {
        "combo": label,
        "ic_5d": ic_series.mean(),
        "icir_5d": ic_series.mean() / ic_series.std() if ic_series.std() > 0 else 0,
        "ic_stab": (ic_series.rolling(252).mean().dropna() > 0).mean() if len(ic_series) > 252 else np.nan,
    }
 def run_analysis(market: str):
    config = UNIVERSES[market]
    benchmark = config["benchmark"]
    print(f"Loading {market.upper()} price data...")
    prices = data_manager.load(market)
    stocks = prices.drop(columns=[benchmark], errors="ignore")
    print(f"Universe: {stocks.shape[1]} stocks, {stocks.shape[0]} days")
    # Build factors
    print("Computing factors...")
    factors = {}
    factors["momentum_12_1"] = factor_momentum_12_1(stocks)
    factors["recovery"] = factor_recovery(stocks)
    factors["inverse_vol"] = factor_inverse_vol(stocks)
    factors["short_term_reversal"] = factor_short_term_reversal(stocks)
    factors["drawdown_recovery"] = factor_max_drawdown_recovery(stocks)
    factors["mean_rev_zscore"] = factor_mean_reversion_residual(stocks)
    factors["neg_skewness"] = factor_skewness(stocks)
    factors["near_52w_high"] = factor_near_52w_high(stocks)
    factors["low_downside_beta"] = factor_low_downside_beta(stocks)
    factors["smooth_momentum"] = factor_smooth_momentum(stocks)
    factors["recovery_accel"] = factor_recovery_acceleration(stocks)
    factors["range_compression"] = factor_range_compression(stocks)
    if market == "cn":
        factors["anti_lottery"] = factor_lottery_demand(stocks)
        factors["vol_reversal"] = factor_turnover_reversal(stocks)
        factors["low_price"] = factor_price_level(stocks)
        factors["illiquidity"] = factor_liquidity_premium(stocks)
    # ---- 1. Proper Cross-Sectional Correlation Matrix ----
    print("\n" + "=" * 90)
    print(f"  1. CROSS-SECTIONAL FACTOR CORRELATIONS — {market.upper()}")
    print("=" * 90)
    print("(Average daily Spearman correlation between factor ranks)\n")
    corr = proper_factor_correlation_matrix(factors)
    print(corr.round(3).to_string())
    # ---- 2. Residual IC after neutralizing known factors ----
    print("\n" + "=" * 90)
    print(f"  2. RESIDUAL IC AFTER NEUTRALIZING KNOWN FACTORS — {market.upper()}")
    print("=" * 90)
    print("(IC of factor after cross-sectionally regressing out momentum + recovery)\n")
    known = [factors["momentum_12_1"], factors["recovery"]]
    fwd_5d = stocks.pct_change(5).shift(-5)
    new_candidates = [k for k in factors if k not in ("momentum_12_1", "recovery", "inverse_vol")]
    rows = []
    for name in new_candidates:
        resid = residual_signal(factors[name], known)
        ic_series = compute_ic(resid, fwd_5d)
        if len(ic_series) > 0:
            rows.append({
                "factor": name,
                "raw_ic_5d": compute_ic(factors[name], fwd_5d).mean(),
                "residual_ic_5d": ic_series.mean(),
                "residual_icir_5d": ic_series.mean() / ic_series.std() if ic_series.std() > 0 else 0,
                "pct_pos": (ic_series > 0).mean(),
            })
    resid_df = pd.DataFrame(rows).set_index("factor").sort_values("residual_icir_5d", ascending=False)
    print(resid_df.round(4).to_string())
    # ---- 3. Sub-Period Stability ----
    print("\n" + "=" * 90)
    print(f"  3. SUB-PERIOD IC STABILITY (2-year windows, 5-day horizon) — {market.upper()}")
    print("=" * 90)
    # Test top factors
    if market == "us":
        top_factors = ["low_downside_beta", "drawdown_recovery", "mean_rev_zscore", "short_term_reversal", "momentum_12_1"]
    else:
        top_factors = ["momentum_12_1", "anti_lottery", "inverse_vol", "vol_reversal", "near_52w_high"]
    for name in top_factors:
        if name not in factors:
            continue
        print(f"\n  {name}:")
        sp = subperiod_ic(factors[name], stocks, horizon=5)
        if not sp.empty:
            print(sp.to_string(index=False))
        else:
            print("  (insufficient data)")
    # ---- 4. Factor Combinations ----
    print("\n" + "=" * 90)
    print(f"  4. FACTOR COMBINATIONS — {market.upper()}")
    print("=" * 90)
    print("(Testing multi-factor composites)\n")
    combos = []
    if market == "us":
        tests = [
            (["momentum_12_1", "low_downside_beta"], [0.5, 0.5], "mom+low_dbeta"),
            (["momentum_12_1", "drawdown_recovery"], [0.5, 0.5], "mom+dd_recovery"),
            (["momentum_12_1", "mean_rev_zscore"], [0.5, 0.5], "mom+mean_rev"),
            (["momentum_12_1", "short_term_reversal"], [0.5, 0.5], "mom+STR"),
            (["recovery", "low_downside_beta"], [0.5, 0.5], "recovery+low_dbeta"),
            (["momentum_12_1", "recovery", "low_downside_beta"], [0.33, 0.33, 0.34], "mom+rec+low_dbeta"),
            (["momentum_12_1", "recovery", "drawdown_recovery"], [0.33, 0.33, 0.34], "mom+rec+dd_rec"),
            (["momentum_12_1", "recovery", "short_term_reversal"], [0.33, 0.33, 0.34], "mom+rec+STR"),
            (["momentum_12_1", "recovery", "mean_rev_zscore"], [0.33, 0.33, 0.34], "mom+rec+meanrev"),
            (["momentum_12_1", "recovery", "low_downside_beta", "short_term_reversal"],
             [0.25, 0.25, 0.25, 0.25], "mom+rec+dbeta+STR"),
            (["momentum_12_1", "recovery", "drawdown_recovery", "mean_rev_zscore"],
             [0.25, 0.25, 0.25, 0.25], "mom+rec+ddrec+meanrev"),
        ]
    else:  # cn
        tests = [
            (["momentum_12_1", "anti_lottery"], [0.5, 0.5], "mom+anti_lottery"),
            (["momentum_12_1", "inverse_vol"], [0.5, 0.5], "mom+inv_vol"),
            (["momentum_12_1", "vol_reversal"], [0.5, 0.5], "mom+vol_reversal"),
            (["momentum_12_1", "near_52w_high"], [0.5, 0.5], "mom+near52wh"),
            (["momentum_12_1", "anti_lottery", "inverse_vol"], [0.33, 0.33, 0.34], "mom+alot+invvol"),
            (["momentum_12_1", "anti_lottery", "vol_reversal"], [0.33, 0.33, 0.34], "mom+alot+volrev"),
            (["momentum_12_1", "anti_lottery", "near_52w_high"], [0.33, 0.33, 0.34], "mom+alot+near52w"),
            (["momentum_12_1", "recovery", "anti_lottery"], [0.33, 0.33, 0.34], "mom+rec+alot"),
            (["momentum_12_1", "anti_lottery", "inverse_vol", "vol_reversal"],
             [0.25, 0.25, 0.25, 0.25], "mom+alot+invvol+volrev"),
            (["momentum_12_1", "anti_lottery", "near_52w_high", "vol_reversal"],
             [0.25, 0.25, 0.25, 0.25], "mom+alot+52wh+volrev"),
        ]
    # Also test the existing recovery+momentum baseline
    baseline = test_factor_combination(factors, ["momentum_12_1", "recovery"], [0.5, 0.5], stocks, "BASELINE: mom+recovery")
    if baseline:
        combos.append(baseline)
    for names, weights, label in tests:
        if all(n in factors for n in names):
            result = test_factor_combination(factors, names, weights, stocks, label)
            if result:
                combos.append(result)
    combo_df = pd.DataFrame(combos).set_index("combo").sort_values("icir_5d", ascending=False)
    print(combo_df.round(4).to_string())
 def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--market", default="us", choices=["us", "cn"])
    args = parser.parse_args()
    run_analysis(args.market)
 if __name__ == "__main__":
    main()
--- a/factor_final_check.py
+++ b/factor_final_check.py
@@ -0,0 +1,150 @@
 """Final robustness check on champions from the discovery loop."""
 from __future__ import annotations
 import warnings
 import numpy as np
 import pandas as pd
 import data_manager
 from universe import UNIVERSES
 from factor_loop import (
    strat, bt, stats, combo, yearly,
    f_rec_mom, f_rec_126, f_rec_63,
    f_mom_12_1, f_mom_6_1, f_mom_intermediate,
    f_above_ma200, f_golden_cross,
    f_up_volume_proxy, f_gap_up_freq,
    f_rec_mom_filtered, f_down_resilience,
    f_up_capture, f_52w_high, f_str_10d,
    f_earnings_drift, f_reversal_vol,
 )
 warnings.filterwarnings("ignore")
 def f_quality_mom(p):
    mom = f_mom_12_1(p)
    consist_ret = p.pct_change()
    consist = (consist_ret > 0).astype(float).rolling(252, min_periods=126).mean()
    mom_r = mom.rank(axis=1, pct=True, na_option="keep")
    con_r = consist.rank(axis=1, pct=True, na_option="keep")
    up_r = f_up_volume_proxy(p).rank(axis=1, pct=True, na_option="keep")
    return 0.4 * mom_r + 0.3 * con_r + 0.3 * up_r
 def f_mom_x_gap(p):
    mom_r = f_mom_12_1(p).rank(axis=1, pct=True, na_option="keep")
    gap_r = f_gap_up_freq(p).rank(axis=1, pct=True, na_option="keep")
    return mom_r * gap_r
 def rolling_2yr(eq):
    dr = eq.pct_change().dropna()
    results = []
    for end_i in range(504, len(dr), 63):
        chunk = dr.iloc[end_i - 504:end_i]
        tot = (1 + chunk).prod() - 1
        ann = (1 + tot) ** (252 / len(chunk)) - 1
        sh = chunk.mean() / chunk.std() * np.sqrt(252) if chunk.std() > 0 else 0
        results.append({"end": chunk.index[-1].date(), "ann": ann, "sh": sh})
    return pd.DataFrame(results)
 def run_robustness(name, func, prices, label_prefix):
    print(f"\n  {name}:")
    # Top-N sensitivity
    print(f"    Top-N:  ", end="")
    for n in [5, 10, 15, 20]:
        w = strat(prices, func, top_n=n)
        eq = bt(w, prices)
        s = stats(eq)
        print(f"N={n}: {s['cagr']:+.1%}/{s['sharpe']:.2f}  ", end="")
    print()
    # Rebal sensitivity
    print(f"    Rebal:  ", end="")
    for r in [5, 10, 21, 42]:
        w = strat(prices, func, top_n=10, rebal=r)
        eq = bt(w, prices)
        s = stats(eq)
        print(f"{r}d: {s['cagr']:+.1%}/{s['sharpe']:.2f}  ", end="")
    print()
    # Cost sensitivity
    print(f"    Cost:   ", end="")
    for c in [0, 0.001, 0.002, 0.005]:
        w = strat(prices, func, top_n=10)
        eq = bt(w, prices, cost=c)
        s = stats(eq)
        print(f"{c*1e4:.0f}bp: {s['cagr']:+.1%}  ", end="")
    print()
    # Rolling 2-year
    w = strat(prices, func, top_n=10)
    eq = bt(w, prices)
    roll = rolling_2yr(eq)
    if not roll.empty:
        pct_pos = (roll["ann"] > 0).mean()
        print(f"    2yr rolling: mean={roll['ann'].mean():+.1%} min={roll['ann'].min():+.1%} "
              f"max={roll['ann'].max():+.1%} %pos={pct_pos:.0%} mean_sharpe={roll['sh'].mean():.2f}")
 def main():
    # ============= US =============
    prices_us = data_manager.load("us")
    stocks_us = prices_us.drop(columns=["SPY"], errors="ignore")
    print("=" * 95)
    print("  US FINAL ROBUSTNESS — Champions vs Baseline")
    print("=" * 95)
    us_champs = [
        ("BASELINE: rec+mom", f_rec_mom),
        ("rec_mom_filtered+rec_deep×upvol",
         combo([(f_rec_mom_filtered, 0.5),
                combo([(f_rec_126, 0.5), (f_up_volume_proxy, 0.5)]), (lambda x: x, 0.0)])),  # hack
        ("above_ma200+mom_7m+rec_126d",
         combo([(f_above_ma200, 0.33), (f_mom_intermediate, 0.33), (f_rec_126, 0.34)])),
        ("rec_mom_filtered+above_ma200",
         combo([(f_rec_mom_filtered, 0.5), (f_above_ma200, 0.5)])),
        ("mom_7m+rec_126d",
         combo([(f_mom_intermediate, 0.5), (f_rec_126, 0.5)])),
    ]
    # Fix the first champion - need proper 2-factor combo
    us_champs[1] = (
        "rec_mom_filt + rec_deep×upvol",
        combo([
            (f_rec_mom_filtered, 0.5),
            (combo([(f_rec_126, 0.5), (f_up_volume_proxy, 0.5)]), 0.5),
        ])
    )
    for name, func in us_champs:
        run_robustness(name, func, stocks_us, "US")
    # ============= CN =============
    prices_cn = data_manager.load("cn")
    stocks_cn = prices_cn.drop(columns=["000300.SS"], errors="ignore")
    print(f"\n{'='*95}")
    print("  CN FINAL ROBUSTNESS — Champions vs Baseline")
    print("=" * 95)
    cn_champs = [
        ("BASELINE: rec+mom", f_rec_mom),
        ("up_capture+quality_mom",
         combo([(f_up_capture, 0.5), (f_quality_mom, 0.5)])),
        ("recovery_63d+mom×gap",
         combo([(f_rec_63, 0.5), (f_mom_x_gap, 0.5)])),
        ("down_resilience+quality_mom",
         combo([(f_down_resilience, 0.5), (f_quality_mom, 0.5)])),
        ("up_capture+mom×gap",
         combo([(f_up_capture, 0.5), (f_mom_x_gap, 0.5)])),
    ]
    for name, func in cn_champs:
        run_robustness(name, func, stocks_cn, "CN")
 if __name__ == "__main__":
    main()
--- a/factor_loop.py
+++ b/factor_loop.py
@@ -0,0 +1,654 @@
 """
 Iterative Factor Discovery Loop.
 Round 1: Academic & practitioner hypotheses (30+ factors)
 Round 2: Data-driven variations on Round 1 winners
 Round 3: Interaction and conditional factors
 Round 4: Parameter optimization on finalists
 Round 5: Best combinations
 Each factor is tested immediately as a top-10 equal-weight strategy
 with monthly rebalancing and 10bps transaction costs.
 """
 from __future__ import annotations
 import argparse
 import warnings
 from typing import Callable
 import numpy as np
 import pandas as pd
 import data_manager
 from universe import UNIVERSES
 warnings.filterwarnings("ignore")
 FactorFunc = Callable[[pd.DataFrame], pd.DataFrame]
 # ---------------------------------------------------------------------------
 # Backtest infrastructure
 # ---------------------------------------------------------------------------
 def strat(
    prices: pd.DataFrame,
    signal_func: FactorFunc,
    top_n: int = 10,
    rebal: int = 21,
    warmup: int = 252,
 ) -> pd.DataFrame:
    sig = signal_func(prices)
    rank = sig.rank(axis=1, ascending=False, na_option="bottom")
    n_valid = sig.notna().sum(axis=1)
    enough = n_valid >= top_n
    mask = (rank <= top_n) & enough.values.reshape(-1, 1)
    raw = mask.astype(float)
    w = raw.div(raw.sum(axis=1).replace(0, np.nan), axis=0).fillna(0.0)
    rmask = pd.Series(False, index=prices.index)
    rmask.iloc[list(range(warmup, len(prices), rebal))] = True
    w[~rmask] = np.nan
    w = w.ffill().fillna(0.0)
    w.iloc[:warmup] = 0.0
    return w.shift(1).fillna(0.0)
 def bt(weights: pd.DataFrame, prices: pd.DataFrame, cost: float = 0.001) -> pd.Series:
    ret = prices.pct_change().fillna(0.0)
    pr = (weights * ret).sum(axis=1)
    pr -= weights.diff().abs().sum(axis=1) * cost
    return (1 + pr).cumprod() * 100000
 def stats(eq: pd.Series) -> dict:
    dr = eq.pct_change().dropna()
    if len(dr) < 200 or dr.std() == 0:
        return {"cagr": np.nan, "sharpe": np.nan, "sortino": np.nan,
                "maxdd": np.nan, "calmar": np.nan}
    ny = len(dr) / 252
    tot = eq.iloc[-1] / eq.iloc[0] - 1
    cagr = (1 + tot) ** (1 / ny) - 1
    sh = dr.mean() / dr.std() * np.sqrt(252)
    sd = dr[dr < 0].std()
    so = dr.mean() / sd * np.sqrt(252) if sd > 0 else 0
    rm = eq.cummax()
    dd = ((eq - rm) / rm).min()
    cal = cagr / abs(dd) if dd != 0 else 0
    return {"cagr": cagr, "sharpe": sh, "sortino": so, "maxdd": dd, "calmar": cal}
 def yearly(eq: pd.Series) -> dict[int, float]:
    dr = eq.pct_change().fillna(0)
    return {y: float((1 + dr[dr.index.year == y]).prod() - 1) for y in sorted(dr.index.year.unique())}
 def test_factor(name: str, func: FactorFunc, prices: pd.DataFrame,
                top_n: int = 10) -> dict:
    w = strat(prices, func, top_n=top_n)
    eq = bt(w, prices)
    s = stats(eq)
    s["name"] = name
    s["equity"] = eq
    return s
 def combo(fws: list[tuple[FactorFunc, float]]) -> FactorFunc:
    def _c(p):
        return sum(w * f(p).rank(axis=1, pct=True, na_option="keep") for f, w in fws)
    return _c
 def print_results(results: list[dict], title: str):
    df = pd.DataFrame([{k: v for k, v in r.items() if k != "equity"} for r in results])
    df = df.set_index("name").sort_values("cagr", ascending=False)
    print(f"\n{'='*95}")
    print(f"  {title}")
    print(f"{'='*95}")
    print(f"  {'Factor':<45} {'CAGR':>7} {'Sharpe':>7} {'Sortino':>8} {'MaxDD':>7} {'Calmar':>7}")
    print(f"  {'-'*85}")
    for name, row in df.iterrows():
        flag = " <<<" if "BASELINE" in str(name) else ""
        c = row['cagr']
        if np.isnan(c):
            continue
        print(f"  {str(name):<45} {c:>+6.1%} {row['sharpe']:>7.2f} {row['sortino']:>8.2f} "
              f"{row['maxdd']:>+6.1%} {row['calmar']:>7.2f}{flag}")
    return df
 # =====================================================================
 #  ROUND 1 — Academic & Practitioner Hypotheses
 # =====================================================================
 # --- Momentum family ---
 def f_mom_12_1(p): return p.shift(21).pct_change(231)
 def f_mom_6_1(p): return p.shift(21).pct_change(105)
 def f_mom_3_1(p): return p.shift(21).pct_change(42)
 def f_mom_1_0(p): return p.pct_change(21)  # 1-month (reversal in US)
 # --- Recovery family ---
 def f_rec_63(p): return p / p.rolling(63, min_periods=63).min() - 1
 def f_rec_126(p): return p / p.rolling(126, min_periods=126).min() - 1
 def f_rec_21(p): return p / p.rolling(21, min_periods=21).min() - 1
 # Novy-Marx 2012: intermediate momentum (7-12 month)
 def f_mom_intermediate(p): return p.shift(21).pct_change(147)  # ~7 month
 # Asness et al: quality/profitability proxy via return consistency
 def f_consistent_returns(p):
    ret = p.pct_change()
    return (ret > 0).astype(float).rolling(252, min_periods=126).mean()
 # Da, Liu, Schaumburg 2014: information discreteness
 # Stocks with many small positive days > stocks with few large positive days
 def f_info_discrete(p):
    ret = p.pct_change()
    n_pos = (ret > 0).astype(float).rolling(60, min_periods=40).sum()
    sum_pos = ret.where(ret > 0, 0).rolling(60, min_periods=40).sum()
    avg_pos = sum_pos / n_pos.replace(0, np.nan)
    # High count of positive days + low average positive = smooth accumulation
    return n_pos / avg_pos.replace(0, np.nan)
 # Accumulation proxy (worked in Round 1)
 def f_up_volume_proxy(p):
    ret = p.pct_change()
    return ret.where(ret > 0, 0).rolling(20, min_periods=15).sum()
 # George & Hwang 2004: 52-week high ratio
 def f_52w_high(p):
    return p / p.rolling(252, min_periods=126).max()
 # Frequency of large up-moves (worked in Round 1)
 def f_gap_up_freq(p):
    ret = p.pct_change()
    return (ret > 0.01).astype(float).rolling(60, min_periods=40).mean()
 # Bali, Cakici, Whitelaw 2011: MAX effect (lottery demand)
 def f_anti_max(p):
    ret = p.pct_change()
    return -ret.rolling(20, min_periods=15).max()
 # Ang et al 2006: idiosyncratic volatility (negative)
 def f_neg_ivol(p):
    ret = p.pct_change()
    return -ret.rolling(20, min_periods=15).std()
 # Blitz & van Vliet 2007: low volatility anomaly
 def f_low_vol_60(p):
    ret = p.pct_change()
    return -ret.rolling(60, min_periods=40).std()
 # Hurst exponent proxy — autocorrelation of returns
 # Stocks with positive autocorrelation = trending
 def f_autocorrelation(p):
    ret = p.pct_change()
    def _ac(x):
        x = x.dropna()
        if len(x) < 20:
            return np.nan
        return np.corrcoef(x[:-1], x[1:])[0, 1]
    return ret.rolling(60, min_periods=40).apply(_ac, raw=False)
 # Short-term reversal (Jegadeesh 1990)
 def f_str_5d(p): return -p.pct_change(5)
 def f_str_10d(p): return -p.pct_change(10)
 # Earnings drift proxy (worked in Round 1)
 def f_earnings_drift(p):
    ret_5d = p.pct_change(5)
    vol = p.pct_change().rolling(60, min_periods=40).std() * np.sqrt(5)
    z = ret_5d / vol.replace(0, np.nan)
    return z.rolling(60, min_periods=20).mean()
 # Risk-adjusted momentum (Sharpe-momentum)
 def f_sharpe_mom(p):
    ret = p.pct_change()
    mu = ret.rolling(252, min_periods=126).mean()
    sigma = ret.rolling(252, min_periods=126).std()
    return mu / sigma.replace(0, np.nan)
 # Trend strength: slope of log-price regression
 def f_trend_slope(p):
    log_p = np.log(p.replace(0, np.nan))
    def _slope(x):
        x = x.dropna().values
        if len(x) < 30:
            return np.nan
        t = np.arange(len(x), dtype=float)
        t -= t.mean()
        return (t * (x - x.mean())).sum() / (t * t).sum()
    return log_p.rolling(60, min_periods=30).apply(_slope, raw=False)
 # Acceleration: recent momentum vs. longer-term momentum
 def f_mom_accel(p):
    m3 = p.shift(5).pct_change(58)  # ~3mo
    m12 = p.shift(21).pct_change(231)  # ~12mo
    return m3 - m12
 # Mean reversion z-score
 def f_mean_rev_z(p):
    ma20 = p.rolling(20, min_periods=20).mean()
    vol = p.pct_change().rolling(60, min_periods=40).std() * p
    return -(p - ma20) / vol.replace(0, np.nan)
 # Price relative to moving averages
 def f_above_ma200(p):
    return p / p.rolling(200, min_periods=200).mean() - 1
 def f_above_ma50(p):
    return p / p.rolling(50, min_periods=50).mean() - 1
 # Dual MA signal: 50-day MA / 200-day MA
 def f_golden_cross(p):
    ma50 = p.rolling(50, min_periods=50).mean()
    ma200 = p.rolling(200, min_periods=200).mean()
    return ma50 / ma200 - 1
 # Drawdown recovery rate
 def f_dd_recovery_rate(p):
    rm = p.rolling(252, min_periods=126).max()
    dd = p / rm - 1  # negative when in drawdown
    return dd - dd.shift(20)  # positive = recovering from drawdown
 # A-share specific: short-term reversal x volatility
 def f_reversal_vol(p):
    return -p.pct_change(5) * p.pct_change().rolling(20, min_periods=15).std()
 # Recovery + momentum (baseline)
 def f_rec_mom(p):
    r1 = f_rec_63(p).rank(axis=1, pct=True, na_option="keep")
    r2 = f_mom_12_1(p).rank(axis=1, pct=True, na_option="keep")
    return 0.5 * r1 + 0.5 * r2
 # =====================================================================
 #  ROUND 2 — Second-order ideas from Round 1 analysis
 # =====================================================================
 # The key insight: "quality of returns" matters more than "magnitude of returns"
 # Factors that measure HOW a stock goes up, not just that it went up.
 # Smoothness-weighted momentum
 def f_smooth_momentum(p):
    """Momentum penalized by path volatility. Stocks that go up smoothly."""
    mom = p.shift(21).pct_change(231)
    ret = p.pct_change()
    vol = ret.rolling(252, min_periods=126).std()
    return mom / (vol.replace(0, np.nan) ** 0.5)  # sqrt to dampen
 # Positive return ratio (like Sharpe numerator)
 def f_pos_ratio_60(p):
    """Fraction of positive return days in 60 days. Quality signal."""
    ret = p.pct_change()
    return (ret > 0).astype(float).rolling(60, min_periods=40).mean()
 # Cumulative positive returns vs cumulative negative returns
 def f_up_down_asymmetry(p):
    """Ratio of cumulative up-move to cumulative down-move."""
    ret = p.pct_change()
    up = ret.where(ret > 0, 0).rolling(60, min_periods=40).sum()
    down = (-ret.where(ret < 0, 0)).rolling(60, min_periods=40).sum()
    return up / down.replace(0, np.nan)
 # Streak momentum: max consecutive up days in last 40 days
 def f_max_streak(p):
    ret = p.pct_change()
    pos = (ret > 0).astype(float)
    def _max_streak(x):
        x = x.dropna().values
        if len(x) == 0:
            return 0
        best = cur = 0
        for v in x:
            cur = cur + 1 if v > 0.5 else 0
            best = max(best, cur)
        return best
    return pos.rolling(40, min_periods=20).apply(_max_streak, raw=False)
 # Overnight proxy: gap between yesterday's close and today's pattern
 # Since we only have close prices, use close-to-close 1d return decomposition
 def f_up_capture(p):
    """Up-market capture ratio over 60 days."""
    ret = p.pct_change()
    mkt = ret.mean(axis=1)
    up_mkt = mkt > 0
    arr = ret.values.copy()
    arr[~up_mkt.values, :] = np.nan
    stock_up = pd.DataFrame(arr, index=ret.index, columns=ret.columns)
    mkt_up_vals = mkt.where(up_mkt, np.nan)
    stock_avg = stock_up.rolling(60, min_periods=20).mean()
    mkt_avg = mkt_up_vals.rolling(60, min_periods=20).mean()
    return stock_avg.div(mkt_avg, axis=0)
 # Down-market resilience
 def f_down_resilience(p):
    """How much LESS a stock falls on down-market days."""
    ret = p.pct_change()
    mkt = ret.mean(axis=1)
    down_mkt = mkt < 0
    arr = ret.values.copy()
    arr[~down_mkt.values, :] = np.nan
    down_ret = pd.DataFrame(arr, index=ret.index, columns=ret.columns)
    return -down_ret.rolling(120, min_periods=30).mean()
 # Recovery from rolling max with momentum filter
 def f_rec_mom_filtered(p):
    """Recovery factor only for stocks with positive 6-month momentum.
    Filters out dead-cat bounces."""
    rec = p / p.rolling(126, min_periods=126).min() - 1
    mom = p.shift(21).pct_change(105)
    return rec.where(mom > 0, np.nan)
 # Information discreteness v2: using the sign ratio
 def f_sign_ratio(p):
    """Ratio of (count of positive days)^2 * avg_size to total return.
    High ratio = many small ups = institutional flow."""
    ret = p.pct_change()
    n_total = 60
    n_pos = (ret > 0).astype(float).rolling(n_total, min_periods=40).sum()
    total_ret = ret.rolling(n_total, min_periods=40).sum()
    sign_vol = n_pos / n_total
    # Stocks where most of the return comes from many small positive days
    return sign_vol * total_ret.clip(lower=0)
 # =====================================================================
 #  ROUND 3 — Interaction & conditional factors
 # =====================================================================
 def f_mom_x_recovery(p):
    """Momentum × Recovery interaction. The product, not the sum."""
    mom_r = f_mom_12_1(p).rank(axis=1, pct=True, na_option="keep")
    rec_r = f_rec_63(p).rank(axis=1, pct=True, na_option="keep")
    return mom_r * rec_r
 def f_mom_x_upvol(p):
    """Momentum × Up-volume-proxy interaction."""
    mom_r = f_mom_12_1(p).rank(axis=1, pct=True, na_option="keep")
    up_r = f_up_volume_proxy(p).rank(axis=1, pct=True, na_option="keep")
    return mom_r * up_r
 def f_rec_deep_x_upvol(p):
    """Deep recovery × Up-volume interaction."""
    rec_r = f_rec_126(p).rank(axis=1, pct=True, na_option="keep")
    up_r = f_up_volume_proxy(p).rank(axis=1, pct=True, na_option="keep")
    return rec_r * up_r
 def f_trend_x_mom(p):
    """Trend strength × Momentum. Trending + momentum = double signal."""
    tr_r = f_trend_slope(p).rank(axis=1, pct=True, na_option="keep")
    mom_r = f_mom_12_1(p).rank(axis=1, pct=True, na_option="keep")
    return tr_r * mom_r
 def f_quality_mom(p):
    """Momentum filtered by consistency. Only persistent winners."""
    mom = f_mom_12_1(p)
    consist = f_consistent_returns(p)
    mom_r = mom.rank(axis=1, pct=True, na_option="keep")
    con_r = consist.rank(axis=1, pct=True, na_option="keep")
    return 0.4 * mom_r + 0.3 * con_r + 0.3 * f_up_volume_proxy(p).rank(axis=1, pct=True, na_option="keep")
 def f_rec_deep_x_gap(p):
    """Deep recovery × gap-up frequency."""
    rec_r = f_rec_126(p).rank(axis=1, pct=True, na_option="keep")
    gap_r = f_gap_up_freq(p).rank(axis=1, pct=True, na_option="keep")
    return rec_r * gap_r
 def f_mom_x_gap(p):
    """Momentum × gap-up frequency."""
    mom_r = f_mom_12_1(p).rank(axis=1, pct=True, na_option="keep")
    gap_r = f_gap_up_freq(p).rank(axis=1, pct=True, na_option="keep")
    return mom_r * gap_r
 # Regime-conditional: momentum with volatility filter
 def f_mom_low_vol_regime(p):
    """Momentum only when market vol is below median.
    Momentum crashes in high-vol regimes."""
    mom = f_mom_12_1(p)
    mkt_vol = p.pct_change().mean(axis=1).rolling(60).std()
    vol_median = mkt_vol.rolling(252, min_periods=126).median()
    low_vol = mkt_vol <= vol_median
    mask = pd.DataFrame(
        np.tile(low_vol.values[:, None], (1, mom.shape[1])),
        index=mom.index, columns=mom.columns,
    )
    return mom.where(mask, 0)
 # =====================================================================
 # Main loop
 # =====================================================================
 def run_round(
    name: str,
    factors: list[tuple[str, FactorFunc]],
    prices: pd.DataFrame,
    top_n: int = 10,
 ) -> list[dict]:
    results = []
    for fname, func in factors:
        r = test_factor(fname, func, prices, top_n=top_n)
        results.append(r)
    print_results(results, name)
    return results
 def run_market(market: str):
    config = UNIVERSES[market]
    benchmark = config["benchmark"]
    prices = data_manager.load(market)
    bench = prices[benchmark].dropna() if benchmark in prices.columns else None
    stocks = prices.drop(columns=[benchmark], errors="ignore")
    print(f"\n{'#'*95}")
    print(f"  FACTOR DISCOVERY LOOP — {market.upper()} MARKET")
    print(f"  {stocks.shape[1]} stocks, {stocks.shape[0]} days, "
          f"{stocks.index[0].date()} → {stocks.index[-1].date()}")
    print(f"{'#'*95}")
    # Benchmark
    if bench is not None:
        eq_bench = bench / bench.iloc[0] * 100000
        bs = stats(eq_bench)
        print(f"\n  Benchmark: CAGR {bs['cagr']:+.1%}, Sharpe {bs['sharpe']:.2f}")
    # ================================================================
    # ROUND 1: Academic & practitioner factors
    # ================================================================
    r1_factors = [
        ("BASELINE:rec+mom", f_rec_mom),
        # Momentum family
        ("mom_12_1", f_mom_12_1),
        ("mom_6_1", f_mom_6_1),
        ("mom_3_1", f_mom_3_1),
        ("mom_1_0", f_mom_1_0),
        ("mom_intermediate_7m", f_mom_intermediate),
        ("sharpe_momentum", f_sharpe_mom),
        # Recovery family
        ("recovery_63d", f_rec_63),
        ("recovery_126d", f_rec_126),
        ("recovery_21d", f_rec_21),
        # Trend
        ("trend_slope_60d", f_trend_slope),
        ("golden_cross", f_golden_cross),
        ("above_ma200", f_above_ma200),
        # Volatility
        ("low_vol_60d", f_low_vol_60),
        ("neg_ivol_20d", f_neg_ivol),
        # Reversal
        ("STR_5d", f_str_5d),
        ("STR_10d", f_str_10d),
        # Quality / accumulation
        ("consistent_returns", f_consistent_returns),
        ("up_volume_proxy", f_up_volume_proxy),
        ("gap_up_freq", f_gap_up_freq),
        ("info_discrete", f_info_discrete),
        ("earnings_drift", f_earnings_drift),
        # Other academic
        ("52w_high", f_52w_high),
        ("anti_max_20d", f_anti_max),
        ("dd_recovery_rate", f_dd_recovery_rate),
        ("mom_acceleration", f_mom_accel),
    ]
    if market == "cn":
        r1_factors.append(("reversal_vol_cn", f_reversal_vol))
    r1 = run_round("ROUND 1 — Academic & Practitioner Hypotheses", r1_factors, stocks)
    # Identify top-10 from round 1
    r1_sorted = sorted(r1, key=lambda x: x.get("cagr", 0) or 0, reverse=True)
    r1_top_names = [r["name"] for r in r1_sorted[:10] if r.get("cagr") and r["cagr"] > 0]
    baseline_cagr = next((r["cagr"] for r in r1 if "BASELINE" in r["name"]), 0)
    print(f"\n  Baseline CAGR: {baseline_cagr:+.1%}")
    print(f"  Top 10: {r1_top_names}")
    # ================================================================
    # ROUND 2: Second-order ideas based on what worked
    # ================================================================
    r2_factors = [
        ("BASELINE:rec+mom", f_rec_mom),
        ("smooth_momentum", f_smooth_momentum),
        ("pos_ratio_60d", f_pos_ratio_60),
        ("up_down_asymmetry", f_up_down_asymmetry),
        ("max_streak_40d", f_max_streak),
        ("up_capture_60d", f_up_capture),
        ("down_resilience_120d", f_down_resilience),
        ("rec_mom_filtered", f_rec_mom_filtered),
        ("sign_ratio", f_sign_ratio),
        ("autocorrelation_60d", f_autocorrelation),
        ("mean_rev_z", f_mean_rev_z),
    ]
    r2 = run_round("ROUND 2 — Return Quality & Behavioral Factors", r2_factors, stocks)
    # ================================================================
    # ROUND 3: Interaction & conditional factors
    # ================================================================
    r3_factors = [
        ("BASELINE:rec+mom", f_rec_mom),
        ("mom×recovery", f_mom_x_recovery),
        ("mom×upvol", f_mom_x_upvol),
        ("rec_deep×upvol", f_rec_deep_x_upvol),
        ("trend×mom", f_trend_x_mom),
        ("quality_mom", f_quality_mom),
        ("rec_deep×gap", f_rec_deep_x_gap),
        ("mom×gap", f_mom_x_gap),
        ("mom_low_vol_regime", f_mom_low_vol_regime),
    ]
    r3 = run_round("ROUND 3 — Interaction & Conditional Factors", r3_factors, stocks)
    # ================================================================
    # Collect ALL results from all rounds
    # ================================================================
    all_results = r1 + r2 + r3
    # Deduplicate baseline
    seen = set()
    unique = []
    for r in all_results:
        if r["name"] not in seen:
            seen.add(r["name"])
            unique.append(r)
    unique_sorted = sorted(unique, key=lambda x: x.get("cagr", 0) or 0, reverse=True)
    print(f"\n{'='*95}")
    print(f"  ALL ROUNDS COMBINED — TOP 15 FACTORS — {market.upper()}")
    print(f"{'='*95}")
    print(f"  {'Factor':<45} {'CAGR':>7} {'Sharpe':>7} {'Sortino':>8} {'MaxDD':>7} {'Calmar':>7}")
    print(f"  {'-'*85}")
    for r in unique_sorted[:15]:
        flag = " <<<" if "BASELINE" in r["name"] else ""
        print(f"  {r['name']:<45} {r['cagr']:>+6.1%} {r['sharpe']:>7.2f} {r['sortino']:>8.2f} "
              f"{r['maxdd']:>+6.1%} {r['calmar']:>7.2f}{flag}")
    # ================================================================
    # ROUND 4: Combine top non-baseline factors
    # ================================================================
    top_funcs = {}
    func_map_all = dict(r1_factors + r2_factors + r3_factors)
    non_baseline = [r for r in unique_sorted if "BASELINE" not in r["name"] and r.get("cagr", 0)]
    for r in non_baseline[:12]:
        if r["name"] in func_map_all:
            top_funcs[r["name"]] = func_map_all[r["name"]]
    top_names = list(top_funcs.keys())
    print(f"\n  Building combinations from top {len(top_names)} factors: {top_names}")
    combo_factors = [("BASELINE:rec+mom", f_rec_mom)]
    # All pairs from top-8
    for i in range(min(8, len(top_names))):
        for j in range(i + 1, min(8, len(top_names))):
            n1, n2 = top_names[i], top_names[j]
            combo_factors.append((
                f"{n1[:20]}+{n2[:20]}",
                combo([(top_funcs[n1], 0.5), (top_funcs[n2], 0.5)])
            ))
    # Triple combos from top-5
    for i in range(min(5, len(top_names))):
        for j in range(i + 1, min(5, len(top_names))):
            for k in range(j + 1, min(5, len(top_names))):
                n1, n2, n3 = top_names[i], top_names[j], top_names[k]
                combo_factors.append((
                    f"{n1[:15]}+{n2[:15]}+{n3[:15]}",
                    combo([(top_funcs[n1], 0.33), (top_funcs[n2], 0.33), (top_funcs[n3], 0.34)])
                ))
    r4 = run_round("ROUND 4 — Factor Combinations", combo_factors, stocks)
    # ================================================================
    # ROUND 5: Yearly breakdown of top 5 combos
    # ================================================================
    r4_sorted = sorted(r4, key=lambda x: x.get("cagr", 0) or 0, reverse=True)
    top5 = r4_sorted[:5]
    # Make sure baseline is included
    base = next((r for r in r4 if "BASELINE" in r["name"]), None)
    if base and base not in top5:
        top5.append(base)
    print(f"\n{'='*95}")
    print(f"  ROUND 5 — YEARLY RETURNS OF BEST STRATEGIES — {market.upper()}")
    print(f"{'='*95}")
    cols = [(r["name"], r["equity"]) for r in top5]
    if bench is not None:
        eq_bench = bench / bench.iloc[0] * 100000
        cols.append(("BENCHMARK", eq_bench))
    # Header
    header = f"  {'Year':<6}"
    for name, _ in cols:
        header += f" | {name[:22]:>22}"
    print(header)
    print("  " + "-" * (6 + 25 * len(cols)))
    all_years = sorted(set(y for _, eq in cols for y in eq.index.year.unique()))
    for year in all_years:
        line = f"  {year:<6}"
        for _, eq in cols:
            dr = eq.pct_change().fillna(0)
            yr = dr[dr.index.year == year]
            r = float((1 + yr).prod() - 1) if len(yr) > 0 else 0
            line += f" | {r:>+21.1%}"
        print(line)
    # Period CAGRs
    for ny in [3, 5, 10]:
        cutoff = stocks.index[-1] - pd.DateOffset(years=ny)
        print(f"\n  --- {ny}-year CAGR ---")
        for name, eq in cols:
            sl = eq[eq.index >= cutoff]
            if len(sl) < 50:
                continue
            tot = sl.iloc[-1] / sl.iloc[0] - 1
            cagr = (1 + tot) ** (1 / ny) - 1
            print(f"    {name[:50]:<50} {cagr:>+8.1%}")
 def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--market", default="us", choices=["us", "cn"])
    args = parser.parse_args()
    run_market(args.market)
 if __name__ == "__main__":
    main()
--- a/factor_real_backtest.py
+++ b/factor_real_backtest.py
@@ -0,0 +1,449 @@
 """
 Factor research v2 — Portfolio-first approach.
 Instead of IC → portfolio, we go directly to:
  1. Build factor signal
  2. Select top-N stocks
  3. Run real backtest with transaction costs
  4. Measure CAGR, Sharpe, MaxDD, yearly returns
 Tests single factors AND combinations. Compares everything against
 the baseline recovery+momentum strategy.
 """
 from __future__ import annotations
 import argparse
 import warnings
 import numpy as np
 import pandas as pd
 import data_manager
 import metrics
 from universe import UNIVERSES
 warnings.filterwarnings("ignore")
 # ---------------------------------------------------------------------------
 # Factor signals — each returns DataFrame (dates x stocks), higher = better
 # ---------------------------------------------------------------------------
 def f_momentum_12_1(p: pd.DataFrame) -> pd.DataFrame:
    return p.shift(21).pct_change(231)
 def f_recovery(p: pd.DataFrame) -> pd.DataFrame:
    return p / p.rolling(63, min_periods=63).min() - 1
 def f_recovery_mom(p: pd.DataFrame) -> pd.DataFrame:
    """The baseline composite: 50/50 recovery + momentum ranks."""
    r1 = f_recovery(p).rank(axis=1, pct=True, na_option="keep")
    r2 = f_momentum_12_1(p).rank(axis=1, pct=True, na_option="keep")
    return 0.5 * r1 + 0.5 * r2
 # --- New single factors ---
 def f_short_term_reversal(p: pd.DataFrame) -> pd.DataFrame:
    """5-day return reversal."""
    return -p.pct_change(5)
 def f_vol_adjusted_mom(p: pd.DataFrame) -> pd.DataFrame:
    """Momentum divided by recent volatility. Sharpe-like signal.
    Hypothesis: risk-adjusted momentum is more persistent."""
    mom = p.shift(21).pct_change(231)
    vol = p.pct_change().rolling(60, min_periods=40).std()
    return mom / vol.replace(0, np.nan)
 def f_acceleration(p: pd.DataFrame) -> pd.DataFrame:
    """3-month momentum minus 12-month momentum.
    Hypothesis: accelerating stocks continue accelerating."""
    mom_3m = p.shift(5).pct_change(63 - 5)
    mom_12m = p.shift(21).pct_change(231)
    return mom_3m - mom_12m
 def f_breakout(p: pd.DataFrame) -> pd.DataFrame:
    """Price relative to 20-day high. Close to 1 = breaking out.
    Hypothesis: breakouts from consolidation continue."""
    return p / p.rolling(20, min_periods=20).max()
 def f_recovery_deep(p: pd.DataFrame) -> pd.DataFrame:
    """Recovery from 126-day (6 month) low instead of 63-day.
    Hypothesis: deeper recovery = stronger signal."""
    return p / p.rolling(126, min_periods=126).min() - 1
 def f_recovery_rate(p: pd.DataFrame) -> pd.DataFrame:
    """Speed of recovery: 20-day change in recovery factor.
    Hypothesis: accelerating recovery predicts continuation."""
    recovery = p / p.rolling(63, min_periods=63).min() - 1
    return recovery - recovery.shift(20)
 def f_drawdown_bounce(p: pd.DataFrame) -> pd.DataFrame:
    """20-day return from drawdown trough, only for stocks in drawdown.
    Hypothesis: strong bounces from drawdowns persist."""
    rolling_max = p.rolling(252, min_periods=126).max()
    in_drawdown = p < rolling_max * 0.9  # at least 10% below peak
    bounce_20d = p.pct_change(20)
    # Only score stocks that were recently in drawdown
    was_in_drawdown = in_drawdown.rolling(20, min_periods=1).max().astype(bool)
    return bounce_20d.where(was_in_drawdown, np.nan)
 def f_consistent_winner(p: pd.DataFrame) -> pd.DataFrame:
    """Fraction of months with positive returns over past 12 months.
    Hypothesis: stocks that win consistently are higher quality momentum."""
    monthly_ret = p.pct_change(21)
    return (monthly_ret > 0).astype(float).rolling(252, min_periods=126).mean()
 def f_gap_up_freq(p: pd.DataFrame) -> pd.DataFrame:
    """Fraction of days with >1% gain in past 60 days.
    Hypothesis: frequent large gains = institutional buying."""
    ret = p.pct_change()
    return (ret > 0.01).astype(float).rolling(60, min_periods=40).mean()
 def f_low_vol_mom(p: pd.DataFrame) -> pd.DataFrame:
    """Momentum only among low-volatility stocks. Combined rank.
    Hypothesis: low-vol momentum is more persistent."""
    mom = f_momentum_12_1(p).rank(axis=1, pct=True, na_option="keep")
    vol = (-p.pct_change().rolling(60, min_periods=40).std()).rank(axis=1, pct=True, na_option="keep")
    return 0.5 * mom + 0.5 * vol
 def f_52w_channel_position(p: pd.DataFrame) -> pd.DataFrame:
    """Position within 252-day high-low channel. 1 = at high, 0 = at low.
    Hypothesis: stocks near highs continue (anchoring + trend)."""
    h = p.rolling(252, min_periods=126).max()
    l = p.rolling(252, min_periods=126).min()
    return (p - l) / (h - l).replace(0, np.nan)
 def f_up_volume_proxy(p: pd.DataFrame) -> pd.DataFrame:
    """Proxy for accumulation: sum of returns on up days over 20 days.
    Without volume data, use magnitude of positive returns as proxy."""
    ret = p.pct_change()
    up_ret = ret.where(ret > 0, 0)
    return up_ret.rolling(20, min_periods=15).sum()
 def f_relative_strength_ma(p: pd.DataFrame) -> pd.DataFrame:
    """Price above 50-day MA relative to 200-day MA position.
    Dual MA trend strength."""
    ma50 = p.rolling(50, min_periods=50).mean()
    ma200 = p.rolling(200, min_periods=200).mean()
    above_50 = (p / ma50 - 1)
    above_200 = (p / ma200 - 1)
    return 0.5 * above_50 + 0.5 * above_200
 def f_earnings_drift_proxy(p: pd.DataFrame) -> pd.DataFrame:
    """Proxy for post-earnings drift using 5-day return spikes.
    Identify large 5-day moves and bet on continuation.
    Hypothesis: large moves driven by information continue."""
    ret_5d = p.pct_change(5)
    vol = p.pct_change().rolling(60, min_periods=40).std() * np.sqrt(5)
    z_score = ret_5d / vol.replace(0, np.nan)
    # Smooth: average z-score over past 60 days to capture multiple events
    return z_score.rolling(60, min_periods=20).mean()
 # --- A-share specific ---
 def f_reversal_vol_cn(p: pd.DataFrame) -> pd.DataFrame:
    """Short-term reversal amplified by volatility.
    High-vol oversold stocks bounce harder in A-shares."""
    ret_5d = p.pct_change(5)
    vol = p.pct_change().rolling(20, min_periods=15).std()
    # Oversold (negative return) + high vol = positive score
    return -ret_5d * vol
 def f_momentum_6_1(p: pd.DataFrame) -> pd.DataFrame:
    """6-1 month momentum. Shorter lookback may work better in A-shares."""
    return p.shift(21).pct_change(105)
 def f_recovery_narrow(p: pd.DataFrame) -> pd.DataFrame:
    """Recovery from 21-day low. Faster recovery signal for A-shares."""
    return p / p.rolling(21, min_periods=21).min() - 1
 def f_range_breakout_cn(p: pd.DataFrame) -> pd.DataFrame:
    """Breakout from 60-day range. Tuned for A-share volatility."""
    h60 = p.rolling(60, min_periods=40).max()
    l60 = p.rolling(60, min_periods=40).min()
    mid = (h60 + l60) / 2
    rng = (h60 - l60) / mid.replace(0, np.nan)
    position = (p - l60) / (h60 - l60).replace(0, np.nan)
    # Reward stocks breaking out of narrow ranges
    return position / rng.replace(0, np.nan)
 # ---------------------------------------------------------------------------
 # Strategy builder and backtester
 # ---------------------------------------------------------------------------
 def make_strategy(
    prices: pd.DataFrame,
    signal_func,
    top_n: int = 10,
    rebal_freq: int = 21,
    warmup: int = 252,
 ) -> pd.DataFrame:
    """Turn a factor signal into a rebalanced top-N equal-weight strategy."""
    signal = signal_func(prices)
    rank = signal.rank(axis=1, ascending=False, na_option="bottom")
    n_valid = signal.notna().sum(axis=1)
    enough = n_valid >= top_n
    top_mask = (rank <= top_n) & enough.values.reshape(-1, 1)
    raw = top_mask.astype(float)
    row_sums = raw.sum(axis=1).replace(0, np.nan)
    weights = raw.div(row_sums, axis=0).fillna(0.0)
    # Monthly rebalance
    rebal_mask = pd.Series(False, index=prices.index)
    rebal_indices = list(range(warmup, len(prices), rebal_freq))
    rebal_mask.iloc[rebal_indices] = True
    weights[~rebal_mask] = np.nan
    weights = weights.ffill().fillna(0.0)
    weights.iloc[:warmup] = 0.0
    return weights.shift(1).fillna(0.0)
 def combo_signal(funcs_and_weights: list[tuple]) -> callable:
    """Create a combined signal function from [(func, weight), ...]."""
    def _combo(p: pd.DataFrame) -> pd.DataFrame:
        ranked = []
        for func, w in funcs_and_weights:
            sig = func(p)
            ranked.append(w * sig.rank(axis=1, pct=True, na_option="keep"))
        return sum(ranked)
    return _combo
 def run_backtest(
    weights: pd.DataFrame,
    prices: pd.DataFrame,
    cost: float = 0.001,
 ) -> pd.Series:
    """Vectorized backtest returning equity curve."""
    returns = prices.pct_change().fillna(0.0)
    port_ret = (weights * returns).sum(axis=1)
    turnover = weights.diff().abs().sum(axis=1)
    port_ret -= turnover * cost
    return (1 + port_ret).cumprod() * 100000
 def compute_stats(equity: pd.Series, label: str) -> dict:
    """Compute strategy statistics."""
    daily_ret = equity.pct_change().dropna()
    if len(daily_ret) < 100 or daily_ret.std() == 0:
        return {"name": label, "cagr": np.nan, "sharpe": np.nan, "maxdd": np.nan,
                "total": np.nan, "win_rate": np.nan}
    n_years = len(daily_ret) / 252
    total_ret = equity.iloc[-1] / equity.iloc[0] - 1
    cagr = (1 + total_ret) ** (1 / n_years) - 1
    sharpe = daily_ret.mean() / daily_ret.std() * np.sqrt(252)
    sortino_denom = daily_ret[daily_ret < 0].std()
    sortino = daily_ret.mean() / sortino_denom * np.sqrt(252) if sortino_denom > 0 else 0
    running_max = equity.cummax()
    maxdd = ((equity - running_max) / running_max).min()
    calmar = cagr / abs(maxdd) if maxdd != 0 else 0
    win_rate = (daily_ret > 0).mean()
    return {
        "name": label, "cagr": cagr, "sharpe": sharpe, "sortino": sortino,
        "maxdd": maxdd, "calmar": calmar, "total": total_ret, "win_rate": win_rate,
    }
 def yearly_returns(equity: pd.Series) -> dict[int, float]:
    daily_ret = equity.pct_change().fillna(0)
    years = daily_ret.index.year
    result = {}
    for year in sorted(years.unique()):
        mask = years == year
        result[year] = float((1 + daily_ret[mask]).prod() - 1)
    return result
 def run(market: str):
    config = UNIVERSES[market]
    benchmark = config["benchmark"]
    print(f"Loading {market.upper()} price data...")
    prices = data_manager.load(market)
    bench = prices[benchmark].dropna() if benchmark in prices.columns else None
    stocks = prices.drop(columns=[benchmark], errors="ignore")
    print(f"Universe: {stocks.shape[1]} stocks, {stocks.shape[0]} days")
    print(f"Period: {stocks.index[0].date()} to {stocks.index[-1].date()}\n")
    # --- Define all strategies to test ---
    strategies: list[tuple[str, callable]] = []
    # Baseline
    strategies.append(("BASELINE: recovery+mom", f_recovery_mom))
    # Single factors
    strategies.append(("momentum_12_1", f_momentum_12_1))
    strategies.append(("recovery", f_recovery))
    strategies.append(("vol_adj_momentum", f_vol_adjusted_mom))
    strategies.append(("acceleration", f_acceleration))
    strategies.append(("breakout_20d", f_breakout))
    strategies.append(("recovery_deep_126d", f_recovery_deep))
    strategies.append(("recovery_rate", f_recovery_rate))
    strategies.append(("drawdown_bounce", f_drawdown_bounce))
    strategies.append(("consistent_winner", f_consistent_winner))
    strategies.append(("gap_up_freq", f_gap_up_freq))
    strategies.append(("low_vol_momentum", f_low_vol_mom))
    strategies.append(("52w_channel_position", f_52w_channel_position))
    strategies.append(("up_volume_proxy", f_up_volume_proxy))
    strategies.append(("relative_strength_ma", f_relative_strength_ma))
    strategies.append(("earnings_drift_proxy", f_earnings_drift_proxy))
    if market == "cn":
        strategies.append(("reversal_vol_cn", f_reversal_vol_cn))
        strategies.append(("momentum_6_1", f_momentum_6_1))
        strategies.append(("recovery_narrow_21d", f_recovery_narrow))
        strategies.append(("range_breakout_cn", f_range_breakout_cn))
    # Run all single-factor backtests
    print("=" * 110)
    print(f"  SINGLE FACTOR BACKTESTS — {market.upper()} (Top 10, monthly rebal, 10bps cost)")
    print("=" * 110)
    results = []
    equities = {}
    for name, func in strategies:
        print(f"  Running: {name}...")
        w = make_strategy(stocks, func, top_n=10)
        eq = run_backtest(w, stocks)
        equities[name] = eq
        results.append(compute_stats(eq, name))
    # Benchmark
    if bench is not None:
        eq_bench = bench / bench.iloc[0] * 100000
        equities["BENCHMARK"] = eq_bench
        results.append(compute_stats(eq_bench, "BENCHMARK"))
    # Print results table
    df = pd.DataFrame(results).set_index("name")
    df = df.sort_values("cagr", ascending=False)
    print(f"\n{'Strategy':<30} {'CAGR':>8} {'Sharpe':>8} {'Sortino':>8} {'MaxDD':>8} {'Calmar':>8} {'Total':>10}")
    print("-" * 90)
    for name, row in df.iterrows():
        flag = " ***" if name == "BASELINE: recovery+mom" else ""
        print(f"{name:<30} {row['cagr']:>+7.1%} {row['sharpe']:>8.2f} {row['sortino']:>8.2f} "
              f"{row['maxdd']:>+7.1%} {row['calmar']:>8.2f} {row['total']:>+9.0%}{flag}")
    # --- Identify factors that beat or match baseline ---
    baseline_cagr = df.loc["BASELINE: recovery+mom", "cagr"]
    winners = df[df["cagr"] >= baseline_cagr * 0.8].index.tolist()
    winners = [w for w in winners if w not in ("BASELINE: recovery+mom", "BENCHMARK")]
    print(f"\nFactors within 80% of baseline CAGR ({baseline_cagr:.1%}): {winners}")
    # --- Test combinations of top performers ---
    print(f"\n{'='*110}")
    print(f"  FACTOR COMBINATIONS — {market.upper()}")
    print(f"{'='*110}")
    # Get top single factors
    single_only = df.drop(["BASELINE: recovery+mom", "BENCHMARK"], errors="ignore")
    top_singles = single_only.nlargest(8, "cagr").index.tolist()
    print(f"  Top 8 singles: {top_singles}\n")
    # Map names back to functions
    func_map = dict(strategies)
    combos: list[tuple[str, callable]] = []
    # Baseline is always included
    combos.append(("BASELINE: recovery+mom", f_recovery_mom))
    # Top2 combinations
    for i in range(min(6, len(top_singles))):
        for j in range(i + 1, min(6, len(top_singles))):
            n1, n2 = top_singles[i], top_singles[j]
            label = f"{n1} + {n2}"
            func = combo_signal([(func_map[n1], 0.5), (func_map[n2], 0.5)])
            combos.append((label, func))
    # Recovery+mom + each top single (3-factor)
    for name in top_singles[:6]:
        if name in ("momentum_12_1", "recovery"):
            continue
        label = f"rec+mom + {name}"
        func = combo_signal([
            (f_recovery, 0.33), (f_momentum_12_1, 0.33), (func_map[name], 0.34)
        ])
        combos.append((label, func))
    # Run combo backtests
    combo_results = []
    for name, func in combos:
        print(f"  Running: {name}...")
        w = make_strategy(stocks, func, top_n=10)
        eq = run_backtest(w, stocks)
        equities[name] = eq
        combo_results.append(compute_stats(eq, name))
    combo_df = pd.DataFrame(combo_results).set_index("name")
    combo_df = combo_df.sort_values("cagr", ascending=False)
    print(f"\n{'Combo':<55} {'CAGR':>8} {'Sharpe':>8} {'Sortino':>8} {'MaxDD':>8} {'Calmar':>8}")
    print("-" * 105)
    for name, row in combo_df.iterrows():
        flag = " ***" if name == "BASELINE: recovery+mom" else ""
        print(f"{name:<55} {row['cagr']:>+7.1%} {row['sharpe']:>8.2f} {row['sortino']:>8.2f} "
              f"{row['maxdd']:>+7.1%} {row['calmar']:>8.2f}{flag}")
    # --- Yearly breakdown for top 3 combos ---
    top3 = combo_df.nlargest(3, "cagr").index.tolist()
    if "BASELINE: recovery+mom" not in top3:
        top3.append("BASELINE: recovery+mom")
    print(f"\n{'='*110}")
    print(f"  YEARLY RETURNS — TOP STRATEGIES vs BASELINE — {market.upper()}")
    print(f"{'='*110}")
    yr_data = {}
    for name in top3:
        yr_data[name] = yearly_returns(equities[name])
    if bench is not None:
        yr_data["BENCHMARK"] = yearly_returns(equities["BENCHMARK"])
    all_years = sorted(set(y for yd in yr_data.values() for y in yd.keys()))
    # Print header
    col_names = top3 + (["BENCHMARK"] if bench is not None else [])
    header = f"  {'Year':<6}"
    for c in col_names:
        header += f" | {c[:25]:>25}"
    print(header)
    print("  " + "-" * (6 + 28 * len(col_names)))
    for year in all_years:
        line = f"  {year:<6}"
        for c in col_names:
            r = yr_data.get(c, {}).get(year, 0)
            line += f" | {r:>+24.1%}"
        print(line)
    # Compute period summaries
    for n_years in [3, 5, 10]:
        cutoff = stocks.index[-1] - pd.DateOffset(years=n_years)
        print(f"\n  --- {n_years}-year CAGR ---")
        for name in col_names:
            eq = equities.get(name)
            if eq is None:
                continue
            eq_slice = eq[eq.index >= cutoff]
            if len(eq_slice) < 50:
                continue
            total = eq_slice.iloc[-1] / eq_slice.iloc[0] - 1
            cagr = (1 + total) ** (1 / n_years) - 1
            print(f"    {name[:40]:<40} {cagr:>+8.1%}")
 def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--market", default="us", choices=["us", "cn"])
    args = parser.parse_args()
    run(args.market)
 if __name__ == "__main__":
    main()
--- a/factor_research.py
+++ b/factor_research.py
@@ -0,0 +1,547 @@
 """
 Factor Research Script — Professional QR-style factor mining.
 Tests candidate alpha factors using:
  - Information Coefficient (IC): rank correlation of signal vs forward returns
  - IC Information Ratio (ICIR): mean(IC) / std(IC), measures signal consistency
  - Quintile long-short spread: monotonicity of returns across signal buckets
  - Turnover: daily rank change, proxy for trading cost
  - Decay profile: IC at 1d, 5d, 10d, 20d horizons
  - Sub-period stability: IC consistency across rolling windows
  - Factor correlation matrix: ensures new factors are orthogonal to known ones
 Usage:
    uv run python factor_research.py --market us
    uv run python factor_research.py --market cn
 """
 from __future__ import annotations
 import argparse
 import warnings
 import numpy as np
 import pandas as pd
 import data_manager
 from universe import UNIVERSES
 warnings.filterwarnings("ignore", category=FutureWarning)
 HORIZONS = [1, 5, 10, 20]
 # ---------------------------------------------------------------------------
 # Factor definitions — each returns a DataFrame (dates x stocks) of scores
 # ---------------------------------------------------------------------------
 def _safe_rank(df: pd.DataFrame) -> pd.DataFrame:
    return df.rank(axis=1, pct=True, na_option="keep")
 def _rolling_ret(prices: pd.DataFrame, window: int) -> pd.DataFrame:
    return prices.pct_change(window)
 # --- Known factors (baselines) ---
 def factor_momentum_12_1(prices: pd.DataFrame) -> pd.DataFrame:
    """Classic 12-1 month momentum."""
    return prices.shift(21).pct_change(231)
 def factor_recovery(prices: pd.DataFrame) -> pd.DataFrame:
    """Price / 63-day low - 1."""
    return prices / prices.rolling(63, min_periods=63).min() - 1
 def factor_inverse_vol(prices: pd.DataFrame) -> pd.DataFrame:
    """Negative 60-day realized volatility (low vol = high score)."""
    return -prices.pct_change().rolling(60, min_periods=60).std()
 # --- NEW candidate factors ---
 def factor_short_term_reversal(prices: pd.DataFrame) -> pd.DataFrame:
    """5-day return reversal. Hypothesis: short-term mean reversion."""
    return -prices.pct_change(5)
 def factor_idio_vol_change(prices: pd.DataFrame) -> pd.DataFrame:
    """Change in idiosyncratic volatility (20d vs 60d).
    Hypothesis: declining vol = stabilizing, predicts positive returns."""
    ret = prices.pct_change()
    vol_20 = ret.rolling(20, min_periods=20).std()
    vol_60 = ret.rolling(60, min_periods=60).std()
    return -(vol_20 / vol_60.replace(0, np.nan) - 1)  # negative = vol declining
 def factor_volume_price_divergence(prices: pd.DataFrame, volume: pd.DataFrame | None = None) -> pd.DataFrame:
    """Price up but momentum fading — proxy via acceleration.
    Without volume data, use return acceleration as proxy."""
    ret_5 = prices.pct_change(5)
    ret_20 = prices.pct_change(20)
    return ret_5 - ret_20 / 4  # recent returns outpacing trend
 def factor_max_drawdown_recovery(prices: pd.DataFrame) -> pd.DataFrame:
    """How much of the 60-day max drawdown has been recovered.
    Hypothesis: stocks that recover from drawdowns continue recovering."""
    rolling_max = prices.rolling(60, min_periods=60).max()
    drawdown = prices / rolling_max - 1  # negative
    rolling_min_dd = drawdown.rolling(60, min_periods=20).min()  # worst drawdown
    recovery_pct = drawdown / rolling_min_dd.replace(0, np.nan)
    return recovery_pct  # closer to 0 = more recovered
 def factor_skewness(prices: pd.DataFrame) -> pd.DataFrame:
    """Negative 20-day return skewness.
    Hypothesis: negatively skewed stocks are overpriced (lottery preference)."""
    ret = prices.pct_change()
    return -ret.rolling(20, min_periods=20).skew()
 def factor_high_low_range(prices: pd.DataFrame) -> pd.DataFrame:
    """20-day high-low range relative to price.
    Hypothesis: narrow range = consolidation, breakout ahead."""
    high_20 = prices.rolling(20, min_periods=20).max()
    low_20 = prices.rolling(20, min_periods=20).min()
    mid = (high_20 + low_20) / 2
    return -(high_20 - low_20) / mid.replace(0, np.nan)  # negative range = narrow = high score
 def factor_mean_reversion_residual(prices: pd.DataFrame) -> pd.DataFrame:
    """Distance from 20-day MA as fraction of 60-day vol.
    Hypothesis: stocks far below MA revert. Z-score style."""
    ma_20 = prices.rolling(20, min_periods=20).mean()
    vol_60 = prices.pct_change().rolling(60, min_periods=60).std() * prices
    return -(prices - ma_20) / vol_60.replace(0, np.nan)  # below MA = high score
 def factor_up_down_vol_ratio(prices: pd.DataFrame) -> pd.DataFrame:
    """Ratio of upside to downside semi-deviation (20d).
    Hypothesis: stocks with more upside vol have positive momentum."""
    ret = prices.pct_change()
    up_vol = ret.where(ret > 0, 0).rolling(20, min_periods=15).std()
    down_vol = ret.where(ret < 0, 0).rolling(20, min_periods=15).std()
    return up_vol / down_vol.replace(0, np.nan)
 def factor_consecutive_up_days(prices: pd.DataFrame) -> pd.DataFrame:
    """Fraction of positive return days in last 10 days.
    Hypothesis: persistent winners keep winning (short-term)."""
    ret = prices.pct_change()
    return (ret > 0).astype(float).rolling(10, min_periods=10).mean()
 def factor_gap_momentum(prices: pd.DataFrame) -> pd.DataFrame:
    """Cumulative overnight-like gaps: close-to-close vs intraday proxy.
    Using 1-day returns smoothed over 20 days minus 5-day return.
    Hypothesis: smooth consistent returns beat volatile ones."""
    ret_1d = prices.pct_change()
    smoothness = ret_1d.rolling(20, min_periods=20).mean() * 20
    raw_20d = prices.pct_change(20)
    return smoothness - raw_20d  # positive = smoother path
 def factor_recovery_acceleration(prices: pd.DataFrame) -> pd.DataFrame:
    """Rate of change of recovery factor.
    Hypothesis: accelerating recovery is stronger signal than level."""
    recovery = prices / prices.rolling(63, min_periods=63).min() - 1
    return recovery.pct_change(5)
 def factor_trend_strength(prices: pd.DataFrame) -> pd.DataFrame:
    """R-squared of log-price vs time over 60 days.
    Hypothesis: stocks trending linearly (high R2) continue."""
    log_p = np.log(prices.replace(0, np.nan))
    def _r2(series):
        y = series.dropna().values
        if len(y) < 30:
            return np.nan
        x = np.arange(len(y), dtype=float)
        x -= x.mean()
        y_dm = y - y.mean()
        ss_xy = (x * y_dm).sum()
        ss_xx = (x * x).sum()
        ss_yy = (y_dm * y_dm).sum()
        if ss_xx == 0 or ss_yy == 0:
            return np.nan
        r2 = (ss_xy ** 2) / (ss_xx * ss_yy)
        slope = ss_xy / ss_xx
        return r2 if slope > 0 else -r2  # sign by direction
    return log_p.rolling(60, min_periods=30).apply(_r2, raw=False)
 def factor_relative_volume_momentum(prices: pd.DataFrame) -> pd.DataFrame:
    """Price momentum weighted by how 'cheap' a stock is relative to 52-week range.
    Hypothesis: momentum in stocks near lows is more likely to persist."""
    mom_20 = prices.pct_change(20)
    high_252 = prices.rolling(252, min_periods=126).max()
    low_252 = prices.rolling(252, min_periods=126).min()
    position_in_range = (prices - low_252) / (high_252 - low_252).replace(0, np.nan)
    return mom_20 * (1 - position_in_range)  # momentum * cheapness
 def factor_52w_high_distance(prices: pd.DataFrame) -> pd.DataFrame:
    """Distance from 52-week high.
    Hypothesis: stocks near their highs continue (anchoring bias)."""
    high_252 = prices.rolling(252, min_periods=126).max()
    return prices / high_252  # closer to 1 = near high
 def factor_downside_beta_proxy(prices: pd.DataFrame) -> pd.DataFrame:
    """Proxy for downside beta using co-movement on market down days.
    Hypothesis: low downside beta outperforms (asymmetric risk)."""
    ret = prices.pct_change()
    market_ret = ret.mean(axis=1)
    down_days = market_ret < 0
    # Mask non-down-day returns to NaN, then rolling mean
    # Use numpy for correct broadcasting, wider window (120d) so ~54 down
    # days are available, well above min_periods=20
    arr = ret.values.copy()
    arr[~down_days.values, :] = np.nan
    down_ret = pd.DataFrame(arr, index=ret.index, columns=ret.columns)
    avg_down = down_ret.rolling(120, min_periods=20).mean()
    return -avg_down  # negative = less downside = good
 # --- A-share specific factors ---
 def factor_liquidity_premium(prices: pd.DataFrame) -> pd.DataFrame:
    """Amihud illiquidity proxy (using returns only, no volume).
    Hypothesis: illiquid stocks earn premium in A-shares (retail driven)."""
    ret = prices.pct_change()
    # Use absolute return as illiquidity proxy (higher abs ret = less liquid)
    illiq = ret.abs().rolling(20, min_periods=15).mean()
    return illiq
 def factor_lottery_demand(prices: pd.DataFrame) -> pd.DataFrame:
    """Max daily return in past 20 days (negative).
    Hypothesis: lottery stocks (high max return) underperform.
    Strong in A-shares due to retail speculation."""
    ret = prices.pct_change()
    return -ret.rolling(20, min_periods=15).max()
 def factor_turnover_reversal(prices: pd.DataFrame) -> pd.DataFrame:
    """Interaction of short-term returns with volatility.
    High vol + negative return = oversold bounce candidate.
    Common A-share alpha source."""
    ret_5 = prices.pct_change(5)
    vol_20 = prices.pct_change().rolling(20, min_periods=15).std()
    return -ret_5 * vol_20  # oversold + high vol = positive
 def factor_price_level(prices: pd.DataFrame) -> pd.DataFrame:
    """Negative absolute price level.
    Hypothesis: low-priced stocks attract retail in A-shares (penny stock effect)."""
    return -prices
 # ---------------------------------------------------------------------------
 # IC and analytics engine
 # ---------------------------------------------------------------------------
 def compute_ic(
    signal: pd.DataFrame,
    forward_ret: pd.DataFrame,
 ) -> pd.Series:
    """Cross-sectional rank IC (Spearman) per day."""
    common_idx = signal.index.intersection(forward_ret.index)
    common_cols = signal.columns.intersection(forward_ret.columns)
    sig = signal.loc[common_idx, common_cols]
    fwd = forward_ret.loc[common_idx, common_cols]
    ics = []
    for date in common_idx:
        s = sig.loc[date].dropna()
        f = fwd.loc[date].dropna()
        common = s.index.intersection(f.index)
        if len(common) < 30:
            continue
        ic = s[common].corr(f[common], method="spearman")
        if np.isfinite(ic):
            ics.append((date, ic))
    if not ics:
        return pd.Series(dtype=float)
    return pd.Series(dict(ics))
 def compute_quintile_returns(
    signal: pd.DataFrame,
    forward_ret: pd.DataFrame,
    n_quantiles: int = 5,
 ) -> pd.DataFrame:
    """Average forward return by signal quintile, per day, then time-averaged."""
    common_idx = signal.index.intersection(forward_ret.index)
    common_cols = signal.columns.intersection(forward_ret.columns)
    sig = signal.loc[common_idx, common_cols]
    fwd = forward_ret.loc[common_idx, common_cols]
    records = []
    for date in common_idx:
        s = sig.loc[date].dropna()
        f = fwd.loc[date].dropna()
        common = s.index.intersection(f.index)
        if len(common) < 50:
            continue
        scores = s[common]
        rets = f[common]
        try:
            quintile = pd.qcut(scores, n_quantiles, labels=False, duplicates="drop")
        except ValueError:
            continue
        for q in range(n_quantiles):
            mask = quintile == q
            if mask.sum() > 0:
                records.append({"date": date, "quintile": q + 1, "return": rets[mask].mean()})
    if not records:
        return pd.DataFrame()
    df = pd.DataFrame(records)
    return df.groupby("quintile")["return"].mean() * 252  # annualize
 def compute_turnover(signal: pd.DataFrame) -> float:
    """Average daily rank change (turnover proxy)."""
    ranked = signal.rank(axis=1, pct=True, na_option="keep")
    daily_change = ranked.diff().abs().mean(axis=1)
    return float(daily_change.mean())
 def compute_factor_correlation(factors: dict[str, pd.DataFrame]) -> pd.DataFrame:
    """Cross-sectional IC correlation between all factor pairs."""
    names = list(factors.keys())
    n = len(names)
    corr_matrix = pd.DataFrame(np.nan, index=names, columns=names)
    # Use time-series of rank-averaged signals
    avg_ranks = {}
    for name, sig in factors.items():
        ranked = sig.rank(axis=1, pct=True, na_option="keep")
        avg_ranks[name] = ranked.mean(axis=1).dropna()
    for i in range(n):
        for j in range(i, n):
            s1 = avg_ranks[names[i]]
            s2 = avg_ranks[names[j]]
            common = s1.index.intersection(s2.index)
            if len(common) > 100:
                c = s1[common].corr(s2[common])
                corr_matrix.loc[names[i], names[j]] = c
                corr_matrix.loc[names[j], names[i]] = c
            if i == j:
                corr_matrix.loc[names[i], names[j]] = 1.0
    return corr_matrix
 def analyze_factor(
    name: str,
    signal: pd.DataFrame,
    prices: pd.DataFrame,
    horizons: list[int] | None = None,
 ) -> dict:
    """Full single-factor analysis."""
    if horizons is None:
        horizons = HORIZONS
    results = {"name": name}
    # Forward returns at each horizon
    for h in horizons:
        fwd_ret = prices.pct_change(h).shift(-h)
        ic_series = compute_ic(signal, fwd_ret)
        if len(ic_series) == 0:
            results[f"ic_{h}d"] = np.nan
            results[f"icir_{h}d"] = np.nan
            continue
        ic_mean = ic_series.mean()
        ic_std = ic_series.std()
        icir = ic_mean / ic_std if ic_std > 0 else 0.0
        results[f"ic_{h}d"] = ic_mean
        results[f"icir_{h}d"] = icir
        if h == 1:
            results["ic_1d_series"] = ic_series
    # Quintile analysis at 5-day horizon
    fwd_5d = prices.pct_change(5).shift(-5)
    quintiles = compute_quintile_returns(signal, fwd_5d)
    if not quintiles.empty:
        results["q5_return"] = float(quintiles.iloc[-1])  # top quintile
        results["q1_return"] = float(quintiles.iloc[0])   # bottom quintile
        results["long_short_ann"] = float(quintiles.iloc[-1] - quintiles.iloc[0])
        results["monotonicity"] = float(quintiles.corr(pd.Series(range(1, len(quintiles) + 1), index=quintiles.index)))
        results["quintile_returns"] = quintiles
    else:
        results["q5_return"] = np.nan
        results["q1_return"] = np.nan
        results["long_short_ann"] = np.nan
        results["monotonicity"] = np.nan
    # Turnover
    results["turnover"] = compute_turnover(signal)
    # Sub-period IC stability (rolling 252-day IC mean)
    if "ic_1d_series" in results and len(results["ic_1d_series"]) > 252:
        rolling_ic = results["ic_1d_series"].rolling(252).mean().dropna()
        results["ic_stability"] = float((rolling_ic > 0).mean())  # fraction of time IC > 0
        results["ic_worst_year"] = float(rolling_ic.min())
        results["ic_best_year"] = float(rolling_ic.max())
    else:
        results["ic_stability"] = np.nan
        results["ic_worst_year"] = np.nan
        results["ic_best_year"] = np.nan
    return results
 # ---------------------------------------------------------------------------
 # Main
 # ---------------------------------------------------------------------------
 def get_all_factors(prices: pd.DataFrame, market: str) -> dict[str, pd.DataFrame]:
    """Build all candidate factor signals."""
    factors = {}
    # Known baselines
    factors["momentum_12_1"] = factor_momentum_12_1(prices)
    factors["recovery"] = factor_recovery(prices)
    factors["inverse_vol"] = factor_inverse_vol(prices)
    # New candidates — universal
    factors["short_term_reversal"] = factor_short_term_reversal(prices)
    factors["idio_vol_change"] = factor_idio_vol_change(prices)
    factors["return_acceleration"] = factor_volume_price_divergence(prices)
    factors["drawdown_recovery"] = factor_max_drawdown_recovery(prices)
    factors["neg_skewness"] = factor_skewness(prices)
    factors["range_compression"] = factor_high_low_range(prices)
    factors["mean_rev_zscore"] = factor_mean_reversion_residual(prices)
    factors["up_down_vol_ratio"] = factor_up_down_vol_ratio(prices)
    factors["win_streak"] = factor_consecutive_up_days(prices)
    factors["smooth_momentum"] = factor_gap_momentum(prices)
    factors["recovery_accel"] = factor_recovery_acceleration(prices)
    factors["trend_r2"] = factor_trend_strength(prices)
    factors["cheap_momentum"] = factor_relative_volume_momentum(prices)
    factors["near_52w_high"] = factor_52w_high_distance(prices)
    factors["low_downside_beta"] = factor_downside_beta_proxy(prices)
    # A-share specific (also test on US for comparison)
    if market == "cn":
        factors["illiquidity"] = factor_liquidity_premium(prices)
        factors["anti_lottery"] = factor_lottery_demand(prices)
        factors["vol_reversal"] = factor_turnover_reversal(prices)
        factors["low_price"] = factor_price_level(prices)
    return factors
 def print_summary_table(results: list[dict], market: str) -> None:
    """Print a ranked summary of all factors."""
    rows = []
    for r in results:
        rows.append({
            "Factor": r["name"],
            "IC_1d": r.get("ic_1d", np.nan),
            "ICIR_1d": r.get("icir_1d", np.nan),
            "IC_5d": r.get("ic_5d", np.nan),
            "ICIR_5d": r.get("icir_5d", np.nan),
            "IC_20d": r.get("ic_20d", np.nan),
            "ICIR_20d": r.get("icir_20d", np.nan),
            "LS_5d_ann": r.get("long_short_ann", np.nan),
            "Mono": r.get("monotonicity", np.nan),
            "Turnover": r.get("turnover", np.nan),
            "IC_stab": r.get("ic_stability", np.nan),
            "IC_worst_yr": r.get("ic_worst_year", np.nan),
        })
    df = pd.DataFrame(rows).set_index("Factor")
    df = df.sort_values("ICIR_5d", ascending=False)
    print(f"\n{'='*100}")
    print(f"  FACTOR RESEARCH RESULTS — {market.upper()} MARKET")
    print(f"{'='*100}")
    print("\nRanked by 5-day ICIR (most important metric for tradeable alpha):\n")
    print(df.round(4).to_string())
    # Highlight top factors
    print(f"\n{'='*100}")
    print("  TOP FACTORS (ICIR_5d > 0.05 and IC_stability > 0.6)")
    print(f"{'='*100}")
    top = df[(df["ICIR_5d"].abs() > 0.05) & (df["IC_stab"] > 0.6)]
    if top.empty:
        top = df.head(5)
        print("  (No factor met strict threshold; showing top 5 by ICIR_5d)")
    print(top.round(4).to_string())
    # Quintile details for top factors
    print(f"\n{'='*100}")
    print("  QUINTILE RETURN PROFILES (annualized, 5-day forward)")
    print(f"{'='*100}")
    for r in sorted(results, key=lambda x: abs(x.get("icir_5d", 0)), reverse=True)[:8]:
        qr = r.get("quintile_returns")
        if qr is not None and not qr.empty:
            q_str = "  ".join(f"Q{int(k)}: {v:+.1%}" for k, v in qr.items())
            ls = r.get("long_short_ann", 0)
            print(f"  {r['name']:25s} | {q_str} | L/S: {ls:+.1%}")
 def main():
    parser = argparse.ArgumentParser(description="Factor research")
    parser.add_argument("--market", default="us", choices=["us", "cn"])
    parser.add_argument("--years", type=int, default=None, help="Limit to last N years")
    args = parser.parse_args()
    market = args.market
    config = UNIVERSES[market]
    benchmark = config["benchmark"]
    print(f"Loading {market.upper()} price data...")
    prices = data_manager.load(market)
    # Remove benchmark from stock universe
    stocks = prices.drop(columns=[benchmark], errors="ignore")
    if args.years:
        cutoff = stocks.index[-1] - pd.DateOffset(years=args.years)
        stocks = stocks[stocks.index >= cutoff]
    print(f"Universe: {stocks.shape[1]} stocks, {stocks.shape[0]} trading days")
    print(f"Date range: {stocks.index[0].date()} to {stocks.index[-1].date()}")
    # Build all factor signals
    print("\nComputing factor signals...")
    factors = get_all_factors(stocks, market)
    # Analyze each factor
    print("Running factor analysis (this may take a few minutes)...")
    results = []
    for name, signal in factors.items():
        print(f"  Analyzing: {name}...")
        r = analyze_factor(name, signal, stocks)
        results.append(r)
    # Print results
    print_summary_table(results, market)
    # Factor correlation matrix
    print(f"\n{'='*100}")
    print("  FACTOR CORRELATION MATRIX (rank-averaged cross-sectional)")
    print(f"{'='*100}")
    corr = compute_factor_correlation(factors)
    # Show only top factors
    top_names = [r["name"] for r in sorted(results, key=lambda x: abs(x.get("icir_5d", 0)), reverse=True)[:10]]
    top_names_present = [n for n in top_names if n in corr.index]
    print(corr.loc[top_names_present, top_names_present].round(2).to_string())
 if __name__ == "__main__":
    main()
--- a/factor_robustness.py
+++ b/factor_robustness.py
@@ -0,0 +1,323 @@
 """
 Robustness checks for winning factor strategies.
 Tests:
 1. Rolling 2-year window performance (stability)
 2. Top-N sensitivity (5, 10, 15, 20)
 3. Rebalance frequency sensitivity (5d, 10d, 21d, 42d)
 4. Transaction cost sensitivity (0, 10bps, 20bps, 50bps)
 5. Drawdown analysis
 """
 from __future__ import annotations
 import argparse
 import warnings
 import numpy as np
 import pandas as pd
 import data_manager
 from universe import UNIVERSES
 from factor_real_backtest import (
    f_recovery_mom,
    f_momentum_12_1,
    f_recovery,
    f_recovery_deep,
    f_up_volume_proxy,
    f_gap_up_freq,
    f_earnings_drift_proxy,
    f_reversal_vol_cn,
    f_consistent_winner,
    combo_signal,
    make_strategy,
    run_backtest,
    compute_stats,
 )
 warnings.filterwarnings("ignore")
 def rolling_window_performance(equity: pd.Series, window_years: int = 2):
    """Compute rolling window returns."""
    daily_ret = equity.pct_change().dropna()
    window = 252 * window_years
    results = []
    for end_idx in range(window, len(daily_ret), 63):  # step 3 months
        start_idx = end_idx - window
        chunk = daily_ret.iloc[start_idx:end_idx]
        total = (1 + chunk).prod() - 1
        ann = (1 + total) ** (252 / len(chunk)) - 1
        sharpe = chunk.mean() / chunk.std() * np.sqrt(252) if chunk.std() > 0 else 0
        results.append({
            "end_date": chunk.index[-1].date(),
            "ann_return": ann,
            "sharpe": sharpe,
        })
    return pd.DataFrame(results)
 def drawdown_analysis(equity: pd.Series) -> pd.DataFrame:
    """Find top 5 drawdown episodes."""
    running_max = equity.cummax()
    drawdown = (equity - running_max) / running_max
    # Find drawdown episodes
    episodes = []
    in_dd = False
    start = None
    for i in range(len(drawdown)):
        if drawdown.iloc[i] < -0.05 and not in_dd:
            in_dd = True
            start = i
        elif drawdown.iloc[i] >= 0 and in_dd:
            in_dd = False
            trough_idx = drawdown.iloc[start:i].idxmin()
            episodes.append({
                "start": drawdown.index[start].date(),
                "trough": trough_idx.date(),
                "end": drawdown.index[i].date(),
                "depth": drawdown.loc[trough_idx],
                "duration_days": i - start,
            })
    # Handle ongoing drawdown
    if in_dd:
        trough_idx = drawdown.iloc[start:].idxmin()
        episodes.append({
            "start": drawdown.index[start].date(),
            "trough": trough_idx.date(),
            "end": "ongoing",
            "depth": drawdown.loc[trough_idx],
            "duration_days": len(drawdown) - start,
        })
    df = pd.DataFrame(episodes)
    if df.empty:
        return df
    return df.nsmallest(5, "depth")
 def run_us(stocks: pd.DataFrame):
    print("=" * 100)
    print("  US ROBUSTNESS — Winner: momentum_12_1 + up_volume_proxy")
    print("=" * 100)
    winner_func = combo_signal([(f_momentum_12_1, 0.5), (f_up_volume_proxy, 0.5)])
    baseline_func = f_recovery_mom
    # 1. Rolling 2-year performance
    print("\n--- 1. Rolling 2-Year Performance ---\n")
    for label, func in [("Winner: mom+upvol", winner_func),
                        ("Baseline: rec+mom", baseline_func)]:
        w = make_strategy(stocks, func, top_n=10)
        eq = run_backtest(w, stocks)
        roll = rolling_window_performance(eq)
        if roll.empty:
            continue
        win_pct = (roll["ann_return"] > 0).mean()
        print(f"  {label}:")
        print(f"    Mean 2yr ann return: {roll['ann_return'].mean():+.1%}")
        print(f"    Min 2yr ann return:  {roll['ann_return'].min():+.1%}")
        print(f"    Max 2yr ann return:  {roll['ann_return'].max():+.1%}")
        print(f"    % positive 2yr:      {win_pct:.0%}")
        print(f"    Mean 2yr Sharpe:     {roll['sharpe'].mean():.2f}")
        print()
    # 2. Top-N sensitivity
    print("--- 2. Top-N Sensitivity ---\n")
    header = f"  {'Top-N':<8}"
    for label in ["Winner: mom+upvol", "Baseline: rec+mom"]:
        header += f" | {'CAGR':>8} {'Sharpe':>8} {'MaxDD':>8}"
    print(header)
    print("  " + "-" * 70)
    for top_n in [5, 10, 15, 20, 30]:
        line = f"  {top_n:<8}"
        for func in [winner_func, baseline_func]:
            w = make_strategy(stocks, func, top_n=top_n)
            eq = run_backtest(w, stocks)
            s = compute_stats(eq, "")
            line += f" | {s['cagr']:>+7.1%} {s['sharpe']:>8.2f} {s['maxdd']:>+7.1%}"
        print(line)
    # 3. Rebalance frequency sensitivity
    print("\n--- 3. Rebalance Frequency Sensitivity ---\n")
    header = f"  {'Rebal':<8}"
    for label in ["Winner: mom+upvol", "Baseline: rec+mom"]:
        header += f" | {'CAGR':>8} {'Sharpe':>8} {'MaxDD':>8}"
    print(header)
    print("  " + "-" * 70)
    for rebal in [5, 10, 21, 42, 63]:
        line = f"  {rebal}d{'':<5}"
        for func in [winner_func, baseline_func]:
            w = make_strategy(stocks, func, top_n=10, rebal_freq=rebal)
            eq = run_backtest(w, stocks)
            s = compute_stats(eq, "")
            line += f" | {s['cagr']:>+7.1%} {s['sharpe']:>8.2f} {s['maxdd']:>+7.1%}"
        print(line)
    # 4. Transaction cost sensitivity
    print("\n--- 4. Transaction Cost Sensitivity ---\n")
    header = f"  {'Cost':<8}"
    for label in ["Winner: mom+upvol", "Baseline: rec+mom"]:
        header += f" | {'CAGR':>8} {'Sharpe':>8}"
    print(header)
    print("  " + "-" * 50)
    for cost in [0, 0.001, 0.002, 0.005]:
        line = f"  {cost*10000:.0f}bps{'':<4}"
        for func in [winner_func, baseline_func]:
            w = make_strategy(stocks, func, top_n=10)
            eq = run_backtest(w, stocks, cost=cost)
            s = compute_stats(eq, "")
            line += f" | {s['cagr']:>+7.1%} {s['sharpe']:>8.2f}"
        print(line)
    # 5. Drawdown analysis
    print("\n--- 5. Drawdown Episodes ---\n")
    for label, func in [("Winner: mom+upvol", winner_func),
                        ("Baseline: rec+mom", baseline_func)]:
        w = make_strategy(stocks, func, top_n=10)
        eq = run_backtest(w, stocks)
        dd = drawdown_analysis(eq)
        print(f"  {label}:")
        if dd.empty:
            print("    No significant drawdowns")
        else:
            for _, row in dd.iterrows():
                print(f"    {row['start']} → {row['trough']} → {row['end']}: "
                      f"{row['depth']:+.1%} ({row['duration_days']}d)")
        print()
    # 6. Also test the runner-up combos
    print("--- 6. Other Strong Combos (Top-10, 21d rebal, 10bps) ---\n")
    other_combos = [
        ("rec_deep+upvol", combo_signal([(f_recovery_deep, 0.5), (f_up_volume_proxy, 0.5)])),
        ("rec_deep+mom", combo_signal([(f_recovery_deep, 0.5), (f_momentum_12_1, 0.5)])),
        ("mom+gap_up", combo_signal([(f_momentum_12_1, 0.5), (f_gap_up_freq, 0.5)])),
        ("rec_deep+upvol+mom", combo_signal([(f_recovery_deep, 0.33), (f_up_volume_proxy, 0.33), (f_momentum_12_1, 0.34)])),
        ("mom+upvol+gap", combo_signal([(f_momentum_12_1, 0.33), (f_up_volume_proxy, 0.33), (f_gap_up_freq, 0.34)])),
    ]
    for label, func in other_combos:
        w = make_strategy(stocks, func, top_n=10)
        eq = run_backtest(w, stocks)
        s = compute_stats(eq, "")
        print(f"  {label:<25} CAGR: {s['cagr']:>+7.1%}  Sharpe: {s['sharpe']:.2f}  MaxDD: {s['maxdd']:>+7.1%}  Calmar: {s['calmar']:.2f}")
 def run_cn(stocks: pd.DataFrame):
    print("\n" + "=" * 100)
    print("  CN ROBUSTNESS — Winners: reversal_vol + gap_up, earn_drift + reversal_vol")
    print("=" * 100)
    winner1_func = combo_signal([(f_reversal_vol_cn, 0.5), (f_gap_up_freq, 0.5)])
    winner2_func = combo_signal([(f_earnings_drift_proxy, 0.5), (f_reversal_vol_cn, 0.5)])
    baseline_func = f_recovery_mom
    # 1. Rolling 2-year performance
    print("\n--- 1. Rolling 2-Year Performance ---\n")
    for label, func in [("W1: rev_vol+gap_up", winner1_func),
                        ("W2: earn_drift+rev_vol", winner2_func),
                        ("Baseline: rec+mom", baseline_func)]:
        w = make_strategy(stocks, func, top_n=10)
        eq = run_backtest(w, stocks)
        roll = rolling_window_performance(eq)
        if roll.empty:
            continue
        win_pct = (roll["ann_return"] > 0).mean()
        print(f"  {label}:")
        print(f"    Mean 2yr ann return: {roll['ann_return'].mean():+.1%}")
        print(f"    Min 2yr ann return:  {roll['ann_return'].min():+.1%}")
        print(f"    Max 2yr ann return:  {roll['ann_return'].max():+.1%}")
        print(f"    % positive 2yr:      {win_pct:.0%}")
        print(f"    Mean 2yr Sharpe:     {roll['sharpe'].mean():.2f}")
        print()
    # 2. Top-N sensitivity
    print("--- 2. Top-N Sensitivity ---\n")
    header = f"  {'Top-N':<8}"
    for label in ["W1: rev+gap", "W2: earn+rev", "Baseline"]:
        header += f" | {'CAGR':>8} {'Sharpe':>8} {'MaxDD':>8}"
    print(header)
    print("  " + "-" * 100)
    for top_n in [5, 10, 15, 20]:
        line = f"  {top_n:<8}"
        for func in [winner1_func, winner2_func, baseline_func]:
            w = make_strategy(stocks, func, top_n=top_n)
            eq = run_backtest(w, stocks)
            s = compute_stats(eq, "")
            line += f" | {s['cagr']:>+7.1%} {s['sharpe']:>8.2f} {s['maxdd']:>+7.1%}"
        print(line)
    # 3. Rebalance frequency
    print("\n--- 3. Rebalance Frequency ---\n")
    header = f"  {'Rebal':<8}"
    for label in ["W1: rev+gap", "W2: earn+rev", "Baseline"]:
        header += f" | {'CAGR':>8} {'Sharpe':>8}"
    print(header)
    print("  " + "-" * 75)
    for rebal in [5, 10, 21, 42]:
        line = f"  {rebal}d{'':<5}"
        for func in [winner1_func, winner2_func, baseline_func]:
            w = make_strategy(stocks, func, top_n=10, rebal_freq=rebal)
            eq = run_backtest(w, stocks)
            s = compute_stats(eq, "")
            line += f" | {s['cagr']:>+7.1%} {s['sharpe']:>8.2f}"
        print(line)
    # 4. Transaction cost sensitivity
    print("\n--- 4. Transaction Cost Sensitivity ---\n")
    header = f"  {'Cost':<8}"
    for label in ["W1: rev+gap", "W2: earn+rev", "Baseline"]:
        header += f" | {'CAGR':>8} {'Sharpe':>8}"
    print(header)
    print("  " + "-" * 75)
    for cost in [0, 0.001, 0.002, 0.005]:
        line = f"  {cost*10000:.0f}bps{'':<4}"
        for func in [winner1_func, winner2_func, baseline_func]:
            w = make_strategy(stocks, func, top_n=10)
            eq = run_backtest(w, stocks, cost=cost)
            s = compute_stats(eq, "")
            line += f" | {s['cagr']:>+7.1%} {s['sharpe']:>8.2f}"
        print(line)
    # 5. Drawdown analysis
    print("\n--- 5. Drawdown Episodes ---\n")
    for label, func in [("W1: rev_vol+gap_up", winner1_func),
                        ("W2: earn_drift+rev_vol", winner2_func),
                        ("Baseline: rec+mom", baseline_func)]:
        w = make_strategy(stocks, func, top_n=10)
        eq = run_backtest(w, stocks)
        dd = drawdown_analysis(eq)
        print(f"  {label}:")
        if dd.empty:
            print("    No significant drawdowns")
        else:
            for _, row in dd.iterrows():
                print(f"    {row['start']} → {row['trough']} → {row['end']}: "
                      f"{row['depth']:+.1%} ({row['duration_days']}d)")
        print()
 def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--market", default="both", choices=["us", "cn", "both"])
    args = parser.parse_args()
    if args.market in ("us", "both"):
        prices = data_manager.load("us")
        stocks = prices.drop(columns=["SPY"], errors="ignore")
        run_us(stocks)
    if args.market in ("cn", "both"):
        prices = data_manager.load("cn")
        stocks = prices.drop(columns=["000300.SS"], errors="ignore")
        run_cn(stocks)
 if __name__ == "__main__":
    main()
--- a/factor_yearly_fresh.py
+++ b/factor_yearly_fresh.py
@@ -0,0 +1,259 @@
 """
 Rebalancing frequency comparison: daily (1d) vs weekly (5d) vs biweekly (10d) vs monthly (21d).
 Shows yearly returns and max drawdown for each frequency, for all champion strategies.
 """
 from __future__ import annotations
 import warnings
 import numpy as np
 import pandas as pd
 import data_manager
 from factor_loop import (
    strat, bt, stats, combo,
    f_rec_mom, f_rec_126, f_rec_63,
    f_mom_12_1, f_mom_6_1, f_mom_intermediate,
    f_above_ma200, f_golden_cross,
    f_up_volume_proxy, f_gap_up_freq,
    f_rec_mom_filtered, f_down_resilience,
    f_up_capture, f_52w_high, f_str_10d,
    f_earnings_drift, f_reversal_vol,
 )
 warnings.filterwarnings("ignore")
 INITIAL = 10_000
 REBAL_CONFIGS = [
    ("daily", 1),
    ("weekly", 5),
    ("biweekly", 10),
    ("monthly", 21),
 ]
 def f_quality_mom(p):
    mom = f_mom_12_1(p)
    consist = (p.pct_change() > 0).astype(float).rolling(252, min_periods=126).mean()
    mom_r = mom.rank(axis=1, pct=True, na_option="keep")
    con_r = consist.rank(axis=1, pct=True, na_option="keep")
    up_r = f_up_volume_proxy(p).rank(axis=1, pct=True, na_option="keep")
    return 0.4 * mom_r + 0.3 * con_r + 0.3 * up_r
 def f_mom_x_gap(p):
    return (f_mom_12_1(p).rank(axis=1, pct=True, na_option="keep") *
            f_gap_up_freq(p).rank(axis=1, pct=True, na_option="keep"))
 def run_equity(func, prices, rebal=21, cost=0.001):
    w = strat(prices, func, top_n=10, rebal=rebal)
    eq = bt(w, prices, cost=cost)
    return eq / eq.iloc[0] * INITIAL
 def year_returns(eq: pd.Series) -> dict[int, float]:
    dr = eq.pct_change().fillna(0)
    return {y: float((1 + dr[dr.index.year == y]).prod() - 1)
            for y in sorted(dr.index.year.unique())}
 def max_drawdown(eq: pd.Series) -> float:
    rm = eq.cummax()
    dd = (eq - rm) / rm
    return float(dd.min())
 def max_drawdown_yearly(eq: pd.Series) -> dict[int, float]:
    result = {}
    for y in sorted(eq.index.year.unique()):
        chunk = eq[eq.index.year == y]
        if len(chunk) < 5:
            continue
        rm = chunk.cummax()
        dd = (chunk - rm) / rm
        result[y] = float(dd.min())
    return result
 def cagr(eq: pd.Series) -> float:
    dr = eq.pct_change().dropna()
    if len(dr) < 100:
        return np.nan
    ny = len(dr) / 252
    tot = eq.iloc[-1] / eq.iloc[0] - 1
    return (1 + tot) ** (1 / ny) - 1
 def sharpe(eq: pd.Series) -> float:
    dr = eq.pct_change().dropna()
    if len(dr) < 100 or dr.std() == 0:
        return np.nan
    return float(dr.mean() / dr.std() * np.sqrt(252))
 def turnover_annual(func, prices, rebal):
    """Estimate annualised turnover (one-way)."""
    w = strat(prices, func, top_n=10, rebal=rebal)
    daily_turn = w.diff().abs().sum(axis=1).mean()
    return daily_turn * 252
 def print_by_year(strat_defs, prices, bench_eq, bench_label, market_label, years):
    """For each year, print a table: strategies as rows, rebal frequencies as columns."""
    freq_labels = [r for r, _ in REBAL_CONFIGS]
    # Pre-compute all equities and returns
    all_eqs = {}  # {(sname, freq): equity}
    for sname, func in strat_defs.items():
        for rlabel, rdays in REBAL_CONFIGS:
            all_eqs[(sname, rlabel)] = run_equity(func, prices, rebal=rdays)
    all_rets = {}  # {(sname, freq): {year: ret}}
    for key, eq in all_eqs.items():
        all_rets[key] = year_returns(eq)
    bench_rets = year_returns(bench_eq)
    snames = list(strat_defs.keys())
    name_w = max(len(s) for s in snames) + 1
    for year in years:
        line_w = name_w + 4 + 20 * (len(freq_labels) + 1)
        print(f"\n{'=' * line_w}")
        print(f"  {market_label} — {year}  (fresh $10,000)")
        print(f"{'=' * line_w}")
        # Header
        print(f"  {'Strategy':<{name_w}}", end="")
        for f in freq_labels:
            print(f"  {f:>18}", end="")
        print(f"  {bench_label:>18}")
        print(f"  {'-'*name_w}", end="")
        for _ in range(len(freq_labels) + 1):
            print(f"  {'-'*18}", end="")
        print()
        for sname in snames:
            print(f"  {sname:<{name_w}}", end="")
            # Find best freq for this strategy this year
            freq_vals = {}
            for f in freq_labels:
                r = all_rets[(sname, f)].get(year)
                if r is not None and abs(r) > 0.0005:
                    freq_vals[f] = r
            best_f = max(freq_vals, key=freq_vals.get) if freq_vals else None
            for f in freq_labels:
                r = all_rets[(sname, f)].get(year)
                if r is not None and abs(r) > 0.0005:
                    v = INITIAL * (1 + r)
                    marker = " ★" if f == best_f else "  "
                    print(f"  ${v:>9,.0f} {r:>+5.0%}{marker}", end="")
                else:
                    print(f"  {'—':>18}", end="")
            # Benchmark (same for all strategies)
            br = bench_rets.get(year)
            if br is not None and abs(br) > 0.0005:
                print(f"  ${INITIAL*(1+br):>9,.0f} {br:>+5.0%}  ", end="")
            else:
                print(f"  {'—':>18}", end="")
            print()
        # Best strategy per freq
        print(f"  {'-'*name_w}", end="")
        for _ in range(len(freq_labels) + 1):
            print(f"  {'-'*18}", end="")
        print()
        print(f"  {'BEST':<{name_w}}", end="")
        for f in freq_labels:
            best_r = -999
            best_s = ""
            for sname in snames:
                r = all_rets[(sname, f)].get(year)
                if r is not None and abs(r) > 0.0005 and r > best_r:
                    best_r = r
                    best_s = sname
            if best_r > -999:
                print(f"  ${INITIAL*(1+best_r):>9,.0f} {best_r:>+5.0%}  ", end="")
            else:
                print(f"  {'—':>18}", end="")
        # bench
        br = bench_rets.get(year)
        if br is not None and abs(br) > 0.0005:
            print(f"  ${INITIAL*(1+br):>9,.0f} {br:>+5.0%}  ", end="")
        else:
            print(f"  {'—':>18}", end="")
        print()
 def main():
    years = list(range(2015, 2027))
    # ===== US =====
    print(f"\n{'#'*130}")
    print(f"{'#'*50}  US MARKET  {'#'*50}")
    print(f"{'#'*130}")
    prices_us = data_manager.load("us")
    bench_us = prices_us["SPY"].dropna()
    stocks_us = prices_us.drop(columns=["SPY"], errors="ignore")
    eq_spy = bench_us / bench_us.iloc[0] * INITIAL
    us_strats = {
        "rec_mfilt+deep×upvol": combo([
            (f_rec_mom_filtered, 0.5),
            (combo([(f_rec_126, 0.5), (f_up_volume_proxy, 0.5)]), 0.5),
        ]),
        "ma200+mom7m+rec126": combo([
            (f_above_ma200, 0.33), (f_mom_intermediate, 0.33), (f_rec_126, 0.34)
        ]),
        "rec_mfilt+ma200": combo([
            (f_rec_mom_filtered, 0.5), (f_above_ma200, 0.5)
        ]),
        "mom7m+rec126": combo([
            (f_mom_intermediate, 0.5), (f_rec_126, 0.5)
        ]),
        "BASELINE:rec+mom": f_rec_mom,
    }
    print_by_year(us_strats, stocks_us, eq_spy, "SPY", "US", years)
    # ===== CN =====
    print(f"\n\n{'#'*130}")
    print(f"{'#'*50}  CN MARKET  {'#'*50}")
    print(f"{'#'*130}")
    prices_cn = data_manager.load("cn")
    bench_cn = prices_cn["000300.SS"].dropna() if "000300.SS" in prices_cn.columns else None
    stocks_cn = prices_cn.drop(columns=["000300.SS"], errors="ignore")
    cn_strats = {
        "up_cap+quality_mom": combo([
            (f_up_capture, 0.5), (f_quality_mom, 0.5)
        ]),
        "down_resil+qual_mom": combo([
            (f_down_resilience, 0.5), (f_quality_mom, 0.5)
        ]),
        "rec63+mom×gap": combo([
            (f_rec_63, 0.5), (f_mom_x_gap, 0.5)
        ]),
        "up_cap+mom×gap": combo([
            (f_up_capture, 0.5), (f_mom_x_gap, 0.5)
        ]),
        "BASELINE:rec+mom": f_rec_mom,
    }
    if bench_cn is not None:
        eq_csi = bench_cn / bench_cn.iloc[0] * INITIAL
    else:
        eq_csi = pd.Series(dtype=float)
    print_by_year(cn_strats, stocks_cn, eq_csi, "CSI300", "CN", years)
 if __name__ == "__main__":
    main()
--- a/factor_yearly_report.py
+++ b/factor_yearly_report.py
@@ -0,0 +1,219 @@
 """
 Yearly ROI report for champion strategies vs SPY, starting from $10,000.
 """
 from __future__ import annotations
 import warnings
 import numpy as np
 import pandas as pd
 import data_manager
 from universe import UNIVERSES
 from factor_loop import (
    strat, bt, stats, combo,
    f_rec_mom, f_rec_126, f_rec_63,
    f_mom_12_1, f_mom_6_1, f_mom_intermediate,
    f_above_ma200, f_golden_cross,
    f_up_volume_proxy, f_gap_up_freq,
    f_rec_mom_filtered, f_down_resilience,
    f_up_capture, f_52w_high, f_str_10d,
    f_earnings_drift, f_reversal_vol,
 )
 warnings.filterwarnings("ignore")
 INITIAL = 10_000
 def f_quality_mom(p):
    mom = f_mom_12_1(p)
    consist = (p.pct_change() > 0).astype(float).rolling(252, min_periods=126).mean()
    mom_r = mom.rank(axis=1, pct=True, na_option="keep")
    con_r = consist.rank(axis=1, pct=True, na_option="keep")
    up_r = f_up_volume_proxy(p).rank(axis=1, pct=True, na_option="keep")
    return 0.4 * mom_r + 0.3 * con_r + 0.3 * up_r
 def f_mom_x_gap(p):
    return (f_mom_12_1(p).rank(axis=1, pct=True, na_option="keep") *
            f_gap_up_freq(p).rank(axis=1, pct=True, na_option="keep"))
 def run_equity(func, prices, cost=0.001):
    w = strat(prices, func, top_n=10)
    eq = bt(w, prices, cost=cost)
    return eq / eq.iloc[0] * INITIAL
 def yearly_table(equities: dict[str, pd.Series], title: str):
    print(f"\n{'='*130}")
    print(f"  {title}")
    print(f"  Starting capital: ${INITIAL:,.0f}")
    print(f"{'='*130}")
    names = list(equities.keys())
    all_years = sorted(set(y for eq in equities.values() for y in eq.index.year.unique()))
    # Header
    print(f"\n  {'Year':<6}", end="")
    for n in names:
        print(f" | {n[:24]:>24}", end="")
    print()
    print(f"  {'-'*6}", end="")
    for _ in names:
        print(f"-+-{'-'*24}", end="")
    print()
    # Track portfolio values
    year_end_vals = {n: INITIAL for n in names}
    for year in all_years:
        print(f"  {year:<6}", end="")
        for n in names:
            eq = equities[n]
            yr_data = eq[eq.index.year == year]
            if len(yr_data) < 2:
                print(f" | {'—':>24}", end="")
                continue
            start_val = yr_data.iloc[0]
            end_val = yr_data.iloc[-1]
            ret = end_val / start_val - 1
            year_end_vals[n] = end_val
            # Show both return % and portfolio value
            print(f" | {ret:>+7.1%}  ${end_val:>12,.0f}", end="")
        print()
    # Summary rows
    print(f"  {'-'*6}", end="")
    for _ in names:
        print(f"-+-{'-'*24}", end="")
    print()
    # Total return
    print(f"  {'Total':<6}", end="")
    for n in names:
        eq = equities[n]
        total = eq.iloc[-1] / INITIAL - 1
        print(f" | {total:>+7.0%}  ${eq.iloc[-1]:>12,.0f}", end="")
    print()
    # CAGR
    print(f"  {'CAGR':<6}", end="")
    for n in names:
        eq = equities[n]
        ny = len(eq) / 252
        total = eq.iloc[-1] / INITIAL - 1
        cagr = (1 + total) ** (1 / ny) - 1
        print(f" | {cagr:>+7.1%}  {'':>12}", end="")
    print()
    # Sharpe
    print(f"  {'Sharpe':<6}", end="")
    for n in names:
        eq = equities[n]
        dr = eq.pct_change().dropna()
        sh = dr.mean() / dr.std() * np.sqrt(252) if dr.std() > 0 else 0
        print(f" | {sh:>7.2f}  {'':>12}", end="")
    print()
    # Max DD
    print(f"  {'MaxDD':<6}", end="")
    for n in names:
        eq = equities[n]
        rm = eq.cummax()
        dd = ((eq - rm) / rm).min()
        print(f" | {dd:>+7.1%}  {'':>12}", end="")
    print()
    # Best/Worst year
    print(f"  {'Best':<6}", end="")
    for n in names:
        eq = equities[n]
        dr = eq.pct_change().fillna(0)
        yr_rets = {y: float((1 + dr[dr.index.year == y]).prod() - 1) for y in all_years}
        # skip warmup year
        active = {y: r for y, r in yr_rets.items() if abs(r) > 0.001}
        if active:
            best_y = max(active, key=active.get)
            print(f" | {active[best_y]:>+7.1%}  ({best_y})       ", end="")
        else:
            print(f" | {'—':>24}", end="")
    print()
    print(f"  {'Worst':<6}", end="")
    for n in names:
        eq = equities[n]
        dr = eq.pct_change().fillna(0)
        yr_rets = {y: float((1 + dr[dr.index.year == y]).prod() - 1) for y in all_years}
        active = {y: r for y, r in yr_rets.items() if abs(r) > 0.001}
        if active:
            worst_y = min(active, key=active.get)
            print(f" | {active[worst_y]:>+7.1%}  ({worst_y})       ", end="")
        else:
            print(f" | {'—':>24}", end="")
    print()
 def main():
    # ===== US =====
    prices_us = data_manager.load("us")
    bench_us = prices_us["SPY"].dropna()
    stocks_us = prices_us.drop(columns=["SPY"], errors="ignore")
    eq_spy = bench_us / bench_us.iloc[0] * INITIAL
    us_strats = {
        "rec_mfilt+deep×upvol": combo([
            (f_rec_mom_filtered, 0.5),
            (combo([(f_rec_126, 0.5), (f_up_volume_proxy, 0.5)]), 0.5),
        ]),
        "ma200+mom7m+rec126": combo([
            (f_above_ma200, 0.33), (f_mom_intermediate, 0.33), (f_rec_126, 0.34)
        ]),
        "rec_mfilt+ma200": combo([
            (f_rec_mom_filtered, 0.5), (f_above_ma200, 0.5)
        ]),
        "mom7m+rec126": combo([
            (f_mom_intermediate, 0.5), (f_rec_126, 0.5)
        ]),
        "BASELINE:rec+mom": f_rec_mom,
    }
    us_equities = {}
    for name, func in us_strats.items():
        us_equities[name] = run_equity(func, stocks_us)
    us_equities["SPY (Benchmark)"] = eq_spy
    yearly_table(us_equities, "US MARKET — Champion Strategies vs SPY — $10,000 Starting Capital")
    # ===== CN =====
    prices_cn = data_manager.load("cn")
    bench_cn = prices_cn["000300.SS"].dropna() if "000300.SS" in prices_cn.columns else None
    stocks_cn = prices_cn.drop(columns=["000300.SS"], errors="ignore")
    cn_strats = {
        "up_cap+quality_mom": combo([
            (f_up_capture, 0.5), (f_quality_mom, 0.5)
        ]),
        "down_resil+qual_mom": combo([
            (f_down_resilience, 0.5), (f_quality_mom, 0.5)
        ]),
        "rec63+mom×gap": combo([
            (f_rec_63, 0.5), (f_mom_x_gap, 0.5)
        ]),
        "up_cap+mom×gap": combo([
            (f_up_capture, 0.5), (f_mom_x_gap, 0.5)
        ]),
        "BASELINE:rec+mom": f_rec_mom,
    }
    cn_equities = {}
    for name, func in cn_strats.items():
        cn_equities[name] = run_equity(func, stocks_cn)
    if bench_cn is not None:
        cn_equities["CSI300 (Benchmark)"] = bench_cn / bench_cn.iloc[0] * INITIAL
    yearly_table(cn_equities, "CN MARKET — Champion Strategies vs CSI 300 — $10,000 Starting Capital")
 if __name__ == "__main__":
    main()
--- a/main.py
+++ b/main.py
@@ -5,6 +5,7 @@ import numpy as np
 import pandas as pd
 import data_manager
 import factor_attribution
 import metrics
 from strategies.adaptive_momentum import AdaptiveMomentumStrategy
 from strategies.buy_and_hold import BuyAndHoldStrategy
@@ -163,6 +164,18 @@ def main() -> None:
        help="Execution mode: 'close' (default, signal & execute on close) or "
             "'open-close' (signal on morning open, execute at close)",
    )
    parser.add_argument(
        "--attribution", action="store_true",
        help="Run factor attribution after performance metrics",
    )
    parser.add_argument(
        "--attribution-model", choices=["capm", "ff5", "ff5plus", "all"], default="all",
        help="Factor model selection for attribution output",
    )
    parser.add_argument(
        "--attribution-export", default=None,
        help="Directory to export factor attribution CSVs",
    )
    args = parser.parse_args()
    initial_capital = args.capital if args.capital is not None else 10_000
    use_open = args.execution == "open-close"
@@ -238,6 +251,20 @@ def main() -> None:
            continue
        metrics.summary(eq, name=name)
    if args.attribution:
        summary_df, loadings_df = factor_attribution.attribute_strategies(
            results_df=results_df,
            benchmark_label=benchmark_label,
            benchmark=benchmark,
            price_data=data,
            market=args.market,
            model_selection=args.attribution_model,
        )
        factor_attribution.print_attribution_summary(summary_df)
        if args.attribution_export:
            factor_attribution.export_attribution(summary_df, loadings_df, args.attribution_export)
            print(f"Attribution CSVs written to {args.attribution_export}")
    # --- Visualization ---
    if not args.no_plot:
        plot_results(results_df.dropna())
--- a/research/init.py
+++ b/research/init.py
--- a/research/event_factors.py
+++ b/research/event_factors.py
@@ -0,0 +1,34 @@
 import numpy as np
 import pandas as pd
 TRAILING_HIGH_WINDOW = 60
 COMPRESSION_WINDOW = 20
 VOLUME_WINDOW = 20
 def breakout_after_compression_score(
    close: pd.DataFrame,
    high: pd.DataFrame,
    low: pd.DataFrame,
    volume: pd.DataFrame,
 ) -> pd.DataFrame:
    """Score breakout setups and shift the result so it is tradable next day."""
    close = close.sort_index()
    high = high.reindex(index=close.index, columns=close.columns).sort_index()
    low = low.reindex(index=close.index, columns=close.columns).sort_index()
    volume = volume.reindex(index=close.index, columns=close.columns).sort_index()
    trailing_high = close.rolling(TRAILING_HIGH_WINDOW, min_periods=TRAILING_HIGH_WINDOW).max()
    proximity_to_high = close / trailing_high.replace(0, np.nan)
    recent_high = high.rolling(COMPRESSION_WINDOW, min_periods=COMPRESSION_WINDOW).max()
    recent_low = low.rolling(COMPRESSION_WINDOW, min_periods=COMPRESSION_WINDOW).min()
    recent_mid = (recent_high + recent_low) / 2
    compressed_range = -((recent_high - recent_low) / recent_mid.replace(0, np.nan))
    median_volume = volume.rolling(VOLUME_WINDOW, min_periods=VOLUME_WINDOW).median()
    abnormal_volume = volume / median_volume.replace(0, np.nan)
    score = proximity_to_high + compressed_range + abnormal_volume
    return score.shift(1)
--- a/research/fetch_historical.py
+++ b/research/fetch_historical.py
@@ -0,0 +1,152 @@
 """
 Fetch price history for all tickers that were ever S&P 500 members — including
 delisted ones — and save to data/us_pit.csv. This is the foundation for a
 survivorship-bias-free backtest.
 NOTE: Yahoo Finance no longer serves price data for many fully-delisted tickers
 (bankruptcies, old mergers). Those are silently skipped. The result is still
 a major improvement over "today's S&P 500 extrapolated 10 years back", but it
 is NOT a perfect point-in-time dataset — only a dataset where the universe
 mask is correct at each date. A subset of worst-outcome tickers (e.g., ABK,
 ACAS) will be missing entirely. This caveat is documented in the run summary.
 """
 import os
 from datetime import datetime, timedelta
 import pandas as pd
 import yfinance as yf
 import universe_history as uh
 DATA_DIR = "data"
 OUT_PATH = os.path.join(DATA_DIR, "us_pit.csv")
 YEARS = 10
 BATCH_SIZE = 50
 def _field_out_paths() -> dict[str, str]:
    return {
        "Close": os.path.join(DATA_DIR, "us_pit_close.csv"),
        "High": os.path.join(DATA_DIR, "us_pit_high.csv"),
        "Low": os.path.join(DATA_DIR, "us_pit_low.csv"),
        "Volume": os.path.join(DATA_DIR, "us_pit_volume.csv"),
    }
 def fetch_all_historical(force: bool = False) -> pd.DataFrame:
    os.makedirs(DATA_DIR, exist_ok=True)
    intervals = uh.load_sp500_history()
    tickers = uh.all_tickers_ever(intervals) + ["SPY"]
    tickers = sorted(set(tickers))
    existing = None
    if os.path.exists(OUT_PATH) and not force:
        existing = pd.read_csv(OUT_PATH, index_col=0, parse_dates=True)
        missing = [t for t in tickers if t not in existing.columns]
        if not missing:
            # Just append latest dates
            last_date = existing.index[-1]
            if (datetime.now() - last_date.to_pydatetime()).days < 2:
                print(f"--- us_pit.csv already up to date: {existing.shape} ---")
                return existing
            tickers = list(existing.columns)
            start = (last_date + timedelta(days=1)).strftime("%Y-%m-%d")
            print(f"--- Appending new dates from {start} for {len(tickers)} tickers ---")
            new = _download_batched(tickers, start=start)
            if new is not None and not new.empty:
                combined = pd.concat([existing, new]).sort_index()
                combined = combined[~combined.index.duplicated(keep="last")]
                combined.to_csv(OUT_PATH)
                print(f"--- Saved {combined.shape} to {OUT_PATH} ---")
                return combined
            return existing
        else:
            print(f"--- Have {existing.shape[1]} cols; need {len(missing)} more ---")
            tickers = missing
    start = (datetime.now() - timedelta(days=365 * YEARS)).strftime("%Y-%m-%d")
    new = _download_batched(tickers, start=start)
    if existing is not None and new is not None and not new.empty:
        combined = pd.concat([existing, new.reindex(existing.index)], axis=1)
        # Add any new rows from `new` not in existing
        new_only_idx = new.index.difference(existing.index)
        if len(new_only_idx) > 0:
            combined_new = new.loc[new_only_idx].reindex(columns=combined.columns)
            combined = pd.concat([combined, combined_new]).sort_index()
    else:
        combined = new
    combined.to_csv(OUT_PATH)
    print(f"--- Saved {combined.shape} to {OUT_PATH} ---")
    return combined
 def fetch_all_historical_ohlcv(force: bool = False) -> dict[str, pd.DataFrame]:
    os.makedirs(DATA_DIR, exist_ok=True)
    intervals = uh.load_sp500_history()
    tickers = uh.all_tickers_ever(intervals) + ["SPY"]
    tickers = sorted(set(tickers))
    start = (datetime.now() - timedelta(days=365 * YEARS)).strftime("%Y-%m-%d")
    panels = _download_batched_fields(tickers, start=start, fields=["Close", "High", "Low", "Volume"])
    if not panels:
        raise RuntimeError("No PIT OHLCV data downloaded")
    close = panels["Close"]
    close.to_csv(OUT_PATH)
    print(f"--- Saved {close.shape} to {OUT_PATH} ---")
    result: dict[str, pd.DataFrame] = {"close": close}
    for field, path in _field_out_paths().items():
        panel = panels[field]
        panel.to_csv(path)
        print(f"--- Saved {panel.shape} to {path} ---")
        result[field.lower()] = panel
    return result
 def _download_batched(tickers: list[str], start: str) -> pd.DataFrame | None:
    panels = _download_batched_fields(tickers, start=start, fields=["Close"])
    if not panels:
        return None
    return panels["Close"]
 def _download_batched_fields(
    tickers: list[str],
    start: str,
    fields: list[str],
 ) -> dict[str, pd.DataFrame]:
    frames = {field: [] for field in fields}
    n = len(tickers)
    for i in range(0, n, BATCH_SIZE):
        batch = tickers[i:i + BATCH_SIZE]
        print(f"  [{i}/{n}] fetching {len(batch)} tickers...", flush=True)
        try:
            raw = yf.download(batch, start=start, auto_adjust=True,
                              progress=False, threads=True)
            if raw.empty:
                continue
            for field in fields:
                if isinstance(raw.columns, pd.MultiIndex):
                    panel = raw[field]
                else:
                    panel = raw[[field]].rename(columns={field: batch[0]})
                panel = panel.dropna(axis=1, how="all")
                if not panel.empty:
                    frames[field].append(panel)
        except Exception as e:
            print(f"    batch failed: {e}")
    result = {}
    for field, field_frames in frames.items():
        if field_frames:
            panel = pd.concat(field_frames, axis=1).sort_index()
            panel = panel.loc[:, ~panel.columns.duplicated()]
            result[field] = panel
        else:
            result[field] = pd.DataFrame()
    return result
 if __name__ == "__main__":
    fetch_all_historical()
--- a/research/optimize.py
+++ b/research/optimize.py
@@ -0,0 +1,299 @@
 """
 End-to-end optimization study for the US recovery+momentum strategy family,
 run on a point-in-time (survivorship-bias-mitigated) S&P 500 universe.
 Experiments:
  E1 — Baseline drift: biased vs point-in-time universe, current top10 params.
  E2 — Hyperparameter sweep with 2016-2022 train / 2023-2026 test split.
  E3 — SPY MA200 regime filter (compare base vs filtered).
  E4 — Weighting schemes: equal vs inverse-vol vs rank.
  E5 — Ensemble of top-3 uncorrelated configs.
 Usage: uv run python -m research.optimize
 """
 import os
 import numpy as np
 import pandas as pd
 import data_manager
 import research.pit_backtest as pit
 from research.strategies_plus import (EnsembleStrategy, RecoveryMomentumPlus,
                                      spy_ma200_filter)
 from strategies.recovery_momentum import RecoveryMomentumStrategy
 DATA_DIR = "data"
 BIASED_CSV = os.path.join(DATA_DIR, "us.csv")
 # ---------------------------------------------------------------------------
 # Helpers
 # ---------------------------------------------------------------------------
 def slice_period(df: pd.DataFrame, start: str | None, end: str | None) -> pd.DataFrame:
    out = df
    if start:
        out = out[out.index >= start]
    if end:
        out = out[out.index <= end]
    return out
 def run_strategy(strategy, prices, benchmark=None, regime_filter=None,
                  fixed_fee: float = 0.0) -> pd.Series:
    return pit.backtest(
        strategy=strategy, prices=prices, initial_capital=10_000,
        transaction_cost=0.001, fixed_fee=fixed_fee,
        benchmark=benchmark, regime_filter=regime_filter,
    )
 # ---------------------------------------------------------------------------
 # Experiment 1: bias drift
 # ---------------------------------------------------------------------------
 def exp1_bias_drift(pit_prices_masked: pd.DataFrame) -> pd.DataFrame:
    print("\n" + "=" * 90)
    print("E1 — Biased universe vs Point-in-time universe (recovery_mom_top10)")
    print("=" * 90)
    rows = []
    # Biased: current 503 tickers extrapolated backward
    biased = pd.read_csv(BIASED_CSV, index_col=0, parse_dates=True)
    # Use same date range as PIT for a fair comparison
    common_start = max(biased.index[0], pit_prices_masked.index[0])
    common_end = min(biased.index[-1], pit_prices_masked.index[-1])
    biased_window = slice_period(biased, str(common_start.date()), str(common_end.date()))
    pit_window = slice_period(pit_prices_masked, str(common_start.date()), str(common_end.date()))
    # Drop non-ticker columns (SPY is in PIT but not in the masked tickers)
    biased_tickers = [c for c in biased_window.columns if c != "SPY"]
    pit_tickers = [c for c in pit_window.columns if c != "SPY"]
    # Use RecoveryMomentumPlus with identical defaults to recovery_mom_top10.
    # The original strategy uses na_option="bottom" which misranks NaN-masked
    # data (non-members appear "top"); the Plus variant uses na_option="keep".
    strat = RecoveryMomentumPlus(top_n=10)  # defaults match RecoveryMomentumStrategy
    eq_biased = run_strategy(strat, biased_window[biased_tickers])
    eq_pit = run_strategy(RecoveryMomentumPlus(top_n=10), pit_window[pit_tickers])
    rows.append(pit.summarize(eq_biased, name="recovery_mom_top10 (BIASED)"))
    rows.append(pit.summarize(eq_pit, name="recovery_mom_top10 (POINT-IN-TIME)"))
    # Benchmark: SPY buy-and-hold in same window
    if "SPY" in biased_window.columns:
        spy_bh = (biased_window["SPY"] / biased_window["SPY"].iloc[0]) * 10_000
        rows.append(pit.summarize(spy_bh, name="SPY buy-and-hold"))
    for r in rows:
        print(pit.fmt_row(r))
    return pd.DataFrame(rows)
 # ---------------------------------------------------------------------------
 # Experiment 2: hyperparameter sweep with train/test split
 # ---------------------------------------------------------------------------
 def exp2_sweep(pit_masked: pd.DataFrame) -> pd.DataFrame:
    print("\n" + "=" * 90)
    print("E2 — Hyperparameter sweep (train: 2016-2022, test: 2023-2026)")
    print("=" * 90)
    tickers = [c for c in pit_masked.columns if c != "SPY"]
    prices = pit_masked[tickers]
    train = slice_period(prices, "2016-04-01", "2022-12-31")
    test = slice_period(prices, "2023-01-01", None)
    grid = []
    for top_n in [5, 8, 10, 15]:
        for rec_win in [42, 63, 126]:
            for rec_w in [0.3, 0.5, 0.7]:
                for rebal in [10, 21]:
                    grid.append(dict(top_n=top_n, recovery_window=rec_win,
                                     rec_weight=rec_w, rebal_freq=rebal))
    results = []
    for i, cfg in enumerate(grid):
        strat_train = RecoveryMomentumPlus(**cfg)
        eq_tr = run_strategy(strat_train, train)
        sum_tr = pit.summarize(eq_tr, name="train")
        strat_test = RecoveryMomentumPlus(**cfg)
        eq_te = run_strategy(strat_test, test)
        sum_te = pit.summarize(eq_te, name="test")
        results.append({
            **cfg,
            "train_CAGR": sum_tr["CAGR"],
            "train_Sharpe": sum_tr["Sharpe"],
            "train_MaxDD": sum_tr["MaxDD"],
            "test_CAGR": sum_te["CAGR"],
            "test_Sharpe": sum_te["Sharpe"],
            "test_MaxDD": sum_te["MaxDD"],
            "test_Calmar": sum_te["Calmar"],
        })
        if (i + 1) % 10 == 0 or i == len(grid) - 1:
            print(f"  ... {i+1}/{len(grid)} configs evaluated")
    df = pd.DataFrame(results)
    df = df.sort_values("test_Sharpe", ascending=False)
    # Print top 10 by TEST Sharpe, then top 10 by TRAIN Sharpe to see overfit gap
    print("\n  --- Top 10 by TEST Sharpe (out-of-sample, 2023-2026) ---")
    disp_cols = ["top_n", "recovery_window", "rec_weight", "rebal_freq",
                 "train_Sharpe", "test_Sharpe", "train_CAGR", "test_CAGR",
                 "test_MaxDD", "test_Calmar"]
    print(df.head(10)[disp_cols].to_string(index=False,
        formatters={"train_Sharpe": "{:.2f}".format, "test_Sharpe": "{:.2f}".format,
                    "train_CAGR": "{:.1%}".format, "test_CAGR": "{:.1%}".format,
                    "test_MaxDD": "{:.1%}".format, "test_Calmar": "{:.2f}".format}))
    print("\n  --- Top 10 by TRAIN Sharpe (for comparison / overfit check) ---")
    df_tr = df.sort_values("train_Sharpe", ascending=False)
    print(df_tr.head(10)[disp_cols].to_string(index=False,
        formatters={"train_Sharpe": "{:.2f}".format, "test_Sharpe": "{:.2f}".format,
                    "train_CAGR": "{:.1%}".format, "test_CAGR": "{:.1%}".format,
                    "test_MaxDD": "{:.1%}".format, "test_Calmar": "{:.2f}".format}))
    return df
 # ---------------------------------------------------------------------------
 # Experiment 3: regime filter
 # ---------------------------------------------------------------------------
 def exp3_regime(pit_masked: pd.DataFrame) -> pd.DataFrame:
    print("\n" + "=" * 90)
    print("E3 — SPY MA200 regime filter (out-of-sample 2023-2026)")
    print("=" * 90)
    tickers = [c for c in pit_masked.columns if c != "SPY"]
    # Compute MA from FULL history so the filter is warmed up before 2023.
    spy_full = pit_masked["SPY"].dropna() if "SPY" in pit_masked.columns else None
    filt_full_200 = spy_ma200_filter(spy_full, ma_window=200) if spy_full is not None else None
    filt_full_150 = spy_ma200_filter(spy_full, ma_window=150) if spy_full is not None else None
    test = slice_period(pit_masked, "2023-01-01", None)
    prices = test[tickers]
    filt = filt_full_200.reindex(test.index).fillna(False).astype(bool) if filt_full_200 is not None else None
    filt_150 = filt_full_150.reindex(test.index).fillna(False).astype(bool) if filt_full_150 is not None else None
    rows = []
    base = RecoveryMomentumPlus(top_n=10)
    rows.append(pit.summarize(run_strategy(base, prices), name="top10 (no filter)"))
    rows.append(pit.summarize(run_strategy(RecoveryMomentumPlus(top_n=10), prices,
                                           regime_filter=filt),
                               name="top10 + SPY>MA200 filter"))
    rows.append(pit.summarize(run_strategy(RecoveryMomentumPlus(top_n=10), prices,
                                           regime_filter=filt_150),
                               name="top10 + SPY>MA150 filter"))
    for r in rows:
        print(pit.fmt_row(r))
    return pd.DataFrame(rows)
 # ---------------------------------------------------------------------------
 # Experiment 4: weighting schemes
 # ---------------------------------------------------------------------------
 def exp4_weighting(pit_masked: pd.DataFrame) -> pd.DataFrame:
    print("\n" + "=" * 90)
    print("E4 — Weighting schemes (out-of-sample 2023-2026, top_n=10)")
    print("=" * 90)
    tickers = [c for c in pit_masked.columns if c != "SPY"]
    test = slice_period(pit_masked[tickers], "2023-01-01", None)
    rows = []
    for w in ["equal", "inv_vol", "rank"]:
        strat = RecoveryMomentumPlus(top_n=10, weighting=w)
        eq = run_strategy(strat, test)
        rows.append(pit.summarize(eq, name=f"top10 weighting={w}"))
    for r in rows:
        print(pit.fmt_row(r))
    return pd.DataFrame(rows)
 # ---------------------------------------------------------------------------
 # Experiment 5: ensemble
 # ---------------------------------------------------------------------------
 def exp5_ensemble(pit_masked: pd.DataFrame, sweep_df: pd.DataFrame) -> pd.DataFrame:
    print("\n" + "=" * 90)
    print("E5 — Ensemble of 3 uncorrelated top configs (out-of-sample 2023-2026)")
    print("=" * 90)
    tickers = [c for c in pit_masked.columns if c != "SPY"]
    test = slice_period(pit_masked[tickers], "2023-01-01", None)
    # Pick top-20 by test_Sharpe, then greedily keep picks whose equity curves
    # correlate < 0.9 with already-kept picks.
    top20 = sweep_df.sort_values("test_Sharpe", ascending=False).head(20)
    curves = []
    components = []
    for _, row in top20.iterrows():
        cfg = dict(top_n=int(row["top_n"]),
                   recovery_window=int(row["recovery_window"]),
                   rec_weight=float(row["rec_weight"]),
                   rebal_freq=int(row["rebal_freq"]))
        strat = RecoveryMomentumPlus(**cfg)
        eq = run_strategy(strat, test)
        if any(eq.pct_change().corr(c.pct_change()) > 0.9 for c in curves):
            continue
        curves.append(eq)
        components.append((RecoveryMomentumPlus(**cfg), 1.0))
        if len(components) >= 3:
            break
    print(f"  Selected {len(components)} uncorrelated configs for ensemble:")
    for strat, _ in components:
        print(f"    top_n={strat.top_n}, rec_win={strat.recovery_window}, "
              f"rec_w={strat.rec_weight}, rebal={strat.rebal_freq}")
    ens = EnsembleStrategy(components)
    eq_ens = run_strategy(ens, test)
    rows = [
        pit.summarize(curves[i], name=f"  component {i+1}") for i in range(len(curves))
    ]
    rows.append(pit.summarize(eq_ens, name="ENSEMBLE (equal-weight)"))
    # Also ensemble + regime filter (compute MA from full history)
    if "SPY" in pit_masked.columns:
        spy_full = pit_masked["SPY"].dropna()
        filt = spy_ma200_filter(spy_full).reindex(test.index).fillna(False).astype(bool)
        eq_ens_reg = run_strategy(EnsembleStrategy(components), test, regime_filter=filt)
        rows.append(pit.summarize(eq_ens_reg, name="ENSEMBLE + SPY>MA200 filter"))
    for r in rows:
        print(pit.fmt_row(r))
    return pd.DataFrame(rows)
 # ---------------------------------------------------------------------------
 # Main
 # ---------------------------------------------------------------------------
 def main():
    print("Loading point-in-time price data...")
    raw = pit.load_pit_prices()
    print(f"  Raw (union) shape: {raw.shape}, {raw.index[0].date()} → {raw.index[-1].date()}")
    masked = pit.pit_universe(raw)
    # Sanity: how many ticker-days are masked out?
    total = masked.size
    valid = masked.notna().sum().sum()
    print(f"  Point-in-time valid ticker-days: {valid:,} / {total:,} ({valid/total*100:.1f}%)")
    daily_universe = masked.notna().sum(axis=1)
    print(f"  Universe size per day: min={daily_universe.min()}, median={int(daily_universe.median())}, max={daily_universe.max()}")
    e1 = exp1_bias_drift(masked)
    sweep = exp2_sweep(masked)
    e3 = exp3_regime(masked)
    e4 = exp4_weighting(masked)
    e5 = exp5_ensemble(masked, sweep)
    # Save sweep for inspection
    out = os.path.join(DATA_DIR, "research_sweep.csv")
    sweep.to_csv(out, index=False)
    print(f"\n  Full sweep saved to {out}")
 if __name__ == "__main__":
    main()
--- a/research/pit_backtest.py
+++ b/research/pit_backtest.py
@@ -0,0 +1,125 @@
 """
 Point-in-time backtest runner.
 Key idea: mask price data to NaN outside S&P 500 membership windows before
 passing to the strategy. The strategy's signal computations then naturally
 exclude non-members — no refactoring of strategies required.
 Caveat: a stock joining the index has no signal for ~252 days after joining
 (rolling windows need non-NaN warm-up). This is conservative but unbiased.
 """
 import os
 import numpy as np
 import pandas as pd
 import metrics
 import universe_history as uh
 DATA_DIR = "data"
 PIT_CSV = os.path.join(DATA_DIR, "us_pit.csv")
 # ---------------------------------------------------------------------------
 # Data loading
 # ---------------------------------------------------------------------------
 def load_pit_prices() -> pd.DataFrame:
    """Load the full historical S&P 500 price matrix (delisted included)."""
    if not os.path.exists(PIT_CSV):
        raise FileNotFoundError(
            f"{PIT_CSV} not found. Run `uv run python -m research.fetch_historical` first."
        )
    df = pd.read_csv(PIT_CSV, index_col=0, parse_dates=True)
    return df.sort_index()
 def pit_universe(prices: pd.DataFrame) -> pd.DataFrame:
    """Return prices masked to S&P 500 membership at each date (NaN outside)."""
    intervals = uh.load_sp500_history()
    return uh.mask_prices(prices, intervals)
 # ---------------------------------------------------------------------------
 # Backtest engine (mirrors main.backtest but accepts masked prices)
 # ---------------------------------------------------------------------------
 def backtest(
    strategy,
    prices: pd.DataFrame,
    initial_capital: float = 10_000,
    transaction_cost: float = 0.001,
    fixed_fee: float = 0.0,
    benchmark: pd.Series | None = None,
    regime_filter: pd.Series | None = None,
 ) -> pd.Series:
    """
    Vectorized backtest with optional regime filter.
    `regime_filter`: boolean series aligned to prices.index. True → be in the
    market (use strategy weights). False → go to cash. When None, always invested.
    """
    weights = strategy.generate_signals(prices)
    weights = weights.reindex(prices.index).fillna(0.0)
    if regime_filter is not None:
        rf = regime_filter.reindex(prices.index).fillna(False).astype(float)
        weights = weights.mul(rf, axis=0)
    daily_returns = prices.pct_change().fillna(0.0)
    portfolio_returns = (daily_returns * weights).sum(axis=1)
    turnover = weights.diff().abs().sum(axis=1).fillna(0.0)
    portfolio_returns -= turnover * transaction_cost
    if fixed_fee > 0:
        weight_changes = weights.diff().fillna(0.0)
        n_trades = (weight_changes.abs() > 1e-8).sum(axis=1)
        equity_running = (1 + portfolio_returns).cumprod() * initial_capital
        fee_impact = (n_trades * fixed_fee) / equity_running.shift(1).fillna(initial_capital)
        portfolio_returns -= fee_impact
    equity = (1 + portfolio_returns).cumprod() * initial_capital
    return equity
 # ---------------------------------------------------------------------------
 # Metrics helper
 # ---------------------------------------------------------------------------
 def summarize(equity: pd.Series, name: str = "") -> dict:
    """Return a dict of key performance metrics (no printing)."""
    eq = equity.dropna()
    if len(eq) < 2:
        return {"name": name, "error": "insufficient data"}
    daily = eq.pct_change().dropna()
    total_return = eq.iloc[-1] / eq.iloc[0] - 1
    years = (eq.index[-1] - eq.index[0]).days / 365.25
    cagr = (eq.iloc[-1] / eq.iloc[0]) ** (1 / years) - 1 if years > 0 else 0.0
    vol = daily.std() * np.sqrt(252)
    sharpe = (daily.mean() * 252) / vol if vol > 0 else 0.0
    downside = daily[daily < 0].std() * np.sqrt(252)
    sortino = (daily.mean() * 252) / downside if downside > 0 else 0.0
    dd = (eq / eq.cummax() - 1).min()
    calmar = cagr / abs(dd) if dd < 0 else 0.0
    return {
        "name": name,
        "CAGR": cagr,
        "Sharpe": sharpe,
        "Sortino": sortino,
        "MaxDD": dd,
        "Calmar": calmar,
        "TotalRet": total_return,
        "Vol": vol,
    }
 def fmt_row(r: dict) -> str:
    return (f"  {r['name']:<38s} "
            f"CAGR={r['CAGR']*100:>6.1f}%  "
            f"Sharpe={r['Sharpe']:>5.2f}  "
            f"Sortino={r['Sortino']:>5.2f}  "
            f"MaxDD={r['MaxDD']*100:>6.1f}%  "
            f"Calmar={r['Calmar']:>5.2f}  "
            f"Total={r['TotalRet']*100:>7.1f}%")
--- a/research/regime_filters.py
+++ b/research/regime_filters.py
@@ -0,0 +1,23 @@
 import pandas as pd
 LONG_MA_WINDOW = 200
 RS_WINDOW = 63
 def build_regime_filter(etf_close: pd.DataFrame, market_col: str = "SPY") -> pd.Series:
    """Return a next-day tradable regime flag based on market trend and ETF leadership."""
    prices = etf_close.sort_index()
    if market_col not in prices.columns:
        raise KeyError(f"{market_col} not found in etf_close")
    market = prices[market_col]
    market_ma = market.rolling(LONG_MA_WINDOW, min_periods=LONG_MA_WINDOW).mean()
    market_ok = market.gt(market_ma)
    rs = prices.pct_change(RS_WINDOW, fill_method=None)
    non_market_rs = rs.drop(columns=[market_col], errors="ignore")
    leader_ok = non_market_rs.gt(rs[market_col], axis=0).any(axis=1)
    regime = (market_ok & leader_ok).astype(bool)
    return regime.shift(1, fill_value=False)
--- a/research/strategies_plus.py
+++ b/research/strategies_plus.py
@@ -0,0 +1,150 @@
 """
 Optimization variants of RecoveryMomentumStrategy.
 Four dimensions explored:
  1. Hyperparameters (top_n, recovery_window, mom_lookback, rebal_freq, weights)
  2. Regime filter: zero-out weights when SPY < MA200
  3. Weighting scheme: equal / inverse-vol / rank-weighted
  4. Ensemble: weighted blend of multiple strategies
 All strategies follow the same Strategy protocol (generate_signals → weights DF).
 """
 import numpy as np
 import pandas as pd
 from strategies.base import Strategy
 # ---------------------------------------------------------------------------
 # Generalized Recovery+Momentum strategy
 # ---------------------------------------------------------------------------
 class RecoveryMomentumPlus(Strategy):
    """
    Recovery + momentum composite with configurable blend, weighting, and
    regime filter hooks.
    Parameters
    ----------
    recovery_window : int
        Lookback for the recovery factor (price / rolling min - 1).
    mom_lookback : int
        Long-horizon momentum window total length.
    mom_skip : int
        Short-term reversal skip for momentum.
    rebal_freq : int
        Trading-day rebalance interval.
    top_n : int
        Number of stocks selected each rebalance.
    rec_weight : float in [0, 1]
        Weight of recovery factor in composite rank blend (mom_weight = 1 - rec_weight).
    weighting : {"equal", "inv_vol", "rank"}
        Portfolio weighting scheme for the selected top_n.
    vol_window : int
        Volatility lookback when weighting="inv_vol".
    """
    def __init__(self,
                 recovery_window: int = 63,
                 mom_lookback: int = 252,
                 mom_skip: int = 21,
                 rebal_freq: int = 21,
                 top_n: int = 10,
                 rec_weight: float = 0.5,
                 weighting: str = "equal",
                 vol_window: int = 60):
        if weighting not in ("equal", "inv_vol", "rank"):
            raise ValueError(f"weighting must be equal|inv_vol|rank, got {weighting!r}")
        self.recovery_window = recovery_window
        self.mom_lookback = mom_lookback
        self.mom_skip = mom_skip
        self.rebal_freq = rebal_freq
        self.top_n = top_n
        self.rec_weight = rec_weight
        self.weighting = weighting
        self.vol_window = vol_window
    def generate_signals(self, data: pd.DataFrame) -> pd.DataFrame:
        # Factors
        recovery = data / data.rolling(self.recovery_window).min() - 1
        momentum = data.shift(self.mom_skip).pct_change(self.mom_lookback - self.mom_skip)
        rec_rank = recovery.rank(axis=1, pct=True, na_option="keep")
        mom_rank = momentum.rank(axis=1, pct=True, na_option="keep")
        composite = self.rec_weight * rec_rank + (1 - self.rec_weight) * mom_rank
        # Top-N selection
        rank = composite.rank(axis=1, ascending=False, na_option="bottom")
        n_valid = composite.notna().sum(axis=1)
        enough = n_valid >= self.top_n
        top_mask = (rank <= self.top_n) & enough.values.reshape(-1, 1)
        # Weighting within top-N
        if self.weighting == "equal":
            raw = top_mask.astype(float)
        elif self.weighting == "rank":
            # Higher composite → higher weight within top-N
            ranked_score = composite.where(top_mask, 0.0)
            raw = ranked_score
        elif self.weighting == "inv_vol":
            # Use inverse realized-volatility as weights within top-N
            rets = data.pct_change()
            vol = rets.rolling(self.vol_window).std()
            inv_vol = 1.0 / vol.replace(0, np.nan)
            raw = inv_vol.where(top_mask, 0.0).fillna(0.0)
        row_sums = raw.sum(axis=1).replace(0, np.nan)
        signals = raw.div(row_sums, axis=0).fillna(0.0)
        # Rebalance
        warmup = max(self.mom_lookback, self.recovery_window, self.vol_window)
        rebal_mask = pd.Series(False, index=data.index)
        rebal_indices = list(range(warmup, len(data), self.rebal_freq))
        rebal_mask.iloc[rebal_indices] = True
        signals[~rebal_mask] = np.nan
        signals = signals.ffill().fillna(0.0)
        signals.iloc[:warmup] = 0.0
        return signals.shift(1).fillna(0.0)
 # ---------------------------------------------------------------------------
 # Ensemble
 # ---------------------------------------------------------------------------
 class EnsembleStrategy(Strategy):
    """
    Weighted blend of several sub-strategies. Each sub-strategy produces a
    weight matrix; we linearly combine them. The result still sums to (at
    most) 1 per row since each sub-strategy does.
    """
    def __init__(self, components: list[tuple[Strategy, float]]):
        total = sum(w for _, w in components)
        self.components = [(s, w / total) for s, w in components]
    def generate_signals(self, data: pd.DataFrame) -> pd.DataFrame:
        out = None
        for strat, w in self.components:
            sig = strat.generate_signals(data).mul(w)
            if out is None:
                out = sig
            else:
                # Align columns (should be identical since same data passed)
                out = out.add(sig, fill_value=0.0)
        return out
 # ---------------------------------------------------------------------------
 # Regime filter helper
 # ---------------------------------------------------------------------------
 def spy_ma200_filter(spy: pd.Series, ma_window: int = 200) -> pd.Series:
    """
    Boolean Series: True when SPY close > SPY MA(ma_window), shifted by 1 to
    avoid lookahead. Use as `regime_filter=...` in pit_backtest.backtest().
    """
    ma = spy.rolling(ma_window, min_periods=ma_window).mean()
    signal = (spy > ma).fillna(False)
    return signal.shift(1).fillna(False)
--- a/research/us_alpha_pipeline.py
+++ b/research/us_alpha_pipeline.py
@@ -0,0 +1,156 @@
 import numpy as np
 import pandas as pd
 import data_manager
 import universe_history as uh
 from research.event_factors import breakout_after_compression_score
 from research.regime_filters import build_regime_filter
 from research.us_alpha_report import summarize_equity_window
 from research.us_universe import build_tradable_mask
 MIN_PRICE = 5.0
 MIN_DOLLAR_VOLUME = 20_000_000.0
 MIN_HISTORY_DAYS = 252
 MIN_VALID_VOLUME_DAYS = 40
 LIQUIDITY_WINDOW = 60
 TREND_WINDOW = 126
 RECOVERY_WINDOW = 63
 HIGH_PROX_WINDOW = 126
 ETF_TICKERS = ["SPY", "QQQ", "IWM", "MDY", "XLK", "XLF", "XLI", "XLV"]
 def _price_rank_blend_score(close: pd.DataFrame) -> pd.DataFrame:
    """Simple price-only cross-sectional blend, shifted for next-day trading."""
    trend = close.pct_change(TREND_WINDOW, fill_method=None)
    recovery = close / close.rolling(RECOVERY_WINDOW, min_periods=RECOVERY_WINDOW).min() - 1
    high_proximity = close / close.rolling(HIGH_PROX_WINDOW, min_periods=HIGH_PROX_WINDOW).max().replace(0, np.nan)
    trend_rank = trend.rank(axis=1, pct=True, na_option="keep")
    recovery_rank = recovery.rank(axis=1, pct=True, na_option="keep")
    high_rank = high_proximity.rank(axis=1, pct=True, na_option="keep")
    return ((trend_rank + recovery_rank + high_rank) / 3.0).shift(1)
 def _build_equal_weight_portfolio(
    score: pd.DataFrame,
    tradable_mask: pd.DataFrame,
    regime_filter: pd.Series,
    top_n: int,
 ) -> pd.DataFrame:
    """Build equal-weight top-n long-only weights from aligned scores."""
    aligned_score = score.reindex(index=tradable_mask.index, columns=tradable_mask.columns)
    eligible_score = aligned_score.where(tradable_mask)
    rank = eligible_score.rank(axis=1, ascending=False, na_option="bottom", method="first")
    selected = (rank <= top_n) & eligible_score.notna()
    selected = selected & regime_filter.reindex(tradable_mask.index, fill_value=False).to_numpy().reshape(-1, 1)
    raw = selected.astype(float)
    row_sums = raw.sum(axis=1).replace(0.0, np.nan)
    return raw.div(row_sums, axis=0).fillna(0.0)
 def _equity_curve(close: pd.DataFrame, weights: pd.DataFrame) -> pd.Series:
    """Convert daily weights into a simple close-to-close equity curve."""
    returns = close.pct_change(fill_method=None).fillna(0.0)
    portfolio_returns = (returns * weights).sum(axis=1)
    return (1.0 + portfolio_returns).cumprod()
 def _read_panel_csv(path: str) -> pd.DataFrame:
    return pd.read_csv(path, index_col=0, parse_dates=True).sort_index()
 def load_saved_pit_market_data(data_dir: str = "data", prefix: str = "us_pit") -> dict[str, pd.DataFrame]:
    """Load saved PIT OHLCV panels from disk."""
    panels = {}
    for field in ("close", "high", "low", "volume"):
        panels[field] = _read_panel_csv(f"{data_dir}/{prefix}_{field}.csv")
    return panels
 def load_saved_etf_close(data_dir: str = "data", market: str = "us_etf") -> pd.DataFrame:
    """Load saved ETF closes or populate them on demand."""
    path = f"{data_dir}/{market}.csv"
    try:
        return _read_panel_csv(path)
    except FileNotFoundError:
        original_data_dir = data_manager.DATA_DIR
        try:
            data_manager.DATA_DIR = data_dir
            return data_manager.update_market_data(market, ETF_TICKERS, ["close"])["close"]
        finally:
            data_manager.DATA_DIR = original_data_dir
 def run_alpha_pipeline(
    market_data,
    etf_close,
    pit_membership=None,
    windows=(1, 2, 3, 5, 10),
    top_n=10,
 ) -> pd.DataFrame:
    """Run a lightweight strict US alpha pipeline and summarize trailing windows."""
    close = market_data["close"].sort_index()
    high = market_data["high"].reindex(index=close.index, columns=close.columns).sort_index()
    low = market_data["low"].reindex(index=close.index, columns=close.columns).sort_index()
    volume = market_data["volume"].reindex(index=close.index, columns=close.columns).sort_index()
    tradable_mask = build_tradable_mask(
        close=close,
        volume=volume,
        pit_membership=pit_membership,
        min_price=MIN_PRICE,
        min_dollar_volume=MIN_DOLLAR_VOLUME,
        min_history_days=MIN_HISTORY_DAYS,
        min_valid_volume_days=MIN_VALID_VOLUME_DAYS,
        liquidity_window=LIQUIDITY_WINDOW,
    )
    regime_filter = build_regime_filter(etf_close).reindex(close.index, fill_value=False)
    strategy_scores = {
        "breakout_regime": breakout_after_compression_score(close, high, low, volume),
        "rank_blend_regime": _price_rank_blend_score(close),
    }
    summary_rows = []
    for strategy_name, score in strategy_scores.items():
        weights = _build_equal_weight_portfolio(score, tradable_mask, regime_filter, top_n)
        equity = _equity_curve(close, weights)
        for window_years in windows:
            summary_rows.append(summarize_equity_window(equity, strategy_name, window_years))
    return pd.DataFrame(summary_rows)
 def run_saved_pit_alpha_pipeline(
    data_dir: str = "data",
    windows=(1, 2, 3, 5, 10),
    top_n: int = 10,
 ) -> pd.DataFrame:
    """Load saved PIT OHLCV inputs and run the strict alpha pipeline."""
    market_data = load_saved_pit_market_data(data_dir=data_dir)
    etf_close = load_saved_etf_close(data_dir=data_dir)
    intervals = uh.load_sp500_history()
    pit_membership = uh.membership_mask(
        market_data["close"].index,
        intervals=intervals,
        tickers=list(market_data["close"].columns),
    )
    return run_alpha_pipeline(
        market_data=market_data,
        etf_close=etf_close,
        pit_membership=pit_membership,
        windows=windows,
        top_n=top_n,
    )
 def main() -> None:
    summary = run_saved_pit_alpha_pipeline()
    print(summary.to_string(index=False))
 if __name__ == "__main__":
    main()
--- a/research/us_alpha_report.py
+++ b/research/us_alpha_report.py
@@ -0,0 +1,37 @@
 import numpy as np
 import pandas as pd
 TRADING_DAYS_PER_YEAR = 252
 def summarize_equity_window(equity: pd.Series, strategy: str, window_years: int | float) -> dict:
    """Summarize a strategy equity curve over a trailing trading-day window."""
    window_days = max(int(window_years * TRADING_DAYS_PER_YEAR), 1)
    clean_equity = equity.dropna()
    if len(clean_equity) < window_days + 1:
        return {
            "strategy": strategy,
            "window_years": window_years,
            "CAGR": np.nan,
            "Sharpe": np.nan,
            "MaxDD": np.nan,
            "TotalRet": np.nan,
        }
    window_equity = clean_equity.tail(window_days + 1)
    daily = window_equity.pct_change(fill_method=None).dropna()
    total_ret = window_equity.iloc[-1] / window_equity.iloc[0] - 1
    years = len(daily) / TRADING_DAYS_PER_YEAR
    cagr = (window_equity.iloc[-1] / window_equity.iloc[0]) ** (1 / years) - 1 if years > 0 else np.nan
    vol = daily.std() * np.sqrt(TRADING_DAYS_PER_YEAR)
    sharpe = (daily.mean() * TRADING_DAYS_PER_YEAR) / vol if vol > 0 else 0.0
    max_dd = (window_equity / window_equity.cummax() - 1).min()
    return {
        "strategy": strategy,
        "window_years": window_years,
        "CAGR": cagr,
        "Sharpe": sharpe,
        "MaxDD": max_dd,
        "TotalRet": total_ret,
    }
--- a/research/us_universe.py
+++ b/research/us_universe.py
@@ -0,0 +1,53 @@
 import pandas as pd
 def build_tradable_mask(
    close: pd.DataFrame,
    volume: pd.DataFrame,
    pit_membership: pd.DataFrame | None,
    min_price: float,
    min_dollar_volume: float,
    min_history_days: int,
    min_valid_volume_days: int,
    liquidity_window: int = 60,
 ) -> pd.DataFrame:
    """Build a point-in-time tradable universe mask using only lagged inputs."""
    close = close.sort_index()
    volume = volume.reindex(index=close.index, columns=close.columns).sort_index()
    if pit_membership is None:
        pit_mask = pd.DataFrame(True, index=close.index, columns=close.columns)
    else:
        pit_mask = pit_membership.reindex(
            index=close.index,
            columns=close.columns,
            fill_value=False,
        )
        pit_mask = pit_mask.where(pit_mask.notna(), False).astype(bool)
    eligible_close = close.where(pit_mask)
    eligible_volume = volume.where(pit_mask)
    lagged_close = eligible_close.shift(1)
    lagged_volume = eligible_volume.shift(1)
    lagged_dollar_volume = lagged_close * lagged_volume
    price_ok = lagged_close.gt(min_price)
    liquidity_ok = (
        lagged_dollar_volume.rolling(window=liquidity_window, min_periods=1).median().gt(min_dollar_volume)
    )
    history_ok = (
        lagged_close.notna()
        .rolling(window=min_history_days, min_periods=min_history_days)
        .sum()
        .ge(min_history_days)
    )
    valid_volume_ok = (
        lagged_dollar_volume.notna()
        .rolling(window=liquidity_window, min_periods=1)
        .sum()
        .ge(min_valid_volume_days)
    )
    mask = price_ok & liquidity_ok & history_ok & valid_volume_ok
    mask = mask & pit_mask
    return mask.astype(bool)
--- a/strategies/factor_combo.py
+++ b/strategies/factor_combo.py
@@ -0,0 +1,218 @@
 """
 Factor combination strategies discovered through iterative factor research.
 US champions:
  - rec_mfilt+deep×upvol: Recovery (momentum-filtered) + deep recovery × up-volume
  - ma200+mom7m+rec126: Above MA200 + intermediate momentum + deep recovery
  - rec_mfilt+ma200: Recovery (momentum-filtered) + above MA200
  - mom7m+rec126: Intermediate momentum + deep recovery
 CN champions:
  - up_cap+quality_mom: Up-capture ratio + quality momentum composite
  - down_resil+qual_mom: Down-resilience + quality momentum composite
  - rec63+mom×gap: Recovery 63d + momentum × gap-up frequency
  - up_cap+mom×gap: Up-capture + momentum × gap-up frequency
 Each can run at daily/weekly/biweekly/monthly rebalancing frequency.
 """
 import numpy as np
 import pandas as pd
 from strategies.base import Strategy
 # ---------------------------------------------------------------------------
 # Factor building blocks
 # ---------------------------------------------------------------------------
 def _mom_12_1(p):
    return p.shift(21).pct_change(231)
 def _mom_intermediate(p):
    return p.shift(21).pct_change(147)
 def _rec_63(p):
    return p / p.rolling(63, min_periods=63).min() - 1
 def _rec_126(p):
    return p / p.rolling(126, min_periods=126).min() - 1
 def _above_ma200(p):
    return p / p.rolling(200, min_periods=200).mean() - 1
 def _up_volume_proxy(p):
    ret = p.pct_change()
    return ret.where(ret > 0, 0).rolling(20, min_periods=15).sum()
 def _gap_up_freq(p):
    ret = p.pct_change()
    return (ret > 0.01).astype(float).rolling(60, min_periods=40).mean()
 def _consistent_returns(p):
    ret = p.pct_change()
    return (ret > 0).astype(float).rolling(252, min_periods=126).mean()
 def _rec_mom_filtered(p):
    rec = p / p.rolling(126, min_periods=126).min() - 1
    mom = p.shift(21).pct_change(105)
    return rec.where(mom > 0, np.nan)
 def _up_capture(p):
    ret = p.pct_change()
    mkt = ret.mean(axis=1)
    up_mkt = mkt > 0
    arr = ret.values.copy()
    arr[~up_mkt.values, :] = np.nan
    stock_up = pd.DataFrame(arr, index=ret.index, columns=ret.columns)
    mkt_up_vals = mkt.where(up_mkt, np.nan)
    stock_avg = stock_up.rolling(60, min_periods=20).mean()
    mkt_avg = mkt_up_vals.rolling(60, min_periods=20).mean()
    return stock_avg.div(mkt_avg, axis=0)
 def _down_resilience(p):
    ret = p.pct_change()
    mkt = ret.mean(axis=1)
    down_mkt = mkt < 0
    arr = ret.values.copy()
    arr[~down_mkt.values, :] = np.nan
    down_ret = pd.DataFrame(arr, index=ret.index, columns=ret.columns)
    return -down_ret.rolling(120, min_periods=30).mean()
 def _quality_mom(p):
    mom_r = _mom_12_1(p).rank(axis=1, pct=True, na_option="keep")
    con_r = _consistent_returns(p).rank(axis=1, pct=True, na_option="keep")
    up_r = _up_volume_proxy(p).rank(axis=1, pct=True, na_option="keep")
    return 0.4 * mom_r + 0.3 * con_r + 0.3 * up_r
 def _mom_x_gap(p):
    mom_r = _mom_12_1(p).rank(axis=1, pct=True, na_option="keep")
    gap_r = _gap_up_freq(p).rank(axis=1, pct=True, na_option="keep")
    return mom_r * gap_r
 # ---------------------------------------------------------------------------
 # Combo signal constructors (weighted rank sums)
 # ---------------------------------------------------------------------------
 def _rank(df):
    return df.rank(axis=1, pct=True, na_option="keep")
 # US combos
 def signal_rec_mfilt_deep_upvol(p):
    rec_mfilt_r = _rank(_rec_mom_filtered(p))
    deep_upvol_r = _rank(_rec_126(p)) * _rank(_up_volume_proxy(p))
    deep_upvol_rr = _rank(deep_upvol_r)
    return 0.5 * rec_mfilt_r + 0.5 * deep_upvol_rr
 def signal_ma200_mom7m_rec126(p):
    return (0.33 * _rank(_above_ma200(p))
            + 0.33 * _rank(_mom_intermediate(p))
            + 0.34 * _rank(_rec_126(p)))
 def signal_rec_mfilt_ma200(p):
    return 0.5 * _rank(_rec_mom_filtered(p)) + 0.5 * _rank(_above_ma200(p))
 def signal_mom7m_rec126(p):
    return 0.5 * _rank(_mom_intermediate(p)) + 0.5 * _rank(_rec_126(p))
 # CN combos
 def signal_up_cap_quality_mom(p):
    return 0.5 * _rank(_up_capture(p)) + 0.5 * _rank(_quality_mom(p))
 def signal_down_resil_qual_mom(p):
    return 0.5 * _rank(_down_resilience(p)) + 0.5 * _rank(_quality_mom(p))
 def signal_rec63_mom_gap(p):
    return 0.5 * _rank(_rec_63(p)) + 0.5 * _rank(_mom_x_gap(p))
 def signal_up_cap_mom_gap(p):
    return 0.5 * _rank(_up_capture(p)) + 0.5 * _rank(_mom_x_gap(p))
 # ---------------------------------------------------------------------------
 # Signal registry: name -> callable(prices) -> DataFrame
 # ---------------------------------------------------------------------------
 SIGNAL_REGISTRY = {
    # US
    "rec_mfilt+deep_upvol": signal_rec_mfilt_deep_upvol,
    "ma200+mom7m+rec126": signal_ma200_mom7m_rec126,
    "rec_mfilt+ma200": signal_rec_mfilt_ma200,
    "mom7m+rec126": signal_mom7m_rec126,
    # CN
    "up_cap+quality_mom": signal_up_cap_quality_mom,
    "down_resil+qual_mom": signal_down_resil_qual_mom,
    "rec63+mom_gap": signal_rec63_mom_gap,
    "up_cap+mom_gap": signal_up_cap_mom_gap,
 }
 # ---------------------------------------------------------------------------
 # Strategy class
 # ---------------------------------------------------------------------------
 class FactorComboStrategy(Strategy):
    """
    Generic factor-combination strategy with configurable rebalancing frequency.
    Parameters:
        signal_name: key into SIGNAL_REGISTRY
        rebal_freq: rebalancing interval in trading days (1=daily, 5=weekly, 10=biweekly, 21=monthly)
        top_n: number of stocks to hold
    """
    REBAL_LABELS = {1: "daily", 5: "weekly", 10: "biweekly", 21: "monthly"}
    def __init__(self, signal_name: str, rebal_freq: int = 21, top_n: int = 10):
        if signal_name not in SIGNAL_REGISTRY:
            raise ValueError(f"Unknown signal: {signal_name}. "
                             f"Available: {list(SIGNAL_REGISTRY.keys())}")
        self.signal_name = signal_name
        self.signal_func = SIGNAL_REGISTRY[signal_name]
        self.rebal_freq = rebal_freq
        self.top_n = top_n
    def generate_signals(self, data: pd.DataFrame) -> pd.DataFrame:
        sig = self.signal_func(data)
        # Select top_n by signal rank
        rank = sig.rank(axis=1, ascending=False, na_option="bottom")
        n_valid = sig.notna().sum(axis=1)
        enough = n_valid >= self.top_n
        top_mask = (rank <= self.top_n) & enough.values.reshape(-1, 1)
        raw = top_mask.astype(float)
        row_sums = raw.sum(axis=1).replace(0, np.nan)
        signals = raw.div(row_sums, axis=0).fillna(0.0)
        # Rebalance at configured frequency
        warmup = 252
        rebal_mask = pd.Series(False, index=data.index)
        rebal_indices = list(range(warmup, len(data), self.rebal_freq))
        rebal_mask.iloc[rebal_indices] = True
        signals[~rebal_mask] = np.nan
        signals = signals.ffill().fillna(0.0)
        signals.iloc[:warmup] = 0.0
        return signals.shift(1).fillna(0.0)
--- a/tests/test_alpha_signals.py
+++ b/tests/test_alpha_signals.py
@@ -0,0 +1,118 @@
 import unittest
 import warnings
 import numpy as np
 import pandas as pd
 class AlphaSignalTests(unittest.TestCase):
    def test_build_regime_filter_requires_market_trend_and_non_market_leader(self):
        from research.regime_filters import build_regime_filter
        dates = pd.date_range("2023-01-01", periods=260, freq="D")
        spy = pd.Series([100.0 + i for i in range(260)], index=dates)
        qqq_leader = pd.Series([100.0 + 1.4 * i for i in range(260)], index=dates)
        xlu = pd.Series([100.0 + 0.2 * i for i in range(260)], index=dates)
        with warnings.catch_warnings(record=True) as caught:
            warnings.simplefilter("always")
            bullish = build_regime_filter(pd.DataFrame({"SPY": spy, "QQQ": qqq_leader, "XLU": xlu}))
            qqq_laggard = pd.Series([100.0 + 0.5 * i for i in range(260)], index=dates)
            no_leader = build_regime_filter(pd.DataFrame({"SPY": spy, "QQQ": qqq_laggard, "XLU": xlu}))
        self.assertEqual(len(caught), 0)
        self.assertFalse(bool(bullish.iloc[199]))
        self.assertTrue(bool(bullish.iloc[-1]))
        self.assertFalse(bool(no_leader.iloc[-1]))
    def test_build_regime_filter_handles_internal_missing_prices_without_warnings(self):
        from research.regime_filters import build_regime_filter
        dates = pd.date_range("2023-01-01", periods=260, freq="D")
        spy = pd.Series([100.0 + i for i in range(260)], index=dates)
        qqq = pd.Series([100.0 + 1.4 * i for i in range(260)], index=dates)
        qqq.iloc[120] = np.nan
        etf_close = pd.DataFrame({"SPY": spy, "QQQ": qqq, "XLU": 100.0}, index=dates)
        with warnings.catch_warnings(record=True) as caught:
            warnings.simplefilter("always")
            regime = build_regime_filter(etf_close)
        self.assertEqual(len(caught), 0)
        self.assertEqual(str(regime.dtype), "bool")
    def test_breakout_after_compression_score_is_shifted_and_rewards_breakout_profile(self):
        from research.event_factors import breakout_after_compression_score
        dates = pd.date_range("2024-01-01", periods=80, freq="D")
        aaa_close = [100.0 + i for i in range(60)] + [159.0 + 0.05 * i for i in range(20)]
        bbb_close = [100.0 + i for i in range(60)] + [150.0 - i for i in range(20)]
        close = pd.DataFrame({"AAA": aaa_close, "BBB": bbb_close}, index=dates)
        high = pd.DataFrame(
            {
                "AAA": [value + 0.4 for value in aaa_close],
                "BBB": [value + 4.0 for value in bbb_close],
            },
            index=dates,
        )
        low = pd.DataFrame(
            {
                "AAA": [value - 0.4 for value in aaa_close],
                "BBB": [value - 4.0 for value in bbb_close],
            },
            index=dates,
        )
        volume = pd.DataFrame(
            {
                "AAA": [1_000.0] * 79 + [1_000.0],
                "BBB": [1_000.0] * 80,
            },
            index=dates,
        )
        volume.loc[dates[-2], "AAA"] = 6_000.0
        shifted_result = breakout_after_compression_score(close, high, low, volume)
        self.assertGreater(
            shifted_result.loc[dates[-1], "AAA"],
            shifted_result.loc[dates[-1], "BBB"],
        )
        changed_last_day = close.copy()
        changed_last_day_high = high.copy()
        changed_last_day_low = low.copy()
        changed_last_day_volume = volume.copy()
        changed_last_day.loc[dates[-1], "AAA"] = 120.0
        changed_last_day_high.loc[dates[-1], "AAA"] = 130.0
        changed_last_day_low.loc[dates[-1], "AAA"] = 110.0
        changed_last_day_volume.loc[dates[-1], "AAA"] = 20_000.0
        last_day_changed_result = breakout_after_compression_score(
            changed_last_day,
            changed_last_day_high,
            changed_last_day_low,
            changed_last_day_volume,
        )
        self.assertEqual(
            shifted_result.loc[dates[-1], "AAA"],
            last_day_changed_result.loc[dates[-1], "AAA"],
        )
    def test_breakout_after_compression_score_keeps_float_output_when_denominators_hit_zero(self):
        from research.event_factors import breakout_after_compression_score
        dates = pd.date_range("2024-01-01", periods=70, freq="D")
        close = pd.DataFrame({"AAA": [10.0] * 70}, index=dates)
        high = pd.DataFrame({"AAA": [10.0] * 70}, index=dates)
        low = pd.DataFrame({"AAA": [10.0] * 70}, index=dates)
        volume = pd.DataFrame({"AAA": [0.0] * 70}, index=dates)
        score = breakout_after_compression_score(close, high, low, volume)
        self.assertEqual(str(score.dtypes["AAA"]), "float64")
        self.assertTrue(pd.isna(score.iloc[-1]["AAA"]))
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_factor_attribution.py
+++ b/tests/test_factor_attribution.py
--- a/tests/test_fetch_historical.py
+++ b/tests/test_fetch_historical.py
@@ -0,0 +1,46 @@
 import tempfile
 import unittest
 from pathlib import Path
 from unittest import mock
 import pandas as pd
 from research import fetch_historical
 class FetchHistoricalTests(unittest.TestCase):
    def test_fetch_all_historical_ohlcv_writes_field_specific_csvs(self):
        dates = pd.to_datetime(["2024-01-02", "2024-01-03"])
        raw = pd.DataFrame(
            {
                ("Close", "AAA"): [10.0, 11.0],
                ("Close", "BBB"): [20.0, 21.0],
                ("High", "AAA"): [10.5, 11.5],
                ("High", "BBB"): [20.5, 21.5],
                ("Low", "AAA"): [9.5, 10.5],
                ("Low", "BBB"): [19.5, 20.5],
                ("Volume", "AAA"): [1000.0, 1100.0],
                ("Volume", "BBB"): [2000.0, 2100.0],
            },
            index=dates,
        )
        raw.columns = pd.MultiIndex.from_tuples(raw.columns)
        with tempfile.TemporaryDirectory() as tmpdir:
            with mock.patch.object(fetch_historical, "DATA_DIR", tmpdir):
                with mock.patch.object(fetch_historical, "OUT_PATH", str(Path(tmpdir) / "us_pit.csv")):
                    with mock.patch("research.fetch_historical.uh.load_sp500_history", return_value={"AAA": [[None, None]], "BBB": [[None, None]]}):
                        with mock.patch("research.fetch_historical.uh.all_tickers_ever", return_value=["AAA", "BBB"]):
                            with mock.patch("research.fetch_historical.yf.download", return_value=raw):
                                panels = fetch_historical.fetch_all_historical_ohlcv(force=True)
            self.assertEqual(set(panels.keys()), {"close", "high", "low", "volume"})
            self.assertTrue((Path(tmpdir) / "us_pit.csv").exists())
            self.assertTrue((Path(tmpdir) / "us_pit_close.csv").exists())
            self.assertTrue((Path(tmpdir) / "us_pit_high.csv").exists())
            self.assertTrue((Path(tmpdir) / "us_pit_low.csv").exists())
            self.assertTrue((Path(tmpdir) / "us_pit_volume.csv").exists())
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_market_data.py
+++ b/tests/test_market_data.py
@@ -0,0 +1,144 @@
 import tempfile
 import unittest
 from pathlib import Path
 from unittest import mock
 import pandas as pd
 import data_manager
 class UpdateMarketDataTests(unittest.TestCase):
    def test_update_market_data_accepts_lowercase_fields_and_does_not_fill_volume(self):
        dates = pd.to_datetime(["2024-01-02", "2024-01-03", "2024-01-04"])
        raw = pd.DataFrame(
            {
                ("Close", "AAA"): [10.0, 11.0, 12.0],
                ("Close", "BBB"): [20.0, float("nan"), 22.0],
                ("Open", "AAA"): [9.5, 10.5, 11.5],
                ("Open", "BBB"): [19.5, 20.5, 21.5],
                ("High", "AAA"): [10.5, 11.5, 12.5],
                ("High", "BBB"): [20.5, 21.5, 22.5],
                ("Low", "AAA"): [9.0, 10.0, 11.0],
                ("Low", "BBB"): [19.0, 20.0, 21.0],
                ("Volume", "AAA"): [1000, 1100, 1200],
                ("Volume", "BBB"): [2000, float("nan"), 2200],
            },
            index=dates,
        )
        raw.columns = pd.MultiIndex.from_tuples(raw.columns)
        with tempfile.TemporaryDirectory() as tmpdir:
            with mock.patch.object(data_manager, "DATA_DIR", tmpdir):
                with mock.patch("data_manager.yf.download", return_value=raw) as mocked_download:
                    panels = data_manager.update_market_data(
                        "us",
                        ["AAA", "BBB"],
                        ["close", "open", "high", "low", "volume"],
                    )
                self.assertEqual(set(panels), {"close", "open", "high", "low", "volume"})
                self.assertEqual(panels["close"].loc[dates[1], "BBB"], 20.0)
                self.assertTrue(pd.isna(panels["volume"].loc[dates[1], "BBB"]))
                self.assertTrue((Path(tmpdir) / "us.csv").exists())
                self.assertTrue((Path(tmpdir) / "us_open.csv").exists())
                self.assertTrue((Path(tmpdir) / "us_high.csv").exists())
                self.assertTrue((Path(tmpdir) / "us_low.csv").exists())
                self.assertTrue((Path(tmpdir) / "us_volume.csv").exists())
                saved_high = pd.read_csv(Path(tmpdir) / "us_high.csv", index_col=0, parse_dates=True)
                pd.testing.assert_frame_equal(saved_high, panels["high"], check_freq=False)
                self.assertEqual(mocked_download.call_args.args[0], ["AAA", "BBB"])
                self.assertEqual(mocked_download.call_args.kwargs["auto_adjust"], True)
                self.assertIn("start", mocked_download.call_args.kwargs)
    def test_update_market_data_rejects_unsupported_fields(self):
        with tempfile.TemporaryDirectory() as tmpdir:
            with mock.patch.object(data_manager, "DATA_DIR", tmpdir):
                with self.assertRaisesRegex(ValueError, "Unsupported market data field: adjusted_close"):
                    data_manager.update_market_data("us", ["AAA"], ["adjusted_close"])
    def test_update_market_data_preserves_existing_cache_columns_and_dates(self):
        existing_dates = pd.to_datetime(["2024-01-01", "2024-01-02"])
        new_dates = pd.to_datetime(["2024-01-02", "2024-01-03"])
        existing_close = pd.DataFrame(
            {
                "AAA": [9.0, 10.0],
                "CCC": [30.0, 31.0],
            },
            index=existing_dates,
        )
        downloaded_close = pd.DataFrame({"Close": [10.5, 11.5]}, index=new_dates)
        with tempfile.TemporaryDirectory() as tmpdir:
            existing_close.to_csv(Path(tmpdir) / "us.csv")
            with mock.patch.object(data_manager, "DATA_DIR", tmpdir):
                with mock.patch("data_manager.yf.download", return_value=downloaded_close):
                    panels = data_manager.update_market_data("us", ["AAA"], ["close"])
            expected = pd.DataFrame(
                {
                    "AAA": [9.0, 10.5, 11.5],
                    "CCC": [30.0, 31.0, float("nan")],
                },
                index=pd.to_datetime(["2024-01-01", "2024-01-02", "2024-01-03"]),
            )
            saved_close = pd.read_csv(Path(tmpdir) / "us.csv", index_col=0, parse_dates=True)
        pd.testing.assert_frame_equal(panels["close"], expected, check_freq=False)
        pd.testing.assert_frame_equal(saved_close, expected, check_freq=False)
    def test_update_market_data_volume_merge_can_clear_stale_cached_values(self):
        existing_dates = pd.to_datetime(["2024-01-01", "2024-01-02"])
        new_dates = pd.to_datetime(["2024-01-02", "2024-01-03", "2024-01-04"])
        existing_volume = pd.DataFrame(
            {
                "AAA": [1000.0, 9999.0],
                "CCC": [3000.0, 3100.0],
            },
            index=existing_dates,
        )
        downloaded_volume = pd.DataFrame({"Volume": [float("nan"), 1200.0, 1300.0]}, index=new_dates)
        with tempfile.TemporaryDirectory() as tmpdir:
            existing_volume.to_csv(Path(tmpdir) / "us_volume.csv")
            with mock.patch.object(data_manager, "DATA_DIR", tmpdir):
                with mock.patch("data_manager.yf.download", return_value=downloaded_volume):
                    panels = data_manager.update_market_data("us", ["AAA"], ["volume"])
            expected = pd.DataFrame(
                {
                    "AAA": [1000.0, float("nan"), 1200.0, 1300.0],
                    "CCC": [3000.0, 3100.0, float("nan"), float("nan")],
                },
                index=pd.to_datetime(["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04"]),
            )
            saved_volume = pd.read_csv(Path(tmpdir) / "us_volume.csv", index_col=0, parse_dates=True)
        pd.testing.assert_frame_equal(panels["volume"], expected, check_freq=False)
        pd.testing.assert_frame_equal(saved_volume, expected, check_freq=False)
    def test_update_market_data_handles_single_ticker_multiindex_download(self):
        dates = pd.to_datetime(["2024-01-02", "2024-01-03"])
        raw = pd.DataFrame(
            {
                ("Close", "AAA"): [10.0, 11.0],
                ("Volume", "AAA"): [1000.0, 1100.0],
            },
            index=dates,
        )
        raw.columns = pd.MultiIndex.from_tuples(raw.columns)
        with tempfile.TemporaryDirectory() as tmpdir:
            with mock.patch.object(data_manager, "DATA_DIR", tmpdir):
                with mock.patch("data_manager.yf.download", return_value=raw):
                    panels = data_manager.update_market_data("us", ["AAA"], ["close", "volume"])
        expected_close = pd.DataFrame({"AAA": [10.0, 11.0]}, index=dates)
        expected_volume = pd.DataFrame({"AAA": [1000.0, 1100.0]}, index=dates)
        pd.testing.assert_frame_equal(panels["close"], expected_close, check_freq=False)
        pd.testing.assert_frame_equal(panels["volume"], expected_volume, check_freq=False)
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_us_alpha_pipeline.py
+++ b/tests/test_us_alpha_pipeline.py
@@ -0,0 +1,164 @@
 import unittest
 from pathlib import Path
 from unittest import mock
 import pandas as pd
 class USAlphaPipelineTests(unittest.TestCase):
    def test_build_equal_weight_portfolio_caps_holdings_under_ties(self):
        from research.us_alpha_pipeline import _build_equal_weight_portfolio
        dates = pd.date_range("2024-01-01", periods=2, freq="D")
        score = pd.DataFrame(
            {
                "AAA": [0.9, 0.9],
                "BBB": [0.9, 0.9],
                "CCC": [0.9, 0.9],
            },
            index=dates,
        )
        tradable_mask = pd.DataFrame(True, index=dates, columns=score.columns)
        regime = pd.Series([True, True], index=dates)
        weights = _build_equal_weight_portfolio(score, tradable_mask, regime, top_n=2)
        self.assertEqual(int((weights.iloc[-1] > 0).sum()), 2)
        self.assertAlmostEqual(float(weights.iloc[-1].sum()), 1.0)
    def test_equity_curve_uses_prior_day_weights_for_returns(self):
        from research.us_alpha_pipeline import _equity_curve
        dates = pd.date_range("2024-01-01", periods=3, freq="D")
        close = pd.DataFrame({"AAA": [1.0, 2.0, 4.0]}, index=dates)
        weights = pd.DataFrame({"AAA": [0.0, 1.0, 0.0]}, index=dates)
        equity = _equity_curve(close, weights)
        self.assertEqual(float(equity.iloc[1]), 2.0)
        self.assertEqual(float(equity.iloc[2]), 2.0)
    def test_summarize_equity_window_returns_nans_when_history_is_too_short(self):
        from research.us_alpha_report import summarize_equity_window
        dates = pd.date_range("2024-01-01", periods=10, freq="D")
        equity = pd.Series([1.0 + 0.01 * i for i in range(10)], index=dates)
        summary = summarize_equity_window(equity, "demo", window_years=1)
        self.assertTrue(pd.isna(summary["CAGR"]))
        self.assertTrue(pd.isna(summary["Sharpe"]))
        self.assertTrue(pd.isna(summary["MaxDD"]))
        self.assertTrue(pd.isna(summary["TotalRet"]))
    def test_run_alpha_pipeline_returns_expected_strategy_summary(self):
        from research.us_alpha_pipeline import run_alpha_pipeline
        dates = pd.date_range("2023-01-01", periods=400, freq="D")
        aaa_close = [50.0 + 0.20 * i for i in range(400)]
        bbb_close = [55.0 + 0.12 * i for i in range(400)]
        ccc_close = [60.0 + 0.05 * i for i in range(400)]
        close = pd.DataFrame(
            {
                "AAA": aaa_close,
                "BBB": bbb_close,
                "CCC": ccc_close,
            },
            index=dates,
        )
        high = pd.DataFrame(
            {
                "AAA": [value + 0.5 for value in aaa_close],
                "BBB": [value + 1.0 for value in bbb_close],
                "CCC": [value + 1.5 for value in ccc_close],
            },
            index=dates,
        )
        low = pd.DataFrame(
            {
                "AAA": [value - 0.5 for value in aaa_close],
                "BBB": [value - 1.0 for value in bbb_close],
                "CCC": [value - 1.5 for value in ccc_close],
            },
            index=dates,
        )
        volume = pd.DataFrame(
            {
                "AAA": [1_500_000.0] * 400,
                "BBB": [1_400_000.0] * 400,
                "CCC": [1_300_000.0] * 400,
            },
            index=dates,
        )
        volume.loc[dates[-2], "AAA"] = 4_000_000.0
        etf_close = pd.DataFrame(
            {
                "SPY": [300.0 + 0.8 * i for i in range(400)],
                "QQQ": [280.0 + 1.1 * i for i in range(400)],
                "XLF": [200.0 + 0.4 * i for i in range(400)],
            },
            index=dates,
        )
        market_data = {
            "close": close,
            "high": high,
            "low": low,
            "volume": volume,
        }
        summary = run_alpha_pipeline(
            market_data=market_data,
            etf_close=etf_close,
            pit_membership=None,
            windows=(1,),
            top_n=2,
        )
        required_columns = {"strategy", "window_years", "CAGR", "Sharpe", "MaxDD", "TotalRet"}
        self.assertTrue(required_columns.issubset(summary.columns))
        self.assertEqual(set(summary["strategy"]), {"breakout_regime", "rank_blend_regime"})
        self.assertEqual(set(summary["window_years"]), {1})
        self.assertEqual(len(summary), 2)
        self.assertTrue(summary[["CAGR", "Sharpe", "MaxDD", "TotalRet"]].notna().all().all())
    def test_run_saved_pit_alpha_pipeline_reads_saved_inputs(self):
        from research.us_alpha_pipeline import run_saved_pit_alpha_pipeline
        dates = pd.date_range("2024-01-01", periods=320, freq="D")
        close = pd.DataFrame(
            {
                "AAA": [50.0 + 0.2 * i for i in range(320)],
                "BBB": [40.0 + 0.1 * i for i in range(320)],
            },
            index=dates,
        )
        high = close + 1.0
        low = close - 1.0
        volume = pd.DataFrame({"AAA": [2_500_000.0] * 320, "BBB": [2_000_000.0] * 320}, index=dates)
        etf_close = pd.DataFrame(
            {"SPY": [300.0 + 0.8 * i for i in range(320)], "QQQ": [280.0 + 1.1 * i for i in range(320)]},
            index=dates,
        )
        with self.subTest("saved_inputs"):
            import tempfile
            with tempfile.TemporaryDirectory() as tmpdir:
                close.to_csv(Path(tmpdir) / "us_pit_close.csv")
                high.to_csv(Path(tmpdir) / "us_pit_high.csv")
                low.to_csv(Path(tmpdir) / "us_pit_low.csv")
                volume.to_csv(Path(tmpdir) / "us_pit_volume.csv")
                etf_close.to_csv(Path(tmpdir) / "us_etf.csv")
                intervals = {"AAA": [[None, None]], "BBB": [[None, None]]}
                with mock.patch("research.us_alpha_pipeline.uh.load_sp500_history", return_value=intervals):
                    summary = run_saved_pit_alpha_pipeline(data_dir=tmpdir, windows=(1,), top_n=1)
        self.assertEqual(set(summary["strategy"]), {"breakout_regime", "rank_blend_regime"})
 if __name__ == "__main__":
    unittest.main()
--- a/tests/test_us_universe.py
+++ b/tests/test_us_universe.py
@@ -0,0 +1,213 @@
 import unittest
 import warnings
 import pandas as pd
 class BuildTradableMaskTests(unittest.TestCase):
    def test_build_tradable_mask_uses_only_lagged_price_and_liquidity_inputs(self):
        from research.us_universe import build_tradable_mask
        dates = pd.date_range("2024-01-01", periods=4, freq="D")
        close = pd.DataFrame({"AAA": [4.0, 10.0, 10.0, 10.0]}, index=dates)
        volume = pd.DataFrame({"AAA": [float("nan"), 200.0, 200.0, 200.0]}, index=dates)
        mask = build_tradable_mask(
            close=close,
            volume=volume,
            pit_membership=None,
            min_price=5.0,
            min_dollar_volume=1000.0,
            min_history_days=2,
            min_valid_volume_days=2,
            liquidity_window=2,
        )
        expected = pd.DataFrame({"AAA": [False, False, False, True]}, index=dates, dtype=bool)
        pd.testing.assert_frame_equal(mask, expected)
    def test_build_tradable_mask_uses_only_lagged_history(self):
        from research.us_universe import build_tradable_mask
        dates = pd.date_range("2024-01-01", periods=4, freq="D")
        close = pd.DataFrame({"AAA": [10.0, float("nan"), 10.0, 10.0]}, index=dates)
        volume = pd.DataFrame({"AAA": [200.0, 200.0, 200.0, 200.0]}, index=dates)
        mask = build_tradable_mask(
            close=close,
            volume=volume,
            pit_membership=None,
            min_price=5.0,
            min_dollar_volume=1_000.0,
            min_history_days=2,
            min_valid_volume_days=1,
            liquidity_window=1,
        )
        expected = pd.DataFrame({"AAA": [False, False, False, False]}, index=dates, dtype=bool)
        pd.testing.assert_frame_equal(mask, expected)
    def test_build_tradable_mask_requires_membership_history_before_first_eligible_day(self):
        from research.us_universe import build_tradable_mask
        dates = pd.date_range("2024-01-01", periods=4, freq="D")
        close = pd.DataFrame({"AAA": [10.0, 10.0, 10.0, 10.0]}, index=dates)
        volume = pd.DataFrame({"AAA": [200.0, 200.0, 200.0, 200.0]}, index=dates)
        pit_membership = pd.DataFrame({"AAA": [False, False, True, True]}, index=dates)
        mask = build_tradable_mask(
            close=close,
            volume=volume,
            pit_membership=pit_membership,
            min_price=5.0,
            min_dollar_volume=1_000.0,
            min_history_days=1,
            min_valid_volume_days=1,
            liquidity_window=1,
        )
        expected = pd.DataFrame({"AAA": [False, False, False, True]}, index=dates, dtype=bool)
        pd.testing.assert_frame_equal(mask, expected)
    def test_build_tradable_mask_aligns_pit_membership_without_truthy_carryover(self):
        from research.us_universe import build_tradable_mask
        dates = pd.date_range("2024-01-01", periods=3, freq="D")
        close = pd.DataFrame(
            {
                "AAA": [10.0, 10.0, 10.0],
                "BBB": [12.0, 12.0, 12.0],
            },
            index=dates,
        )
        volume = pd.DataFrame(
            {
                "AAA": [1_000_000.0, 1_000_000.0, 1_000_000.0],
                "BBB": [1_000_000.0, 1_000_000.0, 1_000_000.0],
            },
            index=dates,
        )
        pit_membership = pd.DataFrame(
            {
                "BBB": [True, True, False],
                "CCC": [True, True, True],
            },
            index=pd.date_range("2024-01-02", periods=3, freq="D"),
        )
        with warnings.catch_warnings(record=True) as caught:
            warnings.simplefilter("always")
            mask = build_tradable_mask(
                close=close,
                volume=volume,
                pit_membership=pit_membership,
                min_price=5.0,
                min_dollar_volume=1_000.0,
                min_history_days=1,
                min_valid_volume_days=1,
                liquidity_window=1,
            )
        self.assertEqual(len(caught), 0)
        expected = pd.DataFrame(
            {
                "AAA": [False, False, False],
                "BBB": [False, False, True],
            },
            index=dates,
            dtype=bool,
        )
        pd.testing.assert_frame_equal(mask, expected)
    def test_build_tradable_mask_treats_missing_membership_cells_as_false(self):
        from research.us_universe import build_tradable_mask
        dates = pd.date_range("2024-01-01", periods=3, freq="D")
        close = pd.DataFrame({"AAA": [10.0, 10.0, 10.0]}, index=dates)
        volume = pd.DataFrame({"AAA": [1_000_000.0, 1_000_000.0, 1_000_000.0]}, index=dates)
        pit_membership = pd.DataFrame(
            {"AAA": [True, pd.NA, True]},
            index=dates,
            dtype="boolean",
        )
        mask = build_tradable_mask(
            close=close,
            volume=volume,
            pit_membership=pit_membership,
            min_price=5.0,
            min_dollar_volume=1_000.0,
            min_history_days=1,
            min_valid_volume_days=1,
            liquidity_window=1,
        )
        expected = pd.DataFrame({"AAA": [False, False, False]}, index=dates, dtype=bool)
        pd.testing.assert_frame_equal(mask, expected)
    def test_build_tradable_mask_uses_strict_thresholds(self):
        from research.us_universe import build_tradable_mask
        dates = pd.date_range("2024-01-01", periods=3, freq="D")
        close = pd.DataFrame({"AAA": [5.0, 5.0, 5.0]}, index=dates)
        volume = pd.DataFrame({"AAA": [300.0, 300.0, 300.0]}, index=dates)
        mask = build_tradable_mask(
            close=close,
            volume=volume,
            pit_membership=None,
            min_price=5.0,
            min_dollar_volume=1_000.0,
            min_history_days=1,
            min_valid_volume_days=1,
            liquidity_window=1,
        )
        expected = pd.DataFrame({"AAA": [False, False, False]}, index=dates, dtype=bool)
        pd.testing.assert_frame_equal(mask, expected)
    def test_build_tradable_mask_uses_strict_dollar_volume_threshold(self):
        from research.us_universe import build_tradable_mask
        dates = pd.date_range("2024-01-01", periods=3, freq="D")
        close = pd.DataFrame({"AAA": [8.0, 8.0, 8.0]}, index=dates)
        volume = pd.DataFrame({"AAA": [125.0, 125.0, 125.0]}, index=dates)
        mask = build_tradable_mask(
            close=close,
            volume=volume,
            pit_membership=None,
            min_price=5.0,
            min_dollar_volume=1_000.0,
            min_history_days=1,
            min_valid_volume_days=1,
            liquidity_window=1,
        )
        expected = pd.DataFrame({"AAA": [False, False, False]}, index=dates, dtype=bool)
        pd.testing.assert_frame_equal(mask, expected)
    def test_build_tradable_mask_requires_valid_dollar_volume_history(self):
        from research.us_universe import build_tradable_mask
        dates = pd.date_range("2024-01-01", periods=4, freq="D")
        close = pd.DataFrame({"AAA": [10.0, float("nan"), 10.0, 10.0]}, index=dates)
        volume = pd.DataFrame({"AAA": [200.0, 200.0, 200.0, 200.0]}, index=dates)
        mask = build_tradable_mask(
            close=close,
            volume=volume,
            pit_membership=None,
            min_price=5.0,
            min_dollar_volume=1_000.0,
            min_history_days=1,
            min_valid_volume_days=2,
            liquidity_window=2,
        )
        expected = pd.DataFrame({"AAA": [False, False, False, False]}, index=dates, dtype=bool)
        pd.testing.assert_frame_equal(mask, expected)
 if __name__ == "__main__":
    unittest.main()
--- a/trader.py
+++ b/trader.py
@@ -40,6 +40,7 @@ import yfinance as yf
 import data_manager
 from strategies.buy_and_hold import BuyAndHoldStrategy
 from strategies.dual_momentum import DualMomentumStrategy
 from strategies.factor_combo import FactorComboStrategy
 from strategies.inverse_vol import InverseVolatilityStrategy
 from strategies.momentum import MomentumStrategy
 from strategies.momentum_quality import MomentumQualityStrategy
@@ -47,11 +48,22 @@ from strategies.recovery_momentum import RecoveryMomentumStrategy
 from strategies.trend_following import TrendFollowingStrategy
 from universe import UNIVERSES
 # ---------------------------------------------------------------------------
 # Per-market fixed trading fees (per trade, in the market's local currency)
 # ---------------------------------------------------------------------------
 # These are applied automatically by cmd_monitor and cmd_auto; they can still
 # be overridden by explicitly passing --fixed-fee on the CLI.
 MARKET_FEES = {
    "us": 2.0,   # USD per trade
    "cn": 5.0,   # CNY per trade (A-share minimum commission)
 }
 # ---------------------------------------------------------------------------
 # Strategy registry
 # ---------------------------------------------------------------------------
 STRATEGY_REGISTRY = {
    # --- Original strategies ---
    "recovery_mom_top10": lambda **kw: RecoveryMomentumStrategy(top_n=10),
    "recovery_mom_top20": lambda **kw: RecoveryMomentumStrategy(top_n=20),
    "recovery_mom_top50": lambda **kw: RecoveryMomentumStrategy(top_n=50),
@@ -61,6 +73,40 @@ STRATEGY_REGISTRY = {
    "inverse_vol": lambda **kw: InverseVolatilityStrategy(vol_window=20),
    "trend_following": lambda **kw: TrendFollowingStrategy(top_n=kw.get("top_n", 20)),
    "buy_and_hold": lambda **kw: BuyAndHoldStrategy(),
    # --- Factor combo: US champions ---
    "fc_rec_mfilt_deep_upvol_daily": lambda **kw: FactorComboStrategy("rec_mfilt+deep_upvol", rebal_freq=1),
    "fc_rec_mfilt_deep_upvol_weekly": lambda **kw: FactorComboStrategy("rec_mfilt+deep_upvol", rebal_freq=5),
    "fc_rec_mfilt_deep_upvol_biweekly": lambda **kw: FactorComboStrategy("rec_mfilt+deep_upvol", rebal_freq=10),
    "fc_rec_mfilt_deep_upvol_monthly": lambda **kw: FactorComboStrategy("rec_mfilt+deep_upvol", rebal_freq=21),
    "fc_ma200_mom7m_rec126_daily": lambda **kw: FactorComboStrategy("ma200+mom7m+rec126", rebal_freq=1),
    "fc_ma200_mom7m_rec126_weekly": lambda **kw: FactorComboStrategy("ma200+mom7m+rec126", rebal_freq=5),
    "fc_ma200_mom7m_rec126_biweekly": lambda **kw: FactorComboStrategy("ma200+mom7m+rec126", rebal_freq=10),
    "fc_ma200_mom7m_rec126_monthly": lambda **kw: FactorComboStrategy("ma200+mom7m+rec126", rebal_freq=21),
    "fc_rec_mfilt_ma200_daily": lambda **kw: FactorComboStrategy("rec_mfilt+ma200", rebal_freq=1),
    "fc_rec_mfilt_ma200_weekly": lambda **kw: FactorComboStrategy("rec_mfilt+ma200", rebal_freq=5),
    "fc_rec_mfilt_ma200_biweekly": lambda **kw: FactorComboStrategy("rec_mfilt+ma200", rebal_freq=10),
    "fc_rec_mfilt_ma200_monthly": lambda **kw: FactorComboStrategy("rec_mfilt+ma200", rebal_freq=21),
    "fc_mom7m_rec126_daily": lambda **kw: FactorComboStrategy("mom7m+rec126", rebal_freq=1),
    "fc_mom7m_rec126_weekly": lambda **kw: FactorComboStrategy("mom7m+rec126", rebal_freq=5),
    "fc_mom7m_rec126_biweekly": lambda **kw: FactorComboStrategy("mom7m+rec126", rebal_freq=10),
    "fc_mom7m_rec126_monthly": lambda **kw: FactorComboStrategy("mom7m+rec126", rebal_freq=21),
    # --- Factor combo: CN champions ---
    "fc_up_cap_quality_mom_daily": lambda **kw: FactorComboStrategy("up_cap+quality_mom", rebal_freq=1),
    "fc_up_cap_quality_mom_weekly": lambda **kw: FactorComboStrategy("up_cap+quality_mom", rebal_freq=5),
    "fc_up_cap_quality_mom_biweekly": lambda **kw: FactorComboStrategy("up_cap+quality_mom", rebal_freq=10),
    "fc_up_cap_quality_mom_monthly": lambda **kw: FactorComboStrategy("up_cap+quality_mom", rebal_freq=21),
    "fc_down_resil_qual_mom_daily": lambda **kw: FactorComboStrategy("down_resil+qual_mom", rebal_freq=1),
    "fc_down_resil_qual_mom_weekly": lambda **kw: FactorComboStrategy("down_resil+qual_mom", rebal_freq=5),
    "fc_down_resil_qual_mom_biweekly": lambda **kw: FactorComboStrategy("down_resil+qual_mom", rebal_freq=10),
    "fc_down_resil_qual_mom_monthly": lambda **kw: FactorComboStrategy("down_resil+qual_mom", rebal_freq=21),
    "fc_rec63_mom_gap_daily": lambda **kw: FactorComboStrategy("rec63+mom_gap", rebal_freq=1),
    "fc_rec63_mom_gap_weekly": lambda **kw: FactorComboStrategy("rec63+mom_gap", rebal_freq=5),
    "fc_rec63_mom_gap_biweekly": lambda **kw: FactorComboStrategy("rec63+mom_gap", rebal_freq=10),
    "fc_rec63_mom_gap_monthly": lambda **kw: FactorComboStrategy("rec63+mom_gap", rebal_freq=21),
    "fc_up_cap_mom_gap_daily": lambda **kw: FactorComboStrategy("up_cap+mom_gap", rebal_freq=1),
    "fc_up_cap_mom_gap_weekly": lambda **kw: FactorComboStrategy("up_cap+mom_gap", rebal_freq=5),
    "fc_up_cap_mom_gap_biweekly": lambda **kw: FactorComboStrategy("up_cap+mom_gap", rebal_freq=10),
    "fc_up_cap_mom_gap_monthly": lambda **kw: FactorComboStrategy("up_cap+mom_gap", rebal_freq=21),
 }
@@ -484,6 +530,12 @@ def cmd_evening(args):
    post_value = portfolio_value(state["holdings"], close_prices, state["cash"])
    state["daily_equity"][trade_date] = round(post_value, 2)
    # Record daily snapshot so daily_log stays complete even on no-trade days
    eq_vals = list(state["daily_equity"].values())
    prev_eq = eq_vals[-2] if len(eq_vals) >= 2 else state["initial_capital"]
    record_daily_snapshot(state, trade_date, close_prices, exec_trades, prev_eq)
    state["pending_trades"] = None
    state["last_evening"] = trade_date
    save_state(state, market, strategy_name)
@@ -1046,12 +1098,13 @@ def cmd_monitor(args):
    print(f"  MONITOR MODE — {len(markets)} market(s), "
          f"{len(strategies)} strategies each")
    print(f"  Capital: ${args.capital:,.0f} | "
          f"Fee: ${args.fixed_fee:.2f}/trade | "
          f"Integer shares: {args.integer_shares}")
    for mkt, sched in market_schedules.items():
        fee = MARKET_FEES.get(mkt, args.fixed_fee)
        print(f"  {sched['label']}:")
        print(f"    Morning: {sched['morn_h']:02d}:{sched['morn_m']:02d} {sched['tz']}")
        print(f"    Evening: {sched['eve_h']:02d}:{sched['eve_m']:02d} {sched['tz']}")
        print(f"    Fixed fee: {fee:.2f}/trade")
    print(f"  Strategies: {', '.join(strategies)}")
    print(f"{'='*60}")
@@ -1096,10 +1149,12 @@ def cmd_monitor(args):
              f"{now_local.strftime('%Y-%m-%d %H:%M:%S %Z')}")
        print(f"[monitor] {'='*55}")
        market_fee = MARKET_FEES.get(market, args.fixed_fee)
        for strat_name in strategies:
            sub_args = copy.copy(args)
            sub_args.strategy = strat_name
            sub_args.market = market
            sub_args.fixed_fee = market_fee
            print(f"\n[monitor] --- {market.upper()}:{strat_name} ---")
            try:
@@ -1253,8 +1308,10 @@ def cmd_auto(args):
        integer_shares=args.integer_shares
    )
    # Fall back to per-market fee when the user didn't explicitly override
    fixed_fee = args.fixed_fee if args.fixed_fee > 0 else MARKET_FEES.get(market, 0.0)
    execute_trades(state, trades, close_prices,
-                  tx_cost=args.tx_cost, fixed_fee=args.fixed_fee,
+                  tx_cost=args.tx_cost, fixed_fee=fixed_fee,
                  trade_date=today_str, integer_shares=args.integer_shares)
    post_value = portfolio_value(state["holdings"], close_prices, state["cash"])
--- a/universe_history.py
+++ b/universe_history.py
@@ -0,0 +1,230 @@
 """
 Point-in-time index membership reconstruction — fixes survivorship bias.
 Approach: Wikipedia's "Selected changes to the list of S&P 500 components"
 table lists every add/remove event (394 rows back to 1976, as of 2026). We
 start from today's membership and walk the change log *backward*:
  - An 'Added' ticker on date D was NOT a member before D.
  - A 'Removed' ticker on date D WAS a member before D.
 Applied iteratively, this yields the set of members on any historical date.
 The membership info is cached in data/sp500_history.json so Wikipedia is hit
 at most once per day. The cache stores per-ticker membership intervals:
    { "ticker": [[start, end_or_null], ...] }
 where dates are YYYY-MM-DD strings.
 """
 import io
 import json
 import os
 import urllib.request
 from datetime import date, datetime
 import pandas as pd
 CACHE_DIR = "data"
 _HEADERS = {"User-Agent": "Mozilla/5.0 (quant-backtest)"}
 # ---------------------------------------------------------------------------
 # Fetch + parse Wikipedia
 # ---------------------------------------------------------------------------
 def _fetch_sp500_tables() -> tuple[pd.DataFrame, pd.DataFrame]:
    """Return (current_list, changes_log) from the S&P 500 Wikipedia page."""
    url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
    req = urllib.request.Request(url, headers=_HEADERS)
    with urllib.request.urlopen(req) as resp:
        html = resp.read().decode("utf-8")
    tables = pd.read_html(io.StringIO(html))
    current = tables[0]
    changes = tables[1]
    changes.columns = [
        "_".join(c).strip() if isinstance(c, tuple) else c
        for c in changes.columns
    ]
    changes.columns = [
        c.replace("Effective Date_Effective Date", "Date") for c in changes.columns
    ]
    return current, changes
 def _normalize_ticker(t: str) -> str:
    """Yahoo Finance ticker format: BRK.B → BRK-B."""
    return str(t).replace(".", "-").strip()
 # ---------------------------------------------------------------------------
 # Membership reconstruction
 # ---------------------------------------------------------------------------
 def build_sp500_history() -> dict[str, list[list[str | None]]]:
    """
    Reconstruct per-ticker membership intervals.
    Returns
    -------
    dict: ticker -> list of [start_date, end_date_or_None] pairs.
        end_date=None means the ticker is still a member as of today.
        Dates are YYYY-MM-DD strings.
    Algorithm: start from today's set of members, walk the change log from
    newest to oldest. For each event on date D:
      - The 'Added' ticker: its current (open) interval starts on D.
        Close it: [..., D] — it was NOT a member before D.
      - The 'Removed' ticker: it was a member up to D (exclusive).
        Open a new interval ending on D (start unknown for now; will be
        closed by an earlier event or left open-start).
    After the walk, any ticker still "open" (never closed backward) has an
    interval reaching back before the earliest logged change.
    """
    current, changes = _fetch_sp500_tables()
    current_tickers = {_normalize_ticker(s) for s in current["Symbol"].tolist()}
    # Parse change log
    changes["dt"] = pd.to_datetime(changes["Date"], errors="coerce")
    changes = changes.dropna(subset=["dt"]).sort_values("dt", ascending=False)
    # For each ticker, collect intervals [start, end].
    # We track a "current open interval" per ticker during the backward walk.
    # intervals[ticker] = list of [start, end] completed intervals (oldest-first).
    # open_start[ticker] = start date of the currently open (most-recent) interval.
    intervals: dict[str, list[list[str | None]]] = {}
    open_end: dict[str, str | None] = {}  # end of currently-open interval
    # Initialize: today's members have an open interval ending = None (still in)
    for t in current_tickers:
        open_end[t] = None  # still a member today
        intervals[t] = []
    # Track the start date of each open interval as we walk backward.
    # For a member today, the interval started at the last "Added" event in the
    # changes log, OR before the log begins if never added.
    # We'll close the interval when we hit the "Added" event going backward.
    open_start: dict[str, str | None] = {t: None for t in current_tickers}
    for _, row in changes.iterrows():
        d = row["dt"].strftime("%Y-%m-%d")
        added = row.get("Added_Ticker")
        removed = row.get("Removed_Ticker")
        if pd.notna(added):
            a = _normalize_ticker(added)
            # This ticker was added on d → its open interval starts on d.
            if a in open_end:
                open_start[a] = d
                # Finalize the current open interval
                intervals[a].append([d, open_end[a]])
                # Pop: no further open interval backward in time for this ticker
                # (unless 'Removed' opens a new older one below)
                del open_end[a]
        if pd.notna(removed):
            r = _normalize_ticker(removed)
            # This ticker was removed on d → it WAS a member before d.
            # Open a new interval ending on d (start unknown yet).
            if r not in open_end:
                intervals.setdefault(r, [])
                open_end[r] = d  # end of the new older interval
    # Any ticker still with an open interval → start predates the log.
    # Use the oldest logged date as a conservative "unknown earlier" marker: None.
    for t, end in open_end.items():
        intervals.setdefault(t, []).append([None, end])
    # Sort intervals per ticker oldest→newest
    for t, ivs in intervals.items():
        ivs.sort(key=lambda iv: (iv[0] or "0000-00-00"))
    return intervals
 # ---------------------------------------------------------------------------
 # Cache I/O
 # ---------------------------------------------------------------------------
 def _cache_path() -> str:
    return os.path.join(CACHE_DIR, "sp500_history.json")
 def load_sp500_history(force_refresh: bool = False) -> dict[str, list[list[str | None]]]:
    """Load cached membership history, or rebuild if stale (>1 day old)."""
    path = _cache_path()
    if not force_refresh and os.path.exists(path):
        try:
            with open(path) as f:
                data = json.load(f)
            if data.get("date") == str(date.today()):
                return data["intervals"]
        except Exception:
            pass
    print("--- Rebuilding S&P 500 membership history from Wikipedia ---")
    intervals = build_sp500_history()
    os.makedirs(CACHE_DIR, exist_ok=True)
    with open(path, "w") as f:
        json.dump({"date": str(date.today()), "intervals": intervals}, f)
    print(f"--- Cached {len(intervals)} tickers' membership intervals ---")
    return intervals
 # ---------------------------------------------------------------------------
 # Convert intervals → aligned mask DataFrame
 # ---------------------------------------------------------------------------
 def membership_mask(dates: pd.DatetimeIndex,
                    intervals: dict[str, list[list[str | None]]] | None = None,
                    tickers: list[str] | None = None) -> pd.DataFrame:
    """
    Boolean DataFrame: rows = dates, columns = tickers.
    True where the ticker was an S&P 500 member on that date.
    If `tickers` is given, restrict columns to that list (useful for aligning
    with a price DataFrame). Otherwise, include all tickers ever a member.
    """
    if intervals is None:
        intervals = load_sp500_history()
    cols = tickers if tickers is not None else sorted(intervals.keys())
    # Tickers not in `intervals` (e.g. SPY, benchmarks, ETFs) are treated as
    # always-members so callers can pass the full price matrix through
    # mask_prices without zeroing out benchmark series.
    mask = pd.DataFrame(False, index=dates, columns=cols)
    for t in cols:
        if t not in intervals:
            mask[t] = True
            continue
        for start, end in intervals[t]:
            s = pd.Timestamp(start) if start else dates[0]
            e = pd.Timestamp(end) if end else dates[-1] + pd.Timedelta(days=1)
            # Interval semantics: member on [start, end). A ticker removed on
            # date D was no longer a member on D.
            mask.loc[(mask.index >= s) & (mask.index < e), t] = True
    return mask
 def all_tickers_ever(intervals: dict | None = None) -> list[str]:
    """All tickers that were ever S&P 500 members (for price data fetching)."""
    if intervals is None:
        intervals = load_sp500_history()
    return sorted(intervals.keys())
 def mask_prices(prices: pd.DataFrame,
                intervals: dict | None = None) -> pd.DataFrame:
    """
    Return a copy of `prices` with NaN set for (date, ticker) pairs where
    the ticker was not an S&P 500 member on that date.
    This is the key survivorship-bias fix: strategies compute signals from
    the masked price data, so they naturally cannot select stocks outside
    the point-in-time index membership.
    Warm-up note: a newly-added member needs sufficient non-NaN history for
    its rolling windows to produce a valid signal. For this codebase's
    ~252-day lookbacks, a stock becomes "selectable" roughly 1 year after
    joining. This is conservative but correct: before that, we have no
    legitimate signal anyway.
    """
    mask = membership_mask(prices.index, intervals, tickers=list(prices.columns))
    return prices.where(mask)
Author	SHA1	Message	Date
Gahow Wang	cdaca4bc2a	Merge branch 'feat/us-alpha-phase1'	2026-04-18 15:00:56 +08:00
Gahow Wang	f5e8c708f3	feat: add PIT OHLCV runner and fetch support	2026-04-18 14:59:48 +08:00
Gahow Wang	c015873ee1	feat: add strict US alpha research pipeline	2026-04-18 00:38:29 +08:00
Gahow Wang	bf6fccfd11	feat: add regime and breakout alpha modules	2026-04-18 00:31:16 +08:00
Gahow Wang	7853eafe55	feat: add PIT-aware tradable universe mask	2026-04-18 00:23:07 +08:00
Gahow Wang	1edce83430	fix: handle single-ticker yahoo panels	2026-04-18 00:03:07 +08:00
Gahow Wang	3abc51e3e3	feat: add OHLCV market data updater	2026-04-17 23:59:06 +08:00
Gahow Wang	7239310be3	docs: add US alpha research design spec	2026-04-17 23:41:10 +08:00
Gahow Wang	5e1c4a681d	Add point-in-time S&P 500 backtest to expose survivorship bias The existing framework fetches today's S&P 500 constituents from Wikipedia and applies that list to the entire 10-year price history — classic survivorship bias. Stocks that went bankrupt or were removed for poor performance are absent, while today's winners (which may have been minor names 10 years ago) are implicitly selected. This materially inflates reported strategy returns. New pipeline: - universe_history.py reconstructs per-ticker membership intervals by walking Wikipedia's "Selected changes" table backward from today. - research/fetch_historical.py downloads prices for all 848 tickers that were ever members (Yahoo returns ~675 of them; ~170 fully delisted names are unavailable — remaining partial bias). - research/pit_backtest.py masks prices to NaN outside membership windows so strategies naturally cannot select non-members. - research/strategies_plus.py adds RecoveryMomentumPlus (generalized Recovery+Momentum with configurable weighting / blend / regime hook) and an EnsembleStrategy. - research/optimize.py runs five experiments: bias drift, hyperparameter sweep (2016-2022 train / 2023-2026 test), SPY MA regime filter, weighting schemes, and an uncorrelated-config ensemble. Headline finding: the biased backtest reports 40.9% CAGR for recovery_mom_top10 over 2016-2026; the point-in-time version reports 22.4% (vs 14.0% SPY buy-and-hold). True edge is ~8pp CAGR, not ~27pp. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-17 16:26:02 +08:00
Gahow Wang	2015b62104	Charge 5 CNY per A-share trade via per-market fee table Add MARKET_FEES {us: 2, cn: 5} so the monitor and cron (auto) paths automatically apply the correct local-currency fixed commission without needing a per-strategy override. CLI --fixed-fee still wins when set explicitly for auto; monitor now always resolves from the table so its banner and each strategy sub-call agree. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-11 13:32:41 +08:00
Gahow Wang	b2176b0c3e	Record daily snapshot in cmd_evening for monitor NAV tracking cmd_evening (used by the monitor path) only updated the simple daily_equity dict, so daily_log had gaps on every monitor-driven day. Mirror cmd_auto's pattern and call record_daily_snapshot so each strategy's NAV is recorded every trading day, even when no trades execute. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-11 13:15:09 +08:00
Gahow Wang	ae25f2f6b5	Add 32 factor-combo strategies with configurable rebalancing frequency New FactorComboStrategy class (strategies/factor_combo.py) implements 8 champion factor signals (4 US, 4 CN) discovered through iterative factor research, each at 4 rebalancing frequencies (daily/weekly/ biweekly/monthly). Registered in trader.py as fc_{signal}_{freq}. Existing strategies and state files are untouched — safe to git pull and restart monitor on server. Also includes factor research scripts (factor_loop.py, factor_research.py, etc.) used to discover and validate these factors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-08 10:41:34 +08:00
Gahow Wang	a66b039d2d	Reject empty attribution semantics headers	2026-04-07 18:10:21 +08:00
Gahow Wang	88d765713e	Reject colliding attribution semantics headers	2026-04-07 18:10:21 +08:00
Gahow Wang	35a91ba6cc	Honor complete attribution beta semantics labels	2026-04-07 18:10:21 +08:00
Gahow Wang	b3d87b3d92	Harden attribution beta semantics fallback	2026-04-07 18:10:21 +08:00
Gahow Wang	097131d962	Add attribution beta semantics metadata	2026-04-07 18:10:21 +08:00
Gahow Wang	82a3e63c2b	Restore summary schema for proxy attribution	2026-04-07 18:10:21 +08:00
Gahow Wang	69a03f52d9	Fix proxy attribution benchmark and labeling	2026-04-07 18:10:21 +08:00
Gahow Wang	9c4a219c68	Integrate factor attribution into backtest CLI	2026-04-07 18:10:21 +08:00
Gahow Wang	f6670d9e6d	Normalize one-point regression residual volatility	2026-04-07 17:02:00 +08:00
Gahow Wang	18174a9e11	Compute residual vol for square regressions	2026-04-07 16:57:58 +08:00
Gahow Wang	3d934b3316	Handle square factor regressions without inference	2026-04-07 16:53:16 +08:00
Gahow Wang	0876c0b6af	Guard factor regressions against unidentified models	2026-04-07 16:48:23 +08:00
Gahow Wang	f2e14ec200	Add factor attribution regression engine	2026-04-07 16:40:24 +08:00
Gahow Wang	507565c556	Tighten benchmark mutation leakage test	2026-04-07 16:34:35 +08:00
Gahow Wang	26937f035e	Split proxy leakage tests	2026-04-07 16:31:51 +08:00
Gahow Wang	7afc60dfcb	Strengthen proxy factor builder tests	2026-04-07 16:28:00 +08:00
Gahow Wang	7e44ece569	Add factor builder leakage tests	2026-04-07 16:21:05 +08:00
Gahow Wang	7e8d24c1e9	Add local attribution factor builders	2026-04-07 16:16:59 +08:00
Gahow Wang	2382364a46	Handle HTTP protocol errors in factor download	2026-04-07 16:12:00 +08:00
Gahow Wang	71912b8358	Wrap additional network errors in factor download	2026-04-07 16:06:54 +08:00
Gahow Wang	7f0c5de574	Use explicit download errors for factor loader fallback	2026-04-07 16:01:51 +08:00
Gahow Wang	c46727b1ca	Handle OSError download fallback for factor loader	2026-04-07 15:57:16 +08:00
Gahow Wang	0e94688066	Narrow factor loader format fallback handling	2026-04-07 15:51:57 +08:00
Gahow Wang	9e6da727a3	Implement Ken French factor download and cache fallback	2026-04-07 15:44:46 +08:00
Gahow Wang	e70922d9af	Harden factor loader zip parsing and fallback	2026-04-07 15:38:49 +08:00
Gahow Wang	feb1864a4d	Add factor loader and cache scaffolding	2026-04-07 15:27:44 +08:00
Gahow Wang	80493cb6af	Add factor attribution design spec	2026-04-07 15:01:57 +08:00