chore: update README

Merge branch 'feature/bucket-aware-routing'
feat: add bucket score global router
2026-04-17 19:15:22 +08:00 · 2026-04-17 18:16:56 +08:00 · 2026-04-17 17:55:54 +08:00 · 2026-04-17 17:52:49 +08:00 · 2026-04-17 15:15:18 +08:00 · 2026-04-17 15:03:10 +08:00
68 changed files with 6786 additions and 1118 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,12 +1,21 @@
+.claude
+
 # Trace files
 bailian-traces

+# docs
+docs
+reports
+scripts
+tests/test_analyze_affinity_policy.py
+
 # Rust build artifacts
 /target/
 **/*.rs.bk

 # Simulation output
 /runs/
+.worktrees/

 # Editor / IDE
 .vscode/
--- a/README.md
+++ b/README.md
@@ -1,345 +1,386 @@
 # kvcache-simulator

 Discrete-event simulator for cluster-level LLM **prefill** serving with a
-two-tier KV cache (GPU HBM + CPU DRAM / v6d) and KV-aware request routing.
-Replays real production traces against a synthetic cluster so you can
-ablate routing strategies and cache sizing without spinning up any GPUs.
+two-tier KV cache and routing experiments. The simulator models a
+PD-disaggregated deployment: only the **prefill** path is simulated, while
+decode is reduced to a small completion tail for TTFT/E2E accounting.

-Assumes **PD (prefill/decode) disaggregation** — only the prefill path is
-modeled.
+It is intended for answering questions like:

-## Features
+- How much do different KV-aware routers help on the same trace?
+- How much HBM/DRAM capacity is enough before routing dominates?
+- How do prefix-locality policies behave under bucketed input-length pools?
+- What is the gap between online LRU and offline-optimal cache capacity?

- **Architecture-derived roofline compute** — auto-derives FLOPs,
-  attention coefficients, and weight-streaming costs from model
-  architecture (MoE, MLA, GQA, DSA, sliding window).
- **HuggingFace config.json auto-parsing** — drop in any HF
-  `config.json` and the simulator extracts layer counts, attention heads,
-  MoE expert configs, MLA LoRA ranks, and DSA sparse parameters.
- **Built-in GPU hardware presets** — H100, H800, H20, A100-80GB,
-  A100-40GB, B200 with tensor-parallel scaling (e.g. `8xb200`).
- **Two-tier KV cache hierarchy** — L0 (GPU HBM) + L1 (CPU DRAM) with
-  LRU eviction and cross-instance RDMA fetch via a cluster-wide
-  meta-store.
- **11 routing policies** — from baselines (random, round-robin) to
-  cache-aware (min\_pd, prefix\_affinity) for systematic ablation.
- **Token-bucket link contention** — PCIe and RDMA bandwidth modeled with
-  reservation-based token-bucket queues.
- **Oracle analysis** — computes theoretical hit-rate ceilings (infinite
-  cache, Belady optimal, LRU) for gap analysis.
+## What The Repo Models
+
+- **Architecture-derived prefill cost** from model structure, including MoE,
+  MLA, GQA, sliding-window attention, and DSA.
+- **Two-tier KV hierarchy** with L0 GPU HBM and L1 host DRAM, plus remote
+  RDMA fetches via a meta-store.
+- **Single-pool and bucketed clusters**. Bucketed mode separates the service
+  into input-length buckets with isolated instance pools and meta-stores.
+- **Local instance routing and global bucket routing** with detailed
+  per-request routing logs.
+- **Trace replay with optional input-length filtering** so the same trace can
+  be sliced into buckets without rewriting the source file.
+- **Offline oracle analysis** for unlimited capacity, Belady, and LRU hit-rate
+  ceilings.
+
+## Highlights
+
+- **HF `config.json` auto-loading**: point `model.config_json` at a model
+  config and the simulator derives architecture parameters automatically.
+- **Hardware presets**: `h100`, `h800`, `h20`, `h20-141g`, `a100-80gb`,
+  `a100-40gb`, `b200`, and `b300`, plus TP variants such as `8xb200`.
+- **18 local router modes** covering baselines, load-based, cache-aware,
+  affinity, and TTFT-estimating policies.
+- **2 global bucket router modes**: `strict_input_length` and `bucket_score`.
+- **Detailed outputs**: `summary.json`, `per_request.csv`, `instances.csv`,
+  `routing_log.jsonl`, plus `ablation.json` / `oracle.json` when applicable.

 ## Build

 ```bash
 cargo build --release
-# binary: target/release/kvcache-sim
 ```

-Fetch the upstream trace (consumed as a git submodule):
+If you want the public Qwen trace submodule as well:

 ```bash
 git submodule update --init --recursive
 ```

-## Usage
-
-### 1. Run a single simulation
+The release binary is:

 ```bash
-target/release/kvcache-sim run --config configs/glm5-8xb200-hf.yaml
+target/release/kvcache-sim
 ```

-Prints `summary.json` to stdout and writes the full output directory
-(see [Outputs](#outputs) below).
+## Quick Start

-### 2. Compare routers on the same trace (ablation)
+Validate a config:
+
+```bash
+target/release/kvcache-sim validate --config configs/glm5-8xb200.yaml
+```
+
+Run one simulation:
+
+```bash
+target/release/kvcache-sim run --config configs/glm5-8xb200.yaml
+```
+
+Compare several routers on the same trace:

 ```bash
 target/release/kvcache-sim ablate \
-    --config configs/glm5-8xb200-hf.yaml \
-    --routers random,least_loaded,least_tokens,min_pd,prefix_affinity \
-    --output-dir runs/glm5_ablation
+  --config configs/glm5-8xb200.yaml \
+  --routers random,least_loaded,cache_score,cache_affinity,estimated_ttft
 ```

-Writes one subdirectory per router plus a combined
-`ablation.json` with side-by-side summaries.
-
-### 3. Compute theoretical hit-rate ceilings (oracle)
+Auto-pick the smallest cluster size that meets a TTFT target, then ablate at
+that size:

 ```bash
-# Cluster-aggregate capacity (default)
-target/release/kvcache-sim oracle \
-    --config configs/glm5-8xb200-hf.yaml --num-instances 64
-
-# A single instance's HBM budget
-target/release/kvcache-sim oracle \
-    --config configs/glm5-8xb200-hf.yaml --per-instance
-
-# Explicit capacity in blocks
-target/release/kvcache-sim oracle \
-    --config configs/glm5-8xb200-hf.yaml --capacity-blocks 200000
+target/release/kvcache-sim ablate \
+  --config configs/glm5-8xb200.yaml \
+  --auto-instances \
+  --auto-probe-router cache_score \
+  --auto-target-ttft-mean 4.0
 ```

-Reports three numbers:
-
- `unlimited.hit_rate` — absolute ceiling (infinite cache)
- `belady_finite.hit_rate` — optimal-eviction ceiling at the given capacity
- `lru_finite.hit_rate` — production LRU at the same capacity
-
-Gap between `lru_finite` and `belady_finite` = headroom from a smarter
-eviction policy. Gap between `belady_finite` and `unlimited` = headroom
-only reachable by adding capacity.
-
-### 4. Validate a config without running
+Run the oracle:

 ```bash
-target/release/kvcache-sim validate --config configs/glm5-8xb200-hf.yaml
+target/release/kvcache-sim oracle \
+  --config configs/glm5-8xb200.yaml \
+  --per-instance
 ```

-Parses the YAML, prints derived per-instance block budgets, and dumps
-the first 5 trace records so you can sanity-check the path.
+`run` prints `summary.json` to stdout and also writes the full output directory
+under `sim.output_dir`.

-## CLI overrides
+## Current Command Boundaries

-These flags work on **all** subcommands and override the YAML in place,
-so the same config can be reused across sweeps:
+The repository now supports both legacy single-pool clusters and bucketed
+service topologies, but not every CLI path supports both yet.

-| Flag                     | Overrides                                 |
-|--------------------------|-------------------------------------------|
-| `--num-instances <N>`    | `cluster.num_instances`                   |
-| `--max-requests <N>`     | `sim.max_requests`                        |
-| `--trace <PATH>`         | `sim.trace_path`                          |
-| `--output-dir <PATH>`    | `sim.output_dir`                          |
-| `--seed <N>`             | `sim.seed`                                |
-| `--precise-topk <N>`     | `cluster.router.precise_probe_topk`       |
-| `--ttl-seconds <S>`      | `cluster.meta_store.ttl_seconds`          |
+- `run`: supports `cluster.num_instances` and `cluster.buckets`
+- `validate`: supports `cluster.num_instances` and `cluster.buckets`
+- `ablate`: currently **single-pool only**
+- `ablate --evict-policies`: currently supports **`lru` only**
+- `oracle`: currently **single-pool only**
+- `--num-instances` override: currently **single-pool only**
+- `--auto-instances`: currently **single-pool only**

-`oracle` additionally takes `--capacity-blocks <N>` / `--per-instance`
-and `--out <PATH>`. `ablate` additionally takes `--routers <csv>`.
+In practice, bucket-aware experiments are ready in `run`, while fixed-placement
+ablation and oracle analysis still reject `cluster.buckets`.

-## Router modes
+## Config Model

-Set `cluster.router.mode` in the YAML or list in `--routers`:
+### Single-Pool Cluster

-| Mode              | Aliases          | What it does                                                                         |
-|-------------------|------------------|--------------------------------------------------------------------------------------|
-| `random`          |                  | Uniform random. Baseline.                                                            |
-| `round_robin`     | `rr`             | Deterministic round-robin. Baseline.                                                 |
-| `least_loaded`    |                  | `argmin(kv_blocks_used + alpha * queue_len)`. KV-blind load balance.                 |
-| `least_tokens`    | `lt`             | `argmin(waiting_tokens)`. Pure load balance by queued compute work.                  |
-| `ttl_aware`       | `ttl`            | Picks instance with longest prefix in the global TTL meta-store. Cache-only.         |
-| `precise`         | `precise_aware`  | Probes top-K least-loaded instances' actual caches; charges probe latency into TTFT. |
-| `min_pd`          | `minpd`, `pd`    | Minimizes `P*D` (prefill tokens x ongoing requests). Cluster-wide RDMA-aware.        |
-| `cache_load`      | `cl`             | Filters to least-loaded 1/4 instances, then picks best cache prefix.                 |
-| `cache_score`     | `cs`             | Exponential scoring: `2^(alpha * queue_len + beta * miss_blocks)`.                   |
-| `estimated_ttft`  | `ettft`,`optimal`| Estimates `drain_time + fetch_time` per instance using architecture-aware compute.    |
-| `prefix_affinity` | `affinity`, `pa` | Rendezvous-hashed prefix fingerprinting for deterministic cache locality.             |
+Use `cluster.num_instances` for the original flat instance pool:

-### Router parameters
-
-These fields in `cluster.router` tune specific routers:
-
-| Field                    | Default | Used by          | Description                                          |
-|--------------------------|---------|------------------|------------------------------------------------------|
-| `load_alpha`             | `1.0`   | `least_loaded`   | Weight of queue\_len vs kv\_blocks\_used              |
-| `score_alpha`            | `1.0`   | `cache_score`    | Load weight in `2^(alpha*load + beta*miss)`           |
-| `score_beta`             | `0.1`   | `cache_score`    | Cache-miss weight in `2^(alpha*load + beta*miss)`     |
-| `prefix_k`              | `8`     | `prefix_affinity`| Number of leading blocks for the prefix fingerprint   |
-| `affinity_fan_out`       | `0`     | `prefix_affinity`| Top-K affinity candidates (0 = auto: n/8, min 2)     |
-| `precise_probe_latency_us`| `50.0`| `precise`        | Simulated per-probe latency (microseconds)            |
-| `precise_probe_topk`    | `4`     | `precise`        | Number of instances probed                            |
-
-### Router design spectrum
-
-```
-        Cache-only                  Hybrid                  Load-only
-        (hot-spot risk)                                     (cache-blind)
-  ┌─────────┬───────────┬───────────┬────────────┬───────────┬───────────┐
-  ttl_aware  precise  cache_score  min_pd     prefix_    least_     random
-                                   cache_load  affinity   loaded
-                                   est_ttft               least_tokens
+```yaml
+cluster:
+  num_instances: 32
+  meta_store:
+    ttl_seconds: 300.0
+  router:
+    mode: cache_affinity
 ```

-`prefix_affinity` sits in a unique position: it builds **proactive cache
-locality** by consistently routing same-prefix requests to the same
-instances (via rendezvous hashing), rather than reactively chasing
-existing cache state. This yields the highest L0 hit rates while
-maintaining load balance through within-group drain-time-aware selection.
+### Bucketed Service

-## Model configuration
+Use `cluster.buckets` plus a `global_router` to model explicit input-length
+buckets:

-### HuggingFace config.json (recommended)
+```yaml
+cluster:
+  meta_store:
+    ttl_seconds: 300.0
+  router:
+    mode: cache_affinity
+    load_alpha: 1.5
+    prefix_k: 8
+  global_router:
+    mode: strict_input_length
+    length_penalty_weight: 1.0
+    load_weight: 1.0
+    cache_weight: 1.0
+  buckets:
+    - name: short
+      input_length_min: 0
+      input_length_max: 32768
+      num_instances: 8
+    - name: long
+      input_length_min: 32769
+      input_length_max: 131072
+      num_instances: 4
+```

-Point `model.config_json` at any HF `config.json` to auto-extract
-architecture:
+Rules enforced by config validation:
+
+- `cluster.num_instances` and `cluster.buckets` are mutually exclusive
+- bucket ranges must not overlap
+- every bucket must have `num_instances > 0`
+- `input_length_min <= input_length_max`
+
+### CLI Overrides
+
+These flags apply on top of the YAML config:
+
+| Flag | Overrides |
+|------|-----------|
+| `--num-instances <N>` | `cluster.num_instances` |
+| `--max-requests <N>` | `sim.max_requests` |
+| `--trace <PATH>` | `sim.trace_path` |
+| `--output-dir <PATH>` | `sim.output_dir` |
+| `--seed <N>` | `sim.seed` |
+| `--precise-topk <N>` | `cluster.router.precise_probe_topk` |
+| `--ttl-seconds <S>` | `cluster.meta_store.ttl_seconds` |
+| `--input-length-min <N>` | `sim.input_length_min` |
+| `--input-length-max <N>` | `sim.input_length_max` |
+
+Subcommand-specific additions:
+
+- `ablate`: `--routers`, `--evict-policies`, `--auto-instances`,
+  `--auto-target-ttft-mean`, `--auto-candidates`, `--auto-probe-router`,
+  `--jobs`
+- `oracle`: `--capacity-blocks`, `--per-instance`, `--out`
+
+## Routing Modes
+
+### Global Bucket Routers
+
+Configured through `cluster.global_router.mode`.
+
+| Mode | What it does |
+|------|---------------|
+| `strict_input_length` | Routes to the unique bucket whose `[input_length_min, input_length_max]` contains the request. |
+| `bucket_score` | Scores every bucket using weighted length mismatch, aggregate queue load, and predicted cache miss. Can intentionally deviate from the strict length bucket. |
+
+### Local Instance Routers
+
+Configured through `cluster.router.mode`. All of these names are accepted by
+`run`, and any of them can be passed to `ablate --routers` on single-pool
+configs.
+
+| Mode | Aliases | What it does |
+|------|---------|---------------|
+| `random` |  | Uniform random baseline. |
+| `round_robin` | `rr` | Deterministic round-robin baseline. |
+| `least_loaded` |  | Minimizes `kv_blocks_used + alpha * queue_len`. |
+| `least_tokens` | `lt` | Minimizes queued token work. |
+| `ttl_aware` | `ttl` | Uses the global TTL meta-store to chase the longest reusable prefix. |
+| `precise` | `precise_aware` | Probes top-K least-loaded instances for actual cache contents and charges probe latency. |
+| `min_pd` | `minpd`, `pd` | Minimizes `P * D` using ongoing load and prefix reuse. |
+| `cache_load` | `cl` | Filters to lightly loaded instances, then chooses the best cache prefix. |
+| `cache_affinity` | `caff`, `ca` | Strong cache-first scoring with rendezvous-based sticky homes for prefix families. |
+| `cache_affinity_weak_rend` | `caff_weak` | Ablation: weak cache weights plus rendezvous placement. |
+| `cache_affinity_strong_only` | `caff_strong` | Ablation: strong cache weights without rendezvous tie-breaking. |
+| `cache_score` | `cs` | Exponential score over queue length and miss blocks. |
+| `cache_score_strong` | `cs_strong`, `css` | Parity probe with stronger cache weighting than default `cache_score`. |
+| `cache_score_ttl` | `csttl`, `cs_ttl` | `cache_score` variant that also uses TTL/meta-store visibility. |
+| `estimated_ttft` | `ettft`, `optimal` | First-principles TTFT estimate per instance using compute plus KV movement. |
+| `prefix_affinity` | `affinity`, `pa` | Deterministic prefix fingerprinting with affinity fan-out and load-aware selection. |
+| `adaptive_affinity` | `aa` | Uses hot-prefix detection: affinity for short hot stems, TTFT optimization otherwise. |
+| `lineage_affinity` | `la` | Combines parent stickiness, family homesets, and strong local cache scoring. |
+
+Router tuning knobs in `cluster.router`:
+
+| Field | Default | Used by |
+|-------|---------|---------|
+| `load_alpha` | `1.0` | `least_loaded`, `ttl_aware`, affinity families |
+| `score_alpha` | `1.0` | `cache_score`, `cache_score_ttl` |
+| `score_beta` | `0.1` | `cache_score`, `cache_score_ttl` |
+| `prefix_k` | `8` | prefix and affinity fingerprinting |
+| `affinity_fan_out` | `0` | `prefix_affinity`, `adaptive_affinity`, `lineage_affinity` |
+| `precise_probe_latency_us` | `50.0` | `precise` |
+| `precise_probe_topk` | `4` | `precise` |
+
+## Model And Hardware Configuration
+
+### Model Config
+
+Recommended pattern:

 ```yaml
 model:
  config_json: ../models/GLM-5/config.json
-  dtype_bytes: 2          # required (not in HF schema)
-  block_size_tokens: 512  # required (not in HF schema)
+  name: glm-5
+  compute_dtype: fp8
+  weight_dtype: fp4
+  dtype_bytes: 1
+  block_size_tokens: 512
 ```

-Auto-detected features:
+Notes:

-| Feature   | Detection trigger             | What it extracts                             |
-|-----------|-------------------------------|----------------------------------------------|
-| **MoE**   | `n_routed_experts`, `num_local_experts`, or `num_experts` | Expert count, active experts, shared experts, expert FFN width |
-| **MLA**   | `kv_lora_rank` present        | KV/Q LoRA ranks, qk\_rope/nope dims, v\_head\_dim |
-| **DSA**   | `first_k_dense_replace` present| Dense window, sparse stride, first dense layers |
-| **Sliding window** | `sliding_window` present | Window size                                 |
-| **GQA**   | `num_key_value_heads < num_attention_heads` | KV head count for grouped-query attention  |
+- `config_json` is resolved relative to the YAML file
+- explicit YAML fields override values loaded from the model config
+- `compute_dtype` selects the compute FLOPS tier
+- `weight_dtype` controls model-weight bytes separately from KV-cache bytes
+- `dtype_bytes` sizes the KV cache

-Explicit YAML fields always override the auto-detected values.
+The architecture loader understands:

-### Inline specification
+- MoE expert counts and active experts
+- MLA LoRA ranks and attention dimensions
+- DSA sparse-attention parameters
+- sliding-window attention
+- GQA from KV-head count

-Alternatively, specify architecture fields directly:
+### Hardware Presets

-```yaml
-model:
-  name: qwen2.5-coder-7b
-  num_layers: 28
-  hidden_size: 3584
-  num_attention_heads: 28
-  num_kv_heads: 4
-  head_dim: 128
-  intermediate_size: 18944
-  dtype_bytes: 2
-  block_size_tokens: 16
-```
-
-When `hidden_size` is present, the compute model is auto-derived
-(architecture mode). Without it, you must supply legacy manual
-coefficients (`flops_per_token_prefill`, `attn_quadratic_coeff`, etc.).
-
-### Bundled model configs
-
-| Model | Path | Architecture |
-|-------|------|--------------|
-| GLM-5 (744B/40B-active) | `models/GLM-5/config.json` | MoE (256 routed, 8 active, 1 shared) + MLA + DSA |
-| Qwen3-Coder-480B-A35B FP8 | `models/Qwen3-Coder-480B-A35B-Instruct-FP8/config.json` | MoE (160 experts, 8 active) + GQA |
-
-## Hardware configuration
-
-### Using presets (recommended)
-
-Set `hardware.type` to a preset name — individual fields can override:
+Recommended pattern:

 ```yaml
 hardware:
-  type: 8xb200
-  hbm_bytes: 500.0e9     # override KV budget (after model weights)
-```
-
-Available presets:
-
-| Preset      | FLOPS      | HBM     | Mem BW     | PCIe |
-|-------------|------------|---------|------------|------|
-| `h100`      | 989 TFLOPS | 80 GB   | 3.35 TB/s  | Gen5 |
-| `h800`      | 989 TFLOPS | 80 GB   | 3.35 TB/s  | Gen5 |
-| `h20`       | 148 TFLOPS | 96 GB   | 4.0 TB/s   | Gen5 |
-| `a100-80gb` | 312 TFLOPS | 80 GB   | 2.0 TB/s   | Gen4 |
-| `a100-40gb` | 312 TFLOPS | 40 GB   | 1.555 TB/s | Gen4 |
-| `b200`      | 2.25 PFLOPS| 192 GB  | 8.0 TB/s   | Gen6 |
-
-Prefix with `2x`, `4x`, or `8x` for tensor-parallel groups (e.g.
-`8xh20`). FLOPS, memory bandwidth, and HBM scale linearly; RDMA and
-DRAM are set to sensible per-node defaults.
-
-### Inline specification
-
-```yaml
-hardware:
-  gpu_flops: 1.80e16
-  gpu_mem_bw: 6.40e13
-  hbm_bytes: 500.0e9
+  type: 8xb300
+  hbm_bytes: 1900.0e9
  dram_bytes: 1.5e12
-  pcie_bw: 128.0e9
-  pcie_latency_us: 4.0
-  rdma_bw: 50.0e9
-  rdma_latency_us: 6.0
  max_batch_slots: 256
-  prefill_chunk_tokens: 4096
 ```

-## Architecture-aware compute model
+Available preset families:

-The simulator derives a **roofline prefill model** from model
-architecture:
+- `h100`, `h800`, `h20`, `h20-141g`
+- `a100-80gb`, `a100-40gb`
+- `b200`, `b300`
+- TP forms such as `2xh100`, `4xh20`, `8xb200`, `8xb300`

-```
-prefill_time(N tokens) = max(compute_time, memory_time)
+## Bundled Configs

-compute_time = layers * (N * linear_flops + attn_coeff * N * effective_ctx(N)) / gpu_flops
-memory_time  = layers * weight_bytes_per_layer / gpu_mem_bw
+Representative configs in `configs/`:
+
+| Config | Notes |
+|--------|-------|
+| `glm5-8xb200.yaml` | GLM-5 on `8xb200`, single-pool baseline config. |
+| `glm5-fp8-8xh20-141g.yaml` | GLM-5-FP8 on `8xh20-141g`, with a 0-32k input-length filter. |
+| `glm5-fp8-8xh20-141g-ca-tuned.yaml` | Same family as above, tuned for `cache_affinity`. |
+| `glm5-nvfp4-8xb300.yaml` | GLM-5-NVFP4 on `8xb300`. |
+| `glm5-nvfp4-fp8compute-8xb300.yaml` | NVFP4 weights with FP8 compute on `8xb300`. |
+| `qwen3-coder-480b-8xh20.yaml` | Qwen3-Coder-480B-A35B on `8xh20`. |
+
+Many of the `glm5-*n*.yaml` configs are bucket/slice-specific experiment
+points that use `sim.input_length_min` and `sim.input_length_max`.
+
+## Trace Inputs
+
+This repository currently contains two trace sources:
+
+- `bailian-traces/`
+  - `glm_coder_blksz_512_040915-040917.jsonl`
+  - `qwen3_coder_blksz_512_040915-040917.jsonl`
+- `qwen-bailian-usagetraces-anon/` submodule
+  - public 16-token-block Qwen traces such as
+    `qwen_coder_blksz_16.jsonl` and `qwen_traceB_blksz_16.jsonl`
+
+The simulator expects JSONL records with fields like:
+
+```json
+{
+  "chat_id": 159,
+  "parent_chat_id": 55,
+  "timestamp": 61.114,
+  "input_length": 521,
+  "output_length": 132,
+  "type": "text",
+  "turn": 2,
+  "hash_ids": [1089, 1090, 1091]
+}
 ```

- **MoE**: only active experts contribute to FLOPs and weight streaming
-  (shared experts always counted)
- **MLA**: compressed KV projections reduce attention FLOPs; KV cache
-  uses `kv_lora_rank + qk_rope_head_dim` instead of `2 * kv_heads * head_dim`
- **DSA**: `effective_ctx = min(N, dense_window) + max(0, N - dense_window) / sparse_stride`,
-  with the first K layers using full dense attention
- **GQA**: fewer KV heads reduce both attention compute and KV cache size
-
-## Bundled config files
-
-| Config | Model | Hardware | Instances | Trace |
-|--------|-------|----------|-----------|-------|
-| `glm5-8xb200-hf.yaml` | GLM-5 via HF config.json | 8xB200 preset | 32 | GLM coder blk512 |
-| `glm5-8xb200-blk512.yaml` | GLM-5 inline | 8xB200 inline | 64 | GLM coder blk512 |
-| `glm5-8xb200.yaml` | GLM-5 inline | 8xB200 inline | 8 | GLM coder blk512 |
-| `qwen3-coder-480b-8xh20.yaml` | Qwen3-Coder via HF | 8xH20 preset | 32 | Qwen coder blk16 |
-| `qwen2.5-coder-7b-h800.yaml` | Qwen2.5-7B inline | H800 inline | 16 | Qwen coder blk16 |
-| `qwen2.5-coder-7b-preset.yaml` | Qwen2.5-7B inline | H800 preset | 16 | Qwen coder blk16 |
-| `qwen2.5-coder-32b-h800.yaml` | Qwen2.5-32B inline | H800 inline | 16 | Qwen coder blk16 |
+Only prefill-side behavior is modeled; `output_length` is used only for a
+decode tail in completion metrics.

 ## Outputs

-Each run writes a directory under `sim.output_dir`:
+Each `run` writes a directory under `sim.output_dir`:

-| File                 | Contents                                                                   |
-|----------------------|----------------------------------------------------------------------------|
-| `summary.json`       | Router, throughput, TTFT p50/p95/p99, hit rates per tier, total RDMA/PCIe bytes |
-| `per_request.csv`    | `req_id,arrival,ttft,e2e,instance,total_blocks,l0_hit,l1_hit,remote_hit,miss,rdma_bytes,pcie_bytes,probe_overhead_s` |
-| `instances.csv`      | `t,instance,queue_len,kv_blocks_used,kv_blocks_total,busy` per sample      |
-| `routing_log.jsonl`  | One JSON per request: all router candidates + chosen instance + reason    |
+| File | Contents |
+|------|----------|
+| `summary.json` | Aggregate throughput, TTFT/E2E percentiles, hit rates, RDMA bytes, PCIe bytes. |
+| `per_request.csv` | Per-request latency and cache stats, including `bucket`, `instance`, and `length_bucket_match`. |
+| `instances.csv` | Periodic per-instance samples with `bucket`, `instance`, `queue_len`, and KV usage. |
+| `routing_log.jsonl` | One JSON route decision per request, including `global_mode`, `mode`, `chosen_bucket`, candidate buckets, and candidate instances. |

-For `ablate`: an extra `ablation.json` with one summary per router.
-For `oracle`: an `oracle.json` with the three hit-rate analyses.
+Additional outputs:

-### Reading results quickly
+- `ablate`: writes `ablation.json`
+- `oracle`: writes `oracle.json`
+- `ablate --auto-instances`: writes calibration runs under
+  `<output_dir>/auto_instances/`
+
+Quick inspection examples:

 ```bash
-# Pretty-print the summary
-cat runs/glm5_8xb200_hf/summary.json | jq .
-
-# Compare all routers from an ablation
-cat runs/glm5_8xb200_hf/ablation.json | \
-  jq '.[] | {router, ttft_mean, ttft_p50, hit_rate_l0, miss_rate}'
-
-# Sort by TTFT
-cat runs/glm5_8xb200_hf/ablation.json | \
-  jq 'sort_by(.ttft_mean) | .[] | {router, ttft_mean, hit_rate_l0}'
+jq . runs/glm5_8xb200/summary.json
 ```

-## Trace format
+```bash
+jq 'sort_by(.ttft_mean) | .[] | {router, ttft_mean, hit_rate_l0, miss_rate}' \
+  runs/glm5_8xb200/ablation.json
+```

-The simulator reads the Alibaba
-[`qwen-bailian-usagetraces-anon`](https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon)
-JSONL schema. Each record has `chat_id`, `timestamp`, `input_length`,
-`output_length`, and `hash_ids` (block hashes, typically 16 tokens each).
-Only the input side is used.
+## Oracle Semantics

-Available traces in the submodule:
+`oracle` computes three hit-rate references at a chosen cache capacity:

-| Trace | Requests | Description |
-|-------|----------|-------------|
-| `qwen_coder_blksz_16.jsonl` | 43k | Qwen Coder serving traffic |
-| `qwen_traceA_blksz_16.jsonl` | 43k | Qwen general traffic A |
-| `qwen_traceB_blksz_16.jsonl` | 173k | Qwen general traffic B |
-| `qwen_thinking_blksz_16.jsonl` | 11k | Qwen reasoning/thinking traffic |
+- `unlimited.hit_rate`: absolute ceiling with infinite capacity
+- `belady_finite.hit_rate`: offline-optimal eviction at the chosen capacity
+- `lru_finite.hit_rate`: LRU at the same capacity
+
+When `sim.input_length_min` / `sim.input_length_max` are set, `oracle` still
+feeds the full trace into cache state but only counts requests inside the
+selected input-length range. That matches the intended "measure one bucket
+inside a mixed workload" interpretation.
+
+The gap from `lru_finite` to `belady_finite` is eviction-policy headroom. The
+gap from `belady_finite` to `unlimited` is pure capacity headroom.

 ## Testing

@@ -347,6 +388,5 @@ Available traces in the submodule:
 cargo test --release
 ```

-28 tests: 27 unit tests (compute model, HF config parsing, hardware
-presets) + 1 integration smoke test that runs routers on a synthetic
-shared-prefix trace and asserts the expected hit-rate ordering.
+The test suite covers config parsing, hardware presets, routing behavior,
+bucket-aware service semantics, oracle logic, and smoke-style end-to-end runs.
--- a/configs/glm5-8xb200-blk512.yaml
+++ b/configs/glm5-8xb200-blk512.yaml
@@ -1,68 +0,0 @@
-# GLM-5 (zai-org/GLM-5) on 8 x B200 SXM (192GB each).
-# Architecture from HuggingFace config.json — all roofline coefficients
-# are derived automatically.
-
-model:
-  name: glm-5
-  # Core architecture (from HF config.json)
-  num_layers: 78
-  hidden_size: 6144
-  num_attention_heads: 64
-  num_kv_heads: 64             # formalism; MLA overrides KV cache sizing
-  head_dim: 64
-  intermediate_size: 12288     # shared expert FFN width
-  dtype_bytes: 2               # BF16
-  block_size_tokens: 512       # matches bailian-traces blksz_512
-
-  # MoE: 256 routed + 1 shared, 8 active per token
-  moe:
-    num_experts: 256
-    num_active_experts: 8
-    num_shared_experts: 1
-    expert_intermediate_size: 2048   # moe_intermediate_size
-
-  # MLA (Multi-head Latent Attention): compressed KV cache
-  mla:
-    kv_lora_rank: 512
-    q_lora_rank: 2048
-    qk_nope_head_dim: 192
-    qk_rope_head_dim: 64
-    v_head_dim: 256
-
-  # DSA (DeepSeek Sparse Attention): sub-quadratic past dense_window
-  attention:
-    type: dsa
-    dense_window: 4096
-    sparse_stride: 8
-    first_dense_layers: 3
-
-hardware:
-  # Aggregate of 8 x B200 in one tensor-parallel group.
-  gpu_flops:        1.80e16    # 8 * 2.25 PFLOPS BF16 dense
-  gpu_mem_bw:       6.40e13    # 8 * 8 TB/s HBM3e
-  # KV budget after FP8 weights + activations. GLM-5 FP8 ~744GB of 1536GB.
-  hbm_bytes:        500.0e9
-  dram_bytes:       1.5e12     # ~1.5 TB usable CPU DRAM / v6d per node
-  pcie_bw:          128.0e9    # PCIe Gen6 x16
-  pcie_latency_us:  4.0
-  rdma_bw:          50.0e9     # ConnectX-7 400 Gbps
-  rdma_latency_us:  6.0
-  max_batch_slots:  256
-  prefill_chunk_tokens: 4096
-
-cluster:
-  num_instances: 64
-  meta_store:
-    ttl_seconds: 300.0
-  router:
-    mode: min_pd
-    precise_probe_latency_us: 50.0
-    precise_probe_topk: 4
-    load_alpha: 1.0
-
-sim:
-  trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
-  max_requests: null
-  output_dir: runs/glm5_8xb200_blk512
-  sample_interval_s: 1.0
-  seed: 42
--- a/configs/glm5-8xb200-hf.yaml
+++ b/configs/glm5-8xb200-hf.yaml
@@ -1,40 +0,0 @@
-# GLM-5 using HuggingFace config.json + hardware preset.
-#
-# This config demonstrates the simplified format:
-#   model.config_json  — loads architecture from HF config.json
-#   hardware.type      — loads GPU specs from built-in preset
-#
-# Only deployment-specific fields need to be set explicitly.
-# Any field from config_json or the preset can be overridden in YAML.
-
-model:
-  # Auto-detect architecture: MoE, MLA, DSA, head dims, etc.
-  config_json: ../models/GLM-5/config.json
-  name: glm-5                    # override HF model_type
-  dtype_bytes: 1                 # BF16 (not in HF config.json)
-  block_size_tokens: 512         # matches bailian-traces blksz_512
-
-hardware:
-  type: 8xb200                   # 8 x B200 SXM (192GB each)
-  # Override preset values for this specific deployment:
-  hbm_bytes: 500.0e9             # KV budget after FP8 weights + activations
-  dram_bytes: 1.5e12             # ~1.5 TB usable CPU DRAM per node
-  max_batch_slots: 256
-
-cluster:
-  num_instances: 32
-  meta_store:
-    ttl_seconds: 300.0
-  router:
-    mode: min_pd
-    precise_probe_latency_us: 50.0
-    precise_probe_topk: 4
-    load_alpha: 1.0
-    prefix_k: 8
-
-sim:
-  trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
-  max_requests: null
-  output_dir: runs/glm5_8xb200_hf
-  sample_interval_s: 1.0
-  seed: 42
--- a/configs/glm5-8xb200.yaml
+++ b/configs/glm5-8xb200.yaml
@@ -1,66 +1,39 @@
-# GLM-5 (zai-org/GLM-5) served as a single tensor-parallel instance on
-# 8 x NVIDIA B200 SXM (192GB each, 1.5 TB aggregate HBM).
+# GLM-5 using HuggingFace config.json + hardware preset.
 #
-# GLM-5 is a 744B-total / 40B-active Mixture-of-Experts model (BF16),
-# using DeepSeek Sparse Attention (DSA). The HF card does not publish
-# layer/head shapes, so the values below are reasonable estimates based
-# on the GLM-4.5 lineage; adjust once the official config.json is public.
+# This config demonstrates the simplified format:
+#   model.config_json  — loads architecture from HF config.json
+#   hardware.type      — loads GPU specs from built-in preset
 #
-# Hardware values below represent the *aggregate* of the 8-GPU TP group
-# (one simulated "instance" == one 8xB200 serving replica). This is how
-# the roofline in src/instance/compute.rs wants to see it: gpu_flops and
-# gpu_mem_bw are the effective peaks seen by the TP'd model.
-#
-# Calibrate `flops_per_token_prefill` and `attn_quadratic_coeff` against
-# measured prefill latency before trusting absolute TTFT numbers.
+# Only deployment-specific fields need to be set explicitly.
+# Any field from config_json or the preset can be overridden in YAML.

 model:
-  name: glm-5
-  # --- estimates; refine from official config.json when available ---
-  num_layers: 92
-  num_kv_heads: 8              # GQA
-  head_dim: 128
-  dtype_bytes: 2               # BF16
-  block_size_tokens: 16        # trace convention
-  # Active-params-driven roofline: MoE activates ~40B params per token,
-  # so non-attention prefill FLOPs/token ≈ 2 * 40e9 = 8e10.
-  flops_per_token_prefill: 8.0e10
-  # Quadratic attention term ≈ 2 * num_heads * head_dim. GLM-5 uses
-  # DeepSeek Sparse Attention which is sub-quadratic in practice, so
-  # this coefficient is an upper bound — lower it if your measurements
-  # show DSA kicking in for long prompts.
-  attn_quadratic_coeff:    2048.0
-  bytes_per_token_prefill: 0.0
+  # Auto-detect architecture: MoE, MLA, DSA, head dims, etc.
+  config_json: ../models/GLM-5/config.json
+  name: glm-5                    # override HF model_type
+  dtype_bytes: 1                 # BF16 (not in HF config.json)
+  block_size_tokens: 512         # matches bailian-traces blksz_512

 hardware:
-  # Aggregate of 8 x B200 in one tensor-parallel group.
-  gpu_flops:        1.80e16    # 8 * 2.25 PFLOPS BF16 dense
-  gpu_mem_bw:       6.40e13    # 8 * 8 TB/s HBM3e
-  # KV-cache budget after weights + activations. GLM-5 @ BF16 is ~1.49TB,
-  # which barely fits in 1.5TB HBM; realistic serving uses FP8 weights
-  # (~744GB), leaving ~500GB for activations + KV cache. Adjust if your
-  # deployment uses a different weight dtype.
-  hbm_bytes:        500.0e9
-  dram_bytes:       1.5e12     # ~1.5 TB usable CPU DRAM / v6d per node
-  pcie_bw:          128.0e9    # PCIe Gen6 x16 ~ 128 GB/s per direction
-  pcie_latency_us:  4.0
-  rdma_bw:          50.0e9     # ConnectX-7 400 Gbps ≈ 50 GB/s
-  rdma_latency_us:  6.0
-  max_batch_slots:  256
-  prefill_chunk_tokens: 2048
+  type: 8xb200                   # 8 x B200 SXM (192GB each)
+  # Override preset values for this specific deployment:
+  hbm_bytes: 500.0e9             # KV budget after FP8 weights + activations
+  dram_bytes: 1.5e12             # ~1.5 TB usable CPU DRAM per node
+  max_batch_slots: 256

 cluster:
-  num_instances: 8             # 8 TP replicas -> 64 B200s cluster-wide
+  num_instances: 32
  meta_store:
-    ttl_seconds: 120.0
+    ttl_seconds: 300.0
  router:
-    mode: ttl_aware
+    mode: min_pd
    precise_probe_latency_us: 50.0
    precise_probe_topk: 4
    load_alpha: 1.0
+    prefix_k: 8

 sim:
-  trace_path: qwen-bailian-usagetraces-anon/qwen_coder_blksz_16.jsonl
+  trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
  max_requests: null
  output_dir: runs/glm5_8xb200
  sample_interval_s: 1.0
--- a/configs/glm5-fp8-8xh20-141g-0-32k-n56.yaml
+++ b/configs/glm5-fp8-8xh20-141g-0-32k-n56.yaml
@@ -0,0 +1,35 @@
+# GLM-5-FP8 on 8 x H20-141G for the 0-32k bucket.
+# Chosen to keep the best policy's mean TTFT below 5s.
+
+model:
+  config_json: ../models/GLM-5-FP8/config.json
+  name: glm-5-fp8
+  compute_dtype: fp8
+  dtype_bytes: 1
+  block_size_tokens: 512
+
+hardware:
+  type: 8xh20-141g
+  hbm_bytes: 300.0e9
+  dram_bytes: 1.5e12
+  max_batch_slots: 256
+
+cluster:
+  num_instances: 56
+  meta_store:
+    ttl_seconds: 300.0
+  router:
+    mode: cache_affinity
+    precise_probe_latency_us: 50.0
+    precise_probe_topk: 4
+    load_alpha: 1.0
+    prefix_k: 8
+
+sim:
+  trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
+  max_requests: null
+  output_dir: runs/glm5_fp8_8xh20_141g_ablation_0_32768_n56
+  sample_interval_s: 1.0
+  seed: 42
+  input_length_min: 0
+  input_length_max: 32768
--- a/configs/glm5-fp8-8xh20-141g-ca-tuned.yaml
+++ b/configs/glm5-fp8-8xh20-141g-ca-tuned.yaml
@@ -0,0 +1,38 @@
+# GLM-5-FP8 (ZhipuAI/GLM-5-FP8) on 8 x H20-141G (N3E).
+# Tuned for the 0-32768 input-length slice of
+# bailian-traces/glm_coder_blksz_512_040915-040917.jsonl.
+
+model:
+  config_json: ../models/GLM-5-FP8/config.json
+  name: glm-5-fp8
+  compute_dtype: fp8
+  dtype_bytes: 1
+  block_size_tokens: 512
+
+hardware:
+  type: 8xh20-141g
+  hbm_bytes: 300.0e9
+  dram_bytes: 1.5e12
+  max_batch_slots: 256
+
+cluster:
+  num_instances: 64
+  meta_store:
+    ttl_seconds: 300.0
+  router:
+    mode: cache_affinity
+    precise_probe_latency_us: 50.0
+    precise_probe_topk: 4
+    # Tuned on this filtered GLM coder workload: stronger queue penalty than
+    # the default 1.0 keeps cache_affinity's locality gains while reducing TTFT.
+    load_alpha: 1.5
+    prefix_k: 8
+
+sim:
+  trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
+  max_requests: null
+  output_dir: runs/glm5_fp8_8xh20_141g_ca_tuned
+  sample_interval_s: 1.0
+  seed: 42
+  input_length_min: 0
+  input_length_max: 32768
--- a/configs/glm5-fp8-8xh20-141g-l1-medium.yaml
+++ b/configs/glm5-fp8-8xh20-141g-l1-medium.yaml
@@ -0,0 +1,35 @@
+# GLM-5-FP8 on 8 x H20-141G, 0-32768 slice.
+# Analysis config: medium L1 (~10% of the default DRAM KV budget).
+
+model:
+  config_json: ../models/GLM-5-FP8/config.json
+  name: glm-5-fp8
+  compute_dtype: fp8
+  dtype_bytes: 1
+  block_size_tokens: 512
+
+hardware:
+  type: 8xh20-141g
+  hbm_bytes: 300.0e9
+  dram_bytes: 1.5e11
+  max_batch_slots: 256
+
+cluster:
+  num_instances: 64
+  meta_store:
+    ttl_seconds: 300.0
+  router:
+    mode: min_pd
+    precise_probe_latency_us: 50.0
+    precise_probe_topk: 4
+    load_alpha: 1.0
+    prefix_k: 8
+
+sim:
+  trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
+  max_requests: null
+  output_dir: runs/glm5_fp8_8xh20_141g_l1_medium
+  sample_interval_s: 1.0
+  seed: 42
+  input_length_min: 0
+  input_length_max: 32768
--- a/configs/glm5-fp8-8xh20-141g-l1-none.yaml
+++ b/configs/glm5-fp8-8xh20-141g-l1-none.yaml
@@ -0,0 +1,35 @@
+# GLM-5-FP8 on 8 x H20-141G, 0-32768 slice.
+# Analysis config: effectively disable L1/remote KV by shrinking DRAM to ~1 block.
+
+model:
+  config_json: ../models/GLM-5-FP8/config.json
+  name: glm-5-fp8
+  compute_dtype: fp8
+  dtype_bytes: 1
+  block_size_tokens: 512
+
+hardware:
+  type: 8xh20-141g
+  hbm_bytes: 300.0e9
+  dram_bytes: 1.0
+  max_batch_slots: 256
+
+cluster:
+  num_instances: 64
+  meta_store:
+    ttl_seconds: 300.0
+  router:
+    mode: min_pd
+    precise_probe_latency_us: 50.0
+    precise_probe_topk: 4
+    load_alpha: 1.0
+    prefix_k: 8
+
+sim:
+  trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
+  max_requests: null
+  output_dir: runs/glm5_fp8_8xh20_141g_l1_none
+  sample_interval_s: 1.0
+  seed: 42
+  input_length_min: 0
+  input_length_max: 32768
--- a/configs/glm5-fp8-8xh20-141g-l1-small.yaml
+++ b/configs/glm5-fp8-8xh20-141g-l1-small.yaml
@@ -0,0 +1,35 @@
+# GLM-5-FP8 on 8 x H20-141G, 0-32768 slice.
+# Analysis config: small L1 (~1% of the default DRAM KV budget).
+
+model:
+  config_json: ../models/GLM-5-FP8/config.json
+  name: glm-5-fp8
+  compute_dtype: fp8
+  dtype_bytes: 1
+  block_size_tokens: 512
+
+hardware:
+  type: 8xh20-141g
+  hbm_bytes: 300.0e9
+  dram_bytes: 1.5e10
+  max_batch_slots: 256
+
+cluster:
+  num_instances: 64
+  meta_store:
+    ttl_seconds: 300.0
+  router:
+    mode: min_pd
+    precise_probe_latency_us: 50.0
+    precise_probe_topk: 4
+    load_alpha: 1.0
+    prefix_k: 8
+
+sim:
+  trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
+  max_requests: null
+  output_dir: runs/glm5_fp8_8xh20_141g_l1_small
+  sample_interval_s: 1.0
+  seed: 42
+  input_length_min: 0
+  input_length_max: 32768
--- a/configs/glm5-fp8-8xh20-141g.yaml
+++ b/configs/glm5-fp8-8xh20-141g.yaml
@@ -0,0 +1,39 @@
+# GLM-5-FP8 (ZhipuAI/GLM-5-FP8) on 8 x H20-141G (N3E).
+# Architecture auto-loaded from the upstream ModelScope config.json.
+#
+# 8 x 141 GB = 1128 GB total HBM. With ~744 GB FP8 weights resident,
+# keep the KV budget conservative to leave room for scales, BF16 holdouts,
+# allocator slack, and runtime activations.
+
+model:
+  config_json: ../models/GLM-5-FP8/config.json
+  name: glm-5-fp8
+  compute_dtype: fp8
+  dtype_bytes: 1
+  block_size_tokens: 512
+
+hardware:
+  type: 8xh20-141g
+  hbm_bytes: 300.0e9
+  dram_bytes: 1.5e12
+  max_batch_slots: 256
+
+cluster:
+  num_instances: 64
+  meta_store:
+    ttl_seconds: 300.0
+  router:
+    mode: min_pd
+    precise_probe_latency_us: 50.0
+    precise_probe_topk: 4
+    load_alpha: 1.0
+    prefix_k: 8
+
+sim:
+  trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
+  max_requests: null
+  output_dir: runs/glm5_fp8_8xh20_141g
+  sample_interval_s: 1.0
+  seed: 42
+  input_length_min: 0
+  input_length_max: 32768
--- a/configs/glm5-nvfp4-8xb200-32k-85k-n5.yaml
+++ b/configs/glm5-nvfp4-8xb200-32k-85k-n5.yaml
@@ -0,0 +1,36 @@
+# GLM-5-NVFP4 on 8 x B200 for the 32k-85k bucket.
+# Chosen to keep the best policy's mean TTFT below 5s.
+
+model:
+  config_json: ../models/GLM-5-NVFP4/config.json
+  name: glm-5-nvfp4
+  compute_dtype: fp8
+  weight_dtype: fp4
+  dtype_bytes: 1
+  block_size_tokens: 512
+
+hardware:
+  type: 8xb200
+  hbm_bytes: 1150.0e9
+  dram_bytes: 1.5e12
+  max_batch_slots: 256
+
+cluster:
+  num_instances: 5
+  meta_store:
+    ttl_seconds: 300.0
+  router:
+    mode: cache_affinity
+    precise_probe_latency_us: 50.0
+    precise_probe_topk: 4
+    load_alpha: 1.0
+    prefix_k: 8
+
+sim:
+  trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
+  max_requests: null
+  output_dir: runs/glm5_nvfp4_8xb200_ablation_32769_87040_n5
+  sample_interval_s: 1.0
+  seed: 42
+  input_length_min: 32769
+  input_length_max: 87040
--- a/configs/glm5-nvfp4-8xb200.yaml
+++ b/configs/glm5-nvfp4-8xb200.yaml
@@ -0,0 +1,36 @@
+# GLM-5-NVFP4 (nvidia/GLM-5-NVFP4) on 8 x B200 (192GB each).
+# Architecture auto-loaded from HuggingFace config.json.
+#
+# FP4 weights: ~744B params * 0.5 bytes = ~372 GB across 8 GPUs.
+# Total HBM: 8 * 192 GB = 1536 GB. Keep the KV budget below the raw
+# remainder to leave room for runtime activations and allocator slack.
+
+model:
+  config_json: ../models/GLM-5-NVFP4/config.json
+  name: glm-5-nvfp4
+  compute_dtype: fp8
+  weight_dtype: fp4
+  dtype_bytes: 1
+  block_size_tokens: 512
+
+hardware:
+  type: 8xb200
+  hbm_bytes: 1150.0e9
+  dram_bytes: 1.5e12
+  max_batch_slots: 256
+
+cluster:
+  num_instances: 8
+  meta_store:
+    ttl_seconds: 300.0
+  router:
+    mode: prefix_affinity
+    prefix_k: 8
+    load_alpha: 1.0
+
+sim:
+  trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
+  max_requests: null
+  output_dir: runs/glm5_nvfp4_8xb200
+  sample_interval_s: 1.0
+  seed: 42
--- a/configs/glm5-nvfp4-8xb300-128k-plus-n1.yaml
+++ b/configs/glm5-nvfp4-8xb300-128k-plus-n1.yaml
@@ -0,0 +1,37 @@
+# GLM-5-NVFP4 on 8 x B300 for the 128k++ bucket.
+# A single instance already keeps mean TTFT below 5s, and routing is
+# effectively irrelevant at N=1 because every request lands on the same node.
+
+model:
+  config_json: ../models/GLM-5-NVFP4/config.json
+  name: glm-5-nvfp4
+  compute_dtype: fp8
+  weight_dtype: fp4
+  dtype_bytes: 1
+  block_size_tokens: 512
+
+hardware:
+  type: 8xb300
+  hbm_bytes: 1900.0e9
+  dram_bytes: 1.5e12
+  max_batch_slots: 256
+
+cluster:
+  num_instances: 1
+  meta_store:
+    ttl_seconds: 300.0
+  router:
+    mode: cache_affinity
+    precise_probe_latency_us: 50.0
+    precise_probe_topk: 1
+    load_alpha: 1.0
+    prefix_k: 8
+
+sim:
+  trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
+  max_requests: null
+  output_dir: runs/glm5_nvfp4_8xb300_ablation_131073_plus_n1
+  sample_interval_s: 1.0
+  seed: 42
+  input_length_min: 131073
+  input_length_max: 4294967295
--- a/configs/glm5-nvfp4-8xb300-85k-128k-n2.yaml
+++ b/configs/glm5-nvfp4-8xb300-85k-128k-n2.yaml
@@ -0,0 +1,36 @@
+# GLM-5-NVFP4 on 8 x B300 for the 85k-128k bucket.
+# Chosen to keep the best policy's mean TTFT below 5s.
+
+model:
+  config_json: ../models/GLM-5-NVFP4/config.json
+  name: glm-5-nvfp4
+  compute_dtype: fp8
+  weight_dtype: fp4
+  dtype_bytes: 1
+  block_size_tokens: 512
+
+hardware:
+  type: 8xb300
+  hbm_bytes: 1900.0e9
+  dram_bytes: 1.5e12
+  max_batch_slots: 256
+
+cluster:
+  num_instances: 2
+  meta_store:
+    ttl_seconds: 300.0
+  router:
+    mode: cache_affinity_strong_only
+    precise_probe_latency_us: 50.0
+    precise_probe_topk: 4
+    load_alpha: 1.0
+    prefix_k: 8
+
+sim:
+  trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
+  max_requests: null
+  output_dir: runs/glm5_nvfp4_8xb300_ablation_87041_131072_n2
+  sample_interval_s: 1.0
+  seed: 42
+  input_length_min: 87041
+  input_length_max: 131072
--- a/configs/glm5-nvfp4-8xb300.yaml
+++ b/configs/glm5-nvfp4-8xb300.yaml
@@ -7,7 +7,8 @@
 model:
  config_json: ../models/GLM-5-NVFP4/config.json
  name: glm-5-nvfp4
-  compute_dtype: fp4            # FP4 weights → selects FP4 tensor core FLOPS
+  compute_dtype: fp8            # FP8 tensor-core execution
+  weight_dtype: fp4             # NVFP4 weights still set the HBM budget
  dtype_bytes: 1                # FP8 KV cache
  block_size_tokens: 512

--- a/configs/glm5-nvfp4-fp8compute-8xb200-32k-85k-n9.yaml
+++ b/configs/glm5-nvfp4-fp8compute-8xb200-32k-85k-n9.yaml
@@ -0,0 +1,32 @@
+# GLM-5-NVFP4 on 8 x B200 for the 32k-85k bucket.
+# NVFP4 weights, FP8 compute. Chosen to keep the best policy below 5 s TTFT.
+
+model:
+  config_json: ../models/GLM-5-NVFP4/config.json
+  name: glm-5-nvfp4
+  compute_dtype: fp8
+  weight_dtype: fp4
+  dtype_bytes: 1
+  block_size_tokens: 512
+
+hardware:
+  type: 8xb200
+  hbm_bytes: 1150.0e9
+  dram_bytes: 1.5e12
+  max_batch_slots: 256
+
+cluster:
+  num_instances: 9
+  meta_store:
+    ttl_seconds: 300.0
+  router:
+    mode: min_pd
+
+sim:
+  trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
+  max_requests: null
+  output_dir: runs/glm5_nvfp4_fp8compute_8xb200_ablation_32769_87040_n9
+  sample_interval_s: 1.0
+  seed: 42
+  input_length_min: 32769
+  input_length_max: 87040
--- a/configs/glm5-nvfp4-fp8compute-8xb200.yaml
+++ b/configs/glm5-nvfp4-fp8compute-8xb200.yaml
@@ -0,0 +1,32 @@
+# GLM-5-NVFP4 on 8 x B200 with FP8 tensor-core compute.
+# Weights remain stored in NVFP4, so HBM budget follows FP4 storage.
+
+model:
+  config_json: ../models/GLM-5-NVFP4/config.json
+  name: glm-5-nvfp4
+  compute_dtype: fp8
+  weight_dtype: fp4
+  dtype_bytes: 1
+  block_size_tokens: 512
+
+hardware:
+  type: 8xb200
+  hbm_bytes: 1150.0e9
+  dram_bytes: 1.5e12
+  max_batch_slots: 256
+
+cluster:
+  num_instances: 8
+  meta_store:
+    ttl_seconds: 300.0
+  router:
+    mode: prefix_affinity
+    prefix_k: 8
+    load_alpha: 1.0
+
+sim:
+  trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
+  max_requests: null
+  output_dir: runs/glm5_nvfp4_fp8compute_8xb200
+  sample_interval_s: 1.0
+  seed: 42
--- a/configs/glm5-nvfp4-fp8compute-8xb300-128k-plus-n1.yaml
+++ b/configs/glm5-nvfp4-fp8compute-8xb300-128k-plus-n1.yaml
@@ -0,0 +1,33 @@
+# GLM-5-NVFP4 on 8 x B300 for the 128k++ bucket.
+# NVFP4 weights, FP8 compute. Routing is effectively irrelevant at one instance.
+
+model:
+  config_json: ../models/GLM-5-NVFP4/config.json
+  name: glm-5-nvfp4
+  compute_dtype: fp8
+  weight_dtype: fp4
+  dtype_bytes: 1
+  block_size_tokens: 512
+
+hardware:
+  type: 8xb300
+  hbm_bytes: 1900.0e9
+  dram_bytes: 1.5e12
+  max_batch_slots: 256
+
+cluster:
+  num_instances: 1
+  meta_store:
+    ttl_seconds: 300.0
+  router:
+    mode: cache_affinity
+    precise_probe_topk: 1
+
+sim:
+  trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
+  max_requests: null
+  output_dir: runs/glm5_nvfp4_fp8compute_8xb300_ablation_131073_plus_n1
+  sample_interval_s: 1.0
+  seed: 42
+  input_length_min: 131073
+  input_length_max: 4294967295
--- a/configs/glm5-nvfp4-fp8compute-8xb300-85k-128k-n4.yaml
+++ b/configs/glm5-nvfp4-fp8compute-8xb300-85k-128k-n4.yaml
@@ -0,0 +1,32 @@
+# GLM-5-NVFP4 on 8 x B300 for the 85k-128k bucket.
+# NVFP4 weights, FP8 compute. Chosen to keep the best policy below 5 s TTFT.
+
+model:
+  config_json: ../models/GLM-5-NVFP4/config.json
+  name: glm-5-nvfp4
+  compute_dtype: fp8
+  weight_dtype: fp4
+  dtype_bytes: 1
+  block_size_tokens: 512
+
+hardware:
+  type: 8xb300
+  hbm_bytes: 1900.0e9
+  dram_bytes: 1.5e12
+  max_batch_slots: 256
+
+cluster:
+  num_instances: 4
+  meta_store:
+    ttl_seconds: 300.0
+  router:
+    mode: cache_affinity
+
+sim:
+  trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
+  max_requests: null
+  output_dir: runs/glm5_nvfp4_fp8compute_8xb300_ablation_87041_131072_n4
+  sample_interval_s: 1.0
+  seed: 42
+  input_length_min: 87041
+  input_length_max: 131072
--- a/configs/glm5-nvfp4-fp8compute-8xb300.yaml
+++ b/configs/glm5-nvfp4-fp8compute-8xb300.yaml
@@ -0,0 +1,32 @@
+# GLM-5-NVFP4 on 8 x B300 with FP8 tensor-core compute.
+# Weights remain stored in NVFP4, so HBM budget follows FP4 storage.
+
+model:
+  config_json: ../models/GLM-5-NVFP4/config.json
+  name: glm-5-nvfp4
+  compute_dtype: fp8
+  weight_dtype: fp4
+  dtype_bytes: 1
+  block_size_tokens: 512
+
+hardware:
+  type: 8xb300
+  hbm_bytes: 1900.0e9
+  dram_bytes: 1.5e12
+  max_batch_slots: 256
+
+cluster:
+  num_instances: 8
+  meta_store:
+    ttl_seconds: 300.0
+  router:
+    mode: prefix_affinity
+    prefix_k: 8
+    load_alpha: 1.0
+
+sim:
+  trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
+  max_requests: null
+  output_dir: runs/glm5_nvfp4_fp8compute_8xb300
+  sample_interval_s: 1.0
+  seed: 42
--- a/configs/qwen2.5-coder-32b-h800.yaml
+++ b/configs/qwen2.5-coder-32b-h800.yaml
@@ -1,42 +0,0 @@
-# Qwen2.5-Coder-32B (dense, GQA) on H800 SXM (80GB).
-# Architecture from HuggingFace config.json — roofline auto-derived.
-
-model:
-  name: qwen2.5-coder-32b
-  num_layers: 64
-  hidden_size: 5120
-  num_attention_heads: 40
-  num_kv_heads: 8              # GQA
-  head_dim: 128
-  intermediate_size: 27648     # SwiGLU FFN
-  dtype_bytes: 2               # BF16
-  block_size_tokens: 16
-
-hardware:
-  gpu_flops:        9.89e14
-  gpu_mem_bw:       3.35e12
-  hbm_bytes:        20.0e9     # smaller budget: 32B weights are large
-  dram_bytes:       512.0e9
-  pcie_bw:          64.0e9
-  pcie_latency_us:  5.0
-  rdma_bw:          25.0e9
-  rdma_latency_us:  8.0
-  max_batch_slots:  128
-  prefill_chunk_tokens: 1024
-
-cluster:
-  num_instances: 16
-  meta_store:
-    ttl_seconds: 60.0
-  router:
-    mode: ttl_aware
-    precise_probe_latency_us: 50.0
-    precise_probe_topk: 4
-    load_alpha: 1.0
-
-sim:
-  trace_path: traces/qwen_coder_blksz_16.jsonl
-  max_requests: null
-  output_dir: runs/qwen32b
-  sample_interval_s: 1.0
-  seed: 42
--- a/configs/qwen2.5-coder-7b-h800.yaml
+++ b/configs/qwen2.5-coder-7b-h800.yaml
@@ -1,42 +0,0 @@
-# Qwen2.5-Coder-7B (dense, GQA) on a single H800 SXM (80GB).
-# Architecture from HuggingFace config.json — roofline auto-derived.
-
-model:
-  name: qwen2.5-coder-7b
-  num_layers: 28
-  hidden_size: 3584
-  num_attention_heads: 28
-  num_kv_heads: 4              # GQA: 28 query heads, 4 KV heads
-  head_dim: 128
-  intermediate_size: 18944     # SwiGLU FFN
-  dtype_bytes: 2               # BF16
-  block_size_tokens: 16        # matches qwen_coder_blksz_16 trace
-
-hardware:
-  gpu_flops:        9.89e14    # H800 bf16 dense
-  gpu_mem_bw:       3.35e12    # 3.35 TB/s HBM3
-  hbm_bytes:        60.0e9     # leave headroom for weights/activations
-  dram_bytes:       512.0e9
-  pcie_bw:          64.0e9     # PCIe Gen5 x16
-  pcie_latency_us:  5.0
-  rdma_bw:          25.0e9     # ~200 Gbps NIC
-  rdma_latency_us:  8.0
-  max_batch_slots:  256
-  prefill_chunk_tokens: 2048
-
-cluster:
-  num_instances: 16
-  meta_store:
-    ttl_seconds: 60.0
-  router:
-    mode: ttl_aware
-    precise_probe_latency_us: 50.0
-    precise_probe_topk: 4
-    load_alpha: 1.0
-
-sim:
-  trace_path: qwen-bailian-usagetraces-anon/qwen_coder_blksz_16.jsonl
-  max_requests: null
-  output_dir: runs/qwen7b
-  sample_interval_s: 1.0
-  seed: 42
--- a/configs/qwen2.5-coder-7b-preset.yaml
+++ b/configs/qwen2.5-coder-7b-preset.yaml
@@ -1,36 +0,0 @@
-# Qwen2.5-Coder-7B using hardware preset.
-#
-# Model architecture is specified inline (no config.json needed for simple
-# models). Hardware uses preset "h800" with a single override for hbm_bytes.
-
-model:
-  name: qwen2.5-coder-7b
-  num_layers: 28
-  hidden_size: 3584
-  num_attention_heads: 28
-  num_kv_heads: 4
-  head_dim: 128
-  intermediate_size: 18944
-  dtype_bytes: 2
-  block_size_tokens: 16
-
-hardware:
-  type: h800                     # single H800 SXM (80GB)
-  hbm_bytes: 60.0e9             # KV budget after 7B model weights
-
-cluster:
-  num_instances: 16
-  meta_store:
-    ttl_seconds: 60.0
-  router:
-    mode: ttl_aware
-    precise_probe_latency_us: 50.0
-    precise_probe_topk: 4
-    load_alpha: 1.0
-
-sim:
-  trace_path: qwen-bailian-usagetraces-anon/qwen_coder_blksz_16.jsonl
-  max_requests: null
-  output_dir: runs/qwen7b_preset
-  sample_interval_s: 1.0
-  seed: 42
--- a/configs/qwen3-coder-480b-8xh20.yaml
+++ b/configs/qwen3-coder-480b-8xh20.yaml
@@ -5,16 +5,17 @@ model:
  config_json: ../models/Qwen3-Coder-480B-A35B-Instruct-FP8/config.json
  name: qwen3-coder-480b
  dtype_bytes: 1               # FP8 inference
-  block_size_tokens: 16
+  block_size_tokens: 512

 hardware:
  type: 8xh20
  hbm_bytes: 400.0e9           # KV budget after FP8 weights on 8x96GB
+  dram_bytes: 1.0e12             # ~1.0 TB usable CPU DRAM per node

 cluster:
-  num_instances: 32
+  num_instances: 128
  meta_store:
-    ttl_seconds: 120.0
+    ttl_seconds: 300.0
  router:
    mode: min_pd
    precise_probe_latency_us: 50.0
@@ -22,7 +23,7 @@ cluster:
    load_alpha: 1.0

 sim:
-  trace_path: traces/qwen_coder_blksz_16.jsonl
+  trace_path: bailian-traces/qwen3_coder_blksz_512_040915-040917.jsonl
  max_requests: null
  output_dir: runs/qwen3_coder_8xh20
  sample_interval_s: 1.0
--- a/docs/superpowers/specs/2026-04-17-bucket-aware-routing-design.md
+++ b/docs/superpowers/specs/2026-04-17-bucket-aware-routing-design.md
@@ -0,0 +1,449 @@
+# Bucket-Aware Routing Design
+
+## 背景
+
+当前 simulator 只有一个全局 `Cluster`：
+
+- trace replay 的所有请求共享一组 `instances`
+- router 直接在全局 instance 池里选目标实例
+- `meta_store`、L0/L1 cache 可见性、remote RDMA 也都是全局共享
+
+这和目标架构不一致。目标架构要求：
+
+- service 内存在多个显式定义的 input-length buckets
+- 每个 bucket 有独立的 instance 池，实例数由配置显式给出
+- bucket 之间的 cache / meta-store / remote 可见性严格隔离
+- router 不只是 bucket 内调度器，还要能以 global 视角感知 bucket 的存在
+- 后续需要用 simulator 研究：
+  - 是否应该区分 bucket
+  - 严格按 input length 分发与非严格分发的差异
+  - bucket policy 与 bucket 内 instance policy 的耦合和收益
+
+因此，这次重构不能把“先按 input length 选 bucket”硬编码到 service 层，而要把 bucket 选择纳入 global router 的决策面。
+
+## 目标
+
+本次设计的目标是把 simulator 重构成“两级调度”架构：
+
+1. global router 在所有 bucket 之间选择目标 bucket
+2. local router 在选中的 bucket 内选择目标 instance
+
+同时满足：
+
+- bucket 由配置文件显式定义
+- 所有 bucket 共享同一套 local router 配置
+- bucket 之间完全隔离，不共享 L0/L1/meta-store/remote 视图
+- 保留足够的 metrics / routing log，支持后续研究 bucket policy 的效果
+- 尽量复用现有 `src/router/*` 中的 instance 级路由实现，避免把所有 router 改写成跨 bucket 扁平打分器
+
+非目标：
+
+- 第一阶段不支持 per-bucket 自定义 router 算法
+- 第一阶段不支持跨 bucket cache 共享
+- 第一阶段不自动推导 bucket 边界或 bucket 实例数
+- 第一阶段不实现“全局扁平 instance 池”语义
+
+## 方案比较
+
+### 方案 A：service 层固定按 input length 选 bucket，router 只负责 bucket 内 instance
+
+优点：
+
+- 对现有代码侵入最小
+- local router 基本不用改
+
+缺点：
+
+- 无法研究“router 不严格按照 input length 分发会怎样”
+- bucket policy 被固化，无法与 instance policy 解耦对比
+- global 视角只能做观测，不能做真正决策
+
+### 方案 B：两级路由，global router 选 bucket，local router 选 bucket 内 instance
+
+优点：
+
+- bucket policy 和 instance policy 清晰解耦
+- 符合目标架构，也适合做对照实验
+- 可以最大程度复用现有 router 实现作为 local router
+
+缺点：
+
+- 需要新增 service 层摘要视图与 global router 接口
+- driver / events / metrics 要显式携带 bucket 维度
+
+### 方案 C：全局扁平 router，对所有 instance 跨 bucket 一起打分
+
+优点：
+
+- 表面上最自由
+
+缺点：
+
+- bucket policy 与 instance policy 混在一起，实验解释性差
+- 现有大多数 router 要重写
+- 容易把“bucket 是独立实例池”的物理边界冲淡
+
+推荐方案：方案 B。
+
+原因：bucket 是 service-level 拓扑与隔离边界，instance selection 是 bucket 内局部调度问题。这两个层次应该分开建模，否则后续无法清晰回答“bucket 本身是否有价值”和“instance 级路由算法是否有效”。
+
+## 配置设计
+
+`cluster` 从现在的单实例池配置扩展为显式 bucket 配置。
+
+目标 YAML 形态：
+
+```yaml
+cluster:
+  meta_store:
+    ttl_seconds: 1000.0
+  router:
+    mode: cache_affinity
+    precise_probe_latency_us: 10.0
+    precise_probe_topk: 4
+    load_alpha: 0.1
+    score_alpha: 1.0
+    score_beta: 0.1
+    prefix_k: 8
+    affinity_fan_out: 0
+  global_router:
+    mode: strict_input_length
+    length_penalty_weight: 1.0
+    load_weight: 1.0
+    cache_weight: 1.0
+  buckets:
+    - name: short
+      input_length_min: 0
+      input_length_max: 32768
+      num_instances: 3
+    - name: medium
+      input_length_min: 32769
+      input_length_max: 81920
+      num_instances: 4
+    - name: long
+      input_length_min: 81921
+      input_length_max: 131072
+      num_instances: 3
+```
+
+其中：
+
+- `cluster.router` 继续表示 bucket 内 local router 配置，全局统一
+- 新增 `cluster.global_router`，表示 bucket 选择策略
+- `cluster.buckets` 显式描述 service 拓扑
+
+第一阶段建议继续兼容旧配置：
+
+- 若只提供 `cluster.num_instances`，则视为单 bucket 模式
+- 若提供 `cluster.buckets`，则进入多 bucket 模式
+- 两者同时出现时直接报错，避免歧义
+
+配置校验约束：
+
+1. `buckets` 非空
+2. 每个 bucket `num_instances > 0`
+3. `input_length_min <= input_length_max`
+4. bucket 区间不重叠
+5. bucket 排序后必须能唯一命中一个 bucket
+6. 旧模式与新模式互斥
+
+## 运行时架构
+
+运行时拆成三层：
+
+### 1. BucketedService
+
+新增一个 service 层对象，持有多个 bucket。每个 bucket 包含：
+
+- `bucket_id`
+- bucket 配置
+- 独立的 `Cluster`
+
+`BucketedService` 职责：
+
+- 为请求构造所有 bucket 的摘要视图
+- 调用 global router 选择 bucket
+- 将请求转发给选中的 bucket 内 `Cluster`
+- 提供全量 bucket / instance 遍历接口，供 driver 做采样与 tick 调度
+
+### 2. Cluster
+
+现有 `Cluster` 语义收缩为“单个 bucket 内的 cluster”：
+
+- 只持有该 bucket 的 `instances`
+- 只持有该 bucket 的 `meta_store`
+- 只运行该 bucket 的 local router
+
+`Cluster::route_and_admit` 不再负责 bucket 选择，只负责：
+
+- 在 bucket 内调用 local router 选 instance
+- 执行该 bucket 内的 L0/L1/remote/miss 路径
+- 返回 bucket 维度补充后的 admission stats
+
+### 3. Router 分层
+
+router 明确拆分为：
+
+- `GlobalRouter`：负责 bucket 选择
+- `LocalRouter`：负责 bucket 内 instance 选择
+
+现有 `src/router/*` 中的大部分算法迁移为 `LocalRouter` 实现。
+
+## Router 接口设计
+
+### GlobalRouter
+
+global router 只看 bucket 摘要，不直接操作实例数组。
+
+建议接口：
+
+```rust
+trait GlobalRouter {
+    fn name(&self) -> &'static str;
+    fn route_bucket(
+        &mut self,
+        req: &RequestRecord,
+        buckets: &[BucketView],
+        now: f64,
+    ) -> GlobalRouteDecision;
+}
+```
+
+`BucketView` 是只读摘要，至少包含：
+
+- `bucket_id`
+- `name`
+- `input_length_min`
+- `input_length_max`
+- `num_instances`
+- `queue_len_sum`
+- `queue_len_max`
+- `kv_blocks_used_sum`
+- `kv_blocks_total_sum`
+- `active_requests`
+- `predicted_prefix`
+- 可选的 `estimated_drain_time`
+
+`predicted_prefix` 表示该 bucket 对当前请求的 bucket 级 prefix 命中预测，用于让 global router 感知 bucket 级 cache affinity。
+
+### LocalRouter
+
+local router 继续以 bucket 内 instance 池为输入。
+
+建议接口保持和现有语义接近：
+
+```rust
+trait LocalRouter {
+    fn name(&self) -> &'static str;
+    fn route_instance(
+        &mut self,
+        req: &RequestRecord,
+        instances: &[Instance],
+        meta: &MetaStore,
+        now: f64,
+    ) -> LocalRouteDecision;
+}
+```
+
+现有 `RouteDecision` 需要拆成两层，最后再合并成对外统一的日志结构：
+
+- `GlobalRouteDecision`
+- `LocalRouteDecision`
+- `RouteDecision`：包含 `chosen_bucket + chosen_instance + 两层 candidates`
+
+## Bucket 隔离语义
+
+bucket 是显式物理隔离边界，不是逻辑标签。
+
+必须满足：
+
+- 请求一旦进入 bucket，只能使用该 bucket 的实例池
+- L0 / L1 只在 bucket 内可见
+- `meta_store` 只描述该 bucket 内哪些实例持有块
+- remote RDMA 只允许从同 bucket 其他实例拉取
+- bucket 之间不共享 owner 信息
+
+这保证了 simulator 中的 bucket 与实际服务拓扑有明确对应关系，避免 global router 虽然“知道 bucket”，但底层缓存模型仍然偷偷全局共享，从而污染实验结论。
+
+## 首批 Global Bucket Policies
+
+第一阶段只实现两个 global bucket policy。
+
+### 1. strict_input_length
+
+语义：
+
+- 只允许选择 `req.input_len` 所在区间的 bucket
+
+用途：
+
+- 作为严格长度分桶的基线策略
+- 对应最初目标架构图
+
+### 2. bucket_score
+
+语义：
+
+- 对所有 bucket 计算分数并选择最优 bucket
+
+第一阶段分数只使用少量强信号：
+
+- `length_penalty`：请求长度偏离 bucket 目标范围的惩罚
+- `load`：bucket 总负载或最大负载
+- `miss`：bucket 级预测 miss，来自当前请求在该 bucket 上的 `predicted_prefix`
+
+目标形式：
+
+```text
+score = a * length_penalty + b * load + c * miss
+```
+
+设计意图：
+
+- 能研究“非严格按 input length 分发”的收益或损失
+- 同时保留长度匹配偏好，避免第一阶段就退化为完全无约束调度
+
+暂不实现：
+
+- 完全扁平的跨 bucket instance 全局打分
+- per-bucket 特化 global scoring
+- 跨 bucket 回退式 cache 共享
+
+## 事件与 Driver 设计
+
+当前 `Event::BatchTick { instance }` 假设 instance 是全局单层编号。多 bucket 后需要改成：
+
+```rust
+Event::BatchTick { bucket: BucketId, instance: InstanceId }
+```
+
+driver 主循环改为：
+
+1. `Arrival`
+2. 读取请求
+3. 调用 `BucketedService::route_and_admit`
+4. 记录全局+局部路由决策
+5. 为 `(bucket, instance)` 安排 `BatchTick`
+
+`Sample` 事件可以继续是全局事件，但采样时要遍历所有 buckets 的所有 instances。
+
+`inflight` 结构保留按 `req_id` 索引，但 value 中新增：
+
+- `bucket`
+- `bucket_policy`
+- `length_bucket_match`
+- 可选的 `bucket_predicted_prefix`
+
+## Metrics 与可观测性
+
+这次重构的重点之一是让 bucket policy 可研究，因此 metrics 必须明确区分 bucket 选择和 instance 选择。
+
+### routing_log.jsonl
+
+建议新增字段：
+
+- `global_mode`
+- `local_mode`
+- `chosen_bucket`
+- `chosen_instance`
+- `bucket_candidates`
+- `instance_candidates`
+- `global_reason`
+- `local_reason`
+
+其中：
+
+- `bucket_candidates` 记录每个 bucket 的摘要与分数
+- `instance_candidates` 记录选中 bucket 内的 instance 级候选信息
+
+### per_request.csv
+
+建议新增字段：
+
+- `bucket`
+- `bucket_policy`
+- `length_bucket_match`
+- `bucket_predicted_prefix`
+
+`length_bucket_match` 用于直接衡量“最终 bucket 是否等于严格长度命中的 bucket”，是分析非严格分发影响的关键字段。
+
+### instances.csv
+
+建议新增字段：
+
+- `bucket`
+
+### summary.json
+
+保留全局汇总不变，同时新增 bucket 维度 breakdown。可以采用：
+
+- 在 `summary.json` 内增加 `per_bucket`
+- 或新增独立 `bucket_summary.json`
+
+第一阶段优先保证可读性与易分析，不需要过度抽象。
+
+## 错误处理
+
+需要显式处理以下失败场景：
+
+- 请求命中 0 个 bucket
+- 请求命中多个 bucket
+- bucket 配置不合法
+- 多 bucket 模式下事件找不到对应 `(bucket, instance)`
+- routing log / metrics 缺失 bucket 信息
+
+其中配置相关错误应尽量在启动时失败，而不是等到 trace replay 中途才暴露。
+
+## 测试策略
+
+测试分三层。
+
+### 1. 配置测试
+
+- 旧配置 `num_instances` 模式仍能成功加载
+- `buckets` 模式成功加载
+- bucket 区间重叠时报错
+- `num_instances` 与 `buckets` 同时配置时报错
+- bucket 未覆盖请求长度时报错或在运行时明确失败
+
+### 2. Service / Driver 测试
+
+- 短请求进入 short bucket
+- 长请求进入 long bucket
+- `bucket_score` 可以在长度不完全匹配时选择非默认 bucket
+- long bucket 请求看不到 short bucket 的 meta-store / remote owner
+- `(bucket, instance)` 维度的 `BatchTick` 能正确推进
+
+### 3. 集成 Smoke Test
+
+构造 mixed trace，包含多个 input length 段与共享前缀模式，验证：
+
+- 严格 bucket policy 下，请求落入预期 bucket
+- `routing_log` 同时记录 bucket 候选和 instance 候选
+- `per_request` / `instances` 带 bucket 字段
+- `bucket_score` 与 `strict_input_length` 在 mixed trace 上产生可观测差异
+
+## 迁移策略
+
+为了控制风险，这次重构按以下顺序推进：
+
+1. 在配置层引入 bucket 结构与校验，但保留旧单 bucket 模式
+2. 把现有 `Cluster` 收缩为单 bucket 语义
+3. 新增 `BucketedService`，先接通 strict bucket policy
+4. 抽出 `GlobalRouter` 接口，补上 `strict_input_length`
+5. 把现有 instance 级 router 适配为 `LocalRouter`
+6. 扩展 driver / events / metrics 到 `(bucket, instance)` 维度
+7. 再实现 `bucket_score` 作为第一种非严格 bucket policy
+
+这样可以先建立正确拓扑与日志，再逐步加入实验策略，避免一次性重写过多核心路径。
+
+## 预期结果
+
+完成后，simulator 将具备以下能力：
+
+- 用显式 bucket 拓扑 replay 混合长度 trace
+- 研究严格长度分桶是否带来收益
+- 研究 global router 在 bucket 维度做非严格调度会产生什么影响
+- 在 bucket policy 与 local instance policy 两个层次分别做 ablation
+
+这为后续研究“bucket 是否必要”“bucket 边界怎么设”“global router 应不应该偏离长度分发”提供了清晰、可观测、可复用的 simulator 基础。
--- a/models/GLM-5-FP8/config.json
+++ b/models/GLM-5-FP8/config.json
@@ -0,0 +1,68 @@
+{
+  "architectures": [
+    "GlmMoeDsaForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "dtype": "bfloat16",
+  "eos_token_id": [
+    154820,
+    154827,
+    154829
+  ],
+  "ep_size": 1,
+  "first_k_dense_replace": 3,
+  "hidden_act": "silu",
+  "head_dim": 64,
+  "hidden_size": 6144,
+  "index_head_dim": 128,
+  "index_n_heads": 32,
+  "index_topk": 2048,
+  "indexer_rope_interleave": true,
+  "initializer_range": 0.02,
+  "intermediate_size": 12288,
+  "kv_lora_rank": 512,
+  "max_position_embeddings": 202752,
+  "moe_intermediate_size": 2048,
+  "moe_layer_freq": 1,
+  "model_type": "glm_moe_dsa",
+  "n_group": 1,
+  "n_routed_experts": 256,
+  "n_shared_experts": 1,
+  "norm_topk_prob": true,
+  "num_attention_heads": 64,
+  "num_experts_per_tok": 8,
+  "num_hidden_layers": 78,
+  "num_key_value_heads": 64,
+  "num_nextn_predict_layers": 1,
+  "pad_token_id": 154820,
+  "pretraining_tp": 1,
+  "q_lora_rank": 2048,
+  "qk_head_dim": 256,
+  "qk_nope_head_dim": 192,
+  "qk_rope_head_dim": 64,
+  "quantization_config": {
+    "activation_scheme": "dynamic",
+    "fmt": "e4m3",
+    "quant_method": "fp8",
+    "weight_block_size": [
+      128,
+      128
+    ]
+  },
+  "rms_norm_eps": 1e-05,
+  "rope_interleave": true,
+  "rope_parameters": {
+    "rope_theta": 1000000,
+    "rope_type": "default"
+  },
+  "routed_scaling_factor": 2.5,
+  "scoring_func": "sigmoid",
+  "tie_word_embeddings": false,
+  "topk_group": 1,
+  "topk_method": "noaux_tc",
+  "transformers_version": "5.0.2.dev0",
+  "use_cache": true,
+  "v_head_dim": 256,
+  "vocab_size": 154880
+}
--- a/src/cluster/bucketed_service.rs
+++ b/src/cluster/bucketed_service.rs
@@ -0,0 +1,216 @@
+use anyhow::Result;
+
+use super::cluster::{AdmissionStats, Cluster};
+use crate::config::{BucketConfig, Config, ModelConfig};
+use crate::instance::Instance;
+use crate::router::{self, BucketId, GlobalRouter};
+use crate::trace::RequestRecord;
+
+pub struct ServiceBucket {
+    pub id: BucketId,
+    pub cfg: BucketConfig,
+    pub cluster: Cluster,
+}
+
+impl ServiceBucket {
+    pub fn instances(&self) -> &[Instance] {
+        &self.cluster.instances
+    }
+}
+
+pub struct BucketedService {
+    pub buckets: Vec<ServiceBucket>,
+    pub global_router: Box<dyn GlobalRouter>,
+}
+
+impl BucketedService {
+    pub fn new(config: &Config, model: &ModelConfig) -> Self {
+        let buckets = config
+            .cluster
+            .effective_buckets()
+            .into_iter()
+            .enumerate()
+            .map(|(idx, cfg)| ServiceBucket {
+                id: idx as BucketId,
+                cluster: Cluster::new_for_bucket(config, model, idx as BucketId, cfg.num_instances)
+                    .expect("bucket-local cluster construction should succeed"),
+                cfg,
+            })
+            .collect();
+
+        Self {
+            buckets,
+            global_router: router::build_global(config),
+        }
+    }
+
+    pub fn bucket(&self, bucket_id: BucketId) -> &ServiceBucket {
+        &self.buckets[bucket_id as usize]
+    }
+
+    pub fn route_and_admit(&mut self, req: &RequestRecord, now: f64) -> Result<AdmissionStats> {
+        let bucket_views = self
+            .buckets
+            .iter()
+            .map(|bucket| bucket.cluster.bucket_view(bucket.id, &bucket.cfg, req, now))
+            .collect::<Vec<_>>();
+        let global = self.global_router.route(req, &bucket_views, now)?;
+        let bucket = &mut self.buckets[global.chosen_bucket as usize];
+        Ok(bucket
+            .cluster
+            .route_and_admit_with_global(req, now, &global))
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use crate::config::{
+        BucketConfig, CalibrationConfig, ClusterConfig, Config, GlobalRouterConfig,
+        GlobalRouterMode, HardwareConfig, MetaStoreConfig, ModelConfig, RouterConfig, RouterMode,
+        SimConfig,
+    };
+    use crate::trace::RequestRecord;
+
+    fn test_config() -> Config {
+        Config {
+            model: ModelConfig {
+                name: "test".into(),
+                num_layers: 4,
+                num_kv_heads: 2,
+                head_dim: 64,
+                dtype_bytes: 2,
+                block_size_tokens: 16,
+                flops_per_token_prefill: Some(1.0e9),
+                attn_quadratic_coeff: Some(64.0),
+                ..Default::default()
+            },
+            hardware: HardwareConfig {
+                gpu_flops: 1.0e14,
+                gpu_fp8_flops: 0.0,
+                gpu_fp4_flops: 0.0,
+                gpu_mem_bw: 1.0e12,
+                hbm_bytes: 1.0e9,
+                dram_bytes: 4.0e9,
+                host_dram_bw: 5.0e11,
+                pcie_bw: 32.0e9,
+                pcie_latency_us: 1.0,
+                rdma_bw: 12.0e9,
+                rdma_latency_us: 5.0,
+                intra_node_tp_bw: 9.0e11,
+                intra_node_tp_latency_us: 2.0,
+                tp_degree: 1,
+                max_batch_slots: 32,
+                prefill_chunk_tokens: 1024,
+            },
+            calibration: CalibrationConfig::default(),
+            cluster: ClusterConfig {
+                num_instances: None,
+                buckets: vec![
+                    BucketConfig {
+                        name: "short".into(),
+                        input_length_min: 0,
+                        input_length_max: 32,
+                        num_instances: 2,
+                    },
+                    BucketConfig {
+                        name: "long".into(),
+                        input_length_min: 33,
+                        input_length_max: 96,
+                        num_instances: 1,
+                    },
+                ],
+                global_router: GlobalRouterConfig {
+                    mode: GlobalRouterMode::StrictInputLength,
+                    length_penalty_weight: 1.0,
+                    load_weight: 1.0,
+                    cache_weight: 1.0,
+                },
+                meta_store: MetaStoreConfig {
+                    ttl_seconds: 1000.0,
+                },
+                router: RouterConfig {
+                    mode: RouterMode::LeastLoaded,
+                    precise_probe_latency_us: 10.0,
+                    precise_probe_topk: 2,
+                    load_alpha: 0.0,
+                    score_alpha: 1.0,
+                    score_beta: 0.1,
+                    prefix_k: 8,
+                    affinity_fan_out: 2,
+                },
+            },
+            sim: SimConfig {
+                trace_path: String::new(),
+                max_requests: None,
+                output_dir: String::new(),
+                sample_interval_s: 0.0,
+                seed: 7,
+                input_length_min: None,
+                input_length_max: None,
+            },
+        }
+    }
+
+    fn req(req_id: u64, input_len: u32, hashes: &[u64]) -> RequestRecord {
+        RequestRecord {
+            req_id,
+            chat_id: req_id as i64,
+            parent_chat_id: -1,
+            turn: 0,
+            arrival: 0.0,
+            input_len,
+            output_len: 16,
+            hash_ids: hashes.to_vec(),
+        }
+    }
+
+    #[test]
+    fn strict_input_length_routes_into_matching_bucket() {
+        let cfg = test_config();
+        let mut service = BucketedService::new(&cfg, &cfg.model);
+        let stats = service
+            .route_and_admit(&req(1, 24, &[10, 11]), 0.0)
+            .unwrap();
+        assert_eq!(stats.bucket, 0);
+        assert_eq!(stats.decision.chosen_bucket, 0);
+        assert_eq!(
+            stats.decision.global_reason,
+            "unique bucket range contains input_length"
+        );
+        assert_eq!(
+            stats.decision.local_reason,
+            "argmin(kv_used + alpha * queue_len)"
+        );
+        assert_eq!(service.bucket(0).instances().len(), 2);
+    }
+
+    #[test]
+    fn bucket_meta_store_is_isolated() {
+        let cfg = test_config();
+        let mut service = BucketedService::new(&cfg, &cfg.model);
+        let _ = service
+            .route_and_admit(&req(1, 24, &[10, 11]), 0.0)
+            .unwrap();
+        let long_stats = service
+            .route_and_admit(&req(2, 64, &[10, 11, 12, 13]), 1.0)
+            .unwrap();
+        assert_eq!(long_stats.bucket, 1);
+        assert_eq!(long_stats.remote_hit_blocks, 0);
+        assert_eq!(long_stats.l1_hit_blocks, 0);
+    }
+
+    #[test]
+    fn unmatched_input_length_returns_recoverable_error() {
+        let mut cfg = test_config();
+        cfg.cluster.buckets[1].input_length_min = 40;
+        let mut service = BucketedService::new(&cfg, &cfg.model);
+
+        let err = service
+            .route_and_admit(&req(3, 36, &[20, 21, 22]), 0.0)
+            .unwrap_err();
+
+        assert!(err.to_string().contains("no bucket"));
+        assert!(err.to_string().contains("input_length=36"));
+    }
+}
--- a/src/cluster/cluster.rs
+++ b/src/cluster/cluster.rs
@@ -1,17 +1,21 @@
 //! Cluster: routes arrivals, performs the L0 / L1 / remote-RDMA fetch chain
 //! described in the design diagram, and bookkeeps the global meta store.

+use anyhow::Result;
+
 use crate::cluster::meta_store::MetaStore;
-use crate::config::{Config, ModelConfig};
+use crate::config::{BucketConfig, Config, ModelConfig};
 use crate::instance::instance::AdmittedRequest;
 use crate::instance::kv_cache::L1Change;
 use crate::instance::Instance;
-use crate::router::{self, RouteDecision, Router};
+use crate::router::{self, BucketId, BucketView, GlobalRouteDecision, RouteDecision, Router};
 use crate::trace::RequestRecord;
+use crate::ttft::{classify_prefix_tiers, TtftModel};
 use crate::types::InstanceId;

 #[derive(Debug, Clone)]
 pub struct AdmissionStats {
+    pub bucket: BucketId,
    pub instance: InstanceId,
    pub l0_hit_blocks: u32,
    pub l1_hit_blocks: u32,
@@ -31,52 +35,70 @@ pub struct Cluster {
    pub router: Box<dyn Router>,
    pub block_size_tokens: u32,
    pub kv_block_bytes: u64,
+    pub ttft_model: TtftModel,
 }

 impl Cluster {
-    pub fn new(config: &Config, model: &ModelConfig) -> Self {
-        let mut instances = Vec::with_capacity(config.cluster.num_instances as usize);
-        for id in 0..config.cluster.num_instances {
-            instances.push(Instance::new(id as InstanceId, model, &config.hardware));
-        }
-        let meta_store = MetaStore::new(config.cluster.meta_store.ttl_seconds);
-        let router = router::build(config, config.sim.seed);
-        Self {
-            instances,
-            meta_store,
-            router,
-            block_size_tokens: model.block_size_tokens,
-            kv_block_bytes: model.kv_block_bytes(),
-        }
+    pub fn new(config: &Config, model: &ModelConfig) -> Result<Self> {
+        let total_instances = config.cluster.require_legacy_single_pool("Cluster::new")?;
+        Self::build_local_cluster(config, model, total_instances)
+    }
+
+    pub fn new_for_bucket(
+        config: &Config,
+        model: &ModelConfig,
+        _bucket_id: BucketId,
+        num_instances: u32,
+    ) -> Result<Self> {
+        let mut local_config = config.clone();
+        local_config.cluster.num_instances = Some(num_instances);
+        local_config.cluster.buckets.clear();
+        Self::build_local_cluster(&local_config, model, num_instances)
    }

    /// Route + admit a request. Returns the chosen instance plus rich
    /// per-request stats for metrics. Does NOT schedule the BatchTick — the
    /// simulator driver does that based on the returned `ready_at`.
    pub fn route_and_admit(&mut self, req: &RequestRecord, now: f64) -> AdmissionStats {
+        let global = GlobalRouteDecision::single_bucket(req.req_id, 0);
+        self.route_and_admit_with_global(req, now, &global)
+    }
+
+    pub fn route_and_admit_with_global(
+        &mut self,
+        req: &RequestRecord,
+        now: f64,
+        global: &GlobalRouteDecision,
+    ) -> AdmissionStats {
        let decision = self
            .router
-            .route(req, &self.instances, &self.meta_store, now);
+            .route(req, &self.instances, &self.meta_store, now)
+            .with_global(global);
        let inst_id = decision.chosen;
        let probe_overhead_s = decision.probe_overhead_s;
+        let scheduler_overhead_s = self
+            .ttft_model
+            .scheduler_overhead_s(self.instances.len(), 3);

-        // The router probe overhead delays the request's effective start time.
-        let effective_now = now + probe_overhead_s;
+        // The router probe overhead and scheduler work delay the request's
+        // effective start time.
+        let effective_now = now + probe_overhead_s + scheduler_overhead_s;

        let inst = &mut self.instances[inst_id as usize];
+        let residency = classify_prefix_tiers(&req.hash_ids, inst, &self.meta_store, now);
        let total_blocks = req.hash_ids.len() as u32;
+        let l0_hits = residency.l0_hit_blocks;
+        let l1_hits = residency.l1_hit_blocks;
+        let remote_hit_blocks = residency.remote_hit_blocks;

-        // 1. L0 lookup (touches matched blocks).
-        let l0_hits = inst.cache.l0.longest_prefix(&req.hash_ids) as u32;
-
-        // 2. L1 lookup on the remaining suffix.
+        // 1. L1 lookup on the remaining suffix.
        let suffix_after_l0 = &req.hash_ids[l0_hits as usize..];
-        let l1_hits = inst.cache.l1.longest_prefix_peek(suffix_after_l0) as u32;
        // L1->L0 transfer cost
        let l1_bytes = (l1_hits as u64) * self.kv_block_bytes;
        let mut t = effective_now;
        let mut l1_changes = Vec::new();
        if l1_hits > 0 {
+            t += self.ttft_model.local_l1_prepare_time_s(l1_hits) - inst.links.pcie.cost(l1_bytes);
            t = inst.links.pcie.reserve(t, l1_bytes);
            l1_changes = inst
                .cache
@@ -86,23 +108,14 @@ impl Cluster {

        // 3. Remote v6d lookup for the still-remaining suffix.
        let suffix_after_l1 = &suffix_after_l0[l1_hits as usize..];
-        let mut remote_hit_blocks: u32 = 0;
-        for &h in suffix_after_l1 {
-            // A block is remotely available iff some instance other than
-            // `inst_id` lists it (and not expired).
-            let owners = self.meta_store.instances_for(h, now);
-            let any_remote = owners.iter().any(|o| *o != inst_id);
-            if any_remote {
-                remote_hit_blocks += 1;
-            } else {
-                break; // contiguous prefix - stop on first miss
-            }
-        }
        let remote_bytes = (remote_hit_blocks as u64) * self.kv_block_bytes;
        if remote_hit_blocks > 0 {
            let pulled = &suffix_after_l1[..remote_hit_blocks as usize];
            let l1_changes = {
                let inst = &mut self.instances[inst_id as usize];
+                t += self.ttft_model.remote_prepare_time_s(remote_hit_blocks)
+                    - inst.links.rdma.cost(remote_bytes)
+                    - inst.links.pcie.cost(remote_bytes);
                t = inst.links.rdma.reserve(t, remote_bytes);
                t = inst.links.pcie.reserve(t, remote_bytes);
                inst.cache.fetch_remote_blocks_to_l0(pulled)
@@ -134,6 +147,7 @@ impl Cluster {
            ready_at: t,
            prefill_tokens_remaining: miss_tokens,
            reserved_blocks,
+            completion_tail_s: self.ttft_model.first_token_tail_s(),
        };
        let inst = &mut self.instances[inst_id as usize];
        inst.admit(admitted);
@@ -142,6 +156,7 @@ impl Cluster {
        let fetch_time_s = (t - effective_now).max(0.0);

        AdmissionStats {
+            bucket: decision.chosen_bucket,
            instance: inst_id,
            l0_hit_blocks: l0_hits,
            l1_hit_blocks: l1_hits,
@@ -156,6 +171,30 @@ impl Cluster {
        }
    }

+    pub fn bucket_view(
+        &self,
+        bucket_id: BucketId,
+        cfg: &BucketConfig,
+        req: &RequestRecord,
+        now: f64,
+    ) -> BucketView {
+        let predicted_prefix = self
+            .meta_store
+            .score_prefix(&req.hash_ids, now, self.instances.len())
+            .into_iter()
+            .max()
+            .unwrap_or(0);
+        BucketView {
+            id: bucket_id,
+            input_length_min: cfg.input_length_min,
+            input_length_max: cfg.input_length_max,
+            num_instances: self.instances.len() as u32,
+            total_queue_len: self.instances.iter().map(Instance::queue_len).sum(),
+            total_load_blocks: self.instances.iter().map(|inst| inst.kv_blocks_used).sum(),
+            predicted_prefix,
+        }
+    }
+
    fn apply_l1_changes(
        meta_store: &mut MetaStore,
        inst_id: InstanceId,
@@ -169,4 +208,158 @@ impl Cluster {
            }
        }
    }
+
+    fn build_local_cluster(
+        config: &Config,
+        model: &ModelConfig,
+        num_instances: u32,
+    ) -> Result<Self> {
+        let mut instances = Vec::with_capacity(num_instances as usize);
+        for id in 0..num_instances {
+            instances.push(Instance::new(
+                id as InstanceId,
+                model,
+                &config.hardware,
+                &config.calibration,
+            ));
+        }
+        let meta_store = MetaStore::new(config.cluster.meta_store.ttl_seconds);
+        let router = router::build(config, config.sim.seed);
+        Ok(Self {
+            instances,
+            meta_store,
+            router,
+            block_size_tokens: model.block_size_tokens,
+            kv_block_bytes: model.kv_block_bytes(),
+            ttft_model: TtftModel::new(
+                &config.hardware,
+                &config.calibration,
+                model.kv_block_bytes(),
+            ),
+        })
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use crate::config::{
+        BucketConfig, CalibrationConfig, ClusterConfig, Config, HardwareConfig, MetaStoreConfig,
+        ModelConfig, RouterConfig, RouterMode, SimConfig,
+    };
+    use crate::trace::RequestRecord;
+
+    fn test_config(mode: RouterMode) -> Config {
+        Config {
+            model: ModelConfig {
+                name: "test".into(),
+                num_layers: 4,
+                num_kv_heads: 2,
+                head_dim: 64,
+                dtype_bytes: 2,
+                block_size_tokens: 16,
+                flops_per_token_prefill: Some(1.0e9),
+                attn_quadratic_coeff: Some(64.0),
+                ..Default::default()
+            },
+            hardware: HardwareConfig {
+                gpu_flops: 1.0e14,
+                gpu_fp8_flops: 0.0,
+                gpu_fp4_flops: 0.0,
+                gpu_mem_bw: 1.0e12,
+                hbm_bytes: 1.0e9,
+                dram_bytes: 4.0e9,
+                host_dram_bw: 5.0e11,
+                pcie_bw: 32.0e9,
+                pcie_latency_us: 1.0,
+                rdma_bw: 12.0e9,
+                rdma_latency_us: 5.0,
+                intra_node_tp_bw: 9.0e11,
+                intra_node_tp_latency_us: 2.0,
+                tp_degree: 1,
+                max_batch_slots: 32,
+                prefill_chunk_tokens: 1024,
+            },
+            calibration: CalibrationConfig {
+                dram_access_latency_us: 25.0,
+                layout_transform_fixed_us: 7.0,
+                ..CalibrationConfig::default()
+            },
+            cluster: ClusterConfig {
+                num_instances: Some(1),
+                buckets: Vec::new(),
+                global_router: Default::default(),
+                meta_store: MetaStoreConfig {
+                    ttl_seconds: 1000.0,
+                },
+                router: RouterConfig {
+                    mode,
+                    precise_probe_latency_us: 10.0,
+                    precise_probe_topk: 2,
+                    load_alpha: 0.0,
+                    score_alpha: 0.0,
+                    score_beta: 1.0,
+                    prefix_k: 8,
+                    affinity_fan_out: 2,
+                },
+            },
+            sim: SimConfig {
+                trace_path: String::new(),
+                max_requests: None,
+                output_dir: String::new(),
+                sample_interval_s: 0.0,
+                seed: 7,
+                input_length_min: None,
+                input_length_max: None,
+            },
+        }
+    }
+
+    #[test]
+    fn l1_ready_at_includes_dram_and_transform_overhead() {
+        let cfg = test_config(RouterMode::EstimatedTtft);
+        let mut cluster = Cluster::new(&cfg, &cfg.model).unwrap();
+        let req = RequestRecord {
+            req_id: 1,
+            chat_id: 0,
+            parent_chat_id: -1,
+            turn: 1,
+            arrival: 0.0,
+            input_len: 32,
+            output_len: 16,
+            hash_ids: vec![10, 11],
+        };
+
+        let mut evicted = Vec::new();
+        cluster.instances[0]
+            .cache
+            .l1
+            .insert_blocks(&req.hash_ids, &mut evicted);
+
+        let stats = cluster.route_and_admit(&req, 0.0);
+        let pure_pcie = cluster.instances[0]
+            .links
+            .pcie
+            .cost(cluster.kv_block_bytes * 2);
+
+        assert!(stats.ready_at > pure_pcie);
+    }
+
+    #[test]
+    fn cluster_new_rejects_bucketed_configs() {
+        let mut cfg = test_config(RouterMode::EstimatedTtft);
+        cfg.cluster.num_instances = None;
+        cfg.cluster.buckets = vec![BucketConfig {
+            name: "short".into(),
+            input_length_min: 0,
+            input_length_max: 64,
+            num_instances: 2,
+        }];
+
+        let result = Cluster::new(&cfg, &cfg.model);
+        assert!(result.is_err(), "bucketed Cluster::new should fail");
+        let err = result.err().unwrap();
+        assert!(err.to_string().contains("Cluster::new"));
+        assert!(err.to_string().contains("cluster.buckets"));
+    }
 }
--- a/src/cluster/mod.rs
+++ b/src/cluster/mod.rs
@@ -1,6 +1,8 @@
-pub mod meta_store;
+pub mod bucketed_service;
 #[allow(clippy::module_inception)]
 pub mod cluster;
+pub mod meta_store;

+pub use bucketed_service::BucketedService;
 pub use cluster::Cluster;
 pub use meta_store::MetaStore;
--- a/src/config.rs
+++ b/src/config.rs
--- a/src/driver.rs
+++ b/src/driver.rs
@@ -1,19 +1,34 @@
 //! Simulation driver: pulls trace records, advances the event queue, runs
 //! instance batch ticks, and emits metrics.

-use anyhow::Result;
-use std::collections::HashMap;
+use anyhow::{anyhow, Context, Result};
+use std::collections::{HashMap, VecDeque};
 use std::path::Path;
+use std::sync::{Arc, Mutex};

-use crate::cluster::Cluster;
-use crate::config::Config;
+use crate::cluster::BucketedService;
+use crate::config::{Config, RouterMode};
+use crate::metrics::ablation::AblationRow;
 use crate::metrics::per_request::{PerRequestRow, PerRequestWriter};
 use crate::metrics::routing_log::RoutingLogWriter;
 use crate::metrics::summary::Summary;
 use crate::metrics::timeseries::{TimeseriesRow, TimeseriesWriter};
+use crate::replay::ReplayEvictPolicy;
 use crate::sim::{Event, EventQueue};
 use crate::trace::{RequestRecord, TraceReader};

+/// Drop records whose `input_len` falls outside `sim.input_length_{min,max}`.
+/// Used to carve an ablation onto a specific input-length bucket (e.g. 0–40k)
+/// without rewriting the trace file. No-op if both bounds are unset.
+pub fn apply_input_length_filter(records: &mut Vec<RequestRecord>, cfg: &crate::config::SimConfig) {
+    let lo = cfg.input_length_min.unwrap_or(0);
+    let hi = cfg.input_length_max.unwrap_or(u32::MAX);
+    if lo == 0 && hi == u32::MAX {
+        return;
+    }
+    records.retain(|r| r.input_len >= lo && r.input_len <= hi);
+}
+
 pub struct RunOutputs {
    pub summary: Summary,
    pub rows: Vec<PerRequestRow>,
@@ -22,7 +37,9 @@ pub struct RunOutputs {
 #[derive(Debug, Clone)]
 struct InflightInfo {
    arrival: f64,
+    bucket: u32,
    instance: u32,
+    length_bucket_match: bool,
    total_blocks: u32,
    l0_hit_blocks: u32,
    l1_hit_blocks: u32,
@@ -34,7 +51,7 @@ struct InflightInfo {
 }

 pub fn run(config: &Config, output_subdir: Option<&str>) -> Result<RunOutputs> {
-    let mut cluster = Cluster::new(config, &config.model);
+    let mut service = BucketedService::new(config, &config.model);
    let mut q = EventQueue::new();

    // Output directory
@@ -53,7 +70,21 @@ pub fn run(config: &Config, output_subdir: Option<&str>) -> Result<RunOutputs> {
    // Load all records (cheap for moderate traces) so we can index by req_id.
    // For very large traces a streaming approach with a peekable iterator
    // would be better; this keeps the driver simple.
-    let records: Vec<RequestRecord> = (&mut trace).collect::<Result<Vec<_>, _>>()?;
+    let mut records: Vec<RequestRecord> = (&mut trace).collect::<Result<Vec<_>, _>>()?;
+    let raw_count = records.len();
+    apply_input_length_filter(&mut records, &config.sim);
+    if records.len() != raw_count {
+        eprintln!(
+            "[driver] input_length filter [{}, {}] kept {}/{} requests",
+            config.sim.input_length_min.unwrap_or(0),
+            config
+                .sim
+                .input_length_max
+                .map_or("∞".to_string(), |v| v.to_string()),
+            records.len(),
+            raw_count,
+        );
+    }
    let mut by_id: HashMap<u64, RequestRecord> = HashMap::with_capacity(records.len());
    for r in &records {
        q.schedule(r.arrival, Event::Arrival { req_id: r.req_id });
@@ -79,13 +110,16 @@ pub fn run(config: &Config, output_subdir: Option<&str>) -> Result<RunOutputs> {
                    Some(r) => r.clone(),
                    None => continue,
                };
-                let stats = cluster.route_and_admit(&req, now);
+                let stats = service.route_and_admit(&req, now)?;
                rt_writer.write(&stats.decision)?;
+                let strict_bucket = config.cluster.bucket_index_for_input_len(req.input_len)?;
                inflight.insert(
                    req_id,
                    InflightInfo {
                        arrival: req.arrival,
+                        bucket: stats.bucket,
                        instance: stats.instance,
+                        length_bucket_match: stats.bucket as usize == strict_bucket,
                        total_blocks: req.hash_ids.len() as u32,
                        l0_hit_blocks: stats.l0_hit_blocks,
                        l1_hit_blocks: stats.l1_hit_blocks,
@@ -96,15 +130,23 @@ pub fn run(config: &Config, output_subdir: Option<&str>) -> Result<RunOutputs> {
                        probe_overhead_s: stats.probe_overhead_s,
                    },
                );
-                let inst = &mut cluster.instances[stats.instance as usize];
+                let inst = &mut service.buckets[stats.bucket as usize].cluster.instances
+                    [stats.instance as usize];
                if !inst.tick_scheduled {
                    inst.tick_scheduled = true;
                    let when = stats.ready_at.max(now);
-                    q.schedule(when, Event::BatchTick { instance: stats.instance });
+                    q.schedule(
+                        when,
+                        Event::BatchTick {
+                            bucket: stats.bucket,
+                            instance: stats.instance,
+                        },
+                    );
                }
            }
-            Event::BatchTick { instance } => {
-                let inst = &mut cluster.instances[instance as usize];
+            Event::BatchTick { bucket, instance } => {
+                let inst =
+                    &mut service.buckets[bucket as usize].cluster.instances[instance as usize];
                inst.tick_scheduled = false;
                let result = inst.step(now);
                for (rid, ttft, end) in result.completed {
@@ -114,7 +156,9 @@ pub fn run(config: &Config, output_subdir: Option<&str>) -> Result<RunOutputs> {
                            arrival: info.arrival,
                            ttft,
                            e2e: end - info.arrival,
+                            bucket: info.bucket,
                            instance: info.instance,
+                            length_bucket_match: info.length_bucket_match,
                            total_blocks: info.total_blocks,
                            l0_hit_blocks: info.l0_hit_blocks,
                            l1_hit_blocks: info.l1_hit_blocks,
@@ -129,24 +173,28 @@ pub fn run(config: &Config, output_subdir: Option<&str>) -> Result<RunOutputs> {
                    }
                }
                if let Some(next) = result.next_tick {
-                    let inst = &mut cluster.instances[instance as usize];
+                    let inst =
+                        &mut service.buckets[bucket as usize].cluster.instances[instance as usize];
                    if !inst.tick_scheduled {
                        inst.tick_scheduled = true;
-                        q.schedule(next.max(now), Event::BatchTick { instance });
+                        q.schedule(next.max(now), Event::BatchTick { bucket, instance });
                    }
                }
            }
            Event::Sample => {
-                for inst in &cluster.instances {
-                    let busy = if inst.queue_len() > 0 { 1 } else { 0 };
-                    ts_writer.write(&TimeseriesRow {
-                        t: now,
-                        instance: inst.id,
-                        queue_len: inst.queue_len(),
-                        kv_blocks_used: inst.kv_blocks_used,
-                        kv_blocks_total: inst.hbm_block_budget,
-                        busy,
-                    })?;
+                for bucket in &service.buckets {
+                    for inst in &bucket.cluster.instances {
+                        let busy = if inst.queue_len() > 0 { 1 } else { 0 };
+                        ts_writer.write(&TimeseriesRow {
+                            t: now,
+                            bucket: bucket.id,
+                            instance: inst.id,
+                            queue_len: inst.queue_len(),
+                            kv_blocks_used: inst.kv_blocks_used,
+                            kv_blocks_total: inst.hbm_block_budget,
+                            busy,
+                        })?;
+                    }
                }
            }
            Event::Stop => break,
@@ -168,3 +216,138 @@ pub fn run(config: &Config, output_subdir: Option<&str>) -> Result<RunOutputs> {

    Ok(RunOutputs { summary, rows })
 }
+
+pub fn ablate_fixed_placement(
+    base: &Config,
+    routers: &[RouterMode],
+    evict_policies: &[ReplayEvictPolicy],
+) -> Result<Vec<AblationRow>> {
+    ablate_fixed_placement_with_parallelism(base, routers, evict_policies, 1)
+}
+
+pub fn ablate_fixed_placement_with_parallelism(
+    base: &Config,
+    routers: &[RouterMode],
+    evict_policies: &[ReplayEvictPolicy],
+    jobs: usize,
+) -> Result<Vec<AblationRow>> {
+    base.cluster
+        .require_legacy_single_pool("fixed-placement ablation")?;
+    let mut out = Vec::new();
+    for &policy in evict_policies {
+        if policy != ReplayEvictPolicy::Lru {
+            return Err(anyhow::anyhow!(
+                "exact belady is not supported for fixed-placement full-hierarchy ablation; \
+                 the previous replay-based approximation has been removed"
+            ));
+        }
+    }
+
+    if routers.is_empty() {
+        return Ok(out);
+    }
+
+    let worker_count = resolve_ablation_parallelism(jobs, routers.len());
+    if worker_count == 1 {
+        for &mode in routers {
+            out.extend(run_ablation_router(base, mode, evict_policies)?);
+        }
+        return Ok(out);
+    }
+
+    eprintln!(
+        "[ablate] running {} routers with {} workers",
+        routers.len(),
+        worker_count
+    );
+
+    let queue = Arc::new(Mutex::new(VecDeque::from(
+        routers
+            .iter()
+            .copied()
+            .enumerate()
+            .collect::<Vec<(usize, RouterMode)>>(),
+    )));
+    let mut ordered_results: Vec<(usize, Vec<AblationRow>)> = Vec::with_capacity(routers.len());
+
+    std::thread::scope(|scope| -> Result<()> {
+        let mut handles = Vec::with_capacity(worker_count);
+        for worker_idx in 0..worker_count {
+            let queue = Arc::clone(&queue);
+            let base = base.clone();
+            let policies = evict_policies.to_vec();
+            handles.push(
+                scope.spawn(move || -> Result<Vec<(usize, Vec<AblationRow>)>> {
+                    let mut local = Vec::new();
+                    loop {
+                        let next = queue
+                            .lock()
+                            .map_err(|_| anyhow!("ablation work queue mutex poisoned"))?
+                            .pop_front();
+                        let Some((idx, mode)) = next else {
+                            break;
+                        };
+                        eprintln!(
+                            "[ablate] worker {}/{} router={} ...",
+                            worker_idx + 1,
+                            worker_count,
+                            mode.as_str()
+                        );
+                        let rows = run_ablation_router(&base, mode, &policies)
+                            .with_context(|| format!("ablation router={}", mode.as_str()))?;
+                        local.push((idx, rows));
+                    }
+                    Ok(local)
+                }),
+            );
+        }
+
+        for handle in handles {
+            let local = handle
+                .join()
+                .map_err(|_| anyhow!("ablation worker thread panicked"))??;
+            ordered_results.extend(local);
+        }
+        Ok(())
+    })?;
+
+    ordered_results.sort_by_key(|(idx, _)| *idx);
+    for (_, rows) in ordered_results {
+        out.extend(rows);
+    }
+    Ok(out)
+}
+
+fn resolve_ablation_parallelism(jobs: usize, num_routers: usize) -> usize {
+    if num_routers == 0 {
+        return 1;
+    }
+    let requested = if jobs == 0 {
+        std::thread::available_parallelism()
+            .map(|n| n.get())
+            .unwrap_or(1)
+    } else {
+        jobs
+    };
+    requested.max(1).min(num_routers)
+}
+
+fn run_ablation_router(
+    base: &Config,
+    mode: RouterMode,
+    evict_policies: &[ReplayEvictPolicy],
+) -> Result<Vec<AblationRow>> {
+    let mut cfg = base.clone();
+    cfg.cluster.router.mode = mode;
+    let placement_run = run(&cfg, Some(&format!("{}__placement_lru", mode.as_str())))?;
+    let mut rows = Vec::with_capacity(evict_policies.len());
+    for &policy in evict_policies {
+        rows.push(AblationRow::from_summary(
+            mode.as_str(),
+            policy,
+            "realized_lru",
+            &placement_run.summary,
+        ));
+    }
+    Ok(rows)
+}
--- a/src/hardware_presets.rs
+++ b/src/hardware_presets.rs
@@ -17,6 +17,7 @@ pub const AVAILABLE: &[&str] = &[
    "h100",
    "h800",
    "h20",
+    "h20-141g",
    "a100-80gb",
    "a100-40gb",
    "b200",
@@ -30,6 +31,9 @@ pub const AVAILABLE: &[&str] = &[
    "2xh20",
    "4xh20",
    "8xh20",
+    "2xh20-141g",
+    "4xh20-141g",
+    "8xh20-141g",
    "2xb200",
    "4xb200",
    "8xb200",
@@ -49,6 +53,7 @@ pub fn resolve(name: &str) -> Option<HardwareConfig> {
        "h100" => Some(make_config(count, &H100)),
        "h800" => Some(make_config(count, &H800)),
        "h20" => Some(make_config(count, &H20)),
+        "h20141g" | "h20141gb" => Some(make_config(count, &H20_141G)),
        "a10080gb" | "a100" => Some(make_config(count, &A100_80GB)),
        "a10040gb" => Some(make_config(count, &A100_40GB)),
        "b200" => Some(make_config(count, &B200)),
@@ -78,56 +83,65 @@ fn parse_count_gpu(s: &str) -> (u32, String) {
 // -- Per-GPU base specs (single die, BF16 dense) -----------------------------

 struct GpuBase {
-    flops: f64,      // BF16 dense FLOPS
-    fp8_flops: f64,  // FP8 dense FLOPS (0 = not supported)
-    fp4_flops: f64,  // FP4 dense FLOPS (0 = not supported)
-    mem_bw: f64,     // HBM bandwidth (B/s)
-    hbm: f64,        // Total HBM (bytes)
-    pcie_gen: u32,   // PCIe generation (4/5/6)
+    flops: f64,     // BF16 dense FLOPS
+    fp8_flops: f64, // FP8 dense FLOPS (0 = not supported)
+    fp4_flops: f64, // FP4 dense FLOPS (0 = not supported)
+    mem_bw: f64,    // HBM bandwidth (B/s)
+    hbm: f64,       // Total HBM (bytes)
+    pcie_gen: u32,  // PCIe generation (4/5/6)
 }

 const H100: GpuBase = GpuBase {
-    flops: 9.89e14,    // 989 TFLOPS BF16 dense
+    flops: 9.89e14,      // 989 TFLOPS BF16 dense
    fp8_flops: 1.979e15, // 1979 TFLOPS FP8 dense
-    fp4_flops: 0.0,    // not supported
-    mem_bw: 3.35e12,   // 3.35 TB/s HBM3
-    hbm: 80.0e9,       // 80 GB
+    fp4_flops: 0.0,      // not supported
+    mem_bw: 3.35e12,     // 3.35 TB/s HBM3
+    hbm: 80.0e9,         // 80 GB
    pcie_gen: 5,
 };

 const H800: GpuBase = GpuBase {
-    flops: 9.89e14,    // same die as H100
+    flops: 9.89e14, // same die as H100
    fp8_flops: 1.979e15,
    fp4_flops: 0.0,
-    mem_bw: 3.35e12,   // 3.35 TB/s HBM3
-    hbm: 80.0e9,       // 80 GB
+    mem_bw: 3.35e12, // 3.35 TB/s HBM3
+    hbm: 80.0e9,     // 80 GB
    pcie_gen: 5,
 };

 const H20: GpuBase = GpuBase {
-    flops: 1.48e14,    // 148 TFLOPS BF16 (China-export Hopper)
+    flops: 1.48e14,     // 148 TFLOPS BF16 (China-export Hopper)
    fp8_flops: 2.96e14, // 296 TFLOPS FP8
-    fp4_flops: 0.0,    // not supported
-    mem_bw: 4.0e12,    // 4.0 TB/s HBM3
-    hbm: 96.0e9,       // 96 GB
+    fp4_flops: 0.0,     // not supported
+    mem_bw: 4.0e12,     // 4.0 TB/s HBM3
+    hbm: 96.0e9,        // 96 GB
+    pcie_gen: 5,
+};
+
+const H20_141G: GpuBase = GpuBase {
+    flops: 1.48e14,     // modeled as the same H20 compute envelope
+    fp8_flops: 2.96e14, // modeled as the same H20 FP8 throughput
+    fp4_flops: 0.0,     // not supported
+    mem_bw: 4.8e12,     // 141 GB HBM variant
+    hbm: 141.0e9,       // 141 GB
    pcie_gen: 5,
 };

 const A100_80GB: GpuBase = GpuBase {
-    flops: 3.12e14,    // 312 TFLOPS BF16
-    fp8_flops: 0.0,    // A100 has no FP8 tensor cores
+    flops: 3.12e14, // 312 TFLOPS BF16
+    fp8_flops: 0.0, // A100 has no FP8 tensor cores
    fp4_flops: 0.0,
-    mem_bw: 2.0e12,    // 2.0 TB/s HBM2e
-    hbm: 80.0e9,       // 80 GB
+    mem_bw: 2.0e12, // 2.0 TB/s HBM2e
+    hbm: 80.0e9,    // 80 GB
    pcie_gen: 4,
 };

 const A100_40GB: GpuBase = GpuBase {
-    flops: 3.12e14,    // 312 TFLOPS BF16
+    flops: 3.12e14, // 312 TFLOPS BF16
    fp8_flops: 0.0,
    fp4_flops: 0.0,
-    mem_bw: 1.555e12,  // 1.555 TB/s HBM2e
-    hbm: 40.0e9,       // 40 GB
+    mem_bw: 1.555e12, // 1.555 TB/s HBM2e
+    hbm: 40.0e9,      // 40 GB
    pcie_gen: 4,
 };

@@ -143,11 +157,11 @@ const B200: GpuBase = GpuBase {

 // DGX B300 (8 GPU) specs: BF16 18 PFLOPS, FP8 ~54 PFLOPS, FP4 108 PFLOPS (dense)
 const B300: GpuBase = GpuBase {
-    flops: 2.25e15,      // 2250 TFLOPS BF16 dense (same GB202 die as B200)
-    fp8_flops: 6.75e15,  // 6750 TFLOPS FP8 dense (estimated from FP4/2)
-    fp4_flops: 13.5e15,  // 13500 TFLOPS FP4 dense (Blackwell Ultra enhanced)
-    mem_bw: 12.0e12,     // 12 TB/s HBM3e 12-Hi
-    hbm: 288.0e9,        // 288 GB HBM3e 12-Hi
+    flops: 2.25e15,     // 2250 TFLOPS BF16 dense (same GB202 die as B200)
+    fp8_flops: 6.75e15, // 6750 TFLOPS FP8 dense (estimated from FP4/2)
+    fp4_flops: 13.5e15, // 13500 TFLOPS FP4 dense (Blackwell Ultra enhanced)
+    mem_bw: 12.0e12,    // 12 TB/s HBM3e 12-Hi
+    hbm: 288.0e9,       // 288 GB HBM3e 12-Hi
    pcie_gen: 6,
 };

@@ -162,15 +176,15 @@ fn make_config(n: u32, base: &GpuBase) -> HardwareConfig {

    // PCIe per-GPU bandwidth and latency by generation
    let (pcie_per_gpu, pcie_lat) = match base.pcie_gen {
-        6 => (128.0e9, 4.0),  // Gen6 x16
-        5 => (64.0e9, 5.0),   // Gen5 x16
-        _ => (32.0e9, 5.0),   // Gen4 x16
+        6 => (128.0e9, 4.0), // Gen6 x16
+        5 => (64.0e9, 5.0),  // Gen5 x16
+        _ => (32.0e9, 5.0),  // Gen4 x16
    };

    // RDMA: base NIC speed by PCIe gen, scaled for multi-NIC servers
    let (rdma_base, rdma_lat) = match base.pcie_gen {
-        6 => (50.0e9, 6.0),  // 400 Gbps NIC
-        _ => (25.0e9, 8.0),  // 200 Gbps NIC
+        6 => (50.0e9, 6.0), // 400 Gbps NIC
+        _ => (25.0e9, 8.0), // 200 Gbps NIC
    };
    let rdma_scale = if n >= 8 { 2.0 } else { 1.0 };

@@ -188,10 +202,18 @@ fn make_config(n: u32, base: &GpuBase) -> HardwareConfig {
        gpu_mem_bw: base.mem_bw * f,
        hbm_bytes: base.hbm * f,
        dram_bytes: dram,
+        host_dram_bw: if n >= 8 { 9.0e11 } else { 5.0e11 },
        pcie_bw: pcie_per_gpu * f,
        pcie_latency_us: pcie_lat,
        rdma_bw: rdma_base * rdma_scale,
        rdma_latency_us: rdma_lat,
+        intra_node_tp_bw: if base.pcie_gen >= 6 {
+            1.8e12 * f
+        } else {
+            9.0e11 * f
+        },
+        intra_node_tp_latency_us: if base.pcie_gen >= 6 { 1.0 } else { 2.0 },
+        tp_degree: n,
        max_batch_slots: 256,
        prefill_chunk_tokens: if n >= 4 { 4096 } else { 2048 },
    }
@@ -223,6 +245,7 @@ mod tests {
        assert!(resolve("H100").is_some());
        assert!(resolve("8xB200").is_some());
        assert!(resolve("8x-B200").is_some());
+        assert!(resolve("8xH20-141G").is_some());
        assert!(resolve("a100-80gb").is_some());
        assert!(resolve("A100_80GB").is_some());
        assert!(resolve("a100_80gb").is_some());
@@ -254,4 +277,13 @@ mod tests {
        assert!((s4.gpu_mem_bw - s1.gpu_mem_bw * 4.0).abs() < 1.0);
        assert!((s8.hbm_bytes - s1.hbm_bytes * 8.0).abs() < 1.0);
    }
+
+    #[test]
+    fn h20_141g_variant_has_larger_hbm() {
+        let h20 = resolve("8xh20").unwrap();
+        let h20_141g = resolve("8xh20-141g").unwrap();
+        assert!((h20_141g.gpu_flops - h20.gpu_flops).abs() < 1.0);
+        assert!(h20_141g.hbm_bytes > h20.hbm_bytes);
+        assert!(h20_141g.gpu_mem_bw > h20.gpu_mem_bw);
+    }
 }
--- a/src/hf_config.rs
+++ b/src/hf_config.rs
@@ -7,7 +7,7 @@ use anyhow::{Context, Result};
 use serde_json::Value;
 use std::path::Path;

-use crate::config::{AttentionConfig, MlaConfig, MoeConfig, ModelConfig};
+use crate::config::{AttentionConfig, MlaConfig, ModelConfig, MoeConfig};

 /// Parse a HuggingFace config.json and return a partially-populated
 /// [`ModelConfig`]. The caller must still set `dtype_bytes` and
@@ -34,8 +34,7 @@ fn parse_value(v: &Value) -> Result<ModelConfig> {
    let num_layers = u32_field(v, "num_hidden_layers");
    let hidden_size = u32_field(v, "hidden_size");
    let num_attention_heads = u32_field(v, "num_attention_heads");
-    let num_kv_heads = u32_field(v, "num_key_value_heads")
-        .or(num_attention_heads); // default to MHA
+    let num_kv_heads = u32_field(v, "num_key_value_heads").or(num_attention_heads); // default to MHA
    let head_dim = u32_field(v, "head_dim").or_else(|| {
        // Infer: hidden_size / num_attention_heads
        match (hidden_size, num_attention_heads) {
@@ -70,25 +69,24 @@ fn parse_value(v: &Value) -> Result<ModelConfig> {
    });

    // --- Attention pattern ---
-    let attention =
-        if let Some(first_dense) = u32_field(v, "first_k_dense_replace") {
-            // DSA-style model (GLM-5, DeepSeek-V3).
-            // dense_window and sparse_stride are typically not in config.json;
-            // use sensible defaults the user can override in YAML.
-            Some(AttentionConfig::Dsa {
-                dense_window: 4096,
-                sparse_stride: 8,
-                first_dense_layers: first_dense,
-            })
-        } else if let Some(sw) = v
-            .get("sliding_window")
-            .and_then(|x| x.as_u64())
-            .map(|x| x as u32)
-        {
-            Some(AttentionConfig::SlidingWindow { window_size: sw })
-        } else {
-            None // dense by default
-        };
+    let attention = if let Some(first_dense) = u32_field(v, "first_k_dense_replace") {
+        // DSA-style model (GLM-5, DeepSeek-V3).
+        // dense_window and sparse_stride are typically not in config.json;
+        // use sensible defaults the user can override in YAML.
+        Some(AttentionConfig::Dsa {
+            dense_window: 4096,
+            sparse_stride: 8,
+            first_dense_layers: first_dense,
+        })
+    } else if let Some(sw) = v
+        .get("sliding_window")
+        .and_then(|x| x.as_u64())
+        .map(|x| x as u32)
+    {
+        Some(AttentionConfig::SlidingWindow { window_size: sw })
+    } else {
+        None // dense by default
+    };

    Ok(ModelConfig {
        name,
@@ -188,6 +186,12 @@ mod tests {
        assert_eq!(moe.num_active_experts, 8);
        let mla = m.mla.as_ref().unwrap();
        assert_eq!(mla.kv_lora_rank, 512);
-        assert!(matches!(m.attention, Some(AttentionConfig::Dsa { first_dense_layers: 3, .. })));
+        assert!(matches!(
+            m.attention,
+            Some(AttentionConfig::Dsa {
+                first_dense_layers: 3,
+                ..
+            })
+        ));
    }
 }
--- a/src/instance/compute.rs
+++ b/src/instance/compute.rs
@@ -22,7 +22,7 @@
 //! `effective_ctx(N)` equals `N` for dense attention (→ O(N²) total) but
 //! is sub-linear for DSA / sliding-window.

-use crate::config::{AttentionConfig, HardwareConfig, ModelConfig};
+use crate::config::{AttentionConfig, CalibrationConfig, HardwareConfig, ModelConfig};

 /// Resolved attention pattern used at runtime.
 #[derive(Debug, Clone)]
@@ -33,7 +33,10 @@ pub enum AttentionPattern {
    SlidingWindow { window: f64 },
    /// DeepSeek Sparse Attention: effective_ctx = min(N, dense_window) +
    /// max(0, N - dense_window) / sparse_stride.
-    Dsa { dense_window: f64, sparse_stride: f64 },
+    Dsa {
+        dense_window: f64,
+        sparse_stride: f64,
+    },
 }

 #[derive(Debug, Clone)]
@@ -52,24 +55,46 @@ pub struct ComputeModel {
    pub attn_pattern: AttentionPattern,
    /// Weight bytes read from HBM per layer (for memory-bound check).
    pub weight_bytes_per_layer: f64,
+    /// Approximate bytes moved by each TP collective, per token per layer.
+    pub tp_bytes_per_token: f64,
+    /// Number of TP collectives per layer on the critical path.
+    pub tp_collective_count_per_layer: f64,
    /// Peak GPU FLOPs (aggregate across TP group).
    pub gpu_flops: f64,
    /// Peak GPU memory bandwidth (aggregate across TP group).
    pub gpu_mem_bw: f64,
+    /// Peak node-local TP bandwidth.
+    pub intra_node_tp_bw: f64,
+    /// Fixed latency per TP collective.
+    pub intra_node_tp_latency_s: f64,
+    /// Effective utilization for GEMM-like linear kernels.
+    pub matmul_util: f64,
+    /// Effective utilization for attention kernels.
+    pub attention_util: f64,
+    /// Effective utilization for HBM streaming.
+    pub hbm_bw_util: f64,
+    /// Effective utilization for TP bandwidth.
+    pub tp_bw_util: f64,
+    /// Fraction of TP communication that can overlap with compute.
+    pub tp_overlap_ratio: f64,
+    /// Fixed per-layer non-FLOP overhead.
+    pub misc_layer_overhead_s: f64,
+    /// Fixed launch overhead per prefill chunk.
+    pub chunk_launch_overhead_s: f64,
 }

 impl ComputeModel {
-    pub fn new(model: &ModelConfig, hw: &HardwareConfig) -> Self {
+    pub fn new(model: &ModelConfig, hw: &HardwareConfig, calib: &CalibrationConfig) -> Self {
        if model.is_arch_mode() {
-            Self::from_arch(model, hw)
+            Self::from_arch(model, hw, calib)
        } else {
-            Self::from_manual(model, hw)
+            Self::from_manual(model, hw, calib)
        }
    }

    // ----- Architecture-derived construction --------------------------------

-    fn from_arch(model: &ModelConfig, hw: &HardwareConfig) -> Self {
+    fn from_arch(model: &ModelConfig, hw: &HardwareConfig, calib: &CalibrationConfig) -> Self {
        let h = model.hidden_size.unwrap() as f64;
        let n_heads = model.num_attention_heads.unwrap_or(model.num_kv_heads) as f64;
        let n_kv = model.num_kv_heads as f64;
@@ -101,8 +126,9 @@ impl ComputeModel {

        // --- MLP FLOPs/token/layer (SwiGLU: gate + up + down = 3 matmuls) ---
        let mlp = if let Some(moe) = &model.moe {
-            let expert_inter = moe.expert_intermediate_size
-                .unwrap_or(model.intermediate_size.unwrap_or(0)) as f64;
+            let expert_inter =
+                moe.expert_intermediate_size
+                    .unwrap_or(model.intermediate_size.unwrap_or(0)) as f64;
            let active = moe.num_active_experts as f64;
            let shared = moe.num_shared_experts as f64;
            active * 6.0 * h * expert_inter + shared * 6.0 * h * inter
@@ -111,6 +137,11 @@ impl ComputeModel {
        };

        let linear_flops = attn_linear + mlp;
+        let tp_bytes_per_token = if hw.tp_degree > 1 {
+            h * model.dtype_bytes as f64
+        } else {
+            0.0
+        };

        // --- Attention quadratic coefficient ---
        // attn_flops_per_layer(N) = attn_coeff * N * effective_ctx(N)
@@ -132,16 +163,14 @@ impl ComputeModel {
            let qk_hd = (mla.qk_nope_head_dim + mla.qk_rope_head_dim) as f64;
            let qk_rd = mla.qk_rope_head_dim as f64;
            let vhd = mla.v_head_dim as f64;
-            (h * qlr + qlr * n_heads * qk_hd
-                + h * (kvlr + qk_rd)
-                + n_heads * vhd * h)
-                * wdtype
+            (h * qlr + qlr * n_heads * qk_hd + h * (kvlr + qk_rd) + n_heads * vhd * h) * wdtype
        } else {
            ((n_heads + 2.0 * n_kv) * hd * h + n_heads * hd * h) * wdtype
        };
        let mlp_wt = if let Some(moe) = &model.moe {
-            let expert_inter = moe.expert_intermediate_size
-                .unwrap_or(model.intermediate_size.unwrap_or(0)) as f64;
+            let expert_inter =
+                moe.expert_intermediate_size
+                    .unwrap_or(model.intermediate_size.unwrap_or(0)) as f64;
            let active = moe.num_active_experts as f64;
            let shared = moe.num_shared_experts as f64;
            (active * 3.0 * h * expert_inter + shared * 3.0 * h * inter) * wdtype
@@ -169,10 +198,9 @@ impl ComputeModel {
                },
                0.0,
            ),
-            Some(AttentionConfig::Dense) | None => (
-                AttentionPattern::Dense,
-                model.num_layers as f64,
-            ),
+            Some(AttentionConfig::Dense) | None => {
+                (AttentionPattern::Dense, model.num_layers as f64)
+            }
        };

        Self {
@@ -182,14 +210,29 @@ impl ComputeModel {
            attn_coeff,
            attn_pattern,
            weight_bytes_per_layer: weight_bytes,
+            tp_bytes_per_token,
+            tp_collective_count_per_layer: if hw.tp_degree > 1 {
+                calib.tp_collective_count_per_layer
+            } else {
+                0.0
+            },
            gpu_flops: hw.gpu_flops,
            gpu_mem_bw: hw.gpu_mem_bw,
+            intra_node_tp_bw: hw.intra_node_tp_bw,
+            intra_node_tp_latency_s: hw.intra_node_tp_latency_us * 1e-6,
+            matmul_util: calib.matmul_util,
+            attention_util: calib.attention_util,
+            hbm_bw_util: calib.hbm_bw_util,
+            tp_bw_util: calib.tp_bw_util,
+            tp_overlap_ratio: calib.tp_overlap_ratio,
+            misc_layer_overhead_s: calib.misc_layer_overhead_us * 1e-6,
+            chunk_launch_overhead_s: calib.chunk_launch_overhead_us * 1e-6,
        }
    }

    // ----- Legacy manual construction ---------------------------------------

-    fn from_manual(model: &ModelConfig, hw: &HardwareConfig) -> Self {
+    fn from_manual(model: &ModelConfig, hw: &HardwareConfig, calib: &CalibrationConfig) -> Self {
        Self {
            num_layers: model.num_layers as f64,
            first_dense_layers: model.num_layers as f64,
@@ -197,8 +240,19 @@ impl ComputeModel {
            attn_coeff: model.attn_quadratic_coeff.unwrap_or(0.0),
            attn_pattern: AttentionPattern::Dense,
            weight_bytes_per_layer: 0.0,
+            tp_bytes_per_token: 0.0,
+            tp_collective_count_per_layer: 0.0,
            gpu_flops: hw.gpu_flops,
            gpu_mem_bw: hw.gpu_mem_bw,
+            intra_node_tp_bw: hw.intra_node_tp_bw,
+            intra_node_tp_latency_s: hw.intra_node_tp_latency_us * 1e-6,
+            matmul_util: calib.matmul_util,
+            attention_util: calib.attention_util,
+            hbm_bw_util: calib.hbm_bw_util,
+            tp_bw_util: calib.tp_bw_util,
+            tp_overlap_ratio: calib.tp_overlap_ratio,
+            misc_layer_overhead_s: calib.misc_layer_overhead_us * 1e-6,
+            chunk_launch_overhead_s: calib.chunk_launch_overhead_us * 1e-6,
        }
    }

@@ -231,30 +285,47 @@ impl ComputeModel {
            return 0.0;
        }
        let n = n as f64;
-        let linear = n * self.linear_flops_per_token;
+        let linear_flops = n * self.linear_flops_per_token;

        // Compute FLOPs across all layers (dense + sparse may differ).
        let dense_layers = self.first_dense_layers;
        let sparse_layers = self.num_layers - dense_layers;

-        let dense_flops = dense_layers
-            * (linear + self.attn_coeff * n * self.effective_ctx(n, true));
-        let sparse_flops = sparse_layers
-            * (linear + self.attn_coeff * n * self.effective_ctx(n, false));
-        let total_flops = dense_flops + sparse_flops;
+        let linear_total_flops = self.num_layers * linear_flops;
+        let dense_attn_flops = dense_layers * (self.attn_coeff * n * self.effective_ctx(n, true));
+        let sparse_attn_flops =
+            sparse_layers * (self.attn_coeff * n * self.effective_ctx(n, false));
+        let attn_total_flops = dense_attn_flops + sparse_attn_flops;

-        let compute_time = total_flops / self.gpu_flops;
+        let linear_time = linear_total_flops / (self.gpu_flops * self.matmul_util.max(1e-6));
+        let attn_time = attn_total_flops / (self.gpu_flops * self.attention_util.max(1e-6));
+        let compute_time = linear_time + attn_time + self.num_layers * self.misc_layer_overhead_s;
        // Weight stream: all layers' active weights read once from HBM.
-        let mem_time = self.weight_bytes_per_layer * self.num_layers / self.gpu_mem_bw;
+        let mem_time = self.weight_bytes_per_layer * self.num_layers
+            / (self.gpu_mem_bw * self.hbm_bw_util.max(1e-6));
+        let tp_comm_time = if self.tp_collective_count_per_layer > 0.0
+            && self.tp_bytes_per_token > 0.0
+            && self.intra_node_tp_bw > 0.0
+        {
+            self.num_layers
+                * (self.tp_collective_count_per_layer * self.intra_node_tp_latency_s
+                    + self.tp_collective_count_per_layer * self.tp_bytes_per_token * n
+                        / (self.intra_node_tp_bw * self.tp_bw_util.max(1e-6)))
+        } else {
+            0.0
+        };
+        let tp_tail = (tp_comm_time - self.tp_overlap_ratio * (linear_time + attn_time)).max(0.0);

-        compute_time.max(mem_time)
+        self.chunk_launch_overhead_s + compute_time.max(mem_time) + tp_tail
    }

    /// Print human-readable derived coefficients (for `validate` output).
    pub fn describe(&self) -> String {
        let pattern_str = match &self.attn_pattern {
            AttentionPattern::Dense => "dense".to_string(),
-            AttentionPattern::SlidingWindow { window } => format!("sliding_window({})", *window as u64),
+            AttentionPattern::SlidingWindow { window } => {
+                format!("sliding_window({})", *window as u64)
+            }
            AttentionPattern::Dsa {
                dense_window,
                sparse_stride,
@@ -266,8 +337,7 @@ impl ComputeModel {
        format!(
            "linear_flops/tok/layer={:.3e}, attn_coeff={:.0}, pattern={}, \
             weight_bytes/layer={:.2e}",
-            self.linear_flops_per_token, self.attn_coeff, pattern_str,
-            self.weight_bytes_per_layer,
+            self.linear_flops_per_token, self.attn_coeff, pattern_str, self.weight_bytes_per_layer,
        )
    }
 }
@@ -275,6 +345,7 @@ impl ComputeModel {
 #[cfg(test)]
 mod tests {
    use super::*;
+    use crate::config::CalibrationConfig;

    fn cm_legacy() -> ComputeModel {
        ComputeModel {
@@ -284,8 +355,19 @@ mod tests {
            attn_coeff: 1024.0,
            attn_pattern: AttentionPattern::Dense,
            weight_bytes_per_layer: 0.0,
+            tp_bytes_per_token: 0.0,
+            tp_collective_count_per_layer: 0.0,
            gpu_flops: 9.89e14,
            gpu_mem_bw: 3.35e12,
+            intra_node_tp_bw: 9.0e11,
+            intra_node_tp_latency_s: 2.0e-6,
+            matmul_util: 1.0,
+            attention_util: 1.0,
+            hbm_bw_util: 1.0,
+            tp_bw_util: 1.0,
+            tp_overlap_ratio: 1.0,
+            misc_layer_overhead_s: 0.0,
+            chunk_launch_overhead_s: 0.0,
        }
    }

@@ -325,8 +407,19 @@ mod tests {
            attn_coeff: 139264.0,
            attn_pattern: AttentionPattern::Dense,
            weight_bytes_per_layer: 0.0,
+            tp_bytes_per_token: 0.0,
+            tp_collective_count_per_layer: 0.0,
            gpu_flops: 1.8e16,
            gpu_mem_bw: 6.4e13,
+            intra_node_tp_bw: 9.0e11,
+            intra_node_tp_latency_s: 2.0e-6,
+            matmul_util: 1.0,
+            attention_util: 1.0,
+            hbm_bw_util: 1.0,
+            tp_bw_util: 1.0,
+            tp_overlap_ratio: 1.0,
+            misc_layer_overhead_s: 0.0,
+            chunk_launch_overhead_s: 0.0,
        };
        let dsa = ComputeModel {
            attn_pattern: AttentionPattern::Dsa {
@@ -358,8 +451,19 @@ mod tests {
            attn_coeff: 1.0,
            attn_pattern: AttentionPattern::Dense,
            weight_bytes_per_layer: 1.0e12, // 1 TB per layer
+            tp_bytes_per_token: 0.0,
+            tp_collective_count_per_layer: 0.0,
            gpu_flops: 1.0e15,
            gpu_mem_bw: 1.0e12,
+            intra_node_tp_bw: 9.0e11,
+            intra_node_tp_latency_s: 2.0e-6,
+            matmul_util: 1.0,
+            attention_util: 1.0,
+            hbm_bw_util: 1.0,
+            tp_bw_util: 1.0,
+            tp_overlap_ratio: 1.0,
+            misc_layer_overhead_s: 0.0,
+            chunk_launch_overhead_s: 0.0,
        };
        let t1 = m.prefill_time(1);
        let t8 = m.prefill_time(8);
@@ -391,18 +495,122 @@ mod tests {
            gpu_mem_bw: 1e12,
            hbm_bytes: 1e9,
            dram_bytes: 4e9,
+            host_dram_bw: 5.0e11,
            pcie_bw: 32e9,
            pcie_latency_us: 1.0,
            rdma_bw: 12e9,
            rdma_latency_us: 5.0,
+            intra_node_tp_bw: 9.0e11,
+            intra_node_tp_latency_us: 2.0,
+            tp_degree: 1,
            max_batch_slots: 32,
            prefill_chunk_tokens: 1024,
        };
-        let cm = ComputeModel::new(&model, &hw);
+        let cm = ComputeModel::new(&model, &hw, &CalibrationConfig::default());
        assert!(cm.linear_flops_per_token > 0.0);
        assert!(cm.attn_coeff > 0.0);
        assert!(cm.weight_bytes_per_layer > 0.0);
        let t = cm.prefill_time(1024);
        assert!(t > 0.0);
    }
+
+    #[test]
+    fn lower_utilization_increases_prefill_time() {
+        let model = ModelConfig {
+            name: "test".into(),
+            num_layers: 8,
+            num_kv_heads: 4,
+            head_dim: 128,
+            dtype_bytes: 2,
+            block_size_tokens: 16,
+            hidden_size: Some(1024),
+            num_attention_heads: Some(8),
+            intermediate_size: Some(4096),
+            ..Default::default()
+        };
+        let hw = HardwareConfig {
+            gpu_flops: 1e14,
+            gpu_fp8_flops: 0.0,
+            gpu_fp4_flops: 0.0,
+            gpu_mem_bw: 1e12,
+            hbm_bytes: 1e9,
+            dram_bytes: 4e9,
+            host_dram_bw: 5.0e11,
+            pcie_bw: 32e9,
+            pcie_latency_us: 1.0,
+            rdma_bw: 12e9,
+            rdma_latency_us: 5.0,
+            intra_node_tp_bw: 9.0e11,
+            intra_node_tp_latency_us: 2.0,
+            tp_degree: 1,
+            max_batch_slots: 32,
+            prefill_chunk_tokens: 1024,
+        };
+        let fast = ComputeModel::new(&model, &hw, &CalibrationConfig::default());
+        let slow = ComputeModel::new(
+            &model,
+            &hw,
+            &CalibrationConfig {
+                matmul_util: 0.2,
+                attention_util: 0.15,
+                ..CalibrationConfig::default()
+            },
+        );
+
+        assert!(slow.prefill_time(4096) > fast.prefill_time(4096));
+    }
+
+    #[test]
+    fn tp_communication_adds_tail_when_overlap_is_limited() {
+        let model = ModelConfig {
+            name: "test".into(),
+            num_layers: 8,
+            num_kv_heads: 4,
+            head_dim: 128,
+            dtype_bytes: 2,
+            block_size_tokens: 16,
+            hidden_size: Some(2048),
+            num_attention_heads: Some(16),
+            intermediate_size: Some(8192),
+            ..Default::default()
+        };
+        let hw = HardwareConfig {
+            gpu_flops: 1e14,
+            gpu_fp8_flops: 0.0,
+            gpu_fp4_flops: 0.0,
+            gpu_mem_bw: 1e12,
+            hbm_bytes: 1e9,
+            dram_bytes: 4e9,
+            host_dram_bw: 5.0e11,
+            pcie_bw: 32e9,
+            pcie_latency_us: 1.0,
+            rdma_bw: 12e9,
+            rdma_latency_us: 5.0,
+            intra_node_tp_bw: 1.0e10,
+            intra_node_tp_latency_us: 20.0,
+            tp_degree: 8,
+            max_batch_slots: 32,
+            prefill_chunk_tokens: 1024,
+        };
+        let no_tp = ComputeModel::new(
+            &model,
+            &hw,
+            &CalibrationConfig {
+                tp_overlap_ratio: 1.0,
+                tp_bw_util: 1.0,
+                ..CalibrationConfig::default()
+            },
+        );
+        let tp_tail = ComputeModel::new(
+            &model,
+            &hw,
+            &CalibrationConfig {
+                tp_overlap_ratio: 0.0,
+                tp_bw_util: 0.2,
+                ..CalibrationConfig::default()
+            },
+        );
+
+        assert!(tp_tail.prefill_time(2048) > no_tp.prefill_time(2048));
+    }
 }
--- a/src/instance/instance.rs
+++ b/src/instance/instance.rs
@@ -19,7 +19,7 @@

 use std::collections::VecDeque;

-use crate::config::{HardwareConfig, ModelConfig};
+use crate::config::{CalibrationConfig, HardwareConfig, ModelConfig};
 use crate::instance::compute::ComputeModel;
 use crate::instance::kv_cache::TwoTierCache;
 use crate::network::InstanceLinks;
@@ -37,6 +37,8 @@ pub struct AdmittedRequest {
    /// KV blocks reserved on this instance's HBM for the lifetime of this
    /// request's prefill (= number of input blocks).
    pub reserved_blocks: u32,
+    /// Tail latency between prefill completion and first-token visibility.
+    pub completion_tail_s: f64,
 }

 #[derive(Debug)]
@@ -68,15 +70,20 @@ pub struct Instance {
 }

 impl Instance {
-    pub fn new(id: InstanceId, model: &ModelConfig, hw: &HardwareConfig) -> Self {
+    pub fn new(
+        id: InstanceId,
+        model: &ModelConfig,
+        hw: &HardwareConfig,
+        calib: &CalibrationConfig,
+    ) -> Self {
        let block_bytes = model.kv_block_bytes() as f64;
        let hbm_blocks = (hw.hbm_bytes / block_bytes).max(1.0) as u32;
        let dram_blocks = (hw.dram_bytes / block_bytes).max(1.0) as u32;
        Self {
            id,
            cache: TwoTierCache::new(hbm_blocks as usize, dram_blocks as usize),
-            links: InstanceLinks::from_hw(hw),
-            compute: ComputeModel::new(model, hw),
+            links: InstanceLinks::from_hw(hw, calib),
+            compute: ComputeModel::new(model, hw, calib),
            block_size_tokens: model.block_size_tokens,
            hbm_block_budget: hbm_blocks,
            dram_block_budget: dram_blocks,
@@ -141,9 +148,10 @@ impl Instance {
            self.kv_blocks_used += r.reserved_blocks;
            if r.prefill_tokens_remaining == 0 {
                // Full cache hit: nothing to compute. TTFT == fetch time.
-                let ttft = now - r.arrival;
+                let t_done = now + r.completion_tail_s;
+                let ttft = t_done - r.arrival;
                self.kv_blocks_used = self.kv_blocks_used.saturating_sub(r.reserved_blocks);
-                completed.push((r.req_id, ttft, now));
+                completed.push((r.req_id, ttft, t_done));
            } else {
                self.prefilling.push_back(r);
            }
@@ -171,9 +179,10 @@ impl Instance {
        head.prefill_tokens_remaining -= chunk_tokens;
        if head.prefill_tokens_remaining == 0 {
            let done = self.prefilling.pop_front().unwrap();
-            let ttft = t_end - done.arrival;
+            let t_done = t_end + done.completion_tail_s;
+            let ttft = t_done - done.arrival;
            self.kv_blocks_used = self.kv_blocks_used.saturating_sub(done.reserved_blocks);
-            completed.push((done.req_id, ttft, t_end));
+            completed.push((done.req_id, ttft, t_done));
        }

        StepResult {
--- a/src/instance/mod.rs
+++ b/src/instance/mod.rs
@@ -1,6 +1,6 @@
 pub mod compute;
-pub mod kv_cache;
 #[allow(clippy::module_inception)]
 pub mod instance;
+pub mod kv_cache;

 pub use instance::Instance;
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -7,7 +7,9 @@ pub mod instance;
 pub mod metrics;
 pub mod network;
 pub mod oracle;
+pub mod replay;
 pub mod router;
 pub mod sim;
 pub mod trace;
+pub mod ttft;
 pub mod types;
--- a/src/main.rs
+++ b/src/main.rs
@@ -3,6 +3,7 @@ use clap::{Args, Parser, Subcommand};
 use std::path::PathBuf;

 use kvcache_simulator::config::{Config, RouterMode};
+use kvcache_simulator::replay::ReplayEvictPolicy;
 use kvcache_simulator::{driver, oracle, trace::TraceReader};

 #[derive(Debug, Parser)]
@@ -37,12 +38,26 @@ struct ConfigOverrides {
    /// Override `cluster.meta_store.ttl_seconds`.
    #[arg(long)]
    ttl_seconds: Option<f64>,
+    /// Override `sim.input_length_min` — requests with `input_length` below
+    /// this value are dropped from the replay.
+    #[arg(long)]
+    input_length_min: Option<u32>,
+    /// Override `sim.input_length_max` — requests with `input_length` above
+    /// this value are dropped from the replay. Combine with `--input-length-min`
+    /// to carve out a specific input-length bucket for ablation.
+    #[arg(long)]
+    input_length_max: Option<u32>,
 }

 impl ConfigOverrides {
-    fn apply(&self, cfg: &mut Config) {
+    fn apply(&self, cfg: &mut Config) -> Result<()> {
        if let Some(n) = self.num_instances {
-            cfg.cluster.num_instances = n;
+            if !cfg.cluster.buckets.is_empty() {
+                anyhow::bail!(
+                    "--num-instances does not support cluster.buckets until Task 2 lands"
+                );
+            }
+            cfg.cluster.num_instances = Some(n);
        }
        if let Some(m) = self.max_requests {
            cfg.sim.max_requests = Some(m);
@@ -62,6 +77,13 @@ impl ConfigOverrides {
        if let Some(ttl) = self.ttl_seconds {
            cfg.cluster.meta_store.ttl_seconds = ttl;
        }
+        if let Some(lo) = self.input_length_min {
+            cfg.sim.input_length_min = Some(lo);
+        }
+        if let Some(hi) = self.input_length_max {
+            cfg.sim.input_length_max = Some(hi);
+        }
+        Ok(())
    }
 }

@@ -74,7 +96,8 @@ enum Cmd {
        #[command(flatten)]
        overrides: ConfigOverrides,
    },
-    /// Run the same trace under multiple routers and compare summaries.
+    /// Run the same trace under multiple routers and fixed-placement eviction
+    /// policies, then compare cache-hit summaries.
    Ablate {
        #[arg(short, long)]
        config: PathBuf,
@@ -85,6 +108,31 @@ enum Cmd {
            default_value = "random,least_loaded,least_tokens,ttl_aware,min_pd,cache_load,cache_score,estimated_ttft,prefix_affinity"
        )]
        routers: String,
+        /// Comma-separated eviction policies for ablation aggregation.
+        /// Currently only `lru` is supported.
+        #[arg(long, default_value = "lru")]
+        evict_policies: String,
+        /// Sweep `num_instances` from `--auto-candidates` with the
+        /// `--auto-probe-router` and pick the smallest cluster size whose
+        /// TTFT mean ≤ `--auto-target-ttft-mean`. Overrides the YAML
+        /// `num_instances` for the ablation run.
+        #[arg(long, default_value_t = false)]
+        auto_instances: bool,
+        /// Target TTFT mean (seconds) for auto-instances calibration.
+        #[arg(long, default_value_t = 4.0)]
+        auto_target_ttft_mean: f64,
+        /// Comma-separated candidate cluster sizes (ascending).
+        #[arg(long, default_value = "4,8,16,24,32,48,64,96,128")]
+        auto_candidates: String,
+        /// Router used as the calibration baseline. The smallest candidate
+        /// where this router's TTFT mean ≤ target is picked — all ablation
+        /// routers are then run at that cluster size.
+        #[arg(long, default_value = "cache_score")]
+        auto_probe_router: String,
+        /// Maximum number of routers to simulate concurrently.
+        /// `0` means auto-detect from available CPU parallelism.
+        #[arg(long, default_value_t = 0)]
+        jobs: usize,
        #[command(flatten)]
        overrides: ConfigOverrides,
    },
@@ -125,8 +173,24 @@ fn main() -> Result<()> {
        Cmd::Ablate {
            config,
            routers,
+            evict_policies,
+            auto_instances,
+            auto_target_ttft_mean,
+            auto_candidates,
+            auto_probe_router,
+            jobs,
            overrides,
-        } => cmd_ablate(&config, &routers, &overrides),
+        } => cmd_ablate(
+            &config,
+            &routers,
+            &evict_policies,
+            auto_instances,
+            auto_target_ttft_mean,
+            &auto_candidates,
+            &auto_probe_router,
+            jobs,
+            &overrides,
+        ),
        Cmd::Validate { config, overrides } => cmd_validate(&config, &overrides),
        Cmd::Oracle {
            config,
@@ -134,13 +198,20 @@ fn main() -> Result<()> {
            capacity_blocks,
            per_instance,
            out,
-        } => cmd_oracle(&config, &overrides, capacity_blocks, per_instance, out.as_deref()),
+        } => cmd_oracle(
+            &config,
+            &overrides,
+            capacity_blocks,
+            per_instance,
+            out.as_deref(),
+        ),
    }
 }

 fn load(config: &PathBuf, overrides: &ConfigOverrides) -> Result<Config> {
    let mut cfg = Config::from_yaml_path(config)?;
-    overrides.apply(&mut cfg);
+    overrides.apply(&mut cfg)?;
+    cfg.cluster.validate()?;
    Ok(cfg)
 }

@@ -151,8 +222,19 @@ fn cmd_run(path: &PathBuf, overrides: &ConfigOverrides) -> Result<()> {
    Ok(())
 }

-fn cmd_ablate(path: &PathBuf, routers: &str, overrides: &ConfigOverrides) -> Result<()> {
-    let base = load(path, overrides)?;
+#[allow(clippy::too_many_arguments)]
+fn cmd_ablate(
+    path: &PathBuf,
+    routers: &str,
+    evict_policies: &str,
+    auto_instances: bool,
+    auto_target_ttft_mean: f64,
+    auto_candidates: &str,
+    auto_probe_router: &str,
+    jobs: usize,
+    overrides: &ConfigOverrides,
+) -> Result<()> {
+    let mut base = load(path, overrides)?;
    let modes: Vec<RouterMode> = routers
        .split(',')
        .map(|s| s.trim())
@@ -160,15 +242,63 @@ fn cmd_ablate(path: &PathBuf, routers: &str, overrides: &ConfigOverrides) -> Res
        .map(RouterMode::parse)
        .collect::<Result<Vec<_>>>()
        .with_context(|| format!("parsing --routers='{routers}'"))?;
-    let mut all = Vec::new();
-    for mode in modes {
-        let mut cfg = base.clone();
-        cfg.cluster.router.mode = mode;
-        let sub = mode.as_str().to_string();
-        eprintln!("[ablate] running router={}", sub);
-        let out = driver::run(&cfg, Some(&sub))?;
-        all.push(out.summary);
+    let policies: Vec<ReplayEvictPolicy> = evict_policies
+        .split(',')
+        .map(|s| s.trim())
+        .filter(|s| !s.is_empty())
+        .map(ReplayEvictPolicy::parse)
+        .collect::<Result<Vec<_>>>()
+        .with_context(|| format!("parsing --evict-policies='{evict_policies}'"))?;
+
+    if auto_instances {
+        let candidates: Vec<u32> = auto_candidates
+            .split(',')
+            .map(|s| s.trim())
+            .filter(|s| !s.is_empty())
+            .map(|s| {
+                s.parse::<u32>()
+                    .with_context(|| format!("parsing --auto-candidates entry '{s}'"))
+            })
+            .collect::<Result<Vec<_>>>()?;
+        if candidates.is_empty() {
+            return Err(anyhow::anyhow!("--auto-candidates is empty"));
+        }
+        // Ascending so the first hit is the smallest cluster meeting the target.
+        let mut sorted = candidates.clone();
+        sorted.sort_unstable();
+        let probe_mode = RouterMode::parse(auto_probe_router)
+            .with_context(|| format!("parsing --auto-probe-router='{auto_probe_router}'"))?;
+        let chosen = auto_select_instances(&base, &sorted, probe_mode, auto_target_ttft_mean)?;
+        eprintln!(
+            "[ablate] auto-instances chose num_instances={} (target ttft_mean ≤ {:.3}s, probe_router={})",
+            chosen,
+            auto_target_ttft_mean,
+            probe_mode.as_str()
+        );
+        base.cluster.num_instances = Some(chosen);
+        base.cluster.buckets.clear();
    }
+
+    eprintln!(
+        "[ablate] routers={} evict_policies={} num_instances={} jobs={}",
+        modes
+            .iter()
+            .map(RouterMode::as_str)
+            .collect::<Vec<_>>()
+            .join(","),
+        policies
+            .iter()
+            .map(ReplayEvictPolicy::as_str)
+            .collect::<Vec<_>>()
+            .join(","),
+        base.cluster.total_instances(),
+        if jobs == 0 {
+            "auto".to_string()
+        } else {
+            jobs.to_string()
+        },
+    );
+    let all = driver::ablate_fixed_placement_with_parallelism(&base, &modes, &policies, jobs)?;
    let agg_path = std::path::Path::new(&base.sim.output_dir).join("ablation.json");
    std::fs::create_dir_all(&base.sim.output_dir)?;
    std::fs::write(&agg_path, serde_json::to_string_pretty(&all)?)?;
@@ -177,23 +307,140 @@ fn cmd_ablate(path: &PathBuf, routers: &str, overrides: &ConfigOverrides) -> Res
    Ok(())
 }

+/// Sweep candidate cluster sizes ascending and return the smallest one whose
+/// TTFT mean under `probe` is ≤ `target_ttft_mean`. Per-candidate calibration
+/// summaries are written under `<output_dir>/auto_instances/` so the picked
+/// N is auditable. If no candidate meets the target, returns an error naming
+/// the best achievable TTFT.
+fn auto_select_instances(
+    base: &Config,
+    candidates: &[u32],
+    probe: RouterMode,
+    target_ttft_mean: f64,
+) -> Result<u32> {
+    base.cluster
+        .require_legacy_single_pool("auto-instances calibration")?;
+
+    #[derive(serde::Serialize)]
+    struct CalibRow {
+        num_instances: u32,
+        router: String,
+        ttft_mean: f64,
+        ttft_p50: f64,
+        ttft_p95: f64,
+        ttft_p99: f64,
+        num_requests: u64,
+        hit_rate_l0: f64,
+        passed: bool,
+    }
+
+    let out_root = std::path::Path::new(&base.sim.output_dir).join("auto_instances");
+    std::fs::create_dir_all(&out_root)?;
+    let mut log: Vec<CalibRow> = Vec::new();
+    let mut chosen: Option<u32> = None;
+
+    for &n in candidates {
+        let mut cfg = base.clone();
+        cfg.cluster.num_instances = Some(n);
+        cfg.cluster.router.mode = probe;
+        // Isolate calibration output so ablation runs don't overwrite it.
+        cfg.sim.output_dir = out_root
+            .join(format!("n{n}__{}", probe.as_str()))
+            .to_string_lossy()
+            .into_owned();
+        eprintln!(
+            "[auto-instances] probing num_instances={} router={} ...",
+            n,
+            probe.as_str()
+        );
+        let run = driver::run(&cfg, None)?;
+        let passed = run.summary.ttft_mean <= target_ttft_mean;
+        eprintln!(
+            "[auto-instances]   ttft_mean={:.3}s p95={:.3}s hit_l0={:.4} -> {}",
+            run.summary.ttft_mean,
+            run.summary.ttft_p95,
+            run.summary.hit_rate_l0,
+            if passed { "PASS" } else { "fail" },
+        );
+        log.push(CalibRow {
+            num_instances: n,
+            router: probe.as_str().to_string(),
+            ttft_mean: run.summary.ttft_mean,
+            ttft_p50: run.summary.ttft_p50,
+            ttft_p95: run.summary.ttft_p95,
+            ttft_p99: run.summary.ttft_p99,
+            num_requests: run.summary.num_requests,
+            hit_rate_l0: run.summary.hit_rate_l0,
+            passed,
+        });
+        if passed && chosen.is_none() {
+            chosen = Some(n);
+            // Keep sweeping if you want a curve; here we stop at the smallest
+            // passing N to satisfy the "not too small, but as small as it can
+            // be while meeting the SLA" requirement.
+            break;
+        }
+    }
+
+    // Persist the calibration log either way so failures are debuggable.
+    let log_path = out_root.join("calibration.json");
+    std::fs::write(
+        &log_path,
+        serde_json::to_string_pretty(&serde_json::json!({
+            "target_ttft_mean": target_ttft_mean,
+            "probe_router": probe.as_str(),
+            "candidates": candidates,
+            "chosen": chosen,
+            "runs": log,
+        }))?,
+    )?;
+    eprintln!("[auto-instances] wrote {}", log_path.display());
+
+    chosen.ok_or_else(|| {
+        let best = log
+            .iter()
+            .min_by(|a, b| a.ttft_mean.partial_cmp(&b.ttft_mean).unwrap())
+            .map(|r| (r.num_instances, r.ttft_mean))
+            .unwrap_or((0, f64::INFINITY));
+        anyhow::anyhow!(
+            "no candidate met target ttft_mean ≤ {:.3}s; best was n={} at {:.3}s — \
+             widen --auto-candidates or raise --auto-target-ttft-mean",
+            target_ttft_mean,
+            best.0,
+            best.1,
+        )
+    })
+}
+
 fn cmd_validate(path: &PathBuf, overrides: &ConfigOverrides) -> Result<()> {
    use kvcache_simulator::instance::compute::ComputeModel;
    let cfg = load(path, overrides)?;
    eprintln!("config OK: {}", cfg.model.name);
-    eprintln!("mode = {}", if cfg.model.is_arch_mode() { "architecture-derived" } else { "legacy manual" });
-    let cm = ComputeModel::new(&cfg.model, &cfg.hardware);
+    eprintln!(
+        "mode = {}",
+        if cfg.model.is_arch_mode() {
+            "architecture-derived"
+        } else {
+            "legacy manual"
+        }
+    );
+    let cm = ComputeModel::new(&cfg.model, &cfg.hardware, &cfg.calibration);
    eprintln!("compute: {}", cm.describe());
-    eprintln!("kv_block_bytes = {} ({:.2} MB{})",
+    eprintln!(
+        "kv_block_bytes = {} ({:.2} MB{})",
        cfg.model.kv_block_bytes(),
        cfg.model.kv_block_bytes() as f64 / 1e6,
-        if cfg.model.mla.is_some() { ", MLA compressed" } else { "" },
+        if cfg.model.mla.is_some() {
+            ", MLA compressed"
+        } else {
+            ""
+        },
    );
    let block_bytes = cfg.model.kv_block_bytes() as f64;
    let hbm_blocks = (cfg.hardware.hbm_bytes / block_bytes) as u64;
    let dram_blocks = (cfg.hardware.dram_bytes / block_bytes) as u64;
    eprintln!("per-instance HBM blocks = {hbm_blocks}, DRAM blocks = {dram_blocks}");
-    eprintln!("num_instances = {}", cfg.cluster.num_instances);
+    eprintln!("num_instances = {}", cfg.cluster.total_instances());
    // Sample prefill times at a few prompt lengths.
    eprintln!("prefill_time samples:");
    for &n in &[256, 1024, 4096, 16384, 65536, 131072] {
@@ -224,9 +471,10 @@ fn cmd_oracle(
    out_path: Option<&std::path::Path>,
 ) -> Result<()> {
    let cfg = load(path, overrides)?;
+    cfg.cluster.require_legacy_single_pool("oracle analysis")?;
    let block_bytes = cfg.model.kv_block_bytes() as f64;
    let per_instance_blocks = (cfg.hardware.hbm_bytes / block_bytes).max(1.0) as u64;
-    let aggregate_blocks = per_instance_blocks * cfg.cluster.num_instances as u64;
+    let aggregate_blocks = per_instance_blocks * cfg.cluster.total_instances() as u64;
    let capacity = match (capacity_blocks, per_instance) {
        (Some(_), true) => {
            return Err(anyhow::anyhow!(
@@ -244,17 +492,53 @@ fn cmd_oracle(
    );
    let reader = TraceReader::open(&cfg.sim.trace_path, cfg.sim.max_requests)?;
    let records: Vec<_> = reader.collect::<Result<Vec<_>, _>>()?;
+
+    // Build a count-mask: all records feed the cache, but only records inside
+    // the input-length range contribute to hit/miss accounting. This way a
+    // 128k+ bucket still benefits from prefix blocks populated by shorter
+    // requests, matching the real mixed-workload ceiling.
+    let lo = cfg.sim.input_length_min.unwrap_or(0);
+    let hi = cfg.sim.input_length_max.unwrap_or(u32::MAX);
+    let has_filter = lo > 0 || hi < u32::MAX;
+    let count_mask: Option<Vec<bool>> = if has_filter {
+        let mask: Vec<bool> = records
+            .iter()
+            .map(|r| r.input_len >= lo && r.input_len <= hi)
+            .collect();
+        let counted = mask.iter().filter(|&&v| v).count();
+        eprintln!(
+            "[oracle] input_length filter [{}, {}] counting {}/{} requests \
+             (all {} used for cache state)",
+            lo,
+            if hi == u32::MAX {
+                "∞".to_string()
+            } else {
+                hi.to_string()
+            },
+            counted,
+            records.len(),
+            records.len(),
+        );
+        Some(mask)
+    } else {
+        None
+    };
+
    eprintln!(
        "[oracle] loaded {} requests; analyzing with capacity = {} blocks \
         ({} per-instance × {} instances{})",
        records.len(),
        capacity,
        per_instance_blocks,
-        cfg.cluster.num_instances,
-        if per_instance { ", per-instance mode" } else { "" }
+        cfg.cluster.total_instances(),
+        if per_instance {
+            ", per-instance mode"
+        } else {
+            ""
+        }
    );

-    let result = oracle::analyze(&records, capacity);
+    let result = oracle::analyze(&records, capacity, count_mask.as_deref());
    let json = serde_json::to_string_pretty(&result)?;
    println!("{}", json);

@@ -269,3 +553,123 @@ fn cmd_oracle(
    eprintln!("[oracle] wrote {}", target.display());
    Ok(())
 }
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use kvcache_simulator::config::{
+        BucketConfig, CalibrationConfig, ClusterConfig, GlobalRouterConfig, HardwareConfig,
+        MetaStoreConfig, ModelConfig, RouterConfig, SimConfig,
+    };
+
+    fn bucketed_config(out_dir: &str) -> Config {
+        Config {
+            model: ModelConfig {
+                name: "test".into(),
+                num_layers: 4,
+                num_kv_heads: 2,
+                head_dim: 64,
+                dtype_bytes: 2,
+                block_size_tokens: 16,
+                flops_per_token_prefill: Some(1.0e9),
+                attn_quadratic_coeff: Some(64.0),
+                ..Default::default()
+            },
+            hardware: HardwareConfig {
+                gpu_flops: 1.0e14,
+                gpu_fp8_flops: 0.0,
+                gpu_fp4_flops: 0.0,
+                gpu_mem_bw: 1.0e12,
+                hbm_bytes: 1.0e9,
+                dram_bytes: 4.0e9,
+                host_dram_bw: 5.0e11,
+                pcie_bw: 32.0e9,
+                pcie_latency_us: 1.0,
+                rdma_bw: 12.0e9,
+                rdma_latency_us: 5.0,
+                intra_node_tp_bw: 9.0e11,
+                intra_node_tp_latency_us: 2.0,
+                tp_degree: 1,
+                max_batch_slots: 32,
+                prefill_chunk_tokens: 1024,
+            },
+            calibration: CalibrationConfig::default(),
+            cluster: ClusterConfig {
+                num_instances: None,
+                buckets: vec![
+                    BucketConfig {
+                        name: "short".into(),
+                        input_length_min: 0,
+                        input_length_max: 64,
+                        num_instances: 2,
+                    },
+                    BucketConfig {
+                        name: "long".into(),
+                        input_length_min: 65,
+                        input_length_max: 128,
+                        num_instances: 1,
+                    },
+                ],
+                global_router: GlobalRouterConfig::default(),
+                meta_store: MetaStoreConfig {
+                    ttl_seconds: 1000.0,
+                },
+                router: RouterConfig {
+                    mode: RouterMode::Random,
+                    precise_probe_latency_us: 10.0,
+                    precise_probe_topk: 4,
+                    load_alpha: 0.1,
+                    score_alpha: 1.0,
+                    score_beta: 0.1,
+                    prefix_k: 8,
+                    affinity_fan_out: 0,
+                },
+            },
+            sim: SimConfig {
+                trace_path: "unused.jsonl".into(),
+                max_requests: None,
+                output_dir: out_dir.into(),
+                sample_interval_s: 0.0,
+                seed: 7,
+                input_length_min: None,
+                input_length_max: None,
+            },
+        }
+    }
+
+    #[test]
+    fn num_instances_override_rejects_bucketed_configs() {
+        let mut cfg = bucketed_config(std::env::temp_dir().to_str().unwrap());
+        let overrides = ConfigOverrides {
+            num_instances: Some(8),
+            ..ConfigOverrides::default()
+        };
+
+        let err = overrides.apply(&mut cfg).unwrap_err();
+        assert!(err.to_string().contains("--num-instances"));
+        assert!(err.to_string().contains("cluster.buckets"));
+    }
+
+    #[test]
+    fn auto_instances_rejects_bucketed_configs() {
+        let cfg = bucketed_config(std::env::temp_dir().to_str().unwrap());
+
+        let err = auto_select_instances(&cfg, &[4, 8], RouterMode::Random, 1.0).unwrap_err();
+        assert!(err.to_string().contains("auto-instances"));
+        assert!(err.to_string().contains("cluster.buckets"));
+    }
+
+    #[test]
+    fn oracle_rejects_bucketed_configs() {
+        let tmp = std::env::temp_dir().join("kvcache_sim_main_tests");
+        let _ = std::fs::remove_dir_all(&tmp);
+        std::fs::create_dir_all(&tmp).unwrap();
+        let path = tmp.join("bucketed.yaml");
+        let cfg = bucketed_config(tmp.to_str().unwrap());
+        std::fs::write(&path, serde_yaml::to_string(&cfg).unwrap()).unwrap();
+
+        let err = cmd_oracle(&path, &ConfigOverrides::default(), None, false, None).unwrap_err();
+        assert!(err.to_string().contains("oracle analysis"));
+        assert!(err.to_string().contains("cluster.buckets"));
+    }
+}
--- a/src/metrics/ablation.rs
+++ b/src/metrics/ablation.rs
@@ -0,0 +1,50 @@
+use serde::Serialize;
+
+use crate::metrics::Summary;
+use crate::replay::ReplayEvictPolicy;
+
+#[derive(Debug, Clone, Serialize)]
+pub struct AblationRow {
+    pub router: String,
+    pub evict_policy: String,
+    pub placement_source: String,
+    pub num_requests: u64,
+    pub total_blocks: u64,
+    pub ttft_mean: f64,
+    pub ttft_p50: f64,
+    pub ttft_p95: f64,
+    pub ttft_p99: f64,
+    pub hit_rate_l0: f64,
+    pub hit_rate_l1: f64,
+    pub hit_rate_remote: f64,
+    pub miss_rate: f64,
+    pub total_rdma_bytes: u64,
+    pub total_pcie_bytes: u64,
+}
+
+impl AblationRow {
+    pub fn from_summary(
+        router: &str,
+        policy: ReplayEvictPolicy,
+        placement_source: &str,
+        summary: &Summary,
+    ) -> Self {
+        Self {
+            router: router.to_string(),
+            evict_policy: policy.as_str().to_string(),
+            placement_source: placement_source.to_string(),
+            num_requests: summary.num_requests,
+            total_blocks: summary.total_blocks,
+            ttft_mean: summary.ttft_mean,
+            ttft_p50: summary.ttft_p50,
+            ttft_p95: summary.ttft_p95,
+            ttft_p99: summary.ttft_p99,
+            hit_rate_l0: summary.hit_rate_l0,
+            hit_rate_l1: summary.hit_rate_l1,
+            hit_rate_remote: summary.hit_rate_remote,
+            miss_rate: summary.miss_rate,
+            total_rdma_bytes: summary.total_rdma_bytes,
+            total_pcie_bytes: summary.total_pcie_bytes,
+        }
+    }
+}
--- a/src/metrics/mod.rs
+++ b/src/metrics/mod.rs
@@ -1,7 +1,9 @@
+pub mod ablation;
 pub mod per_request;
 pub mod routing_log;
 pub mod summary;
 pub mod timeseries;

+pub use ablation::AblationRow;
 pub use per_request::PerRequestRow;
 pub use summary::Summary;
--- a/src/metrics/per_request.rs
+++ b/src/metrics/per_request.rs
@@ -8,7 +8,9 @@ pub struct PerRequestRow {
    pub arrival: f64,
    pub ttft: f64,
    pub e2e: f64,
+    pub bucket: u32,
    pub instance: u32,
+    pub length_bucket_match: bool,
    pub total_blocks: u32,
    pub l0_hit_blocks: u32,
    pub l1_hit_blocks: u32,
--- a/src/metrics/routing_log.rs
+++ b/src/metrics/routing_log.rs
@@ -12,7 +12,9 @@ pub struct RoutingLogWriter {
 impl RoutingLogWriter {
    pub fn create<P: AsRef<Path>>(path: P) -> Result<Self> {
        let f = File::create(path)?;
-        Ok(Self { inner: BufWriter::new(f) })
+        Ok(Self {
+            inner: BufWriter::new(f),
+        })
    }

    pub fn write(&mut self, decision: &RouteDecision) -> Result<()> {
--- a/src/metrics/timeseries.rs
+++ b/src/metrics/timeseries.rs
@@ -5,6 +5,7 @@ use std::path::Path;
 #[derive(Debug, Clone, Serialize)]
 pub struct TimeseriesRow {
    pub t: f64,
+    pub bucket: u32,
    pub instance: u32,
    pub queue_len: u32,
    pub kv_blocks_used: u32,
@@ -19,7 +20,9 @@ pub struct TimeseriesWriter {
 impl TimeseriesWriter {
    pub fn create<P: AsRef<Path>>(path: P) -> Result<Self> {
        let f = std::fs::File::create(path)?;
-        Ok(Self { inner: csv::Writer::from_writer(f) })
+        Ok(Self {
+            inner: csv::Writer::from_writer(f),
+        })
    }

    pub fn write(&mut self, row: &TimeseriesRow) -> Result<()> {
--- a/src/network.rs
+++ b/src/network.rs
@@ -5,7 +5,7 @@
 //! by `bytes / bw`. Latency is added on top of transfer time. This captures
 //! contention without simulating individual packets.

-use crate::config::HardwareConfig;
+use crate::config::{CalibrationConfig, HardwareConfig};

 #[derive(Debug, Clone)]
 pub struct LinkModel {
@@ -53,10 +53,10 @@ pub struct InstanceLinks {
 }

 impl InstanceLinks {
-    pub fn from_hw(hw: &HardwareConfig) -> Self {
+    pub fn from_hw(hw: &HardwareConfig, calib: &CalibrationConfig) -> Self {
        Self {
-            pcie: LinkModel::new(hw.pcie_bw, hw.pcie_latency_us * 1e-6),
-            rdma: LinkModel::new(hw.rdma_bw, hw.rdma_latency_us * 1e-6),
+            pcie: LinkModel::new(hw.pcie_bw * calib.pcie_bw_util, hw.pcie_latency_us * 1e-6),
+            rdma: LinkModel::new(hw.rdma_bw * calib.rdma_bw_util, hw.rdma_latency_us * 1e-6),
        }
    }
 }
--- a/src/oracle.rs
+++ b/src/oracle.rs
@@ -51,43 +51,75 @@ impl TierResult {
            capacity_blocks,
            hits,
            misses,
-            hit_rate: if total == 0 { 0.0 } else { hits as f64 / total as f64 },
+            hit_rate: if total == 0 {
+                0.0
+            } else {
+                hits as f64 / total as f64
+            },
        }
    }
 }

-pub fn analyze(records: &[RequestRecord], capacity_blocks: u64) -> OracleResult {
-    // total / unique counters
-    let total_blocks: u64 = records.iter().map(|r| r.hash_ids.len() as u64).sum();
+/// Run the oracle analysis over `records`.
+///
+/// When `count_mask` is `Some`, **all** records still feed the caches (so the
+/// cache state reflects the full workload), but only records where
+/// `count_mask[i]` is `true` contribute to hit / miss / total accounting.
+/// This is the correct way to answer "what is the theoretical hit-rate for a
+/// particular input-length bucket within a mixed workload?" — the cache sees
+/// every request, but the metric only measures the bucket of interest.
+///
+/// When `count_mask` is `None`, every record is counted (original behaviour).
+pub fn analyze(
+    records: &[RequestRecord],
+    capacity_blocks: u64,
+    count_mask: Option<&[bool]>,
+) -> OracleResult {
+    // Build a default all-true mask when none is supplied.
+    let default_mask;
+    let mask: &[bool] = match count_mask {
+        Some(m) => {
+            assert_eq!(m.len(), records.len());
+            m
+        }
+        None => {
+            default_mask = vec![true; records.len()];
+            &default_mask
+        }
+    };
+
+    // total / unique counters — only for counted records
+    let mut total_blocks: u64 = 0;
    let mut unique = AHashSet::new();
-    for r in records {
-        for &h in &r.hash_ids {
-            unique.insert(h);
+    let mut num_requests: u64 = 0;
+    for (i, r) in records.iter().enumerate() {
+        if mask[i] {
+            total_blocks += r.hash_ids.len() as u64;
+            num_requests += 1;
+            for &h in &r.hash_ids {
+                unique.insert(h);
+            }
        }
    }

    // 1. Unlimited cache
-    let unlimited_hits = run_unlimited(records);
-    let unlimited = TierResult::from_counts(
-        "unlimited",
-        u64::MAX,
-        unlimited_hits,
-        total_blocks,
-    );
+    let unlimited_hits = run_unlimited(records, mask);
+    let unlimited = TierResult::from_counts("unlimited", u64::MAX, unlimited_hits, total_blocks);

-    // 2. Precompute next-use index for Belady
+    // 2. Precompute next-use index for Belady (over ALL records — eviction
+    //    decisions must consider future accesses from the full workload)
    let next_use = build_next_use(records);

    // 3. Belady at the given capacity
-    let belady_hits = run_belady(records, &next_use, capacity_blocks as usize);
+    let belady_hits = run_belady(records, &next_use, capacity_blocks as usize, mask);
    let belady = TierResult::from_counts("belady", capacity_blocks, belady_hits, total_blocks);

    // 4. LRU baseline at the same capacity
-    let lru_hits = run_lru(records, capacity_blocks as usize);
+    let lru_hits = run_lru(records, capacity_blocks as usize, mask);
    let lru = TierResult::from_counts("lru", capacity_blocks, lru_hits, total_blocks);

    OracleResult {
-        num_requests: records.len() as u64,
+        num_requests,
        total_blocks,
        unique_blocks: unique.len() as u64,
        unlimited,
@@ -96,18 +128,21 @@ pub fn analyze(records: &[RequestRecord], capacity_blocks: u64) -> OracleResult
    }
 }

-fn run_unlimited(records: &[RequestRecord]) -> u64 {
+fn run_unlimited(records: &[RequestRecord], mask: &[bool]) -> u64 {
    let mut seen: AHashSet<u64> = AHashSet::with_capacity(1 << 18);
    let mut hits: u64 = 0;
-    for r in records {
-        // Longest prefix match against `seen`
-        for &h in &r.hash_ids {
-            if seen.contains(&h) {
-                hits += 1;
-            } else {
-                break;
+    for (i, r) in records.iter().enumerate() {
+        // Longest prefix match against `seen` — only count for masked records
+        if mask[i] {
+            for &h in &r.hash_ids {
+                if seen.contains(&h) {
+                    hits += 1;
+                } else {
+                    break;
+                }
            }
        }
+        // Always populate the cache so all requests contribute to cache state
        for &h in &r.hash_ids {
            seen.insert(h);
        }
@@ -115,15 +150,20 @@ fn run_unlimited(records: &[RequestRecord]) -> u64 {
    hits
 }

-fn run_lru(records: &[RequestRecord], capacity: usize) -> u64 {
+fn run_lru(records: &[RequestRecord], capacity: usize, mask: &[bool]) -> u64 {
    if capacity == 0 {
        return 0;
    }
    let mut cache = LruBlocks::new(capacity);
    let mut hits: u64 = 0;
    let mut evicted = Vec::new();
-    for r in records {
-        hits += cache.longest_prefix(&r.hash_ids) as u64;
+    for (i, r) in records.iter().enumerate() {
+        // Always touch the cache (longest_prefix updates LRU recency) so that
+        // the eviction order reflects the real mixed workload.
+        let prefix_len = cache.longest_prefix(&r.hash_ids) as u64;
+        if mask[i] {
+            hits += prefix_len;
+        }
        evicted.clear();
        cache.insert_blocks(&r.hash_ids, &mut evicted);
    }
@@ -156,7 +196,12 @@ fn build_next_use(records: &[RequestRecord]) -> Vec<Vec<u32>> {
 /// Implementation: lazy-deletion max-heap keyed by next-use index. Each
 /// cache entry has a version; the heap may contain stale entries from
 /// previous insertions, which we skip on pop.
-fn run_belady(records: &[RequestRecord], next_use: &[Vec<u32>], capacity: usize) -> u64 {
+fn run_belady(
+    records: &[RequestRecord],
+    next_use: &[Vec<u32>],
+    capacity: usize,
+    mask: &[bool],
+) -> u64 {
    if capacity == 0 {
        return 0;
    }
@@ -169,16 +214,19 @@ fn run_belady(records: &[RequestRecord], next_use: &[Vec<u32>], capacity: usize)
    let mut hits: u64 = 0;

    for (i, r) in records.iter().enumerate() {
-        // 1. Longest-prefix hit accounting against current cache.
-        for &h in &r.hash_ids {
-            if in_cache.contains_key(&h) {
-                hits += 1;
-            } else {
-                break;
+        // 1. Longest-prefix hit accounting — only count for masked records.
+        if mask[i] {
+            for &h in &r.hash_ids {
+                if in_cache.contains_key(&h) {
+                    hits += 1;
+                } else {
+                    break;
+                }
            }
        }

        // 2. Insert / update each block in the request with its new next-use.
+        //    Always executed so the cache reflects the full workload.
        for (j, &h) in r.hash_ids.iter().enumerate() {
            let nu = next_use[i][j];
            if let Some(slot) = in_cache.get_mut(&h) {
@@ -222,6 +270,8 @@ mod tests {
        RequestRecord {
            req_id: id,
            chat_id: id as i64,
+            parent_chat_id: -1,
+            turn: 1,
            arrival: t,
            input_len: (hashes.len() as u32) * 16,
            output_len: 16,
@@ -236,7 +286,7 @@ mod tests {
            req(1, 1.0, vec![1, 2, 3, 4]),
            req(2, 2.0, vec![1, 2, 3, 4, 5]),
        ];
-        let out = analyze(&recs, 100);
+        let out = analyze(&recs, 100, None);
        // total = 3 + 4 + 5 = 12
        assert_eq!(out.total_blocks, 12);
        // unique = {1,2,3,4,5} = 5
@@ -255,7 +305,7 @@ mod tests {
        for (i, &h) in pattern.iter().enumerate() {
            recs.push(req(i as u64, i as f64, vec![h]));
        }
-        let out = analyze(&recs, 2);
+        let out = analyze(&recs, 2, None);
        assert!(
            out.belady_finite.hits >= out.lru_finite.hits,
            "belady should be at least as good as lru: belady={} lru={}",
@@ -272,8 +322,39 @@ mod tests {
            req(2, 2.0, vec![60]),
            req(3, 3.0, vec![10, 20, 30, 40, 50, 60]),
        ];
-        let out = analyze(&recs, 3);
+        let out = analyze(&recs, 3, None);
        assert!(out.unlimited.hit_rate >= out.belady_finite.hit_rate);
        assert!(out.belady_finite.hit_rate >= out.lru_finite.hit_rate - 1e-9);
    }
+
+    #[test]
+    fn count_mask_filters_accounting_not_cache() {
+        // req 0 populates blocks [1,2,3] but is not counted.
+        // req 1 has prefix [1,2,3,4] — the first 3 blocks are cache hits
+        // because req 0 populated them, even though req 0 is masked out.
+        let recs = vec![req(0, 0.0, vec![1, 2, 3]), req(1, 1.0, vec![1, 2, 3, 4])];
+        let mask = vec![false, true];
+        let out = analyze(&recs, 100, Some(&mask));
+        // Only req 1 is counted: total = 4, hits = 3 (prefix [1,2,3] hit)
+        assert_eq!(out.num_requests, 1);
+        assert_eq!(out.total_blocks, 4);
+        assert_eq!(out.unlimited.hits, 3);
+        assert!((out.unlimited.hit_rate - 3.0 / 4.0).abs() < 1e-9);
+    }
+
+    #[test]
+    fn count_mask_none_matches_all_true() {
+        let recs = vec![
+            req(0, 0.0, vec![1, 2, 3]),
+            req(1, 1.0, vec![1, 2, 3, 4]),
+            req(2, 2.0, vec![1, 2, 3, 4, 5]),
+        ];
+        let out_none = analyze(&recs, 10, None);
+        let all_true = vec![true; recs.len()];
+        let out_all = analyze(&recs, 10, Some(&all_true));
+        assert_eq!(out_none.unlimited.hits, out_all.unlimited.hits);
+        assert_eq!(out_none.belady_finite.hits, out_all.belady_finite.hits);
+        assert_eq!(out_none.lru_finite.hits, out_all.lru_finite.hits);
+        assert_eq!(out_none.total_blocks, out_all.total_blocks);
+    }
 }
--- a/src/replay.rs
+++ b/src/replay.rs
@@ -0,0 +1,610 @@
+use ahash::{AHashMap, AHashSet};
+use anyhow::{anyhow, Result};
+use serde::Serialize;
+use std::cmp::min;
+use std::collections::BinaryHeap;
+
+use crate::config::Config;
+use crate::instance::kv_cache::LruBlocks;
+use crate::trace::RequestRecord;
+
+#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize)]
+#[serde(rename_all = "snake_case")]
+pub enum ReplayEvictPolicy {
+    Lru,
+    Belady,
+}
+
+impl ReplayEvictPolicy {
+    pub fn parse(s: &str) -> Result<Self> {
+        match s {
+            "lru" => Ok(Self::Lru),
+            "belady" => Err(anyhow!(
+                "exact belady is not supported for fixed-placement full-hierarchy ablation"
+            )),
+            other => Err(anyhow!("unknown evict policy: {other}")),
+        }
+    }
+
+    pub fn as_str(&self) -> &'static str {
+        match self {
+            Self::Lru => "lru",
+            Self::Belady => "belady",
+        }
+    }
+}
+
+#[derive(Debug, Clone)]
+pub struct PlacementEntry {
+    pub req_id: u64,
+    pub instance: u32,
+}
+
+#[derive(Debug, Clone, Serialize, Default)]
+pub struct ReplaySummary {
+    pub num_requests: u64,
+    pub total_blocks: u64,
+    pub l0_hit_blocks: u64,
+    pub l1_hit_blocks: u64,
+    pub remote_hit_blocks: u64,
+    pub miss_blocks: u64,
+    pub hit_rate_l0: f64,
+    pub hit_rate_l1: f64,
+    pub hit_rate_remote: f64,
+    pub miss_rate: f64,
+    pub total_rdma_bytes: u64,
+    pub total_pcie_bytes: u64,
+}
+
+impl ReplaySummary {
+    fn from_counts(
+        num_requests: usize,
+        total_blocks: u64,
+        l0_hit_blocks: u64,
+        l1_hit_blocks: u64,
+        remote_hit_blocks: u64,
+        miss_blocks: u64,
+        total_rdma_bytes: u64,
+        total_pcie_bytes: u64,
+    ) -> Self {
+        let denom = total_blocks.max(1) as f64;
+        Self {
+            num_requests: num_requests as u64,
+            total_blocks,
+            l0_hit_blocks,
+            l1_hit_blocks,
+            remote_hit_blocks,
+            miss_blocks,
+            hit_rate_l0: l0_hit_blocks as f64 / denom,
+            hit_rate_l1: l1_hit_blocks as f64 / denom,
+            hit_rate_remote: remote_hit_blocks as f64 / denom,
+            miss_rate: miss_blocks as f64 / denom,
+            total_rdma_bytes,
+            total_pcie_bytes,
+        }
+    }
+}
+
+#[derive(Debug, Clone, Copy)]
+enum FutureKind {
+    L0,
+    L1,
+}
+
+#[derive(Debug)]
+struct FutureIndex {
+    local: AHashMap<(u32, u64), Vec<usize>>,
+    global: AHashMap<u64, Vec<(usize, u32)>>,
+}
+
+impl FutureIndex {
+    fn build(records: &[RequestRecord], placement: &[u32]) -> Self {
+        let mut local: AHashMap<(u32, u64), Vec<usize>> = AHashMap::new();
+        let mut global: AHashMap<u64, Vec<(usize, u32)>> = AHashMap::new();
+        for (req_idx, record) in records.iter().enumerate() {
+            let inst = placement[req_idx];
+            let mut seen = AHashSet::new();
+            for &block in &record.hash_ids {
+                if !seen.insert(block) {
+                    continue;
+                }
+                local.entry((inst, block)).or_default().push(req_idx);
+                global.entry(block).or_default().push((req_idx, inst));
+            }
+        }
+        Self { local, global }
+    }
+
+    fn next_local(&self, inst: u32, block: u64, current_req_idx: usize) -> usize {
+        match self.local.get(&(inst, block)) {
+            Some(indices) => next_after(indices, current_req_idx),
+            None => usize::MAX,
+        }
+    }
+
+    fn next_other(&self, inst: u32, block: u64, current_req_idx: usize) -> usize {
+        let Some(indices) = self.global.get(&block) else {
+            return usize::MAX;
+        };
+        let start = first_after_pair(indices, current_req_idx);
+        for &(req_idx, owner_inst) in indices.iter().skip(start) {
+            if owner_inst != inst {
+                return req_idx;
+            }
+        }
+        usize::MAX
+    }
+
+    fn next_use(&self, kind: FutureKind, inst: u32, block: u64, current_req_idx: usize) -> usize {
+        match kind {
+            FutureKind::L0 => self.next_local(inst, block, current_req_idx),
+            FutureKind::L1 => min(
+                self.next_local(inst, block, current_req_idx),
+                self.next_other(inst, block, current_req_idx),
+            ),
+        }
+    }
+}
+
+fn next_after(indices: &[usize], current_req_idx: usize) -> usize {
+    let pos = indices.partition_point(|&idx| idx <= current_req_idx);
+    indices.get(pos).copied().unwrap_or(usize::MAX)
+}
+
+fn first_after_pair(indices: &[(usize, u32)], current_req_idx: usize) -> usize {
+    indices.partition_point(|&(idx, _)| idx <= current_req_idx)
+}
+
+#[derive(Debug)]
+struct BeladyTier {
+    capacity: usize,
+    resident: AHashSet<u64>,
+    versions: AHashMap<u64, u64>,
+    heap: BinaryHeap<(usize, u64, u64)>,
+    next_version: u64,
+}
+
+impl BeladyTier {
+    fn new(capacity: usize) -> Self {
+        Self {
+            capacity,
+            resident: AHashSet::with_capacity(capacity),
+            versions: AHashMap::with_capacity(capacity),
+            heap: BinaryHeap::with_capacity(capacity),
+            next_version: 0,
+        }
+    }
+
+    fn contains(&self, key: u64) -> bool {
+        self.resident.contains(&key)
+    }
+
+    fn remove(&mut self, key: u64) -> bool {
+        if self.resident.remove(&key) {
+            self.versions.remove(&key);
+            true
+        } else {
+            false
+        }
+    }
+
+    fn touch(
+        &mut self,
+        key: u64,
+        current_req_idx: usize,
+        kind: FutureKind,
+        inst: u32,
+        futures: &FutureIndex,
+    ) -> bool {
+        if !self.resident.contains(&key) {
+            return false;
+        }
+        self.next_version += 1;
+        let version = self.next_version;
+        let next_use = futures.next_use(kind, inst, key, current_req_idx);
+        self.versions.insert(key, version);
+        self.heap.push((next_use, version, key));
+        true
+    }
+
+    fn insert(
+        &mut self,
+        key: u64,
+        current_req_idx: usize,
+        kind: FutureKind,
+        inst: u32,
+        futures: &FutureIndex,
+    ) -> Option<u64> {
+        if self.touch(key, current_req_idx, kind, inst, futures) {
+            return None;
+        }
+        if self.capacity == 0 {
+            return Some(key);
+        }
+        let mut evicted = None;
+        if self.resident.len() == self.capacity {
+            evicted = self.evict(current_req_idx, kind, inst, futures);
+        }
+        self.next_version += 1;
+        let version = self.next_version;
+        let next_use = futures.next_use(kind, inst, key, current_req_idx);
+        self.resident.insert(key);
+        self.versions.insert(key, version);
+        self.heap.push((next_use, version, key));
+        evicted
+    }
+
+    fn evict(
+        &mut self,
+        current_req_idx: usize,
+        kind: FutureKind,
+        inst: u32,
+        futures: &FutureIndex,
+    ) -> Option<u64> {
+        while let Some((stored_next_use, version, key)) = self.heap.pop() {
+            if !self.resident.contains(&key) {
+                continue;
+            }
+            let Some(current_version) = self.versions.get(&key).copied() else {
+                continue;
+            };
+            if current_version != version {
+                continue;
+            }
+            let actual_next_use = futures.next_use(kind, inst, key, current_req_idx);
+            if actual_next_use != stored_next_use {
+                self.next_version += 1;
+                let new_version = self.next_version;
+                self.versions.insert(key, new_version);
+                self.heap.push((actual_next_use, new_version, key));
+                continue;
+            }
+            self.resident.remove(&key);
+            self.versions.remove(&key);
+            return Some(key);
+        }
+        None
+    }
+}
+
+#[derive(Debug)]
+enum Tier {
+    Lru(LruBlocks),
+    Belady(BeladyTier),
+}
+
+impl Tier {
+    fn new(policy: ReplayEvictPolicy, capacity: usize) -> Self {
+        match policy {
+            ReplayEvictPolicy::Lru => Self::Lru(LruBlocks::new(capacity)),
+            ReplayEvictPolicy::Belady => Self::Belady(BeladyTier::new(capacity)),
+        }
+    }
+
+    fn contains(&self, key: u64) -> bool {
+        match self {
+            Self::Lru(tier) => tier.contains(key),
+            Self::Belady(tier) => tier.contains(key),
+        }
+    }
+
+    fn remove(&mut self, key: u64) -> bool {
+        match self {
+            Self::Lru(tier) => tier.remove(key),
+            Self::Belady(tier) => tier.remove(key),
+        }
+    }
+
+    fn touch(
+        &mut self,
+        key: u64,
+        req_idx: usize,
+        kind: FutureKind,
+        inst: u32,
+        futures: &FutureIndex,
+    ) -> bool {
+        match self {
+            Self::Lru(tier) => tier.touch(key),
+            Self::Belady(tier) => tier.touch(key, req_idx, kind, inst, futures),
+        }
+    }
+
+    fn insert(
+        &mut self,
+        key: u64,
+        req_idx: usize,
+        kind: FutureKind,
+        inst: u32,
+        futures: &FutureIndex,
+    ) -> Option<u64> {
+        match self {
+            Self::Lru(tier) => tier.insert_block(key),
+            Self::Belady(tier) => tier.insert(key, req_idx, kind, inst, futures),
+        }
+    }
+
+    fn longest_prefix_touch(
+        &mut self,
+        hashes: &[u64],
+        req_idx: usize,
+        kind: FutureKind,
+        inst: u32,
+        futures: &FutureIndex,
+    ) -> usize {
+        match self {
+            Self::Lru(tier) => tier.longest_prefix(hashes),
+            Self::Belady(tier) => {
+                let mut matched = 0usize;
+                for &hash in hashes {
+                    if !tier.touch(hash, req_idx, kind, inst, futures) {
+                        break;
+                    }
+                    matched += 1;
+                }
+                matched
+            }
+        }
+    }
+
+    fn longest_prefix_peek(&self, hashes: &[u64]) -> usize {
+        match self {
+            Self::Lru(tier) => tier.longest_prefix_peek(hashes),
+            Self::Belady(tier) => {
+                let mut matched = 0usize;
+                for &hash in hashes {
+                    if !tier.contains(hash) {
+                        break;
+                    }
+                    matched += 1;
+                }
+                matched
+            }
+        }
+    }
+}
+
+#[derive(Debug)]
+struct ReplayInstanceCache {
+    l0: Tier,
+    l1: Tier,
+}
+
+impl ReplayInstanceCache {
+    fn new(policy: ReplayEvictPolicy, l0_cap: usize, l1_cap: usize) -> Self {
+        Self {
+            l0: Tier::new(policy, l0_cap),
+            l1: Tier::new(policy, l1_cap),
+        }
+    }
+
+    fn promote_l1_blocks_to_l0(
+        &mut self,
+        hashes: &[u64],
+        req_idx: usize,
+        inst: u32,
+        futures: &FutureIndex,
+        owners: &mut AHashMap<u64, AHashSet<u32>>,
+    ) {
+        for &hash in hashes {
+            if self.l1.remove(hash) {
+                remove_owner(owners, hash, inst);
+            }
+            self.insert_block_into_l0(hash, req_idx, inst, futures, owners);
+        }
+    }
+
+    fn fetch_remote_blocks_to_l0(
+        &mut self,
+        hashes: &[u64],
+        req_idx: usize,
+        inst: u32,
+        futures: &FutureIndex,
+        owners: &mut AHashMap<u64, AHashSet<u32>>,
+    ) {
+        for &hash in hashes {
+            self.stage_remote_block_in_l1(hash, req_idx, inst, futures, owners);
+            if self.l1.remove(hash) {
+                remove_owner(owners, hash, inst);
+            }
+            self.insert_block_into_l0(hash, req_idx, inst, futures, owners);
+        }
+    }
+
+    fn insert_blocks_into_l0(
+        &mut self,
+        hashes: &[u64],
+        req_idx: usize,
+        inst: u32,
+        futures: &FutureIndex,
+        owners: &mut AHashMap<u64, AHashSet<u32>>,
+    ) {
+        for &hash in hashes {
+            self.insert_block_into_l0(hash, req_idx, inst, futures, owners);
+        }
+    }
+
+    fn insert_block_into_l0(
+        &mut self,
+        hash: u64,
+        req_idx: usize,
+        inst: u32,
+        futures: &FutureIndex,
+        owners: &mut AHashMap<u64, AHashSet<u32>>,
+    ) {
+        if self.l0.touch(hash, req_idx, FutureKind::L0, inst, futures) {
+            return;
+        }
+        if self.l1.remove(hash) {
+            remove_owner(owners, hash, inst);
+        }
+        if let Some(evicted_l0) = self.l0.insert(hash, req_idx, FutureKind::L0, inst, futures) {
+            self.demote_into_l1(evicted_l0, req_idx, inst, futures, owners);
+        }
+    }
+
+    fn stage_remote_block_in_l1(
+        &mut self,
+        hash: u64,
+        req_idx: usize,
+        inst: u32,
+        futures: &FutureIndex,
+        owners: &mut AHashMap<u64, AHashSet<u32>>,
+    ) {
+        if self.l0.contains(hash) || self.l1.contains(hash) {
+            return;
+        }
+        if let Some(evicted_l1) = self.l1.insert(hash, req_idx, FutureKind::L1, inst, futures) {
+            remove_owner(owners, evicted_l1, inst);
+        }
+        add_owner(owners, hash, inst);
+    }
+
+    fn demote_into_l1(
+        &mut self,
+        hash: u64,
+        req_idx: usize,
+        inst: u32,
+        futures: &FutureIndex,
+        owners: &mut AHashMap<u64, AHashSet<u32>>,
+    ) {
+        if self.l1.touch(hash, req_idx, FutureKind::L1, inst, futures) {
+            return;
+        }
+        if let Some(evicted_l1) = self.l1.insert(hash, req_idx, FutureKind::L1, inst, futures) {
+            remove_owner(owners, evicted_l1, inst);
+        }
+        add_owner(owners, hash, inst);
+    }
+}
+
+fn add_owner(owners: &mut AHashMap<u64, AHashSet<u32>>, hash: u64, inst: u32) {
+    owners.entry(hash).or_default().insert(inst);
+}
+
+fn remove_owner(owners: &mut AHashMap<u64, AHashSet<u32>>, hash: u64, inst: u32) {
+    if let Some(bucket) = owners.get_mut(&hash) {
+        bucket.remove(&inst);
+        if bucket.is_empty() {
+            owners.remove(&hash);
+        }
+    }
+}
+
+pub fn replay_fixed_placement(
+    cfg: &Config,
+    records: &[RequestRecord],
+    placements: &[PlacementEntry],
+    policy: ReplayEvictPolicy,
+) -> Result<ReplaySummary> {
+    cfg.cluster
+        .require_legacy_single_pool("fixed-placement replay")?;
+    if records.len() != placements.len() {
+        return Err(anyhow!(
+            "records/placements length mismatch: {} vs {}",
+            records.len(),
+            placements.len()
+        ));
+    }
+    let placement_by_req: AHashMap<u64, u32> =
+        placements.iter().map(|p| (p.req_id, p.instance)).collect();
+    let ordered_placement: Vec<u32> = records
+        .iter()
+        .map(|r| {
+            placement_by_req
+                .get(&r.req_id)
+                .copied()
+                .ok_or_else(|| anyhow!("missing placement for req_id={}", r.req_id))
+        })
+        .collect::<Result<_>>()?;
+    let futures = FutureIndex::build(records, &ordered_placement);
+
+    let block_bytes = cfg.model.kv_block_bytes() as f64;
+    let l0_cap = (cfg.hardware.hbm_bytes / block_bytes).max(1.0) as usize;
+    let l1_cap = (cfg.hardware.dram_bytes / block_bytes).max(1.0) as usize;
+    let num_instances = cfg.cluster.total_instances() as usize;
+    let mut caches: Vec<ReplayInstanceCache> = (0..num_instances)
+        .map(|_| ReplayInstanceCache::new(policy, l0_cap, l1_cap))
+        .collect();
+    let mut owners: AHashMap<u64, AHashSet<u32>> = AHashMap::new();
+
+    let mut total_blocks = 0u64;
+    let mut l0_hit_blocks = 0u64;
+    let mut l1_hit_blocks = 0u64;
+    let mut remote_hit_blocks = 0u64;
+    let mut miss_blocks = 0u64;
+    let mut total_rdma_bytes = 0u64;
+    let mut total_pcie_bytes = 0u64;
+
+    for (req_idx, record) in records.iter().enumerate() {
+        let inst = ordered_placement[req_idx];
+        let cache = &mut caches[inst as usize];
+        total_blocks += record.hash_ids.len() as u64;
+
+        let l0_hits = cache.l0.longest_prefix_touch(
+            &record.hash_ids,
+            req_idx,
+            FutureKind::L0,
+            inst,
+            &futures,
+        );
+        let suffix_after_l0 = &record.hash_ids[l0_hits..];
+        let l1_hits = cache.l1.longest_prefix_peek(suffix_after_l0);
+        if l1_hits > 0 {
+            cache.promote_l1_blocks_to_l0(
+                &suffix_after_l0[..l1_hits],
+                req_idx,
+                inst,
+                &futures,
+                &mut owners,
+            );
+        }
+
+        let suffix_after_l1 = &suffix_after_l0[l1_hits..];
+        let mut remote_hits = 0usize;
+        for &hash in suffix_after_l1 {
+            let any_remote = owners
+                .get(&hash)
+                .map(|bucket| bucket.iter().any(|owner| *owner != inst))
+                .unwrap_or(false);
+            if any_remote {
+                remote_hits += 1;
+            } else {
+                break;
+            }
+        }
+        if remote_hits > 0 {
+            cache.fetch_remote_blocks_to_l0(
+                &suffix_after_l1[..remote_hits],
+                req_idx,
+                inst,
+                &futures,
+                &mut owners,
+            );
+        }
+
+        let misses = record.hash_ids.len() - l0_hits - l1_hits - remote_hits;
+        let new_input = &record.hash_ids[(l0_hits + l1_hits + remote_hits)..];
+        if !new_input.is_empty() {
+            cache.insert_blocks_into_l0(new_input, req_idx, inst, &futures, &mut owners);
+        }
+
+        l0_hit_blocks += l0_hits as u64;
+        l1_hit_blocks += l1_hits as u64;
+        remote_hit_blocks += remote_hits as u64;
+        miss_blocks += misses as u64;
+        let kv_block_bytes = cfg.model.kv_block_bytes();
+        total_rdma_bytes += (remote_hits as u64) * kv_block_bytes;
+        total_pcie_bytes += ((l1_hits + remote_hits) as u64) * kv_block_bytes;
+    }
+
+    Ok(ReplaySummary::from_counts(
+        records.len(),
+        total_blocks,
+        l0_hit_blocks,
+        l1_hit_blocks,
+        remote_hit_blocks,
+        miss_blocks,
+        total_rdma_bytes,
+        total_pcie_bytes,
+    ))
+}
--- a/src/router/adaptive_affinity.rs
+++ b/src/router/adaptive_affinity.rs
@@ -0,0 +1,254 @@
+//! Adaptive affinity routing for coding-agent workloads.
+//!
+//! The trace has two distinct regimes:
+//!
+//! 1. Short root prompts with a tiny shared stem. Their *current-request*
+//!    miss cost is small, so pure TTFT minimization tends to scatter them.
+//!    That destroys future cache locality for the dominant system prompts.
+//! 2. Continuations or already-warm requests with a long local prefix. Their
+//!    immediate reuse is already visible, so a first-principles TTFT estimate
+//!    is the right objective.
+//!
+//! This router separates the two:
+//!
+//! - Keep a lightweight per-prefix heat map over the first few blocks.
+//! - Only when a prefix family is both short and hot do we enforce affinity.
+//! - The hot prefix is mapped to a deterministic rendezvous-ranked home set,
+//!   and the set widens logarithmically as the family gets hotter.
+//! - Within that home set we still minimize estimated TTFT, and we fall back
+//!   to the global TTFT optimum if the affinity choice is clearly overloaded.
+
+use std::collections::HashMap;
+
+use crate::cluster::meta_store::MetaStore;
+use crate::config::Config;
+use crate::instance::Instance;
+use crate::router::{CandidateInfo, RouteDecision, Router};
+use crate::trace::RequestRecord;
+use crate::ttft::{classify_prefix_tiers, TtftModel};
+
+#[derive(Debug, Clone, Copy, Default)]
+struct PrefixStat {
+    seen: u16,
+    last_seen: f64,
+}
+
+#[derive(Debug, Clone, Copy)]
+struct CostedCandidate {
+    idx: usize,
+    cost: f64,
+    reusable_blocks: u32,
+    queue_len: u32,
+    rendezvous: u64,
+}
+
+pub struct AdaptiveAffinityRouter {
+    ttft_model: TtftModel,
+    fingerprint_k: usize,
+    short_request_blocks: usize,
+    warm_prefix_blocks: u32,
+    hot_threshold: u16,
+    hot_ttl_s: f64,
+    max_fan_out: usize,
+    overload_ratio: f64,
+    overload_abs_s: f64,
+    prefix_stats: HashMap<u64, PrefixStat>,
+}
+
+impl AdaptiveAffinityRouter {
+    pub fn new(config: &Config) -> Self {
+        let n = config.cluster.total_instances() as usize;
+        let configured_fan_out = config.cluster.router.affinity_fan_out;
+        let max_fan_out = if configured_fan_out > 0 {
+            configured_fan_out.max(2).min(n)
+        } else {
+            (n / 8).max(8).min(n)
+        };
+        Self {
+            ttft_model: TtftModel::new(
+                &config.hardware,
+                &config.calibration,
+                config.model.kv_block_bytes(),
+            ),
+            // Coding-trace reuse is dominated by the system prompt stem.
+            fingerprint_k: config.cluster.router.prefix_k.clamp(1, 4),
+            short_request_blocks: 12,
+            warm_prefix_blocks: 8,
+            hot_threshold: 4,
+            hot_ttl_s: config.cluster.meta_store.ttl_seconds.max(1.0),
+            max_fan_out,
+            overload_ratio: 1.25,
+            overload_abs_s: 0.25,
+            prefix_stats: HashMap::new(),
+        }
+    }
+
+    fn fingerprint(hash_ids: &[u64], k: usize) -> u64 {
+        let take = hash_ids.len().min(k).max(1);
+        let mut fp: u64 = 0xcbf29ce484222325;
+        for &h in &hash_ids[..hash_ids.len().min(take)] {
+            fp ^= h;
+            fp = fp.wrapping_mul(0x100000001b3);
+        }
+        fp
+    }
+
+    fn rendezvous(fp: u64, instance_id: u32) -> u64 {
+        let mut h = fp ^ (instance_id as u64).wrapping_mul(0x9e3779b97f4a7c15);
+        h = h.wrapping_add(0x9e3779b97f4a7c15);
+        h = (h ^ (h >> 30)).wrapping_mul(0xbf58476d1ce4e5b9);
+        h = (h ^ (h >> 27)).wrapping_mul(0x94d049bb133111eb);
+        h ^ (h >> 31)
+    }
+
+    fn active_heat(&self, fp: u64, now: f64) -> u16 {
+        self.prefix_stats
+            .get(&fp)
+            .filter(|stat| now - stat.last_seen <= self.hot_ttl_s)
+            .map(|stat| stat.seen)
+            .unwrap_or(0)
+    }
+
+    fn observe(&mut self, fp: u64, now: f64) {
+        let stat = self.prefix_stats.entry(fp).or_default();
+        if now - stat.last_seen > self.hot_ttl_s {
+            stat.seen = 0;
+        }
+        stat.last_seen = now;
+        stat.seen = stat.seen.saturating_add(1);
+    }
+
+    fn fan_out(&self, heat: u16, n: usize) -> usize {
+        if heat < self.hot_threshold {
+            return 2.min(n);
+        }
+        let multiples = (heat / self.hot_threshold).max(1) as u32;
+        let extra = multiples.ilog2() as usize;
+        (2 + extra).min(self.max_fan_out).min(n).max(2)
+    }
+
+    fn candidate_cost(
+        &self,
+        req: &RequestRecord,
+        inst: &Instance,
+        meta: &MetaStore,
+        now: f64,
+        scheduler_s: f64,
+    ) -> (f64, u32) {
+        let residency = classify_prefix_tiers(&req.hash_ids, inst, meta, now);
+        let reusable =
+            residency.l0_hit_blocks + residency.l1_hit_blocks + residency.remote_hit_blocks;
+        let miss_tokens = residency.miss_blocks.saturating_mul(inst.block_size_tokens);
+        let kv_prepare = self.ttft_model.kv_prepare_time_s(residency);
+        let cost = inst.estimated_drain_time()
+            + scheduler_s
+            + kv_prepare
+            + inst.compute.prefill_time(miss_tokens)
+            + self.ttft_model.first_token_tail_s();
+        (cost, reusable)
+    }
+
+    fn better(a: CostedCandidate, b: CostedCandidate) -> bool {
+        a.cost < b.cost
+            || (a.cost == b.cost && a.queue_len < b.queue_len)
+            || (a.cost == b.cost
+                && a.queue_len == b.queue_len
+                && a.reusable_blocks > b.reusable_blocks)
+    }
+}
+
+impl Router for AdaptiveAffinityRouter {
+    fn name(&self) -> &'static str {
+        "adaptive_affinity"
+    }
+
+    fn route(
+        &mut self,
+        req: &RequestRecord,
+        instances: &[Instance],
+        meta: &MetaStore,
+        now: f64,
+    ) -> RouteDecision {
+        let n = instances.len();
+        let scheduler_s = self.ttft_model.scheduler_overhead_s(n, 3);
+        let fp = Self::fingerprint(&req.hash_ids, self.fingerprint_k);
+        let active_heat = self.active_heat(fp, now);
+
+        let mut best_local_l0 = 0u32;
+        let mut candidates = Vec::with_capacity(n);
+        let mut scored = Vec::with_capacity(n);
+
+        for (idx, inst) in instances.iter().enumerate() {
+            let l0_hit = inst.cache.l0.longest_prefix_peek(&req.hash_ids) as u32;
+            best_local_l0 = best_local_l0.max(l0_hit);
+
+            let (cost, reusable_blocks) = self.candidate_cost(req, inst, meta, now, scheduler_s);
+            let rendezvous = Self::rendezvous(fp, inst.id);
+            candidates.push(CandidateInfo {
+                instance: inst.id,
+                predicted_prefix: reusable_blocks,
+                load_blocks: inst.kv_blocks_used,
+                queue_len: inst.queue_len(),
+            });
+            scored.push(CostedCandidate {
+                idx,
+                cost,
+                reusable_blocks,
+                queue_len: inst.queue_len(),
+                rendezvous,
+            });
+        }
+
+        let mut global_best = scored[0];
+        for cand in scored.iter().copied().skip(1) {
+            if Self::better(cand, global_best) {
+                global_best = cand;
+            }
+        }
+
+        let should_affinitize = req.hash_ids.len() <= self.short_request_blocks
+            && best_local_l0 <= self.warm_prefix_blocks
+            && active_heat.saturating_add(1) >= self.hot_threshold;
+
+        let (chosen_idx, reason) = if should_affinitize {
+            let fan_out = self.fan_out(active_heat.saturating_add(1), n);
+            scored.sort_unstable_by(|a, b| b.rendezvous.cmp(&a.rendezvous));
+
+            let mut home_best = scored[0];
+            for cand in scored.iter().copied().take(fan_out).skip(1) {
+                let better = Self::better(cand, home_best)
+                    || (cand.cost == home_best.cost
+                        && cand.queue_len == home_best.queue_len
+                        && cand.reusable_blocks == home_best.reusable_blocks
+                        && cand.rendezvous > home_best.rendezvous);
+                if better {
+                    home_best = cand;
+                }
+            }
+
+            let home_cost_ok =
+                home_best.cost <= global_best.cost * self.overload_ratio + self.overload_abs_s;
+            if home_cost_ok {
+                (home_best.idx, "adaptive affinity: hot short prefix homeset")
+            } else {
+                (
+                    global_best.idx,
+                    "adaptive affinity fallback: global min estimated ttft",
+                )
+            }
+        } else {
+            (global_best.idx, "global min estimated ttft")
+        };
+
+        self.observe(fp, now);
+
+        crate::router::local_route_decision(
+            req.req_id,
+            "adaptive_affinity",
+            instances[chosen_idx].id,
+            0.0,
+            candidates,
+            reason,
+        )
+    }
+}
--- a/src/router/cache_affinity.rs
+++ b/src/router/cache_affinity.rs
@@ -0,0 +1,217 @@
+//! Cache-affinity routing tuned for coding-agent workloads.
+//!
+//! Motivation — the coding trace has three dominant patterns:
+//!
+//! 1. **Short system-prompt-only requests** (≤10 blocks): novel per-chat but
+//!    sharing a small set of system prompts across millions of invocations.
+//! 2. **Long multi-turn chains**: parent→child prefixes share ~60+ blocks
+//!    and grow by ~6 blocks per turn. Sticking the chain to one instance
+//!    maximises L0 hits for every subsequent turn.
+//! 3. **Completely novel one-shots**: no existing cache anywhere; should be
+//!    placed to maximise *future* reuse, not just minimise current load.
+//!
+//! `cache_score` minimises `α·queue_len + β·miss_blocks`. With the shipping
+//! defaults (α=1, β=0.1) a single extra queue position is worth ten extra
+//! miss blocks, so short novel requests — the bulk of traffic — reduce to
+//! pure least-loaded routing and scatter the same system prompt across
+//! dozens of instances. Each scattered copy burns HBM that could have held a
+//! different hot prefix, depressing the cluster-wide L0 hit-rate.
+//!
+//! `cache_affinity` fixes this with two changes:
+//!
+//! * **Strong cache weight** — cost is `α·queue_len − γ·l0_hit`, with
+//!   γ ≫ α·input_blocks, so any real L0 hit beats load-balancing. A soft
+//!   bonus (`δ·meta_only_hit`) still rewards instances that have the prefix
+//!   in L1/DRAM even when L0 is empty.
+//!
+//! * **Deterministic rendezvous tiebreak** — among instances that tie on
+//!   `(cost, hit, queue)`, we rank by `rendezvous(fingerprint, instance_id)`
+//!   where `fingerprint` is an FNV hash of the first few block hashes. This
+//!   turns cold routing from "first-found" (which piles on instance 0 until
+//!   it fills, then spills sequentially) into a consistent hash that maps
+//!   each distinct prefix to the *same* small set of homes. Repeat traffic
+//!   for that prefix therefore concentrates on its home, building a strong
+//!   L0 working set.
+//!
+//! Overload protection: if the rendezvous-chosen home already has
+//! `queue_len > overload_threshold`, the load term dominates and the router
+//! naturally spills to the next-best instance.
+
+use crate::cluster::meta_store::MetaStore;
+use crate::instance::Instance;
+use crate::router::{local_l0_scores, CandidateInfo, RouteDecision, Router};
+use crate::trace::RequestRecord;
+
+pub struct CacheAffinityRouter {
+    /// Router display / trace name.
+    name: &'static str,
+    /// Weight on queue length (per queued request).
+    load_alpha: f64,
+    /// Reward per L0-hit block (real, locally cached).
+    l0_gamma: f64,
+    /// Reward per block present via meta-store but not in L0 (L1 / remote).
+    meta_delta: f64,
+    /// Number of leading block hashes folded into the prefix fingerprint.
+    fingerprint_k: usize,
+    /// Whether to break ties by rendezvous hash (sticky consistent placement)
+    /// or by first-found order (matches cache_score behaviour).
+    use_rendezvous: bool,
+}
+
+impl CacheAffinityRouter {
+    pub fn new(load_alpha: f64, fingerprint_k: usize) -> Self {
+        Self {
+            name: "cache_affinity",
+            load_alpha,
+            l0_gamma: 1.0,
+            meta_delta: 0.25,
+            fingerprint_k: fingerprint_k.max(1),
+            use_rendezvous: true,
+        }
+    }
+
+    /// Ablation: cache_score-style weights (γ=0.1, δ=0) but keep rendezvous
+    /// tiebreak. Isolates the contribution of deterministic sticky placement.
+    pub fn weak_with_rendezvous(load_alpha: f64, fingerprint_k: usize) -> Self {
+        Self {
+            name: "cache_affinity_weak_rend",
+            load_alpha,
+            l0_gamma: 0.1,
+            meta_delta: 0.0,
+            fingerprint_k: fingerprint_k.max(1),
+            use_rendezvous: true,
+        }
+    }
+
+    /// Ablation: strong cache weights (γ=1.0, δ=0.25) but first-found tiebreak
+    /// instead of rendezvous. Isolates the contribution of reweighting alone.
+    pub fn strong_no_rendezvous(load_alpha: f64, fingerprint_k: usize) -> Self {
+        Self {
+            name: "cache_affinity_strong_only",
+            load_alpha,
+            l0_gamma: 1.0,
+            meta_delta: 0.25,
+            fingerprint_k: fingerprint_k.max(1),
+            use_rendezvous: false,
+        }
+    }
+
+    /// FNV-1a over the first `k` block hashes — identifies the prefix family
+    /// (system-prompt + early agent context) that drives cache reuse.
+    fn fingerprint(hash_ids: &[u64], k: usize) -> u64 {
+        let n = hash_ids.len().min(k).max(1);
+        let take = hash_ids.len().min(n);
+        let mut fp: u64 = 0xcbf29ce484222325;
+        for &h in &hash_ids[..take] {
+            fp ^= h;
+            fp = fp.wrapping_mul(0x100000001b3);
+        }
+        if take == 0 {
+            // Empty request: still want a deterministic fingerprint (0).
+            fp ^= 0;
+        }
+        fp
+    }
+
+    /// Splitmix64-style rendezvous score for (fingerprint, instance_id).
+    /// Uniform over u64; higher = preferred home.
+    fn rendezvous(fp: u64, instance_id: u32) -> u64 {
+        let mut h = fp ^ (instance_id as u64).wrapping_mul(0x9e3779b97f4a7c15);
+        h = h.wrapping_add(0x9e3779b97f4a7c15);
+        h = (h ^ (h >> 30)).wrapping_mul(0xbf58476d1ce4e5b9);
+        h = (h ^ (h >> 27)).wrapping_mul(0x94d049bb133111eb);
+        h ^ (h >> 31)
+    }
+}
+
+impl Router for CacheAffinityRouter {
+    fn name(&self) -> &'static str {
+        self.name
+    }
+
+    fn route(
+        &mut self,
+        req: &RequestRecord,
+        instances: &[Instance],
+        meta: &MetaStore,
+        now: f64,
+    ) -> RouteDecision {
+        let n = instances.len();
+        let l0 = local_l0_scores(req, instances);
+        // Meta-store predicted prefix — includes L1/remote-reachable blocks.
+        let meta_scores = meta.score_prefix(&req.hash_ids, now, n);
+        let fp = Self::fingerprint(&req.hash_ids, self.fingerprint_k);
+
+        let mut candidates = Vec::with_capacity(n);
+        let mut best_idx: usize = 0;
+        let mut best_cost = f64::INFINITY;
+        let mut best_hit = 0u32;
+        let mut best_queue = u32::MAX;
+        let mut best_rend: u64 = 0;
+
+        for (i, inst) in instances.iter().enumerate() {
+            let hit = l0[i];
+            // meta_only = extra blocks reachable by RDMA/L1 beyond L0 hit.
+            let meta_only = meta_scores[i].saturating_sub(hit);
+            let q = inst.queue_len();
+
+            // Cost to minimise — lower is better.
+            //   load term:  α · queue_len
+            //   cache term: − γ · l0_hit − δ · meta_only
+            // Short novel prefixes yield hit=0 on every instance, so cost
+            // reduces to α·q and the rendezvous tiebreak picks the home.
+            let cost = self.load_alpha * q as f64
+                - self.l0_gamma * hit as f64
+                - self.meta_delta * meta_only as f64;
+            let rend = Self::rendezvous(fp, inst.id);
+
+            candidates.push(CandidateInfo {
+                instance: inst.id,
+                predicted_prefix: hit,
+                load_blocks: inst.kv_blocks_used,
+                queue_len: q,
+            });
+
+            // Tiebreak chain (descending preference):
+            //   1. lowest cost
+            //   2. highest hit (break cost ties toward real L0 work)
+            //   3. lowest queue
+            //   4. highest rendezvous (deterministic sticky home), optional
+            let better = if cost < best_cost {
+                true
+            } else if cost > best_cost {
+                false
+            } else if hit > best_hit {
+                true
+            } else if hit < best_hit {
+                false
+            } else if q < best_queue {
+                true
+            } else if q > best_queue {
+                false
+            } else if self.use_rendezvous {
+                rend > best_rend
+            } else {
+                // First-found wins on full tie (matches cache_score behaviour).
+                false
+            };
+
+            if better {
+                best_cost = cost;
+                best_hit = hit;
+                best_queue = q;
+                best_rend = rend;
+                best_idx = i;
+            }
+        }
+
+        crate::router::local_route_decision(
+            req.req_id,
+            "cache_affinity",
+            instances[best_idx].id,
+            0.0,
+            candidates,
+            "argmin(α·q − γ·l0_hit − δ·meta_only) + rendezvous tiebreak",
+        )
+    }
+}
--- a/src/router/cache_load.rs
+++ b/src/router/cache_load.rs
@@ -13,7 +13,7 @@

 use crate::cluster::meta_store::MetaStore;
 use crate::instance::Instance;
-use crate::router::{CandidateInfo, RouteDecision, Router};
+use crate::router::{local_l0_scores, CandidateInfo, RouteDecision, Router};
 use crate::trace::RequestRecord;

 pub struct CacheLoadRouter;
@@ -39,11 +39,11 @@ impl Router for CacheLoadRouter {
        &mut self,
        req: &RequestRecord,
        instances: &[Instance],
-        meta: &MetaStore,
-        now: f64,
+        _meta: &MetaStore,
+        _now: f64,
    ) -> RouteDecision {
        let n = instances.len();
-        let scores = meta.score_prefix(&req.hash_ids, now, n);
+        let scores = local_l0_scores(req, instances);

        // Step 1: least-loaded 1/4 of instances (by queue_len).
        let pool_size = (n / 4).max(2).min(n);
@@ -77,13 +77,13 @@ impl Router for CacheLoadRouter {
            });
        }

-        RouteDecision {
-            req_id: req.req_id,
-            mode: "cache_load",
-            chosen: instances[best_idx].id,
-            probe_overhead_s: 0.0,
+        crate::router::local_route_decision(
+            req.req_id,
+            "cache_load",
+            instances[best_idx].id,
+            0.0,
            candidates,
-            reason: "least-loaded 1/4, then best prefix",
-        }
+            "least-loaded 1/4, then best local L0 prefix",
+        )
    }
 }
--- a/src/router/cache_score.rs
+++ b/src/router/cache_score.rs
@@ -32,7 +32,7 @@

 use crate::cluster::meta_store::MetaStore;
 use crate::instance::Instance;
-use crate::router::{CandidateInfo, RouteDecision, Router};
+use crate::router::{local_l0_scores, CandidateInfo, RouteDecision, Router};
 use crate::trace::RequestRecord;

 pub struct CacheScoreRouter {
@@ -55,11 +55,11 @@ impl Router for CacheScoreRouter {
        &mut self,
        req: &RequestRecord,
        instances: &[Instance],
-        meta: &MetaStore,
-        now: f64,
+        _meta: &MetaStore,
+        _now: f64,
    ) -> RouteDecision {
        let n = instances.len();
-        let scores = meta.score_prefix(&req.hash_ids, now, n);
+        let scores = local_l0_scores(req, instances);
        let input_blocks = req.hash_ids.len() as f64;

        let mut best_idx: usize = 0;
@@ -99,13 +99,13 @@ impl Router for CacheScoreRouter {
            }
        }

-        RouteDecision {
-            req_id: req.req_id,
-            mode: "cache_score",
-            chosen: instances[best_idx].id,
-            probe_overhead_s: 0.0,
+        crate::router::local_route_decision(
+            req.req_id,
+            "cache_score",
+            instances[best_idx].id,
+            0.0,
            candidates,
-            reason: "argmin 2^(α·load + β·miss)",
-        }
+            "argmin 2^(α·load + β·miss)",
+        )
    }
 }
--- a/src/router/cache_score_ttl.rs
+++ b/src/router/cache_score_ttl.rs
@@ -0,0 +1,86 @@
+//! Cache-score routing using TTL meta-store prefix predictions.
+//!
+//! This keeps the same scoring rule as `cache_score`:
+//!
+//! ```text
+//! exponent_i = alpha * queue_len_i + beta * miss_i
+//! ```
+//!
+//! The only difference is that `miss_i` is computed from the global TTL
+//! meta-store prefix score instead of the real local-L0 prefix.
+
+use crate::cluster::meta_store::MetaStore;
+use crate::instance::Instance;
+use crate::router::{CandidateInfo, RouteDecision, Router};
+use crate::trace::RequestRecord;
+
+pub struct CacheScoreTtlRouter {
+    alpha: f64,
+    beta: f64,
+}
+
+impl CacheScoreTtlRouter {
+    pub fn new(alpha: f64, beta: f64) -> Self {
+        Self { alpha, beta }
+    }
+}
+
+impl Router for CacheScoreTtlRouter {
+    fn name(&self) -> &'static str {
+        "cache_score_ttl"
+    }
+
+    fn route(
+        &mut self,
+        req: &RequestRecord,
+        instances: &[Instance],
+        meta: &MetaStore,
+        now: f64,
+    ) -> RouteDecision {
+        let n = instances.len();
+        let scores = meta.score_prefix(&req.hash_ids, now, n);
+        let input_blocks = req.hash_ids.len() as f64;
+
+        let mut best_idx: usize = 0;
+        let mut best_exp = f64::INFINITY;
+        let mut best_queue = u32::MAX;
+        let mut best_prefix = 0u32;
+        let mut candidates = Vec::with_capacity(n);
+
+        for (i, inst) in instances.iter().enumerate() {
+            let prefix = scores[i] as f64;
+            let miss = (input_blocks - prefix).max(0.0);
+            let q = inst.queue_len() as f64;
+            let exponent = self.alpha * q + self.beta * miss;
+
+            candidates.push(CandidateInfo {
+                instance: inst.id,
+                predicted_prefix: scores[i],
+                load_blocks: inst.kv_blocks_used,
+                queue_len: inst.queue_len(),
+            });
+
+            let better = exponent < best_exp
+                || (exponent == best_exp && inst.queue_len() < best_queue)
+                || (exponent == best_exp
+                    && inst.queue_len() == best_queue
+                    && scores[i] > best_prefix);
+
+            if better {
+                best_exp = exponent;
+                best_idx = i;
+                best_queue = inst.queue_len();
+                best_prefix = scores[i];
+            }
+        }
+
+        crate::router::local_route_decision(
+            req.req_id,
+            "cache_score_ttl",
+            instances[best_idx].id,
+            0.0,
+            candidates,
+            "argmin 2^(alpha*load + beta*meta_store_miss)",
+        )
+    }
+}
--- a/src/router/estimated_ttft.rs
+++ b/src/router/estimated_ttft.rs
@@ -1,57 +1,31 @@
-//! First-principles TTFT-optimal routing.
+//! First-principles TTFT-estimate routing with calibrated compute and
+//! tier-aware KV prepare costs.
 //!
 //! Estimates the actual time-to-first-token for each candidate instance:
 //!
-//! `TTFT(r,i) = drain(i) + fetch(r,i) + prefill(miss)`
-//!
-//! - **drain** — exact queue drain time: sum of per-request `prefill_time()`
-//!   using the architecture-aware compute model (quadratic / DSA).
-//!
-//! - **fetch** — RDMA fetch time for blocks cached elsewhere in the cluster
-//!   but not on instance `i` locally.
-//!
-//! - **prefill** — compute for cluster-wide cache-miss tokens (constant
-//!   across instances, cancels in the argmin).
-//!
-//! The router minimises `drain(i) + fetch(r,i)`, with ties broken by
-//! lowest `queue_len` then most local cache.  The fetch overlap with queue
-//! drain is handled by keeping the additive form: this gives double
-//! incentive to prefer instances with local cache, which empirically
-//! outperforms the `max(drain, fetch)` alternative because even small
-//! RDMA savings compound across thousands of routing decisions.
+//! `TTFT(r,i) = drain(i) + scheduler + kv_prepare(r,i) + prefill(miss_i) + first_token_tail`

 use crate::cluster::meta_store::MetaStore;
 use crate::config::Config;
 use crate::instance::Instance;
 use crate::router::{CandidateInfo, RouteDecision, Router};
 use crate::trace::RequestRecord;
+use crate::ttft::{classify_prefix_tiers, TtftModel};

 pub struct EstimatedTtftRouter {
-    /// Bytes per KV block (for RDMA cost estimation).
-    kv_block_bytes: f64,
-    /// RDMA bandwidth in bytes/s.
-    rdma_bw: f64,
-    /// RDMA per-transfer latency in seconds.
-    rdma_latency_s: f64,
+    ttft_model: TtftModel,
 }

 impl EstimatedTtftRouter {
    pub fn new(config: &Config) -> Self {
        Self {
-            kv_block_bytes: config.model.kv_block_bytes() as f64,
-            rdma_bw: config.hardware.rdma_bw,
-            rdma_latency_s: config.hardware.rdma_latency_us * 1e-6,
+            ttft_model: TtftModel::new(
+                &config.hardware,
+                &config.calibration,
+                config.model.kv_block_bytes(),
+            ),
        }
    }
-
-    /// Estimate RDMA fetch time for `remote_blocks` blocks.
-    fn fetch_time(&self, remote_blocks: u32) -> f64 {
-        if remote_blocks == 0 {
-            return 0.0;
-        }
-        let bytes = remote_blocks as f64 * self.kv_block_bytes;
-        bytes / self.rdma_bw + self.rdma_latency_s
-    }
 }

 impl Router for EstimatedTtftRouter {
@@ -66,63 +40,62 @@ impl Router for EstimatedTtftRouter {
        meta: &MetaStore,
        now: f64,
    ) -> RouteDecision {
+        let scheduler = self.ttft_model.scheduler_overhead_s(instances.len(), 3);
        let n = instances.len();
-        let scores = meta.score_prefix(&req.hash_ids, now, n);
-
-        // Cluster-wide max prefix: blocks reachable via RDMA from any peer.
-        let cluster_prefix = scores.iter().copied().max().unwrap_or(0);

        let mut best: u32 = 0;
        let mut best_cost = f64::INFINITY;
        let mut best_queue = u32::MAX;
-        let mut best_local = 0u32;
+        let mut best_reuse = 0u32;
        let mut candidates = Vec::with_capacity(n);

        for inst in instances {
-            let i = inst.id as usize;
-            let local_prefix = scores[i];
+            let residency = classify_prefix_tiers(&req.hash_ids, inst, meta, now);

            // 1. Exact queue drain time (architecture-aware, per-request sum).
            let drain = inst.estimated_drain_time();

-            // 2. RDMA fetch cost for blocks not locally cached.
-            let remote_blocks = cluster_prefix.saturating_sub(local_prefix);
-            let fetch = self.fetch_time(remote_blocks);
-
-            // Additive cost: drain + fetch.
-            // The additive form gives explicit incentive to prefer local cache
-            // (lower fetch) even when the queue is non-empty, which reduces
-            // total RDMA traffic and improves TTFT in aggregate.
-            let cost = drain + fetch;
+            let miss_tokens = residency.miss_blocks.saturating_mul(inst.block_size_tokens);
+            let kv_prepare = self.ttft_model.kv_prepare_time_s(residency);
+            let first_token_tail = self.ttft_model.first_token_tail_s();
+            let cost = drain
+                + scheduler
+                + kv_prepare
+                + inst.compute.prefill_time(miss_tokens)
+                + first_token_tail;

            candidates.push(CandidateInfo {
                instance: inst.id,
-                predicted_prefix: local_prefix,
+                predicted_prefix: residency.l0_hit_blocks
+                    + residency.l1_hit_blocks
+                    + residency.remote_hit_blocks,
                load_blocks: inst.kv_blocks_used,
                queue_len: inst.queue_len(),
            });

            // Minimise (cost, queue_len, -local_prefix).
            let ql = inst.queue_len();
+            let reusable =
+                residency.l0_hit_blocks + residency.l1_hit_blocks + residency.remote_hit_blocks;
            let better = cost < best_cost
                || (cost == best_cost && ql < best_queue)
-                || (cost == best_cost && ql == best_queue && local_prefix > best_local);
+                || (cost == best_cost && ql == best_queue && reusable > best_reuse);

            if better {
                best_cost = cost;
                best = inst.id;
                best_queue = ql;
-                best_local = local_prefix;
+                best_reuse = reusable;
            }
        }

-        RouteDecision {
-            req_id: req.req_id,
-            mode: "estimated_ttft",
-            chosen: best,
-            probe_overhead_s: 0.0,
+        crate::router::local_route_decision(
+            req.req_id,
+            "estimated_ttft",
+            best,
+            0.0,
            candidates,
-            reason: "argmin(drain_time + fetch_time)",
-        }
+            "argmin(drain + scheduler + kv_prepare + prefill + first_token_tail)",
+        )
    }
 }
--- a/src/router/global_bucket.rs
+++ b/src/router/global_bucket.rs
@@ -0,0 +1,371 @@
+use anyhow::{anyhow, Result};
+use serde::Serialize;
+
+use crate::config::{Config, GlobalRouterMode};
+use crate::trace::RequestRecord;
+
+pub type BucketId = u32;
+
+#[derive(Debug, Clone, Serialize)]
+pub struct BucketView {
+    pub id: BucketId,
+    pub input_length_min: u32,
+    pub input_length_max: u32,
+    pub num_instances: u32,
+    pub total_queue_len: u32,
+    pub total_load_blocks: u32,
+    pub predicted_prefix: u32,
+}
+
+#[derive(Debug, Clone, Serialize)]
+pub struct BucketCandidate {
+    pub bucket: BucketId,
+    pub input_length_min: u32,
+    pub input_length_max: u32,
+    pub num_instances: u32,
+    pub total_queue_len: u32,
+    pub total_load_blocks: u32,
+    pub predicted_prefix: u32,
+    pub matches_input_len: bool,
+    pub score: f64,
+}
+
+#[derive(Debug, Clone, Serialize)]
+pub struct GlobalRouteDecision {
+    pub req_id: u64,
+    pub mode: &'static str,
+    pub chosen_bucket: BucketId,
+    pub candidates: Vec<BucketCandidate>,
+    pub reason: &'static str,
+}
+
+impl GlobalRouteDecision {
+    pub fn single_bucket(req_id: u64, chosen_bucket: BucketId) -> Self {
+        Self {
+            req_id,
+            mode: "single_pool",
+            chosen_bucket,
+            candidates: Vec::new(),
+            reason: "single pool uses bucket 0",
+        }
+    }
+}
+
+pub trait GlobalRouter: Send {
+    fn name(&self) -> &'static str;
+    fn route(
+        &mut self,
+        req: &RequestRecord,
+        buckets: &[BucketView],
+        now: f64,
+    ) -> Result<GlobalRouteDecision>;
+}
+
+struct StrictInputLengthRouter {
+    reported_mode: &'static str,
+    reason: &'static str,
+}
+
+impl StrictInputLengthRouter {
+    fn new(reported_mode: &'static str, reason: &'static str) -> Self {
+        Self {
+            reported_mode,
+            reason,
+        }
+    }
+}
+
+impl GlobalRouter for StrictInputLengthRouter {
+    fn name(&self) -> &'static str {
+        self.reported_mode
+    }
+
+    fn route(
+        &mut self,
+        req: &RequestRecord,
+        buckets: &[BucketView],
+        _now: f64,
+    ) -> Result<GlobalRouteDecision> {
+        let candidates = buckets
+            .iter()
+            .map(|view| BucketCandidate {
+                bucket: view.id,
+                input_length_min: view.input_length_min,
+                input_length_max: view.input_length_max,
+                num_instances: view.num_instances,
+                total_queue_len: view.total_queue_len,
+                total_load_blocks: view.total_load_blocks,
+                predicted_prefix: view.predicted_prefix,
+                matches_input_len: view.input_length_min <= req.input_len
+                    && req.input_len <= view.input_length_max,
+                score: if view.input_length_min <= req.input_len
+                    && req.input_len <= view.input_length_max
+                {
+                    0.0
+                } else {
+                    f64::INFINITY
+                },
+            })
+            .collect::<Vec<_>>();
+
+        let matches = candidates
+            .iter()
+            .filter(|candidate| candidate.matches_input_len)
+            .map(|candidate| candidate.bucket)
+            .collect::<Vec<_>>();
+
+        let chosen_bucket = match matches.as_slice() {
+            [bucket] => *bucket,
+            [] => {
+                return Err(anyhow!(
+                    "cluster.global_router.mode={} has no bucket for input_length={}",
+                    self.reported_mode,
+                    req.input_len
+                ));
+            }
+            _ => {
+                return Err(anyhow!(
+                    "cluster.global_router.mode={} matched multiple buckets for input_length={}",
+                    self.reported_mode,
+                    req.input_len
+                ));
+            }
+        };
+
+        Ok(GlobalRouteDecision {
+            req_id: req.req_id,
+            mode: self.reported_mode,
+            chosen_bucket,
+            candidates,
+            reason: self.reason,
+        })
+    }
+}
+
+struct BucketScoreRouter {
+    length_penalty_weight: f64,
+    load_weight: f64,
+    cache_weight: f64,
+}
+
+impl BucketScoreRouter {
+    fn new(full: &Config) -> Self {
+        Self {
+            length_penalty_weight: full.cluster.global_router.length_penalty_weight,
+            load_weight: full.cluster.global_router.load_weight,
+            cache_weight: full.cluster.global_router.cache_weight,
+        }
+    }
+
+    fn length_penalty(&self, req: &RequestRecord, bucket: &BucketView) -> f64 {
+        if req.input_len < bucket.input_length_min {
+            (bucket.input_length_min - req.input_len) as f64
+        } else if req.input_len > bucket.input_length_max {
+            (req.input_len - bucket.input_length_max) as f64
+        } else {
+            0.0
+        }
+    }
+}
+
+impl GlobalRouter for BucketScoreRouter {
+    fn name(&self) -> &'static str {
+        "bucket_score"
+    }
+
+    fn route(
+        &mut self,
+        req: &RequestRecord,
+        buckets: &[BucketView],
+        _now: f64,
+    ) -> Result<GlobalRouteDecision> {
+        let mut chosen_bucket = None;
+        let mut best_score = f64::INFINITY;
+        let mut candidates = Vec::with_capacity(buckets.len());
+
+        for bucket in buckets {
+            let length_penalty = self.length_penalty(req, bucket);
+            let miss = req
+                .hash_ids
+                .len()
+                .saturating_sub(bucket.predicted_prefix as usize) as f64;
+            let score = self.length_penalty_weight * length_penalty
+                + self.load_weight * bucket.total_queue_len as f64
+                + self.cache_weight * miss;
+
+            candidates.push(BucketCandidate {
+                bucket: bucket.id,
+                input_length_min: bucket.input_length_min,
+                input_length_max: bucket.input_length_max,
+                num_instances: bucket.num_instances,
+                total_queue_len: bucket.total_queue_len,
+                total_load_blocks: bucket.total_load_blocks,
+                predicted_prefix: bucket.predicted_prefix,
+                matches_input_len: bucket.input_length_min <= req.input_len
+                    && req.input_len <= bucket.input_length_max,
+                score,
+            });
+
+            let better = score < best_score
+                || (score == best_score && chosen_bucket.is_none_or(|best| bucket.id < best));
+            if better {
+                best_score = score;
+                chosen_bucket = Some(bucket.id);
+            }
+        }
+
+        Ok(GlobalRouteDecision {
+            req_id: req.req_id,
+            mode: self.name(),
+            chosen_bucket: chosen_bucket.ok_or_else(|| anyhow!("no buckets available"))?,
+            candidates,
+            reason: "weighted length/load/cache bucket score",
+        })
+    }
+}
+
+pub fn build(full: &Config) -> Box<dyn GlobalRouter> {
+    match full.cluster.global_router.mode {
+        GlobalRouterMode::StrictInputLength => Box::new(StrictInputLengthRouter::new(
+            "strict_input_length",
+            "unique bucket range contains input_length",
+        )) as Box<dyn GlobalRouter>,
+        GlobalRouterMode::BucketScore => {
+            Box::new(BucketScoreRouter::new(full)) as Box<dyn GlobalRouter>
+        }
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use crate::config::{
+        ClusterConfig, GlobalRouterConfig, MetaStoreConfig, RouterConfig, RouterMode,
+    };
+
+    fn cfg() -> Config {
+        Config {
+            model: crate::config::ModelConfig::default(),
+            hardware: crate::config::HardwareConfig {
+                gpu_flops: 1.0,
+                gpu_fp8_flops: 0.0,
+                gpu_fp4_flops: 0.0,
+                gpu_mem_bw: 1.0,
+                hbm_bytes: 1.0,
+                dram_bytes: 1.0,
+                host_dram_bw: 1.0,
+                pcie_bw: 1.0,
+                pcie_latency_us: 1.0,
+                rdma_bw: 1.0,
+                rdma_latency_us: 1.0,
+                intra_node_tp_bw: 1.0,
+                intra_node_tp_latency_us: 1.0,
+                tp_degree: 1,
+                max_batch_slots: 1,
+                prefill_chunk_tokens: 1,
+            },
+            calibration: crate::config::CalibrationConfig::default(),
+            cluster: ClusterConfig {
+                num_instances: None,
+                buckets: Vec::new(),
+                global_router: GlobalRouterConfig {
+                    mode: GlobalRouterMode::BucketScore,
+                    length_penalty_weight: 1.0,
+                    load_weight: 1.0,
+                    cache_weight: 1.0,
+                },
+                meta_store: MetaStoreConfig { ttl_seconds: 1.0 },
+                router: RouterConfig {
+                    mode: RouterMode::LeastLoaded,
+                    precise_probe_latency_us: 1.0,
+                    precise_probe_topk: 1,
+                    load_alpha: 1.0,
+                    score_alpha: 1.0,
+                    score_beta: 1.0,
+                    prefix_k: 8,
+                    affinity_fan_out: 1,
+                },
+            },
+            sim: crate::config::SimConfig {
+                trace_path: String::new(),
+                max_requests: None,
+                output_dir: String::new(),
+                sample_interval_s: 0.0,
+                seed: 0,
+                input_length_min: None,
+                input_length_max: None,
+            },
+        }
+    }
+
+    fn req(input_len: u32) -> RequestRecord {
+        RequestRecord {
+            req_id: 1,
+            chat_id: 0,
+            parent_chat_id: -1,
+            turn: 0,
+            arrival: 0.0,
+            input_len,
+            output_len: 16,
+            hash_ids: vec![10, 11, 12],
+        }
+    }
+
+    #[test]
+    fn bucket_score_prefers_matching_bucket_when_load_is_equal() {
+        let mut router = BucketScoreRouter::new(&cfg());
+        let buckets = vec![
+            BucketView {
+                id: 0,
+                input_length_min: 0,
+                input_length_max: 32,
+                num_instances: 2,
+                total_queue_len: 1,
+                total_load_blocks: 0,
+                predicted_prefix: 0,
+            },
+            BucketView {
+                id: 1,
+                input_length_min: 33,
+                input_length_max: 96,
+                num_instances: 2,
+                total_queue_len: 1,
+                total_load_blocks: 0,
+                predicted_prefix: 0,
+            },
+        ];
+        let decision = router.route(&req(24), &buckets, 0.0).unwrap();
+        assert_eq!(decision.chosen_bucket, 0);
+    }
+
+    #[test]
+    fn bucket_score_can_override_length_match_when_load_gap_is_large() {
+        let mut full = cfg();
+        full.cluster.global_router.load_weight = 5.0;
+        full.cluster.global_router.cache_weight = 1.0;
+        full.cluster.global_router.length_penalty_weight = 1.0;
+        let mut router = BucketScoreRouter::new(&full);
+        let buckets = vec![
+            BucketView {
+                id: 0,
+                input_length_min: 0,
+                input_length_max: 32,
+                num_instances: 2,
+                total_queue_len: 20,
+                total_load_blocks: 0,
+                predicted_prefix: 0,
+            },
+            BucketView {
+                id: 1,
+                input_length_min: 33,
+                input_length_max: 96,
+                num_instances: 2,
+                total_queue_len: 0,
+                total_load_blocks: 0,
+                predicted_prefix: 2,
+            },
+        ];
+        let decision = router.route(&req(24), &buckets, 0.0).unwrap();
+        assert_eq!(decision.chosen_bucket, 1);
+    }
+}
--- a/src/router/least_loaded.rs
+++ b/src/router/least_loaded.rs
@@ -29,8 +29,7 @@ impl Router for LeastLoadedRouter {
        let mut best_score = f64::INFINITY;
        let mut candidates = Vec::with_capacity(instances.len());
        for inst in instances {
-            let load = inst.kv_blocks_used as f64
-                + self.alpha * inst.queue_len() as f64;
+            let load = inst.kv_blocks_used as f64 + self.alpha * inst.queue_len() as f64;
            candidates.push(CandidateInfo {
                instance: inst.id,
                predicted_prefix: 0,
@@ -42,13 +41,13 @@ impl Router for LeastLoadedRouter {
                best = inst.id;
            }
        }
-        RouteDecision {
-            req_id: req.req_id,
-            mode: "least_loaded",
-            chosen: best,
-            probe_overhead_s: 0.0,
+        crate::router::local_route_decision(
+            req.req_id,
+            "least_loaded",
+            best,
+            0.0,
            candidates,
-            reason: "argmin(kv_used + alpha * queue_len)",
-        }
+            "argmin(kv_used + alpha * queue_len)",
+        )
    }
 }
--- a/src/router/least_tokens.rs
+++ b/src/router/least_tokens.rs
@@ -61,13 +61,13 @@ impl Router for LeastTokensRouter {
            }
        }

-        RouteDecision {
-            req_id: req.req_id,
-            mode: "least_tokens",
-            chosen: best,
-            probe_overhead_s: 0.0,
+        crate::router::local_route_decision(
+            req.req_id,
+            "least_tokens",
+            best,
+            0.0,
            candidates,
-            reason: "argmin(waiting_prefill_tokens)",
-        }
+            "argmin(waiting_prefill_tokens)",
+        )
    }
 }
--- a/src/router/lineage_affinity.rs
+++ b/src/router/lineage_affinity.rs
@@ -0,0 +1,243 @@
+//! Lineage-aware reuse routing for agentic coding workloads.
+//!
+//! Workload hypothesis:
+//! - turn-1 requests are diverse but recur by prefix family and benefit from
+//!   deterministic home placement instead of diffusion across the cluster;
+//! - child requests usually extend the immediately preceding request and
+//!   should stay on the parent's instance whenever that instance is not
+//!   clearly overloaded.
+//!
+//! The router therefore uses three modes:
+//! - strong local cache scoring for already-warm requests;
+//! - parent stickiness for continuations with a known parent placement;
+//! - family homesets (rendezvous-ranked top-K) for cold / weakly-warm
+//!   requests, with a global fallback if the homeset is substantially worse.
+
+use std::collections::HashMap;
+
+use crate::cluster::meta_store::MetaStore;
+use crate::config::Config;
+use crate::instance::Instance;
+use crate::router::{local_l0_scores, CandidateInfo, RouteDecision, Router};
+use crate::trace::RequestRecord;
+
+#[derive(Debug, Clone, Copy, Default)]
+struct FamilyStat {
+    seen: u16,
+    last_seen: f64,
+}
+
+#[derive(Debug, Clone, Copy)]
+struct CandidateCost {
+    idx: usize,
+    cost: f64,
+    l0_hit: u32,
+    meta_only: u32,
+    queue_len: u32,
+    rendezvous: u64,
+}
+
+pub struct LineageAffinityRouter {
+    load_alpha: f64,
+    l0_gamma: f64,
+    meta_delta: f64,
+    fingerprint_k: usize,
+    warm_prefix_blocks: u32,
+    hot_ttl_s: f64,
+    max_fan_out: usize,
+    parent_cost_slack: f64,
+    homeset_cost_slack: f64,
+    family_stats: HashMap<u64, FamilyStat>,
+    request_home: HashMap<i64, u32>,
+}
+
+impl LineageAffinityRouter {
+    pub fn new(config: &Config) -> Self {
+        let n = config.cluster.total_instances() as usize;
+        let configured_fan_out = config.cluster.router.affinity_fan_out;
+        let max_fan_out = if configured_fan_out > 0 {
+            configured_fan_out.max(2).min(n)
+        } else {
+            (n / 8).max(8).min(n)
+        };
+        Self {
+            load_alpha: config.cluster.router.load_alpha.max(1.0),
+            l0_gamma: 1.0,
+            meta_delta: 0.25,
+            fingerprint_k: config.cluster.router.prefix_k.clamp(2, 8),
+            warm_prefix_blocks: 12,
+            hot_ttl_s: config.cluster.meta_store.ttl_seconds.max(1.0),
+            max_fan_out,
+            // Measured in the same score units as α·queue − γ·hit.
+            parent_cost_slack: 6.0,
+            homeset_cost_slack: 2.0,
+            family_stats: HashMap::new(),
+            request_home: HashMap::new(),
+        }
+    }
+
+    fn fingerprint(hash_ids: &[u64], k: usize) -> u64 {
+        let take = hash_ids.len().min(k).max(1);
+        let mut fp: u64 = 0xcbf29ce484222325;
+        for &h in &hash_ids[..take] {
+            fp ^= h;
+            fp = fp.wrapping_mul(0x100000001b3);
+        }
+        fp
+    }
+
+    fn rendezvous(fp: u64, instance_id: u32) -> u64 {
+        let mut h = fp ^ (instance_id as u64).wrapping_mul(0x9e3779b97f4a7c15);
+        h = h.wrapping_add(0x9e3779b97f4a7c15);
+        h = (h ^ (h >> 30)).wrapping_mul(0xbf58476d1ce4e5b9);
+        h = (h ^ (h >> 27)).wrapping_mul(0x94d049bb133111eb);
+        h ^ (h >> 31)
+    }
+
+    fn active_heat(&self, fp: u64, now: f64) -> u16 {
+        self.family_stats
+            .get(&fp)
+            .filter(|stat| now - stat.last_seen <= self.hot_ttl_s)
+            .map(|stat| stat.seen)
+            .unwrap_or(0)
+    }
+
+    fn observe(&mut self, fp: u64, now: f64) {
+        let stat = self.family_stats.entry(fp).or_default();
+        if now - stat.last_seen > self.hot_ttl_s {
+            stat.seen = 0;
+        }
+        stat.last_seen = now;
+        stat.seen = stat.seen.saturating_add(1);
+    }
+
+    fn fan_out(&self, heat: u16, n: usize) -> usize {
+        let base = 2usize;
+        let extra = match heat {
+            0..=1 => 0,
+            2..=3 => 1,
+            4..=7 => 2,
+            8..=15 => 3,
+            _ => 4,
+        };
+        (base + extra).min(self.max_fan_out).min(n).max(2)
+    }
+
+    fn better(a: CandidateCost, b: CandidateCost) -> bool {
+        a.cost < b.cost
+            || (a.cost == b.cost && a.l0_hit > b.l0_hit)
+            || (a.cost == b.cost && a.l0_hit == b.l0_hit && a.queue_len < b.queue_len)
+            || (a.cost == b.cost
+                && a.l0_hit == b.l0_hit
+                && a.queue_len == b.queue_len
+                && a.meta_only > b.meta_only)
+    }
+}
+
+impl Router for LineageAffinityRouter {
+    fn name(&self) -> &'static str {
+        "lineage_affinity"
+    }
+
+    fn route(
+        &mut self,
+        req: &RequestRecord,
+        instances: &[Instance],
+        meta: &MetaStore,
+        now: f64,
+    ) -> RouteDecision {
+        let n = instances.len();
+        let l0 = local_l0_scores(req, instances);
+        let meta_scores = meta.score_prefix(&req.hash_ids, now, n);
+        let family_fp = Self::fingerprint(&req.hash_ids, self.fingerprint_k);
+        let family_heat = self.active_heat(family_fp, now).saturating_add(1);
+        let parent_home = if req.parent_chat_id >= 0 {
+            self.request_home.get(&req.parent_chat_id).copied()
+        } else {
+            None
+        };
+
+        let mut candidates = Vec::with_capacity(n);
+        let mut scored = Vec::with_capacity(n);
+        let mut best_local_l0 = 0u32;
+
+        for (idx, inst) in instances.iter().enumerate() {
+            let l0_hit = l0[idx];
+            best_local_l0 = best_local_l0.max(l0_hit);
+            let meta_only = meta_scores[idx].saturating_sub(l0_hit);
+            let queue_len = inst.queue_len();
+            let cost = self.load_alpha * queue_len as f64
+                - self.l0_gamma * l0_hit as f64
+                - self.meta_delta * meta_only as f64;
+            let rend = Self::rendezvous(family_fp, inst.id);
+            candidates.push(CandidateInfo {
+                instance: inst.id,
+                predicted_prefix: l0_hit,
+                load_blocks: inst.kv_blocks_used,
+                queue_len,
+            });
+            scored.push(CandidateCost {
+                idx,
+                cost,
+                l0_hit,
+                meta_only,
+                queue_len,
+                rendezvous: rend,
+            });
+        }
+
+        let mut global_best = scored[0];
+        for cand in scored.iter().copied().skip(1) {
+            if Self::better(cand, global_best) {
+                global_best = cand;
+            }
+        }
+
+        let mut chosen = global_best;
+        let reason = if let Some(parent_inst) = parent_home {
+            let parent = scored[parent_inst as usize];
+            if parent.cost <= global_best.cost + self.parent_cost_slack {
+                chosen = parent;
+                "lineage affinity: parent stickiness"
+            } else {
+                "lineage affinity: parent overloaded, global best"
+            }
+        } else if best_local_l0 < self.warm_prefix_blocks {
+            let fan_out = self.fan_out(family_heat, n);
+            scored.sort_unstable_by(|a, b| b.rendezvous.cmp(&a.rendezvous));
+            let mut home_best = scored[0];
+            for cand in scored.iter().copied().take(fan_out).skip(1) {
+                let better = Self::better(cand, home_best)
+                    || (cand.cost == home_best.cost
+                        && cand.l0_hit == home_best.l0_hit
+                        && cand.queue_len == home_best.queue_len
+                        && cand.meta_only == home_best.meta_only
+                        && cand.rendezvous > home_best.rendezvous);
+                if better {
+                    home_best = cand;
+                }
+            }
+            if home_best.cost <= global_best.cost + self.homeset_cost_slack {
+                chosen = home_best;
+                "lineage affinity: family homeset"
+            } else {
+                "lineage affinity: homeset overloaded, global best"
+            }
+        } else {
+            "lineage affinity: warm request global locality"
+        };
+
+        self.observe(family_fp, now);
+        self.request_home
+            .insert(req.chat_id, instances[chosen.idx].id);
+
+        crate::router::local_route_decision(
+            req.req_id,
+            "lineage_affinity",
+            instances[chosen.idx].id,
+            0.0,
+            candidates,
+            reason,
+        )
+    }
+}
--- a/src/router/min_pd.rs
+++ b/src/router/min_pd.rs
@@ -1,4 +1,4 @@
-//! Minimum P*D routing.
+//! Minimum P*D routing using real local L0 hits only.
 //!
 //! For each instance compute:
 //!   - `P` = real prefill tokens this request will do if routed there
@@ -7,30 +7,18 @@
 //!
 //! Score = `P * D`, pick the instance that minimizes it.
 //!
-//! `P` accounts for the **actual** prefill work after the cluster fetch
-//! chain runs: the fetch chain serves any block cached anywhere in the
-//! cluster (L0 → L1 → remote v6d via RDMA), so prefill compute only runs
-//! for blocks that are absent cluster-wide *and* for blocks past the
-//! instance-local prefix (the cluster only fetches a contiguous leading
-//! prefix — any gap ends the fetch chain and the rest must be prefilled).
+//! `P` accounts only for blocks that miss in the candidate instance's
+//! current L0 cache. L1 / remote reuse may still reduce execution-time
+//! work later in the cluster fetch chain, but they do not count as
+//! `kvcache hit` for routing.
 //!
 //! Concretely, for instance `i`:
 //!
 //! ```text
-//! local_prefix_i = meta_store.score_prefix(req, now)[i]   // blocks
-//! cluster_prefix = max over all j of meta_store_score[j]  // blocks
-//! effective_prefix_i = min(cluster_prefix, input_blocks)
-//!     - if local_prefix_i == cluster_prefix the fetch chain stays local,
-//!     - otherwise the prefill still skips cluster_prefix blocks because
-//!       the missing tail is fetched via RDMA from a peer.
-//! P_i = (input_blocks - effective_prefix_i) * block_size_tokens
+//! local_prefix_i = longest L0 prefix on instance i          // blocks
+//! P_i = (input_blocks - local_prefix_i) * block_size_tokens
 //! ```
 //!
-//! This makes `P` nearly instance-independent on well-populated clusters
-//! (so `min_pd` degenerates to balanced load with a cache-affinity
-//! tiebreak), which is exactly what you want when RDMA is cheap relative
-//! to prefill compute.
-//!
 //! Tiebreaks (essential on 128-instance clusters where many instances are
 //! idle and the raw product collapses to zero):
 //!   1. minimum `P*D`
@@ -40,7 +28,7 @@

 use crate::cluster::meta_store::MetaStore;
 use crate::instance::Instance;
-use crate::router::{CandidateInfo, RouteDecision, Router};
+use crate::router::{local_l0_scores, CandidateInfo, RouteDecision, Router};
 use crate::trace::RequestRecord;

 pub struct MinPdRouter;
@@ -66,36 +54,26 @@ impl Router for MinPdRouter {
        &mut self,
        req: &RequestRecord,
        instances: &[Instance],
-        meta: &MetaStore,
-        now: f64,
+        _meta: &MetaStore,
+        _now: f64,
    ) -> RouteDecision {
        let n = instances.len();
-        let scores = meta.score_prefix(&req.hash_ids, now, n);
+        let scores = local_l0_scores(req, instances);
        let block_size = instances[0].block_size_tokens as u64;
        let input_blocks = req.hash_ids.len() as u64;

-        // Cluster-wide max prefix: longest contiguous prefix that EXISTS
-        // somewhere in the cluster (and will be fetched via remote RDMA if
-        // not local). This determines the effective prefill work for every
-        // candidate, not just the one that owns the blocks.
-        let cluster_prefix_blocks = scores.iter().copied().max().unwrap_or(0) as u64;
-        let effective_prefix_blocks = cluster_prefix_blocks.min(input_blocks);
-        let miss_blocks = input_blocks.saturating_sub(effective_prefix_blocks);
-        let p_base = miss_blocks.saturating_mul(block_size); // tokens to prefill
-
        let mut candidates = Vec::with_capacity(n);
        let mut best: u32 = instances[0].id;
        // Minimize (P*D, D, -local_prefix).
-        // P is nearly instance-independent; D is the real discriminator.
-        // When tied on D, prefer the instance with the best local prefix
-        // (avoids the RDMA fetch cost).
        let mut best_key: (u128, u64, i64) = (u128::MAX, u64::MAX, i64::MAX);

        for inst in instances {
            let i = inst.id as usize;
            let d = inst.queue_len() as u64;
-            let pd = p_base as u128 * d as u128;
            let local_prefix = scores[i] as i64;
+            let miss_blocks = input_blocks.saturating_sub(scores[i] as u64);
+            let p = miss_blocks.saturating_mul(block_size);
+            let pd = p as u128 * d as u128;

            candidates.push(CandidateInfo {
                instance: inst.id,
@@ -112,13 +90,13 @@ impl Router for MinPdRouter {
            }
        }

-        RouteDecision {
-            req_id: req.req_id,
-            mode: "min_pd",
-            chosen: best,
-            probe_overhead_s: 0.0,
+        crate::router::local_route_decision(
+            req.req_id,
+            "min_pd",
+            best,
+            0.0,
            candidates,
-            reason: "argmin(P*D), P=cluster-wide miss tokens, D=ongoing reqs",
-        }
+            "argmin(P*D), P=local-L0 miss tokens, D=ongoing reqs",
+        )
    }
 }
--- a/src/router/mod.rs
+++ b/src/router/mod.rs
@@ -1,10 +1,15 @@
 //! Cluster-level routing strategies.

+pub mod adaptive_affinity;
+pub mod cache_affinity;
 pub mod cache_load;
 pub mod cache_score;
+pub mod cache_score_ttl;
 pub mod estimated_ttft;
+pub mod global_bucket;
 pub mod least_loaded;
 pub mod least_tokens;
+pub mod lineage_affinity;
 pub mod min_pd;
 pub mod precise_aware;
 pub mod prefix_affinity;
@@ -19,6 +24,8 @@ use crate::instance::Instance;
 use crate::trace::RequestRecord;
 use crate::types::InstanceId;

+pub use global_bucket::{BucketCandidate, BucketId, BucketView, GlobalRouteDecision, GlobalRouter};
+
 #[derive(Debug, Clone, Serialize)]
 pub struct CandidateInfo {
    pub instance: InstanceId,
@@ -30,11 +37,25 @@ pub struct CandidateInfo {
 #[derive(Debug, Clone, Serialize)]
 pub struct RouteDecision {
    pub req_id: u64,
+    pub global_mode: &'static str,
    pub mode: &'static str,
+    pub global_reason: &'static str,
+    pub local_reason: &'static str,
+    pub chosen_bucket: BucketId,
    pub chosen: InstanceId,
    pub probe_overhead_s: f64,
+    pub bucket_candidates: Vec<BucketCandidate>,
    pub candidates: Vec<CandidateInfo>,
-    pub reason: &'static str,
+}
+
+impl RouteDecision {
+    pub fn with_global(mut self, decision: &GlobalRouteDecision) -> Self {
+        self.global_mode = decision.mode;
+        self.global_reason = decision.reason;
+        self.chosen_bucket = decision.chosen_bucket;
+        self.bucket_candidates = decision.candidates.clone();
+        self
+    }
 }

 pub trait Router: Send {
@@ -48,6 +69,39 @@ pub trait Router: Send {
    ) -> RouteDecision;
 }

+pub(crate) fn local_l0_prefix(req: &RequestRecord, inst: &Instance) -> u32 {
+    inst.cache.l0.longest_prefix_peek(&req.hash_ids) as u32
+}
+
+pub(crate) fn local_l0_scores(req: &RequestRecord, instances: &[Instance]) -> Vec<u32> {
+    instances
+        .iter()
+        .map(|inst| local_l0_prefix(req, inst))
+        .collect()
+}
+
+pub fn local_route_decision(
+    req_id: u64,
+    mode: &'static str,
+    chosen: InstanceId,
+    probe_overhead_s: f64,
+    candidates: Vec<CandidateInfo>,
+    reason: &'static str,
+) -> RouteDecision {
+    RouteDecision {
+        req_id,
+        global_mode: "single_pool",
+        mode,
+        global_reason: "single pool uses bucket 0",
+        local_reason: reason,
+        chosen_bucket: 0,
+        chosen,
+        probe_overhead_s,
+        bucket_candidates: Vec::new(),
+        candidates,
+    }
+}
+
 pub fn build(full: &Config, seed: u64) -> Box<dyn Router> {
    use crate::config::RouterMode::*;
    let cfg = &full.cluster.router;
@@ -66,15 +120,300 @@ pub fn build(full: &Config, seed: u64) -> Box<dyn Router> {
        MinPd => Box::new(min_pd::MinPdRouter::new()) as Box<dyn Router>,
        LeastTokens => Box::new(least_tokens::LeastTokensRouter::new()) as Box<dyn Router>,
        CacheLoad => Box::new(cache_load::CacheLoadRouter::new()) as Box<dyn Router>,
-        CacheScore => {
-            Box::new(cache_score::CacheScoreRouter::new(cfg.score_alpha, cfg.score_beta))
-                as Box<dyn Router>
+        CacheAffinity => Box::new(cache_affinity::CacheAffinityRouter::new(
+            cfg.load_alpha,
+            cfg.prefix_k,
+        )) as Box<dyn Router>,
+        CacheAffinityWeakRend => Box::new(
+            cache_affinity::CacheAffinityRouter::weak_with_rendezvous(cfg.load_alpha, cfg.prefix_k),
+        ) as Box<dyn Router>,
+        CacheAffinityStrongOnly => Box::new(
+            cache_affinity::CacheAffinityRouter::strong_no_rendezvous(cfg.load_alpha, cfg.prefix_k),
+        ) as Box<dyn Router>,
+        CacheScore => Box::new(cache_score::CacheScoreRouter::new(
+            cfg.score_alpha,
+            cfg.score_beta,
+        )) as Box<dyn Router>,
+        // Parity probe for the cache_affinity reweight claim: same scoring
+        // framework as cache_score, but β=1.0 so a single L0-hit block fully
+        // offsets one queue position. Demonstrates how much of cache_affinity's
+        // gain is reproducible by just retuning β (no rendezvous, no meta-store
+        // bonus).
+        CacheScoreStrong => {
+            Box::new(cache_score::CacheScoreRouter::new(1.0, 1.0)) as Box<dyn Router>
        }
+        CacheScoreTtl => Box::new(cache_score_ttl::CacheScoreTtlRouter::new(
+            cfg.score_alpha,
+            cfg.score_beta,
+        )) as Box<dyn Router>,
        EstimatedTtft => {
            Box::new(estimated_ttft::EstimatedTtftRouter::new(full)) as Box<dyn Router>
        }
        PrefixAffinity => {
            Box::new(prefix_affinity::PrefixAffinityRouter::new(full)) as Box<dyn Router>
        }
+        AdaptiveAffinity => {
+            Box::new(adaptive_affinity::AdaptiveAffinityRouter::new(full)) as Box<dyn Router>
+        }
+        LineageAffinity => {
+            Box::new(lineage_affinity::LineageAffinityRouter::new(full)) as Box<dyn Router>
+        }
+    }
+}
+
+pub fn build_global(full: &Config) -> Box<dyn GlobalRouter> {
+    global_bucket::build(full)
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use crate::config::{
+        CalibrationConfig, ClusterConfig, HardwareConfig, MetaStoreConfig, ModelConfig,
+        RouterConfig, RouterMode, SimConfig,
+    };
+    use crate::instance::instance::AdmittedRequest;
+    use crate::router::cache_load::CacheLoadRouter;
+    use crate::router::cache_score::CacheScoreRouter;
+    use crate::router::cache_score_ttl::CacheScoreTtlRouter;
+    use crate::router::estimated_ttft::EstimatedTtftRouter;
+    use crate::router::min_pd::MinPdRouter;
+    use crate::router::precise_aware::PreciseRouter;
+    use crate::router::prefix_affinity::PrefixAffinityRouter;
+    use crate::router::ttl_aware::TtlAwareRouter;
+    use crate::trace::RequestRecord;
+
+    fn test_model() -> ModelConfig {
+        ModelConfig {
+            name: "test".into(),
+            num_layers: 4,
+            num_kv_heads: 2,
+            head_dim: 64,
+            dtype_bytes: 2,
+            block_size_tokens: 16,
+            flops_per_token_prefill: Some(1.0e9),
+            attn_quadratic_coeff: Some(64.0),
+            ..Default::default()
+        }
+    }
+
+    fn test_hardware() -> HardwareConfig {
+        HardwareConfig {
+            gpu_flops: 1.0e14,
+            gpu_fp8_flops: 0.0,
+            gpu_fp4_flops: 0.0,
+            gpu_mem_bw: 1.0e12,
+            hbm_bytes: 1.0e9,
+            dram_bytes: 4.0e9,
+            host_dram_bw: 5.0e11,
+            pcie_bw: 32.0e9,
+            pcie_latency_us: 1.0,
+            rdma_bw: 12.0e9,
+            rdma_latency_us: 5.0,
+            intra_node_tp_bw: 9.0e11,
+            intra_node_tp_latency_us: 2.0,
+            tp_degree: 1,
+            max_batch_slots: 32,
+            prefill_chunk_tokens: 1024,
+        }
+    }
+
+    fn test_config(mode: RouterMode) -> Config {
+        Config {
+            model: test_model(),
+            hardware: test_hardware(),
+            calibration: CalibrationConfig::default(),
+            cluster: ClusterConfig {
+                num_instances: Some(2),
+                buckets: Vec::new(),
+                global_router: Default::default(),
+                meta_store: MetaStoreConfig {
+                    ttl_seconds: 1000.0,
+                },
+                router: RouterConfig {
+                    mode,
+                    precise_probe_latency_us: 10.0,
+                    precise_probe_topk: 2,
+                    load_alpha: 0.0,
+                    score_alpha: 0.0,
+                    score_beta: 1.0,
+                    prefix_k: 8,
+                    affinity_fan_out: 2,
+                },
+            },
+            sim: SimConfig {
+                trace_path: String::new(),
+                max_requests: None,
+                output_dir: String::new(),
+                sample_interval_s: 0.0,
+                seed: 7,
+                input_length_min: None,
+                input_length_max: None,
+            },
+        }
+    }
+
+    fn make_instances(n: usize) -> Vec<Instance> {
+        let model = test_model();
+        let hw = test_hardware();
+        let calib = CalibrationConfig::default();
+        (0..n)
+            .map(|id| Instance::new(id as u32, &model, &hw, &calib))
+            .collect()
+    }
+
+    fn make_request(hashes: &[u64]) -> RequestRecord {
+        RequestRecord {
+            req_id: 1,
+            chat_id: 0,
+            parent_chat_id: -1,
+            turn: 1,
+            arrival: 0.0,
+            input_len: hashes.len() as u32 * 16,
+            output_len: 16,
+            hash_ids: hashes.to_vec(),
+        }
+    }
+
+    fn insert_l0(inst: &mut Instance, hashes: &[u64]) {
+        let mut evicted = Vec::new();
+        inst.cache.l0.insert_blocks(hashes, &mut evicted);
+    }
+
+    fn insert_l1(inst: &mut Instance, hashes: &[u64]) {
+        let mut evicted = Vec::new();
+        inst.cache.l1.insert_blocks(hashes, &mut evicted);
+    }
+
+    fn publish_meta(meta: &mut MetaStore, inst_id: u32, hashes: &[u64], now: f64) {
+        for &h in hashes {
+            meta.insert(h, inst_id, now);
+        }
+    }
+
+    fn enqueue_requests(inst: &mut Instance, count: u32, tokens: u32) {
+        for req_id in 0..count {
+            inst.admit(AdmittedRequest {
+                req_id: req_id as u64,
+                arrival: 0.0,
+                ready_at: 0.0,
+                prefill_tokens_remaining: tokens,
+                reserved_blocks: 0,
+                completion_tail_s: 0.0,
+            });
+        }
+    }
+
+    #[test]
+    fn precise_uses_real_l0_prefix_not_l1_prefix() {
+        let req = make_request(&[10, 11, 12]);
+        let mut instances = make_instances(2);
+        let mut meta = MetaStore::new(1000.0);
+
+        insert_l1(&mut instances[0], &[10, 11, 12]);
+        publish_meta(&mut meta, 0, &[10, 11, 12], 0.0);
+        insert_l0(&mut instances[1], &[10, 11]);
+
+        let mut router = PreciseRouter::new(2, 10e-6, 0.0);
+        let decision = router.route(&req, &instances, &meta, 0.0);
+
+        assert_eq!(decision.chosen, 1);
+    }
+
+    #[test]
+    fn ttl_aware_uses_meta_store_scores() {
+        let req = make_request(&[20, 21, 22]);
+        let mut instances = make_instances(2);
+        let mut meta = MetaStore::new(1000.0);
+
+        // Instance 0 only looks hot in the meta store; its real L0 prefix is zero.
+        publish_meta(&mut meta, 0, &[20, 21, 22], 0.0);
+        // Instance 1 really holds the first two blocks in HBM.
+        insert_l0(&mut instances[1], &[20, 21]);
+
+        let mut ttl = TtlAwareRouter::new(0.0);
+        assert_eq!(ttl.route(&req, &instances, &meta, 0.0).chosen, 0);
+    }
+
+    #[test]
+    fn cache_aware_routers_compare_real_l0_not_meta_store_scores() {
+        let req = make_request(&[20, 21, 22]);
+        let mut instances = make_instances(2);
+        let mut meta = MetaStore::new(1000.0);
+
+        // Instance 0 only looks hot in the meta store; its real L0 prefix is zero.
+        publish_meta(&mut meta, 0, &[20, 21, 22], 0.0);
+        // Instance 1 really holds the first two blocks in HBM.
+        insert_l0(&mut instances[1], &[20, 21]);
+
+        let mut cache_load = CacheLoadRouter::new();
+        assert_eq!(cache_load.route(&req, &instances, &meta, 0.0).chosen, 1);
+
+        let mut cache_score = CacheScoreRouter::new(0.0, 1.0);
+        assert_eq!(cache_score.route(&req, &instances, &meta, 0.0).chosen, 1);
+
+        let mut min_pd = MinPdRouter::new();
+        assert_eq!(min_pd.route(&req, &instances, &meta, 0.0).chosen, 1);
+
+        let cfg = test_config(RouterMode::EstimatedTtft);
+        let mut est = EstimatedTtftRouter::new(&cfg);
+        assert_eq!(est.route(&req, &instances, &meta, 0.0).chosen, 1);
+    }
+
+    #[test]
+    fn cache_score_ttl_uses_meta_store_prefix() {
+        let req = make_request(&[50, 51, 52]);
+        let mut instances = make_instances(2);
+        let mut meta = MetaStore::new(1000.0);
+
+        // Instance 0 only looks hot in the meta store.
+        publish_meta(&mut meta, 0, &[50, 51, 52], 0.0);
+        // Instance 1 has the true local L0 prefix.
+        insert_l0(&mut instances[1], &[50, 51]);
+
+        let mut cache_score = CacheScoreRouter::new(0.0, 1.0);
+        assert_eq!(cache_score.route(&req, &instances, &meta, 0.0).chosen, 1);
+
+        let mut cache_score_ttl = CacheScoreTtlRouter::new(0.0, 1.0);
+        assert_eq!(
+            cache_score_ttl.route(&req, &instances, &meta, 0.0).chosen,
+            0
+        );
+    }
+
+    #[test]
+    fn prefix_affinity_fallback_uses_real_l0_not_meta_store_scores() {
+        let req = make_request(&[30, 31, 32]);
+        let mut instances = make_instances(2);
+        let mut meta = MetaStore::new(1000.0);
+
+        publish_meta(&mut meta, 0, &[30, 31, 32], 0.0);
+        insert_l0(&mut instances[1], &[30, 31]);
+        enqueue_requests(&mut instances[0], 5, 64);
+        enqueue_requests(&mut instances[1], 5, 64);
+
+        let cfg = test_config(RouterMode::PrefixAffinity);
+        let mut router = PrefixAffinityRouter::new(&cfg);
+        let decision = router.route(&req, &instances, &meta, 0.0);
+
+        assert_eq!(decision.local_reason, "affinity fallback: min(drain+fetch)");
+        assert_eq!(decision.chosen, 1);
+    }
+
+    #[test]
+    fn estimated_ttft_can_prefer_full_l1_hit_over_shorter_queue() {
+        let req = make_request(&[40, 41, 42, 43]);
+        let mut instances = make_instances(2);
+        let meta = MetaStore::new(1000.0);
+
+        // Instance 0 can satisfy the whole prefix from local DRAM/L1.
+        insert_l1(&mut instances[0], &[40, 41, 42, 43]);
+        // Instance 1 has a slightly shorter queue but no reusable prefix.
+        enqueue_requests(&mut instances[0], 1, 16);
+
+        let cfg = test_config(RouterMode::EstimatedTtft);
+        let mut est = EstimatedTtftRouter::new(&cfg);
+
+        assert_eq!(est.route(&req, &instances, &meta, 0.0).chosen, 0);
    }
 }
--- a/src/router/precise_aware.rs
+++ b/src/router/precise_aware.rs
@@ -1,28 +1,13 @@
-//! KV-aware routing via meta-store candidate selection + precise probing.
+//! L0-aware routing via exact per-instance probing.
 //!
-//! The global meta store is used as a *candidate pre-filter*: we score
-//! every instance's predicted prefix from the store, take the top-K by
-//! (predicted_prefix DESC, load ASC), and then exact-probe those K
-//! candidates' actual L0+L1 caches to get the true longest prefix. This
-//! catches two cases where the meta store is wrong:
-//!
-//!   - the store is stale (block evicted from L0/L1 but TTL not yet up),
-//!   - the store undercounts because some blocks' TTL expired individually.
-//!
-//! Because the candidate set is sourced from the meta store rather than
-//! from a load ranking, this router is a strict superset of `ttl_aware`:
-//! any instance the meta store would pick is a candidate here, and the
-//! exact probe can only move the decision toward a truthfully-better
-//! instance. Each probe adds `probe_latency_s` to the request's
-//! effective arrival time.
-//!
-//! If the meta store returns zero-prefix for every instance (e.g. cold
-//! start, or a request whose blocks have never been seen), we fall back
-//! to the top-K least-loaded instances so we still place the request.
+//! Every instance is compared using its *real current L0 prefix length*
+//! only. L1 / remote availability can still reduce execution-time misses
+//! later in the cluster fetch chain, but they do not count as `kvcache hit`
+//! for routing.

 use crate::cluster::meta_store::MetaStore;
 use crate::instance::Instance;
-use crate::router::{CandidateInfo, RouteDecision, Router};
+use crate::router::{local_l0_prefix, CandidateInfo, RouteDecision, Router};
 use crate::trace::RequestRecord;

 pub struct PreciseRouter {
@@ -33,7 +18,11 @@ pub struct PreciseRouter {

 impl PreciseRouter {
    pub fn new(topk: u32, probe_latency_s: f64, alpha: f64) -> Self {
-        Self { topk, probe_latency_s, alpha }
+        Self {
+            topk,
+            probe_latency_s,
+            alpha,
+        }
    }

    fn load_of(&self, inst: &Instance) -> f64 {
@@ -50,50 +39,15 @@ impl Router for PreciseRouter {
        &mut self,
        req: &RequestRecord,
        instances: &[Instance],
-        meta: &MetaStore,
-        now: f64,
+        _meta: &MetaStore,
+        _now: f64,
    ) -> RouteDecision {
        let n = instances.len();
-        let k = (self.topk as usize).min(n).max(1);
-
-        // 1. Meta-store candidate set: rank all instances by
-        //    (predicted_prefix DESC, load ASC) and take the top-K.
-        let meta_scores = meta.score_prefix(&req.hash_ids, now, n);
-        let any_meta_hit = meta_scores.iter().any(|&p| p > 0);
-
-        let mut ranked: Vec<usize> = (0..n).collect();
-        if any_meta_hit {
-            ranked.sort_by(|&a, &b| {
-                let pa = meta_scores[a];
-                let pb = meta_scores[b];
-                // prefix desc, then load asc
-                pb.cmp(&pa)
-                    .then_with(|| {
-                        self.load_of(&instances[a])
-                            .partial_cmp(&self.load_of(&instances[b]))
-                            .unwrap_or(std::cmp::Ordering::Equal)
-                    })
-            });
-        } else {
-            // Cold start fallback: pure load order.
-            ranked.sort_by(|&a, &b| {
-                self.load_of(&instances[a])
-                    .partial_cmp(&self.load_of(&instances[b]))
-                    .unwrap_or(std::cmp::Ordering::Equal)
-            });
-        }
-        let probed = &ranked[..k];
-
-        // 2. Exact probe each candidate and pick
-        //    argmax(exact_prefix, tiebreak: -load).
-        let mut candidates = Vec::with_capacity(k);
-        let mut best = probed[0] as u32;
+        let mut candidates = Vec::with_capacity(n);
+        let mut best = instances[0].id;
        let mut best_key: (i64, f64) = (i64::MIN, f64::INFINITY);
-        for &i in probed {
-            let inst = &instances[i];
-            let l0 = inst.cache.l0.longest_prefix_peek(&req.hash_ids);
-            let l1 = inst.cache.l1.longest_prefix_peek(&req.hash_ids[l0..]);
-            let predicted = (l0 + l1) as u32;
+        for inst in instances {
+            let predicted = local_l0_prefix(req, inst);
            let load = self.load_of(inst);
            candidates.push(CandidateInfo {
                instance: inst.id,
@@ -108,13 +62,13 @@ impl Router for PreciseRouter {
            }
        }

-        RouteDecision {
-            req_id: req.req_id,
-            mode: "precise",
-            chosen: best,
-            probe_overhead_s: k as f64 * self.probe_latency_s,
+        crate::router::local_route_decision(
+            req.req_id,
+            "precise",
+            best,
+            n as f64 * self.probe_latency_s,
            candidates,
-            reason: "exact-probe top-K meta-store candidates",
-        }
+            "exact-probe all instances' L0 cache",
+        )
    }
 }
--- a/src/router/prefix_affinity.rs
+++ b/src/router/prefix_affinity.rs
@@ -36,7 +36,7 @@
 use crate::cluster::meta_store::MetaStore;
 use crate::config::Config;
 use crate::instance::Instance;
-use crate::router::{CandidateInfo, RouteDecision, Router};
+use crate::router::{local_l0_scores, CandidateInfo, RouteDecision, Router};
 use crate::trace::RequestRecord;

 pub struct PrefixAffinityRouter {
@@ -47,17 +47,11 @@ pub struct PrefixAffinityRouter {
    /// Queue-length threshold: if all top candidates exceed this, expand to
    /// the full instance set.
    overload_threshold: u32,
-    /// Bytes per KV block (for RDMA cost estimation in fallback path).
-    kv_block_bytes: f64,
-    /// RDMA bandwidth in bytes/s.
-    rdma_bw: f64,
-    /// RDMA per-transfer latency in seconds.
-    rdma_latency_s: f64,
 }

 impl PrefixAffinityRouter {
    pub fn new(config: &Config) -> Self {
-        let n = config.cluster.num_instances as usize;
+        let n = config.cluster.total_instances() as usize;
        let cfg_fan = config.cluster.router.affinity_fan_out;
        // fan_out: if configured, use it; otherwise auto = max(2, n/8).
        let fan_out = if cfg_fan > 0 {
@@ -69,9 +63,6 @@ impl PrefixAffinityRouter {
            prefix_k: config.cluster.router.prefix_k,
            fan_out,
            overload_threshold: 4,
-            kv_block_bytes: config.model.kv_block_bytes() as f64,
-            rdma_bw: config.hardware.rdma_bw,
-            rdma_latency_s: config.hardware.rdma_latency_us * 1e-6,
        }
    }

@@ -96,15 +87,6 @@ impl PrefixAffinityRouter {
        h = (h ^ (h >> 27)).wrapping_mul(0x94d049bb133111eb);
        h ^ (h >> 31)
    }
-
-    /// Estimate RDMA fetch time for `remote_blocks` blocks.
-    fn fetch_time(&self, remote_blocks: u32) -> f64 {
-        if remote_blocks == 0 {
-            return 0.0;
-        }
-        let bytes = remote_blocks as f64 * self.kv_block_bytes;
-        bytes / self.rdma_bw + self.rdma_latency_s
-    }
 }

 impl Router for PrefixAffinityRouter {
@@ -116,8 +98,8 @@ impl Router for PrefixAffinityRouter {
        &mut self,
        req: &RequestRecord,
        instances: &[Instance],
-        meta: &MetaStore,
-        now: f64,
+        _meta: &MetaStore,
+        _now: f64,
    ) -> RouteDecision {
        let n = instances.len();
        let fp = Self::fingerprint(&req.hash_ids, self.prefix_k);
@@ -129,7 +111,7 @@ impl Router for PrefixAffinityRouter {
        ranked.sort_unstable_by(|a, b| b.0.cmp(&a.0)); // descending score

        // Collect candidate info for logging (also needed for fallback).
-        let scores = meta.score_prefix(&req.hash_ids, now, n);
+        let scores = local_l0_scores(req, instances);
        let candidates: Vec<CandidateInfo> = instances
            .iter()
            .map(|inst| CandidateInfo {
@@ -165,14 +147,14 @@ impl Router for PrefixAffinityRouter {
        let reason;
        if all_overloaded {
            reason = "affinity fallback: min(drain+fetch)";
-            let cluster_prefix = scores.iter().copied().max().unwrap_or(0);
            let mut best_cost = f64::INFINITY;
            for &(_, idx) in ranked.iter() {
                let inst = &instances[idx];
                let drain = inst.estimated_drain_time();
-                let local_prefix = scores[idx];
-                let remote_blocks = cluster_prefix.saturating_sub(local_prefix);
-                let cost = drain + self.fetch_time(remote_blocks);
+                let miss_tokens = (req.hash_ids.len() as u32)
+                    .saturating_sub(scores[idx])
+                    .saturating_mul(inst.block_size_tokens);
+                let cost = drain + inst.compute.prefill_time(miss_tokens);
                let ql = inst.queue_len();
                if cost < best_cost || (cost == best_cost && ql < best_ql) {
                    best_cost = cost;
@@ -184,13 +166,13 @@ impl Router for PrefixAffinityRouter {
            reason = "prefix affinity: top-K min drain";
        }

-        RouteDecision {
-            req_id: req.req_id,
-            mode: "prefix_affinity",
-            chosen: instances[best_idx].id,
-            probe_overhead_s: 0.0,
+        crate::router::local_route_decision(
+            req.req_id,
+            "prefix_affinity",
+            instances[best_idx].id,
+            0.0,
            candidates,
            reason,
-        }
+        )
    }
 }
--- a/src/router/random.rs
+++ b/src/router/random.rs
@@ -13,7 +13,9 @@ pub struct RandomRouter {

 impl RandomRouter {
    pub fn new(seed: u64) -> Self {
-        Self { rng: ChaCha8Rng::seed_from_u64(seed) }
+        Self {
+            rng: ChaCha8Rng::seed_from_u64(seed),
+        }
    }
 }

@@ -31,19 +33,19 @@ impl Router for RandomRouter {
    ) -> RouteDecision {
        let n = instances.len();
        let chosen = self.rng.gen_range(0..n) as InstanceId;
-        RouteDecision {
-            req_id: req.req_id,
-            mode: "random",
+        crate::router::local_route_decision(
+            req.req_id,
+            "random",
            chosen,
-            probe_overhead_s: 0.0,
-            candidates: vec![CandidateInfo {
+            0.0,
+            vec![CandidateInfo {
                instance: chosen,
                predicted_prefix: 0,
                load_blocks: instances[chosen as usize].kv_blocks_used,
                queue_len: instances[chosen as usize].queue_len(),
            }],
-            reason: "uniform random",
-        }
+            "uniform random",
+        )
    }
 }

@@ -73,18 +75,18 @@ impl Router for RoundRobinRouter {
        let n = instances.len() as u32;
        let chosen = self.next % n;
        self.next = self.next.wrapping_add(1);
-        RouteDecision {
-            req_id: req.req_id,
-            mode: "round_robin",
+        crate::router::local_route_decision(
+            req.req_id,
+            "round_robin",
            chosen,
-            probe_overhead_s: 0.0,
-            candidates: vec![CandidateInfo {
+            0.0,
+            vec![CandidateInfo {
                instance: chosen,
                predicted_prefix: 0,
                load_blocks: instances[chosen as usize].kv_blocks_used,
                queue_len: instances[chosen as usize].queue_len(),
            }],
-            reason: "round robin",
-        }
+            "round robin",
+        )
    }
 }
--- a/src/router/ttl_aware.rs
+++ b/src/router/ttl_aware.rs
@@ -32,8 +32,7 @@ impl Router for TtlAwareRouter {
        let mut candidates = Vec::with_capacity(n);
        for inst in instances {
            let p = scores[inst.id as usize];
-            let load = inst.kv_blocks_used as f64
-                + self.alpha * inst.queue_len() as f64;
+            let load = inst.kv_blocks_used as f64 + self.alpha * inst.queue_len() as f64;
            candidates.push(CandidateInfo {
                instance: inst.id,
                predicted_prefix: p,
@@ -47,13 +46,13 @@ impl Router for TtlAwareRouter {
                best = inst.id;
            }
        }
-        RouteDecision {
-            req_id: req.req_id,
-            mode: "ttl_aware",
-            chosen: best,
-            probe_overhead_s: 0.0,
+        crate::router::local_route_decision(
+            req.req_id,
+            "ttl_aware",
+            best,
+            0.0,
            candidates,
-            reason: "max meta_store prefix, tie -> least loaded",
-        }
+            "max meta_store prefix, tie -> least loaded",
+        )
    }
 }
--- a/src/sim/engine.rs
+++ b/src/sim/engine.rs
@@ -56,7 +56,11 @@ impl EventQueue {
    pub fn schedule(&mut self, time: f64, event: Event) {
        let t = time.max(self.now);
        self.seq += 1;
-        self.heap.push(Slot { time: t, seq: self.seq, event });
+        self.heap.push(Slot {
+            time: t,
+            seq: self.seq,
+            event,
+        });
    }

    pub fn pop(&mut self) -> Option<(f64, Event)> {
@@ -84,9 +88,27 @@ mod tests {
    #[test]
    fn pops_in_time_order() {
        let mut q = EventQueue::new();
-        q.schedule(2.0, Event::BatchTick { instance: 0 as InstanceId });
-        q.schedule(1.0, Event::BatchTick { instance: 1 });
-        q.schedule(1.5, Event::BatchTick { instance: 2 });
+        q.schedule(
+            2.0,
+            Event::BatchTick {
+                bucket: 0,
+                instance: 0 as InstanceId,
+            },
+        );
+        q.schedule(
+            1.0,
+            Event::BatchTick {
+                bucket: 0,
+                instance: 1,
+            },
+        );
+        q.schedule(
+            1.5,
+            Event::BatchTick {
+                bucket: 0,
+                instance: 2,
+            },
+        );
        let (t1, _) = q.pop().unwrap();
        let (t2, _) = q.pop().unwrap();
        let (t3, _) = q.pop().unwrap();
@@ -98,12 +120,24 @@ mod tests {
    #[test]
    fn equal_time_fifo() {
        let mut q = EventQueue::new();
-        q.schedule(1.0, Event::BatchTick { instance: 7 });
-        q.schedule(1.0, Event::BatchTick { instance: 8 });
+        q.schedule(
+            1.0,
+            Event::BatchTick {
+                bucket: 0,
+                instance: 7,
+            },
+        );
+        q.schedule(
+            1.0,
+            Event::BatchTick {
+                bucket: 1,
+                instance: 8,
+            },
+        );
        let (_, e1) = q.pop().unwrap();
        let (_, e2) = q.pop().unwrap();
        match (e1, e2) {
-            (Event::BatchTick { instance: a }, Event::BatchTick { instance: b }) => {
+            (Event::BatchTick { instance: a, .. }, Event::BatchTick { instance: b, .. }) => {
                assert_eq!(a, 7);
                assert_eq!(b, 8);
            }
--- a/src/sim/events.rs
+++ b/src/sim/events.rs
@@ -1,5 +1,6 @@
 //! Event types for the discrete-event engine.

+use crate::router::BucketId;
 use crate::types::{InstanceId, ReqId};

 #[derive(Debug)]
@@ -7,7 +8,10 @@ pub enum Event {
    /// New trace request arrives at the cluster router.
    Arrival { req_id: ReqId },
    /// Per-instance scheduler tick (continuous batching).
-    BatchTick { instance: InstanceId },
+    BatchTick {
+        bucket: BucketId,
+        instance: InstanceId,
+    },
    /// Periodic time-series sample of all instances.
    Sample,
    /// Stop the simulation early (used internally).
--- a/src/trace.rs
+++ b/src/trace.rs
@@ -21,12 +21,16 @@ struct RawRecord {
    #[serde(default)]
    chat_id: i64,
    #[serde(default)]
+    parent_chat_id: i64,
+    #[serde(default)]
    timestamp: f64,
    #[serde(default)]
    input_length: i64,
    #[serde(default)]
    output_length: i64,
    #[serde(default)]
+    turn: i64,
+    #[serde(default)]
    hash_ids: Vec<i64>,
 }

@@ -34,6 +38,8 @@ struct RawRecord {
 pub struct RequestRecord {
    pub req_id: u64,
    pub chat_id: i64,
+    pub parent_chat_id: i64,
+    pub turn: i64,
    pub arrival: f64,
    pub input_len: u32,
    pub output_len: u32,
@@ -50,8 +56,7 @@ pub struct TraceReader {
 impl TraceReader {
    pub fn open<P: AsRef<Path>>(path: P, max_requests: Option<u64>) -> Result<Self> {
        let path = path.as_ref();
-        let f = File::open(path)
-            .with_context(|| format!("opening trace {}", path.display()))?;
+        let f = File::open(path).with_context(|| format!("opening trace {}", path.display()))?;
        Ok(Self {
            inner: BufReader::with_capacity(1 << 20, f),
            next_id: 0,
@@ -89,6 +94,8 @@ impl Iterator for TraceReader {
                    return Some(Ok(RequestRecord {
                        req_id: id,
                        chat_id: raw.chat_id,
+                        parent_chat_id: raw.parent_chat_id,
+                        turn: raw.turn,
                        arrival: raw.timestamp,
                        input_len: raw.input_length.max(0) as u32,
                        output_len: raw.output_length.max(0) as u32,
--- a/src/ttft.rs
+++ b/src/ttft.rs
@@ -0,0 +1,180 @@
+use crate::cluster::meta_store::MetaStore;
+use crate::config::{CalibrationConfig, HardwareConfig};
+use crate::instance::Instance;
+
+#[derive(Debug, Clone, Copy, Default)]
+pub struct PrefixResidency {
+    pub l0_hit_blocks: u32,
+    pub l1_hit_blocks: u32,
+    pub remote_hit_blocks: u32,
+    pub miss_blocks: u32,
+}
+
+#[derive(Debug, Clone)]
+pub struct TtftModel {
+    kv_block_bytes: u64,
+    host_dram_bw: f64,
+    pcie_bw: f64,
+    pcie_latency_s: f64,
+    rdma_bw: f64,
+    rdma_latency_s: f64,
+    scheduler_base_s: f64,
+    scheduler_per_candidate_s: f64,
+    cache_probe_per_tier_s: f64,
+    batch_pack_s: f64,
+    dram_access_s: f64,
+    remote_metadata_s: f64,
+    layout_transform_s: f64,
+    first_token_tail_s: f64,
+}
+
+impl TtftModel {
+    pub fn new(hw: &HardwareConfig, calib: &CalibrationConfig, kv_block_bytes: u64) -> Self {
+        Self {
+            kv_block_bytes,
+            host_dram_bw: hw.host_dram_bw * calib.dram_bw_util,
+            pcie_bw: hw.pcie_bw * calib.pcie_bw_util,
+            pcie_latency_s: hw.pcie_latency_us * 1e-6,
+            rdma_bw: hw.rdma_bw * calib.rdma_bw_util,
+            rdma_latency_s: hw.rdma_latency_us * 1e-6,
+            scheduler_base_s: calib.scheduler_base_overhead_us * 1e-6,
+            scheduler_per_candidate_s: calib.scheduler_per_candidate_us * 1e-6,
+            cache_probe_per_tier_s: calib.cache_probe_us_per_tier * 1e-6,
+            batch_pack_s: calib.batch_pack_overhead_us * 1e-6,
+            dram_access_s: calib.dram_access_latency_us * 1e-6,
+            remote_metadata_s: calib.remote_metadata_us * 1e-6,
+            layout_transform_s: calib.layout_transform_fixed_us * 1e-6,
+            first_token_tail_s: (calib.final_sync_us + calib.first_token_ready_us) * 1e-6,
+        }
+    }
+
+    pub fn scheduler_overhead_s(&self, num_candidates: usize, num_tiers: usize) -> f64 {
+        self.scheduler_base_s
+            + self.scheduler_per_candidate_s * num_candidates as f64
+            + self.cache_probe_per_tier_s * num_tiers as f64
+            + self.batch_pack_s
+    }
+
+    pub fn first_token_tail_s(&self) -> f64 {
+        self.first_token_tail_s
+    }
+
+    pub fn block_bytes(&self, blocks: u32) -> u64 {
+        self.kv_block_bytes * blocks as u64
+    }
+
+    pub fn local_l1_prepare_time_s(&self, blocks: u32) -> f64 {
+        if blocks == 0 {
+            return 0.0;
+        }
+        let bytes = self.block_bytes(blocks);
+        self.dram_access_s
+            + bytes as f64 / self.host_dram_bw.max(1.0)
+            + self.pcie_cost_s(bytes)
+            + self.layout_transform_s
+    }
+
+    pub fn remote_prepare_time_s(&self, blocks: u32) -> f64 {
+        if blocks == 0 {
+            return 0.0;
+        }
+        let bytes = self.block_bytes(blocks);
+        self.remote_metadata_s
+            + self.rdma_cost_s(bytes)
+            + self.pcie_cost_s(bytes)
+            + self.layout_transform_s
+    }
+
+    pub fn pcie_cost_s(&self, bytes: u64) -> f64 {
+        if bytes == 0 {
+            self.pcie_latency_s
+        } else {
+            self.pcie_latency_s + bytes as f64 / self.pcie_bw.max(1.0)
+        }
+    }
+
+    pub fn rdma_cost_s(&self, bytes: u64) -> f64 {
+        if bytes == 0 {
+            self.rdma_latency_s
+        } else {
+            self.rdma_latency_s + bytes as f64 / self.rdma_bw.max(1.0)
+        }
+    }
+
+    pub fn kv_prepare_time_s(&self, residency: PrefixResidency) -> f64 {
+        self.local_l1_prepare_time_s(residency.l1_hit_blocks)
+            + self.remote_prepare_time_s(residency.remote_hit_blocks)
+    }
+}
+
+pub fn classify_prefix_tiers(
+    req_hashes: &[u64],
+    inst: &Instance,
+    meta: &MetaStore,
+    now: f64,
+) -> PrefixResidency {
+    let total_blocks = req_hashes.len() as u32;
+    let l0_hit_blocks = inst.cache.l0.longest_prefix_peek(req_hashes) as u32;
+    let suffix_after_l0 = &req_hashes[l0_hit_blocks as usize..];
+
+    let l1_hit_blocks = inst.cache.l1.longest_prefix_peek(suffix_after_l0) as u32;
+    let suffix_after_l1 = &suffix_after_l0[l1_hit_blocks as usize..];
+
+    let mut remote_hit_blocks = 0;
+    for &h in suffix_after_l1 {
+        let owners = meta.instances_for(h, now);
+        if owners.iter().any(|o| *o != inst.id) {
+            remote_hit_blocks += 1;
+        } else {
+            break;
+        }
+    }
+
+    PrefixResidency {
+        l0_hit_blocks,
+        l1_hit_blocks,
+        remote_hit_blocks,
+        miss_blocks: total_blocks - l0_hit_blocks - l1_hit_blocks - remote_hit_blocks,
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use crate::config::{CalibrationConfig, HardwareConfig};
+
+    #[test]
+    fn remote_prepare_includes_fixed_overheads() {
+        let hw = HardwareConfig {
+            gpu_flops: 1.0e14,
+            gpu_fp8_flops: 0.0,
+            gpu_fp4_flops: 0.0,
+            gpu_mem_bw: 1.0e12,
+            hbm_bytes: 1.0e9,
+            dram_bytes: 4.0e9,
+            host_dram_bw: 5.0e11,
+            pcie_bw: 32.0e9,
+            pcie_latency_us: 1.0,
+            rdma_bw: 12.0e9,
+            rdma_latency_us: 5.0,
+            intra_node_tp_bw: 9.0e11,
+            intra_node_tp_latency_us: 2.0,
+            tp_degree: 1,
+            max_batch_slots: 32,
+            prefill_chunk_tokens: 1024,
+        };
+        let model = TtftModel::new(
+            &hw,
+            &CalibrationConfig {
+                remote_metadata_us: 11.0,
+                layout_transform_fixed_us: 7.0,
+                ..CalibrationConfig::default()
+            },
+            4096,
+        );
+
+        let transport_only = model.rdma_cost_s(4096) + model.pcie_cost_s(4096);
+        let total = model.remote_prepare_time_s(1);
+        assert!(total > transport_only);
+    }
+}
--- a/tests/smoke.rs
+++ b/tests/smoke.rs
@@ -6,6 +6,7 @@ use std::io::Write;

 use kvcache_simulator::config::*;
 use kvcache_simulator::driver;
+use kvcache_simulator::replay::{self, PlacementEntry, ReplayEvictPolicy};

 fn base_config(trace_path: &str, out_dir: &str, mode: RouterMode) -> Config {
    Config {
@@ -27,16 +28,25 @@ fn base_config(trace_path: &str, out_dir: &str, mode: RouterMode) -> Config {
            gpu_mem_bw: 1.0e12,
            hbm_bytes: 1.0e9,
            dram_bytes: 4.0e9,
+            host_dram_bw: 5.0e11,
            pcie_bw: 32.0e9,
            pcie_latency_us: 1.0,
            rdma_bw: 12.0e9,
            rdma_latency_us: 5.0,
+            intra_node_tp_bw: 9.0e11,
+            intra_node_tp_latency_us: 2.0,
+            tp_degree: 1,
            max_batch_slots: 32,
            prefill_chunk_tokens: 1024,
        },
+        calibration: CalibrationConfig::default(),
        cluster: ClusterConfig {
-            num_instances: 4,
-            meta_store: MetaStoreConfig { ttl_seconds: 1000.0 },
+            num_instances: Some(4),
+            buckets: Vec::new(),
+            global_router: Default::default(),
+            meta_store: MetaStoreConfig {
+                ttl_seconds: 1000.0,
+            },
            router: RouterConfig {
                mode,
                precise_probe_latency_us: 10.0,
@@ -54,10 +64,32 @@ fn base_config(trace_path: &str, out_dir: &str, mode: RouterMode) -> Config {
            output_dir: out_dir.into(),
            sample_interval_s: 0.0,
            seed: 7,
+            input_length_min: None,
+            input_length_max: None,
        },
    }
 }

+fn bucketed_config(trace_path: &str, out_dir: &str, mode: RouterMode) -> Config {
+    let mut cfg = base_config(trace_path, out_dir, mode);
+    cfg.cluster.num_instances = None;
+    cfg.cluster.buckets = vec![
+        BucketConfig {
+            name: "short".into(),
+            input_length_min: 0,
+            input_length_max: 64,
+            num_instances: 2,
+        },
+        BucketConfig {
+            name: "long".into(),
+            input_length_min: 65,
+            input_length_max: 128,
+            num_instances: 1,
+        },
+    ];
+    cfg
+}
+
 fn write_synthetic_trace(path: &std::path::Path) {
    // 5 distinct conversations, each with 8 turns. Within a conversation,
    // turn k+1 reuses the prefix of turn k (shared first ~10 blocks) and
@@ -94,9 +126,11 @@ fn write_synthetic_trace(path: &std::path::Path) {
    }
 }

-fn run(mode: RouterMode, trace_path: &std::path::Path, out_root: &std::path::Path)
-    -> kvcache_simulator::metrics::Summary
-{
+fn run(
+    mode: RouterMode,
+    trace_path: &std::path::Path,
+    out_root: &std::path::Path,
+) -> kvcache_simulator::metrics::Summary {
    let cfg = base_config(
        trace_path.to_str().unwrap(),
        out_root.to_str().unwrap(),
@@ -119,9 +153,8 @@ fn ablation_hit_rate_ordering() {
    let s_ttl = run(RouterMode::TtlAware, &trace_path, &tmp);
    let s_prec = run(RouterMode::Precise, &trace_path, &tmp);

-    let total_hit = |s: &kvcache_simulator::metrics::Summary| {
-        s.hit_rate_l0 + s.hit_rate_l1 + s.hit_rate_remote
-    };
+    let total_hit =
+        |s: &kvcache_simulator::metrics::Summary| s.hit_rate_l0 + s.hit_rate_l1 + s.hit_rate_remote;

    let h_rand = total_hit(&s_random);
    let h_ll = total_hit(&s_ll);
@@ -135,23 +168,286 @@ fn ablation_hit_rate_ordering() {
    eprintln!(
        "         remote+local hit ratio L0/L1/remote: \
         random=({:.2},{:.2},{:.2}) precise=({:.2},{:.2},{:.2})",
-        s_random.hit_rate_l0, s_random.hit_rate_l1, s_random.hit_rate_remote,
-        s_prec.hit_rate_l0, s_prec.hit_rate_l1, s_prec.hit_rate_remote,
+        s_random.hit_rate_l0,
+        s_random.hit_rate_l1,
+        s_random.hit_rate_remote,
+        s_prec.hit_rate_l0,
+        s_prec.hit_rate_l1,
+        s_prec.hit_rate_remote,
    );

    // ttl_aware and precise should outperform random / least_loaded for
    // a workload built entirely of shared-prefix conversations.
    let eps = 1e-6;
-    assert!(
-        h_ttl + eps >= h_rand,
-        "ttl_aware should >= random hit rate"
-    );
-    assert!(
-        h_prec + eps >= h_rand,
-        "precise should >= random hit rate"
-    );
+    assert!(h_ttl + eps >= h_rand, "ttl_aware should >= random hit rate");
+    assert!(h_prec + eps >= h_rand, "precise should >= random hit rate");
    assert!(
        h_prec + eps >= h_ll,
        "precise should >= least_loaded hit rate"
    );
 }
+
+#[test]
+fn ablation_lru_preserves_ttft_fields() {
+    let tmp = std::env::temp_dir().join("kvcache_sim_replay");
+    let _ = std::fs::remove_dir_all(&tmp);
+    std::fs::create_dir_all(&tmp).unwrap();
+    let trace_path = tmp.join("trace.jsonl");
+    write_synthetic_trace(&trace_path);
+
+    let cfg = base_config(
+        trace_path.to_str().unwrap(),
+        tmp.to_str().unwrap(),
+        RouterMode::Random,
+    );
+    let online = driver::run(&cfg, Some("online_lru")).expect("online lru run");
+    let out =
+        driver::ablate_fixed_placement(&cfg, &[RouterMode::Random], &[ReplayEvictPolicy::Lru])
+            .expect("ablate lru");
+
+    assert_eq!(out.len(), 1);
+    let row = &out[0];
+    let online_hit =
+        online.summary.hit_rate_l0 + online.summary.hit_rate_l1 + online.summary.hit_rate_remote;
+    let ablate_hit = row.hit_rate_l0 + row.hit_rate_l1 + row.hit_rate_remote;
+
+    assert!(
+        (ablate_hit - online_hit).abs() < 1e-9,
+        "ablation lru should match online lru hit rate: online={online_hit} ablate={ablate_hit}"
+    );
+    assert!((row.ttft_mean - online.summary.ttft_mean).abs() < 1e-9);
+    assert!((row.ttft_p50 - online.summary.ttft_p50).abs() < 1e-9);
+    assert!((row.ttft_p95 - online.summary.ttft_p95).abs() < 1e-9);
+    assert!((row.ttft_p99 - online.summary.ttft_p99).abs() < 1e-9);
+}
+
+#[test]
+fn ablate_rejects_belady_until_exact_algorithm_exists() {
+    let tmp = std::env::temp_dir().join("kvcache_sim_ablate_evict");
+    let _ = std::fs::remove_dir_all(&tmp);
+    std::fs::create_dir_all(&tmp).unwrap();
+    let trace_path = tmp.join("trace.jsonl");
+    write_synthetic_trace(&trace_path);
+
+    let cfg = base_config(
+        trace_path.to_str().unwrap(),
+        tmp.to_str().unwrap(),
+        RouterMode::Random,
+    );
+
+    let err =
+        driver::ablate_fixed_placement(&cfg, &[RouterMode::Random], &[ReplayEvictPolicy::Belady])
+            .expect_err("belady should be rejected");
+    assert!(
+        err.to_string().contains("exact belady"),
+        "unexpected error: {err:#}"
+    );
+}
+
+#[test]
+fn ablation_parallel_matches_serial() {
+    let tmp = std::env::temp_dir().join("kvcache_sim_ablate_parallel");
+    let _ = std::fs::remove_dir_all(&tmp);
+    std::fs::create_dir_all(&tmp).unwrap();
+    let trace_path = tmp.join("trace.jsonl");
+    write_synthetic_trace(&trace_path);
+
+    let cfg = base_config(
+        trace_path.to_str().unwrap(),
+        tmp.to_str().unwrap(),
+        RouterMode::Random,
+    );
+    let routers = [
+        RouterMode::Random,
+        RouterMode::LeastLoaded,
+        RouterMode::TtlAware,
+        RouterMode::Precise,
+    ];
+
+    let serial = driver::ablate_fixed_placement_with_parallelism(
+        &cfg,
+        &routers,
+        &[ReplayEvictPolicy::Lru],
+        1,
+    )
+    .expect("serial ablate");
+    let parallel = driver::ablate_fixed_placement_with_parallelism(
+        &cfg,
+        &routers,
+        &[ReplayEvictPolicy::Lru],
+        2,
+    )
+    .expect("parallel ablate");
+
+    assert_eq!(parallel.len(), serial.len());
+    for (lhs, rhs) in parallel.iter().zip(serial.iter()) {
+        assert_eq!(lhs.router, rhs.router);
+        assert_eq!(lhs.evict_policy, rhs.evict_policy);
+        assert_eq!(lhs.placement_source, rhs.placement_source);
+        assert!((lhs.ttft_mean - rhs.ttft_mean).abs() < 1e-9);
+        assert!((lhs.ttft_p50 - rhs.ttft_p50).abs() < 1e-9);
+        assert!((lhs.ttft_p95 - rhs.ttft_p95).abs() < 1e-9);
+        assert!((lhs.ttft_p99 - rhs.ttft_p99).abs() < 1e-9);
+        assert!((lhs.hit_rate_l0 - rhs.hit_rate_l0).abs() < 1e-12);
+        assert!((lhs.hit_rate_l1 - rhs.hit_rate_l1).abs() < 1e-12);
+        assert!((lhs.hit_rate_remote - rhs.hit_rate_remote).abs() < 1e-12);
+        assert!((lhs.miss_rate - rhs.miss_rate).abs() < 1e-12);
+    }
+}
+
+#[test]
+fn strict_bucket_run_emits_bucket_fields_in_outputs() {
+    let tmp = std::env::temp_dir().join("kvcache_sim_bucket_outputs");
+    let _ = std::fs::remove_dir_all(&tmp);
+    std::fs::create_dir_all(&tmp).unwrap();
+    let trace_path = tmp.join("trace.jsonl");
+
+    let mut f = std::fs::File::create(&trace_path).unwrap();
+    writeln!(
+        f,
+        "{}",
+        serde_json::json!({
+            "chat_id": 1,
+            "parent_chat_id": -1,
+            "timestamp": 0.0,
+            "input_length": 32,
+            "output_length": 16,
+            "type": "text",
+            "turn": 0,
+            "hash_ids": [1, 2]
+        })
+    )
+    .unwrap();
+    writeln!(
+        f,
+        "{}",
+        serde_json::json!({
+            "chat_id": 2,
+            "parent_chat_id": -1,
+            "timestamp": 0.1,
+            "input_length": 80,
+            "output_length": 16,
+            "type": "text",
+            "turn": 0,
+            "hash_ids": [3, 4, 5, 6, 7]
+        })
+    )
+    .unwrap();
+
+    let mut cfg = bucketed_config(
+        trace_path.to_str().unwrap(),
+        tmp.to_str().unwrap(),
+        RouterMode::LeastLoaded,
+    );
+    cfg.cluster.global_router.mode = GlobalRouterMode::StrictInputLength;
+    cfg.sim.sample_interval_s = 0.05;
+
+    let _ = driver::run(&cfg, Some("strict_bucket")).expect("bucketed run");
+
+    let per_request = std::fs::read_to_string(tmp.join("strict_bucket/per_request.csv")).unwrap();
+    assert!(per_request.contains("bucket"));
+    assert!(per_request.contains("length_bucket_match"));
+
+    let instances = std::fs::read_to_string(tmp.join("strict_bucket/instances.csv")).unwrap();
+    assert!(instances.contains("bucket"));
+
+    let routing_log = std::fs::read_to_string(tmp.join("strict_bucket/routing_log.jsonl")).unwrap();
+    assert!(routing_log.contains("\"chosen_bucket\""));
+    assert!(routing_log.contains("\"bucket_candidates\""));
+    assert!(routing_log.contains("\"global_reason\""));
+}
+
+#[test]
+fn bucketed_configs_are_rejected_by_legacy_fixed_placement_paths() {
+    let tmp = std::env::temp_dir().join("kvcache_sim_bucketed_reject");
+    let _ = std::fs::remove_dir_all(&tmp);
+    std::fs::create_dir_all(&tmp).unwrap();
+    let trace_path = tmp.join("trace.jsonl");
+    write_synthetic_trace(&trace_path);
+
+    let cfg = bucketed_config(
+        trace_path.to_str().unwrap(),
+        tmp.to_str().unwrap(),
+        RouterMode::Random,
+    );
+
+    let err =
+        driver::ablate_fixed_placement(&cfg, &[RouterMode::Random], &[ReplayEvictPolicy::Lru])
+            .expect_err("bucketed ablation should fail");
+    assert!(err.to_string().contains("cluster.buckets"));
+
+    let err = replay::replay_fixed_placement(
+        &cfg,
+        &[],
+        &Vec::<PlacementEntry>::new(),
+        ReplayEvictPolicy::Lru,
+    )
+    .expect_err("bucketed replay should fail");
+    assert!(err.to_string().contains("cluster.buckets"));
+}
+
+#[test]
+fn bucket_score_can_deviate_from_strict_length_bucket() {
+    let tmp = std::env::temp_dir().join("kvcache_sim_bucket_score");
+    let _ = std::fs::remove_dir_all(&tmp);
+    std::fs::create_dir_all(&tmp).unwrap();
+    let trace_path = tmp.join("trace.jsonl");
+
+    let mut f = std::fs::File::create(&trace_path).unwrap();
+    for req_id in 0..3 {
+        writeln!(
+            f,
+            "{}",
+            serde_json::json!({
+                "chat_id": req_id,
+                "parent_chat_id": -1,
+                "timestamp": 0.0,
+                "input_length": 24,
+                "output_length": 16,
+                "type": "text",
+                "turn": 0,
+                "hash_ids": [100 + req_id, 200 + req_id]
+            })
+        )
+        .unwrap();
+    }
+
+    let mut strict_cfg = bucketed_config(
+        trace_path.to_str().unwrap(),
+        tmp.to_str().unwrap(),
+        RouterMode::LeastLoaded,
+    );
+    strict_cfg.cluster.buckets = vec![
+        BucketConfig {
+            name: "short".into(),
+            input_length_min: 0,
+            input_length_max: 32,
+            num_instances: 1,
+        },
+        BucketConfig {
+            name: "long".into(),
+            input_length_min: 33,
+            input_length_max: 96,
+            num_instances: 1,
+        },
+    ];
+    strict_cfg.cluster.global_router.mode = GlobalRouterMode::StrictInputLength;
+
+    let mut score_cfg = strict_cfg.clone();
+    score_cfg.cluster.global_router.mode = GlobalRouterMode::BucketScore;
+    score_cfg.cluster.global_router.length_penalty_weight = 1.0;
+    score_cfg.cluster.global_router.load_weight = 10.0;
+    score_cfg.cluster.global_router.cache_weight = 0.0;
+
+    let _ = driver::run(&strict_cfg, Some("strict_score_cmp")).expect("strict run");
+    let _ = driver::run(&score_cfg, Some("bucket_score_cmp")).expect("bucket score run");
+
+    let strict_log =
+        std::fs::read_to_string(tmp.join("strict_score_cmp/routing_log.jsonl")).unwrap();
+    let score_log =
+        std::fs::read_to_string(tmp.join("bucket_score_cmp/routing_log.jsonl")).unwrap();
+
+    assert!(strict_log.contains("\"chosen_bucket\":0"));
+    assert!(score_log.contains("\"global_mode\":\"bucket_score\""));
+    assert!(score_log.contains("\"chosen_bucket\":1"));
+}
Author	SHA1	Message	Date
Gahow Wang	1906857ffd	chore: update README	2026-04-17 19:15:22 +08:00
Gahow Wang	cad83cec2f	Merge branch 'feature/bucket-aware-routing'	2026-04-17 18:16:56 +08:00
Gahow Wang	43ada0cfc0	feat: add bucket score global router	2026-04-17 17:55:54 +08:00
Gahow Wang	b5a6fb964c	feat: wire bucket identities through driver outputs	2026-04-17 17:52:49 +08:00
Gahow Wang	3a84c15068	fix: harden bucket routing review follow-up	2026-04-17 15:15:18 +08:00
Gahow Wang	fa381b5db3	feat: add bucketed service and strict global routing	2026-04-17 15:03:10 +08:00
Gahow Wang	96019082cc	fix: complete global router config and recoverable cluster init	2026-04-17 14:50:47 +08:00
Gahow Wang	008fe2fe5d	fix: reject bucketed configs in cluster constructor	2026-04-17 14:44:21 +08:00
Gahow Wang	7de38fa998	fix: guard legacy runtime paths for bucketed configs	2026-04-17 14:35:09 +08:00
Gahow Wang	d8a0796506	fix: close bucketed cluster config model gaps	2026-04-17 14:21:34 +08:00
Gahow Wang	a723d7a811	feat: model explicit bucketed cluster config	2026-04-17 14:16:56 +08:00
Gahow Wang	bb280c8ba0	chore: ignore local worktrees	2026-04-17 13:37:25 +08:00
Gahow Wang	92d593d59b	docs: add bucket-aware routing design	2026-04-17 13:26:51 +08:00
Gahow Wang	82b3e2985f	chore	2026-04-17 10:56:30 +08:00
Gahow Wang	67eef78244	chore: git ignore	2026-04-16 14:30:29 +08:00
Gahow Wang	996511f300	feat: new router and benchmark setup	2026-04-16 14:23:53 +08:00
Gahow Wang	c86d931d8f	feat(ablate): input-length bucketing + auto-instance sizing - Add sim.input_length_{min,max} (+ CLI overrides) that drop requests outside the bucket after trace load, enabling per-bucket ablation (e.g. 0-40k) without rewriting the trace file. Applied uniformly in both `run`/`ablate` driver path and `oracle` analysis. - Add cache_score_strong router (alpha=1, beta=1) to isolate how much of cache_affinity's win is reproducible by just retuning beta in the existing cache_score framework (no rendezvous, no meta-store bonus). - Add --auto-instances to ablate: sweeps --auto-candidates ascending with --auto-probe-router and picks the smallest cluster size whose TTFT mean <= --auto-target-ttft-mean. Per-candidate calibration results are persisted under runs/<output_dir>/auto_instances/ so the pick is auditable; the chosen N is then used for the whole ablation.	2026-04-15 19:42:28 +08:00
Gahow Wang	a3f386c858	feat: update ttft modeling and add cache affinity	2026-04-15 19:08:10 +08:00
Gahow Wang	ff316c6873	fix: cache calculation	2026-04-15 17:31:39 +08:00
Gahow Wang	365ceac3be	chore: update ablation and clean configs	2026-04-15 14:48:59 +08:00