chore: update README

2026-04-17 19:15:22 +08:00
1 changed files with 300 additions and 268 deletions
--- a/README.md
+++ b/README.md
@@ -1,353 +1,386 @@
 # kvcache-simulator
 Discrete-event simulator for cluster-level LLM **prefill** serving with a
-two-tier KV cache (GPU HBM + CPU DRAM / v6d) and KV-aware request routing.
+two-tier KV cache and routing experiments. The simulator models a
-Replays real production traces against a synthetic cluster so you can
+PD-disaggregated deployment: only the **prefill** path is simulated, while
-ablate routing strategies and cache sizing without spinning up any GPUs.
+decode is reduced to a small completion tail for TTFT/E2E accounting.
-Assumes **PD (prefill/decode) disaggregation** — only the prefill path is
+It is intended for answering questions like:
 modeled.
-## Features
+- How much do different KV-aware routers help on the same trace?
 - How much HBM/DRAM capacity is enough before routing dominates?
 - How do prefix-locality policies behave under bucketed input-length pools?
 - What is the gap between online LRU and offline-optimal cache capacity?
- **Architecture-derived roofline compute** — auto-derives FLOPs,
+## What The Repo Models
-  attention coefficients, and weight-streaming costs from model
+
-  architecture (MoE, MLA, GQA, DSA, sliding window).
+- **Architecture-derived prefill cost** from model structure, including MoE,
- **HuggingFace config.json auto-parsing** — drop in any HF
+  MLA, GQA, sliding-window attention, and DSA.
-  `config.json` and the simulator extracts layer counts, attention heads,
+- **Two-tier KV hierarchy** with L0 GPU HBM and L1 host DRAM, plus remote
-  MoE expert configs, MLA LoRA ranks, and DSA sparse parameters.
+  RDMA fetches via a meta-store.
- **Built-in GPU hardware presets** — H100, H800, H20, A100-80GB,
+- **Single-pool and bucketed clusters**. Bucketed mode separates the service
-  A100-40GB, B200 with tensor-parallel scaling (e.g. `8xb200`).
+  into input-length buckets with isolated instance pools and meta-stores.
- **Two-tier KV cache hierarchy** — L0 (GPU HBM) + L1 (CPU DRAM) with
+- **Local instance routing and global bucket routing** with detailed
-  LRU eviction and cross-instance RDMA fetch via a cluster-wide
+  per-request routing logs.
-  meta-store.
+- **Trace replay with optional input-length filtering** so the same trace can
- **11 routing policies** — from baselines (random, round-robin) to
+  be sliced into buckets without rewriting the source file.
-  cache-aware (min\_pd, prefix\_affinity) for systematic ablation.
+- **Offline oracle analysis** for unlimited capacity, Belady, and LRU hit-rate
- **Token-bucket link contention** — PCIe and RDMA bandwidth modeled with
+  ceilings.
-  reservation-based token-bucket queues.
+
- **Oracle analysis** — computes theoretical hit-rate ceilings (infinite
+## Highlights
-  cache, Belady optimal, LRU) for gap analysis.
+
 - **HF `config.json` auto-loading**: point `model.config_json` at a model
  config and the simulator derives architecture parameters automatically.
 - **Hardware presets**: `h100`, `h800`, `h20`, `h20-141g`, `a100-80gb`,
  `a100-40gb`, `b200`, and `b300`, plus TP variants such as `8xb200`.
 - **18 local router modes** covering baselines, load-based, cache-aware,
  affinity, and TTFT-estimating policies.
 - **2 global bucket router modes**: `strict_input_length` and `bucket_score`.
 - **Detailed outputs**: `summary.json`, `per_request.csv`, `instances.csv`,
  `routing_log.jsonl`, plus `ablation.json` / `oracle.json` when applicable.
 ## Build
 ```bash
 cargo build --release
 # binary: target/release/kvcache-sim
 ```
-Fetch the upstream trace (consumed as a git submodule):
+If you want the public Qwen trace submodule as well:
 ```bash
 git submodule update --init --recursive
 ```
-## Usage
+The release binary is:
 ### 1. Run a single simulation
 ```bash
-target/release/kvcache-sim run --config configs/glm5-8xb200-hf.yaml
+target/release/kvcache-sim
 ```
-Prints `summary.json` to stdout and writes the full output directory
+## Quick Start
 (see [Outputs](#outputs) below).
-### 2. Compare routers on the same trace (ablation)
+Validate a config:
 ```bash
 target/release/kvcache-sim validate --config configs/glm5-8xb200.yaml
 ```
 Run one simulation:
 ```bash
 target/release/kvcache-sim run --config configs/glm5-8xb200.yaml
 ```
 Compare several routers on the same trace:
 ```bash
 target/release/kvcache-sim ablate \
-    --config configs/glm5-8xb200-hf.yaml \
+  --config configs/glm5-8xb200.yaml \
-    --routers random,least_loaded,least_tokens,min_pd,prefix_affinity \
+  --routers random,least_loaded,cache_score,cache_affinity,estimated_ttft
    --evict-policies lru \
    --output-dir runs/glm5_ablation
 ```
-Writes `ablation.json` with one row per `router x evict_policy`.
+Auto-pick the smallest cluster size that meets a TTFT target, then ablate at
-
+that size:
 `ablate` currently supports only `lru` as a valid eviction policy. The
 aggregated output keeps the online prefill-time metrics
 (`ttft_mean/p50/p95/p99`) and omits `e2e`.
 The previous replay-based `belady` approximation has been removed from
 the CLI because it was not an exact full-hierarchy Belady algorithm and
 could produce misleading comparisons against `lru`.
 ### 3. Compute theoretical hit-rate ceilings (oracle)
 ```bash
-# Cluster-aggregate capacity (default)
+target/release/kvcache-sim ablate \
-target/release/kvcache-sim oracle \
+  --config configs/glm5-8xb200.yaml \
-    --config configs/glm5-8xb200-hf.yaml --num-instances 64
+  --auto-instances \
-
+  --auto-probe-router cache_score \
-# A single instance's HBM budget
+  --auto-target-ttft-mean 4.0
 target/release/kvcache-sim oracle \
    --config configs/glm5-8xb200-hf.yaml --per-instance
 # Explicit capacity in blocks
 target/release/kvcache-sim oracle \
    --config configs/glm5-8xb200-hf.yaml --capacity-blocks 200000
 ```
-Reports three numbers:
+Run the oracle:
 - `unlimited.hit_rate` — absolute ceiling (infinite cache)
 - `belady_finite.hit_rate` — optimal-eviction ceiling at the given capacity
 - `lru_finite.hit_rate` — production LRU at the same capacity
 Gap between `lru_finite` and `belady_finite` = headroom from a smarter
 eviction policy. Gap between `belady_finite` and `unlimited` = headroom
 only reachable by adding capacity.
 ### 4. Validate a config without running
 ```bash
-target/release/kvcache-sim validate --config configs/glm5-8xb200-hf.yaml
+target/release/kvcache-sim oracle \
  --config configs/glm5-8xb200.yaml \
  --per-instance
 ```
-Parses the YAML, prints derived per-instance block budgets, and dumps
+`run` prints `summary.json` to stdout and also writes the full output directory
-the first 5 trace records so you can sanity-check the path.
+under `sim.output_dir`.
-## CLI overrides
+## Current Command Boundaries
-These flags work on **all** subcommands and override the YAML in place,
+The repository now supports both legacy single-pool clusters and bucketed
-so the same config can be reused across sweeps:
+service topologies, but not every CLI path supports both yet.
-| Flag                     | Overrides                                 |
+- `run`: supports `cluster.num_instances` and `cluster.buckets`
-|--------------------------|-------------------------------------------|
+- `validate`: supports `cluster.num_instances` and `cluster.buckets`
-| `--num-instances <N>`    | `cluster.num_instances`                   |
+- `ablate`: currently **single-pool only**
-| `--max-requests <N>`     | `sim.max_requests`                        |
+- `ablate --evict-policies`: currently supports **`lru` only**
-| `--trace <PATH>`         | `sim.trace_path`                          |
+- `oracle`: currently **single-pool only**
-| `--output-dir <PATH>`    | `sim.output_dir`                          |
+- `--num-instances` override: currently **single-pool only**
-| `--seed <N>`             | `sim.seed`                                |
+- `--auto-instances`: currently **single-pool only**
 | `--precise-topk <N>`     | `cluster.router.precise_probe_topk`       |
 | `--ttl-seconds <S>`      | `cluster.meta_store.ttl_seconds`          |
-`oracle` additionally takes `--capacity-blocks <N>` / `--per-instance`
+In practice, bucket-aware experiments are ready in `run`, while fixed-placement
-and `--out <PATH>`. `ablate` additionally takes `--routers <csv>` and
+ablation and oracle analysis still reject `cluster.buckets`.
 `--evict-policies <csv>` (currently only `lru`).
-## Router modes
+## Config Model
-Set `cluster.router.mode` in the YAML or list in `--routers`:
+### Single-Pool Cluster
-| Mode              | Aliases          | What it does                                                                         |
+Use `cluster.num_instances` for the original flat instance pool:
 |-------------------|------------------|--------------------------------------------------------------------------------------|
 | `random`          |                  | Uniform random. Baseline.                                                            |
 | `round_robin`     | `rr`             | Deterministic round-robin. Baseline.                                                 |
 | `least_loaded`    |                  | `argmin(kv_blocks_used + alpha * queue_len)`. KV-blind load balance.                 |
 | `least_tokens`    | `lt`             | `argmin(waiting_tokens)`. Pure load balance by queued compute work.                  |
 | `ttl_aware`       | `ttl`            | Picks instance with longest prefix in the global TTL meta-store. Cache-only.         |
 | `precise`         | `precise_aware`  | Probes top-K least-loaded instances' actual caches; charges probe latency into TTFT. |
 | `min_pd`          | `minpd`, `pd`    | Minimizes `P*D` (prefill tokens x ongoing requests). Cluster-wide RDMA-aware.        |
 | `cache_load`      | `cl`             | Filters to least-loaded 1/4 instances, then picks best cache prefix.                 |
 | `cache_score`     | `cs`             | Exponential scoring: `2^(alpha * queue_len + beta * miss_blocks)`.                   |
 | `estimated_ttft`  | `ettft`,`optimal`| Estimates `drain_time + fetch_time` per instance using architecture-aware compute.    |
 | `prefix_affinity` | `affinity`, `pa` | Rendezvous-hashed prefix fingerprinting for deterministic cache locality.             |
-### Router parameters
+```yaml
-
+cluster:
-These fields in `cluster.router` tune specific routers:
+  num_instances: 32
-
+  meta_store:
-| Field                    | Default | Used by          | Description                                          |
+    ttl_seconds: 300.0
-|--------------------------|---------|------------------|------------------------------------------------------|
+  router:
-| `load_alpha`             | `1.0`   | `least_loaded`   | Weight of queue\_len vs kv\_blocks\_used              |
+    mode: cache_affinity
 | `score_alpha`            | `1.0`   | `cache_score`    | Load weight in `2^(alpha*load + beta*miss)`           |
 | `score_beta`             | `0.1`   | `cache_score`    | Cache-miss weight in `2^(alpha*load + beta*miss)`     |
 | `prefix_k`              | `8`     | `prefix_affinity`| Number of leading blocks for the prefix fingerprint   |
 | `affinity_fan_out`       | `0`     | `prefix_affinity`| Top-K affinity candidates (0 = auto: n/8, min 2)     |
 | `precise_probe_latency_us`| `50.0`| `precise`        | Simulated per-probe latency (microseconds)            |
 | `precise_probe_topk`    | `4`     | `precise`        | Number of instances probed                            |
 ### Router design spectrum
 ```
        Cache-only                  Hybrid                  Load-only
        (hot-spot risk)                                     (cache-blind)
  ┌─────────┬───────────┬───────────┬────────────┬───────────┬───────────┐
  ttl_aware  precise  cache_score  min_pd     prefix_    least_     random
                                   cache_load  affinity   loaded
                                   est_ttft               least_tokens
 ```
-`prefix_affinity` sits in a unique position: it builds **proactive cache
+### Bucketed Service
 locality** by consistently routing same-prefix requests to the same
 instances (via rendezvous hashing), rather than reactively chasing
 existing cache state. This yields the highest L0 hit rates while
 maintaining load balance through within-group drain-time-aware selection.
-## Model configuration
+Use `cluster.buckets` plus a `global_router` to model explicit input-length
 buckets:
-### HuggingFace config.json (recommended)
+```yaml
 cluster:
  meta_store:
    ttl_seconds: 300.0
  router:
    mode: cache_affinity
    load_alpha: 1.5
    prefix_k: 8
  global_router:
    mode: strict_input_length
    length_penalty_weight: 1.0
    load_weight: 1.0
    cache_weight: 1.0
  buckets:
    - name: short
      input_length_min: 0
      input_length_max: 32768
      num_instances: 8
    - name: long
      input_length_min: 32769
      input_length_max: 131072
      num_instances: 4
 ```
-Point `model.config_json` at any HF `config.json` to auto-extract
+Rules enforced by config validation:
-architecture:
+
 - `cluster.num_instances` and `cluster.buckets` are mutually exclusive
 - bucket ranges must not overlap
 - every bucket must have `num_instances > 0`
 - `input_length_min <= input_length_max`
 ### CLI Overrides
 These flags apply on top of the YAML config:
 | Flag | Overrides |
 |------|-----------|
 | `--num-instances <N>` | `cluster.num_instances` |
 | `--max-requests <N>` | `sim.max_requests` |
 | `--trace <PATH>` | `sim.trace_path` |
 | `--output-dir <PATH>` | `sim.output_dir` |
 | `--seed <N>` | `sim.seed` |
 | `--precise-topk <N>` | `cluster.router.precise_probe_topk` |
 | `--ttl-seconds <S>` | `cluster.meta_store.ttl_seconds` |
 | `--input-length-min <N>` | `sim.input_length_min` |
 | `--input-length-max <N>` | `sim.input_length_max` |
 Subcommand-specific additions:
 - `ablate`: `--routers`, `--evict-policies`, `--auto-instances`,
  `--auto-target-ttft-mean`, `--auto-candidates`, `--auto-probe-router`,
  `--jobs`
 - `oracle`: `--capacity-blocks`, `--per-instance`, `--out`
 ## Routing Modes
 ### Global Bucket Routers
 Configured through `cluster.global_router.mode`.
 | Mode | What it does |
 |------|---------------|
 | `strict_input_length` | Routes to the unique bucket whose `[input_length_min, input_length_max]` contains the request. |
 | `bucket_score` | Scores every bucket using weighted length mismatch, aggregate queue load, and predicted cache miss. Can intentionally deviate from the strict length bucket. |
 ### Local Instance Routers
 Configured through `cluster.router.mode`. All of these names are accepted by
 `run`, and any of them can be passed to `ablate --routers` on single-pool
 configs.
 | Mode | Aliases | What it does |
 |------|---------|---------------|
 | `random` |  | Uniform random baseline. |
 | `round_robin` | `rr` | Deterministic round-robin baseline. |
 | `least_loaded` |  | Minimizes `kv_blocks_used + alpha * queue_len`. |
 | `least_tokens` | `lt` | Minimizes queued token work. |
 | `ttl_aware` | `ttl` | Uses the global TTL meta-store to chase the longest reusable prefix. |
 | `precise` | `precise_aware` | Probes top-K least-loaded instances for actual cache contents and charges probe latency. |
 | `min_pd` | `minpd`, `pd` | Minimizes `P * D` using ongoing load and prefix reuse. |
 | `cache_load` | `cl` | Filters to lightly loaded instances, then chooses the best cache prefix. |
 | `cache_affinity` | `caff`, `ca` | Strong cache-first scoring with rendezvous-based sticky homes for prefix families. |
 | `cache_affinity_weak_rend` | `caff_weak` | Ablation: weak cache weights plus rendezvous placement. |
 | `cache_affinity_strong_only` | `caff_strong` | Ablation: strong cache weights without rendezvous tie-breaking. |
 | `cache_score` | `cs` | Exponential score over queue length and miss blocks. |
 | `cache_score_strong` | `cs_strong`, `css` | Parity probe with stronger cache weighting than default `cache_score`. |
 | `cache_score_ttl` | `csttl`, `cs_ttl` | `cache_score` variant that also uses TTL/meta-store visibility. |
 | `estimated_ttft` | `ettft`, `optimal` | First-principles TTFT estimate per instance using compute plus KV movement. |
 | `prefix_affinity` | `affinity`, `pa` | Deterministic prefix fingerprinting with affinity fan-out and load-aware selection. |
 | `adaptive_affinity` | `aa` | Uses hot-prefix detection: affinity for short hot stems, TTFT optimization otherwise. |
 | `lineage_affinity` | `la` | Combines parent stickiness, family homesets, and strong local cache scoring. |
 Router tuning knobs in `cluster.router`:
 | Field | Default | Used by |
 |-------|---------|---------|
 | `load_alpha` | `1.0` | `least_loaded`, `ttl_aware`, affinity families |
 | `score_alpha` | `1.0` | `cache_score`, `cache_score_ttl` |
 | `score_beta` | `0.1` | `cache_score`, `cache_score_ttl` |
 | `prefix_k` | `8` | prefix and affinity fingerprinting |
 | `affinity_fan_out` | `0` | `prefix_affinity`, `adaptive_affinity`, `lineage_affinity` |
 | `precise_probe_latency_us` | `50.0` | `precise` |
 | `precise_probe_topk` | `4` | `precise` |
 ## Model And Hardware Configuration
 ### Model Config
 Recommended pattern:
 ```yaml
 model:
  config_json: ../models/GLM-5/config.json
-  dtype_bytes: 2          # required (not in HF schema)
+  name: glm-5
-  block_size_tokens: 512  # required (not in HF schema)
+  compute_dtype: fp8
  weight_dtype: fp4
  dtype_bytes: 1
  block_size_tokens: 512
 ```
-Auto-detected features:
+Notes:
-| Feature   | Detection trigger             | What it extracts                             |
+- `config_json` is resolved relative to the YAML file
-|-----------|-------------------------------|----------------------------------------------|
+- explicit YAML fields override values loaded from the model config
-| **MoE**   | `n_routed_experts`, `num_local_experts`, or `num_experts` | Expert count, active experts, shared experts, expert FFN width |
+- `compute_dtype` selects the compute FLOPS tier
-| **MLA**   | `kv_lora_rank` present        | KV/Q LoRA ranks, qk\_rope/nope dims, v\_head\_dim |
+- `weight_dtype` controls model-weight bytes separately from KV-cache bytes
-| **DSA**   | `first_k_dense_replace` present| Dense window, sparse stride, first dense layers |
+- `dtype_bytes` sizes the KV cache
 | **Sliding window** | `sliding_window` present | Window size                                 |
 | **GQA**   | `num_key_value_heads < num_attention_heads` | KV head count for grouped-query attention  |
-Explicit YAML fields always override the auto-detected values.
+The architecture loader understands:
-### Inline specification
+- MoE expert counts and active experts
 - MLA LoRA ranks and attention dimensions
 - DSA sparse-attention parameters
 - sliding-window attention
 - GQA from KV-head count
-Alternatively, specify architecture fields directly:
+### Hardware Presets
-```yaml
+Recommended pattern:
 model:
  name: qwen2.5-coder-7b
  num_layers: 28
  hidden_size: 3584
  num_attention_heads: 28
  num_kv_heads: 4
  head_dim: 128
  intermediate_size: 18944
  dtype_bytes: 2
  block_size_tokens: 16
 ```
 When `hidden_size` is present, the compute model is auto-derived
 (architecture mode). Without it, you must supply legacy manual
 coefficients (`flops_per_token_prefill`, `attn_quadratic_coeff`, etc.).
 ### Bundled model configs
 | Model | Path | Architecture |
 |-------|------|--------------|
 | GLM-5 (744B/40B-active) | `models/GLM-5/config.json` | MoE (256 routed, 8 active, 1 shared) + MLA + DSA |
 | GLM-5-FP8 | `models/GLM-5-FP8/config.json` | GLM-5 architecture + upstream FP8 quantization metadata |
 | Qwen3-Coder-480B-A35B FP8 | `models/Qwen3-Coder-480B-A35B-Instruct-FP8/config.json` | MoE (160 experts, 8 active) + GQA |
 ## Hardware configuration
 ### Using presets (recommended)
 Set `hardware.type` to a preset name — individual fields can override:
 ```yaml
 hardware:
-  type: 8xb200
+  type: 8xb300
-  hbm_bytes: 500.0e9     # override KV budget (after model weights)
+  hbm_bytes: 1900.0e9
 ```
 Available presets:
 | Preset      | FLOPS      | HBM     | Mem BW     | PCIe |
 |-------------|------------|---------|------------|------|
 | `h100`      | 989 TFLOPS | 80 GB   | 3.35 TB/s  | Gen5 |
 | `h800`      | 989 TFLOPS | 80 GB   | 3.35 TB/s  | Gen5 |
 | `h20`       | 148 TFLOPS | 96 GB   | 4.0 TB/s   | Gen5 |
 | `h20-141g`  | 148 TFLOPS | 141 GB  | 4.8 TB/s   | Gen5 |
 | `a100-80gb` | 312 TFLOPS | 80 GB   | 2.0 TB/s   | Gen4 |
 | `a100-40gb` | 312 TFLOPS | 40 GB   | 1.555 TB/s | Gen4 |
 | `b200`      | 2.25 PFLOPS| 192 GB  | 8.0 TB/s   | Gen6 |
 Prefix with `2x`, `4x`, or `8x` for tensor-parallel groups (e.g.
 `8xh20`). FLOPS, memory bandwidth, and HBM scale linearly; RDMA and
 DRAM are set to sensible per-node defaults.
 ### Inline specification
 ```yaml
 hardware:
  gpu_flops: 1.80e16
  gpu_mem_bw: 6.40e13
  hbm_bytes: 500.0e9
  dram_bytes: 1.5e12
  pcie_bw: 128.0e9
  pcie_latency_us: 4.0
  rdma_bw: 50.0e9
  rdma_latency_us: 6.0
  max_batch_slots: 256
  prefill_chunk_tokens: 4096
 ```
-## Architecture-aware compute model
+Available preset families:
-The simulator derives a **roofline prefill model** from model
+- `h100`, `h800`, `h20`, `h20-141g`
-architecture:
+- `a100-80gb`, `a100-40gb`
 - `b200`, `b300`
 - TP forms such as `2xh100`, `4xh20`, `8xb200`, `8xb300`
-```
+## Bundled Configs
 prefill_time(N tokens) = max(compute_time, memory_time)
-compute_time = layers * (N * linear_flops + attn_coeff * N * effective_ctx(N)) / gpu_flops
+Representative configs in `configs/`:
-memory_time  = layers * weight_bytes_per_layer / gpu_mem_bw
+
 | Config | Notes |
 |--------|-------|
 | `glm5-8xb200.yaml` | GLM-5 on `8xb200`, single-pool baseline config. |
 | `glm5-fp8-8xh20-141g.yaml` | GLM-5-FP8 on `8xh20-141g`, with a 0-32k input-length filter. |
 | `glm5-fp8-8xh20-141g-ca-tuned.yaml` | Same family as above, tuned for `cache_affinity`. |
 | `glm5-nvfp4-8xb300.yaml` | GLM-5-NVFP4 on `8xb300`. |
 | `glm5-nvfp4-fp8compute-8xb300.yaml` | NVFP4 weights with FP8 compute on `8xb300`. |
 | `qwen3-coder-480b-8xh20.yaml` | Qwen3-Coder-480B-A35B on `8xh20`. |
 Many of the `glm5-*n*.yaml` configs are bucket/slice-specific experiment
 points that use `sim.input_length_min` and `sim.input_length_max`.
 ## Trace Inputs
 This repository currently contains two trace sources:
 - `bailian-traces/`
  - `glm_coder_blksz_512_040915-040917.jsonl`
  - `qwen3_coder_blksz_512_040915-040917.jsonl`
 - `qwen-bailian-usagetraces-anon/` submodule
  - public 16-token-block Qwen traces such as
    `qwen_coder_blksz_16.jsonl` and `qwen_traceB_blksz_16.jsonl`
 The simulator expects JSONL records with fields like:
 ```json
 {
  "chat_id": 159,
  "parent_chat_id": 55,
  "timestamp": 61.114,
  "input_length": 521,
  "output_length": 132,
  "type": "text",
  "turn": 2,
  "hash_ids": [1089, 1090, 1091]
 }
 ```
- **MoE**: only active experts contribute to FLOPs and weight streaming
+Only prefill-side behavior is modeled; `output_length` is used only for a
-  (shared experts always counted)
+decode tail in completion metrics.
 - **MLA**: compressed KV projections reduce attention FLOPs; KV cache
  uses `kv_lora_rank + qk_rope_head_dim` instead of `2 * kv_heads * head_dim`
 - **DSA**: `effective_ctx = min(N, dense_window) + max(0, N - dense_window) / sparse_stride`,
  with the first K layers using full dense attention
 - **GQA**: fewer KV heads reduce both attention compute and KV cache size
 ## Bundled config files
 | Config | Model | Hardware | Instances | Trace |
 |--------|-------|----------|-----------|-------|
 | `glm5-8xb200-hf.yaml` | GLM-5 via HF config.json | 8xB200 preset | 32 | GLM coder blk512 |
 | `glm5-fp8-8xh20-141g.yaml` | GLM-5-FP8 via ModelScope config.json | 8xH20-141G preset | 128 | GLM coder blk512 |
 | `glm5-nvfp4-8xb300.yaml` | GLM-5-NVFP4 via HF config.json | 8xB300 preset | 8 | GLM coder blk512 |
 | `qwen3-coder-480b-8xh20.yaml` | Qwen3-Coder via HF | 8xH20 preset | 32 | Qwen coder blk16 |
 ## Outputs
-Each run writes a directory under `sim.output_dir`:
+Each `run` writes a directory under `sim.output_dir`:
-| File                 | Contents                                                                   |
+| File | Contents |
-|----------------------|----------------------------------------------------------------------------|
+|------|----------|
-| `summary.json`       | Router, throughput, TTFT p50/p95/p99, hit rates per tier, total RDMA/PCIe bytes |
+| `summary.json` | Aggregate throughput, TTFT/E2E percentiles, hit rates, RDMA bytes, PCIe bytes. |
-| `per_request.csv`    | `req_id,arrival,ttft,e2e,instance,total_blocks,l0_hit,l1_hit,remote_hit,miss,rdma_bytes,pcie_bytes,probe_overhead_s` |
+| `per_request.csv` | Per-request latency and cache stats, including `bucket`, `instance`, and `length_bucket_match`. |
-| `instances.csv`      | `t,instance,queue_len,kv_blocks_used,kv_blocks_total,busy` per sample      |
+| `instances.csv` | Periodic per-instance samples with `bucket`, `instance`, `queue_len`, and KV usage. |
-| `routing_log.jsonl`  | One JSON per request: all router candidates + chosen instance + reason    |
+| `routing_log.jsonl` | One JSON route decision per request, including `global_mode`, `mode`, `chosen_bucket`, candidate buckets, and candidate instances. |
-For `ablate`: an extra `ablation.json` with one summary per router.
+Additional outputs:
 For `oracle`: an `oracle.json` with the three hit-rate analyses.
-### Reading results quickly
+- `ablate`: writes `ablation.json`
 - `oracle`: writes `oracle.json`
 - `ablate --auto-instances`: writes calibration runs under
  `<output_dir>/auto_instances/`
 Quick inspection examples:
 ```bash
-# Pretty-print the summary
+jq . runs/glm5_8xb200/summary.json
 cat runs/glm5_8xb200_hf/summary.json | jq .
 # Compare all routers from an ablation
 cat runs/glm5_8xb200_hf/ablation.json | \
  jq '.[] | {router, ttft_mean, ttft_p50, hit_rate_l0, miss_rate}'
 # Sort by TTFT
 cat runs/glm5_8xb200_hf/ablation.json | \
  jq 'sort_by(.ttft_mean) | .[] | {router, ttft_mean, hit_rate_l0}'
 ```
-## Trace format
+```bash
 jq 'sort_by(.ttft_mean) | .[] | {router, ttft_mean, hit_rate_l0, miss_rate}' \
  runs/glm5_8xb200/ablation.json
 ```
-The simulator reads the Alibaba
+## Oracle Semantics
 [`qwen-bailian-usagetraces-anon`](https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon)
 JSONL schema. Each record has `chat_id`, `timestamp`, `input_length`,
 `output_length`, and `hash_ids` (block hashes, typically 16 tokens each).
 Only the input side is used.
-Available traces in the submodule:
+`oracle` computes three hit-rate references at a chosen cache capacity:
-| Trace | Requests | Description |
+- `unlimited.hit_rate`: absolute ceiling with infinite capacity
-|-------|----------|-------------|
+- `belady_finite.hit_rate`: offline-optimal eviction at the chosen capacity
-| `qwen_coder_blksz_16.jsonl` | 43k | Qwen Coder serving traffic |
+- `lru_finite.hit_rate`: LRU at the same capacity
-| `qwen_traceA_blksz_16.jsonl` | 43k | Qwen general traffic A |
+
-| `qwen_traceB_blksz_16.jsonl` | 173k | Qwen general traffic B |
+When `sim.input_length_min` / `sim.input_length_max` are set, `oracle` still
-| `qwen_thinking_blksz_16.jsonl` | 11k | Qwen reasoning/thinking traffic |
+feeds the full trace into cache state but only counts requests inside the
 selected input-length range. That matches the intended "measure one bucket
 inside a mixed workload" interpretation.
 The gap from `lru_finite` to `belady_finite` is eviction-policy headroom. The
 gap from `belady_finite` to `unlimited` is pure capacity headroom.
 ## Testing
@@ -355,6 +388,5 @@ Available traces in the submodule:
 cargo test --release
 ```
-28 tests: 27 unit tests (compute model, HF config parsing, hardware
+The test suite covers config parsing, hardware presets, routing behavior,
-presets) + 1 integration smoke test that runs routers on a synthetic
+bucket-aware service semantics, oracle logic, and smoke-style end-to-end runs.
 shared-prefix trace and asserts the expected hit-rate ordering.