From 8d41123418869edbaa99be76de89678e3bff69f6 Mon Sep 17 00:00:00 2001
From: Gahow Wang <gahow.wang@gmail.com>
Date: Tue, 14 Apr 2026 01:21:28 +0800
Subject: [PATCH] Update README with full feature documentation

Cover all 11 routing policies (including new prefix_affinity, cache_load,
cache_score, estimated_ttft, least_tokens), HuggingFace config.json
auto-parsing, GPU hardware presets, architecture-aware compute model
(MoE/MLA/DSA/GQA), router parameter tuning, bundled model configs and
config files, and available traces.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 README.md | 266 +++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 222 insertions(+), 44 deletions(-)

diff --git a/README.md b/README.md
index 1dde7d3..b859653 100644
--- a/README.md
+++ b/README.md
@@ -8,6 +8,26 @@ ablate routing strategies and cache sizing without spinning up any GPUs.
 Assumes **PD (prefill/decode) disaggregation** — only the prefill path is
 modeled.
 
+## Features
+
+- **Architecture-derived roofline compute** — auto-derives FLOPs,
+  attention coefficients, and weight-streaming costs from model
+  architecture (MoE, MLA, GQA, DSA, sliding window).
+- **HuggingFace config.json auto-parsing** — drop in any HF
+  `config.json` and the simulator extracts layer counts, attention heads,
+  MoE expert configs, MLA LoRA ranks, and DSA sparse parameters.
+- **Built-in GPU hardware presets** — H100, H800, H20, A100-80GB,
+  A100-40GB, B200 with tensor-parallel scaling (e.g. `8xb200`).
+- **Two-tier KV cache hierarchy** — L0 (GPU HBM) + L1 (CPU DRAM) with
+  LRU eviction and cross-instance RDMA fetch via a cluster-wide
+  meta-store.
+- **11 routing policies** — from baselines (random, round-robin) to
+  cache-aware (min\_pd, prefix\_affinity) for systematic ablation.
+- **Token-bucket link contention** — PCIe and RDMA bandwidth modeled with
+  reservation-based token-bucket queues.
+- **Oracle analysis** — computes theoretical hit-rate ceilings (infinite
+  cache, Belady optimal, LRU) for gap analysis.
+
 ## Build
 
 ```bash
@@ -26,7 +46,7 @@ git submodule update --init --recursive
 ### 1. Run a single simulation
 
 ```bash
-target/release/kvcache-sim run --config configs/qwen2.5-coder-7b-h800.yaml
+target/release/kvcache-sim run --config configs/glm5-8xb200-hf.yaml
 ```
 
 Prints `summary.json` to stdout and writes the full output directory
@@ -36,29 +56,28 @@ Prints `summary.json` to stdout and writes the full output directory
 
 ```bash
 target/release/kvcache-sim ablate \
-    --config configs/qwen2.5-coder-7b-h800.yaml \
-    --num-instances 64 \
-    --output-dir runs/qwen7b_n64 \
-    --routers random,least_loaded,ttl_aware,precise
+    --config configs/glm5-8xb200-hf.yaml \
+    --routers random,least_loaded,least_tokens,min_pd,prefix_affinity \
+    --output-dir runs/glm5_ablation
 ```
 
 Writes one subdirectory per router plus a combined
-`runs/qwen7b_n64/ablation.json` with side-by-side summaries.
+`ablation.json` with side-by-side summaries.
 
 ### 3. Compute theoretical hit-rate ceilings (oracle)
 
 ```bash
 # Cluster-aggregate capacity (default)
 target/release/kvcache-sim oracle \
-    --config configs/qwen2.5-coder-7b-h800.yaml --num-instances 64
+    --config configs/glm5-8xb200-hf.yaml --num-instances 64
 
 # A single instance's HBM budget
 target/release/kvcache-sim oracle \
-    --config configs/qwen2.5-coder-7b-h800.yaml --per-instance
+    --config configs/glm5-8xb200-hf.yaml --per-instance
 
-# Explicit capacity in 16-token blocks
+# Explicit capacity in blocks
 target/release/kvcache-sim oracle \
-    --config configs/qwen2.5-coder-7b-h800.yaml --capacity-blocks 200000
+    --config configs/glm5-8xb200-hf.yaml --capacity-blocks 200000
 ```
 
 Reports three numbers:
@@ -74,7 +93,7 @@ only reachable by adding capacity.
 ### 4. Validate a config without running
 
 ```bash
-target/release/kvcache-sim validate --config configs/qwen2.5-coder-7b-h800.yaml
+target/release/kvcache-sim validate --config configs/glm5-8xb200-hf.yaml
 ```
 
 Parses the YAML, prints derived per-instance block budgets, and dumps
@@ -102,15 +121,179 @@ and `--out <PATH>`. `ablate` additionally takes `--routers <csv>`.
 
 Set `cluster.router.mode` in the YAML or list in `--routers`:
 
-| Mode           | What it does                                                       |
-|----------------|--------------------------------------------------------------------|
-| `random`       | Uniform random. Baseline.                                          |
-| `round_robin`  | Deterministic round-robin. Baseline.                               |
-| `least_loaded` | `argmin(kv_blocks_used + alpha * queue_len)`. KV-blind.            |
-| `ttl_aware`    | Picks instance with longest prefix in the global TTL meta store.   |
-| `precise`      | Probes top-K least-loaded instances' actual caches; charges probe latency into TTFT. |
+| Mode              | Aliases          | What it does                                                                         |
+|-------------------|------------------|--------------------------------------------------------------------------------------|
+| `random`          |                  | Uniform random. Baseline.                                                            |
+| `round_robin`     | `rr`             | Deterministic round-robin. Baseline.                                                 |
+| `least_loaded`    |                  | `argmin(kv_blocks_used + alpha * queue_len)`. KV-blind load balance.                 |
+| `least_tokens`    | `lt`             | `argmin(waiting_tokens)`. Pure load balance by queued compute work.                  |
+| `ttl_aware`       | `ttl`            | Picks instance with longest prefix in the global TTL meta-store. Cache-only.         |
+| `precise`         | `precise_aware`  | Probes top-K least-loaded instances' actual caches; charges probe latency into TTFT. |
+| `min_pd`          | `minpd`, `pd`    | Minimizes `P*D` (prefill tokens x ongoing requests). Cluster-wide RDMA-aware.        |
+| `cache_load`      | `cl`             | Filters to least-loaded 1/4 instances, then picks best cache prefix.                 |
+| `cache_score`     | `cs`             | Exponential scoring: `2^(alpha * queue_len + beta * miss_blocks)`.                   |
+| `estimated_ttft`  | `ettft`,`optimal`| Estimates `drain_time + fetch_time` per instance using architecture-aware compute.    |
+| `prefix_affinity` | `affinity`, `pa` | Rendezvous-hashed prefix fingerprinting for deterministic cache locality.             |
 
-Expected hit-rate ordering: `random ≲ least_loaded ≲ ttl_aware ≲ precise`.
+### Router parameters
+
+These fields in `cluster.router` tune specific routers:
+
+| Field                    | Default | Used by          | Description                                          |
+|--------------------------|---------|------------------|------------------------------------------------------|
+| `load_alpha`             | `1.0`   | `least_loaded`   | Weight of queue\_len vs kv\_blocks\_used              |
+| `score_alpha`            | `1.0`   | `cache_score`    | Load weight in `2^(alpha*load + beta*miss)`           |
+| `score_beta`             | `0.1`   | `cache_score`    | Cache-miss weight in `2^(alpha*load + beta*miss)`     |
+| `prefix_k`              | `8`     | `prefix_affinity`| Number of leading blocks for the prefix fingerprint   |
+| `affinity_fan_out`       | `0`     | `prefix_affinity`| Top-K affinity candidates (0 = auto: n/8, min 2)     |
+| `precise_probe_latency_us`| `50.0`| `precise`        | Simulated per-probe latency (microseconds)            |
+| `precise_probe_topk`    | `4`     | `precise`        | Number of instances probed                            |
+
+### Router design spectrum
+
+```
+        Cache-only                  Hybrid                  Load-only
+        (hot-spot risk)                                     (cache-blind)
+  ┌─────────┬───────────┬───────────┬────────────┬───────────┬───────────┐
+  ttl_aware  precise  cache_score  min_pd     prefix_    least_     random
+                                   cache_load  affinity   loaded
+                                   est_ttft               least_tokens
+```
+
+`prefix_affinity` sits in a unique position: it builds **proactive cache
+locality** by consistently routing same-prefix requests to the same
+instances (via rendezvous hashing), rather than reactively chasing
+existing cache state. This yields the highest L0 hit rates while
+maintaining load balance through within-group drain-time-aware selection.
+
+## Model configuration
+
+### HuggingFace config.json (recommended)
+
+Point `model.config_json` at any HF `config.json` to auto-extract
+architecture:
+
+```yaml
+model:
+  config_json: ../models/GLM-5/config.json
+  dtype_bytes: 2          # required (not in HF schema)
+  block_size_tokens: 512  # required (not in HF schema)
+```
+
+Auto-detected features:
+
+| Feature   | Detection trigger             | What it extracts                             |
+|-----------|-------------------------------|----------------------------------------------|
+| **MoE**   | `n_routed_experts`, `num_local_experts`, or `num_experts` | Expert count, active experts, shared experts, expert FFN width |
+| **MLA**   | `kv_lora_rank` present        | KV/Q LoRA ranks, qk\_rope/nope dims, v\_head\_dim |
+| **DSA**   | `first_k_dense_replace` present| Dense window, sparse stride, first dense layers |
+| **Sliding window** | `sliding_window` present | Window size                                 |
+| **GQA**   | `num_key_value_heads < num_attention_heads` | KV head count for grouped-query attention  |
+
+Explicit YAML fields always override the auto-detected values.
+
+### Inline specification
+
+Alternatively, specify architecture fields directly:
+
+```yaml
+model:
+  name: qwen2.5-coder-7b
+  num_layers: 28
+  hidden_size: 3584
+  num_attention_heads: 28
+  num_kv_heads: 4
+  head_dim: 128
+  intermediate_size: 18944
+  dtype_bytes: 2
+  block_size_tokens: 16
+```
+
+When `hidden_size` is present, the compute model is auto-derived
+(architecture mode). Without it, you must supply legacy manual
+coefficients (`flops_per_token_prefill`, `attn_quadratic_coeff`, etc.).
+
+### Bundled model configs
+
+| Model | Path | Architecture |
+|-------|------|--------------|
+| GLM-5 (744B/40B-active) | `models/GLM-5/config.json` | MoE (256 routed, 8 active, 1 shared) + MLA + DSA |
+| Qwen3-Coder-480B-A35B FP8 | `models/Qwen3-Coder-480B-A35B-Instruct-FP8/config.json` | MoE (160 experts, 8 active) + GQA |
+
+## Hardware configuration
+
+### Using presets (recommended)
+
+Set `hardware.type` to a preset name — individual fields can override:
+
+```yaml
+hardware:
+  type: 8xb200
+  hbm_bytes: 500.0e9     # override KV budget (after model weights)
+```
+
+Available presets:
+
+| Preset      | FLOPS      | HBM     | Mem BW     | PCIe |
+|-------------|------------|---------|------------|------|
+| `h100`      | 989 TFLOPS | 80 GB   | 3.35 TB/s  | Gen5 |
+| `h800`      | 989 TFLOPS | 80 GB   | 3.35 TB/s  | Gen5 |
+| `h20`       | 148 TFLOPS | 96 GB   | 4.0 TB/s   | Gen5 |
+| `a100-80gb` | 312 TFLOPS | 80 GB   | 2.0 TB/s   | Gen4 |
+| `a100-40gb` | 312 TFLOPS | 40 GB   | 1.555 TB/s | Gen4 |
+| `b200`      | 2.25 PFLOPS| 192 GB  | 8.0 TB/s   | Gen6 |
+
+Prefix with `2x`, `4x`, or `8x` for tensor-parallel groups (e.g.
+`8xh20`). FLOPS, memory bandwidth, and HBM scale linearly; RDMA and
+DRAM are set to sensible per-node defaults.
+
+### Inline specification
+
+```yaml
+hardware:
+  gpu_flops: 1.80e16
+  gpu_mem_bw: 6.40e13
+  hbm_bytes: 500.0e9
+  dram_bytes: 1.5e12
+  pcie_bw: 128.0e9
+  pcie_latency_us: 4.0
+  rdma_bw: 50.0e9
+  rdma_latency_us: 6.0
+  max_batch_slots: 256
+  prefill_chunk_tokens: 4096
+```
+
+## Architecture-aware compute model
+
+The simulator derives a **roofline prefill model** from model
+architecture:
+
+```
+prefill_time(N tokens) = max(compute_time, memory_time)
+
+compute_time = layers * (N * linear_flops + attn_coeff * N * effective_ctx(N)) / gpu_flops
+memory_time  = layers * weight_bytes_per_layer / gpu_mem_bw
+```
+
+- **MoE**: only active experts contribute to FLOPs and weight streaming
+  (shared experts always counted)
+- **MLA**: compressed KV projections reduce attention FLOPs; KV cache
+  uses `kv_lora_rank + qk_rope_head_dim` instead of `2 * kv_heads * head_dim`
+- **DSA**: `effective_ctx = min(N, dense_window) + max(0, N - dense_window) / sparse_stride`,
+  with the first K layers using full dense attention
+- **GQA**: fewer KV heads reduce both attention compute and KV cache size
+
+## Bundled config files
+
+| Config | Model | Hardware | Instances | Trace |
+|--------|-------|----------|-----------|-------|
+| `glm5-8xb200-hf.yaml` | GLM-5 via HF config.json | 8xB200 preset | 32 | GLM coder blk512 |
+| `glm5-8xb200-blk512.yaml` | GLM-5 inline | 8xB200 inline | 64 | GLM coder blk512 |
+| `glm5-8xb200.yaml` | GLM-5 inline | 8xB200 inline | 8 | GLM coder blk512 |
+| `qwen3-coder-480b-8xh20.yaml` | Qwen3-Coder via HF | 8xH20 preset | 32 | Qwen coder blk16 |
+| `qwen2.5-coder-7b-h800.yaml` | Qwen2.5-7B inline | H800 inline | 16 | Qwen coder blk16 |
+| `qwen2.5-coder-7b-preset.yaml` | Qwen2.5-7B inline | H800 preset | 16 | Qwen coder blk16 |
+| `qwen2.5-coder-32b-h800.yaml` | Qwen2.5-32B inline | H800 inline | 16 | Qwen coder blk16 |
 
 ## Outputs
 
@@ -123,46 +306,40 @@ Each run writes a directory under `sim.output_dir`:
 | `instances.csv`      | `t,instance,queue_len,kv_blocks_used,kv_blocks_total,busy` per sample      |
 | `routing_log.jsonl`  | One JSON per request: all router candidates + chosen instance + reason    |
 
-For `ablate`: an extra `ablation.json` with one summary per router.  
+For `ablate`: an extra `ablation.json` with one summary per router.
 For `oracle`: an `oracle.json` with the three hit-rate analyses.
 
 ### Reading results quickly
 
 ```bash
 # Pretty-print the summary
-cat runs/qwen7b/summary.json | jq .
+cat runs/glm5_8xb200_hf/summary.json | jq .
 
 # Compare all routers from an ablation
-cat runs/qwen7b_n64/ablation.json | jq '.[] | {router, ttft_p50, hit_rate_l0, total_rdma_bytes}'
+cat runs/glm5_8xb200_hf/ablation.json | \
+  jq '.[] | {router, ttft_mean, ttft_p50, hit_rate_l0, miss_rate}'
 
-# Hit-rate ceilings vs LRU at the same capacity
-cat runs/qwen7b/oracle.json | jq '{unlimited: .unlimited.hit_rate, belady: .belady_finite.hit_rate, lru: .lru_finite.hit_rate}'
+# Sort by TTFT
+cat runs/glm5_8xb200_hf/ablation.json | \
+  jq 'sort_by(.ttft_mean) | .[] | {router, ttft_mean, hit_rate_l0}'
 ```
 
-## Config
-
-A config is a single YAML file with four sections. A working example
-lives at
-[`configs/qwen2.5-coder-7b-h800.yaml`](configs/qwen2.5-coder-7b-h800.yaml);
-copy and edit for other models/hardware.
-
-```yaml
-model:      # shape + prefill roofline coefficients
-hardware:   # per-instance GPU/PCIe/RDMA capabilities + batch knobs
-cluster:    # num_instances, meta_store TTL, router mode
-sim:        # trace_path, max_requests, output_dir, seed
-```
-
-Only prefill-side model coefficients are used; any decode fields in
-legacy YAMLs are accepted and ignored.
-
 ## Trace format
 
 The simulator reads the Alibaba
 [`qwen-bailian-usagetraces-anon`](https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon)
 JSONL schema. Each record has `chat_id`, `timestamp`, `input_length`,
-`output_length`, and `hash_ids` (16-token block hashes). Only the
-input side is used.
+`output_length`, and `hash_ids` (block hashes, typically 16 tokens each).
+Only the input side is used.
+
+Available traces in the submodule:
+
+| Trace | Requests | Description |
+|-------|----------|-------------|
+| `qwen_coder_blksz_16.jsonl` | 43k | Qwen Coder serving traffic |
+| `qwen_traceA_blksz_16.jsonl` | 43k | Qwen general traffic A |
+| `qwen_traceB_blksz_16.jsonl` | 173k | Qwen general traffic B |
+| `qwen_thinking_blksz_16.jsonl` | 11k | Qwen reasoning/thinking traffic |
 
 ## Testing
 
@@ -170,5 +347,6 @@ input side is used.
 cargo test --release
 ```
 
-16 tests: 15 unit + 1 smoke that runs all four routers on a synthetic
+28 tests: 27 unit tests (compute model, HF config parsing, hardware
+presets) + 1 integration smoke test that runs routers on a synthetic
 shared-prefix trace and asserts the expected hit-rate ordering.