KVCache simulator for LLM serving cluster routing research

Discrete-event simulator for evaluating KV cache-aware routing policies in prefill-disaggregated LLM serving clusters. Models a two-tier KV cache hierarchy (L0 GPU HBM + L1 CPU DRAM) with RDMA/PCIe link contention, architecture-derived roofline compute (MoE, MLA, DSA), and a cluster-wide meta-store for prefix-aware routing decisions. Includes 11 routing policies (random, round_robin, least_loaded, least_tokens, ttl_aware, precise, min_pd, cache_load, cache_score, estimated_ttft, prefix_affinity), HuggingFace config.json auto-parsing, built-in GPU hardware presets (H100/H800/H20/A100/B200), and ablation tooling for systematic policy comparison across real Alibaba serving traces. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 01:16:02 +08:00
commit ec73a95e05
52 changed files with 6005 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,174 @@
+# kvcache-simulator
+
+Discrete-event simulator for cluster-level LLM **prefill** serving with a
+two-tier KV cache (GPU HBM + CPU DRAM / v6d) and KV-aware request routing.
+Replays real production traces against a synthetic cluster so you can
+ablate routing strategies and cache sizing without spinning up any GPUs.
+
+Assumes **PD (prefill/decode) disaggregation** — only the prefill path is
+modeled.
+
+## Build
+
+```bash
+cargo build --release
+# binary: target/release/kvcache-sim
+```
+
+Fetch the upstream trace (consumed as a git submodule):
+
+```bash
+git submodule update --init --recursive
+```
+
+## Usage
+
+### 1. Run a single simulation
+
+```bash
+target/release/kvcache-sim run --config configs/qwen2.5-coder-7b-h800.yaml
+```
+
+Prints `summary.json` to stdout and writes the full output directory
+(see [Outputs](#outputs) below).
+
+### 2. Compare routers on the same trace (ablation)
+
+```bash
+target/release/kvcache-sim ablate \
+    --config configs/qwen2.5-coder-7b-h800.yaml \
+    --num-instances 64 \
+    --output-dir runs/qwen7b_n64 \
+    --routers random,least_loaded,ttl_aware,precise
+```
+
+Writes one subdirectory per router plus a combined
+`runs/qwen7b_n64/ablation.json` with side-by-side summaries.
+
+### 3. Compute theoretical hit-rate ceilings (oracle)
+
+```bash
+# Cluster-aggregate capacity (default)
+target/release/kvcache-sim oracle \
+    --config configs/qwen2.5-coder-7b-h800.yaml --num-instances 64
+
+# A single instance's HBM budget
+target/release/kvcache-sim oracle \
+    --config configs/qwen2.5-coder-7b-h800.yaml --per-instance
+
+# Explicit capacity in 16-token blocks
+target/release/kvcache-sim oracle \
+    --config configs/qwen2.5-coder-7b-h800.yaml --capacity-blocks 200000
+```
+
+Reports three numbers:
+
+- `unlimited.hit_rate` — absolute ceiling (infinite cache)
+- `belady_finite.hit_rate` — optimal-eviction ceiling at the given capacity
+- `lru_finite.hit_rate` — production LRU at the same capacity
+
+Gap between `lru_finite` and `belady_finite` = headroom from a smarter
+eviction policy. Gap between `belady_finite` and `unlimited` = headroom
+only reachable by adding capacity.
+
+### 4. Validate a config without running
+
+```bash
+target/release/kvcache-sim validate --config configs/qwen2.5-coder-7b-h800.yaml
+```
+
+Parses the YAML, prints derived per-instance block budgets, and dumps
+the first 5 trace records so you can sanity-check the path.
+
+## CLI overrides
+
+These flags work on **all** subcommands and override the YAML in place,
+so the same config can be reused across sweeps:
+
+| Flag                     | Overrides                                 |
+|--------------------------|-------------------------------------------|
+| `--num-instances <N>`    | `cluster.num_instances`                   |
+| `--max-requests <N>`     | `sim.max_requests`                        |
+| `--trace <PATH>`         | `sim.trace_path`                          |
+| `--output-dir <PATH>`    | `sim.output_dir`                          |
+| `--seed <N>`             | `sim.seed`                                |
+| `--precise-topk <N>`     | `cluster.router.precise_probe_topk`       |
+| `--ttl-seconds <S>`      | `cluster.meta_store.ttl_seconds`          |
+
+`oracle` additionally takes `--capacity-blocks <N>` / `--per-instance`
+and `--out <PATH>`. `ablate` additionally takes `--routers <csv>`.
+
+## Router modes
+
+Set `cluster.router.mode` in the YAML or list in `--routers`:
+
+| Mode           | What it does                                                       |
+|----------------|--------------------------------------------------------------------|
+| `random`       | Uniform random. Baseline.                                          |
+| `round_robin`  | Deterministic round-robin. Baseline.                               |
+| `least_loaded` | `argmin(kv_blocks_used + alpha * queue_len)`. KV-blind.            |
+| `ttl_aware`    | Picks instance with longest prefix in the global TTL meta store.   |
+| `precise`      | Probes top-K least-loaded instances' actual caches; charges probe latency into TTFT. |
+
+Expected hit-rate ordering: `random ≲ least_loaded ≲ ttl_aware ≲ precise`.
+
+## Outputs
+
+Each run writes a directory under `sim.output_dir`:
+
+| File                 | Contents                                                                   |
+|----------------------|----------------------------------------------------------------------------|
+| `summary.json`       | Router, throughput, TTFT p50/p95/p99, hit rates per tier, total RDMA/PCIe bytes |
+| `per_request.csv`    | `req_id,arrival,ttft,e2e,instance,total_blocks,l0_hit,l1_hit,remote_hit,miss,rdma_bytes,pcie_bytes,probe_overhead_s` |
+| `instances.csv`      | `t,instance,queue_len,kv_blocks_used,kv_blocks_total,busy` per sample      |
+| `routing_log.jsonl`  | One JSON per request: all router candidates + chosen instance + reason    |
+
+For `ablate`: an extra `ablation.json` with one summary per router.  
+For `oracle`: an `oracle.json` with the three hit-rate analyses.
+
+### Reading results quickly
+
+```bash
+# Pretty-print the summary
+cat runs/qwen7b/summary.json | jq .
+
+# Compare all routers from an ablation
+cat runs/qwen7b_n64/ablation.json | jq '.[] | {router, ttft_p50, hit_rate_l0, total_rdma_bytes}'
+
+# Hit-rate ceilings vs LRU at the same capacity
+cat runs/qwen7b/oracle.json | jq '{unlimited: .unlimited.hit_rate, belady: .belady_finite.hit_rate, lru: .lru_finite.hit_rate}'
+```
+
+## Config
+
+A config is a single YAML file with four sections. A working example
+lives at
+[`configs/qwen2.5-coder-7b-h800.yaml`](configs/qwen2.5-coder-7b-h800.yaml);
+copy and edit for other models/hardware.
+
+```yaml
+model:      # shape + prefill roofline coefficients
+hardware:   # per-instance GPU/PCIe/RDMA capabilities + batch knobs
+cluster:    # num_instances, meta_store TTL, router mode
+sim:        # trace_path, max_requests, output_dir, seed
+```
+
+Only prefill-side model coefficients are used; any decode fields in
+legacy YAMLs are accepted and ignored.
+
+## Trace format
+
+The simulator reads the Alibaba
+[`qwen-bailian-usagetraces-anon`](https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon)
+JSONL schema. Each record has `chat_id`, `timestamp`, `input_length`,
+`output_length`, and `hash_ids` (16-token block hashes). Only the
+input side is used.
+
+## Testing
+
+```bash
+cargo test --release
+```
+
+16 tests: 15 unit + 1 smoke that runs all four routers on a synthetic
+shared-prefix trace and asserts the expected hit-rate ordering.