Discrete-event simulator for evaluating KV cache-aware routing policies in prefill-disaggregated LLM serving clusters. Models a two-tier KV cache hierarchy (L0 GPU HBM + L1 CPU DRAM) with RDMA/PCIe link contention, architecture-derived roofline compute (MoE, MLA, DSA), and a cluster-wide meta-store for prefix-aware routing decisions. Includes 11 routing policies (random, round_robin, least_loaded, least_tokens, ttl_aware, precise, min_pd, cache_load, cache_score, estimated_ttft, prefix_affinity), HuggingFace config.json auto-parsing, built-in GPU hardware presets (H100/H800/H20/A100/B200), and ablation tooling for systematic policy comparison across real Alibaba serving traces. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
kvcache-simulator
Discrete-event simulator for cluster-level LLM prefill serving with a two-tier KV cache (GPU HBM + CPU DRAM / v6d) and KV-aware request routing. Replays real production traces against a synthetic cluster so you can ablate routing strategies and cache sizing without spinning up any GPUs.
Assumes PD (prefill/decode) disaggregation — only the prefill path is modeled.
Build
cargo build --release
# binary: target/release/kvcache-sim
Fetch the upstream trace (consumed as a git submodule):
git submodule update --init --recursive
Usage
1. Run a single simulation
target/release/kvcache-sim run --config configs/qwen2.5-coder-7b-h800.yaml
Prints summary.json to stdout and writes the full output directory
(see Outputs below).
2. Compare routers on the same trace (ablation)
target/release/kvcache-sim ablate \
--config configs/qwen2.5-coder-7b-h800.yaml \
--num-instances 64 \
--output-dir runs/qwen7b_n64 \
--routers random,least_loaded,ttl_aware,precise
Writes one subdirectory per router plus a combined
runs/qwen7b_n64/ablation.json with side-by-side summaries.
3. Compute theoretical hit-rate ceilings (oracle)
# Cluster-aggregate capacity (default)
target/release/kvcache-sim oracle \
--config configs/qwen2.5-coder-7b-h800.yaml --num-instances 64
# A single instance's HBM budget
target/release/kvcache-sim oracle \
--config configs/qwen2.5-coder-7b-h800.yaml --per-instance
# Explicit capacity in 16-token blocks
target/release/kvcache-sim oracle \
--config configs/qwen2.5-coder-7b-h800.yaml --capacity-blocks 200000
Reports three numbers:
unlimited.hit_rate— absolute ceiling (infinite cache)belady_finite.hit_rate— optimal-eviction ceiling at the given capacitylru_finite.hit_rate— production LRU at the same capacity
Gap between lru_finite and belady_finite = headroom from a smarter
eviction policy. Gap between belady_finite and unlimited = headroom
only reachable by adding capacity.
4. Validate a config without running
target/release/kvcache-sim validate --config configs/qwen2.5-coder-7b-h800.yaml
Parses the YAML, prints derived per-instance block budgets, and dumps the first 5 trace records so you can sanity-check the path.
CLI overrides
These flags work on all subcommands and override the YAML in place, so the same config can be reused across sweeps:
| Flag | Overrides |
|---|---|
--num-instances <N> |
cluster.num_instances |
--max-requests <N> |
sim.max_requests |
--trace <PATH> |
sim.trace_path |
--output-dir <PATH> |
sim.output_dir |
--seed <N> |
sim.seed |
--precise-topk <N> |
cluster.router.precise_probe_topk |
--ttl-seconds <S> |
cluster.meta_store.ttl_seconds |
oracle additionally takes --capacity-blocks <N> / --per-instance
and --out <PATH>. ablate additionally takes --routers <csv>.
Router modes
Set cluster.router.mode in the YAML or list in --routers:
| Mode | What it does |
|---|---|
random |
Uniform random. Baseline. |
round_robin |
Deterministic round-robin. Baseline. |
least_loaded |
argmin(kv_blocks_used + alpha * queue_len). KV-blind. |
ttl_aware |
Picks instance with longest prefix in the global TTL meta store. |
precise |
Probes top-K least-loaded instances' actual caches; charges probe latency into TTFT. |
Expected hit-rate ordering: random ≲ least_loaded ≲ ttl_aware ≲ precise.
Outputs
Each run writes a directory under sim.output_dir:
| File | Contents |
|---|---|
summary.json |
Router, throughput, TTFT p50/p95/p99, hit rates per tier, total RDMA/PCIe bytes |
per_request.csv |
req_id,arrival,ttft,e2e,instance,total_blocks,l0_hit,l1_hit,remote_hit,miss,rdma_bytes,pcie_bytes,probe_overhead_s |
instances.csv |
t,instance,queue_len,kv_blocks_used,kv_blocks_total,busy per sample |
routing_log.jsonl |
One JSON per request: all router candidates + chosen instance + reason |
For ablate: an extra ablation.json with one summary per router.
For oracle: an oracle.json with the three hit-rate analyses.
Reading results quickly
# Pretty-print the summary
cat runs/qwen7b/summary.json | jq .
# Compare all routers from an ablation
cat runs/qwen7b_n64/ablation.json | jq '.[] | {router, ttft_p50, hit_rate_l0, total_rdma_bytes}'
# Hit-rate ceilings vs LRU at the same capacity
cat runs/qwen7b/oracle.json | jq '{unlimited: .unlimited.hit_rate, belady: .belady_finite.hit_rate, lru: .lru_finite.hit_rate}'
Config
A config is a single YAML file with four sections. A working example
lives at
configs/qwen2.5-coder-7b-h800.yaml;
copy and edit for other models/hardware.
model: # shape + prefill roofline coefficients
hardware: # per-instance GPU/PCIe/RDMA capabilities + batch knobs
cluster: # num_instances, meta_store TTL, router mode
sim: # trace_path, max_requests, output_dir, seed
Only prefill-side model coefficients are used; any decode fields in legacy YAMLs are accepted and ignored.
Trace format
The simulator reads the Alibaba
qwen-bailian-usagetraces-anon
JSONL schema. Each record has chat_id, timestamp, input_length,
output_length, and hash_ids (16-token block hashes). Only the
input side is used.
Testing
cargo test --release
16 tests: 15 unit + 1 smoke that runs all four routers on a synthetic shared-prefix trace and asserts the expected hit-rate ordering.