kvcache-simulator
Discrete-event simulator for cluster-level LLM prefill serving with a two-tier KV cache (GPU HBM + CPU DRAM / v6d) and KV-aware request routing. Replays real production traces against a synthetic cluster so you can ablate routing strategies and cache sizing without spinning up any GPUs.
Assumes PD (prefill/decode) disaggregation — only the prefill path is modeled.
Features
- Architecture-derived roofline compute — auto-derives FLOPs, attention coefficients, and weight-streaming costs from model architecture (MoE, MLA, GQA, DSA, sliding window).
- HuggingFace config.json auto-parsing — drop in any HF
config.jsonand the simulator extracts layer counts, attention heads, MoE expert configs, MLA LoRA ranks, and DSA sparse parameters. - Built-in GPU hardware presets — H100, H800, H20, A100-80GB,
A100-40GB, B200 with tensor-parallel scaling (e.g.
8xb200). - Two-tier KV cache hierarchy — L0 (GPU HBM) + L1 (CPU DRAM) with LRU eviction and cross-instance RDMA fetch via a cluster-wide meta-store.
- 11 routing policies — from baselines (random, round-robin) to cache-aware (min_pd, prefix_affinity) for systematic ablation.
- Token-bucket link contention — PCIe and RDMA bandwidth modeled with reservation-based token-bucket queues.
- Oracle analysis — computes theoretical hit-rate ceilings (infinite cache, Belady optimal, LRU) for gap analysis.
Build
cargo build --release
# binary: target/release/kvcache-sim
Fetch the upstream trace (consumed as a git submodule):
git submodule update --init --recursive
Usage
1. Run a single simulation
target/release/kvcache-sim run --config configs/glm5-8xb200-hf.yaml
Prints summary.json to stdout and writes the full output directory
(see Outputs below).
2. Compare routers on the same trace (ablation)
target/release/kvcache-sim ablate \
--config configs/glm5-8xb200-hf.yaml \
--routers random,least_loaded,least_tokens,min_pd,prefix_affinity \
--evict-policies lru \
--output-dir runs/glm5_ablation
Writes ablation.json with one row per router x evict_policy.
ablate currently supports only lru as a valid eviction policy. The
aggregated output keeps the online prefill-time metrics
(ttft_mean/p50/p95/p99) and omits e2e.
The previous replay-based belady approximation has been removed from
the CLI because it was not an exact full-hierarchy Belady algorithm and
could produce misleading comparisons against lru.
3. Compute theoretical hit-rate ceilings (oracle)
# Cluster-aggregate capacity (default)
target/release/kvcache-sim oracle \
--config configs/glm5-8xb200-hf.yaml --num-instances 64
# A single instance's HBM budget
target/release/kvcache-sim oracle \
--config configs/glm5-8xb200-hf.yaml --per-instance
# Explicit capacity in blocks
target/release/kvcache-sim oracle \
--config configs/glm5-8xb200-hf.yaml --capacity-blocks 200000
Reports three numbers:
unlimited.hit_rate— absolute ceiling (infinite cache)belady_finite.hit_rate— optimal-eviction ceiling at the given capacitylru_finite.hit_rate— production LRU at the same capacity
Gap between lru_finite and belady_finite = headroom from a smarter
eviction policy. Gap between belady_finite and unlimited = headroom
only reachable by adding capacity.
4. Validate a config without running
target/release/kvcache-sim validate --config configs/glm5-8xb200-hf.yaml
Parses the YAML, prints derived per-instance block budgets, and dumps the first 5 trace records so you can sanity-check the path.
CLI overrides
These flags work on all subcommands and override the YAML in place, so the same config can be reused across sweeps:
| Flag | Overrides |
|---|---|
--num-instances <N> |
cluster.num_instances |
--max-requests <N> |
sim.max_requests |
--trace <PATH> |
sim.trace_path |
--output-dir <PATH> |
sim.output_dir |
--seed <N> |
sim.seed |
--precise-topk <N> |
cluster.router.precise_probe_topk |
--ttl-seconds <S> |
cluster.meta_store.ttl_seconds |
oracle additionally takes --capacity-blocks <N> / --per-instance
and --out <PATH>. ablate additionally takes --routers <csv> and
--evict-policies <csv> (currently only lru).
Router modes
Set cluster.router.mode in the YAML or list in --routers:
| Mode | Aliases | What it does |
|---|---|---|
random |
Uniform random. Baseline. | |
round_robin |
rr |
Deterministic round-robin. Baseline. |
least_loaded |
argmin(kv_blocks_used + alpha * queue_len). KV-blind load balance. |
|
least_tokens |
lt |
argmin(waiting_tokens). Pure load balance by queued compute work. |
ttl_aware |
ttl |
Picks instance with longest prefix in the global TTL meta-store. Cache-only. |
precise |
precise_aware |
Probes top-K least-loaded instances' actual caches; charges probe latency into TTFT. |
min_pd |
minpd, pd |
Minimizes P*D (prefill tokens x ongoing requests). Cluster-wide RDMA-aware. |
cache_load |
cl |
Filters to least-loaded 1/4 instances, then picks best cache prefix. |
cache_score |
cs |
Exponential scoring: 2^(alpha * queue_len + beta * miss_blocks). |
estimated_ttft |
ettft,optimal |
Estimates drain_time + fetch_time per instance using architecture-aware compute. |
prefix_affinity |
affinity, pa |
Rendezvous-hashed prefix fingerprinting for deterministic cache locality. |
Router parameters
These fields in cluster.router tune specific routers:
| Field | Default | Used by | Description |
|---|---|---|---|
load_alpha |
1.0 |
least_loaded |
Weight of queue_len vs kv_blocks_used |
score_alpha |
1.0 |
cache_score |
Load weight in 2^(alpha*load + beta*miss) |
score_beta |
0.1 |
cache_score |
Cache-miss weight in 2^(alpha*load + beta*miss) |
prefix_k |
8 |
prefix_affinity |
Number of leading blocks for the prefix fingerprint |
affinity_fan_out |
0 |
prefix_affinity |
Top-K affinity candidates (0 = auto: n/8, min 2) |
precise_probe_latency_us |
50.0 |
precise |
Simulated per-probe latency (microseconds) |
precise_probe_topk |
4 |
precise |
Number of instances probed |
Router design spectrum
Cache-only Hybrid Load-only
(hot-spot risk) (cache-blind)
┌─────────┬───────────┬───────────┬────────────┬───────────┬───────────┐
ttl_aware precise cache_score min_pd prefix_ least_ random
cache_load affinity loaded
est_ttft least_tokens
prefix_affinity sits in a unique position: it builds proactive cache
locality by consistently routing same-prefix requests to the same
instances (via rendezvous hashing), rather than reactively chasing
existing cache state. This yields the highest L0 hit rates while
maintaining load balance through within-group drain-time-aware selection.
Model configuration
HuggingFace config.json (recommended)
Point model.config_json at any HF config.json to auto-extract
architecture:
model:
config_json: ../models/GLM-5/config.json
dtype_bytes: 2 # required (not in HF schema)
block_size_tokens: 512 # required (not in HF schema)
Auto-detected features:
| Feature | Detection trigger | What it extracts |
|---|---|---|
| MoE | n_routed_experts, num_local_experts, or num_experts |
Expert count, active experts, shared experts, expert FFN width |
| MLA | kv_lora_rank present |
KV/Q LoRA ranks, qk_rope/nope dims, v_head_dim |
| DSA | first_k_dense_replace present |
Dense window, sparse stride, first dense layers |
| Sliding window | sliding_window present |
Window size |
| GQA | num_key_value_heads < num_attention_heads |
KV head count for grouped-query attention |
Explicit YAML fields always override the auto-detected values.
Inline specification
Alternatively, specify architecture fields directly:
model:
name: qwen2.5-coder-7b
num_layers: 28
hidden_size: 3584
num_attention_heads: 28
num_kv_heads: 4
head_dim: 128
intermediate_size: 18944
dtype_bytes: 2
block_size_tokens: 16
When hidden_size is present, the compute model is auto-derived
(architecture mode). Without it, you must supply legacy manual
coefficients (flops_per_token_prefill, attn_quadratic_coeff, etc.).
Bundled model configs
| Model | Path | Architecture |
|---|---|---|
| GLM-5 (744B/40B-active) | models/GLM-5/config.json |
MoE (256 routed, 8 active, 1 shared) + MLA + DSA |
| Qwen3-Coder-480B-A35B FP8 | models/Qwen3-Coder-480B-A35B-Instruct-FP8/config.json |
MoE (160 experts, 8 active) + GQA |
Hardware configuration
Using presets (recommended)
Set hardware.type to a preset name — individual fields can override:
hardware:
type: 8xb200
hbm_bytes: 500.0e9 # override KV budget (after model weights)
Available presets:
| Preset | FLOPS | HBM | Mem BW | PCIe |
|---|---|---|---|---|
h100 |
989 TFLOPS | 80 GB | 3.35 TB/s | Gen5 |
h800 |
989 TFLOPS | 80 GB | 3.35 TB/s | Gen5 |
h20 |
148 TFLOPS | 96 GB | 4.0 TB/s | Gen5 |
a100-80gb |
312 TFLOPS | 80 GB | 2.0 TB/s | Gen4 |
a100-40gb |
312 TFLOPS | 40 GB | 1.555 TB/s | Gen4 |
b200 |
2.25 PFLOPS | 192 GB | 8.0 TB/s | Gen6 |
Prefix with 2x, 4x, or 8x for tensor-parallel groups (e.g.
8xh20). FLOPS, memory bandwidth, and HBM scale linearly; RDMA and
DRAM are set to sensible per-node defaults.
Inline specification
hardware:
gpu_flops: 1.80e16
gpu_mem_bw: 6.40e13
hbm_bytes: 500.0e9
dram_bytes: 1.5e12
pcie_bw: 128.0e9
pcie_latency_us: 4.0
rdma_bw: 50.0e9
rdma_latency_us: 6.0
max_batch_slots: 256
prefill_chunk_tokens: 4096
Architecture-aware compute model
The simulator derives a roofline prefill model from model architecture:
prefill_time(N tokens) = max(compute_time, memory_time)
compute_time = layers * (N * linear_flops + attn_coeff * N * effective_ctx(N)) / gpu_flops
memory_time = layers * weight_bytes_per_layer / gpu_mem_bw
- MoE: only active experts contribute to FLOPs and weight streaming (shared experts always counted)
- MLA: compressed KV projections reduce attention FLOPs; KV cache
uses
kv_lora_rank + qk_rope_head_diminstead of2 * kv_heads * head_dim - DSA:
effective_ctx = min(N, dense_window) + max(0, N - dense_window) / sparse_stride, with the first K layers using full dense attention - GQA: fewer KV heads reduce both attention compute and KV cache size
Bundled config files
| Config | Model | Hardware | Instances | Trace |
|---|---|---|---|---|
glm5-8xb200-hf.yaml |
GLM-5 via HF config.json | 8xB200 preset | 32 | GLM coder blk512 |
glm5-nvfp4-8xb300.yaml |
GLM-5-NVFP4 via HF config.json | 8xB300 preset | 8 | GLM coder blk512 |
qwen3-coder-480b-8xh20.yaml |
Qwen3-Coder via HF | 8xH20 preset | 32 | Qwen coder blk16 |
Outputs
Each run writes a directory under sim.output_dir:
| File | Contents |
|---|---|
summary.json |
Router, throughput, TTFT p50/p95/p99, hit rates per tier, total RDMA/PCIe bytes |
per_request.csv |
req_id,arrival,ttft,e2e,instance,total_blocks,l0_hit,l1_hit,remote_hit,miss,rdma_bytes,pcie_bytes,probe_overhead_s |
instances.csv |
t,instance,queue_len,kv_blocks_used,kv_blocks_total,busy per sample |
routing_log.jsonl |
One JSON per request: all router candidates + chosen instance + reason |
For ablate: an extra ablation.json with one summary per router.
For oracle: an oracle.json with the three hit-rate analyses.
Reading results quickly
# Pretty-print the summary
cat runs/glm5_8xb200_hf/summary.json | jq .
# Compare all routers from an ablation
cat runs/glm5_8xb200_hf/ablation.json | \
jq '.[] | {router, ttft_mean, ttft_p50, hit_rate_l0, miss_rate}'
# Sort by TTFT
cat runs/glm5_8xb200_hf/ablation.json | \
jq 'sort_by(.ttft_mean) | .[] | {router, ttft_mean, hit_rate_l0}'
Trace format
The simulator reads the Alibaba
qwen-bailian-usagetraces-anon
JSONL schema. Each record has chat_id, timestamp, input_length,
output_length, and hash_ids (block hashes, typically 16 tokens each).
Only the input side is used.
Available traces in the submodule:
| Trace | Requests | Description |
|---|---|---|
qwen_coder_blksz_16.jsonl |
43k | Qwen Coder serving traffic |
qwen_traceA_blksz_16.jsonl |
43k | Qwen general traffic A |
qwen_traceB_blksz_16.jsonl |
173k | Qwen general traffic B |
qwen_thinking_blksz_16.jsonl |
11k | Qwen reasoning/thinking traffic |
Testing
cargo test --release
28 tests: 27 unit tests (compute model, HF config parsing, hardware presets) + 1 integration smoke test that runs routers on a synthetic shared-prefix trace and asserts the expected hit-rate ordering.