Discrete-event simulator for cluster-level LLM prefill serving with a two-tier KV cache (GPU HBM + CPU DRAM / v6d) and KV-aware request routing. Replays real production traces against a synthetic cluster so you can ablate routing strategies and cache sizing without spinning up any GPUs.

Assumes PD (prefill/decode) disaggregation — only the prefill path is modeled.

Features

Architecture-derived roofline compute — auto-derives FLOPs, attention coefficients, and weight-streaming costs from model architecture (MoE, MLA, GQA, DSA, sliding window).
HuggingFace config.json auto-parsing — drop in any HF config.json and the simulator extracts layer counts, attention heads, MoE expert configs, MLA LoRA ranks, and DSA sparse parameters.
Built-in GPU hardware presets — H100, H800, H20, A100-80GB, A100-40GB, B200 with tensor-parallel scaling (e.g. 8xb200).
Two-tier KV cache hierarchy — L0 (GPU HBM) + L1 (CPU DRAM) with LRU eviction and cross-instance RDMA fetch via a cluster-wide meta-store.
11 routing policies — from baselines (random, round-robin) to cache-aware (min_pd, prefix_affinity) for systematic ablation.
Token-bucket link contention — PCIe and RDMA bandwidth modeled with reservation-based token-bucket queues.
Oracle analysis — computes theoretical hit-rate ceilings (infinite cache, Belady optimal, LRU) for gap analysis.

Build

cargo build --release
# binary: target/release/kvcache-sim

Fetch the upstream trace (consumed as a git submodule):

git submodule update --init --recursive

Usage

1. Run a single simulation

target/release/kvcache-sim run --config configs/glm5-8xb200-hf.yaml

Prints summary.json to stdout and writes the full output directory (see Outputs below).

2. Compare routers on the same trace (ablation)

target/release/kvcache-sim ablate \
    --config configs/glm5-8xb200-hf.yaml \
    --routers random,least_loaded,least_tokens,min_pd,prefix_affinity \
    --evict-policies lru \
    --output-dir runs/glm5_ablation

Writes ablation.json with one row per router x evict_policy.

ablate currently supports only lru as a valid eviction policy. The aggregated output keeps the online prefill-time metrics (ttft_mean/p50/p95/p99) and omits e2e.

The previous replay-based belady approximation has been removed from the CLI because it was not an exact full-hierarchy Belady algorithm and could produce misleading comparisons against lru.

3. Compute theoretical hit-rate ceilings (oracle)

# Cluster-aggregate capacity (default)
target/release/kvcache-sim oracle \
    --config configs/glm5-8xb200-hf.yaml --num-instances 64

# A single instance's HBM budget
target/release/kvcache-sim oracle \
    --config configs/glm5-8xb200-hf.yaml --per-instance

# Explicit capacity in blocks
target/release/kvcache-sim oracle \
    --config configs/glm5-8xb200-hf.yaml --capacity-blocks 200000

Reports three numbers:

unlimited.hit_rate — absolute ceiling (infinite cache)
belady_finite.hit_rate — optimal-eviction ceiling at the given capacity
lru_finite.hit_rate — production LRU at the same capacity

Gap between lru_finite and belady_finite = headroom from a smarter eviction policy. Gap between belady_finite and unlimited = headroom only reachable by adding capacity.

4. Validate a config without running

target/release/kvcache-sim validate --config configs/glm5-8xb200-hf.yaml

Parses the YAML, prints derived per-instance block budgets, and dumps the first 5 trace records so you can sanity-check the path.

CLI overrides

These flags work on all subcommands and override the YAML in place, so the same config can be reused across sweeps:

Flag	Overrides
`--num-instances <N>`	`cluster.num_instances`
`--max-requests <N>`	`sim.max_requests`
`--trace <PATH>`	`sim.trace_path`
`--output-dir <PATH>`	`sim.output_dir`
`--seed <N>`	`sim.seed`
`--precise-topk <N>`	`cluster.router.precise_probe_topk`
`--ttl-seconds <S>`	`cluster.meta_store.ttl_seconds`

oracle additionally takes --capacity-blocks <N> / --per-instance and --out <PATH>. ablate additionally takes --routers <csv> and --evict-policies <csv> (currently only lru).

Router modes

Set cluster.router.mode in the YAML or list in --routers:

Mode	Aliases	What it does
`random`		Uniform random. Baseline.
`round_robin`	`rr`	Deterministic round-robin. Baseline.
`least_loaded`		`argmin(kv_blocks_used + alpha * queue_len)`. KV-blind load balance.
`least_tokens`	`lt`	`argmin(waiting_tokens)`. Pure load balance by queued compute work.
`ttl_aware`	`ttl`	Picks instance with longest prefix in the global TTL meta-store. Cache-only.
`precise`	`precise_aware`	Probes top-K least-loaded instances' actual caches; charges probe latency into TTFT.
`min_pd`	`minpd`, `pd`	Minimizes `P*D` (prefill tokens x ongoing requests). Cluster-wide RDMA-aware.
`cache_load`	`cl`	Filters to least-loaded 1/4 instances, then picks best cache prefix.
`cache_score`	`cs`	Exponential scoring: `2^(alpha * queue_len + beta * miss_blocks)`.
`estimated_ttft`	`ettft`,`optimal`	Estimates `drain_time + fetch_time` per instance using architecture-aware compute.
`prefix_affinity`	`affinity`, `pa`	Rendezvous-hashed prefix fingerprinting for deterministic cache locality.

Router parameters

These fields in cluster.router tune specific routers:

Field	Default	Used by	Description
`load_alpha`	`1.0`	`least_loaded`	Weight of queue_len vs kv_blocks_used
`score_alpha`	`1.0`	`cache_score`	Load weight in `2^(alphaload + betamiss)`
`score_beta`	`0.1`	`cache_score`	Cache-miss weight in `2^(alphaload + betamiss)`
`prefix_k`	`8`	`prefix_affinity`	Number of leading blocks for the prefix fingerprint
`affinity_fan_out`	`0`	`prefix_affinity`	Top-K affinity candidates (0 = auto: n/8, min 2)
`precise_probe_latency_us`	`50.0`	`precise`	Simulated per-probe latency (microseconds)
`precise_probe_topk`	`4`	`precise`	Number of instances probed

Router design spectrum

        Cache-only                  Hybrid                  Load-only
        (hot-spot risk)                                     (cache-blind)
  ┌─────────┬───────────┬───────────┬────────────┬───────────┬───────────┐
  ttl_aware  precise  cache_score  min_pd     prefix_    least_     random
                                   cache_load  affinity   loaded
                                   est_ttft               least_tokens

prefix_affinity sits in a unique position: it builds proactive cache locality by consistently routing same-prefix requests to the same instances (via rendezvous hashing), rather than reactively chasing existing cache state. This yields the highest L0 hit rates while maintaining load balance through within-group drain-time-aware selection.

Model configuration

HuggingFace config.json (recommended)

Point model.config_json at any HF config.json to auto-extract architecture:

model:
  config_json: ../models/GLM-5/config.json
  dtype_bytes: 2          # required (not in HF schema)
  block_size_tokens: 512  # required (not in HF schema)

Auto-detected features:

Feature	Detection trigger	What it extracts
MoE	`n_routed_experts`, `num_local_experts`, or `num_experts`	Expert count, active experts, shared experts, expert FFN width
MLA	`kv_lora_rank` present	KV/Q LoRA ranks, qk_rope/nope dims, v_head_dim
DSA	`first_k_dense_replace` present	Dense window, sparse stride, first dense layers
Sliding window	`sliding_window` present	Window size
GQA	`num_key_value_heads < num_attention_heads`	KV head count for grouped-query attention

Explicit YAML fields always override the auto-detected values.

Inline specification

Alternatively, specify architecture fields directly:

model:
  name: qwen2.5-coder-7b
  num_layers: 28
  hidden_size: 3584
  num_attention_heads: 28
  num_kv_heads: 4
  head_dim: 128
  intermediate_size: 18944
  dtype_bytes: 2
  block_size_tokens: 16

When hidden_size is present, the compute model is auto-derived (architecture mode). Without it, you must supply legacy manual coefficients (flops_per_token_prefill, attn_quadratic_coeff, etc.).

Bundled model configs

Model	Path	Architecture
GLM-5 (744B/40B-active)	`models/GLM-5/config.json`	MoE (256 routed, 8 active, 1 shared) + MLA + DSA
Qwen3-Coder-480B-A35B FP8	`models/Qwen3-Coder-480B-A35B-Instruct-FP8/config.json`	MoE (160 experts, 8 active) + GQA

Hardware configuration

Using presets (recommended)

Set hardware.type to a preset name — individual fields can override:

hardware:
  type: 8xb200
  hbm_bytes: 500.0e9     # override KV budget (after model weights)

Available presets:

Preset	FLOPS	HBM	Mem BW	PCIe
`h100`	989 TFLOPS	80 GB	3.35 TB/s	Gen5
`h800`	989 TFLOPS	80 GB	3.35 TB/s	Gen5
`h20`	148 TFLOPS	96 GB	4.0 TB/s	Gen5
`a100-80gb`	312 TFLOPS	80 GB	2.0 TB/s	Gen4
`a100-40gb`	312 TFLOPS	40 GB	1.555 TB/s	Gen4
`b200`	2.25 PFLOPS	192 GB	8.0 TB/s	Gen6

Prefix with 2x, 4x, or 8x for tensor-parallel groups (e.g. 8xh20). FLOPS, memory bandwidth, and HBM scale linearly; RDMA and DRAM are set to sensible per-node defaults.

Inline specification

hardware:
  gpu_flops: 1.80e16
  gpu_mem_bw: 6.40e13
  hbm_bytes: 500.0e9
  dram_bytes: 1.5e12
  pcie_bw: 128.0e9
  pcie_latency_us: 4.0
  rdma_bw: 50.0e9
  rdma_latency_us: 6.0
  max_batch_slots: 256
  prefill_chunk_tokens: 4096

Architecture-aware compute model

The simulator derives a roofline prefill model from model architecture:

prefill_time(N tokens) = max(compute_time, memory_time)

compute_time = layers * (N * linear_flops + attn_coeff * N * effective_ctx(N)) / gpu_flops
memory_time  = layers * weight_bytes_per_layer / gpu_mem_bw

MoE: only active experts contribute to FLOPs and weight streaming (shared experts always counted)
MLA: compressed KV projections reduce attention FLOPs; KV cache uses kv_lora_rank + qk_rope_head_dim instead of 2 * kv_heads * head_dim
DSA: effective_ctx = min(N, dense_window) + max(0, N - dense_window) / sparse_stride, with the first K layers using full dense attention
GQA: fewer KV heads reduce both attention compute and KV cache size

Bundled config files

Config	Model	Hardware	Instances	Trace
`glm5-8xb200-hf.yaml`	GLM-5 via HF config.json	8xB200 preset	32	GLM coder blk512
`glm5-nvfp4-8xb300.yaml`	GLM-5-NVFP4 via HF config.json	8xB300 preset	8	GLM coder blk512
`qwen3-coder-480b-8xh20.yaml`	Qwen3-Coder via HF	8xH20 preset	32	Qwen coder blk16

Outputs

Each run writes a directory under sim.output_dir:

File	Contents
`summary.json`	Router, throughput, TTFT p50/p95/p99, hit rates per tier, total RDMA/PCIe bytes
`per_request.csv`	`req_id,arrival,ttft,e2e,instance,total_blocks,l0_hit,l1_hit,remote_hit,miss,rdma_bytes,pcie_bytes,probe_overhead_s`
`instances.csv`	`t,instance,queue_len,kv_blocks_used,kv_blocks_total,busy` per sample
`routing_log.jsonl`	One JSON per request: all router candidates + chosen instance + reason

For ablate: an extra ablation.json with one summary per router. For oracle: an oracle.json with the three hit-rate analyses.

Reading results quickly

# Pretty-print the summary
cat runs/glm5_8xb200_hf/summary.json | jq .

# Compare all routers from an ablation
cat runs/glm5_8xb200_hf/ablation.json | \
  jq '.[] | {router, ttft_mean, ttft_p50, hit_rate_l0, miss_rate}'

# Sort by TTFT
cat runs/glm5_8xb200_hf/ablation.json | \
  jq 'sort_by(.ttft_mean) | .[] | {router, ttft_mean, hit_rate_l0}'

Trace format

The simulator reads the Alibaba qwen-bailian-usagetraces-anon JSONL schema. Each record has chat_id, timestamp, input_length, output_length, and hash_ids (block hashes, typically 16 tokens each). Only the input side is used.

Available traces in the submodule:

Trace	Requests	Description
`qwen_coder_blksz_16.jsonl`	43k	Qwen Coder serving traffic
`qwen_traceA_blksz_16.jsonl`	43k	Qwen general traffic A
`qwen_traceB_blksz_16.jsonl`	173k	Qwen general traffic B
`qwen_thinking_blksz_16.jsonl`	11k	Qwen reasoning/thinking traffic

Testing

cargo test --release

28 tests: 27 unit tests (compute model, HF config parsing, hardware presets) + 1 integration smoke test that runs routers on a synthetic shared-prefix trace and asserts the expected hit-rate ordering.