# kvcache-simulator Discrete-event simulator for cluster-level LLM **prefill** serving with a two-tier KV cache (GPU HBM + CPU DRAM / v6d) and KV-aware request routing. Replays real production traces against a synthetic cluster so you can ablate routing strategies and cache sizing without spinning up any GPUs. Assumes **PD (prefill/decode) disaggregation** — only the prefill path is modeled. ## Features - **Architecture-derived roofline compute** — auto-derives FLOPs, attention coefficients, and weight-streaming costs from model architecture (MoE, MLA, GQA, DSA, sliding window). - **HuggingFace config.json auto-parsing** — drop in any HF `config.json` and the simulator extracts layer counts, attention heads, MoE expert configs, MLA LoRA ranks, and DSA sparse parameters. - **Built-in GPU hardware presets** — H100, H800, H20, A100-80GB, A100-40GB, B200 with tensor-parallel scaling (e.g. `8xb200`). - **Two-tier KV cache hierarchy** — L0 (GPU HBM) + L1 (CPU DRAM) with LRU eviction and cross-instance RDMA fetch via a cluster-wide meta-store. - **11 routing policies** — from baselines (random, round-robin) to cache-aware (min\_pd, prefix\_affinity) for systematic ablation. - **Token-bucket link contention** — PCIe and RDMA bandwidth modeled with reservation-based token-bucket queues. - **Oracle analysis** — computes theoretical hit-rate ceilings (infinite cache, Belady optimal, LRU) for gap analysis. ## Build ```bash cargo build --release # binary: target/release/kvcache-sim ``` Fetch the upstream trace (consumed as a git submodule): ```bash git submodule update --init --recursive ``` ## Usage ### 1. Run a single simulation ```bash target/release/kvcache-sim run --config configs/glm5-8xb200-hf.yaml ``` Prints `summary.json` to stdout and writes the full output directory (see [Outputs](#outputs) below). ### 2. Compare routers on the same trace (ablation) ```bash target/release/kvcache-sim ablate \ --config configs/glm5-8xb200-hf.yaml \ --routers random,least_loaded,least_tokens,min_pd,prefix_affinity \ --evict-policies lru \ --output-dir runs/glm5_ablation ``` Writes `ablation.json` with one row per `router x evict_policy`. `ablate` currently supports only `lru` as a valid eviction policy. The aggregated output keeps the online prefill-time metrics (`ttft_mean/p50/p95/p99`) and omits `e2e`. The previous replay-based `belady` approximation has been removed from the CLI because it was not an exact full-hierarchy Belady algorithm and could produce misleading comparisons against `lru`. ### 3. Compute theoretical hit-rate ceilings (oracle) ```bash # Cluster-aggregate capacity (default) target/release/kvcache-sim oracle \ --config configs/glm5-8xb200-hf.yaml --num-instances 64 # A single instance's HBM budget target/release/kvcache-sim oracle \ --config configs/glm5-8xb200-hf.yaml --per-instance # Explicit capacity in blocks target/release/kvcache-sim oracle \ --config configs/glm5-8xb200-hf.yaml --capacity-blocks 200000 ``` Reports three numbers: - `unlimited.hit_rate` — absolute ceiling (infinite cache) - `belady_finite.hit_rate` — optimal-eviction ceiling at the given capacity - `lru_finite.hit_rate` — production LRU at the same capacity Gap between `lru_finite` and `belady_finite` = headroom from a smarter eviction policy. Gap between `belady_finite` and `unlimited` = headroom only reachable by adding capacity. ### 4. Validate a config without running ```bash target/release/kvcache-sim validate --config configs/glm5-8xb200-hf.yaml ``` Parses the YAML, prints derived per-instance block budgets, and dumps the first 5 trace records so you can sanity-check the path. ## CLI overrides These flags work on **all** subcommands and override the YAML in place, so the same config can be reused across sweeps: | Flag | Overrides | |--------------------------|-------------------------------------------| | `--num-instances ` | `cluster.num_instances` | | `--max-requests ` | `sim.max_requests` | | `--trace ` | `sim.trace_path` | | `--output-dir ` | `sim.output_dir` | | `--seed ` | `sim.seed` | | `--precise-topk ` | `cluster.router.precise_probe_topk` | | `--ttl-seconds ` | `cluster.meta_store.ttl_seconds` | `oracle` additionally takes `--capacity-blocks ` / `--per-instance` and `--out `. `ablate` additionally takes `--routers ` and `--evict-policies ` (currently only `lru`). ## Router modes Set `cluster.router.mode` in the YAML or list in `--routers`: | Mode | Aliases | What it does | |-------------------|------------------|--------------------------------------------------------------------------------------| | `random` | | Uniform random. Baseline. | | `round_robin` | `rr` | Deterministic round-robin. Baseline. | | `least_loaded` | | `argmin(kv_blocks_used + alpha * queue_len)`. KV-blind load balance. | | `least_tokens` | `lt` | `argmin(waiting_tokens)`. Pure load balance by queued compute work. | | `ttl_aware` | `ttl` | Picks instance with longest prefix in the global TTL meta-store. Cache-only. | | `precise` | `precise_aware` | Probes top-K least-loaded instances' actual caches; charges probe latency into TTFT. | | `min_pd` | `minpd`, `pd` | Minimizes `P*D` (prefill tokens x ongoing requests). Cluster-wide RDMA-aware. | | `cache_load` | `cl` | Filters to least-loaded 1/4 instances, then picks best cache prefix. | | `cache_score` | `cs` | Exponential scoring: `2^(alpha * queue_len + beta * miss_blocks)`. | | `estimated_ttft` | `ettft`,`optimal`| Estimates `drain_time + fetch_time` per instance using architecture-aware compute. | | `prefix_affinity` | `affinity`, `pa` | Rendezvous-hashed prefix fingerprinting for deterministic cache locality. | ### Router parameters These fields in `cluster.router` tune specific routers: | Field | Default | Used by | Description | |--------------------------|---------|------------------|------------------------------------------------------| | `load_alpha` | `1.0` | `least_loaded` | Weight of queue\_len vs kv\_blocks\_used | | `score_alpha` | `1.0` | `cache_score` | Load weight in `2^(alpha*load + beta*miss)` | | `score_beta` | `0.1` | `cache_score` | Cache-miss weight in `2^(alpha*load + beta*miss)` | | `prefix_k` | `8` | `prefix_affinity`| Number of leading blocks for the prefix fingerprint | | `affinity_fan_out` | `0` | `prefix_affinity`| Top-K affinity candidates (0 = auto: n/8, min 2) | | `precise_probe_latency_us`| `50.0`| `precise` | Simulated per-probe latency (microseconds) | | `precise_probe_topk` | `4` | `precise` | Number of instances probed | ### Router design spectrum ``` Cache-only Hybrid Load-only (hot-spot risk) (cache-blind) ┌─────────┬───────────┬───────────┬────────────┬───────────┬───────────┐ ttl_aware precise cache_score min_pd prefix_ least_ random cache_load affinity loaded est_ttft least_tokens ``` `prefix_affinity` sits in a unique position: it builds **proactive cache locality** by consistently routing same-prefix requests to the same instances (via rendezvous hashing), rather than reactively chasing existing cache state. This yields the highest L0 hit rates while maintaining load balance through within-group drain-time-aware selection. ## Model configuration ### HuggingFace config.json (recommended) Point `model.config_json` at any HF `config.json` to auto-extract architecture: ```yaml model: config_json: ../models/GLM-5/config.json dtype_bytes: 2 # required (not in HF schema) block_size_tokens: 512 # required (not in HF schema) ``` Auto-detected features: | Feature | Detection trigger | What it extracts | |-----------|-------------------------------|----------------------------------------------| | **MoE** | `n_routed_experts`, `num_local_experts`, or `num_experts` | Expert count, active experts, shared experts, expert FFN width | | **MLA** | `kv_lora_rank` present | KV/Q LoRA ranks, qk\_rope/nope dims, v\_head\_dim | | **DSA** | `first_k_dense_replace` present| Dense window, sparse stride, first dense layers | | **Sliding window** | `sliding_window` present | Window size | | **GQA** | `num_key_value_heads < num_attention_heads` | KV head count for grouped-query attention | Explicit YAML fields always override the auto-detected values. ### Inline specification Alternatively, specify architecture fields directly: ```yaml model: name: qwen2.5-coder-7b num_layers: 28 hidden_size: 3584 num_attention_heads: 28 num_kv_heads: 4 head_dim: 128 intermediate_size: 18944 dtype_bytes: 2 block_size_tokens: 16 ``` When `hidden_size` is present, the compute model is auto-derived (architecture mode). Without it, you must supply legacy manual coefficients (`flops_per_token_prefill`, `attn_quadratic_coeff`, etc.). ### Bundled model configs | Model | Path | Architecture | |-------|------|--------------| | GLM-5 (744B/40B-active) | `models/GLM-5/config.json` | MoE (256 routed, 8 active, 1 shared) + MLA + DSA | | Qwen3-Coder-480B-A35B FP8 | `models/Qwen3-Coder-480B-A35B-Instruct-FP8/config.json` | MoE (160 experts, 8 active) + GQA | ## Hardware configuration ### Using presets (recommended) Set `hardware.type` to a preset name — individual fields can override: ```yaml hardware: type: 8xb200 hbm_bytes: 500.0e9 # override KV budget (after model weights) ``` Available presets: | Preset | FLOPS | HBM | Mem BW | PCIe | |-------------|------------|---------|------------|------| | `h100` | 989 TFLOPS | 80 GB | 3.35 TB/s | Gen5 | | `h800` | 989 TFLOPS | 80 GB | 3.35 TB/s | Gen5 | | `h20` | 148 TFLOPS | 96 GB | 4.0 TB/s | Gen5 | | `a100-80gb` | 312 TFLOPS | 80 GB | 2.0 TB/s | Gen4 | | `a100-40gb` | 312 TFLOPS | 40 GB | 1.555 TB/s | Gen4 | | `b200` | 2.25 PFLOPS| 192 GB | 8.0 TB/s | Gen6 | Prefix with `2x`, `4x`, or `8x` for tensor-parallel groups (e.g. `8xh20`). FLOPS, memory bandwidth, and HBM scale linearly; RDMA and DRAM are set to sensible per-node defaults. ### Inline specification ```yaml hardware: gpu_flops: 1.80e16 gpu_mem_bw: 6.40e13 hbm_bytes: 500.0e9 dram_bytes: 1.5e12 pcie_bw: 128.0e9 pcie_latency_us: 4.0 rdma_bw: 50.0e9 rdma_latency_us: 6.0 max_batch_slots: 256 prefill_chunk_tokens: 4096 ``` ## Architecture-aware compute model The simulator derives a **roofline prefill model** from model architecture: ``` prefill_time(N tokens) = max(compute_time, memory_time) compute_time = layers * (N * linear_flops + attn_coeff * N * effective_ctx(N)) / gpu_flops memory_time = layers * weight_bytes_per_layer / gpu_mem_bw ``` - **MoE**: only active experts contribute to FLOPs and weight streaming (shared experts always counted) - **MLA**: compressed KV projections reduce attention FLOPs; KV cache uses `kv_lora_rank + qk_rope_head_dim` instead of `2 * kv_heads * head_dim` - **DSA**: `effective_ctx = min(N, dense_window) + max(0, N - dense_window) / sparse_stride`, with the first K layers using full dense attention - **GQA**: fewer KV heads reduce both attention compute and KV cache size ## Bundled config files | Config | Model | Hardware | Instances | Trace | |--------|-------|----------|-----------|-------| | `glm5-8xb200-hf.yaml` | GLM-5 via HF config.json | 8xB200 preset | 32 | GLM coder blk512 | | `glm5-nvfp4-8xb300.yaml` | GLM-5-NVFP4 via HF config.json | 8xB300 preset | 8 | GLM coder blk512 | | `qwen3-coder-480b-8xh20.yaml` | Qwen3-Coder via HF | 8xH20 preset | 32 | Qwen coder blk16 | ## Outputs Each run writes a directory under `sim.output_dir`: | File | Contents | |----------------------|----------------------------------------------------------------------------| | `summary.json` | Router, throughput, TTFT p50/p95/p99, hit rates per tier, total RDMA/PCIe bytes | | `per_request.csv` | `req_id,arrival,ttft,e2e,instance,total_blocks,l0_hit,l1_hit,remote_hit,miss,rdma_bytes,pcie_bytes,probe_overhead_s` | | `instances.csv` | `t,instance,queue_len,kv_blocks_used,kv_blocks_total,busy` per sample | | `routing_log.jsonl` | One JSON per request: all router candidates + chosen instance + reason | For `ablate`: an extra `ablation.json` with one summary per router. For `oracle`: an `oracle.json` with the three hit-rate analyses. ### Reading results quickly ```bash # Pretty-print the summary cat runs/glm5_8xb200_hf/summary.json | jq . # Compare all routers from an ablation cat runs/glm5_8xb200_hf/ablation.json | \ jq '.[] | {router, ttft_mean, ttft_p50, hit_rate_l0, miss_rate}' # Sort by TTFT cat runs/glm5_8xb200_hf/ablation.json | \ jq 'sort_by(.ttft_mean) | .[] | {router, ttft_mean, hit_rate_l0}' ``` ## Trace format The simulator reads the Alibaba [`qwen-bailian-usagetraces-anon`](https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon) JSONL schema. Each record has `chat_id`, `timestamp`, `input_length`, `output_length`, and `hash_ids` (block hashes, typically 16 tokens each). Only the input side is used. Available traces in the submodule: | Trace | Requests | Description | |-------|----------|-------------| | `qwen_coder_blksz_16.jsonl` | 43k | Qwen Coder serving traffic | | `qwen_traceA_blksz_16.jsonl` | 43k | Qwen general traffic A | | `qwen_traceB_blksz_16.jsonl` | 173k | Qwen general traffic B | | `qwen_thinking_blksz_16.jsonl` | 11k | Qwen reasoning/thinking traffic | ## Testing ```bash cargo test --release ``` 28 tests: 27 unit tests (compute model, HF config parsing, hardware presets) + 1 integration smoke test that runs routers on a synthetic shared-prefix trace and asserts the expected hit-rate ordering.