Files
kvcache-simulator/README.md

15 KiB

kvcache-simulator

Discrete-event simulator for cluster-level LLM prefill serving with a two-tier KV cache (GPU HBM + CPU DRAM / v6d) and KV-aware request routing. Replays real production traces against a synthetic cluster so you can ablate routing strategies and cache sizing without spinning up any GPUs.

Assumes PD (prefill/decode) disaggregation — only the prefill path is modeled.

Features

  • Architecture-derived roofline compute — auto-derives FLOPs, attention coefficients, and weight-streaming costs from model architecture (MoE, MLA, GQA, DSA, sliding window).
  • HuggingFace config.json auto-parsing — drop in any HF config.json and the simulator extracts layer counts, attention heads, MoE expert configs, MLA LoRA ranks, and DSA sparse parameters.
  • Built-in GPU hardware presets — H100, H800, H20, A100-80GB, A100-40GB, B200 with tensor-parallel scaling (e.g. 8xb200).
  • Two-tier KV cache hierarchy — L0 (GPU HBM) + L1 (CPU DRAM) with LRU eviction and cross-instance RDMA fetch via a cluster-wide meta-store.
  • 11 routing policies — from baselines (random, round-robin) to cache-aware (min_pd, prefix_affinity) for systematic ablation.
  • Token-bucket link contention — PCIe and RDMA bandwidth modeled with reservation-based token-bucket queues.
  • Oracle analysis — computes theoretical hit-rate ceilings (infinite cache, Belady optimal, LRU) for gap analysis.

Build

cargo build --release
# binary: target/release/kvcache-sim

Fetch the upstream trace (consumed as a git submodule):

git submodule update --init --recursive

Usage

1. Run a single simulation

target/release/kvcache-sim run --config configs/glm5-8xb200-hf.yaml

Prints summary.json to stdout and writes the full output directory (see Outputs below).

2. Compare routers on the same trace (ablation)

target/release/kvcache-sim ablate \
    --config configs/glm5-8xb200-hf.yaml \
    --routers random,least_loaded,least_tokens,min_pd,prefix_affinity \
    --evict-policies lru \
    --output-dir runs/glm5_ablation

Writes ablation.json with one row per router x evict_policy.

ablate currently supports only lru as a valid eviction policy. The aggregated output keeps the online prefill-time metrics (ttft_mean/p50/p95/p99) and omits e2e.

The previous replay-based belady approximation has been removed from the CLI because it was not an exact full-hierarchy Belady algorithm and could produce misleading comparisons against lru.

3. Compute theoretical hit-rate ceilings (oracle)

# Cluster-aggregate capacity (default)
target/release/kvcache-sim oracle \
    --config configs/glm5-8xb200-hf.yaml --num-instances 64

# A single instance's HBM budget
target/release/kvcache-sim oracle \
    --config configs/glm5-8xb200-hf.yaml --per-instance

# Explicit capacity in blocks
target/release/kvcache-sim oracle \
    --config configs/glm5-8xb200-hf.yaml --capacity-blocks 200000

Reports three numbers:

  • unlimited.hit_rate — absolute ceiling (infinite cache)
  • belady_finite.hit_rate — optimal-eviction ceiling at the given capacity
  • lru_finite.hit_rate — production LRU at the same capacity

Gap between lru_finite and belady_finite = headroom from a smarter eviction policy. Gap between belady_finite and unlimited = headroom only reachable by adding capacity.

4. Validate a config without running

target/release/kvcache-sim validate --config configs/glm5-8xb200-hf.yaml

Parses the YAML, prints derived per-instance block budgets, and dumps the first 5 trace records so you can sanity-check the path.

CLI overrides

These flags work on all subcommands and override the YAML in place, so the same config can be reused across sweeps:

Flag Overrides
--num-instances <N> cluster.num_instances
--max-requests <N> sim.max_requests
--trace <PATH> sim.trace_path
--output-dir <PATH> sim.output_dir
--seed <N> sim.seed
--precise-topk <N> cluster.router.precise_probe_topk
--ttl-seconds <S> cluster.meta_store.ttl_seconds

oracle additionally takes --capacity-blocks <N> / --per-instance and --out <PATH>. ablate additionally takes --routers <csv> and --evict-policies <csv> (currently only lru).

Router modes

Set cluster.router.mode in the YAML or list in --routers:

Mode Aliases What it does
random Uniform random. Baseline.
round_robin rr Deterministic round-robin. Baseline.
least_loaded argmin(kv_blocks_used + alpha * queue_len). KV-blind load balance.
least_tokens lt argmin(waiting_tokens). Pure load balance by queued compute work.
ttl_aware ttl Picks instance with longest prefix in the global TTL meta-store. Cache-only.
precise precise_aware Probes top-K least-loaded instances' actual caches; charges probe latency into TTFT.
min_pd minpd, pd Minimizes P*D (prefill tokens x ongoing requests). Cluster-wide RDMA-aware.
cache_load cl Filters to least-loaded 1/4 instances, then picks best cache prefix.
cache_score cs Exponential scoring: 2^(alpha * queue_len + beta * miss_blocks).
estimated_ttft ettft,optimal Estimates drain_time + fetch_time per instance using architecture-aware compute.
prefix_affinity affinity, pa Rendezvous-hashed prefix fingerprinting for deterministic cache locality.

Router parameters

These fields in cluster.router tune specific routers:

Field Default Used by Description
load_alpha 1.0 least_loaded Weight of queue_len vs kv_blocks_used
score_alpha 1.0 cache_score Load weight in 2^(alpha*load + beta*miss)
score_beta 0.1 cache_score Cache-miss weight in 2^(alpha*load + beta*miss)
prefix_k 8 prefix_affinity Number of leading blocks for the prefix fingerprint
affinity_fan_out 0 prefix_affinity Top-K affinity candidates (0 = auto: n/8, min 2)
precise_probe_latency_us 50.0 precise Simulated per-probe latency (microseconds)
precise_probe_topk 4 precise Number of instances probed

Router design spectrum

        Cache-only                  Hybrid                  Load-only
        (hot-spot risk)                                     (cache-blind)
  ┌─────────┬───────────┬───────────┬────────────┬───────────┬───────────┐
  ttl_aware  precise  cache_score  min_pd     prefix_    least_     random
                                   cache_load  affinity   loaded
                                   est_ttft               least_tokens

prefix_affinity sits in a unique position: it builds proactive cache locality by consistently routing same-prefix requests to the same instances (via rendezvous hashing), rather than reactively chasing existing cache state. This yields the highest L0 hit rates while maintaining load balance through within-group drain-time-aware selection.

Model configuration

Point model.config_json at any HF config.json to auto-extract architecture:

model:
  config_json: ../models/GLM-5/config.json
  dtype_bytes: 2          # required (not in HF schema)
  block_size_tokens: 512  # required (not in HF schema)

Auto-detected features:

Feature Detection trigger What it extracts
MoE n_routed_experts, num_local_experts, or num_experts Expert count, active experts, shared experts, expert FFN width
MLA kv_lora_rank present KV/Q LoRA ranks, qk_rope/nope dims, v_head_dim
DSA first_k_dense_replace present Dense window, sparse stride, first dense layers
Sliding window sliding_window present Window size
GQA num_key_value_heads < num_attention_heads KV head count for grouped-query attention

Explicit YAML fields always override the auto-detected values.

Inline specification

Alternatively, specify architecture fields directly:

model:
  name: qwen2.5-coder-7b
  num_layers: 28
  hidden_size: 3584
  num_attention_heads: 28
  num_kv_heads: 4
  head_dim: 128
  intermediate_size: 18944
  dtype_bytes: 2
  block_size_tokens: 16

When hidden_size is present, the compute model is auto-derived (architecture mode). Without it, you must supply legacy manual coefficients (flops_per_token_prefill, attn_quadratic_coeff, etc.).

Bundled model configs

Model Path Architecture
GLM-5 (744B/40B-active) models/GLM-5/config.json MoE (256 routed, 8 active, 1 shared) + MLA + DSA
GLM-5-FP8 models/GLM-5-FP8/config.json GLM-5 architecture + upstream FP8 quantization metadata
Qwen3-Coder-480B-A35B FP8 models/Qwen3-Coder-480B-A35B-Instruct-FP8/config.json MoE (160 experts, 8 active) + GQA

Hardware configuration

Set hardware.type to a preset name — individual fields can override:

hardware:
  type: 8xb200
  hbm_bytes: 500.0e9     # override KV budget (after model weights)

Available presets:

Preset FLOPS HBM Mem BW PCIe
h100 989 TFLOPS 80 GB 3.35 TB/s Gen5
h800 989 TFLOPS 80 GB 3.35 TB/s Gen5
h20 148 TFLOPS 96 GB 4.0 TB/s Gen5
h20-141g 148 TFLOPS 141 GB 4.8 TB/s Gen5
a100-80gb 312 TFLOPS 80 GB 2.0 TB/s Gen4
a100-40gb 312 TFLOPS 40 GB 1.555 TB/s Gen4
b200 2.25 PFLOPS 192 GB 8.0 TB/s Gen6

Prefix with 2x, 4x, or 8x for tensor-parallel groups (e.g. 8xh20). FLOPS, memory bandwidth, and HBM scale linearly; RDMA and DRAM are set to sensible per-node defaults.

Inline specification

hardware:
  gpu_flops: 1.80e16
  gpu_mem_bw: 6.40e13
  hbm_bytes: 500.0e9
  dram_bytes: 1.5e12
  pcie_bw: 128.0e9
  pcie_latency_us: 4.0
  rdma_bw: 50.0e9
  rdma_latency_us: 6.0
  max_batch_slots: 256
  prefill_chunk_tokens: 4096

Architecture-aware compute model

The simulator derives a roofline prefill model from model architecture:

prefill_time(N tokens) = max(compute_time, memory_time)

compute_time = layers * (N * linear_flops + attn_coeff * N * effective_ctx(N)) / gpu_flops
memory_time  = layers * weight_bytes_per_layer / gpu_mem_bw
  • MoE: only active experts contribute to FLOPs and weight streaming (shared experts always counted)
  • MLA: compressed KV projections reduce attention FLOPs; KV cache uses kv_lora_rank + qk_rope_head_dim instead of 2 * kv_heads * head_dim
  • DSA: effective_ctx = min(N, dense_window) + max(0, N - dense_window) / sparse_stride, with the first K layers using full dense attention
  • GQA: fewer KV heads reduce both attention compute and KV cache size

Bundled config files

Config Model Hardware Instances Trace
glm5-8xb200-hf.yaml GLM-5 via HF config.json 8xB200 preset 32 GLM coder blk512
glm5-fp8-8xh20-141g.yaml GLM-5-FP8 via ModelScope config.json 8xH20-141G preset 128 GLM coder blk512
glm5-nvfp4-8xb300.yaml GLM-5-NVFP4 via HF config.json 8xB300 preset 8 GLM coder blk512
qwen3-coder-480b-8xh20.yaml Qwen3-Coder via HF 8xH20 preset 32 Qwen coder blk16

Outputs

Each run writes a directory under sim.output_dir:

File Contents
summary.json Router, throughput, TTFT p50/p95/p99, hit rates per tier, total RDMA/PCIe bytes
per_request.csv req_id,arrival,ttft,e2e,instance,total_blocks,l0_hit,l1_hit,remote_hit,miss,rdma_bytes,pcie_bytes,probe_overhead_s
instances.csv t,instance,queue_len,kv_blocks_used,kv_blocks_total,busy per sample
routing_log.jsonl One JSON per request: all router candidates + chosen instance + reason

For ablate: an extra ablation.json with one summary per router. For oracle: an oracle.json with the three hit-rate analyses.

Reading results quickly

# Pretty-print the summary
cat runs/glm5_8xb200_hf/summary.json | jq .

# Compare all routers from an ablation
cat runs/glm5_8xb200_hf/ablation.json | \
  jq '.[] | {router, ttft_mean, ttft_p50, hit_rate_l0, miss_rate}'

# Sort by TTFT
cat runs/glm5_8xb200_hf/ablation.json | \
  jq 'sort_by(.ttft_mean) | .[] | {router, ttft_mean, hit_rate_l0}'

Trace format

The simulator reads the Alibaba qwen-bailian-usagetraces-anon JSONL schema. Each record has chat_id, timestamp, input_length, output_length, and hash_ids (block hashes, typically 16 tokens each). Only the input side is used.

Available traces in the submodule:

Trace Requests Description
qwen_coder_blksz_16.jsonl 43k Qwen Coder serving traffic
qwen_traceA_blksz_16.jsonl 43k Qwen general traffic A
qwen_traceB_blksz_16.jsonl 173k Qwen general traffic B
qwen_thinking_blksz_16.jsonl 11k Qwen reasoning/thinking traffic

Testing

cargo test --release

28 tests: 27 unit tests (compute model, HF config parsing, hardware presets) + 1 integration smoke test that runs routers on a synthetic shared-prefix trace and asserts the expected hit-rate ordering.