kvcache-simulator
Discrete-event simulator for cluster-level LLM prefill serving with a two-tier KV cache and routing experiments. The simulator models a PD-disaggregated deployment: only the prefill path is simulated, while decode is reduced to a small completion tail for TTFT/E2E accounting.
It is intended for answering questions like:
- How much do different KV-aware routers help on the same trace?
- How much HBM/DRAM capacity is enough before routing dominates?
- How do prefix-locality policies behave under bucketed input-length pools?
- What is the gap between online LRU and offline-optimal cache capacity?
What The Repo Models
- Architecture-derived prefill cost from model structure, including MoE, MLA, GQA, sliding-window attention, and DSA.
- Two-tier KV hierarchy with L0 GPU HBM and L1 host DRAM, plus remote RDMA fetches via a meta-store.
- Single-pool and bucketed clusters. Bucketed mode separates the service into input-length buckets with isolated instance pools and meta-stores.
- Local instance routing and global bucket routing with detailed per-request routing logs.
- Trace replay with optional input-length filtering so the same trace can be sliced into buckets without rewriting the source file.
- Offline oracle analysis for unlimited capacity, Belady, and LRU hit-rate ceilings.
Highlights
- HF
config.jsonauto-loading: pointmodel.config_jsonat a model config and the simulator derives architecture parameters automatically. - Hardware presets:
h100,h800,h20,h20-141g,a100-80gb,a100-40gb,b200, andb300, plus TP variants such as8xb200. - 18 local router modes covering baselines, load-based, cache-aware, affinity, and TTFT-estimating policies.
- 2 global bucket router modes:
strict_input_lengthandbucket_score. - Detailed outputs:
summary.json,per_request.csv,instances.csv,routing_log.jsonl, plusablation.json/oracle.jsonwhen applicable.
Build
cargo build --release
If you want the public Qwen trace submodule as well:
git submodule update --init --recursive
The release binary is:
target/release/kvcache-sim
Quick Start
Validate a config:
target/release/kvcache-sim validate --config configs/glm5-8xb200.yaml
Run one simulation:
target/release/kvcache-sim run --config configs/glm5-8xb200.yaml
Compare several routers on the same trace:
target/release/kvcache-sim ablate \
--config configs/glm5-8xb200.yaml \
--routers random,least_loaded,cache_score,cache_affinity,estimated_ttft
Auto-pick the smallest cluster size that meets a TTFT target, then ablate at that size:
target/release/kvcache-sim ablate \
--config configs/glm5-8xb200.yaml \
--auto-instances \
--auto-probe-router cache_score \
--auto-target-ttft-mean 4.0
Run the oracle:
target/release/kvcache-sim oracle \
--config configs/glm5-8xb200.yaml \
--per-instance
run prints summary.json to stdout and also writes the full output directory
under sim.output_dir.
Current Command Boundaries
The repository now supports both legacy single-pool clusters and bucketed service topologies, but not every CLI path supports both yet.
run: supportscluster.num_instancesandcluster.bucketsvalidate: supportscluster.num_instancesandcluster.bucketsablate: currently single-pool onlyablate --evict-policies: currently supportslruonlyoracle: currently single-pool only--num-instancesoverride: currently single-pool only--auto-instances: currently single-pool only
In practice, bucket-aware experiments are ready in run, while fixed-placement
ablation and oracle analysis still reject cluster.buckets.
Config Model
Single-Pool Cluster
Use cluster.num_instances for the original flat instance pool:
cluster:
num_instances: 32
meta_store:
ttl_seconds: 300.0
router:
mode: cache_affinity
Bucketed Service
Use cluster.buckets plus a global_router to model explicit input-length
buckets:
cluster:
meta_store:
ttl_seconds: 300.0
router:
mode: cache_affinity
load_alpha: 1.5
prefix_k: 8
global_router:
mode: strict_input_length
length_penalty_weight: 1.0
load_weight: 1.0
cache_weight: 1.0
buckets:
- name: short
input_length_min: 0
input_length_max: 32768
num_instances: 8
- name: long
input_length_min: 32769
input_length_max: 131072
num_instances: 4
Rules enforced by config validation:
cluster.num_instancesandcluster.bucketsare mutually exclusive- bucket ranges must not overlap
- every bucket must have
num_instances > 0 input_length_min <= input_length_max
CLI Overrides
These flags apply on top of the YAML config:
| Flag | Overrides |
|---|---|
--num-instances <N> |
cluster.num_instances |
--max-requests <N> |
sim.max_requests |
--trace <PATH> |
sim.trace_path |
--output-dir <PATH> |
sim.output_dir |
--seed <N> |
sim.seed |
--precise-topk <N> |
cluster.router.precise_probe_topk |
--ttl-seconds <S> |
cluster.meta_store.ttl_seconds |
--input-length-min <N> |
sim.input_length_min |
--input-length-max <N> |
sim.input_length_max |
Subcommand-specific additions:
ablate:--routers,--evict-policies,--auto-instances,--auto-target-ttft-mean,--auto-candidates,--auto-probe-router,--jobsoracle:--capacity-blocks,--per-instance,--out
Routing Modes
Global Bucket Routers
Configured through cluster.global_router.mode.
| Mode | What it does |
|---|---|
strict_input_length |
Routes to the unique bucket whose [input_length_min, input_length_max] contains the request. |
bucket_score |
Scores every bucket using weighted length mismatch, aggregate queue load, and predicted cache miss. Can intentionally deviate from the strict length bucket. |
Local Instance Routers
Configured through cluster.router.mode. All of these names are accepted by
run, and any of them can be passed to ablate --routers on single-pool
configs.
| Mode | Aliases | What it does |
|---|---|---|
random |
Uniform random baseline. | |
round_robin |
rr |
Deterministic round-robin baseline. |
least_loaded |
Minimizes kv_blocks_used + alpha * queue_len. |
|
least_tokens |
lt |
Minimizes queued token work. |
ttl_aware |
ttl |
Uses the global TTL meta-store to chase the longest reusable prefix. |
precise |
precise_aware |
Probes top-K least-loaded instances for actual cache contents and charges probe latency. |
min_pd |
minpd, pd |
Minimizes P * D using ongoing load and prefix reuse. |
cache_load |
cl |
Filters to lightly loaded instances, then chooses the best cache prefix. |
cache_affinity |
caff, ca |
Strong cache-first scoring with rendezvous-based sticky homes for prefix families. |
cache_affinity_weak_rend |
caff_weak |
Ablation: weak cache weights plus rendezvous placement. |
cache_affinity_strong_only |
caff_strong |
Ablation: strong cache weights without rendezvous tie-breaking. |
cache_score |
cs |
Exponential score over queue length and miss blocks. |
cache_score_strong |
cs_strong, css |
Parity probe with stronger cache weighting than default cache_score. |
cache_score_ttl |
csttl, cs_ttl |
cache_score variant that also uses TTL/meta-store visibility. |
estimated_ttft |
ettft, optimal |
First-principles TTFT estimate per instance using compute plus KV movement. |
prefix_affinity |
affinity, pa |
Deterministic prefix fingerprinting with affinity fan-out and load-aware selection. |
adaptive_affinity |
aa |
Uses hot-prefix detection: affinity for short hot stems, TTFT optimization otherwise. |
lineage_affinity |
la |
Combines parent stickiness, family homesets, and strong local cache scoring. |
Router tuning knobs in cluster.router:
| Field | Default | Used by |
|---|---|---|
load_alpha |
1.0 |
least_loaded, ttl_aware, affinity families |
score_alpha |
1.0 |
cache_score, cache_score_ttl |
score_beta |
0.1 |
cache_score, cache_score_ttl |
prefix_k |
8 |
prefix and affinity fingerprinting |
affinity_fan_out |
0 |
prefix_affinity, adaptive_affinity, lineage_affinity |
precise_probe_latency_us |
50.0 |
precise |
precise_probe_topk |
4 |
precise |
Model And Hardware Configuration
Model Config
Recommended pattern:
model:
config_json: ../models/GLM-5/config.json
name: glm-5
compute_dtype: fp8
weight_dtype: fp4
dtype_bytes: 1
block_size_tokens: 512
Notes:
config_jsonis resolved relative to the YAML file- explicit YAML fields override values loaded from the model config
compute_dtypeselects the compute FLOPS tierweight_dtypecontrols model-weight bytes separately from KV-cache bytesdtype_bytessizes the KV cache
The architecture loader understands:
- MoE expert counts and active experts
- MLA LoRA ranks and attention dimensions
- DSA sparse-attention parameters
- sliding-window attention
- GQA from KV-head count
Hardware Presets
Recommended pattern:
hardware:
type: 8xb300
hbm_bytes: 1900.0e9
dram_bytes: 1.5e12
max_batch_slots: 256
Available preset families:
h100,h800,h20,h20-141ga100-80gb,a100-40gbb200,b300- TP forms such as
2xh100,4xh20,8xb200,8xb300
Bundled Configs
Representative configs in configs/:
| Config | Notes |
|---|---|
glm5-8xb200.yaml |
GLM-5 on 8xb200, single-pool baseline config. |
glm5-fp8-8xh20-141g.yaml |
GLM-5-FP8 on 8xh20-141g, with a 0-32k input-length filter. |
glm5-fp8-8xh20-141g-ca-tuned.yaml |
Same family as above, tuned for cache_affinity. |
glm5-nvfp4-8xb300.yaml |
GLM-5-NVFP4 on 8xb300. |
glm5-nvfp4-fp8compute-8xb300.yaml |
NVFP4 weights with FP8 compute on 8xb300. |
qwen3-coder-480b-8xh20.yaml |
Qwen3-Coder-480B-A35B on 8xh20. |
Many of the glm5-*n*.yaml configs are bucket/slice-specific experiment
points that use sim.input_length_min and sim.input_length_max.
Trace Inputs
This repository currently contains two trace sources:
bailian-traces/glm_coder_blksz_512_040915-040917.jsonlqwen3_coder_blksz_512_040915-040917.jsonl
qwen-bailian-usagetraces-anon/submodule- public 16-token-block Qwen traces such as
qwen_coder_blksz_16.jsonlandqwen_traceB_blksz_16.jsonl
- public 16-token-block Qwen traces such as
The simulator expects JSONL records with fields like:
{
"chat_id": 159,
"parent_chat_id": 55,
"timestamp": 61.114,
"input_length": 521,
"output_length": 132,
"type": "text",
"turn": 2,
"hash_ids": [1089, 1090, 1091]
}
Only prefill-side behavior is modeled; output_length is used only for a
decode tail in completion metrics.
Outputs
Each run writes a directory under sim.output_dir:
| File | Contents |
|---|---|
summary.json |
Aggregate throughput, TTFT/E2E percentiles, hit rates, RDMA bytes, PCIe bytes. |
per_request.csv |
Per-request latency and cache stats, including bucket, instance, and length_bucket_match. |
instances.csv |
Periodic per-instance samples with bucket, instance, queue_len, and KV usage. |
routing_log.jsonl |
One JSON route decision per request, including global_mode, mode, chosen_bucket, candidate buckets, and candidate instances. |
Additional outputs:
ablate: writesablation.jsonoracle: writesoracle.jsonablate --auto-instances: writes calibration runs under<output_dir>/auto_instances/
Quick inspection examples:
jq . runs/glm5_8xb200/summary.json
jq 'sort_by(.ttft_mean) | .[] | {router, ttft_mean, hit_rate_l0, miss_rate}' \
runs/glm5_8xb200/ablation.json
Oracle Semantics
oracle computes three hit-rate references at a chosen cache capacity:
unlimited.hit_rate: absolute ceiling with infinite capacitybelady_finite.hit_rate: offline-optimal eviction at the chosen capacitylru_finite.hit_rate: LRU at the same capacity
When sim.input_length_min / sim.input_length_max are set, oracle still
feeds the full trace into cache state but only counts requests inside the
selected input-length range. That matches the intended "measure one bucket
inside a mixed workload" interpretation.
The gap from lru_finite to belady_finite is eviction-policy headroom. The
gap from belady_finite to unlimited is pure capacity headroom.
Testing
cargo test --release
The test suite covers config parsing, hardware presets, routing behavior, bucket-aware service semantics, oracle logic, and smoke-style end-to-end runs.