Go to file

Gahow Wang ec73a95e05 KVCache simulator for LLM serving cluster routing research

Discrete-event simulator for evaluating KV cache-aware routing policies
in prefill-disaggregated LLM serving clusters. Models a two-tier KV cache
hierarchy (L0 GPU HBM + L1 CPU DRAM) with RDMA/PCIe link contention,
architecture-derived roofline compute (MoE, MLA, DSA), and a cluster-wide
meta-store for prefix-aware routing decisions.

Includes 11 routing policies (random, round_robin, least_loaded,
least_tokens, ttl_aware, precise, min_pd, cache_load, cache_score,
estimated_ttft, prefix_affinity), HuggingFace config.json auto-parsing,
built-in GPU hardware presets (H100/H800/H20/A100/B200), and ablation
tooling for systematic policy comparison across real Alibaba serving traces.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-14 01:16:02 +08:00

configs

KVCache simulator for LLM serving cluster routing research

2026-04-14 01:16:02 +08:00

models

KVCache simulator for LLM serving cluster routing research

2026-04-14 01:16:02 +08:00

qwen-bailian-usagetraces-anon @ 27cfe19920

KVCache simulator for LLM serving cluster routing research

2026-04-14 01:16:02 +08:00

src

KVCache simulator for LLM serving cluster routing research

2026-04-14 01:16:02 +08:00

tests

KVCache simulator for LLM serving cluster routing research

2026-04-14 01:16:02 +08:00

.gitignore

KVCache simulator for LLM serving cluster routing research

2026-04-14 01:16:02 +08:00

.gitmodules

KVCache simulator for LLM serving cluster routing research

2026-04-14 01:16:02 +08:00

Cargo.lock

KVCache simulator for LLM serving cluster routing research

2026-04-14 01:16:02 +08:00

Cargo.toml

KVCache simulator for LLM serving cluster routing research

2026-04-14 01:16:02 +08:00

README.md

KVCache simulator for LLM serving cluster routing research

2026-04-14 01:16:02 +08:00

README.md

kvcache-simulator

Discrete-event simulator for cluster-level LLM prefill serving with a two-tier KV cache (GPU HBM + CPU DRAM / v6d) and KV-aware request routing. Replays real production traces against a synthetic cluster so you can ablate routing strategies and cache sizing without spinning up any GPUs.

Assumes PD (prefill/decode) disaggregation — only the prefill path is modeled.

Build

cargo build --release
# binary: target/release/kvcache-sim

Fetch the upstream trace (consumed as a git submodule):

git submodule update --init --recursive

Usage

1. Run a single simulation

target/release/kvcache-sim run --config configs/qwen2.5-coder-7b-h800.yaml

Prints summary.json to stdout and writes the full output directory (see Outputs below).

2. Compare routers on the same trace (ablation)

target/release/kvcache-sim ablate \
    --config configs/qwen2.5-coder-7b-h800.yaml \
    --num-instances 64 \
    --output-dir runs/qwen7b_n64 \
    --routers random,least_loaded,ttl_aware,precise

Writes one subdirectory per router plus a combined runs/qwen7b_n64/ablation.json with side-by-side summaries.

3. Compute theoretical hit-rate ceilings (oracle)

# Cluster-aggregate capacity (default)
target/release/kvcache-sim oracle \
    --config configs/qwen2.5-coder-7b-h800.yaml --num-instances 64

# A single instance's HBM budget
target/release/kvcache-sim oracle \
    --config configs/qwen2.5-coder-7b-h800.yaml --per-instance

# Explicit capacity in 16-token blocks
target/release/kvcache-sim oracle \
    --config configs/qwen2.5-coder-7b-h800.yaml --capacity-blocks 200000

Reports three numbers:

unlimited.hit_rate — absolute ceiling (infinite cache)
belady_finite.hit_rate — optimal-eviction ceiling at the given capacity
lru_finite.hit_rate — production LRU at the same capacity

Gap between lru_finite and belady_finite = headroom from a smarter eviction policy. Gap between belady_finite and unlimited = headroom only reachable by adding capacity.

4. Validate a config without running

target/release/kvcache-sim validate --config configs/qwen2.5-coder-7b-h800.yaml

Parses the YAML, prints derived per-instance block budgets, and dumps the first 5 trace records so you can sanity-check the path.

CLI overrides

These flags work on all subcommands and override the YAML in place, so the same config can be reused across sweeps:

Flag	Overrides
`--num-instances <N>`	`cluster.num_instances`
`--max-requests <N>`	`sim.max_requests`
`--trace <PATH>`	`sim.trace_path`
`--output-dir <PATH>`	`sim.output_dir`
`--seed <N>`	`sim.seed`
`--precise-topk <N>`	`cluster.router.precise_probe_topk`
`--ttl-seconds <S>`	`cluster.meta_store.ttl_seconds`

oracle additionally takes --capacity-blocks <N> / --per-instance and --out <PATH>. ablate additionally takes --routers <csv>.

Router modes

Set cluster.router.mode in the YAML or list in --routers:

Mode	What it does
`random`	Uniform random. Baseline.
`round_robin`	Deterministic round-robin. Baseline.
`least_loaded`	`argmin(kv_blocks_used + alpha * queue_len)`. KV-blind.
`ttl_aware`	Picks instance with longest prefix in the global TTL meta store.
`precise`	Probes top-K least-loaded instances' actual caches; charges probe latency into TTFT.

Expected hit-rate ordering: random ≲ least_loaded ≲ ttl_aware ≲ precise.

Outputs

Each run writes a directory under sim.output_dir:

File	Contents
`summary.json`	Router, throughput, TTFT p50/p95/p99, hit rates per tier, total RDMA/PCIe bytes
`per_request.csv`	`req_id,arrival,ttft,e2e,instance,total_blocks,l0_hit,l1_hit,remote_hit,miss,rdma_bytes,pcie_bytes,probe_overhead_s`
`instances.csv`	`t,instance,queue_len,kv_blocks_used,kv_blocks_total,busy` per sample
`routing_log.jsonl`	One JSON per request: all router candidates + chosen instance + reason

For ablate: an extra ablation.json with one summary per router.
For oracle: an oracle.json with the three hit-rate analyses.

Reading results quickly

# Pretty-print the summary
cat runs/qwen7b/summary.json | jq .

# Compare all routers from an ablation
cat runs/qwen7b_n64/ablation.json | jq '.[] | {router, ttft_p50, hit_rate_l0, total_rdma_bytes}'

# Hit-rate ceilings vs LRU at the same capacity
cat runs/qwen7b/oracle.json | jq '{unlimited: .unlimited.hit_rate, belady: .belady_finite.hit_rate, lru: .lru_finite.hit_rate}'

Config

A config is a single YAML file with four sections. A working example lives at configs/qwen2.5-coder-7b-h800.yaml; copy and edit for other models/hardware.

model:      # shape + prefill roofline coefficients
hardware:   # per-instance GPU/PCIe/RDMA capabilities + batch knobs
cluster:    # num_instances, meta_store TTL, router mode
sim:        # trace_path, max_requests, output_dir, seed

Only prefill-side model coefficients are used; any decode fields in legacy YAMLs are accepted and ignored.

Trace format

The simulator reads the Alibaba qwen-bailian-usagetraces-anon JSONL schema. Each record has chat_id, timestamp, input_length, output_length, and hash_ids (16-token block hashes). Only the input side is used.

Testing

cargo test --release

16 tests: 15 unit + 1 smoke that runs all four routers on a synthetic shared-prefix trace and asserts the expected hit-rate ordering.