kvcache-simulator

Discrete-event simulator for cluster-level LLM prefill serving with a two-tier KV cache and routing experiments. The simulator models a PD-disaggregated deployment: only the prefill path is simulated, while decode is reduced to a small completion tail for TTFT/E2E accounting.

It is intended for answering questions like:

  • How much do different KV-aware routers help on the same trace?
  • How much HBM/DRAM capacity is enough before routing dominates?
  • How do prefix-locality policies behave under bucketed input-length pools?
  • What is the gap between online LRU and offline-optimal cache capacity?

What The Repo Models

  • Architecture-derived prefill cost from model structure, including MoE, MLA, GQA, sliding-window attention, and DSA.
  • Two-tier KV hierarchy with L0 GPU HBM and L1 host DRAM, plus remote RDMA fetches via a meta-store.
  • Single-pool and bucketed clusters. Bucketed mode separates the service into input-length buckets with isolated instance pools and meta-stores.
  • Local instance routing and global bucket routing with detailed per-request routing logs.
  • Trace replay with optional input-length filtering so the same trace can be sliced into buckets without rewriting the source file.
  • Offline oracle analysis for unlimited capacity, Belady, and LRU hit-rate ceilings.

Highlights

  • HF config.json auto-loading: point model.config_json at a model config and the simulator derives architecture parameters automatically.
  • Hardware presets: h100, h800, h20, h20-141g, a100-80gb, a100-40gb, b200, and b300, plus TP variants such as 8xb200.
  • 18 local router modes covering baselines, load-based, cache-aware, affinity, and TTFT-estimating policies.
  • 2 global bucket router modes: strict_input_length and bucket_score.
  • Detailed outputs: summary.json, per_request.csv, instances.csv, routing_log.jsonl, plus ablation.json / oracle.json when applicable.

Build

cargo build --release

If you want the public Qwen trace submodule as well:

git submodule update --init --recursive

The release binary is:

target/release/kvcache-sim

Quick Start

Validate a config:

target/release/kvcache-sim validate --config configs/glm5-8xb200.yaml

Run one simulation:

target/release/kvcache-sim run --config configs/glm5-8xb200.yaml

Compare several routers on the same trace:

target/release/kvcache-sim ablate \
  --config configs/glm5-8xb200.yaml \
  --routers random,least_loaded,cache_score,cache_affinity,estimated_ttft

Auto-pick the smallest cluster size that meets a TTFT target, then ablate at that size:

target/release/kvcache-sim ablate \
  --config configs/glm5-8xb200.yaml \
  --auto-instances \
  --auto-probe-router cache_score \
  --auto-target-ttft-mean 4.0

Run the oracle:

target/release/kvcache-sim oracle \
  --config configs/glm5-8xb200.yaml \
  --per-instance

run prints summary.json to stdout and also writes the full output directory under sim.output_dir.

Current Command Boundaries

The repository now supports both legacy single-pool clusters and bucketed service topologies, but not every CLI path supports both yet.

  • run: supports cluster.num_instances and cluster.buckets
  • validate: supports cluster.num_instances and cluster.buckets
  • ablate: currently single-pool only
  • ablate --evict-policies: currently supports lru only
  • oracle: currently single-pool only
  • --num-instances override: currently single-pool only
  • --auto-instances: currently single-pool only

In practice, bucket-aware experiments are ready in run, while fixed-placement ablation and oracle analysis still reject cluster.buckets.

Config Model

Single-Pool Cluster

Use cluster.num_instances for the original flat instance pool:

cluster:
  num_instances: 32
  meta_store:
    ttl_seconds: 300.0
  router:
    mode: cache_affinity

Bucketed Service

Use cluster.buckets plus a global_router to model explicit input-length buckets:

cluster:
  meta_store:
    ttl_seconds: 300.0
  router:
    mode: cache_affinity
    load_alpha: 1.5
    prefix_k: 8
  global_router:
    mode: strict_input_length
    length_penalty_weight: 1.0
    load_weight: 1.0
    cache_weight: 1.0
  buckets:
    - name: short
      input_length_min: 0
      input_length_max: 32768
      num_instances: 8
    - name: long
      input_length_min: 32769
      input_length_max: 131072
      num_instances: 4

Rules enforced by config validation:

  • cluster.num_instances and cluster.buckets are mutually exclusive
  • bucket ranges must not overlap
  • every bucket must have num_instances > 0
  • input_length_min <= input_length_max

CLI Overrides

These flags apply on top of the YAML config:

Flag Overrides
--num-instances <N> cluster.num_instances
--max-requests <N> sim.max_requests
--trace <PATH> sim.trace_path
--output-dir <PATH> sim.output_dir
--seed <N> sim.seed
--precise-topk <N> cluster.router.precise_probe_topk
--ttl-seconds <S> cluster.meta_store.ttl_seconds
--input-length-min <N> sim.input_length_min
--input-length-max <N> sim.input_length_max

Subcommand-specific additions:

  • ablate: --routers, --evict-policies, --auto-instances, --auto-target-ttft-mean, --auto-candidates, --auto-probe-router, --jobs
  • oracle: --capacity-blocks, --per-instance, --out

Routing Modes

Global Bucket Routers

Configured through cluster.global_router.mode.

Mode What it does
strict_input_length Routes to the unique bucket whose [input_length_min, input_length_max] contains the request.
bucket_score Scores every bucket using weighted length mismatch, aggregate queue load, and predicted cache miss. Can intentionally deviate from the strict length bucket.

Local Instance Routers

Configured through cluster.router.mode. All of these names are accepted by run, and any of them can be passed to ablate --routers on single-pool configs.

Mode Aliases What it does
random Uniform random baseline.
round_robin rr Deterministic round-robin baseline.
least_loaded Minimizes kv_blocks_used + alpha * queue_len.
least_tokens lt Minimizes queued token work.
ttl_aware ttl Uses the global TTL meta-store to chase the longest reusable prefix.
precise precise_aware Probes top-K least-loaded instances for actual cache contents and charges probe latency.
min_pd minpd, pd Minimizes P * D using ongoing load and prefix reuse.
cache_load cl Filters to lightly loaded instances, then chooses the best cache prefix.
cache_affinity caff, ca Strong cache-first scoring with rendezvous-based sticky homes for prefix families.
cache_affinity_weak_rend caff_weak Ablation: weak cache weights plus rendezvous placement.
cache_affinity_strong_only caff_strong Ablation: strong cache weights without rendezvous tie-breaking.
cache_score cs Exponential score over queue length and miss blocks.
cache_score_strong cs_strong, css Parity probe with stronger cache weighting than default cache_score.
cache_score_ttl csttl, cs_ttl cache_score variant that also uses TTL/meta-store visibility.
estimated_ttft ettft, optimal First-principles TTFT estimate per instance using compute plus KV movement.
prefix_affinity affinity, pa Deterministic prefix fingerprinting with affinity fan-out and load-aware selection.
adaptive_affinity aa Uses hot-prefix detection: affinity for short hot stems, TTFT optimization otherwise.
lineage_affinity la Combines parent stickiness, family homesets, and strong local cache scoring.

Router tuning knobs in cluster.router:

Field Default Used by
load_alpha 1.0 least_loaded, ttl_aware, affinity families
score_alpha 1.0 cache_score, cache_score_ttl
score_beta 0.1 cache_score, cache_score_ttl
prefix_k 8 prefix and affinity fingerprinting
affinity_fan_out 0 prefix_affinity, adaptive_affinity, lineage_affinity
precise_probe_latency_us 50.0 precise
precise_probe_topk 4 precise

Model And Hardware Configuration

Model Config

Recommended pattern:

model:
  config_json: ../models/GLM-5/config.json
  name: glm-5
  compute_dtype: fp8
  weight_dtype: fp4
  dtype_bytes: 1
  block_size_tokens: 512

Notes:

  • config_json is resolved relative to the YAML file
  • explicit YAML fields override values loaded from the model config
  • compute_dtype selects the compute FLOPS tier
  • weight_dtype controls model-weight bytes separately from KV-cache bytes
  • dtype_bytes sizes the KV cache

The architecture loader understands:

  • MoE expert counts and active experts
  • MLA LoRA ranks and attention dimensions
  • DSA sparse-attention parameters
  • sliding-window attention
  • GQA from KV-head count

Hardware Presets

Recommended pattern:

hardware:
  type: 8xb300
  hbm_bytes: 1900.0e9
  dram_bytes: 1.5e12
  max_batch_slots: 256

Available preset families:

  • h100, h800, h20, h20-141g
  • a100-80gb, a100-40gb
  • b200, b300
  • TP forms such as 2xh100, 4xh20, 8xb200, 8xb300

Bundled Configs

Representative configs in configs/:

Config Notes
glm5-8xb200.yaml GLM-5 on 8xb200, single-pool baseline config.
glm5-fp8-8xh20-141g.yaml GLM-5-FP8 on 8xh20-141g, with a 0-32k input-length filter.
glm5-fp8-8xh20-141g-ca-tuned.yaml Same family as above, tuned for cache_affinity.
glm5-nvfp4-8xb300.yaml GLM-5-NVFP4 on 8xb300.
glm5-nvfp4-fp8compute-8xb300.yaml NVFP4 weights with FP8 compute on 8xb300.
qwen3-coder-480b-8xh20.yaml Qwen3-Coder-480B-A35B on 8xh20.

Many of the glm5-*n*.yaml configs are bucket/slice-specific experiment points that use sim.input_length_min and sim.input_length_max.

Trace Inputs

This repository currently contains two trace sources:

  • bailian-traces/
    • glm_coder_blksz_512_040915-040917.jsonl
    • qwen3_coder_blksz_512_040915-040917.jsonl
  • qwen-bailian-usagetraces-anon/ submodule
    • public 16-token-block Qwen traces such as qwen_coder_blksz_16.jsonl and qwen_traceB_blksz_16.jsonl

The simulator expects JSONL records with fields like:

{
  "chat_id": 159,
  "parent_chat_id": 55,
  "timestamp": 61.114,
  "input_length": 521,
  "output_length": 132,
  "type": "text",
  "turn": 2,
  "hash_ids": [1089, 1090, 1091]
}

Only prefill-side behavior is modeled; output_length is used only for a decode tail in completion metrics.

Outputs

Each run writes a directory under sim.output_dir:

File Contents
summary.json Aggregate throughput, TTFT/E2E percentiles, hit rates, RDMA bytes, PCIe bytes.
per_request.csv Per-request latency and cache stats, including bucket, instance, and length_bucket_match.
instances.csv Periodic per-instance samples with bucket, instance, queue_len, and KV usage.
routing_log.jsonl One JSON route decision per request, including global_mode, mode, chosen_bucket, candidate buckets, and candidate instances.

Additional outputs:

  • ablate: writes ablation.json
  • oracle: writes oracle.json
  • ablate --auto-instances: writes calibration runs under <output_dir>/auto_instances/

Quick inspection examples:

jq . runs/glm5_8xb200/summary.json
jq 'sort_by(.ttft_mean) | .[] | {router, ttft_mean, hit_rate_l0, miss_rate}' \
  runs/glm5_8xb200/ablation.json

Oracle Semantics

oracle computes three hit-rate references at a chosen cache capacity:

  • unlimited.hit_rate: absolute ceiling with infinite capacity
  • belady_finite.hit_rate: offline-optimal eviction at the chosen capacity
  • lru_finite.hit_rate: LRU at the same capacity

When sim.input_length_min / sim.input_length_max are set, oracle still feeds the full trace into cache state but only counts requests inside the selected input-length range. That matches the intended "measure one bucket inside a mixed workload" interpretation.

The gap from lru_finite to belady_finite is eviction-policy headroom. The gap from belady_finite to unlimited is pure capacity headroom.

Testing

cargo test --release

The test suite covers config parsing, hardware presets, routing behavior, bucket-aware service semantics, oracle logic, and smoke-style end-to-end runs.

Description
No description provided
Readme 451 KiB
Languages
Rust 100%