diff --git a/README.md b/README.md index 9206b36..083a53b 100644 --- a/README.md +++ b/README.md @@ -1,353 +1,386 @@ # kvcache-simulator Discrete-event simulator for cluster-level LLM **prefill** serving with a -two-tier KV cache (GPU HBM + CPU DRAM / v6d) and KV-aware request routing. -Replays real production traces against a synthetic cluster so you can -ablate routing strategies and cache sizing without spinning up any GPUs. +two-tier KV cache and routing experiments. The simulator models a +PD-disaggregated deployment: only the **prefill** path is simulated, while +decode is reduced to a small completion tail for TTFT/E2E accounting. -Assumes **PD (prefill/decode) disaggregation** — only the prefill path is -modeled. +It is intended for answering questions like: -## Features +- How much do different KV-aware routers help on the same trace? +- How much HBM/DRAM capacity is enough before routing dominates? +- How do prefix-locality policies behave under bucketed input-length pools? +- What is the gap between online LRU and offline-optimal cache capacity? -- **Architecture-derived roofline compute** — auto-derives FLOPs, - attention coefficients, and weight-streaming costs from model - architecture (MoE, MLA, GQA, DSA, sliding window). -- **HuggingFace config.json auto-parsing** — drop in any HF - `config.json` and the simulator extracts layer counts, attention heads, - MoE expert configs, MLA LoRA ranks, and DSA sparse parameters. -- **Built-in GPU hardware presets** — H100, H800, H20, A100-80GB, - A100-40GB, B200 with tensor-parallel scaling (e.g. `8xb200`). -- **Two-tier KV cache hierarchy** — L0 (GPU HBM) + L1 (CPU DRAM) with - LRU eviction and cross-instance RDMA fetch via a cluster-wide - meta-store. -- **11 routing policies** — from baselines (random, round-robin) to - cache-aware (min\_pd, prefix\_affinity) for systematic ablation. -- **Token-bucket link contention** — PCIe and RDMA bandwidth modeled with - reservation-based token-bucket queues. -- **Oracle analysis** — computes theoretical hit-rate ceilings (infinite - cache, Belady optimal, LRU) for gap analysis. +## What The Repo Models + +- **Architecture-derived prefill cost** from model structure, including MoE, + MLA, GQA, sliding-window attention, and DSA. +- **Two-tier KV hierarchy** with L0 GPU HBM and L1 host DRAM, plus remote + RDMA fetches via a meta-store. +- **Single-pool and bucketed clusters**. Bucketed mode separates the service + into input-length buckets with isolated instance pools and meta-stores. +- **Local instance routing and global bucket routing** with detailed + per-request routing logs. +- **Trace replay with optional input-length filtering** so the same trace can + be sliced into buckets without rewriting the source file. +- **Offline oracle analysis** for unlimited capacity, Belady, and LRU hit-rate + ceilings. + +## Highlights + +- **HF `config.json` auto-loading**: point `model.config_json` at a model + config and the simulator derives architecture parameters automatically. +- **Hardware presets**: `h100`, `h800`, `h20`, `h20-141g`, `a100-80gb`, + `a100-40gb`, `b200`, and `b300`, plus TP variants such as `8xb200`. +- **18 local router modes** covering baselines, load-based, cache-aware, + affinity, and TTFT-estimating policies. +- **2 global bucket router modes**: `strict_input_length` and `bucket_score`. +- **Detailed outputs**: `summary.json`, `per_request.csv`, `instances.csv`, + `routing_log.jsonl`, plus `ablation.json` / `oracle.json` when applicable. ## Build ```bash cargo build --release -# binary: target/release/kvcache-sim ``` -Fetch the upstream trace (consumed as a git submodule): +If you want the public Qwen trace submodule as well: ```bash git submodule update --init --recursive ``` -## Usage - -### 1. Run a single simulation +The release binary is: ```bash -target/release/kvcache-sim run --config configs/glm5-8xb200-hf.yaml +target/release/kvcache-sim ``` -Prints `summary.json` to stdout and writes the full output directory -(see [Outputs](#outputs) below). +## Quick Start -### 2. Compare routers on the same trace (ablation) +Validate a config: + +```bash +target/release/kvcache-sim validate --config configs/glm5-8xb200.yaml +``` + +Run one simulation: + +```bash +target/release/kvcache-sim run --config configs/glm5-8xb200.yaml +``` + +Compare several routers on the same trace: ```bash target/release/kvcache-sim ablate \ - --config configs/glm5-8xb200-hf.yaml \ - --routers random,least_loaded,least_tokens,min_pd,prefix_affinity \ - --evict-policies lru \ - --output-dir runs/glm5_ablation + --config configs/glm5-8xb200.yaml \ + --routers random,least_loaded,cache_score,cache_affinity,estimated_ttft ``` -Writes `ablation.json` with one row per `router x evict_policy`. - -`ablate` currently supports only `lru` as a valid eviction policy. The -aggregated output keeps the online prefill-time metrics -(`ttft_mean/p50/p95/p99`) and omits `e2e`. - -The previous replay-based `belady` approximation has been removed from -the CLI because it was not an exact full-hierarchy Belady algorithm and -could produce misleading comparisons against `lru`. - -### 3. Compute theoretical hit-rate ceilings (oracle) +Auto-pick the smallest cluster size that meets a TTFT target, then ablate at +that size: ```bash -# Cluster-aggregate capacity (default) -target/release/kvcache-sim oracle \ - --config configs/glm5-8xb200-hf.yaml --num-instances 64 - -# A single instance's HBM budget -target/release/kvcache-sim oracle \ - --config configs/glm5-8xb200-hf.yaml --per-instance - -# Explicit capacity in blocks -target/release/kvcache-sim oracle \ - --config configs/glm5-8xb200-hf.yaml --capacity-blocks 200000 +target/release/kvcache-sim ablate \ + --config configs/glm5-8xb200.yaml \ + --auto-instances \ + --auto-probe-router cache_score \ + --auto-target-ttft-mean 4.0 ``` -Reports three numbers: - -- `unlimited.hit_rate` — absolute ceiling (infinite cache) -- `belady_finite.hit_rate` — optimal-eviction ceiling at the given capacity -- `lru_finite.hit_rate` — production LRU at the same capacity - -Gap between `lru_finite` and `belady_finite` = headroom from a smarter -eviction policy. Gap between `belady_finite` and `unlimited` = headroom -only reachable by adding capacity. - -### 4. Validate a config without running +Run the oracle: ```bash -target/release/kvcache-sim validate --config configs/glm5-8xb200-hf.yaml +target/release/kvcache-sim oracle \ + --config configs/glm5-8xb200.yaml \ + --per-instance ``` -Parses the YAML, prints derived per-instance block budgets, and dumps -the first 5 trace records so you can sanity-check the path. +`run` prints `summary.json` to stdout and also writes the full output directory +under `sim.output_dir`. -## CLI overrides +## Current Command Boundaries -These flags work on **all** subcommands and override the YAML in place, -so the same config can be reused across sweeps: +The repository now supports both legacy single-pool clusters and bucketed +service topologies, but not every CLI path supports both yet. -| Flag | Overrides | -|--------------------------|-------------------------------------------| -| `--num-instances ` | `cluster.num_instances` | -| `--max-requests ` | `sim.max_requests` | -| `--trace ` | `sim.trace_path` | -| `--output-dir ` | `sim.output_dir` | -| `--seed ` | `sim.seed` | -| `--precise-topk ` | `cluster.router.precise_probe_topk` | -| `--ttl-seconds ` | `cluster.meta_store.ttl_seconds` | +- `run`: supports `cluster.num_instances` and `cluster.buckets` +- `validate`: supports `cluster.num_instances` and `cluster.buckets` +- `ablate`: currently **single-pool only** +- `ablate --evict-policies`: currently supports **`lru` only** +- `oracle`: currently **single-pool only** +- `--num-instances` override: currently **single-pool only** +- `--auto-instances`: currently **single-pool only** -`oracle` additionally takes `--capacity-blocks ` / `--per-instance` -and `--out `. `ablate` additionally takes `--routers ` and -`--evict-policies ` (currently only `lru`). +In practice, bucket-aware experiments are ready in `run`, while fixed-placement +ablation and oracle analysis still reject `cluster.buckets`. -## Router modes +## Config Model -Set `cluster.router.mode` in the YAML or list in `--routers`: +### Single-Pool Cluster -| Mode | Aliases | What it does | -|-------------------|------------------|--------------------------------------------------------------------------------------| -| `random` | | Uniform random. Baseline. | -| `round_robin` | `rr` | Deterministic round-robin. Baseline. | -| `least_loaded` | | `argmin(kv_blocks_used + alpha * queue_len)`. KV-blind load balance. | -| `least_tokens` | `lt` | `argmin(waiting_tokens)`. Pure load balance by queued compute work. | -| `ttl_aware` | `ttl` | Picks instance with longest prefix in the global TTL meta-store. Cache-only. | -| `precise` | `precise_aware` | Probes top-K least-loaded instances' actual caches; charges probe latency into TTFT. | -| `min_pd` | `minpd`, `pd` | Minimizes `P*D` (prefill tokens x ongoing requests). Cluster-wide RDMA-aware. | -| `cache_load` | `cl` | Filters to least-loaded 1/4 instances, then picks best cache prefix. | -| `cache_score` | `cs` | Exponential scoring: `2^(alpha * queue_len + beta * miss_blocks)`. | -| `estimated_ttft` | `ettft`,`optimal`| Estimates `drain_time + fetch_time` per instance using architecture-aware compute. | -| `prefix_affinity` | `affinity`, `pa` | Rendezvous-hashed prefix fingerprinting for deterministic cache locality. | +Use `cluster.num_instances` for the original flat instance pool: -### Router parameters - -These fields in `cluster.router` tune specific routers: - -| Field | Default | Used by | Description | -|--------------------------|---------|------------------|------------------------------------------------------| -| `load_alpha` | `1.0` | `least_loaded` | Weight of queue\_len vs kv\_blocks\_used | -| `score_alpha` | `1.0` | `cache_score` | Load weight in `2^(alpha*load + beta*miss)` | -| `score_beta` | `0.1` | `cache_score` | Cache-miss weight in `2^(alpha*load + beta*miss)` | -| `prefix_k` | `8` | `prefix_affinity`| Number of leading blocks for the prefix fingerprint | -| `affinity_fan_out` | `0` | `prefix_affinity`| Top-K affinity candidates (0 = auto: n/8, min 2) | -| `precise_probe_latency_us`| `50.0`| `precise` | Simulated per-probe latency (microseconds) | -| `precise_probe_topk` | `4` | `precise` | Number of instances probed | - -### Router design spectrum - -``` - Cache-only Hybrid Load-only - (hot-spot risk) (cache-blind) - ┌─────────┬───────────┬───────────┬────────────┬───────────┬───────────┐ - ttl_aware precise cache_score min_pd prefix_ least_ random - cache_load affinity loaded - est_ttft least_tokens +```yaml +cluster: + num_instances: 32 + meta_store: + ttl_seconds: 300.0 + router: + mode: cache_affinity ``` -`prefix_affinity` sits in a unique position: it builds **proactive cache -locality** by consistently routing same-prefix requests to the same -instances (via rendezvous hashing), rather than reactively chasing -existing cache state. This yields the highest L0 hit rates while -maintaining load balance through within-group drain-time-aware selection. +### Bucketed Service -## Model configuration +Use `cluster.buckets` plus a `global_router` to model explicit input-length +buckets: -### HuggingFace config.json (recommended) +```yaml +cluster: + meta_store: + ttl_seconds: 300.0 + router: + mode: cache_affinity + load_alpha: 1.5 + prefix_k: 8 + global_router: + mode: strict_input_length + length_penalty_weight: 1.0 + load_weight: 1.0 + cache_weight: 1.0 + buckets: + - name: short + input_length_min: 0 + input_length_max: 32768 + num_instances: 8 + - name: long + input_length_min: 32769 + input_length_max: 131072 + num_instances: 4 +``` -Point `model.config_json` at any HF `config.json` to auto-extract -architecture: +Rules enforced by config validation: + +- `cluster.num_instances` and `cluster.buckets` are mutually exclusive +- bucket ranges must not overlap +- every bucket must have `num_instances > 0` +- `input_length_min <= input_length_max` + +### CLI Overrides + +These flags apply on top of the YAML config: + +| Flag | Overrides | +|------|-----------| +| `--num-instances ` | `cluster.num_instances` | +| `--max-requests ` | `sim.max_requests` | +| `--trace ` | `sim.trace_path` | +| `--output-dir ` | `sim.output_dir` | +| `--seed ` | `sim.seed` | +| `--precise-topk ` | `cluster.router.precise_probe_topk` | +| `--ttl-seconds ` | `cluster.meta_store.ttl_seconds` | +| `--input-length-min ` | `sim.input_length_min` | +| `--input-length-max ` | `sim.input_length_max` | + +Subcommand-specific additions: + +- `ablate`: `--routers`, `--evict-policies`, `--auto-instances`, + `--auto-target-ttft-mean`, `--auto-candidates`, `--auto-probe-router`, + `--jobs` +- `oracle`: `--capacity-blocks`, `--per-instance`, `--out` + +## Routing Modes + +### Global Bucket Routers + +Configured through `cluster.global_router.mode`. + +| Mode | What it does | +|------|---------------| +| `strict_input_length` | Routes to the unique bucket whose `[input_length_min, input_length_max]` contains the request. | +| `bucket_score` | Scores every bucket using weighted length mismatch, aggregate queue load, and predicted cache miss. Can intentionally deviate from the strict length bucket. | + +### Local Instance Routers + +Configured through `cluster.router.mode`. All of these names are accepted by +`run`, and any of them can be passed to `ablate --routers` on single-pool +configs. + +| Mode | Aliases | What it does | +|------|---------|---------------| +| `random` | | Uniform random baseline. | +| `round_robin` | `rr` | Deterministic round-robin baseline. | +| `least_loaded` | | Minimizes `kv_blocks_used + alpha * queue_len`. | +| `least_tokens` | `lt` | Minimizes queued token work. | +| `ttl_aware` | `ttl` | Uses the global TTL meta-store to chase the longest reusable prefix. | +| `precise` | `precise_aware` | Probes top-K least-loaded instances for actual cache contents and charges probe latency. | +| `min_pd` | `minpd`, `pd` | Minimizes `P * D` using ongoing load and prefix reuse. | +| `cache_load` | `cl` | Filters to lightly loaded instances, then chooses the best cache prefix. | +| `cache_affinity` | `caff`, `ca` | Strong cache-first scoring with rendezvous-based sticky homes for prefix families. | +| `cache_affinity_weak_rend` | `caff_weak` | Ablation: weak cache weights plus rendezvous placement. | +| `cache_affinity_strong_only` | `caff_strong` | Ablation: strong cache weights without rendezvous tie-breaking. | +| `cache_score` | `cs` | Exponential score over queue length and miss blocks. | +| `cache_score_strong` | `cs_strong`, `css` | Parity probe with stronger cache weighting than default `cache_score`. | +| `cache_score_ttl` | `csttl`, `cs_ttl` | `cache_score` variant that also uses TTL/meta-store visibility. | +| `estimated_ttft` | `ettft`, `optimal` | First-principles TTFT estimate per instance using compute plus KV movement. | +| `prefix_affinity` | `affinity`, `pa` | Deterministic prefix fingerprinting with affinity fan-out and load-aware selection. | +| `adaptive_affinity` | `aa` | Uses hot-prefix detection: affinity for short hot stems, TTFT optimization otherwise. | +| `lineage_affinity` | `la` | Combines parent stickiness, family homesets, and strong local cache scoring. | + +Router tuning knobs in `cluster.router`: + +| Field | Default | Used by | +|-------|---------|---------| +| `load_alpha` | `1.0` | `least_loaded`, `ttl_aware`, affinity families | +| `score_alpha` | `1.0` | `cache_score`, `cache_score_ttl` | +| `score_beta` | `0.1` | `cache_score`, `cache_score_ttl` | +| `prefix_k` | `8` | prefix and affinity fingerprinting | +| `affinity_fan_out` | `0` | `prefix_affinity`, `adaptive_affinity`, `lineage_affinity` | +| `precise_probe_latency_us` | `50.0` | `precise` | +| `precise_probe_topk` | `4` | `precise` | + +## Model And Hardware Configuration + +### Model Config + +Recommended pattern: ```yaml model: config_json: ../models/GLM-5/config.json - dtype_bytes: 2 # required (not in HF schema) - block_size_tokens: 512 # required (not in HF schema) + name: glm-5 + compute_dtype: fp8 + weight_dtype: fp4 + dtype_bytes: 1 + block_size_tokens: 512 ``` -Auto-detected features: +Notes: -| Feature | Detection trigger | What it extracts | -|-----------|-------------------------------|----------------------------------------------| -| **MoE** | `n_routed_experts`, `num_local_experts`, or `num_experts` | Expert count, active experts, shared experts, expert FFN width | -| **MLA** | `kv_lora_rank` present | KV/Q LoRA ranks, qk\_rope/nope dims, v\_head\_dim | -| **DSA** | `first_k_dense_replace` present| Dense window, sparse stride, first dense layers | -| **Sliding window** | `sliding_window` present | Window size | -| **GQA** | `num_key_value_heads < num_attention_heads` | KV head count for grouped-query attention | +- `config_json` is resolved relative to the YAML file +- explicit YAML fields override values loaded from the model config +- `compute_dtype` selects the compute FLOPS tier +- `weight_dtype` controls model-weight bytes separately from KV-cache bytes +- `dtype_bytes` sizes the KV cache -Explicit YAML fields always override the auto-detected values. +The architecture loader understands: -### Inline specification +- MoE expert counts and active experts +- MLA LoRA ranks and attention dimensions +- DSA sparse-attention parameters +- sliding-window attention +- GQA from KV-head count -Alternatively, specify architecture fields directly: +### Hardware Presets -```yaml -model: - name: qwen2.5-coder-7b - num_layers: 28 - hidden_size: 3584 - num_attention_heads: 28 - num_kv_heads: 4 - head_dim: 128 - intermediate_size: 18944 - dtype_bytes: 2 - block_size_tokens: 16 -``` - -When `hidden_size` is present, the compute model is auto-derived -(architecture mode). Without it, you must supply legacy manual -coefficients (`flops_per_token_prefill`, `attn_quadratic_coeff`, etc.). - -### Bundled model configs - -| Model | Path | Architecture | -|-------|------|--------------| -| GLM-5 (744B/40B-active) | `models/GLM-5/config.json` | MoE (256 routed, 8 active, 1 shared) + MLA + DSA | -| GLM-5-FP8 | `models/GLM-5-FP8/config.json` | GLM-5 architecture + upstream FP8 quantization metadata | -| Qwen3-Coder-480B-A35B FP8 | `models/Qwen3-Coder-480B-A35B-Instruct-FP8/config.json` | MoE (160 experts, 8 active) + GQA | - -## Hardware configuration - -### Using presets (recommended) - -Set `hardware.type` to a preset name — individual fields can override: +Recommended pattern: ```yaml hardware: - type: 8xb200 - hbm_bytes: 500.0e9 # override KV budget (after model weights) -``` - -Available presets: - -| Preset | FLOPS | HBM | Mem BW | PCIe | -|-------------|------------|---------|------------|------| -| `h100` | 989 TFLOPS | 80 GB | 3.35 TB/s | Gen5 | -| `h800` | 989 TFLOPS | 80 GB | 3.35 TB/s | Gen5 | -| `h20` | 148 TFLOPS | 96 GB | 4.0 TB/s | Gen5 | -| `h20-141g` | 148 TFLOPS | 141 GB | 4.8 TB/s | Gen5 | -| `a100-80gb` | 312 TFLOPS | 80 GB | 2.0 TB/s | Gen4 | -| `a100-40gb` | 312 TFLOPS | 40 GB | 1.555 TB/s | Gen4 | -| `b200` | 2.25 PFLOPS| 192 GB | 8.0 TB/s | Gen6 | - -Prefix with `2x`, `4x`, or `8x` for tensor-parallel groups (e.g. -`8xh20`). FLOPS, memory bandwidth, and HBM scale linearly; RDMA and -DRAM are set to sensible per-node defaults. - -### Inline specification - -```yaml -hardware: - gpu_flops: 1.80e16 - gpu_mem_bw: 6.40e13 - hbm_bytes: 500.0e9 + type: 8xb300 + hbm_bytes: 1900.0e9 dram_bytes: 1.5e12 - pcie_bw: 128.0e9 - pcie_latency_us: 4.0 - rdma_bw: 50.0e9 - rdma_latency_us: 6.0 max_batch_slots: 256 - prefill_chunk_tokens: 4096 ``` -## Architecture-aware compute model +Available preset families: -The simulator derives a **roofline prefill model** from model -architecture: +- `h100`, `h800`, `h20`, `h20-141g` +- `a100-80gb`, `a100-40gb` +- `b200`, `b300` +- TP forms such as `2xh100`, `4xh20`, `8xb200`, `8xb300` -``` -prefill_time(N tokens) = max(compute_time, memory_time) +## Bundled Configs -compute_time = layers * (N * linear_flops + attn_coeff * N * effective_ctx(N)) / gpu_flops -memory_time = layers * weight_bytes_per_layer / gpu_mem_bw +Representative configs in `configs/`: + +| Config | Notes | +|--------|-------| +| `glm5-8xb200.yaml` | GLM-5 on `8xb200`, single-pool baseline config. | +| `glm5-fp8-8xh20-141g.yaml` | GLM-5-FP8 on `8xh20-141g`, with a 0-32k input-length filter. | +| `glm5-fp8-8xh20-141g-ca-tuned.yaml` | Same family as above, tuned for `cache_affinity`. | +| `glm5-nvfp4-8xb300.yaml` | GLM-5-NVFP4 on `8xb300`. | +| `glm5-nvfp4-fp8compute-8xb300.yaml` | NVFP4 weights with FP8 compute on `8xb300`. | +| `qwen3-coder-480b-8xh20.yaml` | Qwen3-Coder-480B-A35B on `8xh20`. | + +Many of the `glm5-*n*.yaml` configs are bucket/slice-specific experiment +points that use `sim.input_length_min` and `sim.input_length_max`. + +## Trace Inputs + +This repository currently contains two trace sources: + +- `bailian-traces/` + - `glm_coder_blksz_512_040915-040917.jsonl` + - `qwen3_coder_blksz_512_040915-040917.jsonl` +- `qwen-bailian-usagetraces-anon/` submodule + - public 16-token-block Qwen traces such as + `qwen_coder_blksz_16.jsonl` and `qwen_traceB_blksz_16.jsonl` + +The simulator expects JSONL records with fields like: + +```json +{ + "chat_id": 159, + "parent_chat_id": 55, + "timestamp": 61.114, + "input_length": 521, + "output_length": 132, + "type": "text", + "turn": 2, + "hash_ids": [1089, 1090, 1091] +} ``` -- **MoE**: only active experts contribute to FLOPs and weight streaming - (shared experts always counted) -- **MLA**: compressed KV projections reduce attention FLOPs; KV cache - uses `kv_lora_rank + qk_rope_head_dim` instead of `2 * kv_heads * head_dim` -- **DSA**: `effective_ctx = min(N, dense_window) + max(0, N - dense_window) / sparse_stride`, - with the first K layers using full dense attention -- **GQA**: fewer KV heads reduce both attention compute and KV cache size - -## Bundled config files - -| Config | Model | Hardware | Instances | Trace | -|--------|-------|----------|-----------|-------| -| `glm5-8xb200-hf.yaml` | GLM-5 via HF config.json | 8xB200 preset | 32 | GLM coder blk512 | -| `glm5-fp8-8xh20-141g.yaml` | GLM-5-FP8 via ModelScope config.json | 8xH20-141G preset | 128 | GLM coder blk512 | -| `glm5-nvfp4-8xb300.yaml` | GLM-5-NVFP4 via HF config.json | 8xB300 preset | 8 | GLM coder blk512 | -| `qwen3-coder-480b-8xh20.yaml` | Qwen3-Coder via HF | 8xH20 preset | 32 | Qwen coder blk16 | +Only prefill-side behavior is modeled; `output_length` is used only for a +decode tail in completion metrics. ## Outputs -Each run writes a directory under `sim.output_dir`: +Each `run` writes a directory under `sim.output_dir`: -| File | Contents | -|----------------------|----------------------------------------------------------------------------| -| `summary.json` | Router, throughput, TTFT p50/p95/p99, hit rates per tier, total RDMA/PCIe bytes | -| `per_request.csv` | `req_id,arrival,ttft,e2e,instance,total_blocks,l0_hit,l1_hit,remote_hit,miss,rdma_bytes,pcie_bytes,probe_overhead_s` | -| `instances.csv` | `t,instance,queue_len,kv_blocks_used,kv_blocks_total,busy` per sample | -| `routing_log.jsonl` | One JSON per request: all router candidates + chosen instance + reason | +| File | Contents | +|------|----------| +| `summary.json` | Aggregate throughput, TTFT/E2E percentiles, hit rates, RDMA bytes, PCIe bytes. | +| `per_request.csv` | Per-request latency and cache stats, including `bucket`, `instance`, and `length_bucket_match`. | +| `instances.csv` | Periodic per-instance samples with `bucket`, `instance`, `queue_len`, and KV usage. | +| `routing_log.jsonl` | One JSON route decision per request, including `global_mode`, `mode`, `chosen_bucket`, candidate buckets, and candidate instances. | -For `ablate`: an extra `ablation.json` with one summary per router. -For `oracle`: an `oracle.json` with the three hit-rate analyses. +Additional outputs: -### Reading results quickly +- `ablate`: writes `ablation.json` +- `oracle`: writes `oracle.json` +- `ablate --auto-instances`: writes calibration runs under + `/auto_instances/` + +Quick inspection examples: ```bash -# Pretty-print the summary -cat runs/glm5_8xb200_hf/summary.json | jq . - -# Compare all routers from an ablation -cat runs/glm5_8xb200_hf/ablation.json | \ - jq '.[] | {router, ttft_mean, ttft_p50, hit_rate_l0, miss_rate}' - -# Sort by TTFT -cat runs/glm5_8xb200_hf/ablation.json | \ - jq 'sort_by(.ttft_mean) | .[] | {router, ttft_mean, hit_rate_l0}' +jq . runs/glm5_8xb200/summary.json ``` -## Trace format +```bash +jq 'sort_by(.ttft_mean) | .[] | {router, ttft_mean, hit_rate_l0, miss_rate}' \ + runs/glm5_8xb200/ablation.json +``` -The simulator reads the Alibaba -[`qwen-bailian-usagetraces-anon`](https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon) -JSONL schema. Each record has `chat_id`, `timestamp`, `input_length`, -`output_length`, and `hash_ids` (block hashes, typically 16 tokens each). -Only the input side is used. +## Oracle Semantics -Available traces in the submodule: +`oracle` computes three hit-rate references at a chosen cache capacity: -| Trace | Requests | Description | -|-------|----------|-------------| -| `qwen_coder_blksz_16.jsonl` | 43k | Qwen Coder serving traffic | -| `qwen_traceA_blksz_16.jsonl` | 43k | Qwen general traffic A | -| `qwen_traceB_blksz_16.jsonl` | 173k | Qwen general traffic B | -| `qwen_thinking_blksz_16.jsonl` | 11k | Qwen reasoning/thinking traffic | +- `unlimited.hit_rate`: absolute ceiling with infinite capacity +- `belady_finite.hit_rate`: offline-optimal eviction at the chosen capacity +- `lru_finite.hit_rate`: LRU at the same capacity + +When `sim.input_length_min` / `sim.input_length_max` are set, `oracle` still +feeds the full trace into cache state but only counts requests inside the +selected input-length range. That matches the intended "measure one bucket +inside a mixed workload" interpretation. + +The gap from `lru_finite` to `belady_finite` is eviction-policy headroom. The +gap from `belady_finite` to `unlimited` is pure capacity headroom. ## Testing @@ -355,6 +388,5 @@ Available traces in the submodule: cargo test --release ``` -28 tests: 27 unit tests (compute model, HF config parsing, hardware -presets) + 1 integration smoke test that runs routers on a synthetic -shared-prefix trace and asserts the expected hit-rate ordering. +The test suite covers config parsing, hardware presets, routing behavior, +bucket-aware service semantics, oracle logic, and smoke-style end-to-end runs.