Compare commits
19 Commits
365ceac3be
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 1906857ffd | |||
| cad83cec2f | |||
| 43ada0cfc0 | |||
| b5a6fb964c | |||
| 3a84c15068 | |||
| fa381b5db3 | |||
| 96019082cc | |||
| 008fe2fe5d | |||
| 7de38fa998 | |||
| d8a0796506 | |||
| a723d7a811 | |||
| bb280c8ba0 | |||
| 92d593d59b | |||
| 82b3e2985f | |||
| 67eef78244 | |||
| 996511f300 | |||
| c86d931d8f | |||
| a3f386c858 | |||
| ff316c6873 |
9
.gitignore
vendored
9
.gitignore
vendored
@@ -1,12 +1,21 @@
|
||||
.claude
|
||||
|
||||
# Trace files
|
||||
bailian-traces
|
||||
|
||||
# docs
|
||||
docs
|
||||
reports
|
||||
scripts
|
||||
tests/test_analyze_affinity_policy.py
|
||||
|
||||
# Rust build artifacts
|
||||
/target/
|
||||
**/*.rs.bk
|
||||
|
||||
# Simulation output
|
||||
/runs/
|
||||
.worktrees/
|
||||
|
||||
# Editor / IDE
|
||||
.vscode/
|
||||
|
||||
547
README.md
547
README.md
@@ -1,119 +1,179 @@
|
||||
# kvcache-simulator
|
||||
|
||||
Discrete-event simulator for cluster-level LLM **prefill** serving with a
|
||||
two-tier KV cache (GPU HBM + CPU DRAM / v6d) and KV-aware request routing.
|
||||
Replays real production traces against a synthetic cluster so you can
|
||||
ablate routing strategies and cache sizing without spinning up any GPUs.
|
||||
two-tier KV cache and routing experiments. The simulator models a
|
||||
PD-disaggregated deployment: only the **prefill** path is simulated, while
|
||||
decode is reduced to a small completion tail for TTFT/E2E accounting.
|
||||
|
||||
Assumes **PD (prefill/decode) disaggregation** — only the prefill path is
|
||||
modeled.
|
||||
It is intended for answering questions like:
|
||||
|
||||
## Features
|
||||
- How much do different KV-aware routers help on the same trace?
|
||||
- How much HBM/DRAM capacity is enough before routing dominates?
|
||||
- How do prefix-locality policies behave under bucketed input-length pools?
|
||||
- What is the gap between online LRU and offline-optimal cache capacity?
|
||||
|
||||
- **Architecture-derived roofline compute** — auto-derives FLOPs,
|
||||
attention coefficients, and weight-streaming costs from model
|
||||
architecture (MoE, MLA, GQA, DSA, sliding window).
|
||||
- **HuggingFace config.json auto-parsing** — drop in any HF
|
||||
`config.json` and the simulator extracts layer counts, attention heads,
|
||||
MoE expert configs, MLA LoRA ranks, and DSA sparse parameters.
|
||||
- **Built-in GPU hardware presets** — H100, H800, H20, A100-80GB,
|
||||
A100-40GB, B200 with tensor-parallel scaling (e.g. `8xb200`).
|
||||
- **Two-tier KV cache hierarchy** — L0 (GPU HBM) + L1 (CPU DRAM) with
|
||||
LRU eviction and cross-instance RDMA fetch via a cluster-wide
|
||||
meta-store.
|
||||
- **11 routing policies** — from baselines (random, round-robin) to
|
||||
cache-aware (min\_pd, prefix\_affinity) for systematic ablation.
|
||||
- **Token-bucket link contention** — PCIe and RDMA bandwidth modeled with
|
||||
reservation-based token-bucket queues.
|
||||
- **Oracle analysis** — computes theoretical hit-rate ceilings (infinite
|
||||
cache, Belady optimal, LRU) for gap analysis.
|
||||
## What The Repo Models
|
||||
|
||||
- **Architecture-derived prefill cost** from model structure, including MoE,
|
||||
MLA, GQA, sliding-window attention, and DSA.
|
||||
- **Two-tier KV hierarchy** with L0 GPU HBM and L1 host DRAM, plus remote
|
||||
RDMA fetches via a meta-store.
|
||||
- **Single-pool and bucketed clusters**. Bucketed mode separates the service
|
||||
into input-length buckets with isolated instance pools and meta-stores.
|
||||
- **Local instance routing and global bucket routing** with detailed
|
||||
per-request routing logs.
|
||||
- **Trace replay with optional input-length filtering** so the same trace can
|
||||
be sliced into buckets without rewriting the source file.
|
||||
- **Offline oracle analysis** for unlimited capacity, Belady, and LRU hit-rate
|
||||
ceilings.
|
||||
|
||||
## Highlights
|
||||
|
||||
- **HF `config.json` auto-loading**: point `model.config_json` at a model
|
||||
config and the simulator derives architecture parameters automatically.
|
||||
- **Hardware presets**: `h100`, `h800`, `h20`, `h20-141g`, `a100-80gb`,
|
||||
`a100-40gb`, `b200`, and `b300`, plus TP variants such as `8xb200`.
|
||||
- **18 local router modes** covering baselines, load-based, cache-aware,
|
||||
affinity, and TTFT-estimating policies.
|
||||
- **2 global bucket router modes**: `strict_input_length` and `bucket_score`.
|
||||
- **Detailed outputs**: `summary.json`, `per_request.csv`, `instances.csv`,
|
||||
`routing_log.jsonl`, plus `ablation.json` / `oracle.json` when applicable.
|
||||
|
||||
## Build
|
||||
|
||||
```bash
|
||||
cargo build --release
|
||||
# binary: target/release/kvcache-sim
|
||||
```
|
||||
|
||||
Fetch the upstream trace (consumed as a git submodule):
|
||||
If you want the public Qwen trace submodule as well:
|
||||
|
||||
```bash
|
||||
git submodule update --init --recursive
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### 1. Run a single simulation
|
||||
The release binary is:
|
||||
|
||||
```bash
|
||||
target/release/kvcache-sim run --config configs/glm5-8xb200-hf.yaml
|
||||
target/release/kvcache-sim
|
||||
```
|
||||
|
||||
Prints `summary.json` to stdout and writes the full output directory
|
||||
(see [Outputs](#outputs) below).
|
||||
## Quick Start
|
||||
|
||||
### 2. Compare routers on the same trace (ablation)
|
||||
Validate a config:
|
||||
|
||||
```bash
|
||||
target/release/kvcache-sim validate --config configs/glm5-8xb200.yaml
|
||||
```
|
||||
|
||||
Run one simulation:
|
||||
|
||||
```bash
|
||||
target/release/kvcache-sim run --config configs/glm5-8xb200.yaml
|
||||
```
|
||||
|
||||
Compare several routers on the same trace:
|
||||
|
||||
```bash
|
||||
target/release/kvcache-sim ablate \
|
||||
--config configs/glm5-8xb200-hf.yaml \
|
||||
--routers random,least_loaded,least_tokens,min_pd,prefix_affinity \
|
||||
--evict-policies lru \
|
||||
--output-dir runs/glm5_ablation
|
||||
--config configs/glm5-8xb200.yaml \
|
||||
--routers random,least_loaded,cache_score,cache_affinity,estimated_ttft
|
||||
```
|
||||
|
||||
Writes `ablation.json` with one row per `router x evict_policy`.
|
||||
|
||||
`ablate` currently supports only `lru` as a valid eviction policy. The
|
||||
aggregated output keeps the online prefill-time metrics
|
||||
(`ttft_mean/p50/p95/p99`) and omits `e2e`.
|
||||
|
||||
The previous replay-based `belady` approximation has been removed from
|
||||
the CLI because it was not an exact full-hierarchy Belady algorithm and
|
||||
could produce misleading comparisons against `lru`.
|
||||
|
||||
### 3. Compute theoretical hit-rate ceilings (oracle)
|
||||
Auto-pick the smallest cluster size that meets a TTFT target, then ablate at
|
||||
that size:
|
||||
|
||||
```bash
|
||||
# Cluster-aggregate capacity (default)
|
||||
target/release/kvcache-sim oracle \
|
||||
--config configs/glm5-8xb200-hf.yaml --num-instances 64
|
||||
|
||||
# A single instance's HBM budget
|
||||
target/release/kvcache-sim oracle \
|
||||
--config configs/glm5-8xb200-hf.yaml --per-instance
|
||||
|
||||
# Explicit capacity in blocks
|
||||
target/release/kvcache-sim oracle \
|
||||
--config configs/glm5-8xb200-hf.yaml --capacity-blocks 200000
|
||||
target/release/kvcache-sim ablate \
|
||||
--config configs/glm5-8xb200.yaml \
|
||||
--auto-instances \
|
||||
--auto-probe-router cache_score \
|
||||
--auto-target-ttft-mean 4.0
|
||||
```
|
||||
|
||||
Reports three numbers:
|
||||
|
||||
- `unlimited.hit_rate` — absolute ceiling (infinite cache)
|
||||
- `belady_finite.hit_rate` — optimal-eviction ceiling at the given capacity
|
||||
- `lru_finite.hit_rate` — production LRU at the same capacity
|
||||
|
||||
Gap between `lru_finite` and `belady_finite` = headroom from a smarter
|
||||
eviction policy. Gap between `belady_finite` and `unlimited` = headroom
|
||||
only reachable by adding capacity.
|
||||
|
||||
### 4. Validate a config without running
|
||||
Run the oracle:
|
||||
|
||||
```bash
|
||||
target/release/kvcache-sim validate --config configs/glm5-8xb200-hf.yaml
|
||||
target/release/kvcache-sim oracle \
|
||||
--config configs/glm5-8xb200.yaml \
|
||||
--per-instance
|
||||
```
|
||||
|
||||
Parses the YAML, prints derived per-instance block budgets, and dumps
|
||||
the first 5 trace records so you can sanity-check the path.
|
||||
`run` prints `summary.json` to stdout and also writes the full output directory
|
||||
under `sim.output_dir`.
|
||||
|
||||
## CLI overrides
|
||||
## Current Command Boundaries
|
||||
|
||||
These flags work on **all** subcommands and override the YAML in place,
|
||||
so the same config can be reused across sweeps:
|
||||
The repository now supports both legacy single-pool clusters and bucketed
|
||||
service topologies, but not every CLI path supports both yet.
|
||||
|
||||
- `run`: supports `cluster.num_instances` and `cluster.buckets`
|
||||
- `validate`: supports `cluster.num_instances` and `cluster.buckets`
|
||||
- `ablate`: currently **single-pool only**
|
||||
- `ablate --evict-policies`: currently supports **`lru` only**
|
||||
- `oracle`: currently **single-pool only**
|
||||
- `--num-instances` override: currently **single-pool only**
|
||||
- `--auto-instances`: currently **single-pool only**
|
||||
|
||||
In practice, bucket-aware experiments are ready in `run`, while fixed-placement
|
||||
ablation and oracle analysis still reject `cluster.buckets`.
|
||||
|
||||
## Config Model
|
||||
|
||||
### Single-Pool Cluster
|
||||
|
||||
Use `cluster.num_instances` for the original flat instance pool:
|
||||
|
||||
```yaml
|
||||
cluster:
|
||||
num_instances: 32
|
||||
meta_store:
|
||||
ttl_seconds: 300.0
|
||||
router:
|
||||
mode: cache_affinity
|
||||
```
|
||||
|
||||
### Bucketed Service
|
||||
|
||||
Use `cluster.buckets` plus a `global_router` to model explicit input-length
|
||||
buckets:
|
||||
|
||||
```yaml
|
||||
cluster:
|
||||
meta_store:
|
||||
ttl_seconds: 300.0
|
||||
router:
|
||||
mode: cache_affinity
|
||||
load_alpha: 1.5
|
||||
prefix_k: 8
|
||||
global_router:
|
||||
mode: strict_input_length
|
||||
length_penalty_weight: 1.0
|
||||
load_weight: 1.0
|
||||
cache_weight: 1.0
|
||||
buckets:
|
||||
- name: short
|
||||
input_length_min: 0
|
||||
input_length_max: 32768
|
||||
num_instances: 8
|
||||
- name: long
|
||||
input_length_min: 32769
|
||||
input_length_max: 131072
|
||||
num_instances: 4
|
||||
```
|
||||
|
||||
Rules enforced by config validation:
|
||||
|
||||
- `cluster.num_instances` and `cluster.buckets` are mutually exclusive
|
||||
- bucket ranges must not overlap
|
||||
- every bucket must have `num_instances > 0`
|
||||
- `input_length_min <= input_length_max`
|
||||
|
||||
### CLI Overrides
|
||||
|
||||
These flags apply on top of the YAML config:
|
||||
|
||||
| Flag | Overrides |
|
||||
|--------------------------|-------------------------------------------|
|
||||
|------|-----------|
|
||||
| `--num-instances <N>` | `cluster.num_instances` |
|
||||
| `--max-requests <N>` | `sim.max_requests` |
|
||||
| `--trace <PATH>` | `sim.trace_path` |
|
||||
@@ -121,230 +181,206 @@ so the same config can be reused across sweeps:
|
||||
| `--seed <N>` | `sim.seed` |
|
||||
| `--precise-topk <N>` | `cluster.router.precise_probe_topk` |
|
||||
| `--ttl-seconds <S>` | `cluster.meta_store.ttl_seconds` |
|
||||
| `--input-length-min <N>` | `sim.input_length_min` |
|
||||
| `--input-length-max <N>` | `sim.input_length_max` |
|
||||
|
||||
`oracle` additionally takes `--capacity-blocks <N>` / `--per-instance`
|
||||
and `--out <PATH>`. `ablate` additionally takes `--routers <csv>` and
|
||||
`--evict-policies <csv>` (currently only `lru`).
|
||||
Subcommand-specific additions:
|
||||
|
||||
## Router modes
|
||||
- `ablate`: `--routers`, `--evict-policies`, `--auto-instances`,
|
||||
`--auto-target-ttft-mean`, `--auto-candidates`, `--auto-probe-router`,
|
||||
`--jobs`
|
||||
- `oracle`: `--capacity-blocks`, `--per-instance`, `--out`
|
||||
|
||||
Set `cluster.router.mode` in the YAML or list in `--routers`:
|
||||
## Routing Modes
|
||||
|
||||
### Global Bucket Routers
|
||||
|
||||
Configured through `cluster.global_router.mode`.
|
||||
|
||||
| Mode | What it does |
|
||||
|------|---------------|
|
||||
| `strict_input_length` | Routes to the unique bucket whose `[input_length_min, input_length_max]` contains the request. |
|
||||
| `bucket_score` | Scores every bucket using weighted length mismatch, aggregate queue load, and predicted cache miss. Can intentionally deviate from the strict length bucket. |
|
||||
|
||||
### Local Instance Routers
|
||||
|
||||
Configured through `cluster.router.mode`. All of these names are accepted by
|
||||
`run`, and any of them can be passed to `ablate --routers` on single-pool
|
||||
configs.
|
||||
|
||||
| Mode | Aliases | What it does |
|
||||
|-------------------|------------------|--------------------------------------------------------------------------------------|
|
||||
| `random` | | Uniform random. Baseline. |
|
||||
| `round_robin` | `rr` | Deterministic round-robin. Baseline. |
|
||||
| `least_loaded` | | `argmin(kv_blocks_used + alpha * queue_len)`. KV-blind load balance. |
|
||||
| `least_tokens` | `lt` | `argmin(waiting_tokens)`. Pure load balance by queued compute work. |
|
||||
| `ttl_aware` | `ttl` | Picks instance with longest prefix in the global TTL meta-store. Cache-only. |
|
||||
| `precise` | `precise_aware` | Probes top-K least-loaded instances' actual caches; charges probe latency into TTFT. |
|
||||
| `min_pd` | `minpd`, `pd` | Minimizes `P*D` (prefill tokens x ongoing requests). Cluster-wide RDMA-aware. |
|
||||
| `cache_load` | `cl` | Filters to least-loaded 1/4 instances, then picks best cache prefix. |
|
||||
| `cache_score` | `cs` | Exponential scoring: `2^(alpha * queue_len + beta * miss_blocks)`. |
|
||||
| `estimated_ttft` | `ettft`,`optimal`| Estimates `drain_time + fetch_time` per instance using architecture-aware compute. |
|
||||
| `prefix_affinity` | `affinity`, `pa` | Rendezvous-hashed prefix fingerprinting for deterministic cache locality. |
|
||||
|------|---------|---------------|
|
||||
| `random` | | Uniform random baseline. |
|
||||
| `round_robin` | `rr` | Deterministic round-robin baseline. |
|
||||
| `least_loaded` | | Minimizes `kv_blocks_used + alpha * queue_len`. |
|
||||
| `least_tokens` | `lt` | Minimizes queued token work. |
|
||||
| `ttl_aware` | `ttl` | Uses the global TTL meta-store to chase the longest reusable prefix. |
|
||||
| `precise` | `precise_aware` | Probes top-K least-loaded instances for actual cache contents and charges probe latency. |
|
||||
| `min_pd` | `minpd`, `pd` | Minimizes `P * D` using ongoing load and prefix reuse. |
|
||||
| `cache_load` | `cl` | Filters to lightly loaded instances, then chooses the best cache prefix. |
|
||||
| `cache_affinity` | `caff`, `ca` | Strong cache-first scoring with rendezvous-based sticky homes for prefix families. |
|
||||
| `cache_affinity_weak_rend` | `caff_weak` | Ablation: weak cache weights plus rendezvous placement. |
|
||||
| `cache_affinity_strong_only` | `caff_strong` | Ablation: strong cache weights without rendezvous tie-breaking. |
|
||||
| `cache_score` | `cs` | Exponential score over queue length and miss blocks. |
|
||||
| `cache_score_strong` | `cs_strong`, `css` | Parity probe with stronger cache weighting than default `cache_score`. |
|
||||
| `cache_score_ttl` | `csttl`, `cs_ttl` | `cache_score` variant that also uses TTL/meta-store visibility. |
|
||||
| `estimated_ttft` | `ettft`, `optimal` | First-principles TTFT estimate per instance using compute plus KV movement. |
|
||||
| `prefix_affinity` | `affinity`, `pa` | Deterministic prefix fingerprinting with affinity fan-out and load-aware selection. |
|
||||
| `adaptive_affinity` | `aa` | Uses hot-prefix detection: affinity for short hot stems, TTFT optimization otherwise. |
|
||||
| `lineage_affinity` | `la` | Combines parent stickiness, family homesets, and strong local cache scoring. |
|
||||
|
||||
### Router parameters
|
||||
Router tuning knobs in `cluster.router`:
|
||||
|
||||
These fields in `cluster.router` tune specific routers:
|
||||
| Field | Default | Used by |
|
||||
|-------|---------|---------|
|
||||
| `load_alpha` | `1.0` | `least_loaded`, `ttl_aware`, affinity families |
|
||||
| `score_alpha` | `1.0` | `cache_score`, `cache_score_ttl` |
|
||||
| `score_beta` | `0.1` | `cache_score`, `cache_score_ttl` |
|
||||
| `prefix_k` | `8` | prefix and affinity fingerprinting |
|
||||
| `affinity_fan_out` | `0` | `prefix_affinity`, `adaptive_affinity`, `lineage_affinity` |
|
||||
| `precise_probe_latency_us` | `50.0` | `precise` |
|
||||
| `precise_probe_topk` | `4` | `precise` |
|
||||
|
||||
| Field | Default | Used by | Description |
|
||||
|--------------------------|---------|------------------|------------------------------------------------------|
|
||||
| `load_alpha` | `1.0` | `least_loaded` | Weight of queue\_len vs kv\_blocks\_used |
|
||||
| `score_alpha` | `1.0` | `cache_score` | Load weight in `2^(alpha*load + beta*miss)` |
|
||||
| `score_beta` | `0.1` | `cache_score` | Cache-miss weight in `2^(alpha*load + beta*miss)` |
|
||||
| `prefix_k` | `8` | `prefix_affinity`| Number of leading blocks for the prefix fingerprint |
|
||||
| `affinity_fan_out` | `0` | `prefix_affinity`| Top-K affinity candidates (0 = auto: n/8, min 2) |
|
||||
| `precise_probe_latency_us`| `50.0`| `precise` | Simulated per-probe latency (microseconds) |
|
||||
| `precise_probe_topk` | `4` | `precise` | Number of instances probed |
|
||||
## Model And Hardware Configuration
|
||||
|
||||
### Router design spectrum
|
||||
### Model Config
|
||||
|
||||
```
|
||||
Cache-only Hybrid Load-only
|
||||
(hot-spot risk) (cache-blind)
|
||||
┌─────────┬───────────┬───────────┬────────────┬───────────┬───────────┐
|
||||
ttl_aware precise cache_score min_pd prefix_ least_ random
|
||||
cache_load affinity loaded
|
||||
est_ttft least_tokens
|
||||
```
|
||||
|
||||
`prefix_affinity` sits in a unique position: it builds **proactive cache
|
||||
locality** by consistently routing same-prefix requests to the same
|
||||
instances (via rendezvous hashing), rather than reactively chasing
|
||||
existing cache state. This yields the highest L0 hit rates while
|
||||
maintaining load balance through within-group drain-time-aware selection.
|
||||
|
||||
## Model configuration
|
||||
|
||||
### HuggingFace config.json (recommended)
|
||||
|
||||
Point `model.config_json` at any HF `config.json` to auto-extract
|
||||
architecture:
|
||||
Recommended pattern:
|
||||
|
||||
```yaml
|
||||
model:
|
||||
config_json: ../models/GLM-5/config.json
|
||||
dtype_bytes: 2 # required (not in HF schema)
|
||||
block_size_tokens: 512 # required (not in HF schema)
|
||||
name: glm-5
|
||||
compute_dtype: fp8
|
||||
weight_dtype: fp4
|
||||
dtype_bytes: 1
|
||||
block_size_tokens: 512
|
||||
```
|
||||
|
||||
Auto-detected features:
|
||||
Notes:
|
||||
|
||||
| Feature | Detection trigger | What it extracts |
|
||||
|-----------|-------------------------------|----------------------------------------------|
|
||||
| **MoE** | `n_routed_experts`, `num_local_experts`, or `num_experts` | Expert count, active experts, shared experts, expert FFN width |
|
||||
| **MLA** | `kv_lora_rank` present | KV/Q LoRA ranks, qk\_rope/nope dims, v\_head\_dim |
|
||||
| **DSA** | `first_k_dense_replace` present| Dense window, sparse stride, first dense layers |
|
||||
| **Sliding window** | `sliding_window` present | Window size |
|
||||
| **GQA** | `num_key_value_heads < num_attention_heads` | KV head count for grouped-query attention |
|
||||
- `config_json` is resolved relative to the YAML file
|
||||
- explicit YAML fields override values loaded from the model config
|
||||
- `compute_dtype` selects the compute FLOPS tier
|
||||
- `weight_dtype` controls model-weight bytes separately from KV-cache bytes
|
||||
- `dtype_bytes` sizes the KV cache
|
||||
|
||||
Explicit YAML fields always override the auto-detected values.
|
||||
The architecture loader understands:
|
||||
|
||||
### Inline specification
|
||||
- MoE expert counts and active experts
|
||||
- MLA LoRA ranks and attention dimensions
|
||||
- DSA sparse-attention parameters
|
||||
- sliding-window attention
|
||||
- GQA from KV-head count
|
||||
|
||||
Alternatively, specify architecture fields directly:
|
||||
### Hardware Presets
|
||||
|
||||
```yaml
|
||||
model:
|
||||
name: qwen2.5-coder-7b
|
||||
num_layers: 28
|
||||
hidden_size: 3584
|
||||
num_attention_heads: 28
|
||||
num_kv_heads: 4
|
||||
head_dim: 128
|
||||
intermediate_size: 18944
|
||||
dtype_bytes: 2
|
||||
block_size_tokens: 16
|
||||
```
|
||||
|
||||
When `hidden_size` is present, the compute model is auto-derived
|
||||
(architecture mode). Without it, you must supply legacy manual
|
||||
coefficients (`flops_per_token_prefill`, `attn_quadratic_coeff`, etc.).
|
||||
|
||||
### Bundled model configs
|
||||
|
||||
| Model | Path | Architecture |
|
||||
|-------|------|--------------|
|
||||
| GLM-5 (744B/40B-active) | `models/GLM-5/config.json` | MoE (256 routed, 8 active, 1 shared) + MLA + DSA |
|
||||
| Qwen3-Coder-480B-A35B FP8 | `models/Qwen3-Coder-480B-A35B-Instruct-FP8/config.json` | MoE (160 experts, 8 active) + GQA |
|
||||
|
||||
## Hardware configuration
|
||||
|
||||
### Using presets (recommended)
|
||||
|
||||
Set `hardware.type` to a preset name — individual fields can override:
|
||||
Recommended pattern:
|
||||
|
||||
```yaml
|
||||
hardware:
|
||||
type: 8xb200
|
||||
hbm_bytes: 500.0e9 # override KV budget (after model weights)
|
||||
```
|
||||
|
||||
Available presets:
|
||||
|
||||
| Preset | FLOPS | HBM | Mem BW | PCIe |
|
||||
|-------------|------------|---------|------------|------|
|
||||
| `h100` | 989 TFLOPS | 80 GB | 3.35 TB/s | Gen5 |
|
||||
| `h800` | 989 TFLOPS | 80 GB | 3.35 TB/s | Gen5 |
|
||||
| `h20` | 148 TFLOPS | 96 GB | 4.0 TB/s | Gen5 |
|
||||
| `a100-80gb` | 312 TFLOPS | 80 GB | 2.0 TB/s | Gen4 |
|
||||
| `a100-40gb` | 312 TFLOPS | 40 GB | 1.555 TB/s | Gen4 |
|
||||
| `b200` | 2.25 PFLOPS| 192 GB | 8.0 TB/s | Gen6 |
|
||||
|
||||
Prefix with `2x`, `4x`, or `8x` for tensor-parallel groups (e.g.
|
||||
`8xh20`). FLOPS, memory bandwidth, and HBM scale linearly; RDMA and
|
||||
DRAM are set to sensible per-node defaults.
|
||||
|
||||
### Inline specification
|
||||
|
||||
```yaml
|
||||
hardware:
|
||||
gpu_flops: 1.80e16
|
||||
gpu_mem_bw: 6.40e13
|
||||
hbm_bytes: 500.0e9
|
||||
type: 8xb300
|
||||
hbm_bytes: 1900.0e9
|
||||
dram_bytes: 1.5e12
|
||||
pcie_bw: 128.0e9
|
||||
pcie_latency_us: 4.0
|
||||
rdma_bw: 50.0e9
|
||||
rdma_latency_us: 6.0
|
||||
max_batch_slots: 256
|
||||
prefill_chunk_tokens: 4096
|
||||
```
|
||||
|
||||
## Architecture-aware compute model
|
||||
Available preset families:
|
||||
|
||||
The simulator derives a **roofline prefill model** from model
|
||||
architecture:
|
||||
- `h100`, `h800`, `h20`, `h20-141g`
|
||||
- `a100-80gb`, `a100-40gb`
|
||||
- `b200`, `b300`
|
||||
- TP forms such as `2xh100`, `4xh20`, `8xb200`, `8xb300`
|
||||
|
||||
```
|
||||
prefill_time(N tokens) = max(compute_time, memory_time)
|
||||
## Bundled Configs
|
||||
|
||||
compute_time = layers * (N * linear_flops + attn_coeff * N * effective_ctx(N)) / gpu_flops
|
||||
memory_time = layers * weight_bytes_per_layer / gpu_mem_bw
|
||||
Representative configs in `configs/`:
|
||||
|
||||
| Config | Notes |
|
||||
|--------|-------|
|
||||
| `glm5-8xb200.yaml` | GLM-5 on `8xb200`, single-pool baseline config. |
|
||||
| `glm5-fp8-8xh20-141g.yaml` | GLM-5-FP8 on `8xh20-141g`, with a 0-32k input-length filter. |
|
||||
| `glm5-fp8-8xh20-141g-ca-tuned.yaml` | Same family as above, tuned for `cache_affinity`. |
|
||||
| `glm5-nvfp4-8xb300.yaml` | GLM-5-NVFP4 on `8xb300`. |
|
||||
| `glm5-nvfp4-fp8compute-8xb300.yaml` | NVFP4 weights with FP8 compute on `8xb300`. |
|
||||
| `qwen3-coder-480b-8xh20.yaml` | Qwen3-Coder-480B-A35B on `8xh20`. |
|
||||
|
||||
Many of the `glm5-*n*.yaml` configs are bucket/slice-specific experiment
|
||||
points that use `sim.input_length_min` and `sim.input_length_max`.
|
||||
|
||||
## Trace Inputs
|
||||
|
||||
This repository currently contains two trace sources:
|
||||
|
||||
- `bailian-traces/`
|
||||
- `glm_coder_blksz_512_040915-040917.jsonl`
|
||||
- `qwen3_coder_blksz_512_040915-040917.jsonl`
|
||||
- `qwen-bailian-usagetraces-anon/` submodule
|
||||
- public 16-token-block Qwen traces such as
|
||||
`qwen_coder_blksz_16.jsonl` and `qwen_traceB_blksz_16.jsonl`
|
||||
|
||||
The simulator expects JSONL records with fields like:
|
||||
|
||||
```json
|
||||
{
|
||||
"chat_id": 159,
|
||||
"parent_chat_id": 55,
|
||||
"timestamp": 61.114,
|
||||
"input_length": 521,
|
||||
"output_length": 132,
|
||||
"type": "text",
|
||||
"turn": 2,
|
||||
"hash_ids": [1089, 1090, 1091]
|
||||
}
|
||||
```
|
||||
|
||||
- **MoE**: only active experts contribute to FLOPs and weight streaming
|
||||
(shared experts always counted)
|
||||
- **MLA**: compressed KV projections reduce attention FLOPs; KV cache
|
||||
uses `kv_lora_rank + qk_rope_head_dim` instead of `2 * kv_heads * head_dim`
|
||||
- **DSA**: `effective_ctx = min(N, dense_window) + max(0, N - dense_window) / sparse_stride`,
|
||||
with the first K layers using full dense attention
|
||||
- **GQA**: fewer KV heads reduce both attention compute and KV cache size
|
||||
|
||||
## Bundled config files
|
||||
|
||||
| Config | Model | Hardware | Instances | Trace |
|
||||
|--------|-------|----------|-----------|-------|
|
||||
| `glm5-8xb200-hf.yaml` | GLM-5 via HF config.json | 8xB200 preset | 32 | GLM coder blk512 |
|
||||
| `glm5-nvfp4-8xb300.yaml` | GLM-5-NVFP4 via HF config.json | 8xB300 preset | 8 | GLM coder blk512 |
|
||||
| `qwen3-coder-480b-8xh20.yaml` | Qwen3-Coder via HF | 8xH20 preset | 32 | Qwen coder blk16 |
|
||||
Only prefill-side behavior is modeled; `output_length` is used only for a
|
||||
decode tail in completion metrics.
|
||||
|
||||
## Outputs
|
||||
|
||||
Each run writes a directory under `sim.output_dir`:
|
||||
Each `run` writes a directory under `sim.output_dir`:
|
||||
|
||||
| File | Contents |
|
||||
|----------------------|----------------------------------------------------------------------------|
|
||||
| `summary.json` | Router, throughput, TTFT p50/p95/p99, hit rates per tier, total RDMA/PCIe bytes |
|
||||
| `per_request.csv` | `req_id,arrival,ttft,e2e,instance,total_blocks,l0_hit,l1_hit,remote_hit,miss,rdma_bytes,pcie_bytes,probe_overhead_s` |
|
||||
| `instances.csv` | `t,instance,queue_len,kv_blocks_used,kv_blocks_total,busy` per sample |
|
||||
| `routing_log.jsonl` | One JSON per request: all router candidates + chosen instance + reason |
|
||||
|------|----------|
|
||||
| `summary.json` | Aggregate throughput, TTFT/E2E percentiles, hit rates, RDMA bytes, PCIe bytes. |
|
||||
| `per_request.csv` | Per-request latency and cache stats, including `bucket`, `instance`, and `length_bucket_match`. |
|
||||
| `instances.csv` | Periodic per-instance samples with `bucket`, `instance`, `queue_len`, and KV usage. |
|
||||
| `routing_log.jsonl` | One JSON route decision per request, including `global_mode`, `mode`, `chosen_bucket`, candidate buckets, and candidate instances. |
|
||||
|
||||
For `ablate`: an extra `ablation.json` with one summary per router.
|
||||
For `oracle`: an `oracle.json` with the three hit-rate analyses.
|
||||
Additional outputs:
|
||||
|
||||
### Reading results quickly
|
||||
- `ablate`: writes `ablation.json`
|
||||
- `oracle`: writes `oracle.json`
|
||||
- `ablate --auto-instances`: writes calibration runs under
|
||||
`<output_dir>/auto_instances/`
|
||||
|
||||
Quick inspection examples:
|
||||
|
||||
```bash
|
||||
# Pretty-print the summary
|
||||
cat runs/glm5_8xb200_hf/summary.json | jq .
|
||||
|
||||
# Compare all routers from an ablation
|
||||
cat runs/glm5_8xb200_hf/ablation.json | \
|
||||
jq '.[] | {router, ttft_mean, ttft_p50, hit_rate_l0, miss_rate}'
|
||||
|
||||
# Sort by TTFT
|
||||
cat runs/glm5_8xb200_hf/ablation.json | \
|
||||
jq 'sort_by(.ttft_mean) | .[] | {router, ttft_mean, hit_rate_l0}'
|
||||
jq . runs/glm5_8xb200/summary.json
|
||||
```
|
||||
|
||||
## Trace format
|
||||
```bash
|
||||
jq 'sort_by(.ttft_mean) | .[] | {router, ttft_mean, hit_rate_l0, miss_rate}' \
|
||||
runs/glm5_8xb200/ablation.json
|
||||
```
|
||||
|
||||
The simulator reads the Alibaba
|
||||
[`qwen-bailian-usagetraces-anon`](https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon)
|
||||
JSONL schema. Each record has `chat_id`, `timestamp`, `input_length`,
|
||||
`output_length`, and `hash_ids` (block hashes, typically 16 tokens each).
|
||||
Only the input side is used.
|
||||
## Oracle Semantics
|
||||
|
||||
Available traces in the submodule:
|
||||
`oracle` computes three hit-rate references at a chosen cache capacity:
|
||||
|
||||
| Trace | Requests | Description |
|
||||
|-------|----------|-------------|
|
||||
| `qwen_coder_blksz_16.jsonl` | 43k | Qwen Coder serving traffic |
|
||||
| `qwen_traceA_blksz_16.jsonl` | 43k | Qwen general traffic A |
|
||||
| `qwen_traceB_blksz_16.jsonl` | 173k | Qwen general traffic B |
|
||||
| `qwen_thinking_blksz_16.jsonl` | 11k | Qwen reasoning/thinking traffic |
|
||||
- `unlimited.hit_rate`: absolute ceiling with infinite capacity
|
||||
- `belady_finite.hit_rate`: offline-optimal eviction at the chosen capacity
|
||||
- `lru_finite.hit_rate`: LRU at the same capacity
|
||||
|
||||
When `sim.input_length_min` / `sim.input_length_max` are set, `oracle` still
|
||||
feeds the full trace into cache state but only counts requests inside the
|
||||
selected input-length range. That matches the intended "measure one bucket
|
||||
inside a mixed workload" interpretation.
|
||||
|
||||
The gap from `lru_finite` to `belady_finite` is eviction-policy headroom. The
|
||||
gap from `belady_finite` to `unlimited` is pure capacity headroom.
|
||||
|
||||
## Testing
|
||||
|
||||
@@ -352,6 +388,5 @@ Available traces in the submodule:
|
||||
cargo test --release
|
||||
```
|
||||
|
||||
28 tests: 27 unit tests (compute model, HF config parsing, hardware
|
||||
presets) + 1 integration smoke test that runs routers on a synthetic
|
||||
shared-prefix trace and asserts the expected hit-rate ordering.
|
||||
The test suite covers config parsing, hardware presets, routing behavior,
|
||||
bucket-aware service semantics, oracle logic, and smoke-style end-to-end runs.
|
||||
|
||||
35
configs/glm5-fp8-8xh20-141g-0-32k-n56.yaml
Normal file
35
configs/glm5-fp8-8xh20-141g-0-32k-n56.yaml
Normal file
@@ -0,0 +1,35 @@
|
||||
# GLM-5-FP8 on 8 x H20-141G for the 0-32k bucket.
|
||||
# Chosen to keep the best policy's mean TTFT below 5s.
|
||||
|
||||
model:
|
||||
config_json: ../models/GLM-5-FP8/config.json
|
||||
name: glm-5-fp8
|
||||
compute_dtype: fp8
|
||||
dtype_bytes: 1
|
||||
block_size_tokens: 512
|
||||
|
||||
hardware:
|
||||
type: 8xh20-141g
|
||||
hbm_bytes: 300.0e9
|
||||
dram_bytes: 1.5e12
|
||||
max_batch_slots: 256
|
||||
|
||||
cluster:
|
||||
num_instances: 56
|
||||
meta_store:
|
||||
ttl_seconds: 300.0
|
||||
router:
|
||||
mode: cache_affinity
|
||||
precise_probe_latency_us: 50.0
|
||||
precise_probe_topk: 4
|
||||
load_alpha: 1.0
|
||||
prefix_k: 8
|
||||
|
||||
sim:
|
||||
trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
|
||||
max_requests: null
|
||||
output_dir: runs/glm5_fp8_8xh20_141g_ablation_0_32768_n56
|
||||
sample_interval_s: 1.0
|
||||
seed: 42
|
||||
input_length_min: 0
|
||||
input_length_max: 32768
|
||||
38
configs/glm5-fp8-8xh20-141g-ca-tuned.yaml
Normal file
38
configs/glm5-fp8-8xh20-141g-ca-tuned.yaml
Normal file
@@ -0,0 +1,38 @@
|
||||
# GLM-5-FP8 (ZhipuAI/GLM-5-FP8) on 8 x H20-141G (N3E).
|
||||
# Tuned for the 0-32768 input-length slice of
|
||||
# bailian-traces/glm_coder_blksz_512_040915-040917.jsonl.
|
||||
|
||||
model:
|
||||
config_json: ../models/GLM-5-FP8/config.json
|
||||
name: glm-5-fp8
|
||||
compute_dtype: fp8
|
||||
dtype_bytes: 1
|
||||
block_size_tokens: 512
|
||||
|
||||
hardware:
|
||||
type: 8xh20-141g
|
||||
hbm_bytes: 300.0e9
|
||||
dram_bytes: 1.5e12
|
||||
max_batch_slots: 256
|
||||
|
||||
cluster:
|
||||
num_instances: 64
|
||||
meta_store:
|
||||
ttl_seconds: 300.0
|
||||
router:
|
||||
mode: cache_affinity
|
||||
precise_probe_latency_us: 50.0
|
||||
precise_probe_topk: 4
|
||||
# Tuned on this filtered GLM coder workload: stronger queue penalty than
|
||||
# the default 1.0 keeps cache_affinity's locality gains while reducing TTFT.
|
||||
load_alpha: 1.5
|
||||
prefix_k: 8
|
||||
|
||||
sim:
|
||||
trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
|
||||
max_requests: null
|
||||
output_dir: runs/glm5_fp8_8xh20_141g_ca_tuned
|
||||
sample_interval_s: 1.0
|
||||
seed: 42
|
||||
input_length_min: 0
|
||||
input_length_max: 32768
|
||||
35
configs/glm5-fp8-8xh20-141g-l1-medium.yaml
Normal file
35
configs/glm5-fp8-8xh20-141g-l1-medium.yaml
Normal file
@@ -0,0 +1,35 @@
|
||||
# GLM-5-FP8 on 8 x H20-141G, 0-32768 slice.
|
||||
# Analysis config: medium L1 (~10% of the default DRAM KV budget).
|
||||
|
||||
model:
|
||||
config_json: ../models/GLM-5-FP8/config.json
|
||||
name: glm-5-fp8
|
||||
compute_dtype: fp8
|
||||
dtype_bytes: 1
|
||||
block_size_tokens: 512
|
||||
|
||||
hardware:
|
||||
type: 8xh20-141g
|
||||
hbm_bytes: 300.0e9
|
||||
dram_bytes: 1.5e11
|
||||
max_batch_slots: 256
|
||||
|
||||
cluster:
|
||||
num_instances: 64
|
||||
meta_store:
|
||||
ttl_seconds: 300.0
|
||||
router:
|
||||
mode: min_pd
|
||||
precise_probe_latency_us: 50.0
|
||||
precise_probe_topk: 4
|
||||
load_alpha: 1.0
|
||||
prefix_k: 8
|
||||
|
||||
sim:
|
||||
trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
|
||||
max_requests: null
|
||||
output_dir: runs/glm5_fp8_8xh20_141g_l1_medium
|
||||
sample_interval_s: 1.0
|
||||
seed: 42
|
||||
input_length_min: 0
|
||||
input_length_max: 32768
|
||||
35
configs/glm5-fp8-8xh20-141g-l1-none.yaml
Normal file
35
configs/glm5-fp8-8xh20-141g-l1-none.yaml
Normal file
@@ -0,0 +1,35 @@
|
||||
# GLM-5-FP8 on 8 x H20-141G, 0-32768 slice.
|
||||
# Analysis config: effectively disable L1/remote KV by shrinking DRAM to ~1 block.
|
||||
|
||||
model:
|
||||
config_json: ../models/GLM-5-FP8/config.json
|
||||
name: glm-5-fp8
|
||||
compute_dtype: fp8
|
||||
dtype_bytes: 1
|
||||
block_size_tokens: 512
|
||||
|
||||
hardware:
|
||||
type: 8xh20-141g
|
||||
hbm_bytes: 300.0e9
|
||||
dram_bytes: 1.0
|
||||
max_batch_slots: 256
|
||||
|
||||
cluster:
|
||||
num_instances: 64
|
||||
meta_store:
|
||||
ttl_seconds: 300.0
|
||||
router:
|
||||
mode: min_pd
|
||||
precise_probe_latency_us: 50.0
|
||||
precise_probe_topk: 4
|
||||
load_alpha: 1.0
|
||||
prefix_k: 8
|
||||
|
||||
sim:
|
||||
trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
|
||||
max_requests: null
|
||||
output_dir: runs/glm5_fp8_8xh20_141g_l1_none
|
||||
sample_interval_s: 1.0
|
||||
seed: 42
|
||||
input_length_min: 0
|
||||
input_length_max: 32768
|
||||
35
configs/glm5-fp8-8xh20-141g-l1-small.yaml
Normal file
35
configs/glm5-fp8-8xh20-141g-l1-small.yaml
Normal file
@@ -0,0 +1,35 @@
|
||||
# GLM-5-FP8 on 8 x H20-141G, 0-32768 slice.
|
||||
# Analysis config: small L1 (~1% of the default DRAM KV budget).
|
||||
|
||||
model:
|
||||
config_json: ../models/GLM-5-FP8/config.json
|
||||
name: glm-5-fp8
|
||||
compute_dtype: fp8
|
||||
dtype_bytes: 1
|
||||
block_size_tokens: 512
|
||||
|
||||
hardware:
|
||||
type: 8xh20-141g
|
||||
hbm_bytes: 300.0e9
|
||||
dram_bytes: 1.5e10
|
||||
max_batch_slots: 256
|
||||
|
||||
cluster:
|
||||
num_instances: 64
|
||||
meta_store:
|
||||
ttl_seconds: 300.0
|
||||
router:
|
||||
mode: min_pd
|
||||
precise_probe_latency_us: 50.0
|
||||
precise_probe_topk: 4
|
||||
load_alpha: 1.0
|
||||
prefix_k: 8
|
||||
|
||||
sim:
|
||||
trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
|
||||
max_requests: null
|
||||
output_dir: runs/glm5_fp8_8xh20_141g_l1_small
|
||||
sample_interval_s: 1.0
|
||||
seed: 42
|
||||
input_length_min: 0
|
||||
input_length_max: 32768
|
||||
39
configs/glm5-fp8-8xh20-141g.yaml
Normal file
39
configs/glm5-fp8-8xh20-141g.yaml
Normal file
@@ -0,0 +1,39 @@
|
||||
# GLM-5-FP8 (ZhipuAI/GLM-5-FP8) on 8 x H20-141G (N3E).
|
||||
# Architecture auto-loaded from the upstream ModelScope config.json.
|
||||
#
|
||||
# 8 x 141 GB = 1128 GB total HBM. With ~744 GB FP8 weights resident,
|
||||
# keep the KV budget conservative to leave room for scales, BF16 holdouts,
|
||||
# allocator slack, and runtime activations.
|
||||
|
||||
model:
|
||||
config_json: ../models/GLM-5-FP8/config.json
|
||||
name: glm-5-fp8
|
||||
compute_dtype: fp8
|
||||
dtype_bytes: 1
|
||||
block_size_tokens: 512
|
||||
|
||||
hardware:
|
||||
type: 8xh20-141g
|
||||
hbm_bytes: 300.0e9
|
||||
dram_bytes: 1.5e12
|
||||
max_batch_slots: 256
|
||||
|
||||
cluster:
|
||||
num_instances: 64
|
||||
meta_store:
|
||||
ttl_seconds: 300.0
|
||||
router:
|
||||
mode: min_pd
|
||||
precise_probe_latency_us: 50.0
|
||||
precise_probe_topk: 4
|
||||
load_alpha: 1.0
|
||||
prefix_k: 8
|
||||
|
||||
sim:
|
||||
trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
|
||||
max_requests: null
|
||||
output_dir: runs/glm5_fp8_8xh20_141g
|
||||
sample_interval_s: 1.0
|
||||
seed: 42
|
||||
input_length_min: 0
|
||||
input_length_max: 32768
|
||||
36
configs/glm5-nvfp4-8xb200-32k-85k-n5.yaml
Normal file
36
configs/glm5-nvfp4-8xb200-32k-85k-n5.yaml
Normal file
@@ -0,0 +1,36 @@
|
||||
# GLM-5-NVFP4 on 8 x B200 for the 32k-85k bucket.
|
||||
# Chosen to keep the best policy's mean TTFT below 5s.
|
||||
|
||||
model:
|
||||
config_json: ../models/GLM-5-NVFP4/config.json
|
||||
name: glm-5-nvfp4
|
||||
compute_dtype: fp8
|
||||
weight_dtype: fp4
|
||||
dtype_bytes: 1
|
||||
block_size_tokens: 512
|
||||
|
||||
hardware:
|
||||
type: 8xb200
|
||||
hbm_bytes: 1150.0e9
|
||||
dram_bytes: 1.5e12
|
||||
max_batch_slots: 256
|
||||
|
||||
cluster:
|
||||
num_instances: 5
|
||||
meta_store:
|
||||
ttl_seconds: 300.0
|
||||
router:
|
||||
mode: cache_affinity
|
||||
precise_probe_latency_us: 50.0
|
||||
precise_probe_topk: 4
|
||||
load_alpha: 1.0
|
||||
prefix_k: 8
|
||||
|
||||
sim:
|
||||
trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
|
||||
max_requests: null
|
||||
output_dir: runs/glm5_nvfp4_8xb200_ablation_32769_87040_n5
|
||||
sample_interval_s: 1.0
|
||||
seed: 42
|
||||
input_length_min: 32769
|
||||
input_length_max: 87040
|
||||
36
configs/glm5-nvfp4-8xb200.yaml
Normal file
36
configs/glm5-nvfp4-8xb200.yaml
Normal file
@@ -0,0 +1,36 @@
|
||||
# GLM-5-NVFP4 (nvidia/GLM-5-NVFP4) on 8 x B200 (192GB each).
|
||||
# Architecture auto-loaded from HuggingFace config.json.
|
||||
#
|
||||
# FP4 weights: ~744B params * 0.5 bytes = ~372 GB across 8 GPUs.
|
||||
# Total HBM: 8 * 192 GB = 1536 GB. Keep the KV budget below the raw
|
||||
# remainder to leave room for runtime activations and allocator slack.
|
||||
|
||||
model:
|
||||
config_json: ../models/GLM-5-NVFP4/config.json
|
||||
name: glm-5-nvfp4
|
||||
compute_dtype: fp8
|
||||
weight_dtype: fp4
|
||||
dtype_bytes: 1
|
||||
block_size_tokens: 512
|
||||
|
||||
hardware:
|
||||
type: 8xb200
|
||||
hbm_bytes: 1150.0e9
|
||||
dram_bytes: 1.5e12
|
||||
max_batch_slots: 256
|
||||
|
||||
cluster:
|
||||
num_instances: 8
|
||||
meta_store:
|
||||
ttl_seconds: 300.0
|
||||
router:
|
||||
mode: prefix_affinity
|
||||
prefix_k: 8
|
||||
load_alpha: 1.0
|
||||
|
||||
sim:
|
||||
trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
|
||||
max_requests: null
|
||||
output_dir: runs/glm5_nvfp4_8xb200
|
||||
sample_interval_s: 1.0
|
||||
seed: 42
|
||||
37
configs/glm5-nvfp4-8xb300-128k-plus-n1.yaml
Normal file
37
configs/glm5-nvfp4-8xb300-128k-plus-n1.yaml
Normal file
@@ -0,0 +1,37 @@
|
||||
# GLM-5-NVFP4 on 8 x B300 for the 128k++ bucket.
|
||||
# A single instance already keeps mean TTFT below 5s, and routing is
|
||||
# effectively irrelevant at N=1 because every request lands on the same node.
|
||||
|
||||
model:
|
||||
config_json: ../models/GLM-5-NVFP4/config.json
|
||||
name: glm-5-nvfp4
|
||||
compute_dtype: fp8
|
||||
weight_dtype: fp4
|
||||
dtype_bytes: 1
|
||||
block_size_tokens: 512
|
||||
|
||||
hardware:
|
||||
type: 8xb300
|
||||
hbm_bytes: 1900.0e9
|
||||
dram_bytes: 1.5e12
|
||||
max_batch_slots: 256
|
||||
|
||||
cluster:
|
||||
num_instances: 1
|
||||
meta_store:
|
||||
ttl_seconds: 300.0
|
||||
router:
|
||||
mode: cache_affinity
|
||||
precise_probe_latency_us: 50.0
|
||||
precise_probe_topk: 1
|
||||
load_alpha: 1.0
|
||||
prefix_k: 8
|
||||
|
||||
sim:
|
||||
trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
|
||||
max_requests: null
|
||||
output_dir: runs/glm5_nvfp4_8xb300_ablation_131073_plus_n1
|
||||
sample_interval_s: 1.0
|
||||
seed: 42
|
||||
input_length_min: 131073
|
||||
input_length_max: 4294967295
|
||||
36
configs/glm5-nvfp4-8xb300-85k-128k-n2.yaml
Normal file
36
configs/glm5-nvfp4-8xb300-85k-128k-n2.yaml
Normal file
@@ -0,0 +1,36 @@
|
||||
# GLM-5-NVFP4 on 8 x B300 for the 85k-128k bucket.
|
||||
# Chosen to keep the best policy's mean TTFT below 5s.
|
||||
|
||||
model:
|
||||
config_json: ../models/GLM-5-NVFP4/config.json
|
||||
name: glm-5-nvfp4
|
||||
compute_dtype: fp8
|
||||
weight_dtype: fp4
|
||||
dtype_bytes: 1
|
||||
block_size_tokens: 512
|
||||
|
||||
hardware:
|
||||
type: 8xb300
|
||||
hbm_bytes: 1900.0e9
|
||||
dram_bytes: 1.5e12
|
||||
max_batch_slots: 256
|
||||
|
||||
cluster:
|
||||
num_instances: 2
|
||||
meta_store:
|
||||
ttl_seconds: 300.0
|
||||
router:
|
||||
mode: cache_affinity_strong_only
|
||||
precise_probe_latency_us: 50.0
|
||||
precise_probe_topk: 4
|
||||
load_alpha: 1.0
|
||||
prefix_k: 8
|
||||
|
||||
sim:
|
||||
trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
|
||||
max_requests: null
|
||||
output_dir: runs/glm5_nvfp4_8xb300_ablation_87041_131072_n2
|
||||
sample_interval_s: 1.0
|
||||
seed: 42
|
||||
input_length_min: 87041
|
||||
input_length_max: 131072
|
||||
@@ -7,7 +7,8 @@
|
||||
model:
|
||||
config_json: ../models/GLM-5-NVFP4/config.json
|
||||
name: glm-5-nvfp4
|
||||
compute_dtype: fp4 # FP4 weights → selects FP4 tensor core FLOPS
|
||||
compute_dtype: fp8 # FP8 tensor-core execution
|
||||
weight_dtype: fp4 # NVFP4 weights still set the HBM budget
|
||||
dtype_bytes: 1 # FP8 KV cache
|
||||
block_size_tokens: 512
|
||||
|
||||
|
||||
32
configs/glm5-nvfp4-fp8compute-8xb200-32k-85k-n9.yaml
Normal file
32
configs/glm5-nvfp4-fp8compute-8xb200-32k-85k-n9.yaml
Normal file
@@ -0,0 +1,32 @@
|
||||
# GLM-5-NVFP4 on 8 x B200 for the 32k-85k bucket.
|
||||
# NVFP4 weights, FP8 compute. Chosen to keep the best policy below 5 s TTFT.
|
||||
|
||||
model:
|
||||
config_json: ../models/GLM-5-NVFP4/config.json
|
||||
name: glm-5-nvfp4
|
||||
compute_dtype: fp8
|
||||
weight_dtype: fp4
|
||||
dtype_bytes: 1
|
||||
block_size_tokens: 512
|
||||
|
||||
hardware:
|
||||
type: 8xb200
|
||||
hbm_bytes: 1150.0e9
|
||||
dram_bytes: 1.5e12
|
||||
max_batch_slots: 256
|
||||
|
||||
cluster:
|
||||
num_instances: 9
|
||||
meta_store:
|
||||
ttl_seconds: 300.0
|
||||
router:
|
||||
mode: min_pd
|
||||
|
||||
sim:
|
||||
trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
|
||||
max_requests: null
|
||||
output_dir: runs/glm5_nvfp4_fp8compute_8xb200_ablation_32769_87040_n9
|
||||
sample_interval_s: 1.0
|
||||
seed: 42
|
||||
input_length_min: 32769
|
||||
input_length_max: 87040
|
||||
32
configs/glm5-nvfp4-fp8compute-8xb200.yaml
Normal file
32
configs/glm5-nvfp4-fp8compute-8xb200.yaml
Normal file
@@ -0,0 +1,32 @@
|
||||
# GLM-5-NVFP4 on 8 x B200 with FP8 tensor-core compute.
|
||||
# Weights remain stored in NVFP4, so HBM budget follows FP4 storage.
|
||||
|
||||
model:
|
||||
config_json: ../models/GLM-5-NVFP4/config.json
|
||||
name: glm-5-nvfp4
|
||||
compute_dtype: fp8
|
||||
weight_dtype: fp4
|
||||
dtype_bytes: 1
|
||||
block_size_tokens: 512
|
||||
|
||||
hardware:
|
||||
type: 8xb200
|
||||
hbm_bytes: 1150.0e9
|
||||
dram_bytes: 1.5e12
|
||||
max_batch_slots: 256
|
||||
|
||||
cluster:
|
||||
num_instances: 8
|
||||
meta_store:
|
||||
ttl_seconds: 300.0
|
||||
router:
|
||||
mode: prefix_affinity
|
||||
prefix_k: 8
|
||||
load_alpha: 1.0
|
||||
|
||||
sim:
|
||||
trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
|
||||
max_requests: null
|
||||
output_dir: runs/glm5_nvfp4_fp8compute_8xb200
|
||||
sample_interval_s: 1.0
|
||||
seed: 42
|
||||
33
configs/glm5-nvfp4-fp8compute-8xb300-128k-plus-n1.yaml
Normal file
33
configs/glm5-nvfp4-fp8compute-8xb300-128k-plus-n1.yaml
Normal file
@@ -0,0 +1,33 @@
|
||||
# GLM-5-NVFP4 on 8 x B300 for the 128k++ bucket.
|
||||
# NVFP4 weights, FP8 compute. Routing is effectively irrelevant at one instance.
|
||||
|
||||
model:
|
||||
config_json: ../models/GLM-5-NVFP4/config.json
|
||||
name: glm-5-nvfp4
|
||||
compute_dtype: fp8
|
||||
weight_dtype: fp4
|
||||
dtype_bytes: 1
|
||||
block_size_tokens: 512
|
||||
|
||||
hardware:
|
||||
type: 8xb300
|
||||
hbm_bytes: 1900.0e9
|
||||
dram_bytes: 1.5e12
|
||||
max_batch_slots: 256
|
||||
|
||||
cluster:
|
||||
num_instances: 1
|
||||
meta_store:
|
||||
ttl_seconds: 300.0
|
||||
router:
|
||||
mode: cache_affinity
|
||||
precise_probe_topk: 1
|
||||
|
||||
sim:
|
||||
trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
|
||||
max_requests: null
|
||||
output_dir: runs/glm5_nvfp4_fp8compute_8xb300_ablation_131073_plus_n1
|
||||
sample_interval_s: 1.0
|
||||
seed: 42
|
||||
input_length_min: 131073
|
||||
input_length_max: 4294967295
|
||||
32
configs/glm5-nvfp4-fp8compute-8xb300-85k-128k-n4.yaml
Normal file
32
configs/glm5-nvfp4-fp8compute-8xb300-85k-128k-n4.yaml
Normal file
@@ -0,0 +1,32 @@
|
||||
# GLM-5-NVFP4 on 8 x B300 for the 85k-128k bucket.
|
||||
# NVFP4 weights, FP8 compute. Chosen to keep the best policy below 5 s TTFT.
|
||||
|
||||
model:
|
||||
config_json: ../models/GLM-5-NVFP4/config.json
|
||||
name: glm-5-nvfp4
|
||||
compute_dtype: fp8
|
||||
weight_dtype: fp4
|
||||
dtype_bytes: 1
|
||||
block_size_tokens: 512
|
||||
|
||||
hardware:
|
||||
type: 8xb300
|
||||
hbm_bytes: 1900.0e9
|
||||
dram_bytes: 1.5e12
|
||||
max_batch_slots: 256
|
||||
|
||||
cluster:
|
||||
num_instances: 4
|
||||
meta_store:
|
||||
ttl_seconds: 300.0
|
||||
router:
|
||||
mode: cache_affinity
|
||||
|
||||
sim:
|
||||
trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
|
||||
max_requests: null
|
||||
output_dir: runs/glm5_nvfp4_fp8compute_8xb300_ablation_87041_131072_n4
|
||||
sample_interval_s: 1.0
|
||||
seed: 42
|
||||
input_length_min: 87041
|
||||
input_length_max: 131072
|
||||
32
configs/glm5-nvfp4-fp8compute-8xb300.yaml
Normal file
32
configs/glm5-nvfp4-fp8compute-8xb300.yaml
Normal file
@@ -0,0 +1,32 @@
|
||||
# GLM-5-NVFP4 on 8 x B300 with FP8 tensor-core compute.
|
||||
# Weights remain stored in NVFP4, so HBM budget follows FP4 storage.
|
||||
|
||||
model:
|
||||
config_json: ../models/GLM-5-NVFP4/config.json
|
||||
name: glm-5-nvfp4
|
||||
compute_dtype: fp8
|
||||
weight_dtype: fp4
|
||||
dtype_bytes: 1
|
||||
block_size_tokens: 512
|
||||
|
||||
hardware:
|
||||
type: 8xb300
|
||||
hbm_bytes: 1900.0e9
|
||||
dram_bytes: 1.5e12
|
||||
max_batch_slots: 256
|
||||
|
||||
cluster:
|
||||
num_instances: 8
|
||||
meta_store:
|
||||
ttl_seconds: 300.0
|
||||
router:
|
||||
mode: prefix_affinity
|
||||
prefix_k: 8
|
||||
load_alpha: 1.0
|
||||
|
||||
sim:
|
||||
trace_path: bailian-traces/glm_coder_blksz_512_040915-040917.jsonl
|
||||
max_requests: null
|
||||
output_dir: runs/glm5_nvfp4_fp8compute_8xb300
|
||||
sample_interval_s: 1.0
|
||||
seed: 42
|
||||
449
docs/superpowers/specs/2026-04-17-bucket-aware-routing-design.md
Normal file
449
docs/superpowers/specs/2026-04-17-bucket-aware-routing-design.md
Normal file
@@ -0,0 +1,449 @@
|
||||
# Bucket-Aware Routing Design
|
||||
|
||||
## 背景
|
||||
|
||||
当前 simulator 只有一个全局 `Cluster`:
|
||||
|
||||
- trace replay 的所有请求共享一组 `instances`
|
||||
- router 直接在全局 instance 池里选目标实例
|
||||
- `meta_store`、L0/L1 cache 可见性、remote RDMA 也都是全局共享
|
||||
|
||||
这和目标架构不一致。目标架构要求:
|
||||
|
||||
- service 内存在多个显式定义的 input-length buckets
|
||||
- 每个 bucket 有独立的 instance 池,实例数由配置显式给出
|
||||
- bucket 之间的 cache / meta-store / remote 可见性严格隔离
|
||||
- router 不只是 bucket 内调度器,还要能以 global 视角感知 bucket 的存在
|
||||
- 后续需要用 simulator 研究:
|
||||
- 是否应该区分 bucket
|
||||
- 严格按 input length 分发与非严格分发的差异
|
||||
- bucket policy 与 bucket 内 instance policy 的耦合和收益
|
||||
|
||||
因此,这次重构不能把“先按 input length 选 bucket”硬编码到 service 层,而要把 bucket 选择纳入 global router 的决策面。
|
||||
|
||||
## 目标
|
||||
|
||||
本次设计的目标是把 simulator 重构成“两级调度”架构:
|
||||
|
||||
1. global router 在所有 bucket 之间选择目标 bucket
|
||||
2. local router 在选中的 bucket 内选择目标 instance
|
||||
|
||||
同时满足:
|
||||
|
||||
- bucket 由配置文件显式定义
|
||||
- 所有 bucket 共享同一套 local router 配置
|
||||
- bucket 之间完全隔离,不共享 L0/L1/meta-store/remote 视图
|
||||
- 保留足够的 metrics / routing log,支持后续研究 bucket policy 的效果
|
||||
- 尽量复用现有 `src/router/*` 中的 instance 级路由实现,避免把所有 router 改写成跨 bucket 扁平打分器
|
||||
|
||||
非目标:
|
||||
|
||||
- 第一阶段不支持 per-bucket 自定义 router 算法
|
||||
- 第一阶段不支持跨 bucket cache 共享
|
||||
- 第一阶段不自动推导 bucket 边界或 bucket 实例数
|
||||
- 第一阶段不实现“全局扁平 instance 池”语义
|
||||
|
||||
## 方案比较
|
||||
|
||||
### 方案 A:service 层固定按 input length 选 bucket,router 只负责 bucket 内 instance
|
||||
|
||||
优点:
|
||||
|
||||
- 对现有代码侵入最小
|
||||
- local router 基本不用改
|
||||
|
||||
缺点:
|
||||
|
||||
- 无法研究“router 不严格按照 input length 分发会怎样”
|
||||
- bucket policy 被固化,无法与 instance policy 解耦对比
|
||||
- global 视角只能做观测,不能做真正决策
|
||||
|
||||
### 方案 B:两级路由,global router 选 bucket,local router 选 bucket 内 instance
|
||||
|
||||
优点:
|
||||
|
||||
- bucket policy 和 instance policy 清晰解耦
|
||||
- 符合目标架构,也适合做对照实验
|
||||
- 可以最大程度复用现有 router 实现作为 local router
|
||||
|
||||
缺点:
|
||||
|
||||
- 需要新增 service 层摘要视图与 global router 接口
|
||||
- driver / events / metrics 要显式携带 bucket 维度
|
||||
|
||||
### 方案 C:全局扁平 router,对所有 instance 跨 bucket 一起打分
|
||||
|
||||
优点:
|
||||
|
||||
- 表面上最自由
|
||||
|
||||
缺点:
|
||||
|
||||
- bucket policy 与 instance policy 混在一起,实验解释性差
|
||||
- 现有大多数 router 要重写
|
||||
- 容易把“bucket 是独立实例池”的物理边界冲淡
|
||||
|
||||
推荐方案:方案 B。
|
||||
|
||||
原因:bucket 是 service-level 拓扑与隔离边界,instance selection 是 bucket 内局部调度问题。这两个层次应该分开建模,否则后续无法清晰回答“bucket 本身是否有价值”和“instance 级路由算法是否有效”。
|
||||
|
||||
## 配置设计
|
||||
|
||||
`cluster` 从现在的单实例池配置扩展为显式 bucket 配置。
|
||||
|
||||
目标 YAML 形态:
|
||||
|
||||
```yaml
|
||||
cluster:
|
||||
meta_store:
|
||||
ttl_seconds: 1000.0
|
||||
router:
|
||||
mode: cache_affinity
|
||||
precise_probe_latency_us: 10.0
|
||||
precise_probe_topk: 4
|
||||
load_alpha: 0.1
|
||||
score_alpha: 1.0
|
||||
score_beta: 0.1
|
||||
prefix_k: 8
|
||||
affinity_fan_out: 0
|
||||
global_router:
|
||||
mode: strict_input_length
|
||||
length_penalty_weight: 1.0
|
||||
load_weight: 1.0
|
||||
cache_weight: 1.0
|
||||
buckets:
|
||||
- name: short
|
||||
input_length_min: 0
|
||||
input_length_max: 32768
|
||||
num_instances: 3
|
||||
- name: medium
|
||||
input_length_min: 32769
|
||||
input_length_max: 81920
|
||||
num_instances: 4
|
||||
- name: long
|
||||
input_length_min: 81921
|
||||
input_length_max: 131072
|
||||
num_instances: 3
|
||||
```
|
||||
|
||||
其中:
|
||||
|
||||
- `cluster.router` 继续表示 bucket 内 local router 配置,全局统一
|
||||
- 新增 `cluster.global_router`,表示 bucket 选择策略
|
||||
- `cluster.buckets` 显式描述 service 拓扑
|
||||
|
||||
第一阶段建议继续兼容旧配置:
|
||||
|
||||
- 若只提供 `cluster.num_instances`,则视为单 bucket 模式
|
||||
- 若提供 `cluster.buckets`,则进入多 bucket 模式
|
||||
- 两者同时出现时直接报错,避免歧义
|
||||
|
||||
配置校验约束:
|
||||
|
||||
1. `buckets` 非空
|
||||
2. 每个 bucket `num_instances > 0`
|
||||
3. `input_length_min <= input_length_max`
|
||||
4. bucket 区间不重叠
|
||||
5. bucket 排序后必须能唯一命中一个 bucket
|
||||
6. 旧模式与新模式互斥
|
||||
|
||||
## 运行时架构
|
||||
|
||||
运行时拆成三层:
|
||||
|
||||
### 1. BucketedService
|
||||
|
||||
新增一个 service 层对象,持有多个 bucket。每个 bucket 包含:
|
||||
|
||||
- `bucket_id`
|
||||
- bucket 配置
|
||||
- 独立的 `Cluster`
|
||||
|
||||
`BucketedService` 职责:
|
||||
|
||||
- 为请求构造所有 bucket 的摘要视图
|
||||
- 调用 global router 选择 bucket
|
||||
- 将请求转发给选中的 bucket 内 `Cluster`
|
||||
- 提供全量 bucket / instance 遍历接口,供 driver 做采样与 tick 调度
|
||||
|
||||
### 2. Cluster
|
||||
|
||||
现有 `Cluster` 语义收缩为“单个 bucket 内的 cluster”:
|
||||
|
||||
- 只持有该 bucket 的 `instances`
|
||||
- 只持有该 bucket 的 `meta_store`
|
||||
- 只运行该 bucket 的 local router
|
||||
|
||||
`Cluster::route_and_admit` 不再负责 bucket 选择,只负责:
|
||||
|
||||
- 在 bucket 内调用 local router 选 instance
|
||||
- 执行该 bucket 内的 L0/L1/remote/miss 路径
|
||||
- 返回 bucket 维度补充后的 admission stats
|
||||
|
||||
### 3. Router 分层
|
||||
|
||||
router 明确拆分为:
|
||||
|
||||
- `GlobalRouter`:负责 bucket 选择
|
||||
- `LocalRouter`:负责 bucket 内 instance 选择
|
||||
|
||||
现有 `src/router/*` 中的大部分算法迁移为 `LocalRouter` 实现。
|
||||
|
||||
## Router 接口设计
|
||||
|
||||
### GlobalRouter
|
||||
|
||||
global router 只看 bucket 摘要,不直接操作实例数组。
|
||||
|
||||
建议接口:
|
||||
|
||||
```rust
|
||||
trait GlobalRouter {
|
||||
fn name(&self) -> &'static str;
|
||||
fn route_bucket(
|
||||
&mut self,
|
||||
req: &RequestRecord,
|
||||
buckets: &[BucketView],
|
||||
now: f64,
|
||||
) -> GlobalRouteDecision;
|
||||
}
|
||||
```
|
||||
|
||||
`BucketView` 是只读摘要,至少包含:
|
||||
|
||||
- `bucket_id`
|
||||
- `name`
|
||||
- `input_length_min`
|
||||
- `input_length_max`
|
||||
- `num_instances`
|
||||
- `queue_len_sum`
|
||||
- `queue_len_max`
|
||||
- `kv_blocks_used_sum`
|
||||
- `kv_blocks_total_sum`
|
||||
- `active_requests`
|
||||
- `predicted_prefix`
|
||||
- 可选的 `estimated_drain_time`
|
||||
|
||||
`predicted_prefix` 表示该 bucket 对当前请求的 bucket 级 prefix 命中预测,用于让 global router 感知 bucket 级 cache affinity。
|
||||
|
||||
### LocalRouter
|
||||
|
||||
local router 继续以 bucket 内 instance 池为输入。
|
||||
|
||||
建议接口保持和现有语义接近:
|
||||
|
||||
```rust
|
||||
trait LocalRouter {
|
||||
fn name(&self) -> &'static str;
|
||||
fn route_instance(
|
||||
&mut self,
|
||||
req: &RequestRecord,
|
||||
instances: &[Instance],
|
||||
meta: &MetaStore,
|
||||
now: f64,
|
||||
) -> LocalRouteDecision;
|
||||
}
|
||||
```
|
||||
|
||||
现有 `RouteDecision` 需要拆成两层,最后再合并成对外统一的日志结构:
|
||||
|
||||
- `GlobalRouteDecision`
|
||||
- `LocalRouteDecision`
|
||||
- `RouteDecision`:包含 `chosen_bucket + chosen_instance + 两层 candidates`
|
||||
|
||||
## Bucket 隔离语义
|
||||
|
||||
bucket 是显式物理隔离边界,不是逻辑标签。
|
||||
|
||||
必须满足:
|
||||
|
||||
- 请求一旦进入 bucket,只能使用该 bucket 的实例池
|
||||
- L0 / L1 只在 bucket 内可见
|
||||
- `meta_store` 只描述该 bucket 内哪些实例持有块
|
||||
- remote RDMA 只允许从同 bucket 其他实例拉取
|
||||
- bucket 之间不共享 owner 信息
|
||||
|
||||
这保证了 simulator 中的 bucket 与实际服务拓扑有明确对应关系,避免 global router 虽然“知道 bucket”,但底层缓存模型仍然偷偷全局共享,从而污染实验结论。
|
||||
|
||||
## 首批 Global Bucket Policies
|
||||
|
||||
第一阶段只实现两个 global bucket policy。
|
||||
|
||||
### 1. strict_input_length
|
||||
|
||||
语义:
|
||||
|
||||
- 只允许选择 `req.input_len` 所在区间的 bucket
|
||||
|
||||
用途:
|
||||
|
||||
- 作为严格长度分桶的基线策略
|
||||
- 对应最初目标架构图
|
||||
|
||||
### 2. bucket_score
|
||||
|
||||
语义:
|
||||
|
||||
- 对所有 bucket 计算分数并选择最优 bucket
|
||||
|
||||
第一阶段分数只使用少量强信号:
|
||||
|
||||
- `length_penalty`:请求长度偏离 bucket 目标范围的惩罚
|
||||
- `load`:bucket 总负载或最大负载
|
||||
- `miss`:bucket 级预测 miss,来自当前请求在该 bucket 上的 `predicted_prefix`
|
||||
|
||||
目标形式:
|
||||
|
||||
```text
|
||||
score = a * length_penalty + b * load + c * miss
|
||||
```
|
||||
|
||||
设计意图:
|
||||
|
||||
- 能研究“非严格按 input length 分发”的收益或损失
|
||||
- 同时保留长度匹配偏好,避免第一阶段就退化为完全无约束调度
|
||||
|
||||
暂不实现:
|
||||
|
||||
- 完全扁平的跨 bucket instance 全局打分
|
||||
- per-bucket 特化 global scoring
|
||||
- 跨 bucket 回退式 cache 共享
|
||||
|
||||
## 事件与 Driver 设计
|
||||
|
||||
当前 `Event::BatchTick { instance }` 假设 instance 是全局单层编号。多 bucket 后需要改成:
|
||||
|
||||
```rust
|
||||
Event::BatchTick { bucket: BucketId, instance: InstanceId }
|
||||
```
|
||||
|
||||
driver 主循环改为:
|
||||
|
||||
1. `Arrival`
|
||||
2. 读取请求
|
||||
3. 调用 `BucketedService::route_and_admit`
|
||||
4. 记录全局+局部路由决策
|
||||
5. 为 `(bucket, instance)` 安排 `BatchTick`
|
||||
|
||||
`Sample` 事件可以继续是全局事件,但采样时要遍历所有 buckets 的所有 instances。
|
||||
|
||||
`inflight` 结构保留按 `req_id` 索引,但 value 中新增:
|
||||
|
||||
- `bucket`
|
||||
- `bucket_policy`
|
||||
- `length_bucket_match`
|
||||
- 可选的 `bucket_predicted_prefix`
|
||||
|
||||
## Metrics 与可观测性
|
||||
|
||||
这次重构的重点之一是让 bucket policy 可研究,因此 metrics 必须明确区分 bucket 选择和 instance 选择。
|
||||
|
||||
### routing_log.jsonl
|
||||
|
||||
建议新增字段:
|
||||
|
||||
- `global_mode`
|
||||
- `local_mode`
|
||||
- `chosen_bucket`
|
||||
- `chosen_instance`
|
||||
- `bucket_candidates`
|
||||
- `instance_candidates`
|
||||
- `global_reason`
|
||||
- `local_reason`
|
||||
|
||||
其中:
|
||||
|
||||
- `bucket_candidates` 记录每个 bucket 的摘要与分数
|
||||
- `instance_candidates` 记录选中 bucket 内的 instance 级候选信息
|
||||
|
||||
### per_request.csv
|
||||
|
||||
建议新增字段:
|
||||
|
||||
- `bucket`
|
||||
- `bucket_policy`
|
||||
- `length_bucket_match`
|
||||
- `bucket_predicted_prefix`
|
||||
|
||||
`length_bucket_match` 用于直接衡量“最终 bucket 是否等于严格长度命中的 bucket”,是分析非严格分发影响的关键字段。
|
||||
|
||||
### instances.csv
|
||||
|
||||
建议新增字段:
|
||||
|
||||
- `bucket`
|
||||
|
||||
### summary.json
|
||||
|
||||
保留全局汇总不变,同时新增 bucket 维度 breakdown。可以采用:
|
||||
|
||||
- 在 `summary.json` 内增加 `per_bucket`
|
||||
- 或新增独立 `bucket_summary.json`
|
||||
|
||||
第一阶段优先保证可读性与易分析,不需要过度抽象。
|
||||
|
||||
## 错误处理
|
||||
|
||||
需要显式处理以下失败场景:
|
||||
|
||||
- 请求命中 0 个 bucket
|
||||
- 请求命中多个 bucket
|
||||
- bucket 配置不合法
|
||||
- 多 bucket 模式下事件找不到对应 `(bucket, instance)`
|
||||
- routing log / metrics 缺失 bucket 信息
|
||||
|
||||
其中配置相关错误应尽量在启动时失败,而不是等到 trace replay 中途才暴露。
|
||||
|
||||
## 测试策略
|
||||
|
||||
测试分三层。
|
||||
|
||||
### 1. 配置测试
|
||||
|
||||
- 旧配置 `num_instances` 模式仍能成功加载
|
||||
- `buckets` 模式成功加载
|
||||
- bucket 区间重叠时报错
|
||||
- `num_instances` 与 `buckets` 同时配置时报错
|
||||
- bucket 未覆盖请求长度时报错或在运行时明确失败
|
||||
|
||||
### 2. Service / Driver 测试
|
||||
|
||||
- 短请求进入 short bucket
|
||||
- 长请求进入 long bucket
|
||||
- `bucket_score` 可以在长度不完全匹配时选择非默认 bucket
|
||||
- long bucket 请求看不到 short bucket 的 meta-store / remote owner
|
||||
- `(bucket, instance)` 维度的 `BatchTick` 能正确推进
|
||||
|
||||
### 3. 集成 Smoke Test
|
||||
|
||||
构造 mixed trace,包含多个 input length 段与共享前缀模式,验证:
|
||||
|
||||
- 严格 bucket policy 下,请求落入预期 bucket
|
||||
- `routing_log` 同时记录 bucket 候选和 instance 候选
|
||||
- `per_request` / `instances` 带 bucket 字段
|
||||
- `bucket_score` 与 `strict_input_length` 在 mixed trace 上产生可观测差异
|
||||
|
||||
## 迁移策略
|
||||
|
||||
为了控制风险,这次重构按以下顺序推进:
|
||||
|
||||
1. 在配置层引入 bucket 结构与校验,但保留旧单 bucket 模式
|
||||
2. 把现有 `Cluster` 收缩为单 bucket 语义
|
||||
3. 新增 `BucketedService`,先接通 strict bucket policy
|
||||
4. 抽出 `GlobalRouter` 接口,补上 `strict_input_length`
|
||||
5. 把现有 instance 级 router 适配为 `LocalRouter`
|
||||
6. 扩展 driver / events / metrics 到 `(bucket, instance)` 维度
|
||||
7. 再实现 `bucket_score` 作为第一种非严格 bucket policy
|
||||
|
||||
这样可以先建立正确拓扑与日志,再逐步加入实验策略,避免一次性重写过多核心路径。
|
||||
|
||||
## 预期结果
|
||||
|
||||
完成后,simulator 将具备以下能力:
|
||||
|
||||
- 用显式 bucket 拓扑 replay 混合长度 trace
|
||||
- 研究严格长度分桶是否带来收益
|
||||
- 研究 global router 在 bucket 维度做非严格调度会产生什么影响
|
||||
- 在 bucket policy 与 local instance policy 两个层次分别做 ablation
|
||||
|
||||
这为后续研究“bucket 是否必要”“bucket 边界怎么设”“global router 应不应该偏离长度分发”提供了清晰、可观测、可复用的 simulator 基础。
|
||||
68
models/GLM-5-FP8/config.json
Normal file
68
models/GLM-5-FP8/config.json
Normal file
@@ -0,0 +1,68 @@
|
||||
{
|
||||
"architectures": [
|
||||
"GlmMoeDsaForCausalLM"
|
||||
],
|
||||
"attention_bias": false,
|
||||
"attention_dropout": 0.0,
|
||||
"dtype": "bfloat16",
|
||||
"eos_token_id": [
|
||||
154820,
|
||||
154827,
|
||||
154829
|
||||
],
|
||||
"ep_size": 1,
|
||||
"first_k_dense_replace": 3,
|
||||
"hidden_act": "silu",
|
||||
"head_dim": 64,
|
||||
"hidden_size": 6144,
|
||||
"index_head_dim": 128,
|
||||
"index_n_heads": 32,
|
||||
"index_topk": 2048,
|
||||
"indexer_rope_interleave": true,
|
||||
"initializer_range": 0.02,
|
||||
"intermediate_size": 12288,
|
||||
"kv_lora_rank": 512,
|
||||
"max_position_embeddings": 202752,
|
||||
"moe_intermediate_size": 2048,
|
||||
"moe_layer_freq": 1,
|
||||
"model_type": "glm_moe_dsa",
|
||||
"n_group": 1,
|
||||
"n_routed_experts": 256,
|
||||
"n_shared_experts": 1,
|
||||
"norm_topk_prob": true,
|
||||
"num_attention_heads": 64,
|
||||
"num_experts_per_tok": 8,
|
||||
"num_hidden_layers": 78,
|
||||
"num_key_value_heads": 64,
|
||||
"num_nextn_predict_layers": 1,
|
||||
"pad_token_id": 154820,
|
||||
"pretraining_tp": 1,
|
||||
"q_lora_rank": 2048,
|
||||
"qk_head_dim": 256,
|
||||
"qk_nope_head_dim": 192,
|
||||
"qk_rope_head_dim": 64,
|
||||
"quantization_config": {
|
||||
"activation_scheme": "dynamic",
|
||||
"fmt": "e4m3",
|
||||
"quant_method": "fp8",
|
||||
"weight_block_size": [
|
||||
128,
|
||||
128
|
||||
]
|
||||
},
|
||||
"rms_norm_eps": 1e-05,
|
||||
"rope_interleave": true,
|
||||
"rope_parameters": {
|
||||
"rope_theta": 1000000,
|
||||
"rope_type": "default"
|
||||
},
|
||||
"routed_scaling_factor": 2.5,
|
||||
"scoring_func": "sigmoid",
|
||||
"tie_word_embeddings": false,
|
||||
"topk_group": 1,
|
||||
"topk_method": "noaux_tc",
|
||||
"transformers_version": "5.0.2.dev0",
|
||||
"use_cache": true,
|
||||
"v_head_dim": 256,
|
||||
"vocab_size": 154880
|
||||
}
|
||||
216
src/cluster/bucketed_service.rs
Normal file
216
src/cluster/bucketed_service.rs
Normal file
@@ -0,0 +1,216 @@
|
||||
use anyhow::Result;
|
||||
|
||||
use super::cluster::{AdmissionStats, Cluster};
|
||||
use crate::config::{BucketConfig, Config, ModelConfig};
|
||||
use crate::instance::Instance;
|
||||
use crate::router::{self, BucketId, GlobalRouter};
|
||||
use crate::trace::RequestRecord;
|
||||
|
||||
pub struct ServiceBucket {
|
||||
pub id: BucketId,
|
||||
pub cfg: BucketConfig,
|
||||
pub cluster: Cluster,
|
||||
}
|
||||
|
||||
impl ServiceBucket {
|
||||
pub fn instances(&self) -> &[Instance] {
|
||||
&self.cluster.instances
|
||||
}
|
||||
}
|
||||
|
||||
pub struct BucketedService {
|
||||
pub buckets: Vec<ServiceBucket>,
|
||||
pub global_router: Box<dyn GlobalRouter>,
|
||||
}
|
||||
|
||||
impl BucketedService {
|
||||
pub fn new(config: &Config, model: &ModelConfig) -> Self {
|
||||
let buckets = config
|
||||
.cluster
|
||||
.effective_buckets()
|
||||
.into_iter()
|
||||
.enumerate()
|
||||
.map(|(idx, cfg)| ServiceBucket {
|
||||
id: idx as BucketId,
|
||||
cluster: Cluster::new_for_bucket(config, model, idx as BucketId, cfg.num_instances)
|
||||
.expect("bucket-local cluster construction should succeed"),
|
||||
cfg,
|
||||
})
|
||||
.collect();
|
||||
|
||||
Self {
|
||||
buckets,
|
||||
global_router: router::build_global(config),
|
||||
}
|
||||
}
|
||||
|
||||
pub fn bucket(&self, bucket_id: BucketId) -> &ServiceBucket {
|
||||
&self.buckets[bucket_id as usize]
|
||||
}
|
||||
|
||||
pub fn route_and_admit(&mut self, req: &RequestRecord, now: f64) -> Result<AdmissionStats> {
|
||||
let bucket_views = self
|
||||
.buckets
|
||||
.iter()
|
||||
.map(|bucket| bucket.cluster.bucket_view(bucket.id, &bucket.cfg, req, now))
|
||||
.collect::<Vec<_>>();
|
||||
let global = self.global_router.route(req, &bucket_views, now)?;
|
||||
let bucket = &mut self.buckets[global.chosen_bucket as usize];
|
||||
Ok(bucket
|
||||
.cluster
|
||||
.route_and_admit_with_global(req, now, &global))
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::config::{
|
||||
BucketConfig, CalibrationConfig, ClusterConfig, Config, GlobalRouterConfig,
|
||||
GlobalRouterMode, HardwareConfig, MetaStoreConfig, ModelConfig, RouterConfig, RouterMode,
|
||||
SimConfig,
|
||||
};
|
||||
use crate::trace::RequestRecord;
|
||||
|
||||
fn test_config() -> Config {
|
||||
Config {
|
||||
model: ModelConfig {
|
||||
name: "test".into(),
|
||||
num_layers: 4,
|
||||
num_kv_heads: 2,
|
||||
head_dim: 64,
|
||||
dtype_bytes: 2,
|
||||
block_size_tokens: 16,
|
||||
flops_per_token_prefill: Some(1.0e9),
|
||||
attn_quadratic_coeff: Some(64.0),
|
||||
..Default::default()
|
||||
},
|
||||
hardware: HardwareConfig {
|
||||
gpu_flops: 1.0e14,
|
||||
gpu_fp8_flops: 0.0,
|
||||
gpu_fp4_flops: 0.0,
|
||||
gpu_mem_bw: 1.0e12,
|
||||
hbm_bytes: 1.0e9,
|
||||
dram_bytes: 4.0e9,
|
||||
host_dram_bw: 5.0e11,
|
||||
pcie_bw: 32.0e9,
|
||||
pcie_latency_us: 1.0,
|
||||
rdma_bw: 12.0e9,
|
||||
rdma_latency_us: 5.0,
|
||||
intra_node_tp_bw: 9.0e11,
|
||||
intra_node_tp_latency_us: 2.0,
|
||||
tp_degree: 1,
|
||||
max_batch_slots: 32,
|
||||
prefill_chunk_tokens: 1024,
|
||||
},
|
||||
calibration: CalibrationConfig::default(),
|
||||
cluster: ClusterConfig {
|
||||
num_instances: None,
|
||||
buckets: vec![
|
||||
BucketConfig {
|
||||
name: "short".into(),
|
||||
input_length_min: 0,
|
||||
input_length_max: 32,
|
||||
num_instances: 2,
|
||||
},
|
||||
BucketConfig {
|
||||
name: "long".into(),
|
||||
input_length_min: 33,
|
||||
input_length_max: 96,
|
||||
num_instances: 1,
|
||||
},
|
||||
],
|
||||
global_router: GlobalRouterConfig {
|
||||
mode: GlobalRouterMode::StrictInputLength,
|
||||
length_penalty_weight: 1.0,
|
||||
load_weight: 1.0,
|
||||
cache_weight: 1.0,
|
||||
},
|
||||
meta_store: MetaStoreConfig {
|
||||
ttl_seconds: 1000.0,
|
||||
},
|
||||
router: RouterConfig {
|
||||
mode: RouterMode::LeastLoaded,
|
||||
precise_probe_latency_us: 10.0,
|
||||
precise_probe_topk: 2,
|
||||
load_alpha: 0.0,
|
||||
score_alpha: 1.0,
|
||||
score_beta: 0.1,
|
||||
prefix_k: 8,
|
||||
affinity_fan_out: 2,
|
||||
},
|
||||
},
|
||||
sim: SimConfig {
|
||||
trace_path: String::new(),
|
||||
max_requests: None,
|
||||
output_dir: String::new(),
|
||||
sample_interval_s: 0.0,
|
||||
seed: 7,
|
||||
input_length_min: None,
|
||||
input_length_max: None,
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
fn req(req_id: u64, input_len: u32, hashes: &[u64]) -> RequestRecord {
|
||||
RequestRecord {
|
||||
req_id,
|
||||
chat_id: req_id as i64,
|
||||
parent_chat_id: -1,
|
||||
turn: 0,
|
||||
arrival: 0.0,
|
||||
input_len,
|
||||
output_len: 16,
|
||||
hash_ids: hashes.to_vec(),
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn strict_input_length_routes_into_matching_bucket() {
|
||||
let cfg = test_config();
|
||||
let mut service = BucketedService::new(&cfg, &cfg.model);
|
||||
let stats = service
|
||||
.route_and_admit(&req(1, 24, &[10, 11]), 0.0)
|
||||
.unwrap();
|
||||
assert_eq!(stats.bucket, 0);
|
||||
assert_eq!(stats.decision.chosen_bucket, 0);
|
||||
assert_eq!(
|
||||
stats.decision.global_reason,
|
||||
"unique bucket range contains input_length"
|
||||
);
|
||||
assert_eq!(
|
||||
stats.decision.local_reason,
|
||||
"argmin(kv_used + alpha * queue_len)"
|
||||
);
|
||||
assert_eq!(service.bucket(0).instances().len(), 2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn bucket_meta_store_is_isolated() {
|
||||
let cfg = test_config();
|
||||
let mut service = BucketedService::new(&cfg, &cfg.model);
|
||||
let _ = service
|
||||
.route_and_admit(&req(1, 24, &[10, 11]), 0.0)
|
||||
.unwrap();
|
||||
let long_stats = service
|
||||
.route_and_admit(&req(2, 64, &[10, 11, 12, 13]), 1.0)
|
||||
.unwrap();
|
||||
assert_eq!(long_stats.bucket, 1);
|
||||
assert_eq!(long_stats.remote_hit_blocks, 0);
|
||||
assert_eq!(long_stats.l1_hit_blocks, 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn unmatched_input_length_returns_recoverable_error() {
|
||||
let mut cfg = test_config();
|
||||
cfg.cluster.buckets[1].input_length_min = 40;
|
||||
let mut service = BucketedService::new(&cfg, &cfg.model);
|
||||
|
||||
let err = service
|
||||
.route_and_admit(&req(3, 36, &[20, 21, 22]), 0.0)
|
||||
.unwrap_err();
|
||||
|
||||
assert!(err.to_string().contains("no bucket"));
|
||||
assert!(err.to_string().contains("input_length=36"));
|
||||
}
|
||||
}
|
||||
@@ -1,17 +1,21 @@
|
||||
//! Cluster: routes arrivals, performs the L0 / L1 / remote-RDMA fetch chain
|
||||
//! described in the design diagram, and bookkeeps the global meta store.
|
||||
|
||||
use anyhow::Result;
|
||||
|
||||
use crate::cluster::meta_store::MetaStore;
|
||||
use crate::config::{Config, ModelConfig};
|
||||
use crate::config::{BucketConfig, Config, ModelConfig};
|
||||
use crate::instance::instance::AdmittedRequest;
|
||||
use crate::instance::kv_cache::L1Change;
|
||||
use crate::instance::Instance;
|
||||
use crate::router::{self, RouteDecision, Router};
|
||||
use crate::router::{self, BucketId, BucketView, GlobalRouteDecision, RouteDecision, Router};
|
||||
use crate::trace::RequestRecord;
|
||||
use crate::ttft::{classify_prefix_tiers, TtftModel};
|
||||
use crate::types::InstanceId;
|
||||
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct AdmissionStats {
|
||||
pub bucket: BucketId,
|
||||
pub instance: InstanceId,
|
||||
pub l0_hit_blocks: u32,
|
||||
pub l1_hit_blocks: u32,
|
||||
@@ -31,52 +35,70 @@ pub struct Cluster {
|
||||
pub router: Box<dyn Router>,
|
||||
pub block_size_tokens: u32,
|
||||
pub kv_block_bytes: u64,
|
||||
pub ttft_model: TtftModel,
|
||||
}
|
||||
|
||||
impl Cluster {
|
||||
pub fn new(config: &Config, model: &ModelConfig) -> Self {
|
||||
let mut instances = Vec::with_capacity(config.cluster.num_instances as usize);
|
||||
for id in 0..config.cluster.num_instances {
|
||||
instances.push(Instance::new(id as InstanceId, model, &config.hardware));
|
||||
}
|
||||
let meta_store = MetaStore::new(config.cluster.meta_store.ttl_seconds);
|
||||
let router = router::build(config, config.sim.seed);
|
||||
Self {
|
||||
instances,
|
||||
meta_store,
|
||||
router,
|
||||
block_size_tokens: model.block_size_tokens,
|
||||
kv_block_bytes: model.kv_block_bytes(),
|
||||
pub fn new(config: &Config, model: &ModelConfig) -> Result<Self> {
|
||||
let total_instances = config.cluster.require_legacy_single_pool("Cluster::new")?;
|
||||
Self::build_local_cluster(config, model, total_instances)
|
||||
}
|
||||
|
||||
pub fn new_for_bucket(
|
||||
config: &Config,
|
||||
model: &ModelConfig,
|
||||
_bucket_id: BucketId,
|
||||
num_instances: u32,
|
||||
) -> Result<Self> {
|
||||
let mut local_config = config.clone();
|
||||
local_config.cluster.num_instances = Some(num_instances);
|
||||
local_config.cluster.buckets.clear();
|
||||
Self::build_local_cluster(&local_config, model, num_instances)
|
||||
}
|
||||
|
||||
/// Route + admit a request. Returns the chosen instance plus rich
|
||||
/// per-request stats for metrics. Does NOT schedule the BatchTick — the
|
||||
/// simulator driver does that based on the returned `ready_at`.
|
||||
pub fn route_and_admit(&mut self, req: &RequestRecord, now: f64) -> AdmissionStats {
|
||||
let global = GlobalRouteDecision::single_bucket(req.req_id, 0);
|
||||
self.route_and_admit_with_global(req, now, &global)
|
||||
}
|
||||
|
||||
pub fn route_and_admit_with_global(
|
||||
&mut self,
|
||||
req: &RequestRecord,
|
||||
now: f64,
|
||||
global: &GlobalRouteDecision,
|
||||
) -> AdmissionStats {
|
||||
let decision = self
|
||||
.router
|
||||
.route(req, &self.instances, &self.meta_store, now);
|
||||
.route(req, &self.instances, &self.meta_store, now)
|
||||
.with_global(global);
|
||||
let inst_id = decision.chosen;
|
||||
let probe_overhead_s = decision.probe_overhead_s;
|
||||
let scheduler_overhead_s = self
|
||||
.ttft_model
|
||||
.scheduler_overhead_s(self.instances.len(), 3);
|
||||
|
||||
// The router probe overhead delays the request's effective start time.
|
||||
let effective_now = now + probe_overhead_s;
|
||||
// The router probe overhead and scheduler work delay the request's
|
||||
// effective start time.
|
||||
let effective_now = now + probe_overhead_s + scheduler_overhead_s;
|
||||
|
||||
let inst = &mut self.instances[inst_id as usize];
|
||||
let residency = classify_prefix_tiers(&req.hash_ids, inst, &self.meta_store, now);
|
||||
let total_blocks = req.hash_ids.len() as u32;
|
||||
let l0_hits = residency.l0_hit_blocks;
|
||||
let l1_hits = residency.l1_hit_blocks;
|
||||
let remote_hit_blocks = residency.remote_hit_blocks;
|
||||
|
||||
// 1. L0 lookup (touches matched blocks).
|
||||
let l0_hits = inst.cache.l0.longest_prefix(&req.hash_ids) as u32;
|
||||
|
||||
// 2. L1 lookup on the remaining suffix.
|
||||
// 1. L1 lookup on the remaining suffix.
|
||||
let suffix_after_l0 = &req.hash_ids[l0_hits as usize..];
|
||||
let l1_hits = inst.cache.l1.longest_prefix_peek(suffix_after_l0) as u32;
|
||||
// L1->L0 transfer cost
|
||||
let l1_bytes = (l1_hits as u64) * self.kv_block_bytes;
|
||||
let mut t = effective_now;
|
||||
let mut l1_changes = Vec::new();
|
||||
if l1_hits > 0 {
|
||||
t += self.ttft_model.local_l1_prepare_time_s(l1_hits) - inst.links.pcie.cost(l1_bytes);
|
||||
t = inst.links.pcie.reserve(t, l1_bytes);
|
||||
l1_changes = inst
|
||||
.cache
|
||||
@@ -86,23 +108,14 @@ impl Cluster {
|
||||
|
||||
// 3. Remote v6d lookup for the still-remaining suffix.
|
||||
let suffix_after_l1 = &suffix_after_l0[l1_hits as usize..];
|
||||
let mut remote_hit_blocks: u32 = 0;
|
||||
for &h in suffix_after_l1 {
|
||||
// A block is remotely available iff some instance other than
|
||||
// `inst_id` lists it (and not expired).
|
||||
let owners = self.meta_store.instances_for(h, now);
|
||||
let any_remote = owners.iter().any(|o| *o != inst_id);
|
||||
if any_remote {
|
||||
remote_hit_blocks += 1;
|
||||
} else {
|
||||
break; // contiguous prefix - stop on first miss
|
||||
}
|
||||
}
|
||||
let remote_bytes = (remote_hit_blocks as u64) * self.kv_block_bytes;
|
||||
if remote_hit_blocks > 0 {
|
||||
let pulled = &suffix_after_l1[..remote_hit_blocks as usize];
|
||||
let l1_changes = {
|
||||
let inst = &mut self.instances[inst_id as usize];
|
||||
t += self.ttft_model.remote_prepare_time_s(remote_hit_blocks)
|
||||
- inst.links.rdma.cost(remote_bytes)
|
||||
- inst.links.pcie.cost(remote_bytes);
|
||||
t = inst.links.rdma.reserve(t, remote_bytes);
|
||||
t = inst.links.pcie.reserve(t, remote_bytes);
|
||||
inst.cache.fetch_remote_blocks_to_l0(pulled)
|
||||
@@ -134,6 +147,7 @@ impl Cluster {
|
||||
ready_at: t,
|
||||
prefill_tokens_remaining: miss_tokens,
|
||||
reserved_blocks,
|
||||
completion_tail_s: self.ttft_model.first_token_tail_s(),
|
||||
};
|
||||
let inst = &mut self.instances[inst_id as usize];
|
||||
inst.admit(admitted);
|
||||
@@ -142,6 +156,7 @@ impl Cluster {
|
||||
let fetch_time_s = (t - effective_now).max(0.0);
|
||||
|
||||
AdmissionStats {
|
||||
bucket: decision.chosen_bucket,
|
||||
instance: inst_id,
|
||||
l0_hit_blocks: l0_hits,
|
||||
l1_hit_blocks: l1_hits,
|
||||
@@ -156,6 +171,30 @@ impl Cluster {
|
||||
}
|
||||
}
|
||||
|
||||
pub fn bucket_view(
|
||||
&self,
|
||||
bucket_id: BucketId,
|
||||
cfg: &BucketConfig,
|
||||
req: &RequestRecord,
|
||||
now: f64,
|
||||
) -> BucketView {
|
||||
let predicted_prefix = self
|
||||
.meta_store
|
||||
.score_prefix(&req.hash_ids, now, self.instances.len())
|
||||
.into_iter()
|
||||
.max()
|
||||
.unwrap_or(0);
|
||||
BucketView {
|
||||
id: bucket_id,
|
||||
input_length_min: cfg.input_length_min,
|
||||
input_length_max: cfg.input_length_max,
|
||||
num_instances: self.instances.len() as u32,
|
||||
total_queue_len: self.instances.iter().map(Instance::queue_len).sum(),
|
||||
total_load_blocks: self.instances.iter().map(|inst| inst.kv_blocks_used).sum(),
|
||||
predicted_prefix,
|
||||
}
|
||||
}
|
||||
|
||||
fn apply_l1_changes(
|
||||
meta_store: &mut MetaStore,
|
||||
inst_id: InstanceId,
|
||||
@@ -169,4 +208,158 @@ impl Cluster {
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn build_local_cluster(
|
||||
config: &Config,
|
||||
model: &ModelConfig,
|
||||
num_instances: u32,
|
||||
) -> Result<Self> {
|
||||
let mut instances = Vec::with_capacity(num_instances as usize);
|
||||
for id in 0..num_instances {
|
||||
instances.push(Instance::new(
|
||||
id as InstanceId,
|
||||
model,
|
||||
&config.hardware,
|
||||
&config.calibration,
|
||||
));
|
||||
}
|
||||
let meta_store = MetaStore::new(config.cluster.meta_store.ttl_seconds);
|
||||
let router = router::build(config, config.sim.seed);
|
||||
Ok(Self {
|
||||
instances,
|
||||
meta_store,
|
||||
router,
|
||||
block_size_tokens: model.block_size_tokens,
|
||||
kv_block_bytes: model.kv_block_bytes(),
|
||||
ttft_model: TtftModel::new(
|
||||
&config.hardware,
|
||||
&config.calibration,
|
||||
model.kv_block_bytes(),
|
||||
),
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::config::{
|
||||
BucketConfig, CalibrationConfig, ClusterConfig, Config, HardwareConfig, MetaStoreConfig,
|
||||
ModelConfig, RouterConfig, RouterMode, SimConfig,
|
||||
};
|
||||
use crate::trace::RequestRecord;
|
||||
|
||||
fn test_config(mode: RouterMode) -> Config {
|
||||
Config {
|
||||
model: ModelConfig {
|
||||
name: "test".into(),
|
||||
num_layers: 4,
|
||||
num_kv_heads: 2,
|
||||
head_dim: 64,
|
||||
dtype_bytes: 2,
|
||||
block_size_tokens: 16,
|
||||
flops_per_token_prefill: Some(1.0e9),
|
||||
attn_quadratic_coeff: Some(64.0),
|
||||
..Default::default()
|
||||
},
|
||||
hardware: HardwareConfig {
|
||||
gpu_flops: 1.0e14,
|
||||
gpu_fp8_flops: 0.0,
|
||||
gpu_fp4_flops: 0.0,
|
||||
gpu_mem_bw: 1.0e12,
|
||||
hbm_bytes: 1.0e9,
|
||||
dram_bytes: 4.0e9,
|
||||
host_dram_bw: 5.0e11,
|
||||
pcie_bw: 32.0e9,
|
||||
pcie_latency_us: 1.0,
|
||||
rdma_bw: 12.0e9,
|
||||
rdma_latency_us: 5.0,
|
||||
intra_node_tp_bw: 9.0e11,
|
||||
intra_node_tp_latency_us: 2.0,
|
||||
tp_degree: 1,
|
||||
max_batch_slots: 32,
|
||||
prefill_chunk_tokens: 1024,
|
||||
},
|
||||
calibration: CalibrationConfig {
|
||||
dram_access_latency_us: 25.0,
|
||||
layout_transform_fixed_us: 7.0,
|
||||
..CalibrationConfig::default()
|
||||
},
|
||||
cluster: ClusterConfig {
|
||||
num_instances: Some(1),
|
||||
buckets: Vec::new(),
|
||||
global_router: Default::default(),
|
||||
meta_store: MetaStoreConfig {
|
||||
ttl_seconds: 1000.0,
|
||||
},
|
||||
router: RouterConfig {
|
||||
mode,
|
||||
precise_probe_latency_us: 10.0,
|
||||
precise_probe_topk: 2,
|
||||
load_alpha: 0.0,
|
||||
score_alpha: 0.0,
|
||||
score_beta: 1.0,
|
||||
prefix_k: 8,
|
||||
affinity_fan_out: 2,
|
||||
},
|
||||
},
|
||||
sim: SimConfig {
|
||||
trace_path: String::new(),
|
||||
max_requests: None,
|
||||
output_dir: String::new(),
|
||||
sample_interval_s: 0.0,
|
||||
seed: 7,
|
||||
input_length_min: None,
|
||||
input_length_max: None,
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn l1_ready_at_includes_dram_and_transform_overhead() {
|
||||
let cfg = test_config(RouterMode::EstimatedTtft);
|
||||
let mut cluster = Cluster::new(&cfg, &cfg.model).unwrap();
|
||||
let req = RequestRecord {
|
||||
req_id: 1,
|
||||
chat_id: 0,
|
||||
parent_chat_id: -1,
|
||||
turn: 1,
|
||||
arrival: 0.0,
|
||||
input_len: 32,
|
||||
output_len: 16,
|
||||
hash_ids: vec![10, 11],
|
||||
};
|
||||
|
||||
let mut evicted = Vec::new();
|
||||
cluster.instances[0]
|
||||
.cache
|
||||
.l1
|
||||
.insert_blocks(&req.hash_ids, &mut evicted);
|
||||
|
||||
let stats = cluster.route_and_admit(&req, 0.0);
|
||||
let pure_pcie = cluster.instances[0]
|
||||
.links
|
||||
.pcie
|
||||
.cost(cluster.kv_block_bytes * 2);
|
||||
|
||||
assert!(stats.ready_at > pure_pcie);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cluster_new_rejects_bucketed_configs() {
|
||||
let mut cfg = test_config(RouterMode::EstimatedTtft);
|
||||
cfg.cluster.num_instances = None;
|
||||
cfg.cluster.buckets = vec![BucketConfig {
|
||||
name: "short".into(),
|
||||
input_length_min: 0,
|
||||
input_length_max: 64,
|
||||
num_instances: 2,
|
||||
}];
|
||||
|
||||
let result = Cluster::new(&cfg, &cfg.model);
|
||||
assert!(result.is_err(), "bucketed Cluster::new should fail");
|
||||
let err = result.err().unwrap();
|
||||
assert!(err.to_string().contains("Cluster::new"));
|
||||
assert!(err.to_string().contains("cluster.buckets"));
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,6 +1,8 @@
|
||||
pub mod meta_store;
|
||||
pub mod bucketed_service;
|
||||
#[allow(clippy::module_inception)]
|
||||
pub mod cluster;
|
||||
pub mod meta_store;
|
||||
|
||||
pub use bucketed_service::BucketedService;
|
||||
pub use cluster::Cluster;
|
||||
pub use meta_store::MetaStore;
|
||||
|
||||
1005
src/config.rs
1005
src/config.rs
File diff suppressed because it is too large
Load Diff
176
src/driver.rs
176
src/driver.rs
@@ -1,11 +1,12 @@
|
||||
//! Simulation driver: pulls trace records, advances the event queue, runs
|
||||
//! instance batch ticks, and emits metrics.
|
||||
|
||||
use anyhow::Result;
|
||||
use std::collections::HashMap;
|
||||
use anyhow::{anyhow, Context, Result};
|
||||
use std::collections::{HashMap, VecDeque};
|
||||
use std::path::Path;
|
||||
use std::sync::{Arc, Mutex};
|
||||
|
||||
use crate::cluster::Cluster;
|
||||
use crate::cluster::BucketedService;
|
||||
use crate::config::{Config, RouterMode};
|
||||
use crate::metrics::ablation::AblationRow;
|
||||
use crate::metrics::per_request::{PerRequestRow, PerRequestWriter};
|
||||
@@ -16,6 +17,18 @@ use crate::replay::ReplayEvictPolicy;
|
||||
use crate::sim::{Event, EventQueue};
|
||||
use crate::trace::{RequestRecord, TraceReader};
|
||||
|
||||
/// Drop records whose `input_len` falls outside `sim.input_length_{min,max}`.
|
||||
/// Used to carve an ablation onto a specific input-length bucket (e.g. 0–40k)
|
||||
/// without rewriting the trace file. No-op if both bounds are unset.
|
||||
pub fn apply_input_length_filter(records: &mut Vec<RequestRecord>, cfg: &crate::config::SimConfig) {
|
||||
let lo = cfg.input_length_min.unwrap_or(0);
|
||||
let hi = cfg.input_length_max.unwrap_or(u32::MAX);
|
||||
if lo == 0 && hi == u32::MAX {
|
||||
return;
|
||||
}
|
||||
records.retain(|r| r.input_len >= lo && r.input_len <= hi);
|
||||
}
|
||||
|
||||
pub struct RunOutputs {
|
||||
pub summary: Summary,
|
||||
pub rows: Vec<PerRequestRow>,
|
||||
@@ -24,7 +37,9 @@ pub struct RunOutputs {
|
||||
#[derive(Debug, Clone)]
|
||||
struct InflightInfo {
|
||||
arrival: f64,
|
||||
bucket: u32,
|
||||
instance: u32,
|
||||
length_bucket_match: bool,
|
||||
total_blocks: u32,
|
||||
l0_hit_blocks: u32,
|
||||
l1_hit_blocks: u32,
|
||||
@@ -36,7 +51,7 @@ struct InflightInfo {
|
||||
}
|
||||
|
||||
pub fn run(config: &Config, output_subdir: Option<&str>) -> Result<RunOutputs> {
|
||||
let mut cluster = Cluster::new(config, &config.model);
|
||||
let mut service = BucketedService::new(config, &config.model);
|
||||
let mut q = EventQueue::new();
|
||||
|
||||
// Output directory
|
||||
@@ -55,7 +70,21 @@ pub fn run(config: &Config, output_subdir: Option<&str>) -> Result<RunOutputs> {
|
||||
// Load all records (cheap for moderate traces) so we can index by req_id.
|
||||
// For very large traces a streaming approach with a peekable iterator
|
||||
// would be better; this keeps the driver simple.
|
||||
let records: Vec<RequestRecord> = (&mut trace).collect::<Result<Vec<_>, _>>()?;
|
||||
let mut records: Vec<RequestRecord> = (&mut trace).collect::<Result<Vec<_>, _>>()?;
|
||||
let raw_count = records.len();
|
||||
apply_input_length_filter(&mut records, &config.sim);
|
||||
if records.len() != raw_count {
|
||||
eprintln!(
|
||||
"[driver] input_length filter [{}, {}] kept {}/{} requests",
|
||||
config.sim.input_length_min.unwrap_or(0),
|
||||
config
|
||||
.sim
|
||||
.input_length_max
|
||||
.map_or("∞".to_string(), |v| v.to_string()),
|
||||
records.len(),
|
||||
raw_count,
|
||||
);
|
||||
}
|
||||
let mut by_id: HashMap<u64, RequestRecord> = HashMap::with_capacity(records.len());
|
||||
for r in &records {
|
||||
q.schedule(r.arrival, Event::Arrival { req_id: r.req_id });
|
||||
@@ -81,13 +110,16 @@ pub fn run(config: &Config, output_subdir: Option<&str>) -> Result<RunOutputs> {
|
||||
Some(r) => r.clone(),
|
||||
None => continue,
|
||||
};
|
||||
let stats = cluster.route_and_admit(&req, now);
|
||||
let stats = service.route_and_admit(&req, now)?;
|
||||
rt_writer.write(&stats.decision)?;
|
||||
let strict_bucket = config.cluster.bucket_index_for_input_len(req.input_len)?;
|
||||
inflight.insert(
|
||||
req_id,
|
||||
InflightInfo {
|
||||
arrival: req.arrival,
|
||||
bucket: stats.bucket,
|
||||
instance: stats.instance,
|
||||
length_bucket_match: stats.bucket as usize == strict_bucket,
|
||||
total_blocks: req.hash_ids.len() as u32,
|
||||
l0_hit_blocks: stats.l0_hit_blocks,
|
||||
l1_hit_blocks: stats.l1_hit_blocks,
|
||||
@@ -98,20 +130,23 @@ pub fn run(config: &Config, output_subdir: Option<&str>) -> Result<RunOutputs> {
|
||||
probe_overhead_s: stats.probe_overhead_s,
|
||||
},
|
||||
);
|
||||
let inst = &mut cluster.instances[stats.instance as usize];
|
||||
let inst = &mut service.buckets[stats.bucket as usize].cluster.instances
|
||||
[stats.instance as usize];
|
||||
if !inst.tick_scheduled {
|
||||
inst.tick_scheduled = true;
|
||||
let when = stats.ready_at.max(now);
|
||||
q.schedule(
|
||||
when,
|
||||
Event::BatchTick {
|
||||
bucket: stats.bucket,
|
||||
instance: stats.instance,
|
||||
},
|
||||
);
|
||||
}
|
||||
}
|
||||
Event::BatchTick { instance } => {
|
||||
let inst = &mut cluster.instances[instance as usize];
|
||||
Event::BatchTick { bucket, instance } => {
|
||||
let inst =
|
||||
&mut service.buckets[bucket as usize].cluster.instances[instance as usize];
|
||||
inst.tick_scheduled = false;
|
||||
let result = inst.step(now);
|
||||
for (rid, ttft, end) in result.completed {
|
||||
@@ -121,7 +156,9 @@ pub fn run(config: &Config, output_subdir: Option<&str>) -> Result<RunOutputs> {
|
||||
arrival: info.arrival,
|
||||
ttft,
|
||||
e2e: end - info.arrival,
|
||||
bucket: info.bucket,
|
||||
instance: info.instance,
|
||||
length_bucket_match: info.length_bucket_match,
|
||||
total_blocks: info.total_blocks,
|
||||
l0_hit_blocks: info.l0_hit_blocks,
|
||||
l1_hit_blocks: info.l1_hit_blocks,
|
||||
@@ -136,18 +173,21 @@ pub fn run(config: &Config, output_subdir: Option<&str>) -> Result<RunOutputs> {
|
||||
}
|
||||
}
|
||||
if let Some(next) = result.next_tick {
|
||||
let inst = &mut cluster.instances[instance as usize];
|
||||
let inst =
|
||||
&mut service.buckets[bucket as usize].cluster.instances[instance as usize];
|
||||
if !inst.tick_scheduled {
|
||||
inst.tick_scheduled = true;
|
||||
q.schedule(next.max(now), Event::BatchTick { instance });
|
||||
q.schedule(next.max(now), Event::BatchTick { bucket, instance });
|
||||
}
|
||||
}
|
||||
}
|
||||
Event::Sample => {
|
||||
for inst in &cluster.instances {
|
||||
for bucket in &service.buckets {
|
||||
for inst in &bucket.cluster.instances {
|
||||
let busy = if inst.queue_len() > 0 { 1 } else { 0 };
|
||||
ts_writer.write(&TimeseriesRow {
|
||||
t: now,
|
||||
bucket: bucket.id,
|
||||
instance: inst.id,
|
||||
queue_len: inst.queue_len(),
|
||||
kv_blocks_used: inst.kv_blocks_used,
|
||||
@@ -156,6 +196,7 @@ pub fn run(config: &Config, output_subdir: Option<&str>) -> Result<RunOutputs> {
|
||||
})?;
|
||||
}
|
||||
}
|
||||
}
|
||||
Event::Stop => break,
|
||||
}
|
||||
}
|
||||
@@ -181,6 +222,17 @@ pub fn ablate_fixed_placement(
|
||||
routers: &[RouterMode],
|
||||
evict_policies: &[ReplayEvictPolicy],
|
||||
) -> Result<Vec<AblationRow>> {
|
||||
ablate_fixed_placement_with_parallelism(base, routers, evict_policies, 1)
|
||||
}
|
||||
|
||||
pub fn ablate_fixed_placement_with_parallelism(
|
||||
base: &Config,
|
||||
routers: &[RouterMode],
|
||||
evict_policies: &[ReplayEvictPolicy],
|
||||
jobs: usize,
|
||||
) -> Result<Vec<AblationRow>> {
|
||||
base.cluster
|
||||
.require_legacy_single_pool("fixed-placement ablation")?;
|
||||
let mut out = Vec::new();
|
||||
for &policy in evict_policies {
|
||||
if policy != ReplayEvictPolicy::Lru {
|
||||
@@ -190,18 +242,112 @@ pub fn ablate_fixed_placement(
|
||||
));
|
||||
}
|
||||
}
|
||||
|
||||
if routers.is_empty() {
|
||||
return Ok(out);
|
||||
}
|
||||
|
||||
let worker_count = resolve_ablation_parallelism(jobs, routers.len());
|
||||
if worker_count == 1 {
|
||||
for &mode in routers {
|
||||
out.extend(run_ablation_router(base, mode, evict_policies)?);
|
||||
}
|
||||
return Ok(out);
|
||||
}
|
||||
|
||||
eprintln!(
|
||||
"[ablate] running {} routers with {} workers",
|
||||
routers.len(),
|
||||
worker_count
|
||||
);
|
||||
|
||||
let queue = Arc::new(Mutex::new(VecDeque::from(
|
||||
routers
|
||||
.iter()
|
||||
.copied()
|
||||
.enumerate()
|
||||
.collect::<Vec<(usize, RouterMode)>>(),
|
||||
)));
|
||||
let mut ordered_results: Vec<(usize, Vec<AblationRow>)> = Vec::with_capacity(routers.len());
|
||||
|
||||
std::thread::scope(|scope| -> Result<()> {
|
||||
let mut handles = Vec::with_capacity(worker_count);
|
||||
for worker_idx in 0..worker_count {
|
||||
let queue = Arc::clone(&queue);
|
||||
let base = base.clone();
|
||||
let policies = evict_policies.to_vec();
|
||||
handles.push(
|
||||
scope.spawn(move || -> Result<Vec<(usize, Vec<AblationRow>)>> {
|
||||
let mut local = Vec::new();
|
||||
loop {
|
||||
let next = queue
|
||||
.lock()
|
||||
.map_err(|_| anyhow!("ablation work queue mutex poisoned"))?
|
||||
.pop_front();
|
||||
let Some((idx, mode)) = next else {
|
||||
break;
|
||||
};
|
||||
eprintln!(
|
||||
"[ablate] worker {}/{} router={} ...",
|
||||
worker_idx + 1,
|
||||
worker_count,
|
||||
mode.as_str()
|
||||
);
|
||||
let rows = run_ablation_router(&base, mode, &policies)
|
||||
.with_context(|| format!("ablation router={}", mode.as_str()))?;
|
||||
local.push((idx, rows));
|
||||
}
|
||||
Ok(local)
|
||||
}),
|
||||
);
|
||||
}
|
||||
|
||||
for handle in handles {
|
||||
let local = handle
|
||||
.join()
|
||||
.map_err(|_| anyhow!("ablation worker thread panicked"))??;
|
||||
ordered_results.extend(local);
|
||||
}
|
||||
Ok(())
|
||||
})?;
|
||||
|
||||
ordered_results.sort_by_key(|(idx, _)| *idx);
|
||||
for (_, rows) in ordered_results {
|
||||
out.extend(rows);
|
||||
}
|
||||
Ok(out)
|
||||
}
|
||||
|
||||
fn resolve_ablation_parallelism(jobs: usize, num_routers: usize) -> usize {
|
||||
if num_routers == 0 {
|
||||
return 1;
|
||||
}
|
||||
let requested = if jobs == 0 {
|
||||
std::thread::available_parallelism()
|
||||
.map(|n| n.get())
|
||||
.unwrap_or(1)
|
||||
} else {
|
||||
jobs
|
||||
};
|
||||
requested.max(1).min(num_routers)
|
||||
}
|
||||
|
||||
fn run_ablation_router(
|
||||
base: &Config,
|
||||
mode: RouterMode,
|
||||
evict_policies: &[ReplayEvictPolicy],
|
||||
) -> Result<Vec<AblationRow>> {
|
||||
let mut cfg = base.clone();
|
||||
cfg.cluster.router.mode = mode;
|
||||
let placement_run = run(&cfg, Some(&format!("{}__placement_lru", mode.as_str())))?;
|
||||
let mut rows = Vec::with_capacity(evict_policies.len());
|
||||
for &policy in evict_policies {
|
||||
out.push(AblationRow::from_summary(
|
||||
rows.push(AblationRow::from_summary(
|
||||
mode.as_str(),
|
||||
policy,
|
||||
"realized_lru",
|
||||
&placement_run.summary,
|
||||
));
|
||||
}
|
||||
}
|
||||
Ok(out)
|
||||
Ok(rows)
|
||||
}
|
||||
|
||||
@@ -17,6 +17,7 @@ pub const AVAILABLE: &[&str] = &[
|
||||
"h100",
|
||||
"h800",
|
||||
"h20",
|
||||
"h20-141g",
|
||||
"a100-80gb",
|
||||
"a100-40gb",
|
||||
"b200",
|
||||
@@ -30,6 +31,9 @@ pub const AVAILABLE: &[&str] = &[
|
||||
"2xh20",
|
||||
"4xh20",
|
||||
"8xh20",
|
||||
"2xh20-141g",
|
||||
"4xh20-141g",
|
||||
"8xh20-141g",
|
||||
"2xb200",
|
||||
"4xb200",
|
||||
"8xb200",
|
||||
@@ -49,6 +53,7 @@ pub fn resolve(name: &str) -> Option<HardwareConfig> {
|
||||
"h100" => Some(make_config(count, &H100)),
|
||||
"h800" => Some(make_config(count, &H800)),
|
||||
"h20" => Some(make_config(count, &H20)),
|
||||
"h20141g" | "h20141gb" => Some(make_config(count, &H20_141G)),
|
||||
"a10080gb" | "a100" => Some(make_config(count, &A100_80GB)),
|
||||
"a10040gb" => Some(make_config(count, &A100_40GB)),
|
||||
"b200" => Some(make_config(count, &B200)),
|
||||
@@ -113,6 +118,15 @@ const H20: GpuBase = GpuBase {
|
||||
pcie_gen: 5,
|
||||
};
|
||||
|
||||
const H20_141G: GpuBase = GpuBase {
|
||||
flops: 1.48e14, // modeled as the same H20 compute envelope
|
||||
fp8_flops: 2.96e14, // modeled as the same H20 FP8 throughput
|
||||
fp4_flops: 0.0, // not supported
|
||||
mem_bw: 4.8e12, // 141 GB HBM variant
|
||||
hbm: 141.0e9, // 141 GB
|
||||
pcie_gen: 5,
|
||||
};
|
||||
|
||||
const A100_80GB: GpuBase = GpuBase {
|
||||
flops: 3.12e14, // 312 TFLOPS BF16
|
||||
fp8_flops: 0.0, // A100 has no FP8 tensor cores
|
||||
@@ -188,10 +202,18 @@ fn make_config(n: u32, base: &GpuBase) -> HardwareConfig {
|
||||
gpu_mem_bw: base.mem_bw * f,
|
||||
hbm_bytes: base.hbm * f,
|
||||
dram_bytes: dram,
|
||||
host_dram_bw: if n >= 8 { 9.0e11 } else { 5.0e11 },
|
||||
pcie_bw: pcie_per_gpu * f,
|
||||
pcie_latency_us: pcie_lat,
|
||||
rdma_bw: rdma_base * rdma_scale,
|
||||
rdma_latency_us: rdma_lat,
|
||||
intra_node_tp_bw: if base.pcie_gen >= 6 {
|
||||
1.8e12 * f
|
||||
} else {
|
||||
9.0e11 * f
|
||||
},
|
||||
intra_node_tp_latency_us: if base.pcie_gen >= 6 { 1.0 } else { 2.0 },
|
||||
tp_degree: n,
|
||||
max_batch_slots: 256,
|
||||
prefill_chunk_tokens: if n >= 4 { 4096 } else { 2048 },
|
||||
}
|
||||
@@ -223,6 +245,7 @@ mod tests {
|
||||
assert!(resolve("H100").is_some());
|
||||
assert!(resolve("8xB200").is_some());
|
||||
assert!(resolve("8x-B200").is_some());
|
||||
assert!(resolve("8xH20-141G").is_some());
|
||||
assert!(resolve("a100-80gb").is_some());
|
||||
assert!(resolve("A100_80GB").is_some());
|
||||
assert!(resolve("a100_80gb").is_some());
|
||||
@@ -254,4 +277,13 @@ mod tests {
|
||||
assert!((s4.gpu_mem_bw - s1.gpu_mem_bw * 4.0).abs() < 1.0);
|
||||
assert!((s8.hbm_bytes - s1.hbm_bytes * 8.0).abs() < 1.0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn h20_141g_variant_has_larger_hbm() {
|
||||
let h20 = resolve("8xh20").unwrap();
|
||||
let h20_141g = resolve("8xh20-141g").unwrap();
|
||||
assert!((h20_141g.gpu_flops - h20.gpu_flops).abs() < 1.0);
|
||||
assert!(h20_141g.hbm_bytes > h20.hbm_bytes);
|
||||
assert!(h20_141g.gpu_mem_bw > h20.gpu_mem_bw);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -7,7 +7,7 @@ use anyhow::{Context, Result};
|
||||
use serde_json::Value;
|
||||
use std::path::Path;
|
||||
|
||||
use crate::config::{AttentionConfig, MlaConfig, MoeConfig, ModelConfig};
|
||||
use crate::config::{AttentionConfig, MlaConfig, ModelConfig, MoeConfig};
|
||||
|
||||
/// Parse a HuggingFace config.json and return a partially-populated
|
||||
/// [`ModelConfig`]. The caller must still set `dtype_bytes` and
|
||||
@@ -34,8 +34,7 @@ fn parse_value(v: &Value) -> Result<ModelConfig> {
|
||||
let num_layers = u32_field(v, "num_hidden_layers");
|
||||
let hidden_size = u32_field(v, "hidden_size");
|
||||
let num_attention_heads = u32_field(v, "num_attention_heads");
|
||||
let num_kv_heads = u32_field(v, "num_key_value_heads")
|
||||
.or(num_attention_heads); // default to MHA
|
||||
let num_kv_heads = u32_field(v, "num_key_value_heads").or(num_attention_heads); // default to MHA
|
||||
let head_dim = u32_field(v, "head_dim").or_else(|| {
|
||||
// Infer: hidden_size / num_attention_heads
|
||||
match (hidden_size, num_attention_heads) {
|
||||
@@ -70,8 +69,7 @@ fn parse_value(v: &Value) -> Result<ModelConfig> {
|
||||
});
|
||||
|
||||
// --- Attention pattern ---
|
||||
let attention =
|
||||
if let Some(first_dense) = u32_field(v, "first_k_dense_replace") {
|
||||
let attention = if let Some(first_dense) = u32_field(v, "first_k_dense_replace") {
|
||||
// DSA-style model (GLM-5, DeepSeek-V3).
|
||||
// dense_window and sparse_stride are typically not in config.json;
|
||||
// use sensible defaults the user can override in YAML.
|
||||
@@ -188,6 +186,12 @@ mod tests {
|
||||
assert_eq!(moe.num_active_experts, 8);
|
||||
let mla = m.mla.as_ref().unwrap();
|
||||
assert_eq!(mla.kv_lora_rank, 512);
|
||||
assert!(matches!(m.attention, Some(AttentionConfig::Dsa { first_dense_layers: 3, .. })));
|
||||
assert!(matches!(
|
||||
m.attention,
|
||||
Some(AttentionConfig::Dsa {
|
||||
first_dense_layers: 3,
|
||||
..
|
||||
})
|
||||
));
|
||||
}
|
||||
}
|
||||
|
||||
@@ -22,7 +22,7 @@
|
||||
//! `effective_ctx(N)` equals `N` for dense attention (→ O(N²) total) but
|
||||
//! is sub-linear for DSA / sliding-window.
|
||||
|
||||
use crate::config::{AttentionConfig, HardwareConfig, ModelConfig};
|
||||
use crate::config::{AttentionConfig, CalibrationConfig, HardwareConfig, ModelConfig};
|
||||
|
||||
/// Resolved attention pattern used at runtime.
|
||||
#[derive(Debug, Clone)]
|
||||
@@ -33,7 +33,10 @@ pub enum AttentionPattern {
|
||||
SlidingWindow { window: f64 },
|
||||
/// DeepSeek Sparse Attention: effective_ctx = min(N, dense_window) +
|
||||
/// max(0, N - dense_window) / sparse_stride.
|
||||
Dsa { dense_window: f64, sparse_stride: f64 },
|
||||
Dsa {
|
||||
dense_window: f64,
|
||||
sparse_stride: f64,
|
||||
},
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone)]
|
||||
@@ -52,24 +55,46 @@ pub struct ComputeModel {
|
||||
pub attn_pattern: AttentionPattern,
|
||||
/// Weight bytes read from HBM per layer (for memory-bound check).
|
||||
pub weight_bytes_per_layer: f64,
|
||||
/// Approximate bytes moved by each TP collective, per token per layer.
|
||||
pub tp_bytes_per_token: f64,
|
||||
/// Number of TP collectives per layer on the critical path.
|
||||
pub tp_collective_count_per_layer: f64,
|
||||
/// Peak GPU FLOPs (aggregate across TP group).
|
||||
pub gpu_flops: f64,
|
||||
/// Peak GPU memory bandwidth (aggregate across TP group).
|
||||
pub gpu_mem_bw: f64,
|
||||
/// Peak node-local TP bandwidth.
|
||||
pub intra_node_tp_bw: f64,
|
||||
/// Fixed latency per TP collective.
|
||||
pub intra_node_tp_latency_s: f64,
|
||||
/// Effective utilization for GEMM-like linear kernels.
|
||||
pub matmul_util: f64,
|
||||
/// Effective utilization for attention kernels.
|
||||
pub attention_util: f64,
|
||||
/// Effective utilization for HBM streaming.
|
||||
pub hbm_bw_util: f64,
|
||||
/// Effective utilization for TP bandwidth.
|
||||
pub tp_bw_util: f64,
|
||||
/// Fraction of TP communication that can overlap with compute.
|
||||
pub tp_overlap_ratio: f64,
|
||||
/// Fixed per-layer non-FLOP overhead.
|
||||
pub misc_layer_overhead_s: f64,
|
||||
/// Fixed launch overhead per prefill chunk.
|
||||
pub chunk_launch_overhead_s: f64,
|
||||
}
|
||||
|
||||
impl ComputeModel {
|
||||
pub fn new(model: &ModelConfig, hw: &HardwareConfig) -> Self {
|
||||
pub fn new(model: &ModelConfig, hw: &HardwareConfig, calib: &CalibrationConfig) -> Self {
|
||||
if model.is_arch_mode() {
|
||||
Self::from_arch(model, hw)
|
||||
Self::from_arch(model, hw, calib)
|
||||
} else {
|
||||
Self::from_manual(model, hw)
|
||||
Self::from_manual(model, hw, calib)
|
||||
}
|
||||
}
|
||||
|
||||
// ----- Architecture-derived construction --------------------------------
|
||||
|
||||
fn from_arch(model: &ModelConfig, hw: &HardwareConfig) -> Self {
|
||||
fn from_arch(model: &ModelConfig, hw: &HardwareConfig, calib: &CalibrationConfig) -> Self {
|
||||
let h = model.hidden_size.unwrap() as f64;
|
||||
let n_heads = model.num_attention_heads.unwrap_or(model.num_kv_heads) as f64;
|
||||
let n_kv = model.num_kv_heads as f64;
|
||||
@@ -101,7 +126,8 @@ impl ComputeModel {
|
||||
|
||||
// --- MLP FLOPs/token/layer (SwiGLU: gate + up + down = 3 matmuls) ---
|
||||
let mlp = if let Some(moe) = &model.moe {
|
||||
let expert_inter = moe.expert_intermediate_size
|
||||
let expert_inter =
|
||||
moe.expert_intermediate_size
|
||||
.unwrap_or(model.intermediate_size.unwrap_or(0)) as f64;
|
||||
let active = moe.num_active_experts as f64;
|
||||
let shared = moe.num_shared_experts as f64;
|
||||
@@ -111,6 +137,11 @@ impl ComputeModel {
|
||||
};
|
||||
|
||||
let linear_flops = attn_linear + mlp;
|
||||
let tp_bytes_per_token = if hw.tp_degree > 1 {
|
||||
h * model.dtype_bytes as f64
|
||||
} else {
|
||||
0.0
|
||||
};
|
||||
|
||||
// --- Attention quadratic coefficient ---
|
||||
// attn_flops_per_layer(N) = attn_coeff * N * effective_ctx(N)
|
||||
@@ -132,15 +163,13 @@ impl ComputeModel {
|
||||
let qk_hd = (mla.qk_nope_head_dim + mla.qk_rope_head_dim) as f64;
|
||||
let qk_rd = mla.qk_rope_head_dim as f64;
|
||||
let vhd = mla.v_head_dim as f64;
|
||||
(h * qlr + qlr * n_heads * qk_hd
|
||||
+ h * (kvlr + qk_rd)
|
||||
+ n_heads * vhd * h)
|
||||
* wdtype
|
||||
(h * qlr + qlr * n_heads * qk_hd + h * (kvlr + qk_rd) + n_heads * vhd * h) * wdtype
|
||||
} else {
|
||||
((n_heads + 2.0 * n_kv) * hd * h + n_heads * hd * h) * wdtype
|
||||
};
|
||||
let mlp_wt = if let Some(moe) = &model.moe {
|
||||
let expert_inter = moe.expert_intermediate_size
|
||||
let expert_inter =
|
||||
moe.expert_intermediate_size
|
||||
.unwrap_or(model.intermediate_size.unwrap_or(0)) as f64;
|
||||
let active = moe.num_active_experts as f64;
|
||||
let shared = moe.num_shared_experts as f64;
|
||||
@@ -169,10 +198,9 @@ impl ComputeModel {
|
||||
},
|
||||
0.0,
|
||||
),
|
||||
Some(AttentionConfig::Dense) | None => (
|
||||
AttentionPattern::Dense,
|
||||
model.num_layers as f64,
|
||||
),
|
||||
Some(AttentionConfig::Dense) | None => {
|
||||
(AttentionPattern::Dense, model.num_layers as f64)
|
||||
}
|
||||
};
|
||||
|
||||
Self {
|
||||
@@ -182,14 +210,29 @@ impl ComputeModel {
|
||||
attn_coeff,
|
||||
attn_pattern,
|
||||
weight_bytes_per_layer: weight_bytes,
|
||||
tp_bytes_per_token,
|
||||
tp_collective_count_per_layer: if hw.tp_degree > 1 {
|
||||
calib.tp_collective_count_per_layer
|
||||
} else {
|
||||
0.0
|
||||
},
|
||||
gpu_flops: hw.gpu_flops,
|
||||
gpu_mem_bw: hw.gpu_mem_bw,
|
||||
intra_node_tp_bw: hw.intra_node_tp_bw,
|
||||
intra_node_tp_latency_s: hw.intra_node_tp_latency_us * 1e-6,
|
||||
matmul_util: calib.matmul_util,
|
||||
attention_util: calib.attention_util,
|
||||
hbm_bw_util: calib.hbm_bw_util,
|
||||
tp_bw_util: calib.tp_bw_util,
|
||||
tp_overlap_ratio: calib.tp_overlap_ratio,
|
||||
misc_layer_overhead_s: calib.misc_layer_overhead_us * 1e-6,
|
||||
chunk_launch_overhead_s: calib.chunk_launch_overhead_us * 1e-6,
|
||||
}
|
||||
}
|
||||
|
||||
// ----- Legacy manual construction ---------------------------------------
|
||||
|
||||
fn from_manual(model: &ModelConfig, hw: &HardwareConfig) -> Self {
|
||||
fn from_manual(model: &ModelConfig, hw: &HardwareConfig, calib: &CalibrationConfig) -> Self {
|
||||
Self {
|
||||
num_layers: model.num_layers as f64,
|
||||
first_dense_layers: model.num_layers as f64,
|
||||
@@ -197,8 +240,19 @@ impl ComputeModel {
|
||||
attn_coeff: model.attn_quadratic_coeff.unwrap_or(0.0),
|
||||
attn_pattern: AttentionPattern::Dense,
|
||||
weight_bytes_per_layer: 0.0,
|
||||
tp_bytes_per_token: 0.0,
|
||||
tp_collective_count_per_layer: 0.0,
|
||||
gpu_flops: hw.gpu_flops,
|
||||
gpu_mem_bw: hw.gpu_mem_bw,
|
||||
intra_node_tp_bw: hw.intra_node_tp_bw,
|
||||
intra_node_tp_latency_s: hw.intra_node_tp_latency_us * 1e-6,
|
||||
matmul_util: calib.matmul_util,
|
||||
attention_util: calib.attention_util,
|
||||
hbm_bw_util: calib.hbm_bw_util,
|
||||
tp_bw_util: calib.tp_bw_util,
|
||||
tp_overlap_ratio: calib.tp_overlap_ratio,
|
||||
misc_layer_overhead_s: calib.misc_layer_overhead_us * 1e-6,
|
||||
chunk_launch_overhead_s: calib.chunk_launch_overhead_us * 1e-6,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -231,30 +285,47 @@ impl ComputeModel {
|
||||
return 0.0;
|
||||
}
|
||||
let n = n as f64;
|
||||
let linear = n * self.linear_flops_per_token;
|
||||
let linear_flops = n * self.linear_flops_per_token;
|
||||
|
||||
// Compute FLOPs across all layers (dense + sparse may differ).
|
||||
let dense_layers = self.first_dense_layers;
|
||||
let sparse_layers = self.num_layers - dense_layers;
|
||||
|
||||
let dense_flops = dense_layers
|
||||
* (linear + self.attn_coeff * n * self.effective_ctx(n, true));
|
||||
let sparse_flops = sparse_layers
|
||||
* (linear + self.attn_coeff * n * self.effective_ctx(n, false));
|
||||
let total_flops = dense_flops + sparse_flops;
|
||||
let linear_total_flops = self.num_layers * linear_flops;
|
||||
let dense_attn_flops = dense_layers * (self.attn_coeff * n * self.effective_ctx(n, true));
|
||||
let sparse_attn_flops =
|
||||
sparse_layers * (self.attn_coeff * n * self.effective_ctx(n, false));
|
||||
let attn_total_flops = dense_attn_flops + sparse_attn_flops;
|
||||
|
||||
let compute_time = total_flops / self.gpu_flops;
|
||||
let linear_time = linear_total_flops / (self.gpu_flops * self.matmul_util.max(1e-6));
|
||||
let attn_time = attn_total_flops / (self.gpu_flops * self.attention_util.max(1e-6));
|
||||
let compute_time = linear_time + attn_time + self.num_layers * self.misc_layer_overhead_s;
|
||||
// Weight stream: all layers' active weights read once from HBM.
|
||||
let mem_time = self.weight_bytes_per_layer * self.num_layers / self.gpu_mem_bw;
|
||||
let mem_time = self.weight_bytes_per_layer * self.num_layers
|
||||
/ (self.gpu_mem_bw * self.hbm_bw_util.max(1e-6));
|
||||
let tp_comm_time = if self.tp_collective_count_per_layer > 0.0
|
||||
&& self.tp_bytes_per_token > 0.0
|
||||
&& self.intra_node_tp_bw > 0.0
|
||||
{
|
||||
self.num_layers
|
||||
* (self.tp_collective_count_per_layer * self.intra_node_tp_latency_s
|
||||
+ self.tp_collective_count_per_layer * self.tp_bytes_per_token * n
|
||||
/ (self.intra_node_tp_bw * self.tp_bw_util.max(1e-6)))
|
||||
} else {
|
||||
0.0
|
||||
};
|
||||
let tp_tail = (tp_comm_time - self.tp_overlap_ratio * (linear_time + attn_time)).max(0.0);
|
||||
|
||||
compute_time.max(mem_time)
|
||||
self.chunk_launch_overhead_s + compute_time.max(mem_time) + tp_tail
|
||||
}
|
||||
|
||||
/// Print human-readable derived coefficients (for `validate` output).
|
||||
pub fn describe(&self) -> String {
|
||||
let pattern_str = match &self.attn_pattern {
|
||||
AttentionPattern::Dense => "dense".to_string(),
|
||||
AttentionPattern::SlidingWindow { window } => format!("sliding_window({})", *window as u64),
|
||||
AttentionPattern::SlidingWindow { window } => {
|
||||
format!("sliding_window({})", *window as u64)
|
||||
}
|
||||
AttentionPattern::Dsa {
|
||||
dense_window,
|
||||
sparse_stride,
|
||||
@@ -266,8 +337,7 @@ impl ComputeModel {
|
||||
format!(
|
||||
"linear_flops/tok/layer={:.3e}, attn_coeff={:.0}, pattern={}, \
|
||||
weight_bytes/layer={:.2e}",
|
||||
self.linear_flops_per_token, self.attn_coeff, pattern_str,
|
||||
self.weight_bytes_per_layer,
|
||||
self.linear_flops_per_token, self.attn_coeff, pattern_str, self.weight_bytes_per_layer,
|
||||
)
|
||||
}
|
||||
}
|
||||
@@ -275,6 +345,7 @@ impl ComputeModel {
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::config::CalibrationConfig;
|
||||
|
||||
fn cm_legacy() -> ComputeModel {
|
||||
ComputeModel {
|
||||
@@ -284,8 +355,19 @@ mod tests {
|
||||
attn_coeff: 1024.0,
|
||||
attn_pattern: AttentionPattern::Dense,
|
||||
weight_bytes_per_layer: 0.0,
|
||||
tp_bytes_per_token: 0.0,
|
||||
tp_collective_count_per_layer: 0.0,
|
||||
gpu_flops: 9.89e14,
|
||||
gpu_mem_bw: 3.35e12,
|
||||
intra_node_tp_bw: 9.0e11,
|
||||
intra_node_tp_latency_s: 2.0e-6,
|
||||
matmul_util: 1.0,
|
||||
attention_util: 1.0,
|
||||
hbm_bw_util: 1.0,
|
||||
tp_bw_util: 1.0,
|
||||
tp_overlap_ratio: 1.0,
|
||||
misc_layer_overhead_s: 0.0,
|
||||
chunk_launch_overhead_s: 0.0,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -325,8 +407,19 @@ mod tests {
|
||||
attn_coeff: 139264.0,
|
||||
attn_pattern: AttentionPattern::Dense,
|
||||
weight_bytes_per_layer: 0.0,
|
||||
tp_bytes_per_token: 0.0,
|
||||
tp_collective_count_per_layer: 0.0,
|
||||
gpu_flops: 1.8e16,
|
||||
gpu_mem_bw: 6.4e13,
|
||||
intra_node_tp_bw: 9.0e11,
|
||||
intra_node_tp_latency_s: 2.0e-6,
|
||||
matmul_util: 1.0,
|
||||
attention_util: 1.0,
|
||||
hbm_bw_util: 1.0,
|
||||
tp_bw_util: 1.0,
|
||||
tp_overlap_ratio: 1.0,
|
||||
misc_layer_overhead_s: 0.0,
|
||||
chunk_launch_overhead_s: 0.0,
|
||||
};
|
||||
let dsa = ComputeModel {
|
||||
attn_pattern: AttentionPattern::Dsa {
|
||||
@@ -358,8 +451,19 @@ mod tests {
|
||||
attn_coeff: 1.0,
|
||||
attn_pattern: AttentionPattern::Dense,
|
||||
weight_bytes_per_layer: 1.0e12, // 1 TB per layer
|
||||
tp_bytes_per_token: 0.0,
|
||||
tp_collective_count_per_layer: 0.0,
|
||||
gpu_flops: 1.0e15,
|
||||
gpu_mem_bw: 1.0e12,
|
||||
intra_node_tp_bw: 9.0e11,
|
||||
intra_node_tp_latency_s: 2.0e-6,
|
||||
matmul_util: 1.0,
|
||||
attention_util: 1.0,
|
||||
hbm_bw_util: 1.0,
|
||||
tp_bw_util: 1.0,
|
||||
tp_overlap_ratio: 1.0,
|
||||
misc_layer_overhead_s: 0.0,
|
||||
chunk_launch_overhead_s: 0.0,
|
||||
};
|
||||
let t1 = m.prefill_time(1);
|
||||
let t8 = m.prefill_time(8);
|
||||
@@ -391,18 +495,122 @@ mod tests {
|
||||
gpu_mem_bw: 1e12,
|
||||
hbm_bytes: 1e9,
|
||||
dram_bytes: 4e9,
|
||||
host_dram_bw: 5.0e11,
|
||||
pcie_bw: 32e9,
|
||||
pcie_latency_us: 1.0,
|
||||
rdma_bw: 12e9,
|
||||
rdma_latency_us: 5.0,
|
||||
intra_node_tp_bw: 9.0e11,
|
||||
intra_node_tp_latency_us: 2.0,
|
||||
tp_degree: 1,
|
||||
max_batch_slots: 32,
|
||||
prefill_chunk_tokens: 1024,
|
||||
};
|
||||
let cm = ComputeModel::new(&model, &hw);
|
||||
let cm = ComputeModel::new(&model, &hw, &CalibrationConfig::default());
|
||||
assert!(cm.linear_flops_per_token > 0.0);
|
||||
assert!(cm.attn_coeff > 0.0);
|
||||
assert!(cm.weight_bytes_per_layer > 0.0);
|
||||
let t = cm.prefill_time(1024);
|
||||
assert!(t > 0.0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn lower_utilization_increases_prefill_time() {
|
||||
let model = ModelConfig {
|
||||
name: "test".into(),
|
||||
num_layers: 8,
|
||||
num_kv_heads: 4,
|
||||
head_dim: 128,
|
||||
dtype_bytes: 2,
|
||||
block_size_tokens: 16,
|
||||
hidden_size: Some(1024),
|
||||
num_attention_heads: Some(8),
|
||||
intermediate_size: Some(4096),
|
||||
..Default::default()
|
||||
};
|
||||
let hw = HardwareConfig {
|
||||
gpu_flops: 1e14,
|
||||
gpu_fp8_flops: 0.0,
|
||||
gpu_fp4_flops: 0.0,
|
||||
gpu_mem_bw: 1e12,
|
||||
hbm_bytes: 1e9,
|
||||
dram_bytes: 4e9,
|
||||
host_dram_bw: 5.0e11,
|
||||
pcie_bw: 32e9,
|
||||
pcie_latency_us: 1.0,
|
||||
rdma_bw: 12e9,
|
||||
rdma_latency_us: 5.0,
|
||||
intra_node_tp_bw: 9.0e11,
|
||||
intra_node_tp_latency_us: 2.0,
|
||||
tp_degree: 1,
|
||||
max_batch_slots: 32,
|
||||
prefill_chunk_tokens: 1024,
|
||||
};
|
||||
let fast = ComputeModel::new(&model, &hw, &CalibrationConfig::default());
|
||||
let slow = ComputeModel::new(
|
||||
&model,
|
||||
&hw,
|
||||
&CalibrationConfig {
|
||||
matmul_util: 0.2,
|
||||
attention_util: 0.15,
|
||||
..CalibrationConfig::default()
|
||||
},
|
||||
);
|
||||
|
||||
assert!(slow.prefill_time(4096) > fast.prefill_time(4096));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn tp_communication_adds_tail_when_overlap_is_limited() {
|
||||
let model = ModelConfig {
|
||||
name: "test".into(),
|
||||
num_layers: 8,
|
||||
num_kv_heads: 4,
|
||||
head_dim: 128,
|
||||
dtype_bytes: 2,
|
||||
block_size_tokens: 16,
|
||||
hidden_size: Some(2048),
|
||||
num_attention_heads: Some(16),
|
||||
intermediate_size: Some(8192),
|
||||
..Default::default()
|
||||
};
|
||||
let hw = HardwareConfig {
|
||||
gpu_flops: 1e14,
|
||||
gpu_fp8_flops: 0.0,
|
||||
gpu_fp4_flops: 0.0,
|
||||
gpu_mem_bw: 1e12,
|
||||
hbm_bytes: 1e9,
|
||||
dram_bytes: 4e9,
|
||||
host_dram_bw: 5.0e11,
|
||||
pcie_bw: 32e9,
|
||||
pcie_latency_us: 1.0,
|
||||
rdma_bw: 12e9,
|
||||
rdma_latency_us: 5.0,
|
||||
intra_node_tp_bw: 1.0e10,
|
||||
intra_node_tp_latency_us: 20.0,
|
||||
tp_degree: 8,
|
||||
max_batch_slots: 32,
|
||||
prefill_chunk_tokens: 1024,
|
||||
};
|
||||
let no_tp = ComputeModel::new(
|
||||
&model,
|
||||
&hw,
|
||||
&CalibrationConfig {
|
||||
tp_overlap_ratio: 1.0,
|
||||
tp_bw_util: 1.0,
|
||||
..CalibrationConfig::default()
|
||||
},
|
||||
);
|
||||
let tp_tail = ComputeModel::new(
|
||||
&model,
|
||||
&hw,
|
||||
&CalibrationConfig {
|
||||
tp_overlap_ratio: 0.0,
|
||||
tp_bw_util: 0.2,
|
||||
..CalibrationConfig::default()
|
||||
},
|
||||
);
|
||||
|
||||
assert!(tp_tail.prefill_time(2048) > no_tp.prefill_time(2048));
|
||||
}
|
||||
}
|
||||
|
||||
@@ -19,7 +19,7 @@
|
||||
|
||||
use std::collections::VecDeque;
|
||||
|
||||
use crate::config::{HardwareConfig, ModelConfig};
|
||||
use crate::config::{CalibrationConfig, HardwareConfig, ModelConfig};
|
||||
use crate::instance::compute::ComputeModel;
|
||||
use crate::instance::kv_cache::TwoTierCache;
|
||||
use crate::network::InstanceLinks;
|
||||
@@ -37,6 +37,8 @@ pub struct AdmittedRequest {
|
||||
/// KV blocks reserved on this instance's HBM for the lifetime of this
|
||||
/// request's prefill (= number of input blocks).
|
||||
pub reserved_blocks: u32,
|
||||
/// Tail latency between prefill completion and first-token visibility.
|
||||
pub completion_tail_s: f64,
|
||||
}
|
||||
|
||||
#[derive(Debug)]
|
||||
@@ -68,15 +70,20 @@ pub struct Instance {
|
||||
}
|
||||
|
||||
impl Instance {
|
||||
pub fn new(id: InstanceId, model: &ModelConfig, hw: &HardwareConfig) -> Self {
|
||||
pub fn new(
|
||||
id: InstanceId,
|
||||
model: &ModelConfig,
|
||||
hw: &HardwareConfig,
|
||||
calib: &CalibrationConfig,
|
||||
) -> Self {
|
||||
let block_bytes = model.kv_block_bytes() as f64;
|
||||
let hbm_blocks = (hw.hbm_bytes / block_bytes).max(1.0) as u32;
|
||||
let dram_blocks = (hw.dram_bytes / block_bytes).max(1.0) as u32;
|
||||
Self {
|
||||
id,
|
||||
cache: TwoTierCache::new(hbm_blocks as usize, dram_blocks as usize),
|
||||
links: InstanceLinks::from_hw(hw),
|
||||
compute: ComputeModel::new(model, hw),
|
||||
links: InstanceLinks::from_hw(hw, calib),
|
||||
compute: ComputeModel::new(model, hw, calib),
|
||||
block_size_tokens: model.block_size_tokens,
|
||||
hbm_block_budget: hbm_blocks,
|
||||
dram_block_budget: dram_blocks,
|
||||
@@ -141,9 +148,10 @@ impl Instance {
|
||||
self.kv_blocks_used += r.reserved_blocks;
|
||||
if r.prefill_tokens_remaining == 0 {
|
||||
// Full cache hit: nothing to compute. TTFT == fetch time.
|
||||
let ttft = now - r.arrival;
|
||||
let t_done = now + r.completion_tail_s;
|
||||
let ttft = t_done - r.arrival;
|
||||
self.kv_blocks_used = self.kv_blocks_used.saturating_sub(r.reserved_blocks);
|
||||
completed.push((r.req_id, ttft, now));
|
||||
completed.push((r.req_id, ttft, t_done));
|
||||
} else {
|
||||
self.prefilling.push_back(r);
|
||||
}
|
||||
@@ -171,9 +179,10 @@ impl Instance {
|
||||
head.prefill_tokens_remaining -= chunk_tokens;
|
||||
if head.prefill_tokens_remaining == 0 {
|
||||
let done = self.prefilling.pop_front().unwrap();
|
||||
let ttft = t_end - done.arrival;
|
||||
let t_done = t_end + done.completion_tail_s;
|
||||
let ttft = t_done - done.arrival;
|
||||
self.kv_blocks_used = self.kv_blocks_used.saturating_sub(done.reserved_blocks);
|
||||
completed.push((done.req_id, ttft, t_end));
|
||||
completed.push((done.req_id, ttft, t_done));
|
||||
}
|
||||
|
||||
StepResult {
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
pub mod compute;
|
||||
pub mod kv_cache;
|
||||
#[allow(clippy::module_inception)]
|
||||
pub mod instance;
|
||||
pub mod kv_cache;
|
||||
|
||||
pub use instance::Instance;
|
||||
|
||||
@@ -11,4 +11,5 @@ pub mod replay;
|
||||
pub mod router;
|
||||
pub mod sim;
|
||||
pub mod trace;
|
||||
pub mod ttft;
|
||||
pub mod types;
|
||||
|
||||
384
src/main.rs
384
src/main.rs
@@ -38,12 +38,26 @@ struct ConfigOverrides {
|
||||
/// Override `cluster.meta_store.ttl_seconds`.
|
||||
#[arg(long)]
|
||||
ttl_seconds: Option<f64>,
|
||||
/// Override `sim.input_length_min` — requests with `input_length` below
|
||||
/// this value are dropped from the replay.
|
||||
#[arg(long)]
|
||||
input_length_min: Option<u32>,
|
||||
/// Override `sim.input_length_max` — requests with `input_length` above
|
||||
/// this value are dropped from the replay. Combine with `--input-length-min`
|
||||
/// to carve out a specific input-length bucket for ablation.
|
||||
#[arg(long)]
|
||||
input_length_max: Option<u32>,
|
||||
}
|
||||
|
||||
impl ConfigOverrides {
|
||||
fn apply(&self, cfg: &mut Config) {
|
||||
fn apply(&self, cfg: &mut Config) -> Result<()> {
|
||||
if let Some(n) = self.num_instances {
|
||||
cfg.cluster.num_instances = n;
|
||||
if !cfg.cluster.buckets.is_empty() {
|
||||
anyhow::bail!(
|
||||
"--num-instances does not support cluster.buckets until Task 2 lands"
|
||||
);
|
||||
}
|
||||
cfg.cluster.num_instances = Some(n);
|
||||
}
|
||||
if let Some(m) = self.max_requests {
|
||||
cfg.sim.max_requests = Some(m);
|
||||
@@ -63,6 +77,13 @@ impl ConfigOverrides {
|
||||
if let Some(ttl) = self.ttl_seconds {
|
||||
cfg.cluster.meta_store.ttl_seconds = ttl;
|
||||
}
|
||||
if let Some(lo) = self.input_length_min {
|
||||
cfg.sim.input_length_min = Some(lo);
|
||||
}
|
||||
if let Some(hi) = self.input_length_max {
|
||||
cfg.sim.input_length_max = Some(hi);
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
@@ -91,6 +112,27 @@ enum Cmd {
|
||||
/// Currently only `lru` is supported.
|
||||
#[arg(long, default_value = "lru")]
|
||||
evict_policies: String,
|
||||
/// Sweep `num_instances` from `--auto-candidates` with the
|
||||
/// `--auto-probe-router` and pick the smallest cluster size whose
|
||||
/// TTFT mean ≤ `--auto-target-ttft-mean`. Overrides the YAML
|
||||
/// `num_instances` for the ablation run.
|
||||
#[arg(long, default_value_t = false)]
|
||||
auto_instances: bool,
|
||||
/// Target TTFT mean (seconds) for auto-instances calibration.
|
||||
#[arg(long, default_value_t = 4.0)]
|
||||
auto_target_ttft_mean: f64,
|
||||
/// Comma-separated candidate cluster sizes (ascending).
|
||||
#[arg(long, default_value = "4,8,16,24,32,48,64,96,128")]
|
||||
auto_candidates: String,
|
||||
/// Router used as the calibration baseline. The smallest candidate
|
||||
/// where this router's TTFT mean ≤ target is picked — all ablation
|
||||
/// routers are then run at that cluster size.
|
||||
#[arg(long, default_value = "cache_score")]
|
||||
auto_probe_router: String,
|
||||
/// Maximum number of routers to simulate concurrently.
|
||||
/// `0` means auto-detect from available CPU parallelism.
|
||||
#[arg(long, default_value_t = 0)]
|
||||
jobs: usize,
|
||||
#[command(flatten)]
|
||||
overrides: ConfigOverrides,
|
||||
},
|
||||
@@ -132,8 +174,23 @@ fn main() -> Result<()> {
|
||||
config,
|
||||
routers,
|
||||
evict_policies,
|
||||
auto_instances,
|
||||
auto_target_ttft_mean,
|
||||
auto_candidates,
|
||||
auto_probe_router,
|
||||
jobs,
|
||||
overrides,
|
||||
} => cmd_ablate(&config, &routers, &evict_policies, &overrides),
|
||||
} => cmd_ablate(
|
||||
&config,
|
||||
&routers,
|
||||
&evict_policies,
|
||||
auto_instances,
|
||||
auto_target_ttft_mean,
|
||||
&auto_candidates,
|
||||
&auto_probe_router,
|
||||
jobs,
|
||||
&overrides,
|
||||
),
|
||||
Cmd::Validate { config, overrides } => cmd_validate(&config, &overrides),
|
||||
Cmd::Oracle {
|
||||
config,
|
||||
@@ -153,7 +210,8 @@ fn main() -> Result<()> {
|
||||
|
||||
fn load(config: &PathBuf, overrides: &ConfigOverrides) -> Result<Config> {
|
||||
let mut cfg = Config::from_yaml_path(config)?;
|
||||
overrides.apply(&mut cfg);
|
||||
overrides.apply(&mut cfg)?;
|
||||
cfg.cluster.validate()?;
|
||||
Ok(cfg)
|
||||
}
|
||||
|
||||
@@ -164,13 +222,19 @@ fn cmd_run(path: &PathBuf, overrides: &ConfigOverrides) -> Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
fn cmd_ablate(
|
||||
path: &PathBuf,
|
||||
routers: &str,
|
||||
evict_policies: &str,
|
||||
auto_instances: bool,
|
||||
auto_target_ttft_mean: f64,
|
||||
auto_candidates: &str,
|
||||
auto_probe_router: &str,
|
||||
jobs: usize,
|
||||
overrides: &ConfigOverrides,
|
||||
) -> Result<()> {
|
||||
let base = load(path, overrides)?;
|
||||
let mut base = load(path, overrides)?;
|
||||
let modes: Vec<RouterMode> = routers
|
||||
.split(',')
|
||||
.map(|s| s.trim())
|
||||
@@ -185,8 +249,38 @@ fn cmd_ablate(
|
||||
.map(ReplayEvictPolicy::parse)
|
||||
.collect::<Result<Vec<_>>>()
|
||||
.with_context(|| format!("parsing --evict-policies='{evict_policies}'"))?;
|
||||
|
||||
if auto_instances {
|
||||
let candidates: Vec<u32> = auto_candidates
|
||||
.split(',')
|
||||
.map(|s| s.trim())
|
||||
.filter(|s| !s.is_empty())
|
||||
.map(|s| {
|
||||
s.parse::<u32>()
|
||||
.with_context(|| format!("parsing --auto-candidates entry '{s}'"))
|
||||
})
|
||||
.collect::<Result<Vec<_>>>()?;
|
||||
if candidates.is_empty() {
|
||||
return Err(anyhow::anyhow!("--auto-candidates is empty"));
|
||||
}
|
||||
// Ascending so the first hit is the smallest cluster meeting the target.
|
||||
let mut sorted = candidates.clone();
|
||||
sorted.sort_unstable();
|
||||
let probe_mode = RouterMode::parse(auto_probe_router)
|
||||
.with_context(|| format!("parsing --auto-probe-router='{auto_probe_router}'"))?;
|
||||
let chosen = auto_select_instances(&base, &sorted, probe_mode, auto_target_ttft_mean)?;
|
||||
eprintln!(
|
||||
"[ablate] routers={} evict_policies={}",
|
||||
"[ablate] auto-instances chose num_instances={} (target ttft_mean ≤ {:.3}s, probe_router={})",
|
||||
chosen,
|
||||
auto_target_ttft_mean,
|
||||
probe_mode.as_str()
|
||||
);
|
||||
base.cluster.num_instances = Some(chosen);
|
||||
base.cluster.buckets.clear();
|
||||
}
|
||||
|
||||
eprintln!(
|
||||
"[ablate] routers={} evict_policies={} num_instances={} jobs={}",
|
||||
modes
|
||||
.iter()
|
||||
.map(RouterMode::as_str)
|
||||
@@ -196,9 +290,15 @@ fn cmd_ablate(
|
||||
.iter()
|
||||
.map(ReplayEvictPolicy::as_str)
|
||||
.collect::<Vec<_>>()
|
||||
.join(",")
|
||||
.join(","),
|
||||
base.cluster.total_instances(),
|
||||
if jobs == 0 {
|
||||
"auto".to_string()
|
||||
} else {
|
||||
jobs.to_string()
|
||||
},
|
||||
);
|
||||
let all = driver::ablate_fixed_placement(&base, &modes, &policies)?;
|
||||
let all = driver::ablate_fixed_placement_with_parallelism(&base, &modes, &policies, jobs)?;
|
||||
let agg_path = std::path::Path::new(&base.sim.output_dir).join("ablation.json");
|
||||
std::fs::create_dir_all(&base.sim.output_dir)?;
|
||||
std::fs::write(&agg_path, serde_json::to_string_pretty(&all)?)?;
|
||||
@@ -207,6 +307,111 @@ fn cmd_ablate(
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Sweep candidate cluster sizes ascending and return the smallest one whose
|
||||
/// TTFT mean under `probe` is ≤ `target_ttft_mean`. Per-candidate calibration
|
||||
/// summaries are written under `<output_dir>/auto_instances/` so the picked
|
||||
/// N is auditable. If no candidate meets the target, returns an error naming
|
||||
/// the best achievable TTFT.
|
||||
fn auto_select_instances(
|
||||
base: &Config,
|
||||
candidates: &[u32],
|
||||
probe: RouterMode,
|
||||
target_ttft_mean: f64,
|
||||
) -> Result<u32> {
|
||||
base.cluster
|
||||
.require_legacy_single_pool("auto-instances calibration")?;
|
||||
|
||||
#[derive(serde::Serialize)]
|
||||
struct CalibRow {
|
||||
num_instances: u32,
|
||||
router: String,
|
||||
ttft_mean: f64,
|
||||
ttft_p50: f64,
|
||||
ttft_p95: f64,
|
||||
ttft_p99: f64,
|
||||
num_requests: u64,
|
||||
hit_rate_l0: f64,
|
||||
passed: bool,
|
||||
}
|
||||
|
||||
let out_root = std::path::Path::new(&base.sim.output_dir).join("auto_instances");
|
||||
std::fs::create_dir_all(&out_root)?;
|
||||
let mut log: Vec<CalibRow> = Vec::new();
|
||||
let mut chosen: Option<u32> = None;
|
||||
|
||||
for &n in candidates {
|
||||
let mut cfg = base.clone();
|
||||
cfg.cluster.num_instances = Some(n);
|
||||
cfg.cluster.router.mode = probe;
|
||||
// Isolate calibration output so ablation runs don't overwrite it.
|
||||
cfg.sim.output_dir = out_root
|
||||
.join(format!("n{n}__{}", probe.as_str()))
|
||||
.to_string_lossy()
|
||||
.into_owned();
|
||||
eprintln!(
|
||||
"[auto-instances] probing num_instances={} router={} ...",
|
||||
n,
|
||||
probe.as_str()
|
||||
);
|
||||
let run = driver::run(&cfg, None)?;
|
||||
let passed = run.summary.ttft_mean <= target_ttft_mean;
|
||||
eprintln!(
|
||||
"[auto-instances] ttft_mean={:.3}s p95={:.3}s hit_l0={:.4} -> {}",
|
||||
run.summary.ttft_mean,
|
||||
run.summary.ttft_p95,
|
||||
run.summary.hit_rate_l0,
|
||||
if passed { "PASS" } else { "fail" },
|
||||
);
|
||||
log.push(CalibRow {
|
||||
num_instances: n,
|
||||
router: probe.as_str().to_string(),
|
||||
ttft_mean: run.summary.ttft_mean,
|
||||
ttft_p50: run.summary.ttft_p50,
|
||||
ttft_p95: run.summary.ttft_p95,
|
||||
ttft_p99: run.summary.ttft_p99,
|
||||
num_requests: run.summary.num_requests,
|
||||
hit_rate_l0: run.summary.hit_rate_l0,
|
||||
passed,
|
||||
});
|
||||
if passed && chosen.is_none() {
|
||||
chosen = Some(n);
|
||||
// Keep sweeping if you want a curve; here we stop at the smallest
|
||||
// passing N to satisfy the "not too small, but as small as it can
|
||||
// be while meeting the SLA" requirement.
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
// Persist the calibration log either way so failures are debuggable.
|
||||
let log_path = out_root.join("calibration.json");
|
||||
std::fs::write(
|
||||
&log_path,
|
||||
serde_json::to_string_pretty(&serde_json::json!({
|
||||
"target_ttft_mean": target_ttft_mean,
|
||||
"probe_router": probe.as_str(),
|
||||
"candidates": candidates,
|
||||
"chosen": chosen,
|
||||
"runs": log,
|
||||
}))?,
|
||||
)?;
|
||||
eprintln!("[auto-instances] wrote {}", log_path.display());
|
||||
|
||||
chosen.ok_or_else(|| {
|
||||
let best = log
|
||||
.iter()
|
||||
.min_by(|a, b| a.ttft_mean.partial_cmp(&b.ttft_mean).unwrap())
|
||||
.map(|r| (r.num_instances, r.ttft_mean))
|
||||
.unwrap_or((0, f64::INFINITY));
|
||||
anyhow::anyhow!(
|
||||
"no candidate met target ttft_mean ≤ {:.3}s; best was n={} at {:.3}s — \
|
||||
widen --auto-candidates or raise --auto-target-ttft-mean",
|
||||
target_ttft_mean,
|
||||
best.0,
|
||||
best.1,
|
||||
)
|
||||
})
|
||||
}
|
||||
|
||||
fn cmd_validate(path: &PathBuf, overrides: &ConfigOverrides) -> Result<()> {
|
||||
use kvcache_simulator::instance::compute::ComputeModel;
|
||||
let cfg = load(path, overrides)?;
|
||||
@@ -219,7 +424,7 @@ fn cmd_validate(path: &PathBuf, overrides: &ConfigOverrides) -> Result<()> {
|
||||
"legacy manual"
|
||||
}
|
||||
);
|
||||
let cm = ComputeModel::new(&cfg.model, &cfg.hardware);
|
||||
let cm = ComputeModel::new(&cfg.model, &cfg.hardware, &cfg.calibration);
|
||||
eprintln!("compute: {}", cm.describe());
|
||||
eprintln!(
|
||||
"kv_block_bytes = {} ({:.2} MB{})",
|
||||
@@ -235,7 +440,7 @@ fn cmd_validate(path: &PathBuf, overrides: &ConfigOverrides) -> Result<()> {
|
||||
let hbm_blocks = (cfg.hardware.hbm_bytes / block_bytes) as u64;
|
||||
let dram_blocks = (cfg.hardware.dram_bytes / block_bytes) as u64;
|
||||
eprintln!("per-instance HBM blocks = {hbm_blocks}, DRAM blocks = {dram_blocks}");
|
||||
eprintln!("num_instances = {}", cfg.cluster.num_instances);
|
||||
eprintln!("num_instances = {}", cfg.cluster.total_instances());
|
||||
// Sample prefill times at a few prompt lengths.
|
||||
eprintln!("prefill_time samples:");
|
||||
for &n in &[256, 1024, 4096, 16384, 65536, 131072] {
|
||||
@@ -266,9 +471,10 @@ fn cmd_oracle(
|
||||
out_path: Option<&std::path::Path>,
|
||||
) -> Result<()> {
|
||||
let cfg = load(path, overrides)?;
|
||||
cfg.cluster.require_legacy_single_pool("oracle analysis")?;
|
||||
let block_bytes = cfg.model.kv_block_bytes() as f64;
|
||||
let per_instance_blocks = (cfg.hardware.hbm_bytes / block_bytes).max(1.0) as u64;
|
||||
let aggregate_blocks = per_instance_blocks * cfg.cluster.num_instances as u64;
|
||||
let aggregate_blocks = per_instance_blocks * cfg.cluster.total_instances() as u64;
|
||||
let capacity = match (capacity_blocks, per_instance) {
|
||||
(Some(_), true) => {
|
||||
return Err(anyhow::anyhow!(
|
||||
@@ -286,13 +492,45 @@ fn cmd_oracle(
|
||||
);
|
||||
let reader = TraceReader::open(&cfg.sim.trace_path, cfg.sim.max_requests)?;
|
||||
let records: Vec<_> = reader.collect::<Result<Vec<_>, _>>()?;
|
||||
|
||||
// Build a count-mask: all records feed the cache, but only records inside
|
||||
// the input-length range contribute to hit/miss accounting. This way a
|
||||
// 128k+ bucket still benefits from prefix blocks populated by shorter
|
||||
// requests, matching the real mixed-workload ceiling.
|
||||
let lo = cfg.sim.input_length_min.unwrap_or(0);
|
||||
let hi = cfg.sim.input_length_max.unwrap_or(u32::MAX);
|
||||
let has_filter = lo > 0 || hi < u32::MAX;
|
||||
let count_mask: Option<Vec<bool>> = if has_filter {
|
||||
let mask: Vec<bool> = records
|
||||
.iter()
|
||||
.map(|r| r.input_len >= lo && r.input_len <= hi)
|
||||
.collect();
|
||||
let counted = mask.iter().filter(|&&v| v).count();
|
||||
eprintln!(
|
||||
"[oracle] input_length filter [{}, {}] counting {}/{} requests \
|
||||
(all {} used for cache state)",
|
||||
lo,
|
||||
if hi == u32::MAX {
|
||||
"∞".to_string()
|
||||
} else {
|
||||
hi.to_string()
|
||||
},
|
||||
counted,
|
||||
records.len(),
|
||||
records.len(),
|
||||
);
|
||||
Some(mask)
|
||||
} else {
|
||||
None
|
||||
};
|
||||
|
||||
eprintln!(
|
||||
"[oracle] loaded {} requests; analyzing with capacity = {} blocks \
|
||||
({} per-instance × {} instances{})",
|
||||
records.len(),
|
||||
capacity,
|
||||
per_instance_blocks,
|
||||
cfg.cluster.num_instances,
|
||||
cfg.cluster.total_instances(),
|
||||
if per_instance {
|
||||
", per-instance mode"
|
||||
} else {
|
||||
@@ -300,7 +538,7 @@ fn cmd_oracle(
|
||||
}
|
||||
);
|
||||
|
||||
let result = oracle::analyze(&records, capacity);
|
||||
let result = oracle::analyze(&records, capacity, count_mask.as_deref());
|
||||
let json = serde_json::to_string_pretty(&result)?;
|
||||
println!("{}", json);
|
||||
|
||||
@@ -315,3 +553,123 @@ fn cmd_oracle(
|
||||
eprintln!("[oracle] wrote {}", target.display());
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use kvcache_simulator::config::{
|
||||
BucketConfig, CalibrationConfig, ClusterConfig, GlobalRouterConfig, HardwareConfig,
|
||||
MetaStoreConfig, ModelConfig, RouterConfig, SimConfig,
|
||||
};
|
||||
|
||||
fn bucketed_config(out_dir: &str) -> Config {
|
||||
Config {
|
||||
model: ModelConfig {
|
||||
name: "test".into(),
|
||||
num_layers: 4,
|
||||
num_kv_heads: 2,
|
||||
head_dim: 64,
|
||||
dtype_bytes: 2,
|
||||
block_size_tokens: 16,
|
||||
flops_per_token_prefill: Some(1.0e9),
|
||||
attn_quadratic_coeff: Some(64.0),
|
||||
..Default::default()
|
||||
},
|
||||
hardware: HardwareConfig {
|
||||
gpu_flops: 1.0e14,
|
||||
gpu_fp8_flops: 0.0,
|
||||
gpu_fp4_flops: 0.0,
|
||||
gpu_mem_bw: 1.0e12,
|
||||
hbm_bytes: 1.0e9,
|
||||
dram_bytes: 4.0e9,
|
||||
host_dram_bw: 5.0e11,
|
||||
pcie_bw: 32.0e9,
|
||||
pcie_latency_us: 1.0,
|
||||
rdma_bw: 12.0e9,
|
||||
rdma_latency_us: 5.0,
|
||||
intra_node_tp_bw: 9.0e11,
|
||||
intra_node_tp_latency_us: 2.0,
|
||||
tp_degree: 1,
|
||||
max_batch_slots: 32,
|
||||
prefill_chunk_tokens: 1024,
|
||||
},
|
||||
calibration: CalibrationConfig::default(),
|
||||
cluster: ClusterConfig {
|
||||
num_instances: None,
|
||||
buckets: vec![
|
||||
BucketConfig {
|
||||
name: "short".into(),
|
||||
input_length_min: 0,
|
||||
input_length_max: 64,
|
||||
num_instances: 2,
|
||||
},
|
||||
BucketConfig {
|
||||
name: "long".into(),
|
||||
input_length_min: 65,
|
||||
input_length_max: 128,
|
||||
num_instances: 1,
|
||||
},
|
||||
],
|
||||
global_router: GlobalRouterConfig::default(),
|
||||
meta_store: MetaStoreConfig {
|
||||
ttl_seconds: 1000.0,
|
||||
},
|
||||
router: RouterConfig {
|
||||
mode: RouterMode::Random,
|
||||
precise_probe_latency_us: 10.0,
|
||||
precise_probe_topk: 4,
|
||||
load_alpha: 0.1,
|
||||
score_alpha: 1.0,
|
||||
score_beta: 0.1,
|
||||
prefix_k: 8,
|
||||
affinity_fan_out: 0,
|
||||
},
|
||||
},
|
||||
sim: SimConfig {
|
||||
trace_path: "unused.jsonl".into(),
|
||||
max_requests: None,
|
||||
output_dir: out_dir.into(),
|
||||
sample_interval_s: 0.0,
|
||||
seed: 7,
|
||||
input_length_min: None,
|
||||
input_length_max: None,
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn num_instances_override_rejects_bucketed_configs() {
|
||||
let mut cfg = bucketed_config(std::env::temp_dir().to_str().unwrap());
|
||||
let overrides = ConfigOverrides {
|
||||
num_instances: Some(8),
|
||||
..ConfigOverrides::default()
|
||||
};
|
||||
|
||||
let err = overrides.apply(&mut cfg).unwrap_err();
|
||||
assert!(err.to_string().contains("--num-instances"));
|
||||
assert!(err.to_string().contains("cluster.buckets"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn auto_instances_rejects_bucketed_configs() {
|
||||
let cfg = bucketed_config(std::env::temp_dir().to_str().unwrap());
|
||||
|
||||
let err = auto_select_instances(&cfg, &[4, 8], RouterMode::Random, 1.0).unwrap_err();
|
||||
assert!(err.to_string().contains("auto-instances"));
|
||||
assert!(err.to_string().contains("cluster.buckets"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn oracle_rejects_bucketed_configs() {
|
||||
let tmp = std::env::temp_dir().join("kvcache_sim_main_tests");
|
||||
let _ = std::fs::remove_dir_all(&tmp);
|
||||
std::fs::create_dir_all(&tmp).unwrap();
|
||||
let path = tmp.join("bucketed.yaml");
|
||||
let cfg = bucketed_config(tmp.to_str().unwrap());
|
||||
std::fs::write(&path, serde_yaml::to_string(&cfg).unwrap()).unwrap();
|
||||
|
||||
let err = cmd_oracle(&path, &ConfigOverrides::default(), None, false, None).unwrap_err();
|
||||
assert!(err.to_string().contains("oracle analysis"));
|
||||
assert!(err.to_string().contains("cluster.buckets"));
|
||||
}
|
||||
}
|
||||
|
||||
@@ -8,7 +8,9 @@ pub struct PerRequestRow {
|
||||
pub arrival: f64,
|
||||
pub ttft: f64,
|
||||
pub e2e: f64,
|
||||
pub bucket: u32,
|
||||
pub instance: u32,
|
||||
pub length_bucket_match: bool,
|
||||
pub total_blocks: u32,
|
||||
pub l0_hit_blocks: u32,
|
||||
pub l1_hit_blocks: u32,
|
||||
|
||||
@@ -12,7 +12,9 @@ pub struct RoutingLogWriter {
|
||||
impl RoutingLogWriter {
|
||||
pub fn create<P: AsRef<Path>>(path: P) -> Result<Self> {
|
||||
let f = File::create(path)?;
|
||||
Ok(Self { inner: BufWriter::new(f) })
|
||||
Ok(Self {
|
||||
inner: BufWriter::new(f),
|
||||
})
|
||||
}
|
||||
|
||||
pub fn write(&mut self, decision: &RouteDecision) -> Result<()> {
|
||||
|
||||
@@ -5,6 +5,7 @@ use std::path::Path;
|
||||
#[derive(Debug, Clone, Serialize)]
|
||||
pub struct TimeseriesRow {
|
||||
pub t: f64,
|
||||
pub bucket: u32,
|
||||
pub instance: u32,
|
||||
pub queue_len: u32,
|
||||
pub kv_blocks_used: u32,
|
||||
@@ -19,7 +20,9 @@ pub struct TimeseriesWriter {
|
||||
impl TimeseriesWriter {
|
||||
pub fn create<P: AsRef<Path>>(path: P) -> Result<Self> {
|
||||
let f = std::fs::File::create(path)?;
|
||||
Ok(Self { inner: csv::Writer::from_writer(f) })
|
||||
Ok(Self {
|
||||
inner: csv::Writer::from_writer(f),
|
||||
})
|
||||
}
|
||||
|
||||
pub fn write(&mut self, row: &TimeseriesRow) -> Result<()> {
|
||||
|
||||
@@ -5,7 +5,7 @@
|
||||
//! by `bytes / bw`. Latency is added on top of transfer time. This captures
|
||||
//! contention without simulating individual packets.
|
||||
|
||||
use crate::config::HardwareConfig;
|
||||
use crate::config::{CalibrationConfig, HardwareConfig};
|
||||
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct LinkModel {
|
||||
@@ -53,10 +53,10 @@ pub struct InstanceLinks {
|
||||
}
|
||||
|
||||
impl InstanceLinks {
|
||||
pub fn from_hw(hw: &HardwareConfig) -> Self {
|
||||
pub fn from_hw(hw: &HardwareConfig, calib: &CalibrationConfig) -> Self {
|
||||
Self {
|
||||
pcie: LinkModel::new(hw.pcie_bw, hw.pcie_latency_us * 1e-6),
|
||||
rdma: LinkModel::new(hw.rdma_bw, hw.rdma_latency_us * 1e-6),
|
||||
pcie: LinkModel::new(hw.pcie_bw * calib.pcie_bw_util, hw.pcie_latency_us * 1e-6),
|
||||
rdma: LinkModel::new(hw.rdma_bw * calib.rdma_bw_util, hw.rdma_latency_us * 1e-6),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
135
src/oracle.rs
135
src/oracle.rs
@@ -51,43 +51,75 @@ impl TierResult {
|
||||
capacity_blocks,
|
||||
hits,
|
||||
misses,
|
||||
hit_rate: if total == 0 { 0.0 } else { hits as f64 / total as f64 },
|
||||
hit_rate: if total == 0 {
|
||||
0.0
|
||||
} else {
|
||||
hits as f64 / total as f64
|
||||
},
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
pub fn analyze(records: &[RequestRecord], capacity_blocks: u64) -> OracleResult {
|
||||
// total / unique counters
|
||||
let total_blocks: u64 = records.iter().map(|r| r.hash_ids.len() as u64).sum();
|
||||
/// Run the oracle analysis over `records`.
|
||||
///
|
||||
/// When `count_mask` is `Some`, **all** records still feed the caches (so the
|
||||
/// cache state reflects the full workload), but only records where
|
||||
/// `count_mask[i]` is `true` contribute to hit / miss / total accounting.
|
||||
/// This is the correct way to answer "what is the theoretical hit-rate for a
|
||||
/// particular input-length bucket within a mixed workload?" — the cache sees
|
||||
/// every request, but the metric only measures the bucket of interest.
|
||||
///
|
||||
/// When `count_mask` is `None`, every record is counted (original behaviour).
|
||||
pub fn analyze(
|
||||
records: &[RequestRecord],
|
||||
capacity_blocks: u64,
|
||||
count_mask: Option<&[bool]>,
|
||||
) -> OracleResult {
|
||||
// Build a default all-true mask when none is supplied.
|
||||
let default_mask;
|
||||
let mask: &[bool] = match count_mask {
|
||||
Some(m) => {
|
||||
assert_eq!(m.len(), records.len());
|
||||
m
|
||||
}
|
||||
None => {
|
||||
default_mask = vec![true; records.len()];
|
||||
&default_mask
|
||||
}
|
||||
};
|
||||
|
||||
// total / unique counters — only for counted records
|
||||
let mut total_blocks: u64 = 0;
|
||||
let mut unique = AHashSet::new();
|
||||
for r in records {
|
||||
let mut num_requests: u64 = 0;
|
||||
for (i, r) in records.iter().enumerate() {
|
||||
if mask[i] {
|
||||
total_blocks += r.hash_ids.len() as u64;
|
||||
num_requests += 1;
|
||||
for &h in &r.hash_ids {
|
||||
unique.insert(h);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// 1. Unlimited cache
|
||||
let unlimited_hits = run_unlimited(records);
|
||||
let unlimited = TierResult::from_counts(
|
||||
"unlimited",
|
||||
u64::MAX,
|
||||
unlimited_hits,
|
||||
total_blocks,
|
||||
);
|
||||
let unlimited_hits = run_unlimited(records, mask);
|
||||
let unlimited = TierResult::from_counts("unlimited", u64::MAX, unlimited_hits, total_blocks);
|
||||
|
||||
// 2. Precompute next-use index for Belady
|
||||
// 2. Precompute next-use index for Belady (over ALL records — eviction
|
||||
// decisions must consider future accesses from the full workload)
|
||||
let next_use = build_next_use(records);
|
||||
|
||||
// 3. Belady at the given capacity
|
||||
let belady_hits = run_belady(records, &next_use, capacity_blocks as usize);
|
||||
let belady_hits = run_belady(records, &next_use, capacity_blocks as usize, mask);
|
||||
let belady = TierResult::from_counts("belady", capacity_blocks, belady_hits, total_blocks);
|
||||
|
||||
// 4. LRU baseline at the same capacity
|
||||
let lru_hits = run_lru(records, capacity_blocks as usize);
|
||||
let lru_hits = run_lru(records, capacity_blocks as usize, mask);
|
||||
let lru = TierResult::from_counts("lru", capacity_blocks, lru_hits, total_blocks);
|
||||
|
||||
OracleResult {
|
||||
num_requests: records.len() as u64,
|
||||
num_requests,
|
||||
total_blocks,
|
||||
unique_blocks: unique.len() as u64,
|
||||
unlimited,
|
||||
@@ -96,11 +128,12 @@ pub fn analyze(records: &[RequestRecord], capacity_blocks: u64) -> OracleResult
|
||||
}
|
||||
}
|
||||
|
||||
fn run_unlimited(records: &[RequestRecord]) -> u64 {
|
||||
fn run_unlimited(records: &[RequestRecord], mask: &[bool]) -> u64 {
|
||||
let mut seen: AHashSet<u64> = AHashSet::with_capacity(1 << 18);
|
||||
let mut hits: u64 = 0;
|
||||
for r in records {
|
||||
// Longest prefix match against `seen`
|
||||
for (i, r) in records.iter().enumerate() {
|
||||
// Longest prefix match against `seen` — only count for masked records
|
||||
if mask[i] {
|
||||
for &h in &r.hash_ids {
|
||||
if seen.contains(&h) {
|
||||
hits += 1;
|
||||
@@ -108,6 +141,8 @@ fn run_unlimited(records: &[RequestRecord]) -> u64 {
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
// Always populate the cache so all requests contribute to cache state
|
||||
for &h in &r.hash_ids {
|
||||
seen.insert(h);
|
||||
}
|
||||
@@ -115,15 +150,20 @@ fn run_unlimited(records: &[RequestRecord]) -> u64 {
|
||||
hits
|
||||
}
|
||||
|
||||
fn run_lru(records: &[RequestRecord], capacity: usize) -> u64 {
|
||||
fn run_lru(records: &[RequestRecord], capacity: usize, mask: &[bool]) -> u64 {
|
||||
if capacity == 0 {
|
||||
return 0;
|
||||
}
|
||||
let mut cache = LruBlocks::new(capacity);
|
||||
let mut hits: u64 = 0;
|
||||
let mut evicted = Vec::new();
|
||||
for r in records {
|
||||
hits += cache.longest_prefix(&r.hash_ids) as u64;
|
||||
for (i, r) in records.iter().enumerate() {
|
||||
// Always touch the cache (longest_prefix updates LRU recency) so that
|
||||
// the eviction order reflects the real mixed workload.
|
||||
let prefix_len = cache.longest_prefix(&r.hash_ids) as u64;
|
||||
if mask[i] {
|
||||
hits += prefix_len;
|
||||
}
|
||||
evicted.clear();
|
||||
cache.insert_blocks(&r.hash_ids, &mut evicted);
|
||||
}
|
||||
@@ -156,7 +196,12 @@ fn build_next_use(records: &[RequestRecord]) -> Vec<Vec<u32>> {
|
||||
/// Implementation: lazy-deletion max-heap keyed by next-use index. Each
|
||||
/// cache entry has a version; the heap may contain stale entries from
|
||||
/// previous insertions, which we skip on pop.
|
||||
fn run_belady(records: &[RequestRecord], next_use: &[Vec<u32>], capacity: usize) -> u64 {
|
||||
fn run_belady(
|
||||
records: &[RequestRecord],
|
||||
next_use: &[Vec<u32>],
|
||||
capacity: usize,
|
||||
mask: &[bool],
|
||||
) -> u64 {
|
||||
if capacity == 0 {
|
||||
return 0;
|
||||
}
|
||||
@@ -169,7 +214,8 @@ fn run_belady(records: &[RequestRecord], next_use: &[Vec<u32>], capacity: usize)
|
||||
let mut hits: u64 = 0;
|
||||
|
||||
for (i, r) in records.iter().enumerate() {
|
||||
// 1. Longest-prefix hit accounting against current cache.
|
||||
// 1. Longest-prefix hit accounting — only count for masked records.
|
||||
if mask[i] {
|
||||
for &h in &r.hash_ids {
|
||||
if in_cache.contains_key(&h) {
|
||||
hits += 1;
|
||||
@@ -177,8 +223,10 @@ fn run_belady(records: &[RequestRecord], next_use: &[Vec<u32>], capacity: usize)
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// 2. Insert / update each block in the request with its new next-use.
|
||||
// Always executed so the cache reflects the full workload.
|
||||
for (j, &h) in r.hash_ids.iter().enumerate() {
|
||||
let nu = next_use[i][j];
|
||||
if let Some(slot) = in_cache.get_mut(&h) {
|
||||
@@ -222,6 +270,8 @@ mod tests {
|
||||
RequestRecord {
|
||||
req_id: id,
|
||||
chat_id: id as i64,
|
||||
parent_chat_id: -1,
|
||||
turn: 1,
|
||||
arrival: t,
|
||||
input_len: (hashes.len() as u32) * 16,
|
||||
output_len: 16,
|
||||
@@ -236,7 +286,7 @@ mod tests {
|
||||
req(1, 1.0, vec![1, 2, 3, 4]),
|
||||
req(2, 2.0, vec![1, 2, 3, 4, 5]),
|
||||
];
|
||||
let out = analyze(&recs, 100);
|
||||
let out = analyze(&recs, 100, None);
|
||||
// total = 3 + 4 + 5 = 12
|
||||
assert_eq!(out.total_blocks, 12);
|
||||
// unique = {1,2,3,4,5} = 5
|
||||
@@ -255,7 +305,7 @@ mod tests {
|
||||
for (i, &h) in pattern.iter().enumerate() {
|
||||
recs.push(req(i as u64, i as f64, vec![h]));
|
||||
}
|
||||
let out = analyze(&recs, 2);
|
||||
let out = analyze(&recs, 2, None);
|
||||
assert!(
|
||||
out.belady_finite.hits >= out.lru_finite.hits,
|
||||
"belady should be at least as good as lru: belady={} lru={}",
|
||||
@@ -272,8 +322,39 @@ mod tests {
|
||||
req(2, 2.0, vec![60]),
|
||||
req(3, 3.0, vec![10, 20, 30, 40, 50, 60]),
|
||||
];
|
||||
let out = analyze(&recs, 3);
|
||||
let out = analyze(&recs, 3, None);
|
||||
assert!(out.unlimited.hit_rate >= out.belady_finite.hit_rate);
|
||||
assert!(out.belady_finite.hit_rate >= out.lru_finite.hit_rate - 1e-9);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn count_mask_filters_accounting_not_cache() {
|
||||
// req 0 populates blocks [1,2,3] but is not counted.
|
||||
// req 1 has prefix [1,2,3,4] — the first 3 blocks are cache hits
|
||||
// because req 0 populated them, even though req 0 is masked out.
|
||||
let recs = vec![req(0, 0.0, vec![1, 2, 3]), req(1, 1.0, vec![1, 2, 3, 4])];
|
||||
let mask = vec![false, true];
|
||||
let out = analyze(&recs, 100, Some(&mask));
|
||||
// Only req 1 is counted: total = 4, hits = 3 (prefix [1,2,3] hit)
|
||||
assert_eq!(out.num_requests, 1);
|
||||
assert_eq!(out.total_blocks, 4);
|
||||
assert_eq!(out.unlimited.hits, 3);
|
||||
assert!((out.unlimited.hit_rate - 3.0 / 4.0).abs() < 1e-9);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn count_mask_none_matches_all_true() {
|
||||
let recs = vec![
|
||||
req(0, 0.0, vec![1, 2, 3]),
|
||||
req(1, 1.0, vec![1, 2, 3, 4]),
|
||||
req(2, 2.0, vec![1, 2, 3, 4, 5]),
|
||||
];
|
||||
let out_none = analyze(&recs, 10, None);
|
||||
let all_true = vec![true; recs.len()];
|
||||
let out_all = analyze(&recs, 10, Some(&all_true));
|
||||
assert_eq!(out_none.unlimited.hits, out_all.unlimited.hits);
|
||||
assert_eq!(out_none.belady_finite.hits, out_all.belady_finite.hits);
|
||||
assert_eq!(out_none.lru_finite.hits, out_all.lru_finite.hits);
|
||||
assert_eq!(out_none.total_blocks, out_all.total_blocks);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -496,6 +496,8 @@ pub fn replay_fixed_placement(
|
||||
placements: &[PlacementEntry],
|
||||
policy: ReplayEvictPolicy,
|
||||
) -> Result<ReplaySummary> {
|
||||
cfg.cluster
|
||||
.require_legacy_single_pool("fixed-placement replay")?;
|
||||
if records.len() != placements.len() {
|
||||
return Err(anyhow!(
|
||||
"records/placements length mismatch: {} vs {}",
|
||||
@@ -519,7 +521,7 @@ pub fn replay_fixed_placement(
|
||||
let block_bytes = cfg.model.kv_block_bytes() as f64;
|
||||
let l0_cap = (cfg.hardware.hbm_bytes / block_bytes).max(1.0) as usize;
|
||||
let l1_cap = (cfg.hardware.dram_bytes / block_bytes).max(1.0) as usize;
|
||||
let num_instances = cfg.cluster.num_instances as usize;
|
||||
let num_instances = cfg.cluster.total_instances() as usize;
|
||||
let mut caches: Vec<ReplayInstanceCache> = (0..num_instances)
|
||||
.map(|_| ReplayInstanceCache::new(policy, l0_cap, l1_cap))
|
||||
.collect();
|
||||
|
||||
254
src/router/adaptive_affinity.rs
Normal file
254
src/router/adaptive_affinity.rs
Normal file
@@ -0,0 +1,254 @@
|
||||
//! Adaptive affinity routing for coding-agent workloads.
|
||||
//!
|
||||
//! The trace has two distinct regimes:
|
||||
//!
|
||||
//! 1. Short root prompts with a tiny shared stem. Their *current-request*
|
||||
//! miss cost is small, so pure TTFT minimization tends to scatter them.
|
||||
//! That destroys future cache locality for the dominant system prompts.
|
||||
//! 2. Continuations or already-warm requests with a long local prefix. Their
|
||||
//! immediate reuse is already visible, so a first-principles TTFT estimate
|
||||
//! is the right objective.
|
||||
//!
|
||||
//! This router separates the two:
|
||||
//!
|
||||
//! - Keep a lightweight per-prefix heat map over the first few blocks.
|
||||
//! - Only when a prefix family is both short and hot do we enforce affinity.
|
||||
//! - The hot prefix is mapped to a deterministic rendezvous-ranked home set,
|
||||
//! and the set widens logarithmically as the family gets hotter.
|
||||
//! - Within that home set we still minimize estimated TTFT, and we fall back
|
||||
//! to the global TTFT optimum if the affinity choice is clearly overloaded.
|
||||
|
||||
use std::collections::HashMap;
|
||||
|
||||
use crate::cluster::meta_store::MetaStore;
|
||||
use crate::config::Config;
|
||||
use crate::instance::Instance;
|
||||
use crate::router::{CandidateInfo, RouteDecision, Router};
|
||||
use crate::trace::RequestRecord;
|
||||
use crate::ttft::{classify_prefix_tiers, TtftModel};
|
||||
|
||||
#[derive(Debug, Clone, Copy, Default)]
|
||||
struct PrefixStat {
|
||||
seen: u16,
|
||||
last_seen: f64,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Copy)]
|
||||
struct CostedCandidate {
|
||||
idx: usize,
|
||||
cost: f64,
|
||||
reusable_blocks: u32,
|
||||
queue_len: u32,
|
||||
rendezvous: u64,
|
||||
}
|
||||
|
||||
pub struct AdaptiveAffinityRouter {
|
||||
ttft_model: TtftModel,
|
||||
fingerprint_k: usize,
|
||||
short_request_blocks: usize,
|
||||
warm_prefix_blocks: u32,
|
||||
hot_threshold: u16,
|
||||
hot_ttl_s: f64,
|
||||
max_fan_out: usize,
|
||||
overload_ratio: f64,
|
||||
overload_abs_s: f64,
|
||||
prefix_stats: HashMap<u64, PrefixStat>,
|
||||
}
|
||||
|
||||
impl AdaptiveAffinityRouter {
|
||||
pub fn new(config: &Config) -> Self {
|
||||
let n = config.cluster.total_instances() as usize;
|
||||
let configured_fan_out = config.cluster.router.affinity_fan_out;
|
||||
let max_fan_out = if configured_fan_out > 0 {
|
||||
configured_fan_out.max(2).min(n)
|
||||
} else {
|
||||
(n / 8).max(8).min(n)
|
||||
};
|
||||
Self {
|
||||
ttft_model: TtftModel::new(
|
||||
&config.hardware,
|
||||
&config.calibration,
|
||||
config.model.kv_block_bytes(),
|
||||
),
|
||||
// Coding-trace reuse is dominated by the system prompt stem.
|
||||
fingerprint_k: config.cluster.router.prefix_k.clamp(1, 4),
|
||||
short_request_blocks: 12,
|
||||
warm_prefix_blocks: 8,
|
||||
hot_threshold: 4,
|
||||
hot_ttl_s: config.cluster.meta_store.ttl_seconds.max(1.0),
|
||||
max_fan_out,
|
||||
overload_ratio: 1.25,
|
||||
overload_abs_s: 0.25,
|
||||
prefix_stats: HashMap::new(),
|
||||
}
|
||||
}
|
||||
|
||||
fn fingerprint(hash_ids: &[u64], k: usize) -> u64 {
|
||||
let take = hash_ids.len().min(k).max(1);
|
||||
let mut fp: u64 = 0xcbf29ce484222325;
|
||||
for &h in &hash_ids[..hash_ids.len().min(take)] {
|
||||
fp ^= h;
|
||||
fp = fp.wrapping_mul(0x100000001b3);
|
||||
}
|
||||
fp
|
||||
}
|
||||
|
||||
fn rendezvous(fp: u64, instance_id: u32) -> u64 {
|
||||
let mut h = fp ^ (instance_id as u64).wrapping_mul(0x9e3779b97f4a7c15);
|
||||
h = h.wrapping_add(0x9e3779b97f4a7c15);
|
||||
h = (h ^ (h >> 30)).wrapping_mul(0xbf58476d1ce4e5b9);
|
||||
h = (h ^ (h >> 27)).wrapping_mul(0x94d049bb133111eb);
|
||||
h ^ (h >> 31)
|
||||
}
|
||||
|
||||
fn active_heat(&self, fp: u64, now: f64) -> u16 {
|
||||
self.prefix_stats
|
||||
.get(&fp)
|
||||
.filter(|stat| now - stat.last_seen <= self.hot_ttl_s)
|
||||
.map(|stat| stat.seen)
|
||||
.unwrap_or(0)
|
||||
}
|
||||
|
||||
fn observe(&mut self, fp: u64, now: f64) {
|
||||
let stat = self.prefix_stats.entry(fp).or_default();
|
||||
if now - stat.last_seen > self.hot_ttl_s {
|
||||
stat.seen = 0;
|
||||
}
|
||||
stat.last_seen = now;
|
||||
stat.seen = stat.seen.saturating_add(1);
|
||||
}
|
||||
|
||||
fn fan_out(&self, heat: u16, n: usize) -> usize {
|
||||
if heat < self.hot_threshold {
|
||||
return 2.min(n);
|
||||
}
|
||||
let multiples = (heat / self.hot_threshold).max(1) as u32;
|
||||
let extra = multiples.ilog2() as usize;
|
||||
(2 + extra).min(self.max_fan_out).min(n).max(2)
|
||||
}
|
||||
|
||||
fn candidate_cost(
|
||||
&self,
|
||||
req: &RequestRecord,
|
||||
inst: &Instance,
|
||||
meta: &MetaStore,
|
||||
now: f64,
|
||||
scheduler_s: f64,
|
||||
) -> (f64, u32) {
|
||||
let residency = classify_prefix_tiers(&req.hash_ids, inst, meta, now);
|
||||
let reusable =
|
||||
residency.l0_hit_blocks + residency.l1_hit_blocks + residency.remote_hit_blocks;
|
||||
let miss_tokens = residency.miss_blocks.saturating_mul(inst.block_size_tokens);
|
||||
let kv_prepare = self.ttft_model.kv_prepare_time_s(residency);
|
||||
let cost = inst.estimated_drain_time()
|
||||
+ scheduler_s
|
||||
+ kv_prepare
|
||||
+ inst.compute.prefill_time(miss_tokens)
|
||||
+ self.ttft_model.first_token_tail_s();
|
||||
(cost, reusable)
|
||||
}
|
||||
|
||||
fn better(a: CostedCandidate, b: CostedCandidate) -> bool {
|
||||
a.cost < b.cost
|
||||
|| (a.cost == b.cost && a.queue_len < b.queue_len)
|
||||
|| (a.cost == b.cost
|
||||
&& a.queue_len == b.queue_len
|
||||
&& a.reusable_blocks > b.reusable_blocks)
|
||||
}
|
||||
}
|
||||
|
||||
impl Router for AdaptiveAffinityRouter {
|
||||
fn name(&self) -> &'static str {
|
||||
"adaptive_affinity"
|
||||
}
|
||||
|
||||
fn route(
|
||||
&mut self,
|
||||
req: &RequestRecord,
|
||||
instances: &[Instance],
|
||||
meta: &MetaStore,
|
||||
now: f64,
|
||||
) -> RouteDecision {
|
||||
let n = instances.len();
|
||||
let scheduler_s = self.ttft_model.scheduler_overhead_s(n, 3);
|
||||
let fp = Self::fingerprint(&req.hash_ids, self.fingerprint_k);
|
||||
let active_heat = self.active_heat(fp, now);
|
||||
|
||||
let mut best_local_l0 = 0u32;
|
||||
let mut candidates = Vec::with_capacity(n);
|
||||
let mut scored = Vec::with_capacity(n);
|
||||
|
||||
for (idx, inst) in instances.iter().enumerate() {
|
||||
let l0_hit = inst.cache.l0.longest_prefix_peek(&req.hash_ids) as u32;
|
||||
best_local_l0 = best_local_l0.max(l0_hit);
|
||||
|
||||
let (cost, reusable_blocks) = self.candidate_cost(req, inst, meta, now, scheduler_s);
|
||||
let rendezvous = Self::rendezvous(fp, inst.id);
|
||||
candidates.push(CandidateInfo {
|
||||
instance: inst.id,
|
||||
predicted_prefix: reusable_blocks,
|
||||
load_blocks: inst.kv_blocks_used,
|
||||
queue_len: inst.queue_len(),
|
||||
});
|
||||
scored.push(CostedCandidate {
|
||||
idx,
|
||||
cost,
|
||||
reusable_blocks,
|
||||
queue_len: inst.queue_len(),
|
||||
rendezvous,
|
||||
});
|
||||
}
|
||||
|
||||
let mut global_best = scored[0];
|
||||
for cand in scored.iter().copied().skip(1) {
|
||||
if Self::better(cand, global_best) {
|
||||
global_best = cand;
|
||||
}
|
||||
}
|
||||
|
||||
let should_affinitize = req.hash_ids.len() <= self.short_request_blocks
|
||||
&& best_local_l0 <= self.warm_prefix_blocks
|
||||
&& active_heat.saturating_add(1) >= self.hot_threshold;
|
||||
|
||||
let (chosen_idx, reason) = if should_affinitize {
|
||||
let fan_out = self.fan_out(active_heat.saturating_add(1), n);
|
||||
scored.sort_unstable_by(|a, b| b.rendezvous.cmp(&a.rendezvous));
|
||||
|
||||
let mut home_best = scored[0];
|
||||
for cand in scored.iter().copied().take(fan_out).skip(1) {
|
||||
let better = Self::better(cand, home_best)
|
||||
|| (cand.cost == home_best.cost
|
||||
&& cand.queue_len == home_best.queue_len
|
||||
&& cand.reusable_blocks == home_best.reusable_blocks
|
||||
&& cand.rendezvous > home_best.rendezvous);
|
||||
if better {
|
||||
home_best = cand;
|
||||
}
|
||||
}
|
||||
|
||||
let home_cost_ok =
|
||||
home_best.cost <= global_best.cost * self.overload_ratio + self.overload_abs_s;
|
||||
if home_cost_ok {
|
||||
(home_best.idx, "adaptive affinity: hot short prefix homeset")
|
||||
} else {
|
||||
(
|
||||
global_best.idx,
|
||||
"adaptive affinity fallback: global min estimated ttft",
|
||||
)
|
||||
}
|
||||
} else {
|
||||
(global_best.idx, "global min estimated ttft")
|
||||
};
|
||||
|
||||
self.observe(fp, now);
|
||||
|
||||
crate::router::local_route_decision(
|
||||
req.req_id,
|
||||
"adaptive_affinity",
|
||||
instances[chosen_idx].id,
|
||||
0.0,
|
||||
candidates,
|
||||
reason,
|
||||
)
|
||||
}
|
||||
}
|
||||
217
src/router/cache_affinity.rs
Normal file
217
src/router/cache_affinity.rs
Normal file
@@ -0,0 +1,217 @@
|
||||
//! Cache-affinity routing tuned for coding-agent workloads.
|
||||
//!
|
||||
//! Motivation — the coding trace has three dominant patterns:
|
||||
//!
|
||||
//! 1. **Short system-prompt-only requests** (≤10 blocks): novel per-chat but
|
||||
//! sharing a small set of system prompts across millions of invocations.
|
||||
//! 2. **Long multi-turn chains**: parent→child prefixes share ~60+ blocks
|
||||
//! and grow by ~6 blocks per turn. Sticking the chain to one instance
|
||||
//! maximises L0 hits for every subsequent turn.
|
||||
//! 3. **Completely novel one-shots**: no existing cache anywhere; should be
|
||||
//! placed to maximise *future* reuse, not just minimise current load.
|
||||
//!
|
||||
//! `cache_score` minimises `α·queue_len + β·miss_blocks`. With the shipping
|
||||
//! defaults (α=1, β=0.1) a single extra queue position is worth ten extra
|
||||
//! miss blocks, so short novel requests — the bulk of traffic — reduce to
|
||||
//! pure least-loaded routing and scatter the same system prompt across
|
||||
//! dozens of instances. Each scattered copy burns HBM that could have held a
|
||||
//! different hot prefix, depressing the cluster-wide L0 hit-rate.
|
||||
//!
|
||||
//! `cache_affinity` fixes this with two changes:
|
||||
//!
|
||||
//! * **Strong cache weight** — cost is `α·queue_len − γ·l0_hit`, with
|
||||
//! γ ≫ α·input_blocks, so any real L0 hit beats load-balancing. A soft
|
||||
//! bonus (`δ·meta_only_hit`) still rewards instances that have the prefix
|
||||
//! in L1/DRAM even when L0 is empty.
|
||||
//!
|
||||
//! * **Deterministic rendezvous tiebreak** — among instances that tie on
|
||||
//! `(cost, hit, queue)`, we rank by `rendezvous(fingerprint, instance_id)`
|
||||
//! where `fingerprint` is an FNV hash of the first few block hashes. This
|
||||
//! turns cold routing from "first-found" (which piles on instance 0 until
|
||||
//! it fills, then spills sequentially) into a consistent hash that maps
|
||||
//! each distinct prefix to the *same* small set of homes. Repeat traffic
|
||||
//! for that prefix therefore concentrates on its home, building a strong
|
||||
//! L0 working set.
|
||||
//!
|
||||
//! Overload protection: if the rendezvous-chosen home already has
|
||||
//! `queue_len > overload_threshold`, the load term dominates and the router
|
||||
//! naturally spills to the next-best instance.
|
||||
|
||||
use crate::cluster::meta_store::MetaStore;
|
||||
use crate::instance::Instance;
|
||||
use crate::router::{local_l0_scores, CandidateInfo, RouteDecision, Router};
|
||||
use crate::trace::RequestRecord;
|
||||
|
||||
pub struct CacheAffinityRouter {
|
||||
/// Router display / trace name.
|
||||
name: &'static str,
|
||||
/// Weight on queue length (per queued request).
|
||||
load_alpha: f64,
|
||||
/// Reward per L0-hit block (real, locally cached).
|
||||
l0_gamma: f64,
|
||||
/// Reward per block present via meta-store but not in L0 (L1 / remote).
|
||||
meta_delta: f64,
|
||||
/// Number of leading block hashes folded into the prefix fingerprint.
|
||||
fingerprint_k: usize,
|
||||
/// Whether to break ties by rendezvous hash (sticky consistent placement)
|
||||
/// or by first-found order (matches cache_score behaviour).
|
||||
use_rendezvous: bool,
|
||||
}
|
||||
|
||||
impl CacheAffinityRouter {
|
||||
pub fn new(load_alpha: f64, fingerprint_k: usize) -> Self {
|
||||
Self {
|
||||
name: "cache_affinity",
|
||||
load_alpha,
|
||||
l0_gamma: 1.0,
|
||||
meta_delta: 0.25,
|
||||
fingerprint_k: fingerprint_k.max(1),
|
||||
use_rendezvous: true,
|
||||
}
|
||||
}
|
||||
|
||||
/// Ablation: cache_score-style weights (γ=0.1, δ=0) but keep rendezvous
|
||||
/// tiebreak. Isolates the contribution of deterministic sticky placement.
|
||||
pub fn weak_with_rendezvous(load_alpha: f64, fingerprint_k: usize) -> Self {
|
||||
Self {
|
||||
name: "cache_affinity_weak_rend",
|
||||
load_alpha,
|
||||
l0_gamma: 0.1,
|
||||
meta_delta: 0.0,
|
||||
fingerprint_k: fingerprint_k.max(1),
|
||||
use_rendezvous: true,
|
||||
}
|
||||
}
|
||||
|
||||
/// Ablation: strong cache weights (γ=1.0, δ=0.25) but first-found tiebreak
|
||||
/// instead of rendezvous. Isolates the contribution of reweighting alone.
|
||||
pub fn strong_no_rendezvous(load_alpha: f64, fingerprint_k: usize) -> Self {
|
||||
Self {
|
||||
name: "cache_affinity_strong_only",
|
||||
load_alpha,
|
||||
l0_gamma: 1.0,
|
||||
meta_delta: 0.25,
|
||||
fingerprint_k: fingerprint_k.max(1),
|
||||
use_rendezvous: false,
|
||||
}
|
||||
}
|
||||
|
||||
/// FNV-1a over the first `k` block hashes — identifies the prefix family
|
||||
/// (system-prompt + early agent context) that drives cache reuse.
|
||||
fn fingerprint(hash_ids: &[u64], k: usize) -> u64 {
|
||||
let n = hash_ids.len().min(k).max(1);
|
||||
let take = hash_ids.len().min(n);
|
||||
let mut fp: u64 = 0xcbf29ce484222325;
|
||||
for &h in &hash_ids[..take] {
|
||||
fp ^= h;
|
||||
fp = fp.wrapping_mul(0x100000001b3);
|
||||
}
|
||||
if take == 0 {
|
||||
// Empty request: still want a deterministic fingerprint (0).
|
||||
fp ^= 0;
|
||||
}
|
||||
fp
|
||||
}
|
||||
|
||||
/// Splitmix64-style rendezvous score for (fingerprint, instance_id).
|
||||
/// Uniform over u64; higher = preferred home.
|
||||
fn rendezvous(fp: u64, instance_id: u32) -> u64 {
|
||||
let mut h = fp ^ (instance_id as u64).wrapping_mul(0x9e3779b97f4a7c15);
|
||||
h = h.wrapping_add(0x9e3779b97f4a7c15);
|
||||
h = (h ^ (h >> 30)).wrapping_mul(0xbf58476d1ce4e5b9);
|
||||
h = (h ^ (h >> 27)).wrapping_mul(0x94d049bb133111eb);
|
||||
h ^ (h >> 31)
|
||||
}
|
||||
}
|
||||
|
||||
impl Router for CacheAffinityRouter {
|
||||
fn name(&self) -> &'static str {
|
||||
self.name
|
||||
}
|
||||
|
||||
fn route(
|
||||
&mut self,
|
||||
req: &RequestRecord,
|
||||
instances: &[Instance],
|
||||
meta: &MetaStore,
|
||||
now: f64,
|
||||
) -> RouteDecision {
|
||||
let n = instances.len();
|
||||
let l0 = local_l0_scores(req, instances);
|
||||
// Meta-store predicted prefix — includes L1/remote-reachable blocks.
|
||||
let meta_scores = meta.score_prefix(&req.hash_ids, now, n);
|
||||
let fp = Self::fingerprint(&req.hash_ids, self.fingerprint_k);
|
||||
|
||||
let mut candidates = Vec::with_capacity(n);
|
||||
let mut best_idx: usize = 0;
|
||||
let mut best_cost = f64::INFINITY;
|
||||
let mut best_hit = 0u32;
|
||||
let mut best_queue = u32::MAX;
|
||||
let mut best_rend: u64 = 0;
|
||||
|
||||
for (i, inst) in instances.iter().enumerate() {
|
||||
let hit = l0[i];
|
||||
// meta_only = extra blocks reachable by RDMA/L1 beyond L0 hit.
|
||||
let meta_only = meta_scores[i].saturating_sub(hit);
|
||||
let q = inst.queue_len();
|
||||
|
||||
// Cost to minimise — lower is better.
|
||||
// load term: α · queue_len
|
||||
// cache term: − γ · l0_hit − δ · meta_only
|
||||
// Short novel prefixes yield hit=0 on every instance, so cost
|
||||
// reduces to α·q and the rendezvous tiebreak picks the home.
|
||||
let cost = self.load_alpha * q as f64
|
||||
- self.l0_gamma * hit as f64
|
||||
- self.meta_delta * meta_only as f64;
|
||||
let rend = Self::rendezvous(fp, inst.id);
|
||||
|
||||
candidates.push(CandidateInfo {
|
||||
instance: inst.id,
|
||||
predicted_prefix: hit,
|
||||
load_blocks: inst.kv_blocks_used,
|
||||
queue_len: q,
|
||||
});
|
||||
|
||||
// Tiebreak chain (descending preference):
|
||||
// 1. lowest cost
|
||||
// 2. highest hit (break cost ties toward real L0 work)
|
||||
// 3. lowest queue
|
||||
// 4. highest rendezvous (deterministic sticky home), optional
|
||||
let better = if cost < best_cost {
|
||||
true
|
||||
} else if cost > best_cost {
|
||||
false
|
||||
} else if hit > best_hit {
|
||||
true
|
||||
} else if hit < best_hit {
|
||||
false
|
||||
} else if q < best_queue {
|
||||
true
|
||||
} else if q > best_queue {
|
||||
false
|
||||
} else if self.use_rendezvous {
|
||||
rend > best_rend
|
||||
} else {
|
||||
// First-found wins on full tie (matches cache_score behaviour).
|
||||
false
|
||||
};
|
||||
|
||||
if better {
|
||||
best_cost = cost;
|
||||
best_hit = hit;
|
||||
best_queue = q;
|
||||
best_rend = rend;
|
||||
best_idx = i;
|
||||
}
|
||||
}
|
||||
|
||||
crate::router::local_route_decision(
|
||||
req.req_id,
|
||||
"cache_affinity",
|
||||
instances[best_idx].id,
|
||||
0.0,
|
||||
candidates,
|
||||
"argmin(α·q − γ·l0_hit − δ·meta_only) + rendezvous tiebreak",
|
||||
)
|
||||
}
|
||||
}
|
||||
@@ -13,7 +13,7 @@
|
||||
|
||||
use crate::cluster::meta_store::MetaStore;
|
||||
use crate::instance::Instance;
|
||||
use crate::router::{CandidateInfo, RouteDecision, Router};
|
||||
use crate::router::{local_l0_scores, CandidateInfo, RouteDecision, Router};
|
||||
use crate::trace::RequestRecord;
|
||||
|
||||
pub struct CacheLoadRouter;
|
||||
@@ -39,11 +39,11 @@ impl Router for CacheLoadRouter {
|
||||
&mut self,
|
||||
req: &RequestRecord,
|
||||
instances: &[Instance],
|
||||
meta: &MetaStore,
|
||||
now: f64,
|
||||
_meta: &MetaStore,
|
||||
_now: f64,
|
||||
) -> RouteDecision {
|
||||
let n = instances.len();
|
||||
let scores = meta.score_prefix(&req.hash_ids, now, n);
|
||||
let scores = local_l0_scores(req, instances);
|
||||
|
||||
// Step 1: least-loaded 1/4 of instances (by queue_len).
|
||||
let pool_size = (n / 4).max(2).min(n);
|
||||
@@ -77,13 +77,13 @@ impl Router for CacheLoadRouter {
|
||||
});
|
||||
}
|
||||
|
||||
RouteDecision {
|
||||
req_id: req.req_id,
|
||||
mode: "cache_load",
|
||||
chosen: instances[best_idx].id,
|
||||
probe_overhead_s: 0.0,
|
||||
crate::router::local_route_decision(
|
||||
req.req_id,
|
||||
"cache_load",
|
||||
instances[best_idx].id,
|
||||
0.0,
|
||||
candidates,
|
||||
reason: "least-loaded 1/4, then best prefix",
|
||||
}
|
||||
"least-loaded 1/4, then best local L0 prefix",
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -32,7 +32,7 @@
|
||||
|
||||
use crate::cluster::meta_store::MetaStore;
|
||||
use crate::instance::Instance;
|
||||
use crate::router::{CandidateInfo, RouteDecision, Router};
|
||||
use crate::router::{local_l0_scores, CandidateInfo, RouteDecision, Router};
|
||||
use crate::trace::RequestRecord;
|
||||
|
||||
pub struct CacheScoreRouter {
|
||||
@@ -55,11 +55,11 @@ impl Router for CacheScoreRouter {
|
||||
&mut self,
|
||||
req: &RequestRecord,
|
||||
instances: &[Instance],
|
||||
meta: &MetaStore,
|
||||
now: f64,
|
||||
_meta: &MetaStore,
|
||||
_now: f64,
|
||||
) -> RouteDecision {
|
||||
let n = instances.len();
|
||||
let scores = meta.score_prefix(&req.hash_ids, now, n);
|
||||
let scores = local_l0_scores(req, instances);
|
||||
let input_blocks = req.hash_ids.len() as f64;
|
||||
|
||||
let mut best_idx: usize = 0;
|
||||
@@ -99,13 +99,13 @@ impl Router for CacheScoreRouter {
|
||||
}
|
||||
}
|
||||
|
||||
RouteDecision {
|
||||
req_id: req.req_id,
|
||||
mode: "cache_score",
|
||||
chosen: instances[best_idx].id,
|
||||
probe_overhead_s: 0.0,
|
||||
crate::router::local_route_decision(
|
||||
req.req_id,
|
||||
"cache_score",
|
||||
instances[best_idx].id,
|
||||
0.0,
|
||||
candidates,
|
||||
reason: "argmin 2^(α·load + β·miss)",
|
||||
}
|
||||
"argmin 2^(α·load + β·miss)",
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
86
src/router/cache_score_ttl.rs
Normal file
86
src/router/cache_score_ttl.rs
Normal file
@@ -0,0 +1,86 @@
|
||||
//! Cache-score routing using TTL meta-store prefix predictions.
|
||||
//!
|
||||
//! This keeps the same scoring rule as `cache_score`:
|
||||
//!
|
||||
//! ```text
|
||||
//! exponent_i = alpha * queue_len_i + beta * miss_i
|
||||
//! ```
|
||||
//!
|
||||
//! The only difference is that `miss_i` is computed from the global TTL
|
||||
//! meta-store prefix score instead of the real local-L0 prefix.
|
||||
|
||||
use crate::cluster::meta_store::MetaStore;
|
||||
use crate::instance::Instance;
|
||||
use crate::router::{CandidateInfo, RouteDecision, Router};
|
||||
use crate::trace::RequestRecord;
|
||||
|
||||
pub struct CacheScoreTtlRouter {
|
||||
alpha: f64,
|
||||
beta: f64,
|
||||
}
|
||||
|
||||
impl CacheScoreTtlRouter {
|
||||
pub fn new(alpha: f64, beta: f64) -> Self {
|
||||
Self { alpha, beta }
|
||||
}
|
||||
}
|
||||
|
||||
impl Router for CacheScoreTtlRouter {
|
||||
fn name(&self) -> &'static str {
|
||||
"cache_score_ttl"
|
||||
}
|
||||
|
||||
fn route(
|
||||
&mut self,
|
||||
req: &RequestRecord,
|
||||
instances: &[Instance],
|
||||
meta: &MetaStore,
|
||||
now: f64,
|
||||
) -> RouteDecision {
|
||||
let n = instances.len();
|
||||
let scores = meta.score_prefix(&req.hash_ids, now, n);
|
||||
let input_blocks = req.hash_ids.len() as f64;
|
||||
|
||||
let mut best_idx: usize = 0;
|
||||
let mut best_exp = f64::INFINITY;
|
||||
let mut best_queue = u32::MAX;
|
||||
let mut best_prefix = 0u32;
|
||||
let mut candidates = Vec::with_capacity(n);
|
||||
|
||||
for (i, inst) in instances.iter().enumerate() {
|
||||
let prefix = scores[i] as f64;
|
||||
let miss = (input_blocks - prefix).max(0.0);
|
||||
let q = inst.queue_len() as f64;
|
||||
let exponent = self.alpha * q + self.beta * miss;
|
||||
|
||||
candidates.push(CandidateInfo {
|
||||
instance: inst.id,
|
||||
predicted_prefix: scores[i],
|
||||
load_blocks: inst.kv_blocks_used,
|
||||
queue_len: inst.queue_len(),
|
||||
});
|
||||
|
||||
let better = exponent < best_exp
|
||||
|| (exponent == best_exp && inst.queue_len() < best_queue)
|
||||
|| (exponent == best_exp
|
||||
&& inst.queue_len() == best_queue
|
||||
&& scores[i] > best_prefix);
|
||||
|
||||
if better {
|
||||
best_exp = exponent;
|
||||
best_idx = i;
|
||||
best_queue = inst.queue_len();
|
||||
best_prefix = scores[i];
|
||||
}
|
||||
}
|
||||
|
||||
crate::router::local_route_decision(
|
||||
req.req_id,
|
||||
"cache_score_ttl",
|
||||
instances[best_idx].id,
|
||||
0.0,
|
||||
candidates,
|
||||
"argmin 2^(alpha*load + beta*meta_store_miss)",
|
||||
)
|
||||
}
|
||||
}
|
||||
@@ -1,57 +1,31 @@
|
||||
//! First-principles TTFT-optimal routing.
|
||||
//! First-principles TTFT-estimate routing with calibrated compute and
|
||||
//! tier-aware KV prepare costs.
|
||||
//!
|
||||
//! Estimates the actual time-to-first-token for each candidate instance:
|
||||
//!
|
||||
//! `TTFT(r,i) = drain(i) + fetch(r,i) + prefill(miss)`
|
||||
//!
|
||||
//! - **drain** — exact queue drain time: sum of per-request `prefill_time()`
|
||||
//! using the architecture-aware compute model (quadratic / DSA).
|
||||
//!
|
||||
//! - **fetch** — RDMA fetch time for blocks cached elsewhere in the cluster
|
||||
//! but not on instance `i` locally.
|
||||
//!
|
||||
//! - **prefill** — compute for cluster-wide cache-miss tokens (constant
|
||||
//! across instances, cancels in the argmin).
|
||||
//!
|
||||
//! The router minimises `drain(i) + fetch(r,i)`, with ties broken by
|
||||
//! lowest `queue_len` then most local cache. The fetch overlap with queue
|
||||
//! drain is handled by keeping the additive form: this gives double
|
||||
//! incentive to prefer instances with local cache, which empirically
|
||||
//! outperforms the `max(drain, fetch)` alternative because even small
|
||||
//! RDMA savings compound across thousands of routing decisions.
|
||||
//! `TTFT(r,i) = drain(i) + scheduler + kv_prepare(r,i) + prefill(miss_i) + first_token_tail`
|
||||
|
||||
use crate::cluster::meta_store::MetaStore;
|
||||
use crate::config::Config;
|
||||
use crate::instance::Instance;
|
||||
use crate::router::{CandidateInfo, RouteDecision, Router};
|
||||
use crate::trace::RequestRecord;
|
||||
use crate::ttft::{classify_prefix_tiers, TtftModel};
|
||||
|
||||
pub struct EstimatedTtftRouter {
|
||||
/// Bytes per KV block (for RDMA cost estimation).
|
||||
kv_block_bytes: f64,
|
||||
/// RDMA bandwidth in bytes/s.
|
||||
rdma_bw: f64,
|
||||
/// RDMA per-transfer latency in seconds.
|
||||
rdma_latency_s: f64,
|
||||
ttft_model: TtftModel,
|
||||
}
|
||||
|
||||
impl EstimatedTtftRouter {
|
||||
pub fn new(config: &Config) -> Self {
|
||||
Self {
|
||||
kv_block_bytes: config.model.kv_block_bytes() as f64,
|
||||
rdma_bw: config.hardware.rdma_bw,
|
||||
rdma_latency_s: config.hardware.rdma_latency_us * 1e-6,
|
||||
ttft_model: TtftModel::new(
|
||||
&config.hardware,
|
||||
&config.calibration,
|
||||
config.model.kv_block_bytes(),
|
||||
),
|
||||
}
|
||||
}
|
||||
|
||||
/// Estimate RDMA fetch time for `remote_blocks` blocks.
|
||||
fn fetch_time(&self, remote_blocks: u32) -> f64 {
|
||||
if remote_blocks == 0 {
|
||||
return 0.0;
|
||||
}
|
||||
let bytes = remote_blocks as f64 * self.kv_block_bytes;
|
||||
bytes / self.rdma_bw + self.rdma_latency_s
|
||||
}
|
||||
}
|
||||
|
||||
impl Router for EstimatedTtftRouter {
|
||||
@@ -66,63 +40,62 @@ impl Router for EstimatedTtftRouter {
|
||||
meta: &MetaStore,
|
||||
now: f64,
|
||||
) -> RouteDecision {
|
||||
let scheduler = self.ttft_model.scheduler_overhead_s(instances.len(), 3);
|
||||
let n = instances.len();
|
||||
let scores = meta.score_prefix(&req.hash_ids, now, n);
|
||||
|
||||
// Cluster-wide max prefix: blocks reachable via RDMA from any peer.
|
||||
let cluster_prefix = scores.iter().copied().max().unwrap_or(0);
|
||||
|
||||
let mut best: u32 = 0;
|
||||
let mut best_cost = f64::INFINITY;
|
||||
let mut best_queue = u32::MAX;
|
||||
let mut best_local = 0u32;
|
||||
let mut best_reuse = 0u32;
|
||||
let mut candidates = Vec::with_capacity(n);
|
||||
|
||||
for inst in instances {
|
||||
let i = inst.id as usize;
|
||||
let local_prefix = scores[i];
|
||||
let residency = classify_prefix_tiers(&req.hash_ids, inst, meta, now);
|
||||
|
||||
// 1. Exact queue drain time (architecture-aware, per-request sum).
|
||||
let drain = inst.estimated_drain_time();
|
||||
|
||||
// 2. RDMA fetch cost for blocks not locally cached.
|
||||
let remote_blocks = cluster_prefix.saturating_sub(local_prefix);
|
||||
let fetch = self.fetch_time(remote_blocks);
|
||||
|
||||
// Additive cost: drain + fetch.
|
||||
// The additive form gives explicit incentive to prefer local cache
|
||||
// (lower fetch) even when the queue is non-empty, which reduces
|
||||
// total RDMA traffic and improves TTFT in aggregate.
|
||||
let cost = drain + fetch;
|
||||
let miss_tokens = residency.miss_blocks.saturating_mul(inst.block_size_tokens);
|
||||
let kv_prepare = self.ttft_model.kv_prepare_time_s(residency);
|
||||
let first_token_tail = self.ttft_model.first_token_tail_s();
|
||||
let cost = drain
|
||||
+ scheduler
|
||||
+ kv_prepare
|
||||
+ inst.compute.prefill_time(miss_tokens)
|
||||
+ first_token_tail;
|
||||
|
||||
candidates.push(CandidateInfo {
|
||||
instance: inst.id,
|
||||
predicted_prefix: local_prefix,
|
||||
predicted_prefix: residency.l0_hit_blocks
|
||||
+ residency.l1_hit_blocks
|
||||
+ residency.remote_hit_blocks,
|
||||
load_blocks: inst.kv_blocks_used,
|
||||
queue_len: inst.queue_len(),
|
||||
});
|
||||
|
||||
// Minimise (cost, queue_len, -local_prefix).
|
||||
let ql = inst.queue_len();
|
||||
let reusable =
|
||||
residency.l0_hit_blocks + residency.l1_hit_blocks + residency.remote_hit_blocks;
|
||||
let better = cost < best_cost
|
||||
|| (cost == best_cost && ql < best_queue)
|
||||
|| (cost == best_cost && ql == best_queue && local_prefix > best_local);
|
||||
|| (cost == best_cost && ql == best_queue && reusable > best_reuse);
|
||||
|
||||
if better {
|
||||
best_cost = cost;
|
||||
best = inst.id;
|
||||
best_queue = ql;
|
||||
best_local = local_prefix;
|
||||
best_reuse = reusable;
|
||||
}
|
||||
}
|
||||
|
||||
RouteDecision {
|
||||
req_id: req.req_id,
|
||||
mode: "estimated_ttft",
|
||||
chosen: best,
|
||||
probe_overhead_s: 0.0,
|
||||
crate::router::local_route_decision(
|
||||
req.req_id,
|
||||
"estimated_ttft",
|
||||
best,
|
||||
0.0,
|
||||
candidates,
|
||||
reason: "argmin(drain_time + fetch_time)",
|
||||
}
|
||||
"argmin(drain + scheduler + kv_prepare + prefill + first_token_tail)",
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
371
src/router/global_bucket.rs
Normal file
371
src/router/global_bucket.rs
Normal file
@@ -0,0 +1,371 @@
|
||||
use anyhow::{anyhow, Result};
|
||||
use serde::Serialize;
|
||||
|
||||
use crate::config::{Config, GlobalRouterMode};
|
||||
use crate::trace::RequestRecord;
|
||||
|
||||
pub type BucketId = u32;
|
||||
|
||||
#[derive(Debug, Clone, Serialize)]
|
||||
pub struct BucketView {
|
||||
pub id: BucketId,
|
||||
pub input_length_min: u32,
|
||||
pub input_length_max: u32,
|
||||
pub num_instances: u32,
|
||||
pub total_queue_len: u32,
|
||||
pub total_load_blocks: u32,
|
||||
pub predicted_prefix: u32,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize)]
|
||||
pub struct BucketCandidate {
|
||||
pub bucket: BucketId,
|
||||
pub input_length_min: u32,
|
||||
pub input_length_max: u32,
|
||||
pub num_instances: u32,
|
||||
pub total_queue_len: u32,
|
||||
pub total_load_blocks: u32,
|
||||
pub predicted_prefix: u32,
|
||||
pub matches_input_len: bool,
|
||||
pub score: f64,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize)]
|
||||
pub struct GlobalRouteDecision {
|
||||
pub req_id: u64,
|
||||
pub mode: &'static str,
|
||||
pub chosen_bucket: BucketId,
|
||||
pub candidates: Vec<BucketCandidate>,
|
||||
pub reason: &'static str,
|
||||
}
|
||||
|
||||
impl GlobalRouteDecision {
|
||||
pub fn single_bucket(req_id: u64, chosen_bucket: BucketId) -> Self {
|
||||
Self {
|
||||
req_id,
|
||||
mode: "single_pool",
|
||||
chosen_bucket,
|
||||
candidates: Vec::new(),
|
||||
reason: "single pool uses bucket 0",
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
pub trait GlobalRouter: Send {
|
||||
fn name(&self) -> &'static str;
|
||||
fn route(
|
||||
&mut self,
|
||||
req: &RequestRecord,
|
||||
buckets: &[BucketView],
|
||||
now: f64,
|
||||
) -> Result<GlobalRouteDecision>;
|
||||
}
|
||||
|
||||
struct StrictInputLengthRouter {
|
||||
reported_mode: &'static str,
|
||||
reason: &'static str,
|
||||
}
|
||||
|
||||
impl StrictInputLengthRouter {
|
||||
fn new(reported_mode: &'static str, reason: &'static str) -> Self {
|
||||
Self {
|
||||
reported_mode,
|
||||
reason,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl GlobalRouter for StrictInputLengthRouter {
|
||||
fn name(&self) -> &'static str {
|
||||
self.reported_mode
|
||||
}
|
||||
|
||||
fn route(
|
||||
&mut self,
|
||||
req: &RequestRecord,
|
||||
buckets: &[BucketView],
|
||||
_now: f64,
|
||||
) -> Result<GlobalRouteDecision> {
|
||||
let candidates = buckets
|
||||
.iter()
|
||||
.map(|view| BucketCandidate {
|
||||
bucket: view.id,
|
||||
input_length_min: view.input_length_min,
|
||||
input_length_max: view.input_length_max,
|
||||
num_instances: view.num_instances,
|
||||
total_queue_len: view.total_queue_len,
|
||||
total_load_blocks: view.total_load_blocks,
|
||||
predicted_prefix: view.predicted_prefix,
|
||||
matches_input_len: view.input_length_min <= req.input_len
|
||||
&& req.input_len <= view.input_length_max,
|
||||
score: if view.input_length_min <= req.input_len
|
||||
&& req.input_len <= view.input_length_max
|
||||
{
|
||||
0.0
|
||||
} else {
|
||||
f64::INFINITY
|
||||
},
|
||||
})
|
||||
.collect::<Vec<_>>();
|
||||
|
||||
let matches = candidates
|
||||
.iter()
|
||||
.filter(|candidate| candidate.matches_input_len)
|
||||
.map(|candidate| candidate.bucket)
|
||||
.collect::<Vec<_>>();
|
||||
|
||||
let chosen_bucket = match matches.as_slice() {
|
||||
[bucket] => *bucket,
|
||||
[] => {
|
||||
return Err(anyhow!(
|
||||
"cluster.global_router.mode={} has no bucket for input_length={}",
|
||||
self.reported_mode,
|
||||
req.input_len
|
||||
));
|
||||
}
|
||||
_ => {
|
||||
return Err(anyhow!(
|
||||
"cluster.global_router.mode={} matched multiple buckets for input_length={}",
|
||||
self.reported_mode,
|
||||
req.input_len
|
||||
));
|
||||
}
|
||||
};
|
||||
|
||||
Ok(GlobalRouteDecision {
|
||||
req_id: req.req_id,
|
||||
mode: self.reported_mode,
|
||||
chosen_bucket,
|
||||
candidates,
|
||||
reason: self.reason,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
struct BucketScoreRouter {
|
||||
length_penalty_weight: f64,
|
||||
load_weight: f64,
|
||||
cache_weight: f64,
|
||||
}
|
||||
|
||||
impl BucketScoreRouter {
|
||||
fn new(full: &Config) -> Self {
|
||||
Self {
|
||||
length_penalty_weight: full.cluster.global_router.length_penalty_weight,
|
||||
load_weight: full.cluster.global_router.load_weight,
|
||||
cache_weight: full.cluster.global_router.cache_weight,
|
||||
}
|
||||
}
|
||||
|
||||
fn length_penalty(&self, req: &RequestRecord, bucket: &BucketView) -> f64 {
|
||||
if req.input_len < bucket.input_length_min {
|
||||
(bucket.input_length_min - req.input_len) as f64
|
||||
} else if req.input_len > bucket.input_length_max {
|
||||
(req.input_len - bucket.input_length_max) as f64
|
||||
} else {
|
||||
0.0
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl GlobalRouter for BucketScoreRouter {
|
||||
fn name(&self) -> &'static str {
|
||||
"bucket_score"
|
||||
}
|
||||
|
||||
fn route(
|
||||
&mut self,
|
||||
req: &RequestRecord,
|
||||
buckets: &[BucketView],
|
||||
_now: f64,
|
||||
) -> Result<GlobalRouteDecision> {
|
||||
let mut chosen_bucket = None;
|
||||
let mut best_score = f64::INFINITY;
|
||||
let mut candidates = Vec::with_capacity(buckets.len());
|
||||
|
||||
for bucket in buckets {
|
||||
let length_penalty = self.length_penalty(req, bucket);
|
||||
let miss = req
|
||||
.hash_ids
|
||||
.len()
|
||||
.saturating_sub(bucket.predicted_prefix as usize) as f64;
|
||||
let score = self.length_penalty_weight * length_penalty
|
||||
+ self.load_weight * bucket.total_queue_len as f64
|
||||
+ self.cache_weight * miss;
|
||||
|
||||
candidates.push(BucketCandidate {
|
||||
bucket: bucket.id,
|
||||
input_length_min: bucket.input_length_min,
|
||||
input_length_max: bucket.input_length_max,
|
||||
num_instances: bucket.num_instances,
|
||||
total_queue_len: bucket.total_queue_len,
|
||||
total_load_blocks: bucket.total_load_blocks,
|
||||
predicted_prefix: bucket.predicted_prefix,
|
||||
matches_input_len: bucket.input_length_min <= req.input_len
|
||||
&& req.input_len <= bucket.input_length_max,
|
||||
score,
|
||||
});
|
||||
|
||||
let better = score < best_score
|
||||
|| (score == best_score && chosen_bucket.is_none_or(|best| bucket.id < best));
|
||||
if better {
|
||||
best_score = score;
|
||||
chosen_bucket = Some(bucket.id);
|
||||
}
|
||||
}
|
||||
|
||||
Ok(GlobalRouteDecision {
|
||||
req_id: req.req_id,
|
||||
mode: self.name(),
|
||||
chosen_bucket: chosen_bucket.ok_or_else(|| anyhow!("no buckets available"))?,
|
||||
candidates,
|
||||
reason: "weighted length/load/cache bucket score",
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
pub fn build(full: &Config) -> Box<dyn GlobalRouter> {
|
||||
match full.cluster.global_router.mode {
|
||||
GlobalRouterMode::StrictInputLength => Box::new(StrictInputLengthRouter::new(
|
||||
"strict_input_length",
|
||||
"unique bucket range contains input_length",
|
||||
)) as Box<dyn GlobalRouter>,
|
||||
GlobalRouterMode::BucketScore => {
|
||||
Box::new(BucketScoreRouter::new(full)) as Box<dyn GlobalRouter>
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::config::{
|
||||
ClusterConfig, GlobalRouterConfig, MetaStoreConfig, RouterConfig, RouterMode,
|
||||
};
|
||||
|
||||
fn cfg() -> Config {
|
||||
Config {
|
||||
model: crate::config::ModelConfig::default(),
|
||||
hardware: crate::config::HardwareConfig {
|
||||
gpu_flops: 1.0,
|
||||
gpu_fp8_flops: 0.0,
|
||||
gpu_fp4_flops: 0.0,
|
||||
gpu_mem_bw: 1.0,
|
||||
hbm_bytes: 1.0,
|
||||
dram_bytes: 1.0,
|
||||
host_dram_bw: 1.0,
|
||||
pcie_bw: 1.0,
|
||||
pcie_latency_us: 1.0,
|
||||
rdma_bw: 1.0,
|
||||
rdma_latency_us: 1.0,
|
||||
intra_node_tp_bw: 1.0,
|
||||
intra_node_tp_latency_us: 1.0,
|
||||
tp_degree: 1,
|
||||
max_batch_slots: 1,
|
||||
prefill_chunk_tokens: 1,
|
||||
},
|
||||
calibration: crate::config::CalibrationConfig::default(),
|
||||
cluster: ClusterConfig {
|
||||
num_instances: None,
|
||||
buckets: Vec::new(),
|
||||
global_router: GlobalRouterConfig {
|
||||
mode: GlobalRouterMode::BucketScore,
|
||||
length_penalty_weight: 1.0,
|
||||
load_weight: 1.0,
|
||||
cache_weight: 1.0,
|
||||
},
|
||||
meta_store: MetaStoreConfig { ttl_seconds: 1.0 },
|
||||
router: RouterConfig {
|
||||
mode: RouterMode::LeastLoaded,
|
||||
precise_probe_latency_us: 1.0,
|
||||
precise_probe_topk: 1,
|
||||
load_alpha: 1.0,
|
||||
score_alpha: 1.0,
|
||||
score_beta: 1.0,
|
||||
prefix_k: 8,
|
||||
affinity_fan_out: 1,
|
||||
},
|
||||
},
|
||||
sim: crate::config::SimConfig {
|
||||
trace_path: String::new(),
|
||||
max_requests: None,
|
||||
output_dir: String::new(),
|
||||
sample_interval_s: 0.0,
|
||||
seed: 0,
|
||||
input_length_min: None,
|
||||
input_length_max: None,
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
fn req(input_len: u32) -> RequestRecord {
|
||||
RequestRecord {
|
||||
req_id: 1,
|
||||
chat_id: 0,
|
||||
parent_chat_id: -1,
|
||||
turn: 0,
|
||||
arrival: 0.0,
|
||||
input_len,
|
||||
output_len: 16,
|
||||
hash_ids: vec![10, 11, 12],
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn bucket_score_prefers_matching_bucket_when_load_is_equal() {
|
||||
let mut router = BucketScoreRouter::new(&cfg());
|
||||
let buckets = vec![
|
||||
BucketView {
|
||||
id: 0,
|
||||
input_length_min: 0,
|
||||
input_length_max: 32,
|
||||
num_instances: 2,
|
||||
total_queue_len: 1,
|
||||
total_load_blocks: 0,
|
||||
predicted_prefix: 0,
|
||||
},
|
||||
BucketView {
|
||||
id: 1,
|
||||
input_length_min: 33,
|
||||
input_length_max: 96,
|
||||
num_instances: 2,
|
||||
total_queue_len: 1,
|
||||
total_load_blocks: 0,
|
||||
predicted_prefix: 0,
|
||||
},
|
||||
];
|
||||
let decision = router.route(&req(24), &buckets, 0.0).unwrap();
|
||||
assert_eq!(decision.chosen_bucket, 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn bucket_score_can_override_length_match_when_load_gap_is_large() {
|
||||
let mut full = cfg();
|
||||
full.cluster.global_router.load_weight = 5.0;
|
||||
full.cluster.global_router.cache_weight = 1.0;
|
||||
full.cluster.global_router.length_penalty_weight = 1.0;
|
||||
let mut router = BucketScoreRouter::new(&full);
|
||||
let buckets = vec![
|
||||
BucketView {
|
||||
id: 0,
|
||||
input_length_min: 0,
|
||||
input_length_max: 32,
|
||||
num_instances: 2,
|
||||
total_queue_len: 20,
|
||||
total_load_blocks: 0,
|
||||
predicted_prefix: 0,
|
||||
},
|
||||
BucketView {
|
||||
id: 1,
|
||||
input_length_min: 33,
|
||||
input_length_max: 96,
|
||||
num_instances: 2,
|
||||
total_queue_len: 0,
|
||||
total_load_blocks: 0,
|
||||
predicted_prefix: 2,
|
||||
},
|
||||
];
|
||||
let decision = router.route(&req(24), &buckets, 0.0).unwrap();
|
||||
assert_eq!(decision.chosen_bucket, 1);
|
||||
}
|
||||
}
|
||||
@@ -29,8 +29,7 @@ impl Router for LeastLoadedRouter {
|
||||
let mut best_score = f64::INFINITY;
|
||||
let mut candidates = Vec::with_capacity(instances.len());
|
||||
for inst in instances {
|
||||
let load = inst.kv_blocks_used as f64
|
||||
+ self.alpha * inst.queue_len() as f64;
|
||||
let load = inst.kv_blocks_used as f64 + self.alpha * inst.queue_len() as f64;
|
||||
candidates.push(CandidateInfo {
|
||||
instance: inst.id,
|
||||
predicted_prefix: 0,
|
||||
@@ -42,13 +41,13 @@ impl Router for LeastLoadedRouter {
|
||||
best = inst.id;
|
||||
}
|
||||
}
|
||||
RouteDecision {
|
||||
req_id: req.req_id,
|
||||
mode: "least_loaded",
|
||||
chosen: best,
|
||||
probe_overhead_s: 0.0,
|
||||
crate::router::local_route_decision(
|
||||
req.req_id,
|
||||
"least_loaded",
|
||||
best,
|
||||
0.0,
|
||||
candidates,
|
||||
reason: "argmin(kv_used + alpha * queue_len)",
|
||||
}
|
||||
"argmin(kv_used + alpha * queue_len)",
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -61,13 +61,13 @@ impl Router for LeastTokensRouter {
|
||||
}
|
||||
}
|
||||
|
||||
RouteDecision {
|
||||
req_id: req.req_id,
|
||||
mode: "least_tokens",
|
||||
chosen: best,
|
||||
probe_overhead_s: 0.0,
|
||||
crate::router::local_route_decision(
|
||||
req.req_id,
|
||||
"least_tokens",
|
||||
best,
|
||||
0.0,
|
||||
candidates,
|
||||
reason: "argmin(waiting_prefill_tokens)",
|
||||
}
|
||||
"argmin(waiting_prefill_tokens)",
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
243
src/router/lineage_affinity.rs
Normal file
243
src/router/lineage_affinity.rs
Normal file
@@ -0,0 +1,243 @@
|
||||
//! Lineage-aware reuse routing for agentic coding workloads.
|
||||
//!
|
||||
//! Workload hypothesis:
|
||||
//! - turn-1 requests are diverse but recur by prefix family and benefit from
|
||||
//! deterministic home placement instead of diffusion across the cluster;
|
||||
//! - child requests usually extend the immediately preceding request and
|
||||
//! should stay on the parent's instance whenever that instance is not
|
||||
//! clearly overloaded.
|
||||
//!
|
||||
//! The router therefore uses three modes:
|
||||
//! - strong local cache scoring for already-warm requests;
|
||||
//! - parent stickiness for continuations with a known parent placement;
|
||||
//! - family homesets (rendezvous-ranked top-K) for cold / weakly-warm
|
||||
//! requests, with a global fallback if the homeset is substantially worse.
|
||||
|
||||
use std::collections::HashMap;
|
||||
|
||||
use crate::cluster::meta_store::MetaStore;
|
||||
use crate::config::Config;
|
||||
use crate::instance::Instance;
|
||||
use crate::router::{local_l0_scores, CandidateInfo, RouteDecision, Router};
|
||||
use crate::trace::RequestRecord;
|
||||
|
||||
#[derive(Debug, Clone, Copy, Default)]
|
||||
struct FamilyStat {
|
||||
seen: u16,
|
||||
last_seen: f64,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Copy)]
|
||||
struct CandidateCost {
|
||||
idx: usize,
|
||||
cost: f64,
|
||||
l0_hit: u32,
|
||||
meta_only: u32,
|
||||
queue_len: u32,
|
||||
rendezvous: u64,
|
||||
}
|
||||
|
||||
pub struct LineageAffinityRouter {
|
||||
load_alpha: f64,
|
||||
l0_gamma: f64,
|
||||
meta_delta: f64,
|
||||
fingerprint_k: usize,
|
||||
warm_prefix_blocks: u32,
|
||||
hot_ttl_s: f64,
|
||||
max_fan_out: usize,
|
||||
parent_cost_slack: f64,
|
||||
homeset_cost_slack: f64,
|
||||
family_stats: HashMap<u64, FamilyStat>,
|
||||
request_home: HashMap<i64, u32>,
|
||||
}
|
||||
|
||||
impl LineageAffinityRouter {
|
||||
pub fn new(config: &Config) -> Self {
|
||||
let n = config.cluster.total_instances() as usize;
|
||||
let configured_fan_out = config.cluster.router.affinity_fan_out;
|
||||
let max_fan_out = if configured_fan_out > 0 {
|
||||
configured_fan_out.max(2).min(n)
|
||||
} else {
|
||||
(n / 8).max(8).min(n)
|
||||
};
|
||||
Self {
|
||||
load_alpha: config.cluster.router.load_alpha.max(1.0),
|
||||
l0_gamma: 1.0,
|
||||
meta_delta: 0.25,
|
||||
fingerprint_k: config.cluster.router.prefix_k.clamp(2, 8),
|
||||
warm_prefix_blocks: 12,
|
||||
hot_ttl_s: config.cluster.meta_store.ttl_seconds.max(1.0),
|
||||
max_fan_out,
|
||||
// Measured in the same score units as α·queue − γ·hit.
|
||||
parent_cost_slack: 6.0,
|
||||
homeset_cost_slack: 2.0,
|
||||
family_stats: HashMap::new(),
|
||||
request_home: HashMap::new(),
|
||||
}
|
||||
}
|
||||
|
||||
fn fingerprint(hash_ids: &[u64], k: usize) -> u64 {
|
||||
let take = hash_ids.len().min(k).max(1);
|
||||
let mut fp: u64 = 0xcbf29ce484222325;
|
||||
for &h in &hash_ids[..take] {
|
||||
fp ^= h;
|
||||
fp = fp.wrapping_mul(0x100000001b3);
|
||||
}
|
||||
fp
|
||||
}
|
||||
|
||||
fn rendezvous(fp: u64, instance_id: u32) -> u64 {
|
||||
let mut h = fp ^ (instance_id as u64).wrapping_mul(0x9e3779b97f4a7c15);
|
||||
h = h.wrapping_add(0x9e3779b97f4a7c15);
|
||||
h = (h ^ (h >> 30)).wrapping_mul(0xbf58476d1ce4e5b9);
|
||||
h = (h ^ (h >> 27)).wrapping_mul(0x94d049bb133111eb);
|
||||
h ^ (h >> 31)
|
||||
}
|
||||
|
||||
fn active_heat(&self, fp: u64, now: f64) -> u16 {
|
||||
self.family_stats
|
||||
.get(&fp)
|
||||
.filter(|stat| now - stat.last_seen <= self.hot_ttl_s)
|
||||
.map(|stat| stat.seen)
|
||||
.unwrap_or(0)
|
||||
}
|
||||
|
||||
fn observe(&mut self, fp: u64, now: f64) {
|
||||
let stat = self.family_stats.entry(fp).or_default();
|
||||
if now - stat.last_seen > self.hot_ttl_s {
|
||||
stat.seen = 0;
|
||||
}
|
||||
stat.last_seen = now;
|
||||
stat.seen = stat.seen.saturating_add(1);
|
||||
}
|
||||
|
||||
fn fan_out(&self, heat: u16, n: usize) -> usize {
|
||||
let base = 2usize;
|
||||
let extra = match heat {
|
||||
0..=1 => 0,
|
||||
2..=3 => 1,
|
||||
4..=7 => 2,
|
||||
8..=15 => 3,
|
||||
_ => 4,
|
||||
};
|
||||
(base + extra).min(self.max_fan_out).min(n).max(2)
|
||||
}
|
||||
|
||||
fn better(a: CandidateCost, b: CandidateCost) -> bool {
|
||||
a.cost < b.cost
|
||||
|| (a.cost == b.cost && a.l0_hit > b.l0_hit)
|
||||
|| (a.cost == b.cost && a.l0_hit == b.l0_hit && a.queue_len < b.queue_len)
|
||||
|| (a.cost == b.cost
|
||||
&& a.l0_hit == b.l0_hit
|
||||
&& a.queue_len == b.queue_len
|
||||
&& a.meta_only > b.meta_only)
|
||||
}
|
||||
}
|
||||
|
||||
impl Router for LineageAffinityRouter {
|
||||
fn name(&self) -> &'static str {
|
||||
"lineage_affinity"
|
||||
}
|
||||
|
||||
fn route(
|
||||
&mut self,
|
||||
req: &RequestRecord,
|
||||
instances: &[Instance],
|
||||
meta: &MetaStore,
|
||||
now: f64,
|
||||
) -> RouteDecision {
|
||||
let n = instances.len();
|
||||
let l0 = local_l0_scores(req, instances);
|
||||
let meta_scores = meta.score_prefix(&req.hash_ids, now, n);
|
||||
let family_fp = Self::fingerprint(&req.hash_ids, self.fingerprint_k);
|
||||
let family_heat = self.active_heat(family_fp, now).saturating_add(1);
|
||||
let parent_home = if req.parent_chat_id >= 0 {
|
||||
self.request_home.get(&req.parent_chat_id).copied()
|
||||
} else {
|
||||
None
|
||||
};
|
||||
|
||||
let mut candidates = Vec::with_capacity(n);
|
||||
let mut scored = Vec::with_capacity(n);
|
||||
let mut best_local_l0 = 0u32;
|
||||
|
||||
for (idx, inst) in instances.iter().enumerate() {
|
||||
let l0_hit = l0[idx];
|
||||
best_local_l0 = best_local_l0.max(l0_hit);
|
||||
let meta_only = meta_scores[idx].saturating_sub(l0_hit);
|
||||
let queue_len = inst.queue_len();
|
||||
let cost = self.load_alpha * queue_len as f64
|
||||
- self.l0_gamma * l0_hit as f64
|
||||
- self.meta_delta * meta_only as f64;
|
||||
let rend = Self::rendezvous(family_fp, inst.id);
|
||||
candidates.push(CandidateInfo {
|
||||
instance: inst.id,
|
||||
predicted_prefix: l0_hit,
|
||||
load_blocks: inst.kv_blocks_used,
|
||||
queue_len,
|
||||
});
|
||||
scored.push(CandidateCost {
|
||||
idx,
|
||||
cost,
|
||||
l0_hit,
|
||||
meta_only,
|
||||
queue_len,
|
||||
rendezvous: rend,
|
||||
});
|
||||
}
|
||||
|
||||
let mut global_best = scored[0];
|
||||
for cand in scored.iter().copied().skip(1) {
|
||||
if Self::better(cand, global_best) {
|
||||
global_best = cand;
|
||||
}
|
||||
}
|
||||
|
||||
let mut chosen = global_best;
|
||||
let reason = if let Some(parent_inst) = parent_home {
|
||||
let parent = scored[parent_inst as usize];
|
||||
if parent.cost <= global_best.cost + self.parent_cost_slack {
|
||||
chosen = parent;
|
||||
"lineage affinity: parent stickiness"
|
||||
} else {
|
||||
"lineage affinity: parent overloaded, global best"
|
||||
}
|
||||
} else if best_local_l0 < self.warm_prefix_blocks {
|
||||
let fan_out = self.fan_out(family_heat, n);
|
||||
scored.sort_unstable_by(|a, b| b.rendezvous.cmp(&a.rendezvous));
|
||||
let mut home_best = scored[0];
|
||||
for cand in scored.iter().copied().take(fan_out).skip(1) {
|
||||
let better = Self::better(cand, home_best)
|
||||
|| (cand.cost == home_best.cost
|
||||
&& cand.l0_hit == home_best.l0_hit
|
||||
&& cand.queue_len == home_best.queue_len
|
||||
&& cand.meta_only == home_best.meta_only
|
||||
&& cand.rendezvous > home_best.rendezvous);
|
||||
if better {
|
||||
home_best = cand;
|
||||
}
|
||||
}
|
||||
if home_best.cost <= global_best.cost + self.homeset_cost_slack {
|
||||
chosen = home_best;
|
||||
"lineage affinity: family homeset"
|
||||
} else {
|
||||
"lineage affinity: homeset overloaded, global best"
|
||||
}
|
||||
} else {
|
||||
"lineage affinity: warm request global locality"
|
||||
};
|
||||
|
||||
self.observe(family_fp, now);
|
||||
self.request_home
|
||||
.insert(req.chat_id, instances[chosen.idx].id);
|
||||
|
||||
crate::router::local_route_decision(
|
||||
req.req_id,
|
||||
"lineage_affinity",
|
||||
instances[chosen.idx].id,
|
||||
0.0,
|
||||
candidates,
|
||||
reason,
|
||||
)
|
||||
}
|
||||
}
|
||||
@@ -1,4 +1,4 @@
|
||||
//! Minimum P*D routing.
|
||||
//! Minimum P*D routing using real local L0 hits only.
|
||||
//!
|
||||
//! For each instance compute:
|
||||
//! - `P` = real prefill tokens this request will do if routed there
|
||||
@@ -7,30 +7,18 @@
|
||||
//!
|
||||
//! Score = `P * D`, pick the instance that minimizes it.
|
||||
//!
|
||||
//! `P` accounts for the **actual** prefill work after the cluster fetch
|
||||
//! chain runs: the fetch chain serves any block cached anywhere in the
|
||||
//! cluster (L0 → L1 → remote v6d via RDMA), so prefill compute only runs
|
||||
//! for blocks that are absent cluster-wide *and* for blocks past the
|
||||
//! instance-local prefix (the cluster only fetches a contiguous leading
|
||||
//! prefix — any gap ends the fetch chain and the rest must be prefilled).
|
||||
//! `P` accounts only for blocks that miss in the candidate instance's
|
||||
//! current L0 cache. L1 / remote reuse may still reduce execution-time
|
||||
//! work later in the cluster fetch chain, but they do not count as
|
||||
//! `kvcache hit` for routing.
|
||||
//!
|
||||
//! Concretely, for instance `i`:
|
||||
//!
|
||||
//! ```text
|
||||
//! local_prefix_i = meta_store.score_prefix(req, now)[i] // blocks
|
||||
//! cluster_prefix = max over all j of meta_store_score[j] // blocks
|
||||
//! effective_prefix_i = min(cluster_prefix, input_blocks)
|
||||
//! - if local_prefix_i == cluster_prefix the fetch chain stays local,
|
||||
//! - otherwise the prefill still skips cluster_prefix blocks because
|
||||
//! the missing tail is fetched via RDMA from a peer.
|
||||
//! P_i = (input_blocks - effective_prefix_i) * block_size_tokens
|
||||
//! local_prefix_i = longest L0 prefix on instance i // blocks
|
||||
//! P_i = (input_blocks - local_prefix_i) * block_size_tokens
|
||||
//! ```
|
||||
//!
|
||||
//! This makes `P` nearly instance-independent on well-populated clusters
|
||||
//! (so `min_pd` degenerates to balanced load with a cache-affinity
|
||||
//! tiebreak), which is exactly what you want when RDMA is cheap relative
|
||||
//! to prefill compute.
|
||||
//!
|
||||
//! Tiebreaks (essential on 128-instance clusters where many instances are
|
||||
//! idle and the raw product collapses to zero):
|
||||
//! 1. minimum `P*D`
|
||||
@@ -40,7 +28,7 @@
|
||||
|
||||
use crate::cluster::meta_store::MetaStore;
|
||||
use crate::instance::Instance;
|
||||
use crate::router::{CandidateInfo, RouteDecision, Router};
|
||||
use crate::router::{local_l0_scores, CandidateInfo, RouteDecision, Router};
|
||||
use crate::trace::RequestRecord;
|
||||
|
||||
pub struct MinPdRouter;
|
||||
@@ -66,36 +54,26 @@ impl Router for MinPdRouter {
|
||||
&mut self,
|
||||
req: &RequestRecord,
|
||||
instances: &[Instance],
|
||||
meta: &MetaStore,
|
||||
now: f64,
|
||||
_meta: &MetaStore,
|
||||
_now: f64,
|
||||
) -> RouteDecision {
|
||||
let n = instances.len();
|
||||
let scores = meta.score_prefix(&req.hash_ids, now, n);
|
||||
let scores = local_l0_scores(req, instances);
|
||||
let block_size = instances[0].block_size_tokens as u64;
|
||||
let input_blocks = req.hash_ids.len() as u64;
|
||||
|
||||
// Cluster-wide max prefix: longest contiguous prefix that EXISTS
|
||||
// somewhere in the cluster (and will be fetched via remote RDMA if
|
||||
// not local). This determines the effective prefill work for every
|
||||
// candidate, not just the one that owns the blocks.
|
||||
let cluster_prefix_blocks = scores.iter().copied().max().unwrap_or(0) as u64;
|
||||
let effective_prefix_blocks = cluster_prefix_blocks.min(input_blocks);
|
||||
let miss_blocks = input_blocks.saturating_sub(effective_prefix_blocks);
|
||||
let p_base = miss_blocks.saturating_mul(block_size); // tokens to prefill
|
||||
|
||||
let mut candidates = Vec::with_capacity(n);
|
||||
let mut best: u32 = instances[0].id;
|
||||
// Minimize (P*D, D, -local_prefix).
|
||||
// P is nearly instance-independent; D is the real discriminator.
|
||||
// When tied on D, prefer the instance with the best local prefix
|
||||
// (avoids the RDMA fetch cost).
|
||||
let mut best_key: (u128, u64, i64) = (u128::MAX, u64::MAX, i64::MAX);
|
||||
|
||||
for inst in instances {
|
||||
let i = inst.id as usize;
|
||||
let d = inst.queue_len() as u64;
|
||||
let pd = p_base as u128 * d as u128;
|
||||
let local_prefix = scores[i] as i64;
|
||||
let miss_blocks = input_blocks.saturating_sub(scores[i] as u64);
|
||||
let p = miss_blocks.saturating_mul(block_size);
|
||||
let pd = p as u128 * d as u128;
|
||||
|
||||
candidates.push(CandidateInfo {
|
||||
instance: inst.id,
|
||||
@@ -112,13 +90,13 @@ impl Router for MinPdRouter {
|
||||
}
|
||||
}
|
||||
|
||||
RouteDecision {
|
||||
req_id: req.req_id,
|
||||
mode: "min_pd",
|
||||
chosen: best,
|
||||
probe_overhead_s: 0.0,
|
||||
crate::router::local_route_decision(
|
||||
req.req_id,
|
||||
"min_pd",
|
||||
best,
|
||||
0.0,
|
||||
candidates,
|
||||
reason: "argmin(P*D), P=cluster-wide miss tokens, D=ongoing reqs",
|
||||
}
|
||||
"argmin(P*D), P=local-L0 miss tokens, D=ongoing reqs",
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,10 +1,15 @@
|
||||
//! Cluster-level routing strategies.
|
||||
|
||||
pub mod adaptive_affinity;
|
||||
pub mod cache_affinity;
|
||||
pub mod cache_load;
|
||||
pub mod cache_score;
|
||||
pub mod cache_score_ttl;
|
||||
pub mod estimated_ttft;
|
||||
pub mod global_bucket;
|
||||
pub mod least_loaded;
|
||||
pub mod least_tokens;
|
||||
pub mod lineage_affinity;
|
||||
pub mod min_pd;
|
||||
pub mod precise_aware;
|
||||
pub mod prefix_affinity;
|
||||
@@ -19,6 +24,8 @@ use crate::instance::Instance;
|
||||
use crate::trace::RequestRecord;
|
||||
use crate::types::InstanceId;
|
||||
|
||||
pub use global_bucket::{BucketCandidate, BucketId, BucketView, GlobalRouteDecision, GlobalRouter};
|
||||
|
||||
#[derive(Debug, Clone, Serialize)]
|
||||
pub struct CandidateInfo {
|
||||
pub instance: InstanceId,
|
||||
@@ -30,11 +37,25 @@ pub struct CandidateInfo {
|
||||
#[derive(Debug, Clone, Serialize)]
|
||||
pub struct RouteDecision {
|
||||
pub req_id: u64,
|
||||
pub global_mode: &'static str,
|
||||
pub mode: &'static str,
|
||||
pub global_reason: &'static str,
|
||||
pub local_reason: &'static str,
|
||||
pub chosen_bucket: BucketId,
|
||||
pub chosen: InstanceId,
|
||||
pub probe_overhead_s: f64,
|
||||
pub bucket_candidates: Vec<BucketCandidate>,
|
||||
pub candidates: Vec<CandidateInfo>,
|
||||
pub reason: &'static str,
|
||||
}
|
||||
|
||||
impl RouteDecision {
|
||||
pub fn with_global(mut self, decision: &GlobalRouteDecision) -> Self {
|
||||
self.global_mode = decision.mode;
|
||||
self.global_reason = decision.reason;
|
||||
self.chosen_bucket = decision.chosen_bucket;
|
||||
self.bucket_candidates = decision.candidates.clone();
|
||||
self
|
||||
}
|
||||
}
|
||||
|
||||
pub trait Router: Send {
|
||||
@@ -48,6 +69,39 @@ pub trait Router: Send {
|
||||
) -> RouteDecision;
|
||||
}
|
||||
|
||||
pub(crate) fn local_l0_prefix(req: &RequestRecord, inst: &Instance) -> u32 {
|
||||
inst.cache.l0.longest_prefix_peek(&req.hash_ids) as u32
|
||||
}
|
||||
|
||||
pub(crate) fn local_l0_scores(req: &RequestRecord, instances: &[Instance]) -> Vec<u32> {
|
||||
instances
|
||||
.iter()
|
||||
.map(|inst| local_l0_prefix(req, inst))
|
||||
.collect()
|
||||
}
|
||||
|
||||
pub fn local_route_decision(
|
||||
req_id: u64,
|
||||
mode: &'static str,
|
||||
chosen: InstanceId,
|
||||
probe_overhead_s: f64,
|
||||
candidates: Vec<CandidateInfo>,
|
||||
reason: &'static str,
|
||||
) -> RouteDecision {
|
||||
RouteDecision {
|
||||
req_id,
|
||||
global_mode: "single_pool",
|
||||
mode,
|
||||
global_reason: "single pool uses bucket 0",
|
||||
local_reason: reason,
|
||||
chosen_bucket: 0,
|
||||
chosen,
|
||||
probe_overhead_s,
|
||||
bucket_candidates: Vec::new(),
|
||||
candidates,
|
||||
}
|
||||
}
|
||||
|
||||
pub fn build(full: &Config, seed: u64) -> Box<dyn Router> {
|
||||
use crate::config::RouterMode::*;
|
||||
let cfg = &full.cluster.router;
|
||||
@@ -66,15 +120,300 @@ pub fn build(full: &Config, seed: u64) -> Box<dyn Router> {
|
||||
MinPd => Box::new(min_pd::MinPdRouter::new()) as Box<dyn Router>,
|
||||
LeastTokens => Box::new(least_tokens::LeastTokensRouter::new()) as Box<dyn Router>,
|
||||
CacheLoad => Box::new(cache_load::CacheLoadRouter::new()) as Box<dyn Router>,
|
||||
CacheScore => {
|
||||
Box::new(cache_score::CacheScoreRouter::new(cfg.score_alpha, cfg.score_beta))
|
||||
as Box<dyn Router>
|
||||
CacheAffinity => Box::new(cache_affinity::CacheAffinityRouter::new(
|
||||
cfg.load_alpha,
|
||||
cfg.prefix_k,
|
||||
)) as Box<dyn Router>,
|
||||
CacheAffinityWeakRend => Box::new(
|
||||
cache_affinity::CacheAffinityRouter::weak_with_rendezvous(cfg.load_alpha, cfg.prefix_k),
|
||||
) as Box<dyn Router>,
|
||||
CacheAffinityStrongOnly => Box::new(
|
||||
cache_affinity::CacheAffinityRouter::strong_no_rendezvous(cfg.load_alpha, cfg.prefix_k),
|
||||
) as Box<dyn Router>,
|
||||
CacheScore => Box::new(cache_score::CacheScoreRouter::new(
|
||||
cfg.score_alpha,
|
||||
cfg.score_beta,
|
||||
)) as Box<dyn Router>,
|
||||
// Parity probe for the cache_affinity reweight claim: same scoring
|
||||
// framework as cache_score, but β=1.0 so a single L0-hit block fully
|
||||
// offsets one queue position. Demonstrates how much of cache_affinity's
|
||||
// gain is reproducible by just retuning β (no rendezvous, no meta-store
|
||||
// bonus).
|
||||
CacheScoreStrong => {
|
||||
Box::new(cache_score::CacheScoreRouter::new(1.0, 1.0)) as Box<dyn Router>
|
||||
}
|
||||
CacheScoreTtl => Box::new(cache_score_ttl::CacheScoreTtlRouter::new(
|
||||
cfg.score_alpha,
|
||||
cfg.score_beta,
|
||||
)) as Box<dyn Router>,
|
||||
EstimatedTtft => {
|
||||
Box::new(estimated_ttft::EstimatedTtftRouter::new(full)) as Box<dyn Router>
|
||||
}
|
||||
PrefixAffinity => {
|
||||
Box::new(prefix_affinity::PrefixAffinityRouter::new(full)) as Box<dyn Router>
|
||||
}
|
||||
AdaptiveAffinity => {
|
||||
Box::new(adaptive_affinity::AdaptiveAffinityRouter::new(full)) as Box<dyn Router>
|
||||
}
|
||||
LineageAffinity => {
|
||||
Box::new(lineage_affinity::LineageAffinityRouter::new(full)) as Box<dyn Router>
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
pub fn build_global(full: &Config) -> Box<dyn GlobalRouter> {
|
||||
global_bucket::build(full)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::config::{
|
||||
CalibrationConfig, ClusterConfig, HardwareConfig, MetaStoreConfig, ModelConfig,
|
||||
RouterConfig, RouterMode, SimConfig,
|
||||
};
|
||||
use crate::instance::instance::AdmittedRequest;
|
||||
use crate::router::cache_load::CacheLoadRouter;
|
||||
use crate::router::cache_score::CacheScoreRouter;
|
||||
use crate::router::cache_score_ttl::CacheScoreTtlRouter;
|
||||
use crate::router::estimated_ttft::EstimatedTtftRouter;
|
||||
use crate::router::min_pd::MinPdRouter;
|
||||
use crate::router::precise_aware::PreciseRouter;
|
||||
use crate::router::prefix_affinity::PrefixAffinityRouter;
|
||||
use crate::router::ttl_aware::TtlAwareRouter;
|
||||
use crate::trace::RequestRecord;
|
||||
|
||||
fn test_model() -> ModelConfig {
|
||||
ModelConfig {
|
||||
name: "test".into(),
|
||||
num_layers: 4,
|
||||
num_kv_heads: 2,
|
||||
head_dim: 64,
|
||||
dtype_bytes: 2,
|
||||
block_size_tokens: 16,
|
||||
flops_per_token_prefill: Some(1.0e9),
|
||||
attn_quadratic_coeff: Some(64.0),
|
||||
..Default::default()
|
||||
}
|
||||
}
|
||||
|
||||
fn test_hardware() -> HardwareConfig {
|
||||
HardwareConfig {
|
||||
gpu_flops: 1.0e14,
|
||||
gpu_fp8_flops: 0.0,
|
||||
gpu_fp4_flops: 0.0,
|
||||
gpu_mem_bw: 1.0e12,
|
||||
hbm_bytes: 1.0e9,
|
||||
dram_bytes: 4.0e9,
|
||||
host_dram_bw: 5.0e11,
|
||||
pcie_bw: 32.0e9,
|
||||
pcie_latency_us: 1.0,
|
||||
rdma_bw: 12.0e9,
|
||||
rdma_latency_us: 5.0,
|
||||
intra_node_tp_bw: 9.0e11,
|
||||
intra_node_tp_latency_us: 2.0,
|
||||
tp_degree: 1,
|
||||
max_batch_slots: 32,
|
||||
prefill_chunk_tokens: 1024,
|
||||
}
|
||||
}
|
||||
|
||||
fn test_config(mode: RouterMode) -> Config {
|
||||
Config {
|
||||
model: test_model(),
|
||||
hardware: test_hardware(),
|
||||
calibration: CalibrationConfig::default(),
|
||||
cluster: ClusterConfig {
|
||||
num_instances: Some(2),
|
||||
buckets: Vec::new(),
|
||||
global_router: Default::default(),
|
||||
meta_store: MetaStoreConfig {
|
||||
ttl_seconds: 1000.0,
|
||||
},
|
||||
router: RouterConfig {
|
||||
mode,
|
||||
precise_probe_latency_us: 10.0,
|
||||
precise_probe_topk: 2,
|
||||
load_alpha: 0.0,
|
||||
score_alpha: 0.0,
|
||||
score_beta: 1.0,
|
||||
prefix_k: 8,
|
||||
affinity_fan_out: 2,
|
||||
},
|
||||
},
|
||||
sim: SimConfig {
|
||||
trace_path: String::new(),
|
||||
max_requests: None,
|
||||
output_dir: String::new(),
|
||||
sample_interval_s: 0.0,
|
||||
seed: 7,
|
||||
input_length_min: None,
|
||||
input_length_max: None,
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
fn make_instances(n: usize) -> Vec<Instance> {
|
||||
let model = test_model();
|
||||
let hw = test_hardware();
|
||||
let calib = CalibrationConfig::default();
|
||||
(0..n)
|
||||
.map(|id| Instance::new(id as u32, &model, &hw, &calib))
|
||||
.collect()
|
||||
}
|
||||
|
||||
fn make_request(hashes: &[u64]) -> RequestRecord {
|
||||
RequestRecord {
|
||||
req_id: 1,
|
||||
chat_id: 0,
|
||||
parent_chat_id: -1,
|
||||
turn: 1,
|
||||
arrival: 0.0,
|
||||
input_len: hashes.len() as u32 * 16,
|
||||
output_len: 16,
|
||||
hash_ids: hashes.to_vec(),
|
||||
}
|
||||
}
|
||||
|
||||
fn insert_l0(inst: &mut Instance, hashes: &[u64]) {
|
||||
let mut evicted = Vec::new();
|
||||
inst.cache.l0.insert_blocks(hashes, &mut evicted);
|
||||
}
|
||||
|
||||
fn insert_l1(inst: &mut Instance, hashes: &[u64]) {
|
||||
let mut evicted = Vec::new();
|
||||
inst.cache.l1.insert_blocks(hashes, &mut evicted);
|
||||
}
|
||||
|
||||
fn publish_meta(meta: &mut MetaStore, inst_id: u32, hashes: &[u64], now: f64) {
|
||||
for &h in hashes {
|
||||
meta.insert(h, inst_id, now);
|
||||
}
|
||||
}
|
||||
|
||||
fn enqueue_requests(inst: &mut Instance, count: u32, tokens: u32) {
|
||||
for req_id in 0..count {
|
||||
inst.admit(AdmittedRequest {
|
||||
req_id: req_id as u64,
|
||||
arrival: 0.0,
|
||||
ready_at: 0.0,
|
||||
prefill_tokens_remaining: tokens,
|
||||
reserved_blocks: 0,
|
||||
completion_tail_s: 0.0,
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn precise_uses_real_l0_prefix_not_l1_prefix() {
|
||||
let req = make_request(&[10, 11, 12]);
|
||||
let mut instances = make_instances(2);
|
||||
let mut meta = MetaStore::new(1000.0);
|
||||
|
||||
insert_l1(&mut instances[0], &[10, 11, 12]);
|
||||
publish_meta(&mut meta, 0, &[10, 11, 12], 0.0);
|
||||
insert_l0(&mut instances[1], &[10, 11]);
|
||||
|
||||
let mut router = PreciseRouter::new(2, 10e-6, 0.0);
|
||||
let decision = router.route(&req, &instances, &meta, 0.0);
|
||||
|
||||
assert_eq!(decision.chosen, 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn ttl_aware_uses_meta_store_scores() {
|
||||
let req = make_request(&[20, 21, 22]);
|
||||
let mut instances = make_instances(2);
|
||||
let mut meta = MetaStore::new(1000.0);
|
||||
|
||||
// Instance 0 only looks hot in the meta store; its real L0 prefix is zero.
|
||||
publish_meta(&mut meta, 0, &[20, 21, 22], 0.0);
|
||||
// Instance 1 really holds the first two blocks in HBM.
|
||||
insert_l0(&mut instances[1], &[20, 21]);
|
||||
|
||||
let mut ttl = TtlAwareRouter::new(0.0);
|
||||
assert_eq!(ttl.route(&req, &instances, &meta, 0.0).chosen, 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cache_aware_routers_compare_real_l0_not_meta_store_scores() {
|
||||
let req = make_request(&[20, 21, 22]);
|
||||
let mut instances = make_instances(2);
|
||||
let mut meta = MetaStore::new(1000.0);
|
||||
|
||||
// Instance 0 only looks hot in the meta store; its real L0 prefix is zero.
|
||||
publish_meta(&mut meta, 0, &[20, 21, 22], 0.0);
|
||||
// Instance 1 really holds the first two blocks in HBM.
|
||||
insert_l0(&mut instances[1], &[20, 21]);
|
||||
|
||||
let mut cache_load = CacheLoadRouter::new();
|
||||
assert_eq!(cache_load.route(&req, &instances, &meta, 0.0).chosen, 1);
|
||||
|
||||
let mut cache_score = CacheScoreRouter::new(0.0, 1.0);
|
||||
assert_eq!(cache_score.route(&req, &instances, &meta, 0.0).chosen, 1);
|
||||
|
||||
let mut min_pd = MinPdRouter::new();
|
||||
assert_eq!(min_pd.route(&req, &instances, &meta, 0.0).chosen, 1);
|
||||
|
||||
let cfg = test_config(RouterMode::EstimatedTtft);
|
||||
let mut est = EstimatedTtftRouter::new(&cfg);
|
||||
assert_eq!(est.route(&req, &instances, &meta, 0.0).chosen, 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn cache_score_ttl_uses_meta_store_prefix() {
|
||||
let req = make_request(&[50, 51, 52]);
|
||||
let mut instances = make_instances(2);
|
||||
let mut meta = MetaStore::new(1000.0);
|
||||
|
||||
// Instance 0 only looks hot in the meta store.
|
||||
publish_meta(&mut meta, 0, &[50, 51, 52], 0.0);
|
||||
// Instance 1 has the true local L0 prefix.
|
||||
insert_l0(&mut instances[1], &[50, 51]);
|
||||
|
||||
let mut cache_score = CacheScoreRouter::new(0.0, 1.0);
|
||||
assert_eq!(cache_score.route(&req, &instances, &meta, 0.0).chosen, 1);
|
||||
|
||||
let mut cache_score_ttl = CacheScoreTtlRouter::new(0.0, 1.0);
|
||||
assert_eq!(
|
||||
cache_score_ttl.route(&req, &instances, &meta, 0.0).chosen,
|
||||
0
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn prefix_affinity_fallback_uses_real_l0_not_meta_store_scores() {
|
||||
let req = make_request(&[30, 31, 32]);
|
||||
let mut instances = make_instances(2);
|
||||
let mut meta = MetaStore::new(1000.0);
|
||||
|
||||
publish_meta(&mut meta, 0, &[30, 31, 32], 0.0);
|
||||
insert_l0(&mut instances[1], &[30, 31]);
|
||||
enqueue_requests(&mut instances[0], 5, 64);
|
||||
enqueue_requests(&mut instances[1], 5, 64);
|
||||
|
||||
let cfg = test_config(RouterMode::PrefixAffinity);
|
||||
let mut router = PrefixAffinityRouter::new(&cfg);
|
||||
let decision = router.route(&req, &instances, &meta, 0.0);
|
||||
|
||||
assert_eq!(decision.local_reason, "affinity fallback: min(drain+fetch)");
|
||||
assert_eq!(decision.chosen, 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn estimated_ttft_can_prefer_full_l1_hit_over_shorter_queue() {
|
||||
let req = make_request(&[40, 41, 42, 43]);
|
||||
let mut instances = make_instances(2);
|
||||
let meta = MetaStore::new(1000.0);
|
||||
|
||||
// Instance 0 can satisfy the whole prefix from local DRAM/L1.
|
||||
insert_l1(&mut instances[0], &[40, 41, 42, 43]);
|
||||
// Instance 1 has a slightly shorter queue but no reusable prefix.
|
||||
enqueue_requests(&mut instances[0], 1, 16);
|
||||
|
||||
let cfg = test_config(RouterMode::EstimatedTtft);
|
||||
let mut est = EstimatedTtftRouter::new(&cfg);
|
||||
|
||||
assert_eq!(est.route(&req, &instances, &meta, 0.0).chosen, 0);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,28 +1,13 @@
|
||||
//! KV-aware routing via meta-store candidate selection + precise probing.
|
||||
//! L0-aware routing via exact per-instance probing.
|
||||
//!
|
||||
//! The global meta store is used as a *candidate pre-filter*: we score
|
||||
//! every instance's predicted prefix from the store, take the top-K by
|
||||
//! (predicted_prefix DESC, load ASC), and then exact-probe those K
|
||||
//! candidates' actual L0+L1 caches to get the true longest prefix. This
|
||||
//! catches two cases where the meta store is wrong:
|
||||
//!
|
||||
//! - the store is stale (block evicted from L0/L1 but TTL not yet up),
|
||||
//! - the store undercounts because some blocks' TTL expired individually.
|
||||
//!
|
||||
//! Because the candidate set is sourced from the meta store rather than
|
||||
//! from a load ranking, this router is a strict superset of `ttl_aware`:
|
||||
//! any instance the meta store would pick is a candidate here, and the
|
||||
//! exact probe can only move the decision toward a truthfully-better
|
||||
//! instance. Each probe adds `probe_latency_s` to the request's
|
||||
//! effective arrival time.
|
||||
//!
|
||||
//! If the meta store returns zero-prefix for every instance (e.g. cold
|
||||
//! start, or a request whose blocks have never been seen), we fall back
|
||||
//! to the top-K least-loaded instances so we still place the request.
|
||||
//! Every instance is compared using its *real current L0 prefix length*
|
||||
//! only. L1 / remote availability can still reduce execution-time misses
|
||||
//! later in the cluster fetch chain, but they do not count as `kvcache hit`
|
||||
//! for routing.
|
||||
|
||||
use crate::cluster::meta_store::MetaStore;
|
||||
use crate::instance::Instance;
|
||||
use crate::router::{CandidateInfo, RouteDecision, Router};
|
||||
use crate::router::{local_l0_prefix, CandidateInfo, RouteDecision, Router};
|
||||
use crate::trace::RequestRecord;
|
||||
|
||||
pub struct PreciseRouter {
|
||||
@@ -33,7 +18,11 @@ pub struct PreciseRouter {
|
||||
|
||||
impl PreciseRouter {
|
||||
pub fn new(topk: u32, probe_latency_s: f64, alpha: f64) -> Self {
|
||||
Self { topk, probe_latency_s, alpha }
|
||||
Self {
|
||||
topk,
|
||||
probe_latency_s,
|
||||
alpha,
|
||||
}
|
||||
}
|
||||
|
||||
fn load_of(&self, inst: &Instance) -> f64 {
|
||||
@@ -50,50 +39,15 @@ impl Router for PreciseRouter {
|
||||
&mut self,
|
||||
req: &RequestRecord,
|
||||
instances: &[Instance],
|
||||
meta: &MetaStore,
|
||||
now: f64,
|
||||
_meta: &MetaStore,
|
||||
_now: f64,
|
||||
) -> RouteDecision {
|
||||
let n = instances.len();
|
||||
let k = (self.topk as usize).min(n).max(1);
|
||||
|
||||
// 1. Meta-store candidate set: rank all instances by
|
||||
// (predicted_prefix DESC, load ASC) and take the top-K.
|
||||
let meta_scores = meta.score_prefix(&req.hash_ids, now, n);
|
||||
let any_meta_hit = meta_scores.iter().any(|&p| p > 0);
|
||||
|
||||
let mut ranked: Vec<usize> = (0..n).collect();
|
||||
if any_meta_hit {
|
||||
ranked.sort_by(|&a, &b| {
|
||||
let pa = meta_scores[a];
|
||||
let pb = meta_scores[b];
|
||||
// prefix desc, then load asc
|
||||
pb.cmp(&pa)
|
||||
.then_with(|| {
|
||||
self.load_of(&instances[a])
|
||||
.partial_cmp(&self.load_of(&instances[b]))
|
||||
.unwrap_or(std::cmp::Ordering::Equal)
|
||||
})
|
||||
});
|
||||
} else {
|
||||
// Cold start fallback: pure load order.
|
||||
ranked.sort_by(|&a, &b| {
|
||||
self.load_of(&instances[a])
|
||||
.partial_cmp(&self.load_of(&instances[b]))
|
||||
.unwrap_or(std::cmp::Ordering::Equal)
|
||||
});
|
||||
}
|
||||
let probed = &ranked[..k];
|
||||
|
||||
// 2. Exact probe each candidate and pick
|
||||
// argmax(exact_prefix, tiebreak: -load).
|
||||
let mut candidates = Vec::with_capacity(k);
|
||||
let mut best = probed[0] as u32;
|
||||
let mut candidates = Vec::with_capacity(n);
|
||||
let mut best = instances[0].id;
|
||||
let mut best_key: (i64, f64) = (i64::MIN, f64::INFINITY);
|
||||
for &i in probed {
|
||||
let inst = &instances[i];
|
||||
let l0 = inst.cache.l0.longest_prefix_peek(&req.hash_ids);
|
||||
let l1 = inst.cache.l1.longest_prefix_peek(&req.hash_ids[l0..]);
|
||||
let predicted = (l0 + l1) as u32;
|
||||
for inst in instances {
|
||||
let predicted = local_l0_prefix(req, inst);
|
||||
let load = self.load_of(inst);
|
||||
candidates.push(CandidateInfo {
|
||||
instance: inst.id,
|
||||
@@ -108,13 +62,13 @@ impl Router for PreciseRouter {
|
||||
}
|
||||
}
|
||||
|
||||
RouteDecision {
|
||||
req_id: req.req_id,
|
||||
mode: "precise",
|
||||
chosen: best,
|
||||
probe_overhead_s: k as f64 * self.probe_latency_s,
|
||||
crate::router::local_route_decision(
|
||||
req.req_id,
|
||||
"precise",
|
||||
best,
|
||||
n as f64 * self.probe_latency_s,
|
||||
candidates,
|
||||
reason: "exact-probe top-K meta-store candidates",
|
||||
}
|
||||
"exact-probe all instances' L0 cache",
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -36,7 +36,7 @@
|
||||
use crate::cluster::meta_store::MetaStore;
|
||||
use crate::config::Config;
|
||||
use crate::instance::Instance;
|
||||
use crate::router::{CandidateInfo, RouteDecision, Router};
|
||||
use crate::router::{local_l0_scores, CandidateInfo, RouteDecision, Router};
|
||||
use crate::trace::RequestRecord;
|
||||
|
||||
pub struct PrefixAffinityRouter {
|
||||
@@ -47,17 +47,11 @@ pub struct PrefixAffinityRouter {
|
||||
/// Queue-length threshold: if all top candidates exceed this, expand to
|
||||
/// the full instance set.
|
||||
overload_threshold: u32,
|
||||
/// Bytes per KV block (for RDMA cost estimation in fallback path).
|
||||
kv_block_bytes: f64,
|
||||
/// RDMA bandwidth in bytes/s.
|
||||
rdma_bw: f64,
|
||||
/// RDMA per-transfer latency in seconds.
|
||||
rdma_latency_s: f64,
|
||||
}
|
||||
|
||||
impl PrefixAffinityRouter {
|
||||
pub fn new(config: &Config) -> Self {
|
||||
let n = config.cluster.num_instances as usize;
|
||||
let n = config.cluster.total_instances() as usize;
|
||||
let cfg_fan = config.cluster.router.affinity_fan_out;
|
||||
// fan_out: if configured, use it; otherwise auto = max(2, n/8).
|
||||
let fan_out = if cfg_fan > 0 {
|
||||
@@ -69,9 +63,6 @@ impl PrefixAffinityRouter {
|
||||
prefix_k: config.cluster.router.prefix_k,
|
||||
fan_out,
|
||||
overload_threshold: 4,
|
||||
kv_block_bytes: config.model.kv_block_bytes() as f64,
|
||||
rdma_bw: config.hardware.rdma_bw,
|
||||
rdma_latency_s: config.hardware.rdma_latency_us * 1e-6,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -96,15 +87,6 @@ impl PrefixAffinityRouter {
|
||||
h = (h ^ (h >> 27)).wrapping_mul(0x94d049bb133111eb);
|
||||
h ^ (h >> 31)
|
||||
}
|
||||
|
||||
/// Estimate RDMA fetch time for `remote_blocks` blocks.
|
||||
fn fetch_time(&self, remote_blocks: u32) -> f64 {
|
||||
if remote_blocks == 0 {
|
||||
return 0.0;
|
||||
}
|
||||
let bytes = remote_blocks as f64 * self.kv_block_bytes;
|
||||
bytes / self.rdma_bw + self.rdma_latency_s
|
||||
}
|
||||
}
|
||||
|
||||
impl Router for PrefixAffinityRouter {
|
||||
@@ -116,8 +98,8 @@ impl Router for PrefixAffinityRouter {
|
||||
&mut self,
|
||||
req: &RequestRecord,
|
||||
instances: &[Instance],
|
||||
meta: &MetaStore,
|
||||
now: f64,
|
||||
_meta: &MetaStore,
|
||||
_now: f64,
|
||||
) -> RouteDecision {
|
||||
let n = instances.len();
|
||||
let fp = Self::fingerprint(&req.hash_ids, self.prefix_k);
|
||||
@@ -129,7 +111,7 @@ impl Router for PrefixAffinityRouter {
|
||||
ranked.sort_unstable_by(|a, b| b.0.cmp(&a.0)); // descending score
|
||||
|
||||
// Collect candidate info for logging (also needed for fallback).
|
||||
let scores = meta.score_prefix(&req.hash_ids, now, n);
|
||||
let scores = local_l0_scores(req, instances);
|
||||
let candidates: Vec<CandidateInfo> = instances
|
||||
.iter()
|
||||
.map(|inst| CandidateInfo {
|
||||
@@ -165,14 +147,14 @@ impl Router for PrefixAffinityRouter {
|
||||
let reason;
|
||||
if all_overloaded {
|
||||
reason = "affinity fallback: min(drain+fetch)";
|
||||
let cluster_prefix = scores.iter().copied().max().unwrap_or(0);
|
||||
let mut best_cost = f64::INFINITY;
|
||||
for &(_, idx) in ranked.iter() {
|
||||
let inst = &instances[idx];
|
||||
let drain = inst.estimated_drain_time();
|
||||
let local_prefix = scores[idx];
|
||||
let remote_blocks = cluster_prefix.saturating_sub(local_prefix);
|
||||
let cost = drain + self.fetch_time(remote_blocks);
|
||||
let miss_tokens = (req.hash_ids.len() as u32)
|
||||
.saturating_sub(scores[idx])
|
||||
.saturating_mul(inst.block_size_tokens);
|
||||
let cost = drain + inst.compute.prefill_time(miss_tokens);
|
||||
let ql = inst.queue_len();
|
||||
if cost < best_cost || (cost == best_cost && ql < best_ql) {
|
||||
best_cost = cost;
|
||||
@@ -184,13 +166,13 @@ impl Router for PrefixAffinityRouter {
|
||||
reason = "prefix affinity: top-K min drain";
|
||||
}
|
||||
|
||||
RouteDecision {
|
||||
req_id: req.req_id,
|
||||
mode: "prefix_affinity",
|
||||
chosen: instances[best_idx].id,
|
||||
probe_overhead_s: 0.0,
|
||||
crate::router::local_route_decision(
|
||||
req.req_id,
|
||||
"prefix_affinity",
|
||||
instances[best_idx].id,
|
||||
0.0,
|
||||
candidates,
|
||||
reason,
|
||||
}
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -13,7 +13,9 @@ pub struct RandomRouter {
|
||||
|
||||
impl RandomRouter {
|
||||
pub fn new(seed: u64) -> Self {
|
||||
Self { rng: ChaCha8Rng::seed_from_u64(seed) }
|
||||
Self {
|
||||
rng: ChaCha8Rng::seed_from_u64(seed),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -31,19 +33,19 @@ impl Router for RandomRouter {
|
||||
) -> RouteDecision {
|
||||
let n = instances.len();
|
||||
let chosen = self.rng.gen_range(0..n) as InstanceId;
|
||||
RouteDecision {
|
||||
req_id: req.req_id,
|
||||
mode: "random",
|
||||
crate::router::local_route_decision(
|
||||
req.req_id,
|
||||
"random",
|
||||
chosen,
|
||||
probe_overhead_s: 0.0,
|
||||
candidates: vec![CandidateInfo {
|
||||
0.0,
|
||||
vec![CandidateInfo {
|
||||
instance: chosen,
|
||||
predicted_prefix: 0,
|
||||
load_blocks: instances[chosen as usize].kv_blocks_used,
|
||||
queue_len: instances[chosen as usize].queue_len(),
|
||||
}],
|
||||
reason: "uniform random",
|
||||
}
|
||||
"uniform random",
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -73,18 +75,18 @@ impl Router for RoundRobinRouter {
|
||||
let n = instances.len() as u32;
|
||||
let chosen = self.next % n;
|
||||
self.next = self.next.wrapping_add(1);
|
||||
RouteDecision {
|
||||
req_id: req.req_id,
|
||||
mode: "round_robin",
|
||||
crate::router::local_route_decision(
|
||||
req.req_id,
|
||||
"round_robin",
|
||||
chosen,
|
||||
probe_overhead_s: 0.0,
|
||||
candidates: vec![CandidateInfo {
|
||||
0.0,
|
||||
vec![CandidateInfo {
|
||||
instance: chosen,
|
||||
predicted_prefix: 0,
|
||||
load_blocks: instances[chosen as usize].kv_blocks_used,
|
||||
queue_len: instances[chosen as usize].queue_len(),
|
||||
}],
|
||||
reason: "round robin",
|
||||
}
|
||||
"round robin",
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -32,8 +32,7 @@ impl Router for TtlAwareRouter {
|
||||
let mut candidates = Vec::with_capacity(n);
|
||||
for inst in instances {
|
||||
let p = scores[inst.id as usize];
|
||||
let load = inst.kv_blocks_used as f64
|
||||
+ self.alpha * inst.queue_len() as f64;
|
||||
let load = inst.kv_blocks_used as f64 + self.alpha * inst.queue_len() as f64;
|
||||
candidates.push(CandidateInfo {
|
||||
instance: inst.id,
|
||||
predicted_prefix: p,
|
||||
@@ -47,13 +46,13 @@ impl Router for TtlAwareRouter {
|
||||
best = inst.id;
|
||||
}
|
||||
}
|
||||
RouteDecision {
|
||||
req_id: req.req_id,
|
||||
mode: "ttl_aware",
|
||||
chosen: best,
|
||||
probe_overhead_s: 0.0,
|
||||
crate::router::local_route_decision(
|
||||
req.req_id,
|
||||
"ttl_aware",
|
||||
best,
|
||||
0.0,
|
||||
candidates,
|
||||
reason: "max meta_store prefix, tie -> least loaded",
|
||||
}
|
||||
"max meta_store prefix, tie -> least loaded",
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -56,7 +56,11 @@ impl EventQueue {
|
||||
pub fn schedule(&mut self, time: f64, event: Event) {
|
||||
let t = time.max(self.now);
|
||||
self.seq += 1;
|
||||
self.heap.push(Slot { time: t, seq: self.seq, event });
|
||||
self.heap.push(Slot {
|
||||
time: t,
|
||||
seq: self.seq,
|
||||
event,
|
||||
});
|
||||
}
|
||||
|
||||
pub fn pop(&mut self) -> Option<(f64, Event)> {
|
||||
@@ -84,9 +88,27 @@ mod tests {
|
||||
#[test]
|
||||
fn pops_in_time_order() {
|
||||
let mut q = EventQueue::new();
|
||||
q.schedule(2.0, Event::BatchTick { instance: 0 as InstanceId });
|
||||
q.schedule(1.0, Event::BatchTick { instance: 1 });
|
||||
q.schedule(1.5, Event::BatchTick { instance: 2 });
|
||||
q.schedule(
|
||||
2.0,
|
||||
Event::BatchTick {
|
||||
bucket: 0,
|
||||
instance: 0 as InstanceId,
|
||||
},
|
||||
);
|
||||
q.schedule(
|
||||
1.0,
|
||||
Event::BatchTick {
|
||||
bucket: 0,
|
||||
instance: 1,
|
||||
},
|
||||
);
|
||||
q.schedule(
|
||||
1.5,
|
||||
Event::BatchTick {
|
||||
bucket: 0,
|
||||
instance: 2,
|
||||
},
|
||||
);
|
||||
let (t1, _) = q.pop().unwrap();
|
||||
let (t2, _) = q.pop().unwrap();
|
||||
let (t3, _) = q.pop().unwrap();
|
||||
@@ -98,12 +120,24 @@ mod tests {
|
||||
#[test]
|
||||
fn equal_time_fifo() {
|
||||
let mut q = EventQueue::new();
|
||||
q.schedule(1.0, Event::BatchTick { instance: 7 });
|
||||
q.schedule(1.0, Event::BatchTick { instance: 8 });
|
||||
q.schedule(
|
||||
1.0,
|
||||
Event::BatchTick {
|
||||
bucket: 0,
|
||||
instance: 7,
|
||||
},
|
||||
);
|
||||
q.schedule(
|
||||
1.0,
|
||||
Event::BatchTick {
|
||||
bucket: 1,
|
||||
instance: 8,
|
||||
},
|
||||
);
|
||||
let (_, e1) = q.pop().unwrap();
|
||||
let (_, e2) = q.pop().unwrap();
|
||||
match (e1, e2) {
|
||||
(Event::BatchTick { instance: a }, Event::BatchTick { instance: b }) => {
|
||||
(Event::BatchTick { instance: a, .. }, Event::BatchTick { instance: b, .. }) => {
|
||||
assert_eq!(a, 7);
|
||||
assert_eq!(b, 8);
|
||||
}
|
||||
|
||||
@@ -1,5 +1,6 @@
|
||||
//! Event types for the discrete-event engine.
|
||||
|
||||
use crate::router::BucketId;
|
||||
use crate::types::{InstanceId, ReqId};
|
||||
|
||||
#[derive(Debug)]
|
||||
@@ -7,7 +8,10 @@ pub enum Event {
|
||||
/// New trace request arrives at the cluster router.
|
||||
Arrival { req_id: ReqId },
|
||||
/// Per-instance scheduler tick (continuous batching).
|
||||
BatchTick { instance: InstanceId },
|
||||
BatchTick {
|
||||
bucket: BucketId,
|
||||
instance: InstanceId,
|
||||
},
|
||||
/// Periodic time-series sample of all instances.
|
||||
Sample,
|
||||
/// Stop the simulation early (used internally).
|
||||
|
||||
11
src/trace.rs
11
src/trace.rs
@@ -21,12 +21,16 @@ struct RawRecord {
|
||||
#[serde(default)]
|
||||
chat_id: i64,
|
||||
#[serde(default)]
|
||||
parent_chat_id: i64,
|
||||
#[serde(default)]
|
||||
timestamp: f64,
|
||||
#[serde(default)]
|
||||
input_length: i64,
|
||||
#[serde(default)]
|
||||
output_length: i64,
|
||||
#[serde(default)]
|
||||
turn: i64,
|
||||
#[serde(default)]
|
||||
hash_ids: Vec<i64>,
|
||||
}
|
||||
|
||||
@@ -34,6 +38,8 @@ struct RawRecord {
|
||||
pub struct RequestRecord {
|
||||
pub req_id: u64,
|
||||
pub chat_id: i64,
|
||||
pub parent_chat_id: i64,
|
||||
pub turn: i64,
|
||||
pub arrival: f64,
|
||||
pub input_len: u32,
|
||||
pub output_len: u32,
|
||||
@@ -50,8 +56,7 @@ pub struct TraceReader {
|
||||
impl TraceReader {
|
||||
pub fn open<P: AsRef<Path>>(path: P, max_requests: Option<u64>) -> Result<Self> {
|
||||
let path = path.as_ref();
|
||||
let f = File::open(path)
|
||||
.with_context(|| format!("opening trace {}", path.display()))?;
|
||||
let f = File::open(path).with_context(|| format!("opening trace {}", path.display()))?;
|
||||
Ok(Self {
|
||||
inner: BufReader::with_capacity(1 << 20, f),
|
||||
next_id: 0,
|
||||
@@ -89,6 +94,8 @@ impl Iterator for TraceReader {
|
||||
return Some(Ok(RequestRecord {
|
||||
req_id: id,
|
||||
chat_id: raw.chat_id,
|
||||
parent_chat_id: raw.parent_chat_id,
|
||||
turn: raw.turn,
|
||||
arrival: raw.timestamp,
|
||||
input_len: raw.input_length.max(0) as u32,
|
||||
output_len: raw.output_length.max(0) as u32,
|
||||
|
||||
180
src/ttft.rs
Normal file
180
src/ttft.rs
Normal file
@@ -0,0 +1,180 @@
|
||||
use crate::cluster::meta_store::MetaStore;
|
||||
use crate::config::{CalibrationConfig, HardwareConfig};
|
||||
use crate::instance::Instance;
|
||||
|
||||
#[derive(Debug, Clone, Copy, Default)]
|
||||
pub struct PrefixResidency {
|
||||
pub l0_hit_blocks: u32,
|
||||
pub l1_hit_blocks: u32,
|
||||
pub remote_hit_blocks: u32,
|
||||
pub miss_blocks: u32,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct TtftModel {
|
||||
kv_block_bytes: u64,
|
||||
host_dram_bw: f64,
|
||||
pcie_bw: f64,
|
||||
pcie_latency_s: f64,
|
||||
rdma_bw: f64,
|
||||
rdma_latency_s: f64,
|
||||
scheduler_base_s: f64,
|
||||
scheduler_per_candidate_s: f64,
|
||||
cache_probe_per_tier_s: f64,
|
||||
batch_pack_s: f64,
|
||||
dram_access_s: f64,
|
||||
remote_metadata_s: f64,
|
||||
layout_transform_s: f64,
|
||||
first_token_tail_s: f64,
|
||||
}
|
||||
|
||||
impl TtftModel {
|
||||
pub fn new(hw: &HardwareConfig, calib: &CalibrationConfig, kv_block_bytes: u64) -> Self {
|
||||
Self {
|
||||
kv_block_bytes,
|
||||
host_dram_bw: hw.host_dram_bw * calib.dram_bw_util,
|
||||
pcie_bw: hw.pcie_bw * calib.pcie_bw_util,
|
||||
pcie_latency_s: hw.pcie_latency_us * 1e-6,
|
||||
rdma_bw: hw.rdma_bw * calib.rdma_bw_util,
|
||||
rdma_latency_s: hw.rdma_latency_us * 1e-6,
|
||||
scheduler_base_s: calib.scheduler_base_overhead_us * 1e-6,
|
||||
scheduler_per_candidate_s: calib.scheduler_per_candidate_us * 1e-6,
|
||||
cache_probe_per_tier_s: calib.cache_probe_us_per_tier * 1e-6,
|
||||
batch_pack_s: calib.batch_pack_overhead_us * 1e-6,
|
||||
dram_access_s: calib.dram_access_latency_us * 1e-6,
|
||||
remote_metadata_s: calib.remote_metadata_us * 1e-6,
|
||||
layout_transform_s: calib.layout_transform_fixed_us * 1e-6,
|
||||
first_token_tail_s: (calib.final_sync_us + calib.first_token_ready_us) * 1e-6,
|
||||
}
|
||||
}
|
||||
|
||||
pub fn scheduler_overhead_s(&self, num_candidates: usize, num_tiers: usize) -> f64 {
|
||||
self.scheduler_base_s
|
||||
+ self.scheduler_per_candidate_s * num_candidates as f64
|
||||
+ self.cache_probe_per_tier_s * num_tiers as f64
|
||||
+ self.batch_pack_s
|
||||
}
|
||||
|
||||
pub fn first_token_tail_s(&self) -> f64 {
|
||||
self.first_token_tail_s
|
||||
}
|
||||
|
||||
pub fn block_bytes(&self, blocks: u32) -> u64 {
|
||||
self.kv_block_bytes * blocks as u64
|
||||
}
|
||||
|
||||
pub fn local_l1_prepare_time_s(&self, blocks: u32) -> f64 {
|
||||
if blocks == 0 {
|
||||
return 0.0;
|
||||
}
|
||||
let bytes = self.block_bytes(blocks);
|
||||
self.dram_access_s
|
||||
+ bytes as f64 / self.host_dram_bw.max(1.0)
|
||||
+ self.pcie_cost_s(bytes)
|
||||
+ self.layout_transform_s
|
||||
}
|
||||
|
||||
pub fn remote_prepare_time_s(&self, blocks: u32) -> f64 {
|
||||
if blocks == 0 {
|
||||
return 0.0;
|
||||
}
|
||||
let bytes = self.block_bytes(blocks);
|
||||
self.remote_metadata_s
|
||||
+ self.rdma_cost_s(bytes)
|
||||
+ self.pcie_cost_s(bytes)
|
||||
+ self.layout_transform_s
|
||||
}
|
||||
|
||||
pub fn pcie_cost_s(&self, bytes: u64) -> f64 {
|
||||
if bytes == 0 {
|
||||
self.pcie_latency_s
|
||||
} else {
|
||||
self.pcie_latency_s + bytes as f64 / self.pcie_bw.max(1.0)
|
||||
}
|
||||
}
|
||||
|
||||
pub fn rdma_cost_s(&self, bytes: u64) -> f64 {
|
||||
if bytes == 0 {
|
||||
self.rdma_latency_s
|
||||
} else {
|
||||
self.rdma_latency_s + bytes as f64 / self.rdma_bw.max(1.0)
|
||||
}
|
||||
}
|
||||
|
||||
pub fn kv_prepare_time_s(&self, residency: PrefixResidency) -> f64 {
|
||||
self.local_l1_prepare_time_s(residency.l1_hit_blocks)
|
||||
+ self.remote_prepare_time_s(residency.remote_hit_blocks)
|
||||
}
|
||||
}
|
||||
|
||||
pub fn classify_prefix_tiers(
|
||||
req_hashes: &[u64],
|
||||
inst: &Instance,
|
||||
meta: &MetaStore,
|
||||
now: f64,
|
||||
) -> PrefixResidency {
|
||||
let total_blocks = req_hashes.len() as u32;
|
||||
let l0_hit_blocks = inst.cache.l0.longest_prefix_peek(req_hashes) as u32;
|
||||
let suffix_after_l0 = &req_hashes[l0_hit_blocks as usize..];
|
||||
|
||||
let l1_hit_blocks = inst.cache.l1.longest_prefix_peek(suffix_after_l0) as u32;
|
||||
let suffix_after_l1 = &suffix_after_l0[l1_hit_blocks as usize..];
|
||||
|
||||
let mut remote_hit_blocks = 0;
|
||||
for &h in suffix_after_l1 {
|
||||
let owners = meta.instances_for(h, now);
|
||||
if owners.iter().any(|o| *o != inst.id) {
|
||||
remote_hit_blocks += 1;
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
PrefixResidency {
|
||||
l0_hit_blocks,
|
||||
l1_hit_blocks,
|
||||
remote_hit_blocks,
|
||||
miss_blocks: total_blocks - l0_hit_blocks - l1_hit_blocks - remote_hit_blocks,
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::config::{CalibrationConfig, HardwareConfig};
|
||||
|
||||
#[test]
|
||||
fn remote_prepare_includes_fixed_overheads() {
|
||||
let hw = HardwareConfig {
|
||||
gpu_flops: 1.0e14,
|
||||
gpu_fp8_flops: 0.0,
|
||||
gpu_fp4_flops: 0.0,
|
||||
gpu_mem_bw: 1.0e12,
|
||||
hbm_bytes: 1.0e9,
|
||||
dram_bytes: 4.0e9,
|
||||
host_dram_bw: 5.0e11,
|
||||
pcie_bw: 32.0e9,
|
||||
pcie_latency_us: 1.0,
|
||||
rdma_bw: 12.0e9,
|
||||
rdma_latency_us: 5.0,
|
||||
intra_node_tp_bw: 9.0e11,
|
||||
intra_node_tp_latency_us: 2.0,
|
||||
tp_degree: 1,
|
||||
max_batch_slots: 32,
|
||||
prefill_chunk_tokens: 1024,
|
||||
};
|
||||
let model = TtftModel::new(
|
||||
&hw,
|
||||
&CalibrationConfig {
|
||||
remote_metadata_us: 11.0,
|
||||
layout_transform_fixed_us: 7.0,
|
||||
..CalibrationConfig::default()
|
||||
},
|
||||
4096,
|
||||
);
|
||||
|
||||
let transport_only = model.rdma_cost_s(4096) + model.pcie_cost_s(4096);
|
||||
let total = model.remote_prepare_time_s(1);
|
||||
assert!(total > transport_only);
|
||||
}
|
||||
}
|
||||
254
tests/smoke.rs
254
tests/smoke.rs
@@ -6,7 +6,7 @@ use std::io::Write;
|
||||
|
||||
use kvcache_simulator::config::*;
|
||||
use kvcache_simulator::driver;
|
||||
use kvcache_simulator::replay::ReplayEvictPolicy;
|
||||
use kvcache_simulator::replay::{self, PlacementEntry, ReplayEvictPolicy};
|
||||
|
||||
fn base_config(trace_path: &str, out_dir: &str, mode: RouterMode) -> Config {
|
||||
Config {
|
||||
@@ -28,15 +28,22 @@ fn base_config(trace_path: &str, out_dir: &str, mode: RouterMode) -> Config {
|
||||
gpu_mem_bw: 1.0e12,
|
||||
hbm_bytes: 1.0e9,
|
||||
dram_bytes: 4.0e9,
|
||||
host_dram_bw: 5.0e11,
|
||||
pcie_bw: 32.0e9,
|
||||
pcie_latency_us: 1.0,
|
||||
rdma_bw: 12.0e9,
|
||||
rdma_latency_us: 5.0,
|
||||
intra_node_tp_bw: 9.0e11,
|
||||
intra_node_tp_latency_us: 2.0,
|
||||
tp_degree: 1,
|
||||
max_batch_slots: 32,
|
||||
prefill_chunk_tokens: 1024,
|
||||
},
|
||||
calibration: CalibrationConfig::default(),
|
||||
cluster: ClusterConfig {
|
||||
num_instances: 4,
|
||||
num_instances: Some(4),
|
||||
buckets: Vec::new(),
|
||||
global_router: Default::default(),
|
||||
meta_store: MetaStoreConfig {
|
||||
ttl_seconds: 1000.0,
|
||||
},
|
||||
@@ -57,10 +64,32 @@ fn base_config(trace_path: &str, out_dir: &str, mode: RouterMode) -> Config {
|
||||
output_dir: out_dir.into(),
|
||||
sample_interval_s: 0.0,
|
||||
seed: 7,
|
||||
input_length_min: None,
|
||||
input_length_max: None,
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
fn bucketed_config(trace_path: &str, out_dir: &str, mode: RouterMode) -> Config {
|
||||
let mut cfg = base_config(trace_path, out_dir, mode);
|
||||
cfg.cluster.num_instances = None;
|
||||
cfg.cluster.buckets = vec![
|
||||
BucketConfig {
|
||||
name: "short".into(),
|
||||
input_length_min: 0,
|
||||
input_length_max: 64,
|
||||
num_instances: 2,
|
||||
},
|
||||
BucketConfig {
|
||||
name: "long".into(),
|
||||
input_length_min: 65,
|
||||
input_length_max: 128,
|
||||
num_instances: 1,
|
||||
},
|
||||
];
|
||||
cfg
|
||||
}
|
||||
|
||||
fn write_synthetic_trace(path: &std::path::Path) {
|
||||
// 5 distinct conversations, each with 8 turns. Within a conversation,
|
||||
// turn k+1 reuses the prefix of turn k (shared first ~10 blocks) and
|
||||
@@ -172,12 +201,14 @@ fn ablation_lru_preserves_ttft_fields() {
|
||||
RouterMode::Random,
|
||||
);
|
||||
let online = driver::run(&cfg, Some("online_lru")).expect("online lru run");
|
||||
let out = driver::ablate_fixed_placement(&cfg, &[RouterMode::Random], &[ReplayEvictPolicy::Lru])
|
||||
let out =
|
||||
driver::ablate_fixed_placement(&cfg, &[RouterMode::Random], &[ReplayEvictPolicy::Lru])
|
||||
.expect("ablate lru");
|
||||
|
||||
assert_eq!(out.len(), 1);
|
||||
let row = &out[0];
|
||||
let online_hit = online.summary.hit_rate_l0 + online.summary.hit_rate_l1 + online.summary.hit_rate_remote;
|
||||
let online_hit =
|
||||
online.summary.hit_rate_l0 + online.summary.hit_rate_l1 + online.summary.hit_rate_remote;
|
||||
let ablate_hit = row.hit_rate_l0 + row.hit_rate_l1 + row.hit_rate_remote;
|
||||
|
||||
assert!(
|
||||
@@ -204,14 +235,219 @@ fn ablate_rejects_belady_until_exact_algorithm_exists() {
|
||||
RouterMode::Random,
|
||||
);
|
||||
|
||||
let err = driver::ablate_fixed_placement(
|
||||
&cfg,
|
||||
&[RouterMode::Random],
|
||||
&[ReplayEvictPolicy::Belady],
|
||||
)
|
||||
let err =
|
||||
driver::ablate_fixed_placement(&cfg, &[RouterMode::Random], &[ReplayEvictPolicy::Belady])
|
||||
.expect_err("belady should be rejected");
|
||||
assert!(
|
||||
err.to_string().contains("exact belady"),
|
||||
"unexpected error: {err:#}"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn ablation_parallel_matches_serial() {
|
||||
let tmp = std::env::temp_dir().join("kvcache_sim_ablate_parallel");
|
||||
let _ = std::fs::remove_dir_all(&tmp);
|
||||
std::fs::create_dir_all(&tmp).unwrap();
|
||||
let trace_path = tmp.join("trace.jsonl");
|
||||
write_synthetic_trace(&trace_path);
|
||||
|
||||
let cfg = base_config(
|
||||
trace_path.to_str().unwrap(),
|
||||
tmp.to_str().unwrap(),
|
||||
RouterMode::Random,
|
||||
);
|
||||
let routers = [
|
||||
RouterMode::Random,
|
||||
RouterMode::LeastLoaded,
|
||||
RouterMode::TtlAware,
|
||||
RouterMode::Precise,
|
||||
];
|
||||
|
||||
let serial = driver::ablate_fixed_placement_with_parallelism(
|
||||
&cfg,
|
||||
&routers,
|
||||
&[ReplayEvictPolicy::Lru],
|
||||
1,
|
||||
)
|
||||
.expect("serial ablate");
|
||||
let parallel = driver::ablate_fixed_placement_with_parallelism(
|
||||
&cfg,
|
||||
&routers,
|
||||
&[ReplayEvictPolicy::Lru],
|
||||
2,
|
||||
)
|
||||
.expect("parallel ablate");
|
||||
|
||||
assert_eq!(parallel.len(), serial.len());
|
||||
for (lhs, rhs) in parallel.iter().zip(serial.iter()) {
|
||||
assert_eq!(lhs.router, rhs.router);
|
||||
assert_eq!(lhs.evict_policy, rhs.evict_policy);
|
||||
assert_eq!(lhs.placement_source, rhs.placement_source);
|
||||
assert!((lhs.ttft_mean - rhs.ttft_mean).abs() < 1e-9);
|
||||
assert!((lhs.ttft_p50 - rhs.ttft_p50).abs() < 1e-9);
|
||||
assert!((lhs.ttft_p95 - rhs.ttft_p95).abs() < 1e-9);
|
||||
assert!((lhs.ttft_p99 - rhs.ttft_p99).abs() < 1e-9);
|
||||
assert!((lhs.hit_rate_l0 - rhs.hit_rate_l0).abs() < 1e-12);
|
||||
assert!((lhs.hit_rate_l1 - rhs.hit_rate_l1).abs() < 1e-12);
|
||||
assert!((lhs.hit_rate_remote - rhs.hit_rate_remote).abs() < 1e-12);
|
||||
assert!((lhs.miss_rate - rhs.miss_rate).abs() < 1e-12);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn strict_bucket_run_emits_bucket_fields_in_outputs() {
|
||||
let tmp = std::env::temp_dir().join("kvcache_sim_bucket_outputs");
|
||||
let _ = std::fs::remove_dir_all(&tmp);
|
||||
std::fs::create_dir_all(&tmp).unwrap();
|
||||
let trace_path = tmp.join("trace.jsonl");
|
||||
|
||||
let mut f = std::fs::File::create(&trace_path).unwrap();
|
||||
writeln!(
|
||||
f,
|
||||
"{}",
|
||||
serde_json::json!({
|
||||
"chat_id": 1,
|
||||
"parent_chat_id": -1,
|
||||
"timestamp": 0.0,
|
||||
"input_length": 32,
|
||||
"output_length": 16,
|
||||
"type": "text",
|
||||
"turn": 0,
|
||||
"hash_ids": [1, 2]
|
||||
})
|
||||
)
|
||||
.unwrap();
|
||||
writeln!(
|
||||
f,
|
||||
"{}",
|
||||
serde_json::json!({
|
||||
"chat_id": 2,
|
||||
"parent_chat_id": -1,
|
||||
"timestamp": 0.1,
|
||||
"input_length": 80,
|
||||
"output_length": 16,
|
||||
"type": "text",
|
||||
"turn": 0,
|
||||
"hash_ids": [3, 4, 5, 6, 7]
|
||||
})
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let mut cfg = bucketed_config(
|
||||
trace_path.to_str().unwrap(),
|
||||
tmp.to_str().unwrap(),
|
||||
RouterMode::LeastLoaded,
|
||||
);
|
||||
cfg.cluster.global_router.mode = GlobalRouterMode::StrictInputLength;
|
||||
cfg.sim.sample_interval_s = 0.05;
|
||||
|
||||
let _ = driver::run(&cfg, Some("strict_bucket")).expect("bucketed run");
|
||||
|
||||
let per_request = std::fs::read_to_string(tmp.join("strict_bucket/per_request.csv")).unwrap();
|
||||
assert!(per_request.contains("bucket"));
|
||||
assert!(per_request.contains("length_bucket_match"));
|
||||
|
||||
let instances = std::fs::read_to_string(tmp.join("strict_bucket/instances.csv")).unwrap();
|
||||
assert!(instances.contains("bucket"));
|
||||
|
||||
let routing_log = std::fs::read_to_string(tmp.join("strict_bucket/routing_log.jsonl")).unwrap();
|
||||
assert!(routing_log.contains("\"chosen_bucket\""));
|
||||
assert!(routing_log.contains("\"bucket_candidates\""));
|
||||
assert!(routing_log.contains("\"global_reason\""));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn bucketed_configs_are_rejected_by_legacy_fixed_placement_paths() {
|
||||
let tmp = std::env::temp_dir().join("kvcache_sim_bucketed_reject");
|
||||
let _ = std::fs::remove_dir_all(&tmp);
|
||||
std::fs::create_dir_all(&tmp).unwrap();
|
||||
let trace_path = tmp.join("trace.jsonl");
|
||||
write_synthetic_trace(&trace_path);
|
||||
|
||||
let cfg = bucketed_config(
|
||||
trace_path.to_str().unwrap(),
|
||||
tmp.to_str().unwrap(),
|
||||
RouterMode::Random,
|
||||
);
|
||||
|
||||
let err =
|
||||
driver::ablate_fixed_placement(&cfg, &[RouterMode::Random], &[ReplayEvictPolicy::Lru])
|
||||
.expect_err("bucketed ablation should fail");
|
||||
assert!(err.to_string().contains("cluster.buckets"));
|
||||
|
||||
let err = replay::replay_fixed_placement(
|
||||
&cfg,
|
||||
&[],
|
||||
&Vec::<PlacementEntry>::new(),
|
||||
ReplayEvictPolicy::Lru,
|
||||
)
|
||||
.expect_err("bucketed replay should fail");
|
||||
assert!(err.to_string().contains("cluster.buckets"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn bucket_score_can_deviate_from_strict_length_bucket() {
|
||||
let tmp = std::env::temp_dir().join("kvcache_sim_bucket_score");
|
||||
let _ = std::fs::remove_dir_all(&tmp);
|
||||
std::fs::create_dir_all(&tmp).unwrap();
|
||||
let trace_path = tmp.join("trace.jsonl");
|
||||
|
||||
let mut f = std::fs::File::create(&trace_path).unwrap();
|
||||
for req_id in 0..3 {
|
||||
writeln!(
|
||||
f,
|
||||
"{}",
|
||||
serde_json::json!({
|
||||
"chat_id": req_id,
|
||||
"parent_chat_id": -1,
|
||||
"timestamp": 0.0,
|
||||
"input_length": 24,
|
||||
"output_length": 16,
|
||||
"type": "text",
|
||||
"turn": 0,
|
||||
"hash_ids": [100 + req_id, 200 + req_id]
|
||||
})
|
||||
)
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
let mut strict_cfg = bucketed_config(
|
||||
trace_path.to_str().unwrap(),
|
||||
tmp.to_str().unwrap(),
|
||||
RouterMode::LeastLoaded,
|
||||
);
|
||||
strict_cfg.cluster.buckets = vec![
|
||||
BucketConfig {
|
||||
name: "short".into(),
|
||||
input_length_min: 0,
|
||||
input_length_max: 32,
|
||||
num_instances: 1,
|
||||
},
|
||||
BucketConfig {
|
||||
name: "long".into(),
|
||||
input_length_min: 33,
|
||||
input_length_max: 96,
|
||||
num_instances: 1,
|
||||
},
|
||||
];
|
||||
strict_cfg.cluster.global_router.mode = GlobalRouterMode::StrictInputLength;
|
||||
|
||||
let mut score_cfg = strict_cfg.clone();
|
||||
score_cfg.cluster.global_router.mode = GlobalRouterMode::BucketScore;
|
||||
score_cfg.cluster.global_router.length_penalty_weight = 1.0;
|
||||
score_cfg.cluster.global_router.load_weight = 10.0;
|
||||
score_cfg.cluster.global_router.cache_weight = 0.0;
|
||||
|
||||
let _ = driver::run(&strict_cfg, Some("strict_score_cmp")).expect("strict run");
|
||||
let _ = driver::run(&score_cfg, Some("bucket_score_cmp")).expect("bucket score run");
|
||||
|
||||
let strict_log =
|
||||
std::fs::read_to_string(tmp.join("strict_score_cmp/routing_log.jsonl")).unwrap();
|
||||
let score_log =
|
||||
std::fs::read_to_string(tmp.join("bucket_score_cmp/routing_log.jsonl")).unwrap();
|
||||
|
||||
assert!(strict_log.contains("\"chosen_bucket\":0"));
|
||||
assert!(score_log.contains("\"global_mode\":\"bucket_score\""));
|
||||
assert!(score_log.contains("\"chosen_bucket\":1"));
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user