Commit Graph

97 Commits

Author SHA1 Message Date
5b1d36080a Fix B2 migration: correct offload call signature (c_inst/d_inst order + cache_hit arg)
The session migration path was calling _handle_cached_prefill_offload
with swapped c_inst/d_inst and missing cache_hit parameter, causing
TypeError on every migration attempt (13 of 41 errors in the test run).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 22:46:46 +08:00
5772149d36 Approach B v2: TTFT-based migration trigger
Replace num_requests threshold with recent TTFT median as migration
trigger. Track per-instance rolling TTFT (last 8 requests) and trigger
migration when median > 5s (configurable). Target is the instance with
lowest recent TTFT, requiring > 2x improvement to justify migration.

This is more responsive than the instantaneous num_requests signal
because TTFT directly measures the user-facing impact of contention.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 21:54:06 +08:00
45b82272c3 Add migration policy design doc with A/B experiment results
Approach A (contention-aware cost model): TTFT p90 -52% vs baseline.
Approach B (session migration): 0 triggers at 1.5x threshold — needs tuning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 18:24:49 +08:00
e9919605af Approach B: session-level lazy migration trigger
When a request arrives for a session on an overloaded instance, force
migration if three conditions hold:
1. Instance busy: num_requests > avg * migration_request_factor (1.5x)
2. Session has cache value: cache_ratio > 50%
3. Request is HEAVY (>= heavy_threshold)
4. A meaningfully less-loaded target exists (num_requests gap > 2)

This bypasses the cost model for migration decisions — the cost model's
cache-inflated costs prevented migration even when instances had 150s
queue times with 99% cache hit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 17:34:06 +08:00
e06de5144b Approach A: contention-aware cost model with migration discount
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 17:24:27 +08:00
e13391eeab Evict migrated blocks from prefix cache after KV send completes
After a session migrates from C to D via offload, C's blocks were freed
to the LRU tail (most-recently-used position), making them the last to
be evicted. Since the session won't return to C, these blocks are dead
weight occupying cache capacity.

Now capture block IDs before _free_blocks and call evict_blocks to
remove them from the prefix cache hash table, so they can be reused
sooner for active sessions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 16:56:34 +08:00
4b50c5a08d Fix unified cost model: include decode load in queue + hard overload gate
Two bugs caused elastic to concentrate load on cached instances (10x token
imbalance vs 2.7x baseline):

1. _instance_cost queue only counted pending_prefill_tokens, missing
   ongoing_decode_tokens entirely — instances with 50 decoding requests
   appeared idle to the cost model.

2. Cache hits made overloaded instances look "cheap", creating a positive
   feedback loop: more sessions → more cache → lower cost → more routing.
   Added a hard gate (ongoing_tokens > avg * overload_factor) that breaks
   affinity before the cost model runs, matching linear policy behavior.

Result: token imbalance 10.3x → 2.6x, TTFT p90 -37% vs baseline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 16:25:02 +08:00
9cebdb6b9b Fix multi-turn replay fidelity: track realized output tokens across all components
The replayer and proxy were building multi-turn prompts from trace tokens,
but the model generates different output tokens. Subsequent turns had wrong
prefix tokens, causing cache misses and invalid experimental measurements.

- replay.py: min_tokens=max_tokens for deterministic length, return_token_ids
  to capture actual output, _apply_realized_prefix for next-turn correction
- proxy: extract output token_ids from SSE, record prompt+output as realized
  prefix in shadow cache, extract _handle_local_request to deduplicate
- bench.sh/launch_elastic_p2p.sh: default elastic mode to unified policy
- mooncake_connector: only send prompt blocks (not stale output blocks),
  track failed_recving_block_ids for error recovery

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 14:47:51 +08:00
cc4a9c91e7 Fix estimate_hit: reuse _lookup_by_tokens instead of reimplementing hash
The standalone hash computation in estimate_hit produced different hashes
than the hash_table (synced from scheduler). Root cause unclear (possibly
pickle serialization differences or hash chain state). Fix: delegate to
_lookup_by_tokens which is proven to work (push_blocks uses it).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 12:41:53 +08:00
657812f8c4 Add deploy_vllm_patches.sh: sync third_party/vllm patches to site-packages
Copies mooncake_connector.py, mooncake_utils.py, scheduler.py from
third_party/vllm to the pip-installed vllm's site-packages. C extensions
stay from the pip package; only Python files are overridden.

Usage: bash scripts/deploy_vllm_patches.sh [HOST]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 11:59:52 +08:00
bf76273778 Add --offload-mode switch for ablation (direct_read vs cached_prefill)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 11:24:15 +08:00
cdf83493ab Fix A+C: real cache sync + cached-prefill-on-C architecture
A: Add /estimate_hit endpoint to bootstrap server for real-time cache
   probing. Proxy queries this before committing to PUSH, eliminating
   24% zero-match PUSH requests (shadow cache divergence).

C: Add _handle_cached_prefill_offload: C (cache source) does fast
   cached prefill → KV to Mooncake → D pulls and decodes.
   Replaces broken direct_read PUSH where D waited for RDMA transfer
   while occupying KV blocks without doing compute.

Also: update §3.9 baseline to plain vLLM with full mean/p50/p90/p99.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 11:22:38 +08:00
2b9eae0d54 Report §3.9: Unified routing final results — TTFT -25%, E2E -7%
850/850, 0 errors. Single argmin(latency) with soft affinity.
116 PUSH_MIGRATE (all with cache, avg 25k tokens), 723 LOCAL.
TPOT p90 +15% tradeoff from kv_both overhead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 03:15:32 +08:00
97f4fe5164 Fix: rename inst->chosen in generate function (NameError crash)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 02:55:01 +08:00
5892739159 Add session affinity as soft preference in unified routing
Without affinity, all cached requests route to the same instance
(cache source always has lowest prefill cost), causing 149s queue.

Fix: if the session's last instance has cost <= 2x the global best,
use it (preserves cache locality). Only re-route when the affinity
instance is significantly more expensive (overloaded).

The 2x threshold is intentionally loose — it's not a hardcoded magic
number but a "prefer locality unless clearly worse" heuristic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 02:37:58 +08:00
6b255fad91 Unified routing: single argmin(expected_latency) over all instances
Replace two-phase routing (pick_instance → offload gate) with a single
cost function evaluated per instance:

  latency(D) = queue(D) + prefill_time(D) + transfer_cost(D)

  - If D has local cache: prefill = (input - local_hit) / throughput
  - If D can receive PUSH from cache source: prefill = (input - push_hit) / throughput + rdma
  - Otherwise: prefill = input / throughput (cold)

Choose argmin(latency). If the winner needs PUSH → trigger migration.

Removed:
- WARM/MEDIUM/HEAVY classification (no routing purpose)
- heavy_threshold, overload_factor, max_offload_inflight, cache_gate_ratio
- Interference penalty magic number (0.3)
- Separate pick_instance + offload gate stages

Only 2 measured parameters remain:
- prefill_throughput = 7000 tokens/s (H20 measured)
- rdma_overhead_s = 0.1s (RDMA PUSH measured)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 02:21:34 +08:00
1cd0a18e2c Report §3.8: Document direct KV cache migration architecture + bugs fixed
Complete documentation of bootstrap-triggered PUSH implementation:
hash table sync, token-based lookup, RDMA WRITE path, cost model,
PYTHONHASHSEED requirement, and all 6 bugs fixed during development.

Verified: 640/640 blocks pushed, External APC 80%, TTFT 0.367s
(vs local cache 0.338s, +0.03s overhead).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 01:52:38 +08:00
8c267ec54e VERIFIED: Bootstrap-triggered PUSH works end-to-end!
Test results:
- 640/640 blocks matched and pushed (ret=0)
- External prefix cache hit rate: 80.0% on D
- Turn 2 TTFT: inst_0 (cached) = 0.338s, inst_1 (RDMA push) = 0.367s
- C's scheduler was NOT involved (0 GPU compute on C)

The complete direct KV cache migration pipeline is working:
D → /push_blocks → C bootstrap matches tokens → C RDMA WRITE → D GPU

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 01:50:37 +08:00
e3a1d70cf2 Switch from RDMA READ to bootstrap-triggered PUSH
RDMA READ (batch_transfer_sync_read) fails on GPU memory because
batch_register_memory only sets IBV_ACCESS_REMOTE_WRITE.

New approach: D sends /push_blocks to C's bootstrap with token_ids
+ D's GPU addresses. C's bootstrap:
1. Looks up matching blocks in synced hash table (640/640 verified)
2. Uses C's TransferEngine.batch_transfer_sync_write to PUSH blocks
   directly into D's GPU memory
3. Returns match count + push status

C's scheduler is still NOT involved (0 GPU compute on C).
The push uses C's worker thread + existing RDMA WRITE path (proven reliable).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 01:47:49 +08:00
6716a3401a Progress: hash matching FIXED (640/640), RDMA read returns -1
Hash mismatch root cause: sha256_cbor vs sha256 (default) + NONE_HASH
from-import value binding. Both fixed. Now 640/640 blocks matched.

RDMA read (batch_transfer_sync_read) fails with ret=-1.
Likely cause: Mooncake TransferEngine may not support RDMA READ
to arbitrary registered memory without explicit permission setup.
The PUSH path (batch_transfer_sync_write) works because the sender
initiates, but PULL may need additional RDMA MR access flags.

Next: investigate Mooncake's RDMA read permission model, or
fall back to a two-step approach: D sends query → C responds
with blocks via batch_transfer_sync_write (existing PUSH path),
but triggered by the bootstrap server instead of the scheduler.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 01:40:52 +08:00
0bb6a67ed3 Fix: use sha256 (default) not sha256_cbor for block hash computation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 01:36:05 +08:00
08d5e12838 Fix NONE_HASH import: use module ref instead of from-import (value binding bug)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 01:32:19 +08:00
7e91b83d88 Set PYTHONHASHSEED=42 for elastic mode to ensure consistent block hashes
Root cause confirmed: NONE_HASH = os.urandom(32) differs between
scheduler and bootstrap server even in the same process (init_none_hash
called separately by each import path). PYTHONHASHSEED makes it
deterministic: NONE_HASH = hash_fn(seed), same across all code paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 01:27:52 +08:00
ee2301ae17 Fix: token lookup condition should check hash_table not block_pool
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 01:21:49 +08:00
0c88609caa Fix: use synced hash table + sha256_cbor for token-based lookup (same process NONE_HASH)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 01:18:47 +08:00
0500350849 Fix hash mismatch: token-based lookup instead of cross-instance hash matching
Root cause: each vLLM instance has a random NONE_HASH (os.urandom(32))
when PYTHONHASHSEED is not set. All block hashes are chained from
NONE_HASH, so D's hashes never match C's hashes.

Fix: C's bootstrap server now accepts token_ids and does the prefix
cache lookup locally using C's own hash function and block pool.
No cross-instance hash matching needed.

New flow: D sends prompt token_ids → C computes hashes on C's side →
C looks up in C's own BlockPool → returns block_ids.

Also: module-level _shared_block_pool for scheduler→bootstrap bridge,
prompt_token_ids passed through PullReqMeta, test script added.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 01:14:33 +08:00
a1f30e5fce Add hash_table_sync logging + gap analysis
Root cause of 0 cache hits on offloaded requests identified:
- Hash table sync IS working (scheduler→metadata→worker→bootstrap)
- But D's query_blocks returns no matches → hash format mismatch
  between D's request.block_hashes and C's synced hashes

The gap: offloaded TTFT (12.4s) ≈ co-located TTFT (12.0s) because
D does FULL cold prefill (cache_hit=0), not partial prefill with
RDMA-read cached blocks.

Next: debug hash format mismatch between D and C.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 00:38:14 +08:00
1cf03c6e79 Cost model: add interference penalty for co-located heavy prefill
Old cost model: offload_cost = colocated_cost + RDMA_overhead, so offload
was always 0.1s more expensive. Result: only 19/117 HEAVY offloaded.

New: colocated_cost includes interference penalty when C_s has decode
requests: penalty = prefill_time × min(num_requests, 3) × 0.3.
Offload now wins when C_s has 1+ concurrent request.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 23:59:06 +08:00
29b901b145 Fix scheduler assertion crash on partial remote prefill finished_recving
The assertion `assert RequestStatus.is_finished(req.status)` at
scheduler.py:2109 fires when a partial-remote-prefill request
receives `finished_recving` while in RUNNING state (local prefill
already started before RDMA read completed).

This was the root cause of 67% error rate: EngineCore crashed with
"fatal error" assertion, killing the vLLM instance.

Fix: Replace assertion with debug log for non-WAITING, non-finished
requests. kv_both no-offload baseline confirmed 0 errors, proving
the crash was from our scheduler patch, not kv_both instability.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 23:33:26 +08:00
4f93bb5b8a Report §3.8: Direct RDMA read results — HEAVY TTFT -70%, TPOT p90 -38%
D reads C's cached KV blocks via batch_transfer_sync_read, bypassing
C's scheduler entirely. 65/318 HEAVY requests offloaded.

HEAVY_OFFLOAD TTFT: 3.40s vs HEAVY_COLO 11.21s (-70%)
Overall TPOT p90: 0.100 vs baseline 0.162 (-38%)

kv_both mode has 67.5% error rate (Mooncake instability), but
276 successful requests show strong performance improvement.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 22:56:16 +08:00
a7514fc3d5 Fix retry syntax: async generator can't use return, use break+try/finally
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 22:37:32 +08:00
daeb95eca0 RDMA overhead 2.0→0.1s (direct read is raw memory, not scheduler flow) +
retry on ConnectError to handle kv_both connection instability

With RDMA_overhead=0.1s, offload triggers when C_s has just 700 tokens
pending (0.1s queue), vs 38k tokens (5.4s) with the old 2.0s estimate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 22:33:10 +08:00
5c66f500fc Fix offload gate: remove cache_gate for direct RDMA read, fix cost model
The cache_gate_ratio=0.3 check blocked 83/112 HEAVY requests (75%)
because they were cold (cache_ratio=0). But with direct RDMA read,
D reads C's cached blocks via RDMA regardless of cache ratio — the
gate was protecting against the OLD flow (C does prefill + push).

Also fixed cost model: offload_cost now reflects direct read reality:
  OLD: P_queue + P_full_prefill + RDMA (P has no cache → expensive)
  NEW: D_queue + RDMA_read + D_local_prefill(new_tokens)

Offload wins when C_s queue > RDMA_overhead (~2s).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 22:01:43 +08:00
23788f7cd5 Fix: import field from dataclasses for PullReqMeta
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 21:29:24 +08:00
1dea82f2ff launch_phase1_ps: parameterise project + model paths (B6 followup) 2026-05-23 21:14:15 +08:00
52a54e44af proxy: split session_affinity per mode + vLLM patch self-check (M4, S2)
- Replace the global session_affinity dict with two namespace-isolated
  ones (combined / prefill) so a session_id never indexes the wrong
  instance list across mode switches. Keep `session_affinity` as a
  read-only alias to the combined dict for any existing tooling.
- Add a startup _verify_vllm_patch() that scans
  vllm.v1.core.sched.scheduler.Scheduler for the original
  `assert req_id in self.requests` line. If the patch was not
  re-applied after a vLLM upgrade we now print a loud warning at
  lifespan startup instead of dying mid-experiment on a KV-transfer
  abort race.
2026-05-23 21:12:56 +08:00
c843f2e3db proxy: Settings dataclass + cache-ratio gate + P-pick offload penalty (B4, M2, M3, D5)
- Replace mutable module constants (HEAVY_THRESHOLD/OVERLOAD_FACTOR/
  MAX_OFFLOAD_INFLIGHT/PREFILL_THROUGHPUT/RDMA_OVERHEAD_S/
  CACHE_CAPACITY_BLOCKS) with a Settings dataclass + SETTINGS singleton.
  __main__ now mutates SETTINGS so CLI overrides survive even when the
  module is imported as a library (e.g. by tests/) (D5).
- Add --max-offload-inflight CLI flag (M3) and read it from SETTINGS.
- Add --cache-gate-ratio CLI flag and a real gate before the cost-model
  branch: if cache_hit/input_length < ratio, mark cache_gate_REASON and
  fall back to colocated. cache_ratio is no longer a write-only field
  (B4).
- P candidate selection penalises instances already running offloaded
  HEAVY prefills, so back-to-back HEAVY requests don't pile onto the
  same P (M2).
- bench.sh forwards --max-offload-inflight / --cache-gate-ratio to the
  proxy.
- Tests cover SETTINGS knobs + the heavy_threshold-driven P-offload
  penalty.
2026-05-23 21:11:17 +08:00
0701f84c00 tests: add minimal coverage for percentile + proxy routing (S1)
- tests/test_metrics.py asserts the new linear-interp _percentile against
  hand-computed expected values (single value, two-value interpolation,
  endpoints, numpy-equivalent linear default, on-integer rank).
- tests/test_proxy_pick.py exercises InstanceState LRU eviction and
  move-to-end on hit, plus session-affinity stickiness, the overload
  fallback, the active_p_offloads penalty, and lmetric scoring. The
  proxy is loaded by file path with stub fastapi/uvicorn/httpx modules
  so the suite runs without the FastAPI server deps installed.
- pyproject.toml gets a hatchling wheel target and a [tool.pytest]
  section so `uv run --extra dev pytest` works out of the box.
2026-05-23 21:07:14 +08:00
7c7f8b951a replayer: wire --max-inflight-sessions cap into replay loop (B2)
Trace-driven dispatch is preserved by default (semaphore=None when the
flag is not set), but operators can now cap concurrent sessions to
reproduce session-admission scenarios from earlier sweeps without
artificial time compression.
2026-05-23 21:04:09 +08:00
2c7f7fdaae replayer: restore optional max_inflight_sessions for backwards compat
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 21:02:26 +08:00
a7df84bd3b Direct RDMA read: D reads cached KV from C's GPU without C's scheduler
Complete implementation of direct RDMA read for KV cache migration:

vLLM Mooncake connector (mooncake_connector.py):
- PullReqMeta: add direct_read flag + block_hashes
- MooncakeConnectorMetadata: add hash_table_updates/removals for
  scheduler->worker block hash sync
- MooncakeConnectorScheduler: set_block_pool() to access BlockPool,
  build_connector_meta() computes hash table deltas each step,
  update_state_after_alloc() captures request block hashes for direct_read
- MooncakeConnectorWorker: _start_direct_read() + _direct_read_single()
  implements D-side RDMA read via batch_transfer_sync_read, with
  HTTP query/unpin to C's bootstrap server

Bootstrap server (mooncake_utils.py):
- POST /query_blocks: look up block hashes, return block_ids + GPU layout
- POST /unpin_blocks: release pin tracking
- set_worker_kv_info(): register GPU addresses at init
- update_hash_table(): receive scheduler deltas each step

Scheduler (scheduler.py):
- One-line hookup: pass block_pool to connector after KVCacheManager init

Proxy (cache_aware_proxy.py):
- _handle_direct_read_offload: sends request ONLY to D with
  direct_read=True + remote_bootstrap_addr. No request to C at all.
- C's scheduler is completely uninvolved (0 GPU time on C)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 21:02:13 +08:00
020be9f444 proxy: real LRU for cached_blocks + shadow-state reconcile loop (M1, M5)
M1: cached_blocks was a plain set with a "trim half via list slicing"
eviction. CPython does not guarantee set iteration order, so the trim
discarded an arbitrary half of the entries — completely unlike vLLM's
LRU and a known contributor to the router's cache_hit estimate
diverging from real APC. Replace with an OrderedDict-backed LRU:
move_to_end on hits, popitem(last=False) on overflow. Capacity exposed
as CACHE_CAPACITY_BLOCKS module constant (200000 by default).

M5: streamed responses decrement load counters in their generator's
finally block. If a client disconnects before consuming the body the
generator is never entered and the decrement is lost, causing
ongoing_tokens / num_requests / pending_prefill_tokens to drift
negative under load. Add a 60s background reconcile_loop that clamps
those counters at zero as a safety net. Started in lifespan, cancelled
on shutdown. Does not replace proper vLLM exact-state syncing.
2026-05-23 21:00:35 +08:00
0ed1ce200e metrics: replace round-based percentile with linear interpolation (B5)
The previous implementation used round((n-1) * pct), which under Python's
banker's rounding returned the upper-middle element on every even-length
array (e.g. p50 of [1,2,3,4] returned 3 instead of 2.5). All summary
JSONs were biased upward at p50 as a result. Match numpy.percentile's
default linear interpolation between the two adjacent sorted values.
2026-05-23 21:00:24 +08:00
0958823cdb REPORT: add §1.1 errata flagging superseded sections (S3)
Calls out that §3.1 (old random sampler, time-scale compression, 1 req/GPU
cap) and the early elastic v3 warm-vs-fresh runs are no longer current,
and that the "--max-inflight-sessions 64+" next-step text refers to a
flag that was removed and must be restored per FIXES.md §B2 before those
numbers can be reproduced. Points readers at §3.6/§3.7 as authoritative.
2026-05-23 20:58:38 +08:00
ea5c3bfe6b compute_roofline: argparse --trace, fix stale default path (D4)
The hardcoded traces/sampled_1000req_seed42.jsonl no longer exists; switch
the default to the current sampled trace file w600_r0.0015_st30.jsonl and
let users override via --trace. Skip Part 4 cleanly when the file is
missing instead of relying on os.path.exists.
2026-05-23 20:58:09 +08:00
547611e022 scripts: archive obsolete one-off shell/python scripts to legacy/ (D2, D3)
D2: run_benchmark.sh and run_experiments.sh still pass --time-scale and
--max-inflight-sessions to the replayer, but those flags were removed when
the project moved to trace-driven dispatch. The scripts cannot run as-is.

D3: ~25 ad-hoc analyze_* / compare_* / profile_* / final_* scripts and a
handful of single-experiment run_*.sh point at /home/admin/cpfs paths,
deleted output directories, or a sampled trace file that no longer exists.
Keep them in scripts/legacy/ for historical reference; the scripts that
remain in scripts/ (analyze_trace, analyze_breakdown, analyze_cache_hit,
analyze_eviction, compare_results, compute_roofline, sample_trace,
analyze_agentic_patterns, simulate_cache_policies, plus launch_*.sh,
gpu_monitor.sh, bench.sh) cover the current workflow.

Adds scripts/legacy/README.md to document the archival policy.
2026-05-23 20:57:32 +08:00
c64b0b39c7 bench.sh: fix stale MODEL and TRACE defaults (B6)
The default MODEL pointed at /home/admin/cpfs/... which never existed on
the public dev machines (other launch_*.sh and TODO.md use $HOME/models),
and the default TRACE pointed at traces/sampled_1000req_seed42.jsonl
which was deleted when the sampler moved to window+thin output. Update
both to the values the rest of the repo already standardized on.
2026-05-23 20:56:40 +08:00
556f3011c6 proxy: remove dead state and broken fire-and-forget path (B1, D1)
B1: _inst_cumulative_tokens was written by pick_instance but never read
anywhere; delete the variable, global declaration, and per-call increment.
Load is already tracked via inst.ongoing_tokens.

D1: _send_prefill_async + the --fire-and-forget branch were unreachable
in practice (no launch/bench script enabled the flag) and broken even if
exercised: D-decode would fire before P registered the transfer_id,
guaranteeing a Mooncake 502. Collapse _handle_pd_sep to its synchronous
path and drop the CLI flag.
2026-05-23 20:56:11 +08:00
fc445df0ad Add FIXES.md with prioritized repo cleanup checklist
Captures the full review of bugs, fake/half-implemented features, dead
branches, and quality gaps found in cache_aware_proxy.py, replayer, and
the shell scripts. Each item has file:line, problem, fix, and verification
steps so any contributor can pick it up directly.
2026-05-23 20:35:56 +08:00
b2ede1da77 bench.sh: add trap for graceful cleanup on kill/interrupt
Added EXIT/INT/TERM traps to ensure vLLM, proxy, and gpu_monitor
processes are cleaned up even when bench.sh is killed externally.
Also includes gpu_monitor in cleanup_gpu pattern matching.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 20:24:13 +08:00