Commit Graph

7 Commits

Author SHA1 Message Date
f42c715ec1 A4: open-loop session-causal SRR loadgen
New replayer/srr.py drives a Poisson session-arrival load against the
existing proxy, with strict per-session turn sequentiality, explicit
warmup/steady/drain windows, and per-arrival fresh session_id +
request_id so APC/session-affinity counters are not contaminated by
repeated draws from the trace pool. Writes window_summary.json with
attempted/completed/errored split by window so latency tails can be
read on the steady-state window only.

Required by Batch 4 SRR sweep; trace-timestamp dispatch in replay.py
cannot drive arrival rate independently.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 16:19:20 +08:00
5816aad731 A3: vLLM scheduler patch for step-level JSONL log
When AGENTIC_STEP_LOG_PATH is set, the scheduler emits one JSONL line
per scheduler step with t_unix, worker_id, prefill/decode token
counts, n_running/n_waiting, preempted ids, and per-request phase
labels. No-op when the env var is unset, so production engines are
not impacted. bench.sh now threads AGENTIC_STEP_LOG_DIR through to
each per-engine launch so step logs end up at engine_${i}.jsonl.

Required by Batch 2 (PD-colo interference index) and Batch 5
(same-worker overlap attribution); engine /metrics polling cannot
provide per-step granularity.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 16:19:11 +08:00
fe556b5d98 A2: proxy worker-state snapshot and request-id passthrough
Honor incoming X-Request-Id so replayer metrics and proxy breakdown
share a join key. Each route decision now captures session_id, the
full per-worker candidate-score snapshot (ongoing/pending/num_requests
/cached_blocks plus both linear and lmetric scores), the chosen score,
and unix timestamps for first-token and done events. A separate
_worker_state_log records one row per decision and is exposed via
GET /worker_state; GET /worker_state/latest returns a live snapshot
without recording it.

Required by Batch 3 (session hot-spot proof) and Batch 5 (failure
attribution); existing breakdown.json had no per-worker state at
decision time.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 16:19:01 +08:00
d57e338366 A1: replayer instrumentation for cross-process join
RequestMetrics gains absolute unix timestamps (t_dispatch_unix,
t_first_token_unix, t_finish_unix), the proxy_request_id, the chosen
endpoint URL, and the trace hash_ids. Replayer sends
X-Request-Id: <session_id>:<turn_id>:<chat_id>:<idx> so proxy
breakdown rows can be joined to metrics by exact key.

Required by Batch 0 (online sequentiality proof) and Batch 1 reuse
decomposition; existing metrics.jsonl couldn't establish either.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 16:18:52 +08:00
ac6534c3ff Cleanup: retire dead PUSH path + extract hybrid picker
- Delete unreachable best_needs_push block in _handle_combined and the
  four orphaned helpers (_handle_cached_prefill_offload,
  _handle_direct_read_offload, _query_bootstrap_hit,
  _get_bootstrap_client). Their only caller was the retired PUSH gate;
  see REPORT §3.9 errata for the rejected experiments (cc6e562, 4c583f2).

- Extract pick_instance_unified_hybrid as a pure function returning
  (chosen, idx, decision_dict). The decision dict carries the review #7
  breakdown fields (decision, affinity_idx/chosen_idx, cache_hit/ratio,
  avg_num_requests, fallback_score, tie_break_used).

- Add LMetric-fallback tie-breaker (primary score, then new_uncached,
  num_requests, round-robin) so new sessions don't all pin to inst 0
  when BS=0 across the board.

- Drop the lmetric-policy affinity write so --policy lmetric stays
  affinity-free per review #3.

- Mark --max-offload-inflight / --offload-mode / --cache-gate-ratio /
  --decode-iteration-s as [DEPRECATED] in --help; flags remain accepted
  so scripts/bench.sh and legacy launchers don't break.

- Revert uncommitted overload_factor 2.0->1.5 default; H7 sweep already
  rejected this knob (within noise). Future sweeps should go via CLI.

Tests: add 6 hybrid-policy tests in tests/test_proxy_pick.py covering
affinity-hit, overload break, low-cache fallback, tie-break rotation,
lmetric purity, and breakdown field shape. 19/19 pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:46:57 +08:00
c843f2e3db proxy: Settings dataclass + cache-ratio gate + P-pick offload penalty (B4, M2, M3, D5)
- Replace mutable module constants (HEAVY_THRESHOLD/OVERLOAD_FACTOR/
  MAX_OFFLOAD_INFLIGHT/PREFILL_THROUGHPUT/RDMA_OVERHEAD_S/
  CACHE_CAPACITY_BLOCKS) with a Settings dataclass + SETTINGS singleton.
  __main__ now mutates SETTINGS so CLI overrides survive even when the
  module is imported as a library (e.g. by tests/) (D5).
- Add --max-offload-inflight CLI flag (M3) and read it from SETTINGS.
- Add --cache-gate-ratio CLI flag and a real gate before the cost-model
  branch: if cache_hit/input_length < ratio, mark cache_gate_REASON and
  fall back to colocated. cache_ratio is no longer a write-only field
  (B4).
- P candidate selection penalises instances already running offloaded
  HEAVY prefills, so back-to-back HEAVY requests don't pile onto the
  same P (M2).
- bench.sh forwards --max-offload-inflight / --cache-gate-ratio to the
  proxy.
- Tests cover SETTINGS knobs + the heavy_threshold-driven P-offload
  penalty.
2026-05-23 21:11:17 +08:00
0701f84c00 tests: add minimal coverage for percentile + proxy routing (S1)
- tests/test_metrics.py asserts the new linear-interp _percentile against
  hand-computed expected values (single value, two-value interpolation,
  endpoints, numpy-equivalent linear default, on-integer rank).
- tests/test_proxy_pick.py exercises InstanceState LRU eviction and
  move-to-end on hit, plus session-affinity stickiness, the overload
  fallback, the active_p_offloads penalty, and lmetric scoring. The
  proxy is loaded by file path with stub fastapi/uvicorn/httpx modules
  so the suite runs without the FastAPI server deps installed.
- pyproject.toml gets a hatchling wheel target and a [tool.pytest]
  section so `uv run --extra dev pytest` works out of the box.
2026-05-23 21:07:14 +08:00