4 Commits

Author SHA1 Message Date
3fdcec9c0f Fix review P2s: lockfile, model path convention, trap robustness
- Regenerate uv.lock after adding fastapi/uvicorn deps so uv sync
  --locked no longer fails
- B3 scripts: default MODEL to $HOME/models/... matching documented
  convention and other launch scripts (repo has no models/ directory)
- launch_elastic_p2p: append || true to each trap command so set -e
  doesn't abort cleanup when jobs -p is empty and EngineCore orphans
  remain
2026-05-26 16:05:43 +08:00
645b067dd4 Fix review bugs: PD-sep counter leaks, hardcoded paths, missing deps
Critical:
- cache_aware_proxy: _handle_pd_sep leaked p_inst.num_requests (never
  decremented) and never managed d_inst.num_requests; fix media_type
  from application/json to text/event-stream for SSE stream

High:
- b3_sweep/b3_isolated_policy/b3_analyze: replace hardcoded
  /home/admin/cpfs/wjh/ ROOT with script-relative $(dirname "$0")/..
- b3_analyze: replace hardcoded 8-port WORKER_MAP with dynamic
  generation from BASE_PORT and N_INSTANCES

Medium:
- analyze_breakdown: warn on stderr when records are skipped (was silent)
- deploy_vllm_patches: fail-fast on SSH/SCP errors instead of
  continuing with empty VENV_SITE
- pyproject.toml: declare fastapi and uvicorn as runtime dependencies
- launch_elastic_p2p: kill EngineCore and proxy in trap handler to
  prevent GPU memory leaks on exit
2026-05-26 15:54:55 +08:00
0e82612100 Fix B3 analysis bugs from subagent audit (median + percentile + sweep)
Three fixes from the B3 audit:

1) joined_analysis.hotspot_index used sorted[n//2] as median, which
   returns the ~60th percentile for n=8 (even-length). Systematically
   under-states the hotspot index. Recomputed values:
       lmetric   2.238 -> 2.253  (+0.7%)
       load_only 1.140 -> 1.294  (+13.5%)
       sticky    2.349 -> 2.728  (+16.1%)
       unified   3.350 -> 3.667  (+9.5%)
       capped    1.937 -> 2.020  (+4.3%)
   Qualitative ranking preserved; "capped only modestly reduces hotspot"
   story holds with ~10% drop instead of the previously reported 13%.
   Added test_hotspot_index_uses_true_median_for_even_n to lock in the
   fix.

2) b3_analyze.sh's pct() helper used floor-indexed percentile
   sorted[int(p*(n-1))], inconsistent with metrics._percentile and
   joined_analysis._percentile which both use linear interpolation.
   Now matches.

3) b3_sweep.sh's capped step called run_policy "capped", but the
   proxy's argparse has no "capped" choice, so the hot-sweep variant
   would have crashed on this step. The actual capped data was
   produced via b3_isolated_policy.sh with --policy lmetric. Replace
   the broken inline call with an explicit launch_proxy lmetric +
   inline replayer block so the sweep script matches the data path
   it documents.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 01:08:37 +08:00
c6b7c3471b B3: load_only + sticky policies, capped-trace builder, sweep driver
Three additions land together because B3's whole point is comparing
LMetric against meaningful controls.

- scripts/cache_aware_proxy.py: two new --policy values.
  - load_only: pure min(num_requests) routing, no cache or affinity.
    The B3 control that strips locality so the LMetric-vs-load gap is
    legible.
  - sticky: first turn goes to min-load, subsequent turns ALWAYS
    return to the same instance, even under saturation. The B3
    control that maxes out locality so the hot-spot cost is legible.
- scripts/build_capped_trace.py: per-session turn cap (default 8).
  Generates the session-mass-equalized variant the TODO calls for so
  that hot-spot index can be re-measured with the heavy-tail removed.
- scripts/b3_sweep.sh: orchestrates the 5-cell sweep.
  - GPU_INDICES makes it easy to skip a dead GPU.
  - EXTRA_VLLM_ARGS defaults to --enable-prompt-tokens-details so
    usage.prompt_tokens_details.cached_tokens is populated. vLLM
    0.18.1 omits the field by default and breaks the reuse-decomp
    pipeline; the smoke run surfaced this.
  - Trap kills EngineCore by name in addition to "vllm serve" — the
    parent dies first but the child holds GPU memory. Was the root
    cause of the 89 GB ghost on GPU 0 earlier today.
  - Proxy readiness is a polling loop, not a fixed sleep.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 17:54:24 +08:00