Three-axis controlled ablation of PD-colo vs PD-disagg on synthetic regular
traces (closed-loop, controlled reuse via REPLAY_NO_REALIZED_PREFIX) on the
clean stack (e13391e gated off).
Axis 1 (Fig 1) -- reuse 6%->94% at N=8, in8192/out256
Axis 2 (Fig 2) -- shape in2048/out2048 -> in32768/out64 at N=8, reuse~70%
Axis 3 (Fig 3) -- concurrency N=8/16/32/64 at reuse~71%, in8192/out256
Findings:
* APC parity colo=PD at every reuse (5.5/22/44/66/77/82%) -- contamination
fix validated.
* PD edge erodes 1.57x->1.10x with reuse; prefill GPUs strand 26%->9%.
* Shape: PD-best peaks mid-sweep (1.34x at in8192/out512); wrong PD ratio
catastrophic at prefill extreme (in32768/out64 pd2 = 378/400, p99 432s).
* Concurrency: PD wins N<=32 (1.23-1.29x), TIPS at N=64 -- pd2/pd4
crater (APC 71%->1.4%, TPS -30%) while colo scales cleanly.
Infrastructure:
* replayer: --max-inflight-sessions, --inter-turn-think, --no-realized-prefix
(env-defaulted via REPLAY_MAX_INFLIGHT, REPLAY_INTER_TURN_THINK_S,
REPLAY_NO_REALIZED_PREFIX).
* mb5_run.sh: writes bench_config.json + gpu_util.csv + run_window.json +
instance_apc.txt + metrics.jsonl for bench_report/fig_agg ingest.
* fig_agg.py: per-arm GPU role split + producer-side APC; --json mode.
* gpu_util_report.py: companion per-GPU util report from gpu_util.csv.
* partial_summary.py: stats from in-flight replay_metrics.jsonl
(works before metrics.summary.json exists).
Data: analysis/mb5_pd_ablation/fig{1,2,3}.json (24 + 20 + 16 rows).
Figures: figs/mb5_pd_ablation/fig{1_reuse,2_shape,3_concurrency}_axis.png.
84 lines
3.9 KiB
Python
84 lines
3.9 KiB
Python
"""CLI entry point: python -m replayer replay ..."""
|
|
|
|
from __future__ import annotations
|
|
|
|
import argparse
|
|
import asyncio
|
|
import logging
|
|
import os
|
|
from pathlib import Path
|
|
|
|
from .replay import ReplayConfig, replay_trace
|
|
|
|
|
|
def main() -> None:
|
|
p = argparse.ArgumentParser(description="Trace replayer for vLLM benchmarking")
|
|
p.add_argument("--trace", type=Path, required=True, help="Sampled trace JSONL")
|
|
p.add_argument("--output", type=Path, required=True, help="Output metrics JSONL")
|
|
p.add_argument("--endpoint", type=str, required=True,
|
|
help="vLLM server URL (e.g. http://localhost:8000)")
|
|
p.add_argument("--model", type=str, default="default", help="Model name for API")
|
|
p.add_argument("--concurrency-limit", type=int, default=2000,
|
|
help="Max concurrent HTTP requests (safety limit)")
|
|
_env_inflight = os.environ.get("REPLAY_MAX_INFLIGHT")
|
|
p.add_argument("--max-inflight-sessions", type=int,
|
|
default=int(_env_inflight) if _env_inflight else None,
|
|
help="Cap on concurrent sessions (None = unlimited; "
|
|
"trace-driven dispatch otherwise). Env: REPLAY_MAX_INFLIGHT")
|
|
_env_think = os.environ.get("REPLAY_INTER_TURN_THINK_S")
|
|
p.add_argument("--inter-turn-think", type=float,
|
|
default=float(_env_think) if _env_think else None,
|
|
help="Closed-loop think-time (s) after each turn completes; "
|
|
"ignore absolute trace schedule. Env: REPLAY_INTER_TURN_THINK_S")
|
|
p.add_argument("--no-realized-prefix",
|
|
action="store_true",
|
|
default=bool(os.environ.get("REPLAY_NO_REALIZED_PREFIX")),
|
|
help="Controlled-reuse mode: prompt = hash-built tokens only "
|
|
"(reuse set by hash_ids). Env: REPLAY_NO_REALIZED_PREFIX")
|
|
p.add_argument("--dispatch-mode", choices=["tracets", "thinktime"],
|
|
default=os.environ.get("REPLAY_DISPATCH_MODE", "tracets"),
|
|
help="tracets (Mode 1): absolute trace ts = max(prev_finished, ts). "
|
|
"thinktime (Mode 2): turn-k at prev_finished + "
|
|
"time_to_parent_chat. Env: REPLAY_DISPATCH_MODE")
|
|
p.add_argument("--request-timeout", type=float, default=600.0)
|
|
_env_maxdur = os.environ.get("REPLAY_MAX_DURATION")
|
|
p.add_argument("--max-duration", type=float,
|
|
default=float(_env_maxdur) if _env_maxdur else None,
|
|
help="Overall wall-clock deadline (s): cancel in-flight + write "
|
|
"summary (un-run turns counted as failures) to bound a "
|
|
"collapsed config's drain. Env: REPLAY_MAX_DURATION")
|
|
p.add_argument("--request-limit", type=int, default=None,
|
|
help="Limit number of requests to replay")
|
|
p.add_argument("-v", "--verbose", action="store_true")
|
|
args = p.parse_args()
|
|
|
|
logging.basicConfig(
|
|
level=logging.DEBUG if args.verbose else logging.INFO,
|
|
format="%(asctime)s %(levelname)s %(name)s: %(message)s",
|
|
)
|
|
|
|
config = ReplayConfig(
|
|
trace_path=args.trace,
|
|
output_path=args.output,
|
|
endpoint_url=args.endpoint.rstrip("/"),
|
|
model_name=args.model,
|
|
concurrency_limit=args.concurrency_limit,
|
|
request_timeout_s=args.request_timeout,
|
|
request_limit=args.request_limit,
|
|
max_inflight_sessions=args.max_inflight_sessions,
|
|
inter_turn_think_s=args.inter_turn_think,
|
|
no_realized_prefix=args.no_realized_prefix,
|
|
dispatch_mode=args.dispatch_mode,
|
|
max_duration_s=args.max_duration,
|
|
)
|
|
|
|
results = asyncio.run(replay_trace(config))
|
|
succeeded = sum(1 for r in results if r.error is None)
|
|
print(f"\nDone: {succeeded}/{len(results)} requests succeeded")
|
|
print(f"Metrics: {args.output}")
|
|
print(f"Summary: {args.output.with_suffix('.summary.json')}")
|
|
|
|
|
|
if __name__ == "__main__":
|
|
main()
|