Fix replay methodology: trace-driven dispatch, no artificial limits
The replayer was artificially limiting concurrency with --max-inflight-sessions (semaphore) and --time-scale (time compression), producing unrealistically low 1 req/GPU load that masked prefill-decode interference. Replayer changes: - Remove session_sem and time_scale entirely - Each request dispatched at its trace timestamp exactly - Sessions still sequential (turn N+1 waits for turn N completion) - If turn completes late, next turn fires immediately Sampler changes: - Add --sample-ratio for GPU-proportional session sampling - Keep --target-requests for backwards compat - No time compression (preserve original arrival pattern) bench.sh: remove --time-scale and --max-inflight-sessions args Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -17,10 +17,8 @@ def main() -> None:
|
||||
p.add_argument("--endpoint", type=str, required=True,
|
||||
help="vLLM server URL (e.g. http://localhost:8000)")
|
||||
p.add_argument("--model", type=str, default="default", help="Model name for API")
|
||||
p.add_argument("--time-scale", type=float, default=1.0,
|
||||
help="Time compression (>1 = faster)")
|
||||
p.add_argument("--max-inflight-sessions", type=int, default=32)
|
||||
p.add_argument("--concurrency-limit", type=int, default=256)
|
||||
p.add_argument("--concurrency-limit", type=int, default=2000,
|
||||
help="Max concurrent HTTP requests (safety limit)")
|
||||
p.add_argument("--request-timeout", type=float, default=600.0)
|
||||
p.add_argument("--request-limit", type=int, default=None,
|
||||
help="Limit number of requests to replay")
|
||||
@@ -37,8 +35,6 @@ def main() -> None:
|
||||
output_path=args.output,
|
||||
endpoint_url=args.endpoint.rstrip("/"),
|
||||
model_name=args.model,
|
||||
time_scale=args.time_scale,
|
||||
max_inflight_sessions=args.max_inflight_sessions,
|
||||
concurrency_limit=args.concurrency_limit,
|
||||
request_timeout_s=args.request_timeout,
|
||||
request_limit=args.request_limit,
|
||||
|
||||
Reference in New Issue
Block a user