Fix replay methodology: trace-driven dispatch, no artificial limits

The replayer was artificially limiting concurrency with --max-inflight-sessions
(semaphore) and --time-scale (time compression), producing unrealistically low
1 req/GPU load that masked prefill-decode interference.

Replayer changes:
- Remove session_sem and time_scale entirely
- Each request dispatched at its trace timestamp exactly
- Sessions still sequential (turn N+1 waits for turn N completion)
- If turn completes late, next turn fires immediately

Sampler changes:
- Add --sample-ratio for GPU-proportional session sampling
- Keep --target-requests for backwards compat
- No time compression (preserve original arrival pattern)

bench.sh: remove --time-scale and --max-inflight-sessions args

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-23 12:43:41 +08:00
parent c8ba666517
commit 4089ffd63f
4 changed files with 84 additions and 103 deletions

View File

@@ -17,10 +17,8 @@ def main() -> None:
p.add_argument("--endpoint", type=str, required=True,
help="vLLM server URL (e.g. http://localhost:8000)")
p.add_argument("--model", type=str, default="default", help="Model name for API")
p.add_argument("--time-scale", type=float, default=1.0,
help="Time compression (>1 = faster)")
p.add_argument("--max-inflight-sessions", type=int, default=32)
p.add_argument("--concurrency-limit", type=int, default=256)
p.add_argument("--concurrency-limit", type=int, default=2000,
help="Max concurrent HTTP requests (safety limit)")
p.add_argument("--request-timeout", type=float, default=600.0)
p.add_argument("--request-limit", type=int, default=None,
help="Limit number of requests to replay")
@@ -37,8 +35,6 @@ def main() -> None:
output_path=args.output,
endpoint_url=args.endpoint.rstrip("/"),
model_name=args.model,
time_scale=args.time_scale,
max_inflight_sessions=args.max_inflight_sessions,
concurrency_limit=args.concurrency_limit,
request_timeout_s=args.request_timeout,
request_limit=args.request_limit,