Fix replay methodology: trace-driven dispatch, no artificial limits

The replayer was artificially limiting concurrency with --max-inflight-sessions (semaphore) and --time-scale (time compression), producing unrealistically low 1 req/GPU load that masked prefill-decode interference. Replayer changes: - Remove session_sem and time_scale entirely - Each request dispatched at its trace timestamp exactly - Sessions still sequential (turn N+1 waits for turn N completion) - If turn completes late, next turn fires immediately Sampler changes: - Add --sample-ratio for GPU-proportional session sampling - Keep --target-requests for backwards compat - No time compression (preserve original arrival pattern) bench.sh: remove --time-scale and --max-inflight-sessions args Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 12:43:41 +08:00
parent c8ba666517
commit 4089ffd63f
4 changed files with 84 additions and 103 deletions
--- a/replayer/main.py
+++ b/replayer/main.py
@@ -17,10 +17,8 @@ def main() -> None:
    p.add_argument("--endpoint", type=str, required=True,
                   help="vLLM server URL (e.g. http://localhost:8000)")
    p.add_argument("--model", type=str, default="default", help="Model name for API")
-    p.add_argument("--time-scale", type=float, default=1.0,
-                   help="Time compression (>1 = faster)")
-    p.add_argument("--max-inflight-sessions", type=int, default=32)
-    p.add_argument("--concurrency-limit", type=int, default=256)
+    p.add_argument("--concurrency-limit", type=int, default=2000,
+                   help="Max concurrent HTTP requests (safety limit)")
    p.add_argument("--request-timeout", type=float, default=600.0)
    p.add_argument("--request-limit", type=int, default=None,
                   help="Limit number of requests to replay")
@@ -37,8 +35,6 @@ def main() -> None:
        output_path=args.output,
        endpoint_url=args.endpoint.rstrip("/"),
        model_name=args.model,
-        time_scale=args.time_scale,
-        max_inflight_sessions=args.max_inflight_sessions,
        concurrency_limit=args.concurrency_limit,
        request_timeout_s=args.request_timeout,
        request_limit=args.request_limit,