Balanced session-sticky routing + agentic workload pattern analysis

Routing fix: new sessions placed by cumulative token load (greedy bin
packing) with cache-hit tiebreak. Session affinity for turn 2+.
Replayer now sends X-Session-Id header for proper session tracking.

Agentic workload core patterns (GLM-5.1 trace):
  - 91% of reusable KV is intra-session (not cross-session)
  - Session-sticky routing is THE critical optimization
  - 36% warm requests (1.3k new tokens), 64% cold (17k+)
  - After cache: effective prefill/decode ratio drops from 61.5x to 28.7x
  - Cross-session sharing (system prompt) is only 4.8% of tokens

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-22 01:50:27 +08:00
parent e45f00eb68
commit 32f09d32cd
3 changed files with 226 additions and 9 deletions

View File

@@ -125,12 +125,15 @@ async def _dispatch_request(
err = None
token_times: list[float] = []
req_headers = {"X-Session-Id": req.session_id}
async with sem:
try:
async with client.stream(
"POST",
f"{endpoint}/v1/completions",
json=payload,
headers=req_headers,
timeout=config.request_timeout_s,
) as resp:
resp.raise_for_status()