Balanced session-sticky routing + agentic workload pattern analysis
Routing fix: new sessions placed by cumulative token load (greedy bin packing) with cache-hit tiebreak. Session affinity for turn 2+. Replayer now sends X-Session-Id header for proper session tracking. Agentic workload core patterns (GLM-5.1 trace): - 91% of reusable KV is intra-session (not cross-session) - Session-sticky routing is THE critical optimization - 36% warm requests (1.3k new tokens), 64% cold (17k+) - After cache: effective prefill/decode ratio drops from 61.5x to 28.7x - Cross-session sharing (system prompt) is only 4.8% of tokens Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -125,12 +125,15 @@ async def _dispatch_request(
|
||||
err = None
|
||||
token_times: list[float] = []
|
||||
|
||||
req_headers = {"X-Session-Id": req.session_id}
|
||||
|
||||
async with sem:
|
||||
try:
|
||||
async with client.stream(
|
||||
"POST",
|
||||
f"{endpoint}/v1/completions",
|
||||
json=payload,
|
||||
headers=req_headers,
|
||||
timeout=config.request_timeout_s,
|
||||
) as resp:
|
||||
resp.raise_for_status()
|
||||
|
||||
Reference in New Issue
Block a user