Files

Gahow Wang 71b0747b3b 600s-truncated trace + LPWL 5-policy results

traces/w600_r0.0015_st30_first600s.jsonl: first-600s cut of the shipped w600
trace (807 reqs, 274 sessions, all turn-1s + early later-turns; theoretical
APC ceiling ~70% vs 80% full). Faster iteration (~18 min/arm) but a colder,
lower-locality regime; whitelisted alongside the parent anonymized trace.

analysis/lpwl_5policy_600s.md: LPWL vs LMetric/sticky/unified/unified+A+B on
the 600s trace (dash1 8xH20, cold APC, n=1). LPWL is overall best with zero
knobs — TTFT p90 7983ms vs tuned A+B 11562 (-31%), E2E p90 -16%, best request
balance; APC 0.648 (emergent affinity, far above LMetric 0.507); only loss is
E2E p99 from heavy-class decode concentration. Demonstrates anti-overfit: A+B
was tuned on full w600 yet is beaten by the knob-free policy on this regime.
Includes the run_5policy_600s.sh repro driver.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-29 16:08:35 +08:00

README.md

traces/README: clarify w600 is the session-start window, not span

2026-05-29 12:04:14 +08:00

w600_r0.0015_st30_first600s.jsonl

600s-truncated trace + LPWL 5-policy results

2026-05-29 16:08:35 +08:00

w600_r0.0015_st30.jsonl

Ship anonymized benchmark trace w600_r0.0015_st30 + provenance

2026-05-29 11:54:43 +08:00

README.md

Benchmark trace

`w600_r0.0015_st30.jsonl`

The primary replay trace for the routing / connector experiments (1214 requests, 274 sessions). One JSON object per request:

{"chat_id": 1237198, "parent_chat_id": -1, "timestamp": 0.0,
 "input_length": 8228, "output_length": 21, "type": "coder", "turn": 1,
 "hash_ids": [12292995, ...], "session_id": "1237198"}

field	meaning
`input_length` / `output_length`	token counts only
`hash_ids`	opaque integer KV-block hashes — shared ids ⇒ shared prefix (drives prefix-cache reuse in replay)
`timestamp`	arrival offset (s) from trace start
`turn` / `parent_chat_id` / `session_id`	multi-turn session structure

No cleartext. There are no prompts, no model outputs, and no PII — only token counts, opaque block hashes, timing, and session structure. The replayer synthesizes dummy token sequences consistent with hash_ids so prefix-cache hit rates match the original workload.

Provenance

Sampled from the internal Alibaba GLM-5.1-formatted production trace (~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl on dash0, ~2.1 M requests, 2 h) — not redistributable; only this anonymized sample is shipped. The filename encodes the sampling params: w=window-seconds, r=sample-ratio, st=max-single-turn-ratio.

w600 is the 600 s window of session start times, not the trace duration. The sampler keeps every session whose first request falls in a 600 s window, then includes all of that session's turns. Because agentic sessions are long-lived multi-turn (inter-turn gaps up to ~700 s), the actual trace spans ~2912 s (~48.5 min) even though all 274 sessions start within the first 598 s; 34 % of requests are later turns occurring after t=600 s.

Regenerate (requires the dash0 source):

python scripts/sample_trace.py \
    --input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
    --output traces/w600_r0.0015_st30.jsonl \
    --window-seconds 600 --sample-ratio 0.0015 --max-single-turn-ratio 0.30 --seed 42

Replay

python -m replayer --trace traces/w600_r0.0015_st30.jsonl ...

See replayer/ and scripts/cache_aware_proxy.py.