Commit Graph

4 Commits

Author SHA1 Message Date
075f5bbc22 trace: time_to_parent_chat annotation + thinktime trace variants
Adds `scripts/add_ttp_streaming.py`: one streaming pass over the 522 GB raw
glm5.1 trace to build {chat_id: (ready_ms, end_ms)} and join the real
inter-turn gap onto the COMPLETE formatted trace (no early-exit, low memory).
  time_to_parent_chat = (this.request_ready_time_ms - parent.request_end_time_ms)/1000
  = tool-exec + agent think-time; turn-1 -> null, negatives clamped to 0.

Ships the two ttp-annotated sampled traces (same anonymized data + one
timestamp-derived field; regenerated via sample_trace.py --seed 42 so they are
row-for-row identical to the non-ttp variants on all 9 shared fields):
  traces/w600_r0.0015_st30_ttp.jsonl          (1214 reqs)
  traces/w600_r0.0015_st30_first600s_ttp.jsonl (807 reqs)
They are needed to replay with --dispatch-mode thinktime without the
non-redistributable raw trace, so they are added to the .gitignore allowlist.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 20:58:49 +08:00
71b0747b3b 600s-truncated trace + LPWL 5-policy results
traces/w600_r0.0015_st30_first600s.jsonl: first-600s cut of the shipped w600
trace (807 reqs, 274 sessions, all turn-1s + early later-turns; theoretical
APC ceiling ~70% vs 80% full). Faster iteration (~18 min/arm) but a colder,
lower-locality regime; whitelisted alongside the parent anonymized trace.

analysis/lpwl_5policy_600s.md: LPWL vs LMetric/sticky/unified/unified+A+B on
the 600s trace (dash1 8xH20, cold APC, n=1). LPWL is overall best with zero
knobs — TTFT p90 7983ms vs tuned A+B 11562 (-31%), E2E p90 -16%, best request
balance; APC 0.648 (emergent affinity, far above LMetric 0.507); only loss is
E2E p99 from heavy-class decode concentration. Demonstrates anti-overfit: A+B
was tuned on full w600 yet is beaten by the knob-free policy on this regime.
Includes the run_5policy_600s.sh repro driver.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 16:08:35 +08:00
8a876e90d1 traces/README: clarify w600 is the session-start window, not span
The trace actually spans ~2912 s (~48.5 min): all 274 sessions START within
the 600 s --window-seconds window, but their later multi-turn requests (34%
of rows, inter-turn gaps up to ~700 s) extend well past t=600 s. Remove the
misleading "~600 s span".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 12:04:14 +08:00
08c3cf48aa Ship anonymized benchmark trace w600_r0.0015_st30 + provenance
Whitelist the sampled replay trace (1214 reqs / 274 sessions / ~600 s) past
the traces/ ignore so the repo is runnable without dash0 access. Metadata
only (token counts, opaque KV-block hashes, timing, session structure) — no
prompts/outputs/PII. traces/README documents schema, provenance (sampled
from the internal GLM-5.1 production trace via scripts/sample_trace.py), and
the regeneration command.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 11:54:43 +08:00