Adds `scripts/add_ttp_streaming.py`: one streaming pass over the 522 GB raw
glm5.1 trace to build {chat_id: (ready_ms, end_ms)} and join the real
inter-turn gap onto the COMPLETE formatted trace (no early-exit, low memory).
time_to_parent_chat = (this.request_ready_time_ms - parent.request_end_time_ms)/1000
= tool-exec + agent think-time; turn-1 -> null, negatives clamped to 0.
Ships the two ttp-annotated sampled traces (same anonymized data + one
timestamp-derived field; regenerated via sample_trace.py --seed 42 so they are
row-for-row identical to the non-ttp variants on all 9 shared fields):
traces/w600_r0.0015_st30_ttp.jsonl (1214 reqs)
traces/w600_r0.0015_st30_first600s_ttp.jsonl (807 reqs)
They are needed to replay with --dispatch-mode thinktime without the
non-redistributable raw trace, so they are added to the .gitignore allowlist.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
traces/w600_r0.0015_st30_first600s.jsonl: first-600s cut of the shipped w600
trace (807 reqs, 274 sessions, all turn-1s + early later-turns; theoretical
APC ceiling ~70% vs 80% full). Faster iteration (~18 min/arm) but a colder,
lower-locality regime; whitelisted alongside the parent anonymized trace.
analysis/lpwl_5policy_600s.md: LPWL vs LMetric/sticky/unified/unified+A+B on
the 600s trace (dash1 8xH20, cold APC, n=1). LPWL is overall best with zero
knobs — TTFT p90 7983ms vs tuned A+B 11562 (-31%), E2E p90 -16%, best request
balance; APC 0.648 (emergent affinity, far above LMetric 0.507); only loss is
E2E p99 from heavy-class decode concentration. Demonstrates anti-overfit: A+B
was tuned on full w600 yet is beaten by the knob-free policy on this regime.
Includes the run_5policy_600s.sh repro driver.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Whitelist the sampled replay trace (1214 reqs / 274 sessions / ~600 s) past
the traces/ ignore so the repo is runnable without dash0 access. Metadata
only (token counts, opaque KV-block hashes, timing, session structure) — no
prompts/outputs/PII. traces/README documents schema, provenance (sampled
from the internal GLM-5.1 production trace via scripts/sample_trace.py), and
the regeneration command.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
third_party/vllm/ now tracked in git for direct patch management.
Based on vLLM v0.18.1 release with one patch applied:
vllm/v1/core/sched/scheduler.py:
Replace fatal assert with graceful skip when KV transfer callback
arrives for an already-aborted request during PD disaggregated serving.
Future vLLM modifications should be made directly in third_party/vllm/
and committed normally. The patches/ directory is kept as documentation
of what changed from upstream.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Systematic study of prefill-decode disaggregation for agentic LLM workloads
using production GLM-5.1 coder trace (2.1M requests, 71B input tokens).
Key findings:
- Cache-aware routing improves TPOT p90 by 15% and APC from 20.8% to 44.7%
without PD separation, matching PD-Sep's decode isolation benefit
- PD separation adds +72% TTFT overhead (KV transfer) with no TPOT gain
when using the same cache-aware scheduler
- Prefill remains compute-bound even at 95% KV cache reuse (AI >1000x
vs decode AI <2), but absolute FLOPs drop 71% from cache hits
- For agentic MoE workloads, cache-aware routing > PD separation
Infrastructure:
- Trace sampler preserving session structure + hash_ids for prefix sharing
- Async trace replayer with streaming TTFT/TPOT/E2E measurement
- Unified cache-aware + token-level load-balanced global scheduler proxy
supporting both PD-colocated and PD-disaggregated (Mooncake/RDMA) modes
- vLLM 0.18.1 scheduler patch for KV transfer abort race condition
- Roofline analysis tool for prefill/decode compute characterization
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>