986f351365462be59eb2c4204488102ae28d9825
Fix A from docs/E3_FINDINGS_ZH.md §3. The existing streaming-session correction at the top of ScheduleBatch.prepare_for_extend zeroes req.extend_input_len when len(fill_ids) <= len(prefix_indices), but the per-req invariant later in the same function (assert seq_len - pre_len == req.extend_input_len) is computed from raw fill_ids/prefix_indices lengths and has no path to be satisfied when fill_len < prefix_len. The result is an AssertionError that crashes the entire decode worker. Add a pre-filter pass at the start of prepare_for_extend that detects this state, marks the affected reqs with FINISH_ABORT (so the client gets an error response instead of the worker hanging), and drops them from the batch before the correction loop runs. If all reqs are filtered, populate empty tensor/list state and return early so downstream model.forward sees a valid no-op batch. This treats fill_ids < prefix_indices as upstream state inconsistency that should be reported to the client rather than silently miscomputed. The narrower invariant after this filter: prepare_for_extend's body only ever sees streaming-session reqs where actual_extend_len > 0, which is the regime the existing correction logic was designed for. Reproduced by E3 first run on 2026-05-12 02:51:21 UTC (rid 6f4318e93dd543a49dbf19248cfc1e6f, session 1000195, fill_len=6648, prefix_len=43459) — masked in E1/E2 because the cap-out failure cascade prevented sessions from accumulating deep enough committed prefix to trigger the inconsistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Agentic PD Hybrid
这个项目是在 SGLang xPyD 上做一个最小实验框架,用来判断:
面向 agentic coding workload 的 session-aware / KV-cache-aware P/D routing,能不能降低端到端延迟。
更完整但仍然简洁的说明见 docs/PROJECT_OVERVIEW.md。
当前做了什么
- 启动单机 SGLang P/D 栈。
- 回放 Ali coding agent trace,并记录 request-level metrics。
- 支持
default、sticky、kv-aware路由策略。 - 支持
pd-disaggregation、kvcache-centric、pd-colo对比。 - 支持小 append、多轮 session 的 micro-benchmark trace。
- 维护了基于 SGLang
v0.5.10的本地 patch,放在third_party/sglang。
环境
统一使用 uv:
uv sync
默认模型路径:
~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct
当前主要测试环境是单机 8 GPU,约束是 prefill + decode <= 8。
常用命令
生成小 append trace:
uv run agentic-pd-hybrid make-small-append-trace \
--output outputs/smoke-hotcap-30k-1k-256.jsonl \
--session-count 4 \
--turns-per-session 3 \
--initial-input-length 30000 \
--append-input-length 1000 \
--output-length 256
跑 live benchmark:
uv run agentic-pd-hybrid benchmark-live \
--trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
--output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
--mechanism kvcache-centric \
--policy kv-aware \
--kvcache-admission-mode worker \
--prefill-workers 1 \
--decode-workers 1 \
--prefill-gpu-ids 0 \
--decode-gpu-ids 1 \
--transfer-backend mooncake \
--target-duration-s 2000 \
--session-sample-rate 1.0 \
--min-turns 2 \
--time-scale 1 \
--concurrency-limit 1000
只回放并写 metrics:
uv run agentic-pd-hybrid replay \
--trace path/to/trace.jsonl \
--policy kv-aware \
--mechanism pd-disaggregation \
--router-url http://127.0.0.1:8000 \
--output outputs/replay.jsonl
输出
每次 replay/benchmark 会写:
- request metrics:
request-metrics.jsonl - 汇总结果:
request-metrics.jsonl.summary.json
重点看:
- E2E latency
- TTFT / TPOT
- execution mode
- cached tokens
- KV transfer blocks
- error
维护约定
- 项目代码改动:
feat:/fix:/docs:。 - SGLang 改动:
feat(sglang): .../fix(sglang): ...。 third_party/sglang的基线是 clean SGLangv0.5.10snapshot。- 不提交
outputs/、日志、__pycache__、虚拟环境。
Description
Languages
Python
100%