Initial commit: obsidian to gitea

This commit is contained in:
2026-05-07 15:04:41 +08:00
commit a57afa86b4
323 changed files with 42569 additions and 0 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 580 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 593 KiB

View File

@@ -0,0 +1,32 @@
[ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production](https://arxiv.org/pdf/2505.09999)
优先 evict M queue
![[projects/kvcachecache/Dev.figs/250414-000021.png]]
| | S3FIFO |
|------------|----------|
| 1kGPU1kCPU | 0.095005 |
| 1kGPU2kCPU | 0.136413 |
| 1kGPU4kCPU | 0.213832 |
优先 evict S queue
![[projects/kvcachecache/Dev.figs/250414-000021-1.png]]
| | S3FIFO |
| ---------- | -------- |
| 1kGPU1kCPU | 0.095005 |
| 1kGPU2kCPU | 0.136413 |
| 1kGPU4kCPU | 0.213832 |

Binary file not shown.

After

Width:  |  Height:  |  Size: 371 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 174 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 340 KiB

View File

@@ -0,0 +1,33 @@
# trace 部分
## 现存的问题
1. traceA 和 traceB 来自不同的 log格式不同处理时需要的脚本不同
2. traceA 和 traceB 之前测试时使用的时长分别为 1h/2h时长不统一 & 不够长
3. trace 相关的处理脚本、绘制脚本不够高效
- 需要 load 所有 trace 进 RAM1h/4h 还能处理24h 下显然不可行
- 即使是 simulator4h 数据也需要差不多 1h 运行24h 下效率太低
## TBD
- 统一 preprocess 一遍 24h 的 traceA 和 traceB得到一份统一好用的 trace
- 优化现有脚本,把 load all 改为 streaming支持 24h 的 trace 处理
- 刷一遍现有 trace 分析的数据
- 改为 24h 后,一天内应该会出现特征的波动,为了说明一段较长的连续时间内是稳定的(小时级别),可能需要同时展示 24h 和 2h 的数据(高峰/低谷9-11 / 22-24
- 预期特征与之前分析的结论一致。24h 下一天的波动可能能带来一些新发现
- suggestion from 大爷:一个表,现有 paper 的假设/观察 和我们在实际工业界 trace 观察的关系:结论相同?相悖?
## From vLLM v1 meetup
1. 移除 prefill 和 decode 的概念
![[projects/kvcachecache/Refine roadmap.figs/250414-000021.png]]
2. 不做 PD 分离,对 scheduler 自然提出了很高的要求(需要 scheduler 尽可能避免 compute-bound/memory-bound。vLLM 团队也有计划做 workload-specific scheduler。不过他们这个是请求到来的 workload-aware不是我们之前的 KVCache 的 workload-aware。我们 refine 工作时,可能也可以思考 workload-aware 对 scheduler 能提供什么帮助。【需要先熟悉 vLLM v1 最新的 scheduler】
![[projects/kvcachecache/Refine roadmap.figs/250414-000021-1.png]]
3. 针对不同场景 Reasoning/Coding/... 做优化,我们的 workload-aware 能否考虑到这类信息,对系统设计/优化提供有价值的信息?
![[projects/kvcachecache/Refine roadmap.figs/250812-140723.png]]
# Response
- [ ] workload 对 PD 分离 global scheduler 有什么启发?
- [ ] workload 对 MoE 的影响?不同 workload 下,硬件配比是否有变化?
- [ ] Deepseek 320 experts 320 卡每张卡一个 expert如果在 8 卡下跑expert 如何分布?激活特性?

View File

@@ -0,0 +1,64 @@
## trace 格式约定
Q1: 当前时间8月7日15:39balbal111
A1: xxx
Q2: 当前时间8月7日15:40balbal111, xxx, blabal222
A2: yy
Q3: 当前时间8月7日15:40, blabal222, yy, blabla333 -> 当前时间8月7日15:40,balbal111, xxx, blabal222, yy, blabla333
| 字段名 | 类型【feather】 | 说明 |
| ------------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------- |
| request_id | str | 当前请求的唯一标识 |
| chat_id | str【int】 | 从 0 开始递增的唯一标识 |
| session_id【当前不支持】 | str | 一个 session 的唯一标识 |
| parent_chat_id | str【int】 | session 中上一轮对话请求的 chat_id若不存在上一轮对话则为 -1 |
| uid【当前不支持】 | str | 请求来自用户的 uid |
| time | str | 请求到达时间,形如 `"2025-02-18 23:52:48.827000"` |
| end_time | str【datetime】 | 请求结束时间,形如 `"2025-02-18 23:53:00.854000"` |
| timestamp | float【datetime】 | 请求到达时间的时间戳(单位 s |
| first_latency | int | 首包延迟TTFT (单位 ms) |
| duration | int | 请求总耗时E2E latency (单位 ms) |
| input_token_length | int | 输入 token 总数 |
| output_token_length | int | 输出 token 总数 |
| usage | dict | 该请求的资源用量,形如 `{'input_tokens': 1195, 'output_tokens': 246, 'plugins': {'wanx': {'count': 1}}, 'total_tokens': 1441}` |
| token_ids | list | 输入的 token list使用 qwen vocab range 的 token id |
| input_text | str | 输入的 prompt |
| messages | list | 该请求的 context形如 `[("system", "You are an assistant"), ("user", "hi"), ("assistant", "hello"), ("user", "world")]` |
| turn | int | 该请求在所处 session 的对话轮数 |
| type | str | workload tag |
| no_sp_messages | list | 移除 system prompt 中时间对 prefix cache 影响后的 messages |
| no_sp_input_text | str | 移除 system prompt 中时间对 prefix cache 影响后的 input_text |
| no_sp_sw_messages | list | 在 no_sp 的基础上,进一步移除了 sliding window 影响后的 messages |
| no_sp_sw_input_text | str | 在 no_sp 的基础上,进一步移除了 sliding window 影响后的 input_text |
| no_sp_token_ids | list | 移除 system prompt 中时间对 prefix cache 影响后的 token_ids |
| no_sp_sw_token_ids | list | 在 no_sp 的基础上,进一步移除了 sliding window 影响后的 token_ids |
| no_sp_sw_output_token_ids | list | 若有下一轮对话,从下一轮对话的 answer 获取 token_ids若没有则随机生成一段长度为 output_token_length 的 token_ids |
## 处理流程
- pass 1: 将能够从 raw trace 中直接获得的字段获取,还剩下 parent_chat_id, session_id, type, uid (in traceA) 无法获取,获取时删除所有 illegal 的 record按照 timestamp 排序
- pass2: streaming 的获取 session设置 parent_chat_id设置 session_id更新 turn 字段(因为存在 sliding window直接 count user 的 message 次数存在 bias
- pass3: 通过 plugins 设置 type
- traceA
zhiwen_doc_search, pdf_extracter: file
tongyi_nlp_web_search, tongyi_nlp_deep_search, search: search
wanx: image
other: text
- traceB
same system prompt qps > 0.5: api
other: file
- pass4: 移除 system prompt 中的时间和 sliding window 导致的 prefix unmatch添加上 no_sp, no_sp_sw 相关 field
- pass5: 添加 output_token_ids如果有下一轮对话则为下一轮对话的 answer否则为 random gen 的长度为 output_token_length 的 list
## 现存问题
- 新 traceA 中 uid 无法获取
- fig6: KV cache reuse by same uid
- fig7: hit by uid count
- fig8: reqs count by uid
- fig10: number of turns by uid
## TBD
- [ ] 确认处理一个 session 内前后 chat turn即设置 parent_chat_id 的过程)是否正确

Binary file not shown.

After

Width:  |  Height:  |  Size: 98 KiB

View File

@@ -0,0 +1,85 @@
## Current Problems
- 最新的 trace 缺少了做 workload 类型判断的依据(之前依据 plugin call 进行判断,根据是否调用了 pdf_extractor/wanx/tongyi_nlp_web_search 分类 file/image/search
- 对 turn 的定义变得不确定
manus: agent 模式下context 太长LLM 忘了之前的内容怎么办,就会把一些内容,让 LLM 自己复述一遍,“让我想想,我之前看过这个东西 xxx” 相当于把距离很远的 context 又提到最后了
```
## Old
+---------+
| Human |
+---------+
| /|\
| |
\|/ |
+---------+
| LLM |
+---------+
## New
+---------+
| Human |
+---------+
| /|\
| |
\|/ |
+---------+
| LLM |------+
+---------+ |
/|\ | ----> "<Web_search> Do you understand?" "Yes, sir!"
| |
+------------+
```
- 缺少 output token length
在之前的 trace 中有 `usage` 这个 field
![[projects/kvcachecache/Trace-Qwen3.figs/250812-140723.png]]
## fields
```
__source__,__tag__:__hostname__,__tag__:__pack_id__,__tag__:__path__,__tag__:__receive_time__,__tag__:__service_name__,__tag__:__user_defined_id__,__tag__:_container_ip_,__tag__:_container_name_,__tag__:_image_name_,__tag__:_namespace_,__tag__:_pod_name_,__tag__:_pod_uid_,__tag__:eci_id,__time__,__topic__,code,context,ds_service_id,ds_service_name,interval,message,model,request_id,service_id,service_name,span_id,step,task_id,time,trace_id,user_id
['__source__', '__tag__:__hostname__', '__tag__:__pack_id__', '__tag__:__path__', '__tag__:__receive_time__', '__tag__:__service_name__', '__tag__:__user_defined_id__', '__tag__:_container_ip_', '__tag__:_container_name_', '__tag__:_image_name_', '__tag__:_namespace_', '__tag__:_pod_name_', '__tag__:_pod_uid_', '__tag__:eci_id', '__time__', '__topic__', 'code', 'context', 'ds_service_id', 'ds_service_name', 'interval', 'message', 'model', 'request_id', 'service_id', 'service_name', 'span_id', 'step', 'task_id', 'time', 'trace_id', 'user_id']
__source__
__tag__:__hostname__
__tag__:__pack_id__
__tag__:__path__
__tag__:__receive_time__
__tag__:__service_name__
__tag__:__user_defined_id__
__tag__:_container_ip_
__tag__:_container_name_
__tag__:_image_name_
__tag__:_namespace_
__tag__:_pod_name_
__tag__:_pod_uid_
__tag__:eci_id
__time__
__topic__
code
context
ds_service_id
ds_service_name
interval
message
model
request_id
service_id
service_name
span_id
step
task_id
time
trace_id
user_id
```
qwen-chat: 345339
tongyi: 421551
tob: 740251 + 740393

View File

@@ -0,0 +1,21 @@
arXiv Submission: Paper Password for Ownership Claim
Dear Co-Authors,
Im writing to share the paper password for our recent arXiv submission titled " KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider" (arXiv ID: 2506.02634). This password is required for each co-author to claim ownership of the paper on arXiv.
**Paper ID:** 2506.02634
**Paper Password:** fzw95
**Next Steps:**
1. Please use the below paper ID and password to claim ownership with a password form in https://arxiv.org/auth/need-paper-password.
2. Confirm with me once youve completed this step so we can ensure all co-authors are properly credited.
If you encounter any issues or need further assistance, reply to this email directly.
Thank you for your prompt attention to this!
Best regards,
Jiahao Wang
IPADS Shanghai Jiao Tong University

View File

@@ -0,0 +1,19 @@
**Subject:** Action Required: Sign Consent to Publish Form for _KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider_
Dear Co-Authors,
As one of the lead author, Im writing to remind you that all authors are required to sign the **consent to publish form** for our paper to be published by USENIX ATC'25. This form grants permission for the paper, as well as any accompanying slides, audio, and/or video of our presentation, to be freely shared as part of USENIXs open-access commitment.
**Action Items:**
1. Please review and e-sign the form here: https://app.hellosign.com/s/79G82eZT.
FYI:
- Our paper title is "KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider".
- You can find the e-sign link in https://www.usenix.org/conference/atc25/instructions-presenters.
2. If you encounter any issues or questions, reply to this email or contact me directly.
Thank you all for your prompt attention to this! Let me know if youve already submitted your form—Ill follow up individually if needed to ensure everyone is covered.
Best regards,
Jiahao Wang
IPADS Shanghai Jiao Tong University

View File

@@ -0,0 +1,19 @@
Request for Minor Revision to Submitted Paper Due to Open-Source Feedback
Dear ATC'25 Chairs,
I hope this message finds you well.
We are writing regarding our paper titled "KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider" (Paper ID: 477) submitted to ATC'25. Following its open-sourcing on GitHub, a community member kindly pointed out a minor misstatement about the trace anonymous method in the paper. We have reviewed their feedback and agree with the concern.
To ensure accuracy and clarity, we have updated the relevant statement in the paper accordingly. The change does not affect the overall results or conclusions, but we believe it is important to address the issue for the sake of correctness.
We would like to kindly request your permission to update the submitted version with this corrected version. Please let us know the appropriate process, or whether such an update is permissible at this stage.
We have attached our current version (kvcache.pdf) as well as the difference compared to the current submitted version (diff.pdf).
We appreciate your understanding and look forward to your guidance.
Best regards,
Jiahao Wang
Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University

View File

@@ -0,0 +1,6 @@
目前大语言模型LLM推理服务已成为云服务提供商的关键应用而跨请求缓存中间结果KVCache能够显著提升系统吞吐量并降低响应延迟然而现有研究多基于合成负载尚未充分揭示真实生产环境下 KVCache 的作用机制,例如缓存驱逐策略等系统决策高度依赖于工作负载。为此,本工作依托于阿里巴巴通义实验室,对通义千问在线服务的全量真实工作负载进行了脱敏采集与深入分析,发现单轮请求与多轮对话之间的缓存重用同等重要却表现各异,不同请求类型下的缓存重用时间窗口与概率虽差异显著,但对于某一固定类型的请求,其缓存重用模式高度可预测;且在 API 主导的场景中,容量有限的 GPU 本地缓存已足以满足需求。基于这些针对真实负载的观察,我们设计了一种基于工作负载感知的缓存驱逐策略,使缓存命中率由 14.5% 提升至 18.5%首词时延TTFT缩短约 25%,从而在真实业务场景下大幅提升了服务性能。
xingda version:
KVCache 缓存是当今大模型推理系统的关键组件,其系统设计与缓存特征密切相关。在本研究中,我们与阿里通义实验室合作,深入分析了千问线上脱敏的 KVCache 缓存特征。我们发现了几个全新的见解,包括:单轮对话场景也高度依赖 KVCache 缓存,不同负载的缓存分布在各时间段呈现规律性特征等。基于这些观察,我们设计了一种新型的负载感知的 KVCache 缓存替换策略,在真实数据集上将缓存命中率从 14.5% 提升至 18.5%同时将首词时延TTFT减少 25%。