Initial commit: obsidian to gitea

This commit is contained in:
2026-05-07 15:04:41 +08:00
commit a57afa86b4
323 changed files with 42569 additions and 0 deletions

View File

@@ -0,0 +1,17 @@
Objectives
- Analysis of QWen trace
- Customize vLLM(Ali ver) with new features
- Port XPURemoting to PhOS
Key Results
- Enhance QWen trace's workloads separation
- Get vLLM KVCache hit rate for different open source workloads
- Build unified docker image for XPURemoting and PhOS
Last Week
- Get a unified workload taxonomy for QWen trace in both Web and App ends.
- Run vLLM(Ali ver) and start to customize to get some features(e.x. KVCache hit rate for different workloads).
- Build a new docker image to satisfy PhOS's base image requirement with XPURemoting env(static linked PyTorch 1.13.1).
Next Week
- Customize vLLM to support new features like KVCache schedule policy comparation.

View File

@@ -0,0 +1,16 @@
Objectives
- Analysis of QWen trace
- Customize vLLM(Ali ver) with new features
Key Results
- Tokenize Qwen trace with Qwen-agent and some other tools [60%]
- Modify vLLM to support different KV cache block number
- Profile open source dataset with different cache blocks
Last Week
- Use Qwen-agent to handle workloads with file, get a more precise token length for these workloads.
- Modify vLLM's cache manager to support specific KVCache cache blocks, then measure the KV cache hit rate trend by block number in different workloads.
Next Week
- Tokenize all Qwen trace especially multimodal (image) workloads and measure with these trace.
- Profile KVCache cache hit rate in actual trace and compare with other open source trace to find different.

View File

@@ -0,0 +1,15 @@
Objectives
- Analysis of QWen trace
- Customize vLLM(Ali ver) with new features
Key Results
- Tokenize Qwen trace with Qwen-agent and some other tools
- Profile Qwen trace with different cache blocks
Last Week
- Use Qwen-agent to handle all workloads in Qwen trace and get a precise token stream to simulate actual online environment.
- Measure the performance and KVCache cache hit rate for different cache blocks using real Qwen trace running for one hour.
Next Week
- Check the tokenize results from Qwen trace, maybe need to modify.
- Measure KV cache performance with CPU memory.

View File

@@ -0,0 +1,14 @@
Objective
- Customize vLLM(Ali ver) with new features
Key Results
- Test modified vLLM which supports CPU KV cache
- Profile and breakdown modified vLLM in synthetic data and real Qwen trace
Last Week
- Merge vLLM which supports CPU KV cache and use synthetic data and real Qwen trace to measure the performance and find bugs.
- Add a breakdown measurement support in vLLM server side to measure the time for copying of KV blocks.
Next Week
- Run more test for vLLM which supports CPU KV cache.
- Try to optimize current implementation.

View File

@@ -0,0 +1,17 @@
Objective
- Workload-centric KV cache scheduling
- XPURemoting adaption for PhOS
Key Results
- Refactor vLLM benchmark tools to get more precise metrics
- Simulate different token lengths and hit rate to define hit rate's effect
- Modify XPURemoting to support new architecture
Last Week
- Implement a unified vLLM benchmark tool to get more precise metric results and provide a unified requests builder.
- Measure the effect of cache hit rate and try to define a good hit rate for real performance improvement.
- Merge XPURemoting with new features and support for PhOS.
Next Week
- Define a `good hit rate` for KV cache scheduling.
- Finish XPURemoting adaption.

View File

@@ -0,0 +1,16 @@
Objective
- Workload-centric KV cache scheduling
- XPURemoting adaption for PhOS
Key Results
- Define the Good KVCache hit rate in different conditions [6/10]
- Prove the interference between different workloads in current vLLM
- Modify XPURemoting to support PhOS (v1)
Last Week
- Search different KVCache schedule algorithms and sumarize something common for definition of Good KVCache hit rate.
- Profile ali trace in vLLM and group them to prove interference.
- Adaption of XPURemoting to support current PhOS's API. And fully test implementation in PhOS's open source examples. [MR](https://ipads.se.sjtu.edu.cn:1312/scaleaisys/xpuremoting/-/merge_requests/25) for XPURemoting and [e80bf94](https://github.com/Gahow/PhoenixOS/commit/e80bf94075fcd6f53c97406dadfbe7f13fc16092) for PhOS.
Next Week
- Finish definetion of Good KVCache hit rate.

View File

@@ -0,0 +1,15 @@
Objectives
- Serverless KVCache cache
- PhOS profile
Key Results
- Implement a workload aware KVCache scheduler. [3/10]
- Provide test apps for PhOS
Last Week
- Implement a simulator for KVCache scheduler to quick test different policies.
- Prepare and do a paper sharing in Ali.
- Provide StableDiffusion single GPU train, Llama2-13b multi GPU train, Llama2-70b multi GPU inference script for PhOS profiling.
Next Week
- Implement a solution to reduce KVCache memory need.

View File

@@ -0,0 +1,13 @@
Objective
- Serverless KVCache cache
Key Results
- Test a workload aware KVCache scheduler
- Implement the workload aware policy in vLLM
Last Week
- Design a workload aware schedule policy in simulator and profile the KVCache reuse rate.
- Implement the designed policy under vLLM.
Next Week
- Profile the real performance of new policy under vLLM and do some enhancement.

View File

@@ -0,0 +1,16 @@
Objective
- Serverless KVCache cache
Key Results
- Implement the workload aware policy in vLLM [8/10]
- Profile the workload aware policy [3/10]
- Supply workloads difference in Qwen trace
Last Week
- Add new design point to cache policy, making the policy to consider cache memory size and predicted reuse distance together. To do this, add a new monitor for workloads' reuse time interval and average number of tokens.
- Set a offline (i.e. best) scheduling policy, profile the default policy, our workload aware policy and offline policy to show the performance difference in CDF of TTFT.
- Implement a cache block source tracker in vLLM to show where the KVCache reuse comes from. Prove that 90% of KVCache reuse comes from multi turns chat.
Next Week
- Improve the performance of our policy.
- Plot some formal figures.

View File

@@ -0,0 +1,14 @@
Objective
- Serverless KVCache cache
Key Results
- Implement the workload aware policy in vLLM
- Profile the workload aware policy [3/10]
Last Week
- Implement priority-based (calculated by our policy) evictor for both GPU and CPU sides.
- Test our policy under ralative small cache memory, and get a 30% cache hit ratio and 10% performance improvement. Prove our policy is used for limited cache memory. But for the larger cache memory, our policy still need some fine-tune.
Next Week
- Improve our policy for larger cache memory.
- Analysis new trace.