Initial commit: obsidian to gitea

This commit is contained in:
2026-05-07 15:04:41 +08:00
commit a57afa86b4
323 changed files with 42569 additions and 0 deletions

View File

@@ -0,0 +1,17 @@
Objectives
- Analysis of QWen trace
- Customize vLLM(Ali ver) with new features
- Port XPURemoting to PhOS
Key Results
- Enhance QWen trace's workloads separation
- Get vLLM KVCache hit rate for different open source workloads
- Build unified docker image for XPURemoting and PhOS
Last Week
- Get a unified workload taxonomy for QWen trace in both Web and App ends.
- Run vLLM(Ali ver) and start to customize to get some features(e.x. KVCache hit rate for different workloads).
- Build a new docker image to satisfy PhOS's base image requirement with XPURemoting env(static linked PyTorch 1.13.1).
Next Week
- Customize vLLM to support new features like KVCache schedule policy comparation.

View File

@@ -0,0 +1,16 @@
Objectives
- Analysis of QWen trace
- Customize vLLM(Ali ver) with new features
Key Results
- Tokenize Qwen trace with Qwen-agent and some other tools [60%]
- Modify vLLM to support different KV cache block number
- Profile open source dataset with different cache blocks
Last Week
- Use Qwen-agent to handle workloads with file, get a more precise token length for these workloads.
- Modify vLLM's cache manager to support specific KVCache cache blocks, then measure the KV cache hit rate trend by block number in different workloads.
Next Week
- Tokenize all Qwen trace especially multimodal (image) workloads and measure with these trace.
- Profile KVCache cache hit rate in actual trace and compare with other open source trace to find different.

View File

@@ -0,0 +1,15 @@
Objectives
- Analysis of QWen trace
- Customize vLLM(Ali ver) with new features
Key Results
- Tokenize Qwen trace with Qwen-agent and some other tools
- Profile Qwen trace with different cache blocks
Last Week
- Use Qwen-agent to handle all workloads in Qwen trace and get a precise token stream to simulate actual online environment.
- Measure the performance and KVCache cache hit rate for different cache blocks using real Qwen trace running for one hour.
Next Week
- Check the tokenize results from Qwen trace, maybe need to modify.
- Measure KV cache performance with CPU memory.

View File

@@ -0,0 +1,14 @@
Objective
- Customize vLLM(Ali ver) with new features
Key Results
- Test modified vLLM which supports CPU KV cache
- Profile and breakdown modified vLLM in synthetic data and real Qwen trace
Last Week
- Merge vLLM which supports CPU KV cache and use synthetic data and real Qwen trace to measure the performance and find bugs.
- Add a breakdown measurement support in vLLM server side to measure the time for copying of KV blocks.
Next Week
- Run more test for vLLM which supports CPU KV cache.
- Try to optimize current implementation.

View File

@@ -0,0 +1,17 @@
Objective
- Workload-centric KV cache scheduling
- XPURemoting adaption for PhOS
Key Results
- Refactor vLLM benchmark tools to get more precise metrics
- Simulate different token lengths and hit rate to define hit rate's effect
- Modify XPURemoting to support new architecture
Last Week
- Implement a unified vLLM benchmark tool to get more precise metric results and provide a unified requests builder.
- Measure the effect of cache hit rate and try to define a good hit rate for real performance improvement.
- Merge XPURemoting with new features and support for PhOS.
Next Week
- Define a `good hit rate` for KV cache scheduling.
- Finish XPURemoting adaption.

View File

@@ -0,0 +1,16 @@
Objective
- Workload-centric KV cache scheduling
- XPURemoting adaption for PhOS
Key Results
- Define the Good KVCache hit rate in different conditions [6/10]
- Prove the interference between different workloads in current vLLM
- Modify XPURemoting to support PhOS (v1)
Last Week
- Search different KVCache schedule algorithms and sumarize something common for definition of Good KVCache hit rate.
- Profile ali trace in vLLM and group them to prove interference.
- Adaption of XPURemoting to support current PhOS's API. And fully test implementation in PhOS's open source examples. [MR](https://ipads.se.sjtu.edu.cn:1312/scaleaisys/xpuremoting/-/merge_requests/25) for XPURemoting and [e80bf94](https://github.com/Gahow/PhoenixOS/commit/e80bf94075fcd6f53c97406dadfbe7f13fc16092) for PhOS.
Next Week
- Finish definetion of Good KVCache hit rate.

View File

@@ -0,0 +1,15 @@
Objectives
- Serverless KVCache cache
- PhOS profile
Key Results
- Implement a workload aware KVCache scheduler. [3/10]
- Provide test apps for PhOS
Last Week
- Implement a simulator for KVCache scheduler to quick test different policies.
- Prepare and do a paper sharing in Ali.
- Provide StableDiffusion single GPU train, Llama2-13b multi GPU train, Llama2-70b multi GPU inference script for PhOS profiling.
Next Week
- Implement a solution to reduce KVCache memory need.

View File

@@ -0,0 +1,13 @@
Objective
- Serverless KVCache cache
Key Results
- Test a workload aware KVCache scheduler
- Implement the workload aware policy in vLLM
Last Week
- Design a workload aware schedule policy in simulator and profile the KVCache reuse rate.
- Implement the designed policy under vLLM.
Next Week
- Profile the real performance of new policy under vLLM and do some enhancement.

View File

@@ -0,0 +1,16 @@
Objective
- Serverless KVCache cache
Key Results
- Implement the workload aware policy in vLLM [8/10]
- Profile the workload aware policy [3/10]
- Supply workloads difference in Qwen trace
Last Week
- Add new design point to cache policy, making the policy to consider cache memory size and predicted reuse distance together. To do this, add a new monitor for workloads' reuse time interval and average number of tokens.
- Set a offline (i.e. best) scheduling policy, profile the default policy, our workload aware policy and offline policy to show the performance difference in CDF of TTFT.
- Implement a cache block source tracker in vLLM to show where the KVCache reuse comes from. Prove that 90% of KVCache reuse comes from multi turns chat.
Next Week
- Improve the performance of our policy.
- Plot some formal figures.

View File

@@ -0,0 +1,14 @@
Objective
- Serverless KVCache cache
Key Results
- Implement the workload aware policy in vLLM
- Profile the workload aware policy [3/10]
Last Week
- Implement priority-based (calculated by our policy) evictor for both GPU and CPU sides.
- Test our policy under ralative small cache memory, and get a 30% cache hit ratio and 10% performance improvement. Prove our policy is used for limited cache memory. But for the larger cache memory, our policy still need some fine-tune.
Next Week
- Improve our policy for larger cache memory.
- Analysis new trace.

View File

@@ -0,0 +1,12 @@
Objective
- Serverless KVCache cache
Key Results
- Analysis new trace
Last Week
- Get the qwen-plus trace, analysis its feature. 96% requests come from script. Many long system prompt which will be used many times (greater than 100k in 4h).
- Confirm trace A and trace B for paper. Draw figures for them.
Next Week
- Profile our policy.

View File

@@ -0,0 +1,11 @@
Objective
- Serverless KVCache cache
Key Results
- Run test and get number for paper
Last Week
- Do all things about paper.
Next Week
- Finish paper.

View File

@@ -0,0 +1,11 @@
Objective
- Serverless KVCache cache
Key Results
- Finish paper for ATC
Last Week
- Do all things about paper.
Next Week
- Go for vacation

View File

@@ -0,0 +1,17 @@
Objective
- Serverless KVCache cache
- MoE study
Key Results
- Check the trace from Ali and fix problems
- Define a formatted trace structure for incoming refine
- Study papers about MoE, run int4 DeepSeek v3 671B in 8 * A800
Last Week
- Communicate with a colleague in Ali to get a desired trace and check the problems in trace to give feedback.
- Design a standard trace structure for better refining, then start format the trace in 12h for test.
- Study on MoE and find a int4 quantization version DeepSeek v3 671B to run in 8 * A800.
Next Week
- Format all trace to desired structure.
- Study on DeepSeek v3 to see how the experts do parallelism.

View File

@@ -0,0 +1,14 @@
Objective
- Serverless KVCache cache
Key Results
- Format traceA and traceB to standard format and get the chat session
Last Week
- Update the process script to support streaming and format 24h data for traceA and traceB.
- Preparing paper-sharing.
- Go back to school for intern defense.
Next Week
- Analysis traceA and traceB in 24h data.
- Survey the different DeepSeek deploy method.

View File

@@ -0,0 +1,14 @@
Objective
- Serverless KVCache cache
Key Results
- Test traceA and traceB and fix bugs
- Survey the hardware for MoE deploying in medium-scale cluster
Last Week
- Do test on traceA and traceB, then fix bugs for the format pass to handle corner cases.
- Learn the calculation details of MLA and MoE to estimate the memory and calculation requirements, and compare with the different hardware.
Next Week
- Re-plot all the figures about trace.
- Survey the MoE deployment.

View File

@@ -0,0 +1,17 @@
Objective
- Serverless KVCache cache
- DeepSeek deployment study
Key Results
- Refine some trace figures in 24h trace
- Give a cache policy evaluation method (w/ Jinbo)
- Survey the hardware for MoE deploying in medium-scale cluster
Last Week
- Finish all the trace clean and preprocess and re-plot some figures for traceA and traceB in new trace.
- Communicate with Jinbo to have a better understand in the gap between vLLM cache management and traditional cache policy. Figure out a evaluation method to judge the cache policy.
- Calculate the FLOPs requirement for DeepSeek.
Next Week
- Test and refine the cache policy.
- Try to summary the challenges for medium-scale deployment.

View File

@@ -0,0 +1,15 @@
Objectives
- Serverless KVCache cache
- DeepSeek deployment study
Key Results
- Write a KVCache simulator to speed up policy test
- Refine S3-FIFO to get some improvement
Last Week
- Write a *naive* KVCache simulator to align with vLLM's KVCache management. And have very small bias comparing to real vLLM.
- Refine the S3-FIFO in vLLM and evaluate it. It can have a little improvement in relatively small cache space.
- Write the middle-stage report for graduation thesis.
Next Week
- Refine the cache policy.

View File

@@ -0,0 +1,13 @@
Objective
- Serverless KVCache cache
Key Result
- Implement PDF-based workload-aware cache policy in simulator
- Test the policy with different refine methods
Last Week
- Implement workload-aware cache policy by exponential distribution fitting and get stable hit ratio improvement for the first time.
- Try to monitor with time sliding window, warm up for fitting coefficients, use oracle fitting coefficients etc. But all of them cannot get a notable improvement.
Next Week
- Refine the cache policy.

View File

@@ -0,0 +1,13 @@
Objective
- Serverless KVCache cache
Key Result
- Analysis the difference between LRU/WA/oracle
Last Week
- Define the difference of cache policies with a reuse rank (for each cache hit, we can get current key's rank in a cache policy). Evaluate different cache policies by reuse rank and draw CDF.
- Prepare and do middle term graduating thesis offense.
Next Week
- Do rebuttal for ATC.
- Implement WA policy in vllm and test.

View File

@@ -0,0 +1,13 @@
Objective
- Serverless KVCache cache
Key Result
- Rebuttal for ATC'25
- Refine cache policy implementation
Last Week
- Finish rebuttal for ATC'25 w/ Jinbo.
- Fix some bugs in our cache policy and test in simulator to get a bit hit ratio improvement.
Next Week
- Implement WA policy in vllm and test.

View File

@@ -0,0 +1,14 @@
Objective
- Serverless KVCache cache
Key Result
- Refine cache policy implementation
- Write graduation thesis
Last Week
- Fix some bugs in our cache policy and test in simulator to get a bit hit ratio improvement.
- Fix bugs for cache policy and simulator and refine policy to always (1x, 2x, 4x) get better cache hit ratio compared to LRU.
- Write graduation thesis for 20 pages.
Next Week
- Refine cache policy to get better performance.

View File

@@ -0,0 +1,15 @@
Objective
- Serverless KVCache cache
Key Result
- Refine cache policy implementation
- Implement and test our workload-aware cache policy in vLLM
- Write graduation thesis
Last Week
- Refine cache policy to consider the _cost_ of keeping cache in memory, and get about 1% to 2% hit rate improvement under 1k+1k cache blocks.
- Implement PDF-based workload-aware cache policy in vLLM and profile LRU v.s. WA under Qwen2-7B, get 25% QTTFT reduction.
- Finish the first draft of graduation thesis.
Next Week
- Do full test for different cache policies and under different models.

View File

@@ -0,0 +1,11 @@
Objective
- Serverless KVCache cache
Key Result
- Null
Last Week
- Labor Day Vocation
Next Week
- TBD

View File

@@ -0,0 +1,16 @@
Objective
- Serverless KVCache cache
Key Result
- Preprocess 5 days trace for comparison
- Draw policy results in Trace A under 7B, 13B, 70B models
- Write policy algorithm for paper
- Prepare an version of trace for open source
Last Week
- Get 5 workdays trace and preprocess them for future simulator test.
- Write paper policy design part, finish a pseudocode for our cache policy.
- Process and get an anonymous trace for open source.
Next Week
- Finish final version paper writing

View File

@@ -0,0 +1,15 @@
Objective
- Serverless KVCache cache
Key Result
- Prepare a repo for open source Qwen trace
- Write paper policy part
- Draw policy test figs
Last Week
- Prepare a trace repo for the flow to open source in Ali.
- Write paper policy design and eval parts.
- Rerun policy test multi times to draw figs with shadow(error bar).
Next Week
- Finish final version paper

View File

@@ -0,0 +1,15 @@
Objective
- Serverless KVCache cache
Key Result
- Refine a final version of KV$ cache for ATC'25
- Prepare graduation defense slides
Last Week
- Finish the final version of KV$ cache and send to the shepherd.
- Finish a slides and submit materials for graduation defense.
- Learn from ChinaSys'25.
Next Week
- Go for graduation defense.
- Polish for the camera ready version of KV$ cache.

View File

@@ -0,0 +1,17 @@
Objectives
- Serverless KVCache cache
- MoE autoscaling
Key Results
- [10/10] Refine a final version of KV$ cache for ATC'25
- [10/10] Graduation thesis defense
- [2/10] Run MoE model in Ali
- [0/10] Analysis the pattern of experts loading in Ali trace
Last Week
- Prepare and finish graduation defense.
- Polish the final version of KV$ cache and send to the shepherd.
- Run Qwen3-32B on latest vLLM.
Next Week
- Modify vLLM to support tracing the expert load pattern.

View File

@@ -0,0 +1,17 @@
Objectives
- Serverless KVCache cache
- MoE autoscaling
Key Results
- [10/10] Refine a final version of KV$ cache for ATC'25
- [8/10] Run MoE model in Ali
- [0/10] Analysis the pattern of experts loading in Ali trace
- [0/10] Understand how EP influence performance fully
Last Week
- Modify vLLM to support tracing the activated experts and test on Ali trace with Qwen3-32B.
- Prepare and submit KV$ cache to arXiv.
Next Week
- Analysis the experts pattern.
- Test on more MoE models.

View File

@@ -0,0 +1,20 @@
Objectives
- Serverless KVCache cache
- MoE pattern feature
- EP design for inference performance
Key Results
- [0/10] Prepare slides for ATC'25 presentation w/ Jinbo
- [8/10] Run MoE models in Ali
- [5/10] Analysis the pattern of experts loading in Ali trace
- [3/10] Analysis the expert pattern in different models
- [0/10] Understand how EP influence performance fully
- [0/10] Verify how dynamic EP influence performance
Last Week
- Develop in vLLM to support tracing expert pattern with PP and distributed with Ray for DeepSeek-671B.
- Analysis expert pattern's temporal locality.
Next Week
- Develop in vLLM fully for all models.
- Analysis the expert pattern's correlations between layers.

View File

@@ -0,0 +1,22 @@
Objectives
- Serverless KVCache cache
- MoE pattern feature
- EP design for inference performance
Key Results
- [5/10] Prepare slides for ATC'25 presentation w/ Jinbo
- [1/10] Survey MoE works and their observations
- [9/10] Analysis experts load balance's temporal locality
- [0/10] Analysis correlations between MoE layers
- [0/10] Understand how EP influence performance fully
- [0/10] Verify how dynamic EP influence performance
Last Week
- Tracing expert pattern with Qwen trace under Qwen3-235B and DeepSeek-671B.
- Analysis expert pattern's temporal locality in large models (Qwen3-235B and DeepSeek-671B).
- Prepare KVCache slides.
- All misc for graduation.
Next Week
- Analysis the expert pattern's correlations between layers.
- Survey current MoE works for more observations to check.

View File

@@ -0,0 +1,20 @@
Objectives
- Serverless KVCache cache
- MoE pattern feature
- EP design for inference performance
Key Results
- [9/10] Prepare slides for ATC'25 presentation w/ Jinbo
- [6/10] Survey MoE works and their observations
- [9/10] Analysis experts load balance's temporal locality
- [0/10] Analysis correlations between MoE layers
- [0/10] Understand how EP influence performance fully
- [0/10] Verify how dynamic EP influence performance
Last Week
- Survey MoE works and summarize their key points.
- Refine KVCache slides w/ Jinbo.
- Nit: support Ali machine usage and give a landing doc.
Next Week
- Check the feasibility of EP combinatory method.

View File

@@ -0,0 +1,19 @@
Objectives
- Serverless KVCache cache
- MoE pattern feature
- EP design for inference performance
Key Results
- [10/10] Prepare slides for ATC'25 presentation w/ Jinbo
- [6/10] Survey MoE works and their observations
- [9/10] Analysis experts load balance's temporal locality
- [4/10] Analysis correlations between MoE layers
- [0/10] Understand how EP influence performance fully
- [0/10] Verify how dynamic EP influence performance
Last Week
- Survey the architecture of Bailian, read their docs, get some knowledge of their gateway, cluster setup and some serverless service.
- Refine KVCache slides w/ Jinbo and Dingyan.
Next Week
- Skip for one week.

View File

@@ -0,0 +1,20 @@
Objectives
- Serverless KVCache cache
- MoE pattern feature
- EP design for inference performance
Key Results
- [10/10] Prepare slides for ATC'25 presentation w/ Jinbo
- [6/10] Survey MoE works and their observations
- [9/10] Analysis experts load balance's temporal locality
- [4/10] Analysis correlations between MoE layers
- [0/10] Understand how EP influence performance fully
- [0/10] Verify how dynamic EP influence performance
Last Week
- Go for vacation.
Next Week
- Understand the infrastructure of Bailian further.
- Review and write comments for at least 3 paper as a shadow PC.
- Learn about current MoE network feature under different parallelism mode.

View File

@@ -0,0 +1,20 @@
Objectives
- MoE pattern feature
- EP design for inference performance
Key Results
- [6/10] Survey MoE works and their observations
- [9/10] Analysis experts load balance's temporal locality
- [4/10] Analysis correlations between MoE layers
- [0/10] Understand how EP influence performance fully
- [0/10] Verify how dynamic EP influence performance
Last Week
- Survey the infrastructure of Bailian, specially in model serving and batching.
- Give a KVCache cache talk in Ali w/ Jinbo.
- Review 2 papers as shadow PC.
- Survey the agent workflow for potential system problem.
Next Week
- Survey the different parallelism setup scheduling.
- Review and write comments for all assigned papers.

View File

@@ -0,0 +1,18 @@
Objectives
- MoE pattern feature
- EP design for inference performance
Key Results
- [6/10] Survey MoE works and their observations
- [9/10] Analysis experts load balance's temporal locality
- [4/10] Analysis correlations between MoE layers
- [0/10] Understand how EP influence performance fully
- [0/10] Verify how dynamic EP influence performance
Last Week
- Survey about heterogeneous parallelism config setup for different workloads and SLO.
- Finish the review for all papers as a shadow PC.
Next Week
- Survey the chance and challenges for EP reconfiguration.
- Survey the agentic AI infra.

View File

@@ -0,0 +1,18 @@
Objectives
- Heterogenous parallelism in cluster
- EP design for inference performance
Key Results
- [5/10] Profile different parallelism setup with real trace and analysis their difference
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup
- [0/10] Understand how EP influence performance fully
- [0/10] Verify how dynamic EP influence performance
- [4/10] Analysis correlations between MoE layers (suspended)
Last Week
- [For KR1] Run latest vLLM with different parallelism configurations (TP, PP, DP, EP) in Qwen-30B with fixed input/output length to get their difference.
- [Misc] Write AIR project conclusion docs for the collaboration in Ali w/ Jinbo.
Next Week
- Test different parallelism configurations with latest Ali trace.
- Analysis the performance pattern in different workloads.

View File

@@ -0,0 +1,22 @@
Objectives
- Heterogenous parallelism in cluster
- EP design for inference performance
Key Results
- [5/10] Profile different parallelism setup with real trace and analysis their difference
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup
- [0/10] Understand how EP influence performance fully
- [0/10] Verify how dynamic EP influence performance
- [4/10] Analysis correlations between MoE layers (suspended)
Last Week
- [For KR1] Read vLLM code and understand how vLLM TP/PP/DP works.
- [For KR1] Run profile test with different config in a more complete search space.
- [Surveying] Understand the bottleneck of autoscaling in Ali.
- [Surveying] The opportunity for profile kernel and get a best compute graph to guide the parallelism config.
- [Misc] Prepare slides for AIR project conclusion defense.
Next Week
- Survey the possibility of a universal parallelism config search based on kernel. (Start from the related works about NanoFlow)
- Check the possibility to use GPU bubbles which running small models.
- Check the challenges to switch parallelism config with context.

View File

@@ -0,0 +1,21 @@
Objectives
- Heterogenous parallelism in cluster
- EP design for inference performance
Key Results
- [6/10] Profile vLLM to get compute graph
- [2/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
- [5/10] Profile different parallelism setup with real trace and analysis their difference
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup
- [0/10] Understand how EP influence performance fully
- [0/10] Verify how dynamic EP influence performance
- [4/10] Analysis correlations between MoE layers (suspended)
Last Week
- [Surveying] Learn about the compute graph arrangement in traditional streaming/batch system and compared to LLM inference system.
- [KR1] Profile the vLLM to get kernels time consuming, overlapping status.
- [Misc] Review 3 papers as shadow PC for Round 2.
- [Misc] Prepare and finish the AIR project conclusion defense with slides.
Next Week
- Summarize a table for the similarities and challenges in compute graph arrangement optimization between traditional streaming system and LLM inference system.

View File

@@ -0,0 +1,18 @@
Objectives
- Heterogenous parallelism in cluster
- EP design for inference performance [untracked]
Key Results
- [6/10] Profile vLLM to get compute graph
- [2/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
- [5/10] Profile different parallelism setup with real trace and analysis their difference
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup
- [0/10] Understand how EP (static/dynamic) influence performance fully
- [4/10] Analysis correlations between MoE layers [suspended]
Last Week
- [KR2] Learn about triton (vLLM has many kernel implemented in triton), run a demo to compile the python triton kernel to get ptx then loaded and called in Rust.
- [KR2] Try a demo to run vLLM's flash-attention in Rust.
Next Week
- Find a way to get the full compute flow and data flow in vLLM, then replay in Rust.

View File

@@ -0,0 +1,18 @@
Objectives
- Auto distributed LLM inference config optimization
Key Results
- [3/10] Implement a minimal Rust inference framework
- [0/10] Trace vLLM compute graph and data flow
- [6/10] Profile vLLM to get compute graph
- [2/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
- [5/10] Profile different parallelism setup with real trace and analysis their difference
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
Last Week
- [KR1] Learn and implement a simple LLM inference in candle.
- [KR1] Debug for the float precision problem in candle, trying to figure out the root cause: kernel library or rust float precision.
Next Week
- Think about the structure of inference framework.
- Continue the rust code implementation.

View File

@@ -0,0 +1,18 @@
Objectives
- Auto distributed LLM inference config optimization
Key Results
- [3/10] Implement a minimal Rust inference framework
- [1/10] Define the IR for automatic optimization
- [0/10] Trace vLLM compute graph and data flow
- [2/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
- [5/10] Profile different parallelism setup with real trace and analysis their difference
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
Last Week
- [KR2] Rethink about the project target and the definition of IR for automatic distribution optimization.
- [KR2] Learn something about category for IR abstraction.
- [KR2] Survey the TVM and MLC LLM to learning about their IR abstraction.
Next Week
- Profile the compute and communication time for kernels to show the bubble in micro-batch under different models and different input lengths.

View File

@@ -0,0 +1,18 @@
Objectives
- Auto distributed LLM inference config optimization
Key Results
- [3/10] Implement a minimal Rust inference framework
- [1/10] Define the IR for automatic optimization
- [0/10] Trace vLLM compute graph and data flow
- [2/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
- [5/10] Profile different parallelism setup with real trace and analysis their difference
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
Last Week
- Theoretically analyze for the dual batch overlap optimization to show that different models with different hardware should apply different execution flow.
- Survey the DBO and hybrid KVCache management in vLLM.
- Make some bottom-up things to do in roadmap https://ipads.se.sjtu.edu.cn:1312/wangjh/infer-framework/-/issues/3.
Next Week
- Go through the vLLM codebase to find the feasibility and challenges for auto apply an execution flow for different models.

View File

@@ -0,0 +1,18 @@
Objectives
- Auto distributed LLM inference config optimization
Key Results
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
- [0/10] Trace vLLM compute graph and data flow
- [3/10] Implement a minimal Rust inference framework
- [1/10] Define the IR for automatic optimization
- [5/10] Profile different parallelism setup with real trace and analysis their difference
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
Last Week
- [KR1] Learn about how vLLM implements the DBO. Check the feasibility to apply an execution flow automatically with a generated config.
- [misc] Write a paper commentary for SOSP.
Next Week
- Summary the optimizations in Qwen.
- Profile model's different stage (module), analyze the overlap status.

View File

@@ -0,0 +1,19 @@
Objectives
- Auto distributed LLM inference config optimization
Key Results
- [5/10] Check the current situation of parallelism config optimization
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
- [0/10] Trace vLLM compute graph and data flow
- [3/10] Implement a minimal Rust inference framework
- [1/10] Define the IR for automatic optimization
- [5/10] Profile different parallelism setup with real trace and analysis their difference
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
Last Week
- [KR1] Summary the optimizations for Qwen models about fused_moe kernels, attention optimization and data copy reduction.
- [KR1] Survey in Ali about the workflow for parallelism config search.
- [misc] Finish 3 homework for courses.
Next Week
- Find the possibility to search configs automatically with AI like alpha evolve.

View File

@@ -0,0 +1,19 @@
Objectives
- Auto distributed LLM inference config optimization
Key Results
- [2/10] Build the first version auto tuner system
- [5/10] Check the current situation of parallelism config optimization
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
- [0/10] Trace vLLM compute graph and data flow
- [3/10] Implement a minimal Rust inference framework
- [1/10] Define the IR for automatic optimization
- [5/10] Profile different parallelism setup with real trace and analysis their difference
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
Last Week
- [KR1] Design the auto tuner system structure and draw a figure.
- [KR1] Write code for hardware prober and workload profiler.
Next Week
- Continue to build the system with config generator and config tuner.

View File

@@ -0,0 +1,20 @@
Objectives
- Auto LLM inference config tuner
Key Results
- [4/10] Build the first version auto tuner system
- [5/10] Check the current situation of parallelism config optimization
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
- [0/10] Trace vLLM compute graph and data flow
- [3/10] Implement a minimal Rust inference framework
- [1/10] Define the IR for automatic optimization
- [5/10] Profile different parallelism setup with real trace and analysis their difference
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
Last Week
- [KR1] Write code for basic config generator and benchmark to check the performance.
- [KR1] Trying to find a way to tune the config for better performance.
Next Week
- Benchmark for baseline and some human-tuned configs to prove the necessary of config tuning.
- Continue to design a way for auto tuning.

View File

@@ -0,0 +1,20 @@
Objectives
- Auto LLM inference config tuner
Key Results
- [5/10] Build the first version auto tuner system
- [5/10] Check the current situation of parallelism config optimization
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
- [0/10] Trace vLLM compute graph and data flow
- [3/10] Implement a minimal Rust inference framework
- [1/10] Define the IR for automatic optimization
- [5/10] Profile different parallelism setup with real trace and analysis their difference
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
Last Week
- [KR1] Struggle to prepare running environment for Qwen3-Max-fp4. Try to fix/bypass a lot of dependencies/code problems.
- [mics] Prepare the first version review agent w/ Yingyi. [0b288d64](https://ipads.se.sjtu.edu.cn:1312/shadowpc/deep-review/-/commit/0b288d643301edcb19be6baf394710ce35a2dd74) ~ [57093ff4](https://ipads.se.sjtu.edu.cn:1312/shadowpc/deep-review/-/commit/57093ff4a5782dbfa6e40456b9c0825df5576f8b)
Next Week
- Think about the insight in our system target.
- Continue to implement the tuner part in our system.

View File

@@ -0,0 +1,19 @@
Objectives
- Auto LLM inference config tuner
Key Results
- [6/10] Build the first version auto tuner system
- [7/10] Check the current situation of parallelism config optimization
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
- [0/10] Trace vLLM compute graph and data flow
- [3/10] Implement a minimal Rust inference framework
- [1/10] Define the IR for automatic optimization
- [5/10] Profile different parallelism setup with real trace and analysis their difference
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
Last Week
- [KR2] Benchmark different configs in different hardware, prove that different hardware and different workload will cause different trends of performance change. [5f2c1ec3](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/5f2c1ec3692586031f3ecd452709a034d8217113) ~ [65d05520](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/65d0552020041d5922e13172d9c40f8ef93a3985)
- [KR1] Build a precise workload generator from real workload. Benchmark on _quite similar_ generated workloads and find that even the similar workloads still trigger different performance.
Next Week
- Find the root cause of performance gap under similar workloads.

View File

@@ -0,0 +1,19 @@
Objectives
- Auto LLM inference config tuner
Key Results
- [7/10] Build the first version auto tuner system
- [7/10] Check the current situation of parallelism config optimization
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
- [0/10] Trace vLLM compute graph and data flow
- [3/10] Implement a minimal Rust inference framework
- [1/10] Define the IR for automatic optimization
- [5/10] Profile different parallelism setup with real trace and analysis their difference
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
Last Week
- [KR1] Update workload generator from real workload, not only support different timestamps, but also input_length, output_length and KVCache hit ratio from real workloads. Then benchmark to check whether we can use an abstract spec to replay the similar performance. [b0bcfa63](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/b0bcfa6326f69755aaaf859d89ad2def2409cd48)~[fb1f0848](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/fb1f084815342d6b8379f3b191ed152a3c1cda67)
- [KR1] Check the root cause of performance gap under different similar workloads. The difference mainly comes from different inference load.
Next Week
- Update the workload abstraction spec for more precise replayed performance.

View File

@@ -0,0 +1,20 @@
Objectives
- Auto LLM inference config tuner
Key Results
- [7/10] Build the first version auto tuner system
- [7/10] Check the current situation of parallelism config optimization
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
- [0/10] Trace vLLM compute graph and data flow
- [3/10] Implement a minimal Rust inference framework
- [1/10] Define the IR for automatic optimization
- [5/10] Profile different parallelism setup with real trace and analysis their difference
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
Last Week
- [KR1] Update workload generator from real workload, give a more precise spec abstraction. [c969f366](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/c969f366b05cad03447e1d7bdd9f30785dd792e4)~[7407149d](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/7407149d1052d3d610fd1fb3e51ce60068ba4981)
- [KR1] Benchmark and compare for generated workloads and raw workloads. Find that if input/output length are generated, will cause the performance varies a lot.
Next Week
- Find the root cause for workload performance variation.
- Summary the intelligence for auto tuning path.

View File

@@ -0,0 +1,20 @@
Objectives
- Auto LLM inference config tuner
Key Results
- [9/10] Build the first version auto tuner system
- [7/10] Check the current situation of parallelism config optimization
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
- [0/10] Trace vLLM compute graph and data flow
- [3/10] Implement a minimal Rust inference framework
- [1/10] Define the IR for automatic optimization
- [5/10] Profile different parallelism setup with real trace and analysis their difference
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
Last Week
- [KR1] Study and summarize the system intelligence, learn the basic way to implement auto tuner.
- [KR1] Implement the naive auto tuner framework, which supports to run vLLM with sampled configs, then aggregate the benchmark results as the context for LLM to get proposals from LLM for evolving. [ad0b0fc3](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/ad0b0fc3eb3dea5f91a2c75efc69894fac011301)~[420afa3c](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/420afa3c7a48d19e2d864f212db0efcd86b40ca8)
Next Week
- Benchmark and summarize and performance of auto tuner vs expert.
- Survey the heterogenous hardware's utilization in Ali.

View File

@@ -0,0 +1,20 @@
Objectives
- Auto LLM inference config tuner
Key Results
- [9/10] Build the first version auto tuner system
- [7/10] Check the current situation of parallelism config optimization
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
- [0/10] Trace vLLM compute graph and data flow
- [3/10] Implement a minimal Rust inference framework
- [1/10] Define the IR for automatic optimization
- [5/10] Profile different parallelism setup with real trace and analysis their difference
- [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup [offtrack]
Last Week
- Refine the story. Focus on heterogenous workloads are classified by labels or input length, which is not enough. We should define a classification method through the grouping of similar performance under the same config.
- Prepare slides to summarize the story and what to do next.
- Prepare slides for IPADS group meeting.
Next Week
- Run benchmark for current workload classification to prove different classes need different configs to max the goodput.

View File

@@ -0,0 +1,18 @@
Objectives
- Auto LLM inference config tuner
Key Results
- [2/10] Workload grouping methods
- [9/10] Build the first version auto tuner system
- [8/10] Check the current situation of parallelism config optimization
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
- [1/10] Define the IR for automatic optimization
- [5/10] Profile different parallelism setup with real trace and analysis their difference
Last Week
- [KR1] Run benchmark for different workload classifications and prove that different classification method will shift the best config and different workload groups need different configs to maximize the goodput.
- [misc] Prepare for IPADS group meeting presentation.
- [misc] Prepare for the ChinaSys presentation.
Next Week
- Define the workload classification space and find the method to group workload.

View File

@@ -0,0 +1,17 @@
Objectives
- Auto LLM inference config tuner
Key Results
- [2/10] Workload grouping methods
- [9/10] Build the first version auto tuner system
- [8/10] Check the current situation of parallelism config optimization
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
- [1/10] Define the IR for automatic optimization
- [5/10] Profile different parallelism setup with real trace and analysis their difference
Last Week
- [KR2] Run AI Tuner under the current Ali workload groups (input length / label), and try to find the insights for building a better AI Tuner.
- [misc] Build system for EuroSys Shadow experiment.
Next Week
- Compare the AI Tuner results with Ali's current situation to find more insights for AI Tuner.

View File

@@ -0,0 +1,21 @@
Objectives
- Auto LLM inference config tuner
Key Results
- [4/10] Build the agentic tuner system
- [10/10] Build the first version auto tuner system
- [2/10] Workload grouping methods
- [8/10] Check the current situation of parallelism config optimization
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
- [1/10] Define the IR for automatic optimization
- [5/10] Profile different parallelism setup with real trace and analysis their difference
Last Week
- [KR1] Refactor the first version of auto tuner system to make it more agentic. [4e3b15b6](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/4e3b15b60819fb61d04148302be68bb66e9dda7b) ~ [095c1edd](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/095c1edda49bfd8dad70bed20e81564c29ae3e8a)
- Support a tool library for our tuner system to call
- Speedup the tuning time
- Support early stop for bad configs
- Support LLM to predict the performance trend and reflection
Next Week
- Summarize the advantages and agentic tuner system and continue to optimize it.

View File

@@ -0,0 +1,20 @@
Objectives
- Auto LLM inference config tuner
Key Results
- [6/10] Build the agentic tuner system
- [10/10] Build the first version auto tuner system
- [2/10] Workload grouping methods
- [8/10] Check the current situation of parallelism config optimization
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
- [1/10] Define the IR for automatic optimization
- [5/10] Profile different parallelism setup with real trace and analysis their difference
Last Week
- [KR1] Update agentic AITuner to support new trace benchmark / new vLLM flags/ objective score. [0a012bdd](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/0a012bdda53086cd24277962abb0cb559bd313bb) ~ [788da3d8](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/788da3d8bc546620e8c76800dfb7070372cb3540)
- [KR1] Survey the related works. Some works build an agent for LLM training / storage system / ..., a work use BO for LLM inference config tuning.
- [misc] Prepare an open-sourced version of new traces (thinking and coder) and update readme.
Next Week
- Optimize the agentic AITuner.
- Test SCOOT as one of the baseline.

View File

@@ -0,0 +1,21 @@
Objectives
- Auto LLM inference config tuner
Key Results
- [8/10] Build the agentic tuner system
- [2/10] Paper outline
- [10/10] Build the first version auto tuner system
- [2/10] Workload grouping methods
- [8/10] Check the current situation of parallelism config optimization
- [4/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically
- [1/10] Define the IR for automatic optimization
- [5/10] Profile different parallelism setup with real trace and analysis their difference
Last Week
- [KR1] Update agentic AITuner to support DP vs replicas, early-stop error handling, fix problems/illegal constrains in large search space. [6c0940e7](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/6c0940e7b0a234265290398fe0a7ca7b7f3d4178) ~ [0cbc1727](https://ipads.se.sjtu.edu.cn:1312/wangjh/auto-tuner/-/commit/0cbc1727c06589ea9b021b223883d0fd114fd4c7)
- [KR2] Prepare draft for paper outline, summarize current story and what to do next.
- [misc] Prepare a [paper template](https://ipads.se.sjtu.edu.cn:1312/wangjh/paper-ai-tuner).
- [misc] Open source our new trace and trace-replayer at https://github.com/alibaba-edu/qwen-bailian-usagetraces-anon.
Next Week
- Compare to Ali production environment's configs.