Add ReplayServe Frontier vLLM alignment report

2026-06-25 17:10:30 +08:00
commit a99bd00782
63 changed files with 17033 additions and 0 deletions
--- a/docs/assets/frontier_vllm_alignment/completion_prefix.png
+++ b/docs/assets/frontier_vllm_alignment/completion_prefix.png
--- a/docs/assets/frontier_vllm_alignment/frontier_vllm_alignment.csv
+++ b/docs/assets/frontier_vllm_alignment/frontier_vllm_alignment.csv
@@ -0,0 +1,10 @@
+run_id,label,tp,request_count,scale_label,scale_value,fixture,kv_blocks,frontier_completed,frontier_total,frontier_complete,vllm_completed,vllm_total,frontier_preemptions,vllm_preemptions,frontier_prefix_hit,vllm_prefix_hit,prefix_hit_delta,frontier_rps,vllm_rps,rps_ratio,frontier_total_tps,vllm_total_tps,total_tps_ratio,frontier_decode_tps,vllm_decode_tps,decode_tps_ratio,frontier_ttft_p50_s,vllm_ttft_p50_s,ttft_p50_ratio,frontier_ttft_p95_s,vllm_ttft_p95_s,ttft_p95_ratio,frontier_tpot_p50_s,vllm_tpot_p50_s,tpot_p50_ratio,frontier_tpot_p95_s,vllm_tpot_p95_s,tpot_p95_ratio,frontier_e2e_p50_s,vllm_e2e_p50_s,e2e_p50_ratio,frontier_e2e_p95_s,vllm_e2e_p95_s,e2e_p95_ratio,notes
+tp1_n100_scale1,TP1 N100 raw,1,100,raw,1,coder_100,15281,96,100,false,100,100,0,8,0.2487845616,0.2510820686,-0.002297507075,0.4048148795,0.6879880691,0.588403924,2348.908821,3832.320581,0.6129207541,347.7992338,567.4456795,0.6129207541,0.9087481136,4.503025495,0.201808343,12.76295815,29.06046906,0.4391862402,0.05688966428,0.06608134396,0.8609035603,0.1456880793,0.6211491471,0.2345460505,30.93928316,41.84076733,0.7394530534,119.6361376,97.36622969,1.228723121,Frontier incomplete before lifecycle fix; included as TP1 100-request baseline.
+tp1_n500_scale1,TP1 N500 raw,1,500,raw,1,coder_500,15281,439,500,false,500,500,0,63,0.1192374692,0.3868498695,-0.2676124002,0.660990472,0.8401719451,0.7867323776,4733.748762,5282.903731,0.896050544,656.2204998,732.3476384,0.896050544,136.7755789,185.6581683,0.7367064976,340.2371222,375.8950067,0.9051387119,0.05643274739,0.04975253624,1.134268756,0.08942839773,0.0918798539,0.9733188935,177.7998574,224.2697872,0.7927945162,397.29145,417.3562933,0.9519239469,Frontier incomplete; useful as high-pressure stress signal.
+tp1_n200_scale0667,TP1 N200 scale 0.667,1,200,0.667,0.6666666667,coder_200_ts0667,15281,176,200,false,200,200,0,26,0.170276008,0.2697549478,-0.09947893984,0.5830903706,0.8236788215,0.7079098737,3913.437526,4864.778909,0.8044430383,593.287826,737.51378,0.8044430383,20.58014532,34.56323652,0.595434554,96.71793818,120.8039818,0.800618794,0.05837096651,0.05145431897,1.13442307,0.235894569,0.2534757496,0.9306395954,73.20731169,83.6219905,0.875455263,189.2402903,183.726977,1.030008186,Dense-arrival run; Frontier incomplete before lifecycle fix.
+tp1_n200_scale2,TP1 N200 scale 2,1,200,2,2,coder_200_ts2,15281,200,200,true,200,200,33,43,0.23134169,0.2697549478,-0.03841325784,0.5936627655,0.8029813635,0.7393232178,3506.267279,4742.53641,0.7393232178,531.5597036,718.9814831,0.7393232178,9.595321274,9.216767096,1.041072338,77.50341053,69.21141595,1.119806747,0.05421362546,0.04970337519,1.09074334,0.06653162646,0.06863309532,0.9693811149,61.45769412,55.00248734,1.117362088,174.4840836,142.3375087,1.225847531,After Frontier decode-preemption lifecycle fix.
+tp1_n200_scale3,TP1 N200 scale 3,1,200,3,3,coder_200_ts3,15281,200,200,true,200,200,20,16,0.2176751278,0.2697549478,-0.05207982007,0.5739781652,0.7802265504,0.735655772,3390.00688,4608.142843,0.735655772,513.9343094,698.607051,0.735655772,1.001474116,1.166151478,0.8587856162,45.9466567,32.25842447,1.424330464,0.05339333437,0.04616159714,1.156661331,0.06861254671,0.0713836296,0.9611804148,44.76058145,33.21267588,1.34769573,154.5483135,122.7887113,1.258652459,After Frontier decode-preemption lifecycle fix.
+tp2_n200_scale2,TP2 N200 scale 2,2,200,2,2,coder_200_ts2,69055,200,200,true,200,200,0,0,0.2697549478,0.2697549478,0,0.7756823572,1.277818683,0.607036325,4581.304111,7547.001591,0.607036325,694.5382258,1144.14607,0.607036325,0.2690959621,0.225119116,1.195349231,6.744624223,0.715071776,9.432094022,0.04295527658,0.03004499679,1.429698158,0.05288764732,0.04340382318,1.218502046,26.05122482,16.44861007,1.583794905,106.7591651,72.5347179,1.471835394,Uses true-mixed TP2/TP4 attention profile.
+tp2_n200_scale3,TP2 N200 scale 3,2,200,3,3,coder_200_ts3,69055,200,200,true,200,200,0,0,0.2697549478,0.2697549478,0,0.6877705321,1.088050278,0.6321128225,4062.082806,6426.199028,0.6321128225,615.8228567,974.2293382,0.6321128225,0.1341535495,0.153530943,0.8737883511,0.5741378218,0.6270455511,0.9156237864,0.03937896849,0.01905767256,2.06630523,0.04670767225,0.02799082097,1.668678182,21.78596494,9.956003374,2.188223941,101.5918393,53.98348621,1.881905864,Uses true-mixed TP2/TP4 attention profile.
+tp4_n200_scale2,TP4 N200 scale 2,4,200,2,2,coder_200_ts2,177077,200,200,true,200,200,0,0,0.2697549478,0.2697549478,0,0.8525337931,1.536203537,0.5549614829,5035.200987,9073.063884,0.5549614829,763.350233,1375.501285,0.5549614829,0.09755515041,0.1704972619,0.5721801589,0.3856872342,1.419861408,0.2716372401,0.03366585047,0.01634437735,2.059781767,0.03838265621,0.02831690026,1.355468143,18.65216282,9.260885488,2.014079846,84.93775414,43.62188903,1.947136083,Uses true-mixed TP2/TP4 attention profile.
+tp4_n200_scale3,TP4 N200 scale 3,4,200,3,3,coder_200_ts3,177077,200,200,true,200,200,0,0,0.2697549478,0.2697549478,0,0.7373665172,1.253504493,0.5882440162,4355.004629,7403.398096,0.5882440162,660.2306059,1122.375388,0.5882440162,0.08859749135,0.100106278,0.885034317,0.3458954617,0.3184188101,1.086290919,0.03106778109,0.009410284212,3.301471071,0.03578285082,0.01279276668,2.79711588,16.90291941,5.54948732,3.045852424,83.00995365,27.86907583,2.978568581,Uses true-mixed TP2/TP4 attention profile.
--- a/docs/assets/frontier_vllm_alignment/frontier_vllm_alignment.json
+++ b/docs/assets/frontier_vllm_alignment/frontier_vllm_alignment.json
@@ -0,0 +1,434 @@
+[
+  {
+    "decode_tps_ratio": 0.6129207541251469,
+    "e2e_p50_ratio": 0.7394530533987747,
+    "e2e_p95_ratio": 1.2287231205931113,
+    "fixture": "coder_100",
+    "frontier_complete": false,
+    "frontier_completed": 96,
+    "frontier_decode_tps": 347.79923381681954,
+    "frontier_e2e_p50_s": 30.939283157873398,
+    "frontier_e2e_p95_s": 119.6361375789676,
+    "frontier_preemptions": 0,
+    "frontier_prefix_hit": 0.24878456156190046,
+    "frontier_rps": 0.4048148795016268,
+    "frontier_total": 100,
+    "frontier_total_tps": 2348.908820556559,
+    "frontier_tpot_p50_s": 0.056889664283438265,
+    "frontier_tpot_p95_s": 0.14568807925543142,
+    "frontier_ttft_p50_s": 0.9087481136376141,
+    "frontier_ttft_p95_s": 12.762958146117297,
+    "kv_blocks": 15281,
+    "label": "TP1 N100 raw",
+    "notes": "Frontier incomplete before lifecycle fix; included as TP1 100-request baseline.",
+    "prefix_hit_delta": -0.0022975070751777016,
+    "request_count": 100,
+    "rps_ratio": 0.5884039239601411,
+    "run_id": "tp1_n100_scale1",
+    "scale_label": "raw",
+    "scale_value": 1.0,
+    "total_tps_ratio": 0.6129207541251469,
+    "tp": 1,
+    "tpot_p50_ratio": 0.8609035602986401,
+    "tpot_p95_ratio": 0.23454605053898236,
+    "ttft_p50_ratio": 0.20180834300191677,
+    "ttft_p95_ratio": 0.439186240241972,
+    "vllm_completed": 100,
+    "vllm_decode_tps": 567.445679520595,
+    "vllm_e2e_p50_s": 41.84076732886024,
+    "vllm_e2e_p95_s": 97.36622968502343,
+    "vllm_preemptions": 8,
+    "vllm_prefix_hit": 0.25108206863707816,
+    "vllm_rps": 0.6879880691092217,
+    "vllm_total": 100,
+    "vllm_total_tps": 3832.3205810011714,
+    "vllm_tpot_p50_s": 0.06608134395878643,
+    "vllm_tpot_p95_s": 0.6211491471318447,
+    "vllm_ttft_p50_s": 4.503025494981557,
+    "vllm_ttft_p95_s": 29.060469059972093
+  },
+  {
+    "decode_tps_ratio": 0.8960505440100501,
+    "e2e_p50_ratio": 0.7927945162118318,
+    "e2e_p95_ratio": 0.951923946910999,
+    "fixture": "coder_500",
+    "frontier_complete": false,
+    "frontier_completed": 439,
+    "frontier_decode_tps": 656.2204997652797,
+    "frontier_e2e_p50_s": 177.7998574092898,
+    "frontier_e2e_p95_s": 397.29145000151055,
+    "frontier_preemptions": 0,
+    "frontier_prefix_hit": 0.11923746923408568,
+    "frontier_rps": 0.6609904720097601,
+    "frontier_total": 500,
+    "frontier_total_tps": 4733.748762075876,
+    "frontier_tpot_p50_s": 0.05643274739314083,
+    "frontier_tpot_p95_s": 0.08942839772817235,
+    "frontier_ttft_p50_s": 136.77557892500107,
+    "frontier_ttft_p95_s": 340.237122196321,
+    "kv_blocks": 15281,
+    "label": "TP1 N500 raw",
+    "notes": "Frontier incomplete; useful as high-pressure stress signal.",
+    "prefix_hit_delta": -0.2676124002320734,
+    "request_count": 500,
+    "rps_ratio": 0.786732377640824,
+    "run_id": "tp1_n500_scale1",
+    "scale_label": "raw",
+    "scale_value": 1.0,
+    "total_tps_ratio": 0.8960505440100502,
+    "tp": 1,
+    "tpot_p50_ratio": 1.134268756171416,
+    "tpot_p95_ratio": 0.9733188934802052,
+    "ttft_p50_ratio": 0.736706497600023,
+    "ttft_p95_ratio": 0.9051387119015346,
+    "vllm_completed": 500,
+    "vllm_decode_tps": 732.3476383692921,
+    "vllm_e2e_p50_s": 224.26978715602309,
+    "vllm_e2e_p95_s": 417.3562933159992,
+    "vllm_preemptions": 63,
+    "vllm_prefix_hit": 0.38684986946615907,
+    "vllm_rps": 0.8401719451179492,
+    "vllm_total": 500,
+    "vllm_total_tps": 5282.903730956031,
+    "vllm_tpot_p50_s": 0.049752536236317216,
+    "vllm_tpot_p95_s": 0.09187985389702198,
+    "vllm_ttft_p50_s": 185.6581683079712,
+    "vllm_ttft_p95_s": 375.8950067239348
+  },
+  {
+    "decode_tps_ratio": 0.8044430383408974,
+    "e2e_p50_ratio": 0.8754552629577944,
+    "e2e_p95_ratio": 1.030008185534932,
+    "fixture": "coder_200_ts0667",
+    "frontier_complete": false,
+    "frontier_completed": 176,
+    "frontier_decode_tps": 593.287826008356,
+    "frontier_e2e_p50_s": 73.20731168652793,
+    "frontier_e2e_p95_s": 189.24029025053343,
+    "frontier_preemptions": 0,
+    "frontier_prefix_hit": 0.17027600800712456,
+    "frontier_rps": 0.5830903705506575,
+    "frontier_total": 200,
+    "frontier_total_tps": 3913.43752605849,
+    "frontier_tpot_p50_s": 0.05837096651496554,
+    "frontier_tpot_p95_s": 0.23589456903741046,
+    "frontier_ttft_p50_s": 20.58014532403832,
+    "frontier_ttft_p95_s": 96.7179381828816,
+    "kv_blocks": 15281,
+    "label": "TP1 N200 scale 0.667",
+    "notes": "Dense-arrival run; Frontier incomplete before lifecycle fix.",
+    "prefix_hit_delta": -0.09947893983522305,
+    "request_count": 200,
+    "rps_ratio": 0.7079098737399896,
+    "run_id": "tp1_n200_scale0667",
+    "scale_label": "0.667",
+    "scale_value": 0.6666666666666666,
+    "total_tps_ratio": 0.8044430383408974,
+    "tp": 1,
+    "tpot_p50_ratio": 1.1344230703885074,
+    "tpot_p95_ratio": 0.930639595403931,
+    "ttft_p50_ratio": 0.5954345540217358,
+    "ttft_p95_ratio": 0.800618794003408,
+    "vllm_completed": 200,
+    "vllm_decode_tps": 737.5137800085473,
+    "vllm_e2e_p50_s": 83.62199050490744,
+    "vllm_e2e_p95_s": 183.7269770358689,
+    "vllm_preemptions": 26,
+    "vllm_prefix_hit": 0.2697549478423476,
+    "vllm_rps": 0.8236788215286605,
+    "vllm_total": 200,
+    "vllm_total_tps": 4864.778908559713,
+    "vllm_tpot_p50_s": 0.051454318973762715,
+    "vllm_tpot_p95_s": 0.2534757495838373,
+    "vllm_ttft_p50_s": 34.563236522022635,
+    "vllm_ttft_p95_s": 120.80398175423034
+  },
+  {
+    "decode_tps_ratio": 0.7393232177681209,
+    "e2e_p50_ratio": 1.1173620884379967,
+    "e2e_p95_ratio": 1.2258475306262637,
+    "fixture": "coder_200_ts2",
+    "frontier_complete": true,
+    "frontier_completed": 200,
+    "frontier_decode_tps": 531.5597035900641,
+    "frontier_e2e_p50_s": 61.45769412455945,
+    "frontier_e2e_p95_s": 174.48408358603848,
+    "frontier_preemptions": 33,
+    "frontier_prefix_hit": 0.23134168999974056,
+    "frontier_rps": 0.5936627654877362,
+    "frontier_total": 200,
+    "frontier_total_tps": 3506.267279013048,
+    "frontier_tpot_p50_s": 0.054213625462090735,
+    "frontier_tpot_p95_s": 0.06653162646338621,
+    "frontier_ttft_p50_s": 9.595321273711544,
+    "frontier_ttft_p95_s": 77.50341053197451,
+    "kv_blocks": 15281,
+    "label": "TP1 N200 scale 2",
+    "notes": "After Frontier decode-preemption lifecycle fix.",
+    "prefix_hit_delta": -0.038413257842607046,
+    "request_count": 200,
+    "rps_ratio": 0.7393232177681209,
+    "run_id": "tp1_n200_scale2",
+    "scale_label": "2",
+    "scale_value": 2.0,
+    "total_tps_ratio": 0.7393232177681209,
+    "tp": 1,
+    "tpot_p50_ratio": 1.0907433399899442,
+    "tpot_p95_ratio": 0.9693811149298648,
+    "ttft_p50_ratio": 1.0410723384685256,
+    "ttft_p95_ratio": 1.1198067467817787,
+    "vllm_completed": 200,
+    "vllm_decode_tps": 718.9814830849542,
+    "vllm_e2e_p50_s": 55.002487340942025,
+    "vllm_e2e_p95_s": 142.3375087250024,
+    "vllm_preemptions": 43,
+    "vllm_prefix_hit": 0.2697549478423476,
+    "vllm_rps": 0.8029813635231063,
+    "vllm_total": 200,
+    "vllm_total_tps": 4742.53640998563,
+    "vllm_tpot_p50_s": 0.049703375188695206,
+    "vllm_tpot_p95_s": 0.06863309532102842,
+    "vllm_ttft_p50_s": 9.216767095960677,
+    "vllm_ttft_p95_s": 69.2114159471821
+  },
+  {
+    "decode_tps_ratio": 0.7356557719569122,
+    "e2e_p50_ratio": 1.3476957295017153,
+    "e2e_p95_ratio": 1.258652459348984,
+    "fixture": "coder_200_ts3",
+    "frontier_complete": true,
+    "frontier_completed": 200,
+    "frontier_decode_tps": 513.9343093668691,
+    "frontier_e2e_p50_s": 44.76058145123308,
+    "frontier_e2e_p95_s": 154.54831351855702,
+    "frontier_preemptions": 20,
+    "frontier_prefix_hit": 0.21767512777477313,
+    "frontier_rps": 0.573978165231764,
+    "frontier_total": 200,
+    "frontier_total_tps": 3390.0068803652352,
+    "frontier_tpot_p50_s": 0.053393334371887605,
+    "frontier_tpot_p95_s": 0.06861254670772189,
+    "frontier_ttft_p50_s": 1.0014741156186515,
+    "frontier_ttft_p95_s": 45.94665669959886,
+    "kv_blocks": 15281,
+    "label": "TP1 N200 scale 3",
+    "notes": "After Frontier decode-preemption lifecycle fix.",
+    "prefix_hit_delta": -0.05207982006757447,
+    "request_count": 200,
+    "rps_ratio": 0.7356557719569123,
+    "run_id": "tp1_n200_scale3",
+    "scale_label": "3",
+    "scale_value": 3.0,
+    "total_tps_ratio": 0.7356557719569123,
+    "tp": 1,
+    "tpot_p50_ratio": 1.1566613307426805,
+    "tpot_p95_ratio": 0.9611804148017213,
+    "ttft_p50_ratio": 0.8587856162345445,
+    "ttft_p95_ratio": 1.4243304641052532,
+    "vllm_completed": 200,
+    "vllm_decode_tps": 698.607050957755,
+    "vllm_e2e_p50_s": 33.2126758818049,
+    "vllm_e2e_p95_s": 122.78871134808287,
+    "vllm_preemptions": 16,
+    "vllm_prefix_hit": 0.2697549478423476,
+    "vllm_rps": 0.7802265503945264,
+    "vllm_total": 200,
+    "vllm_total_tps": 4608.1428428781355,
+    "vllm_tpot_p50_s": 0.04616159713544178,
+    "vllm_tpot_p95_s": 0.07138362959869063,
+    "vllm_ttft_p50_s": 1.1661514779552817,
+    "vllm_ttft_p95_s": 32.25842447206378
+  },
+  {
+    "decode_tps_ratio": 0.6070363250137228,
+    "e2e_p50_ratio": 1.5837949050918096,
+    "e2e_p95_ratio": 1.4718353941122981,
+    "fixture": "coder_200_ts2",
+    "frontier_complete": true,
+    "frontier_completed": 200,
+    "frontier_decode_tps": 694.538225813865,
+    "frontier_e2e_p50_s": 26.05122481685102,
+    "frontier_e2e_p95_s": 106.75916510714146,
+    "frontier_preemptions": 0,
+    "frontier_prefix_hit": 0.2697549478423476,
+    "frontier_rps": 0.7756823572006221,
+    "frontier_total": 200,
+    "frontier_total_tps": 4581.304110804026,
+    "frontier_tpot_p50_s": 0.042955276577521156,
+    "frontier_tpot_p95_s": 0.05288764732371923,
+    "frontier_ttft_p50_s": 0.2690959621493789,
+    "frontier_ttft_p95_s": 6.744624223172184,
+    "kv_blocks": 69055,
+    "label": "TP2 N200 scale 2",
+    "notes": "Uses true-mixed TP2/TP4 attention profile.",
+    "prefix_hit_delta": 0.0,
+    "request_count": 200,
+    "rps_ratio": 0.6070363250137228,
+    "run_id": "tp2_n200_scale2",
+    "scale_label": "2",
+    "scale_value": 2.0,
+    "total_tps_ratio": 0.6070363250137228,
+    "tp": 2,
+    "tpot_p50_ratio": 1.4296981582601855,
+    "tpot_p95_ratio": 1.218502045500008,
+    "ttft_p50_ratio": 1.1953492307083635,
+    "ttft_p95_ratio": 9.432094021900193,
+    "vllm_completed": 200,
+    "vllm_decode_tps": 1144.1460703330465,
+    "vllm_e2e_p50_s": 16.448610065039247,
+    "vllm_e2e_p95_s": 72.53471789998002,
+    "vllm_preemptions": 0,
+    "vllm_prefix_hit": 0.2697549478423476,
+    "vllm_rps": 1.2778186827338327,
+    "vllm_total": 200,
+    "vllm_total_tps": 7547.001591215254,
+    "vllm_tpot_p50_s": 0.030044996791346416,
+    "vllm_tpot_p95_s": 0.043403823177019754,
+    "vllm_ttft_p50_s": 0.22511911601759493,
+    "vllm_ttft_p95_s": 0.7150717759504914
+  },
+  {
+    "decode_tps_ratio": 0.6321128225155744,
+    "e2e_p50_ratio": 2.1882239414176055,
+    "e2e_p95_ratio": 1.8819058641979227,
+    "fixture": "coder_200_ts3",
+    "frontier_complete": true,
+    "frontier_completed": 200,
+    "frontier_decode_tps": 615.822856748031,
+    "frontier_e2e_p50_s": 21.785964943721574,
+    "frontier_e2e_p95_s": 101.59183927019191,
+    "frontier_preemptions": 0,
+    "frontier_prefix_hit": 0.2697549478423476,
+    "frontier_rps": 0.6877705321122985,
+    "frontier_total": 200,
+    "frontier_total_tps": 4062.0828059403734,
+    "frontier_tpot_p50_s": 0.0393789684875167,
+    "frontier_tpot_p95_s": 0.04670767224504207,
+    "frontier_ttft_p50_s": 0.13415354950526392,
+    "frontier_ttft_p95_s": 0.574137821753455,
+    "kv_blocks": 69055,
+    "label": "TP2 N200 scale 3",
+    "notes": "Uses true-mixed TP2/TP4 attention profile.",
+    "prefix_hit_delta": 0.0,
+    "request_count": 200,
+    "rps_ratio": 0.6321128225155745,
+    "run_id": "tp2_n200_scale3",
+    "scale_label": "3",
+    "scale_value": 3.0,
+    "total_tps_ratio": 0.6321128225155745,
+    "tp": 2,
+    "tpot_p50_ratio": 2.066305230245682,
+    "tpot_p95_ratio": 1.668678182045304,
+    "ttft_p50_ratio": 0.8737883511042303,
+    "ttft_p95_ratio": 0.9156237864420547,
+    "vllm_completed": 200,
+    "vllm_decode_tps": 974.229338201501,
+    "vllm_e2e_p50_s": 9.956003373954445,
+    "vllm_e2e_p95_s": 53.98348621092737,
+    "vllm_preemptions": 0,
+    "vllm_prefix_hit": 0.2697549478423476,
+    "vllm_rps": 1.0880502777577379,
+    "vllm_total": 200,
+    "vllm_total_tps": 6426.199028481642,
+    "vllm_tpot_p50_s": 0.01905767256023186,
+    "vllm_tpot_p95_s": 0.02799082096692385,
+    "vllm_ttft_p50_s": 0.15353094297461212,
+    "vllm_ttft_p95_s": 0.6270455510821193
+  },
+  {
+    "decode_tps_ratio": 0.554961482872708,
+    "e2e_p50_ratio": 2.0140798462106178,
+    "e2e_p95_ratio": 1.9471360828275543,
+    "fixture": "coder_200_ts2",
+    "frontier_complete": true,
+    "frontier_completed": 200,
+    "frontier_decode_tps": 763.3502329676248,
+    "frontier_e2e_p50_s": 18.65216281946347,
+    "frontier_e2e_p95_s": 84.93775413567799,
+    "frontier_preemptions": 0,
+    "frontier_prefix_hit": 0.2697549478423476,
+    "frontier_rps": 0.8525337930595883,
+    "frontier_total": 200,
+    "frontier_total_tps": 5035.200987216818,
+    "frontier_tpot_p50_s": 0.03366585046876145,
+    "frontier_tpot_p95_s": 0.03838265621202119,
+    "frontier_ttft_p50_s": 0.09755515041058871,
+    "frontier_ttft_p95_s": 0.3856872342439675,
+    "kv_blocks": 177077,
+    "label": "TP4 N200 scale 2",
+    "notes": "Uses true-mixed TP2/TP4 attention profile.",
+    "prefix_hit_delta": 0.0,
+    "request_count": 200,
+    "rps_ratio": 0.5549614828727081,
+    "run_id": "tp4_n200_scale2",
+    "scale_label": "2",
+    "scale_value": 2.0,
+    "total_tps_ratio": 0.5549614828727081,
+    "tp": 4,
+    "tpot_p50_ratio": 2.0597817670263323,
+    "tpot_p95_ratio": 1.3554681431066735,
+    "ttft_p50_ratio": 0.5721801588631308,
+    "ttft_p95_ratio": 0.27163724014492546,
+    "vllm_completed": 200,
+    "vllm_decode_tps": 1375.5012852715674,
+    "vllm_e2e_p50_s": 9.26088548800908,
+    "vllm_e2e_p95_s": 43.621889032190666,
+    "vllm_preemptions": 0,
+    "vllm_prefix_hit": 0.2697549478423476,
+    "vllm_rps": 1.5362035373095158,
+    "vllm_total": 200,
+    "vllm_total_tps": 9073.06388391597,
+    "vllm_tpot_p50_s": 0.016344377354773947,
+    "vllm_tpot_p95_s": 0.02831690025857032,
+    "vllm_ttft_p50_s": 0.1704972619190812,
+    "vllm_ttft_p95_s": 1.4198614079505205
+  },
+  {
+    "decode_tps_ratio": 0.5882440161960838,
+    "e2e_p50_ratio": 3.045852424279607,
+    "e2e_p95_ratio": 2.9785685814353515,
+    "fixture": "coder_200_ts3",
+    "frontier_complete": true,
+    "frontier_completed": 200,
+    "frontier_decode_tps": 660.2306058712501,
+    "frontier_e2e_p50_s": 16.902919407154563,
+    "frontier_e2e_p95_s": 83.00995364867583,
+    "frontier_preemptions": 0,
+    "frontier_prefix_hit": 0.2697549478423476,
+    "frontier_rps": 0.7373665172396945,
+    "frontier_total": 200,
+    "frontier_total_tps": 4355.004629460394,
+    "frontier_tpot_p50_s": 0.031067781092248118,
+    "frontier_tpot_p95_s": 0.035782850818878296,
+    "frontier_ttft_p50_s": 0.08859749134958328,
+    "frontier_ttft_p95_s": 0.3458954617429286,
+    "kv_blocks": 177077,
+    "label": "TP4 N200 scale 3",
+    "notes": "Uses true-mixed TP2/TP4 attention profile.",
+    "prefix_hit_delta": 0.0,
+    "request_count": 200,
+    "rps_ratio": 0.5882440161960838,
+    "run_id": "tp4_n200_scale3",
+    "scale_label": "3",
+    "scale_value": 3.0,
+    "total_tps_ratio": 0.5882440161960839,
+    "tp": 4,
+    "tpot_p50_ratio": 3.301471070786272,
+    "tpot_p95_ratio": 2.7971158799197804,
+    "ttft_p50_ratio": 0.8850343170011207,
+    "ttft_p95_ratio": 1.086290918512101,
+    "vllm_completed": 200,
+    "vllm_decode_tps": 1122.3753879226379,
+    "vllm_e2e_p50_s": 5.549487320007756,
+    "vllm_e2e_p95_s": 27.869075825903565,
+    "vllm_preemptions": 0,
+    "vllm_prefix_hit": 0.2697549478423476,
+    "vllm_rps": 1.2535044929278167,
+    "vllm_total": 200,
+    "vllm_total_tps": 7403.398095950554,
+    "vllm_tpot_p50_s": 0.00941028421153152,
+    "vllm_tpot_p95_s": 0.01279276667647553,
+    "vllm_ttft_p50_s": 0.1001062779687345,
+    "vllm_ttft_p95_s": 0.3184188101440668
+  }
+]
--- a/docs/assets/frontier_vllm_alignment/latency_ratios.png
+++ b/docs/assets/frontier_vllm_alignment/latency_ratios.png
--- a/docs/assets/frontier_vllm_alignment/throughput_ratio.png
+++ b/docs/assets/frontier_vllm_alignment/throughput_ratio.png
--- a/docs/assets/frontier_vllm_alignment/tp_scaling_total_tps.png
+++ b/docs/assets/frontier_vllm_alignment/tp_scaling_total_tps.png
--- a/docs/comparison.md
+++ b/docs/comparison.md
@@ -0,0 +1,253 @@
+# RS2 Simulator Comparison
+
+Checked on 2026-06-24. RS2 compares simulator capabilities and first local
+ReplayServe results. It does not start the RS3 sweep and does not make
+performance-quality claims.
+
+## Sources
+
+| Source | Local path | Commit / HEAD | RS2 use |
+|---|---|---:|---|
+| ReplayServe | `/home/gahow/phd/replayserve` | local RS0/RS1/RS1B artifacts | Adapter, fixtures, runs, postprocess summaries |
+| Qwen trace | `/home/gahow/phd/qwen-bailian-usagetraces-anon` | `5f7439c51ec248a0c585f7d90a41a6f57773b912` | Source `qwen_coder_blksz_16.jsonl` |
+| Frontier canonical | `/tmp/toc-llm-sim-research/Frontier` | `d9cfeb6d8791fbf2f295dd9744c56a666171776e` | RS1 fixed config and source inspection |
+| Frontier patched scratch | `/tmp/replayserve-frontier-rs1b` | base `d9cfeb6...` plus `patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch` | RS1B unblock verification |
+| Vidur | `/tmp/toc-llm-sim-research/vidur` | `8383d2935bc62723a212090baa9f98ada206fc14` | Source inspection for baseline capability |
+| AIConfigurator | `/tmp/toc-llm-sim-research/aiconfigurator` | `e46ece7510e727fafefb8212e5846172145a30ea` | Source/docs inspection for config-estimator capability |
+
+Key local evidence:
+
+- Frontier trace replay: `/tmp/toc-llm-sim-research/Frontier/frontier/request_generator/trace_replay_request_generator.py`
+- Frontier prefix-cache validation: `/tmp/toc-llm-sim-research/Frontier/frontier/scheduler/cluster_scheduler/base_cluster_scheduler.py`
+- Frontier prefix-cache request metrics: `/tmp/toc-llm-sim-research/Frontier/frontier/metrics/metrics_store.py`
+- Vidur trace replay: `/tmp/toc-llm-sim-research/vidur/vidur/request_generator/trace_replay_request_generator.py`
+- Vidur request entity: `/tmp/toc-llm-sim-research/vidur/vidur/entities/request.py`
+- AIConfigurator CLI/docs: `/tmp/toc-llm-sim-research/aiconfigurator/README.md` and `/tmp/toc-llm-sim-research/aiconfigurator/src/aiconfigurator/cli/main.py`
+
+## Capability Matrix
+
+| Capability | Frontier | Vidur | AIConfigurator |
+|---|---|---|---|
+| Per-request timestamp replay | Yes. `trace_replay` consumes `arrived_at` and RS1 runs `simulation_mode=online`. | Yes. `TraceReplayRequestGenerator` consumes `arrived_at`. | No per-request replay. CLI consumes workload summaries such as `--isl`, `--osl`, and SLA targets. |
+| Input/output length replay | Yes. Consumes `num_prefill_tokens` and `num_decode_tokens`. Frontier can clip overflows internally, so ReplayServe adapter validates before run. | Yes. Consumes `num_prefill_tokens` and `num_decode_tokens`; current code clips prefill length if total exceeds max tokens. | Only summary lengths, not per-request traces. |
+| Explicit `block_hash_ids` / prefix KV reuse replay | Yes. Current Frontier parses `session_id` and `block_hash_ids`, validates they are present when prefix caching is enabled, and applies prefix-cache accounting in vLLM v1. RS1B needs a patch for prefix cache plus chunked prefill under pressure. | No in this checkout. Request objects carry arrival/prefill/decode lengths and processed-token state, but no `session_id`, block hashes, or explicit prefix-reuse replay. README says prefix-caching work lives on a canary branch with sharp edges, not this main checkout. | No. `--prefix` is an aggregate prefix length/workload parameter, not an explicit hash/session replay model. |
+| Online arrival pattern | Yes. RS1 fixed config uses online mode and trace replay. | Yes for trace replay baseline. | No event-level online replay. It estimates candidate deployments from summary workload/SLA inputs. |
+| Prefix-cache hit-ratio output | Yes. Frontier emits request metrics including cached prefill tokens, query blocks, and hit blocks when present. ReplayServe postprocess adds token-weighted hit ratio using sidecar partial-block counts. | No native prefix-hit ratio in current main because no explicit prefix replay. | No prefix-hit replay metric. |
+| TTFT / TPOT / E2E / throughput output | Yes. Request and system metrics are emitted under Frontier metrics dirs. RS1 uses dummy execution predictor, so values are plumbing-only. | Yes. Vidur metrics include request E2E, prefill/TTFT-style, decode-normalized, and system metrics. Fidelity depends on matching profiles. | Yes as estimates: best throughput, per-GPU throughput, per-user throughput, TTFT, TPOT, and request latency. These are configuration-search outputs, not replay observations. |
+| TP / EP / DP / config knobs | Yes. Frontier has model, device, network device, attention TP/DP, MoE TP/EP, PP, scheduler, block, batch, prefix-cache, chunked-prefill, and memory-planner knobs. | Partial. Vidur exposes model, device, network device, tensor parallel size, pipeline stages, scheduler and batch/KV knobs. The inspected checkout is not a faithful EP/DP prefix-replay candidate. | Strong for config search. Supports TP/PP/DP and expert TP/EP style search across supported backends/systems. |
+| Arbitrary model/hardware/config boundary | Not arbitrary. Model/device configs may exist, but reliable latency/throughput requires compute/network profiles, scheduler support, matching parallel semantics, and bug-free code paths. RS1 Qwen3-32B on A800 uses dummy predictor because public A800 dense Qwen3-32B compute profiles are absent. | Not arbitrary. README lists supported model/device/profile combinations; docs say profiling on actual GPUs is needed for new model/hardware fidelity. Current public device/profile coverage does not match A800 Qwen3-32B. | Not arbitrary. It depends on supported model families, backend/system databases, and estimate modes. The support matrix includes H100/H200/B200/GB200/A100 variants; no A800 built-in silicon database was found. |
+| Needs profile/calibration | Yes for performance claims. Dummy predictor plus analytical comm is only a smoke. | Yes for performance claims. New model/device requires compute, network, and CPU-overhead style profiling. | Yes for production-quality estimates. It relies on collected silicon/perf databases or rough estimate modes; README warns memory/results need validation. |
+
+## First Results
+
+### Frontier Canonical
+
+Fixed RS1 config:
+
+- `simulation_mode=online`
+- `sys_arch=co-location`
+- `replica_scheduler=vllm_v1`
+- `device=a800`
+- `network_device=a800_dgx`
+- `model_name=Qwen/Qwen3-32B`
+- `attn_tensor_parallel_size=2`
+- dummy execution predictor
+- analytical communication backend
+- `trace_request_generator_config_max_tokens=32768`
+- prefix caching enabled
+- block size 16
+- chunked prefill enabled
+- batch cap 128
+- max batch tokens 32768
+- KV capacity from Frontier memory planner with `gpu_memory_utilization=0.9` and `non_kv_cache_overhead_bytes=0`
+
+Results:
+
+| Run | Result | Evidence | Notes |
+|---|---|---|---|
+| `coder_100` | Pass | `runs/rs1/coder_100/` | Frontier block hit ratio `0.04948661841440835`; ReplayServe token-weighted hit ratio `0.04956232588915065`; no preemptions. |
+| `coder_2000` | Fail | `runs/rs1/coder_2000/` | Exit code 1 after 4 seconds with `ValueError: Request 194 already scheduled.` Traceback ends at Frontier vLLM v1 waiting scheduling calling `request.on_cache_hit(prefix_cached_tokens)`. |
+
+The canonical failure was minimized in `docs/rs1_frontier_blocker.md`: first-N
+`N=192` passes, `N=193` fails as `Request 192 already scheduled`, and larger
+fixed-config probes fail around the same preempted prefix-cache path. Prefix off,
+chunked-prefill off, or a high long-prefill threshold avoids the failure, so this
+is not a bad Qwen trace row.
+
+### Frontier Patched Scratch
+
+Patch:
+
+- File: `patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch`
+- Documentation: `docs/rs1_frontier_patch.md`
+- Scratch checkout only: `/tmp/replayserve-frontier-rs1b`
+
+The patch resets preempted request scheduler/cache-hit admission state before
+the request re-enters the waiting path. As of 2026-06-25 it also replays
+decode-phase preemption by moving already-produced tokens into the next prefill
+segment, preserves user-facing lengths for metrics, and fails fast if
+sequential simulation drains before all generated requests complete. It keeps
+the canonical Frontier checkout clean.
+
+| Run | Result | Evidence | Hit ratios | Preemption | Memory planner facts |
+|---|---|---|---|---|---|
+| `N=193` fixed config | Pass | `runs/rs1b/patched/n193_fixed_v2/` | Frontier block `0.12458971786194112`; ReplayServe token-weighted `0.12476981408429115` | 5 total events, 1 request | `num_blocks=36902`, `gpu_memory_utilization=0.9`, non-KV overhead `0`, weight shard estimate `26.953125 GiB` |
+| `coder_100` fixed config | Pass | `runs/rs1b/patched/coder_100/` | Frontier block `0.04948661841440835`; ReplayServe token-weighted `0.04956232588915065` | 0 | same derived memory planner point |
+| `coder_2000` fixed config | Pass | `runs/rs1b/patched/coder_2000/` | Frontier block `0.12318930248025924`; ReplayServe token-weighted `0.12332978217090633` | 35940 total events, 1061 requests | same derived memory planner point |
+
+Metrics caveats:
+
+- These are plumbing-smoke metrics. The run uses dummy 1 ms execution time and
+  analytical communication, not calibrated Qwen3-32B A800 compute profiles.
+- `coder_2000` produced `request_metrics.csv` with 2000 rows, but 745 rows have
+  blank request-level prefix-cache fields. ReplayServe token-weighted hit ratio
+  therefore uses the 1255 rows with complete cache metrics. Frontier's aggregate
+  prefix-cache statistics in the same summary also report 1255 requests with
+  cache metrics. This is acceptable for blocker removal evidence, but it is not
+  a final metric-quality result.
+- No allocation/OOM pressure log lines were found in the postprocess summaries.
+
+### Vidur
+
+No Vidur baseline run was executed for RS2. Based on source inspection, Vidur is
+useful as an arrival-and-length baseline candidate, but it cannot faithfully
+compare ReplayServe prefix reuse without additional code:
+
+- `vidur/request_generator/trace_replay_request_generator.py` consumes
+  `arrived_at`, `num_prefill_tokens`, and `num_decode_tokens`.
+- `vidur/entities/request.py` stores arrival, prefill length, decode length,
+  processed tokens, schedule/completion timestamps, and preemption state.
+- The inspected request path does not carry `session_id`, `block_hash_ids`, or
+  sidecar block-token accounting.
+- Current Vidur trace replay clips prefill lengths when total tokens exceed
+  `max_tokens`; ReplayServe must keep its own hard-fail validation if Vidur is
+  used later as a length-only baseline.
+
+Conclusion for Vidur in RS2: it can likely replay `coder_100`/`coder_2000`
+arrival and length after a simple CSV compatibility conversion, but it would
+measure a different workload because prefix KV reuse is absent.
+
+### AIConfigurator
+
+No AIConfigurator run was executed for RS2 because it is not a per-request
+replay simulator. Source/docs show it is a deployment/config search estimator:
+
+- CLI examples take workload summaries such as `--isl`, `--osl`, `--prefix`,
+  `--ttft`, `--tpot`, `--total-gpus`, `--system`, and model path.
+- Outputs are best throughput, per-GPU throughput, per-user throughput, TTFT,
+  TPOT, request latency, concurrency, and parallel deployment choices.
+- It models operations and searches aggregated/disaggregated serving
+  configurations using collected or estimated performance data.
+
+Conclusion for AIConfigurator in RS2: it is useful for config candidates and
+reference sizing assumptions. It cannot directly compare faithful per-request
+prefix-hit replay on Qwen trace fixtures.
+
+## Metric Definitions
+
+`TTFT`:
+Time from request arrival to first generated token / prefill completion. Frontier
+and Vidur both have request-level prefill/first-token style timing fields, but
+RS1 Frontier values are not performance claims because the execution predictor is
+dummy.
+
+`TPOT`:
+Decode time per output token. Tools differ on whether they report total decode
+normalized by output tokens, inter-token latency, or a configured SLA target.
+Use each tool's native field only within that tool unless calibrated against the
+same serving definition.
+
+`E2E latency`:
+Completion time minus arrival time for one request.
+
+`Throughput`:
+Completed tokens or requests per unit time. AIConfigurator reports estimated
+tokens/s style capacity; Frontier/Vidur report simulated metrics. RS1 Frontier
+throughput is plumbing-only because compute is dummy.
+
+`KV-cache hit ratio`:
+
+- Frontier native block-level ratio:
+  `sum(request_prefix_cache_hit_blocks) / sum(request_prefix_cache_query_blocks)`.
+- ReplayServe token-weighted ratio:
+  use sidecar `block_token_counts` and count the first
+  `request_prefix_cache_hit_blocks` blocks by true token count, so a partial
+  final block contributes its actual token count instead of always 16.
+
+For `coder_2000` patched, both ratios are computed only for request rows with
+complete cache fields because 745 request metric rows have blank cache fields.
+This is a metrics completeness caveat, not evidence that the trace has invalid
+hashes.
+
+## Non-Comparable Items
+
+- Frontier canonical and patched scratch are not equivalent artifacts. The
+  patched result demonstrates an RS1 unblock path; it is not an upstream
+  Frontier release.
+- Frontier/Vidur simulator timings and AIConfigurator estimator timings are not
+  directly comparable without shared profiles, calibration, and metric
+  definitions.
+- Prefix-reuse fidelity is not comparable across all three tools. Only Frontier
+  currently consumes explicit block hash traces in the inspected checkouts.
+- AIConfigurator's `prefix` workload parameter is not the same as ReplayServe
+  `block_hash_ids`; it cannot recover session-level sharing or partial-block
+  token accounting.
+
+## Conclusions
+
+There is no best open-source implementation that satisfies ReplayServe's target
+out of the box.
+
+Frontier is the closest because it supports online trace replay, prefix-cache
+metadata, vLLM v1 style scheduler controls, memory planning, and request/system
+metrics. It is still not out of the box for ReplayServe: RS0 needed an adapter,
+RS1B needed a local patch or upstream fix for prefix cache plus chunked prefill,
+and performance-quality claims need Qwen3-32B A800 profiles/calibration.
+
+Vidur can be a useful arrival-plus-length baseline, but not a faithful prefix KV
+reuse replay engine in the inspected checkout.
+
+AIConfigurator can guide candidate deployment/config choices, but it is a
+workload-summary estimator rather than a per-request simulator.
+
+Frontier also does not support arbitrary model plus arbitrary hardware plus
+arbitrary config in a performance-reliable sense. A model/device config may be
+accepted syntactically, but fidelity depends on compute profiles, network
+profiles, scheduler support, parallelism semantics, memory-planner assumptions,
+and bug surface. For RS1, A800 network profiles exist, but the public checkout
+does not provide dense Qwen3-32B A800 compute profiles, so latency/throughput
+remain plumbing smoke.
+
+## Next Steps
+
+RS3 sweep prerequisites:
+
+- Decide whether RS3 uses the local RS1B patch, waits for upstream Frontier, or
+  carries both canonical and patched modes explicitly.
+- Keep fixed-config smoke and any sweep configs separate from performance claims.
+- Add a small run manifest/check script that records Frontier commit, patch
+  status, fixture, command, and metric completeness.
+- Treat the `coder_2000` blank cache fields as a metrics issue to investigate
+  before using request-level hit ratios as a headline metric.
+
+RS4 calibration prerequisites:
+
+- Collect or obtain dense `Qwen/Qwen3-32B` A800 compute profiles for the Frontier
+  predictor path.
+- Verify the A800 network profile and node SKU semantics match the target
+  deployment.
+- Add non-KV memory overhead assumptions from a real serving stack instead of
+  using `0`.
+- Validate simulator TTFT/TPOT/E2E/throughput against measured vLLM runs before
+  making performance conclusions.
+
+Patch path recommendation:
+
+- Keep `patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch` pinned in
+  ReplayServe until an upstream Frontier fix is available; it now covers both
+  the original prefix-cache/chunked-prefill preemption bug and the RS10
+  decode-phase preemption lifecycle bug.
+- Open an upstream issue or PR with the RS1B minimal repro (`N=193`) and
+  evidence from `docs/rs1_frontier_blocker.md`.
+- Re-run `coder_100`, `N=193`, and `coder_2000` when changing Frontier commit or
+  patch status.
--- a/docs/frontier_vllm_alignment_summary_20260625.md
+++ b/docs/frontier_vllm_alignment_summary_20260625.md
@@ -0,0 +1,177 @@
+# Frontier vs vLLM H20 Alignment Summary
+
+Date: 2026-06-25
+
+This document summarizes the current ReplayServe comparison between Frontier
+simulation and real vLLM runs on H20 for Qwen3-30B-A3B. It covers TP=1/2/4,
+different timestamp scales, and 100/200/500-request windows from
+`qwen_coder_blksz_16.jsonl`.
+
+The source data and plots are generated by:
+
+```bash
+~/.venv/plot/bin/python tools/build_frontier_vllm_alignment_report.py
+```
+
+Generated artifacts:
+
+- `docs/assets/frontier_vllm_alignment/frontier_vllm_alignment.csv`
+- `docs/assets/frontier_vllm_alignment/frontier_vllm_alignment.json`
+- `docs/assets/frontier_vllm_alignment/throughput_ratio.png`
+- `docs/assets/frontier_vllm_alignment/latency_ratios.png`
+- `docs/assets/frontier_vllm_alignment/tp_scaling_total_tps.png`
+- `docs/assets/frontier_vllm_alignment/completion_prefix.png`
+
+## Bottom Line
+
+Functional replay is now usable for the clean 200-request runs:
+
+- TP1 scale 2/3 after the Frontier lifecycle fix: `200/200` completed.
+- TP2/TP4 scale 2/3: `200/200` completed, no preemption on either side, matched
+  vLLM KV block counts, and exact trace-side prefix reuse ratio.
+
+Performance is not fully calibrated:
+
+- TP1 scale 2/3 is the closest current operating point: Frontier throughput is
+  about `0.74x` vLLM and TPOT p50/p95 is close.
+- TP2/TP4 is functionally aligned but slower: Frontier throughput is only
+  `0.55-0.63x` vLLM, and TP4 TPOT is too pessimistic.
+- Frontier underestimates the TP2->TP4 speedup. vLLM improves total throughput
+  by `1.15-1.20x`; Frontier improves by only `1.07-1.10x`.
+
+Current use: acceptable for integration work and rough qualitative trends, not
+yet acceptable as a calibrated absolute performance predictor.
+
+## Run Matrix
+
+All vLLM runs use vLLM 0.11.1, H20, Qwen3-30B-A3B,
+`max_model_len=32768`, `max_num_seqs=64`,
+`max_num_batched_tokens=32768`, `gpu_memory_utilization=0.85`, prefix caching,
+and chunked prefill.
+
+| run | Frontier rows | preempt F/V | prefix hit F/V | total tok/s F/V | ratio | TPOT p50 F/V | E2E p95 F/V |
+|---|---:|---:|---:|---:|---:|---:|---:|
+| TP1 N100 raw | 96/100 | 0/8 | 0.249/0.251 | 2349/3832 | 0.61 | 0.0569/0.0661s | 119.6/97.4s |
+| TP1 N500 raw | 439/500 | 0/63 | 0.119/0.387 | 4734/5283 | 0.90 | 0.0564/0.0498s | 397.3/417.4s |
+| TP1 N200 scale 0.667 | 176/200 | 0/26 | 0.170/0.270 | 3913/4865 | 0.80 | 0.0584/0.0515s | 189.2/183.7s |
+| TP1 N200 scale 2 | 200/200 | 33/43 | 0.231/0.270 | 3506/4743 | 0.74 | 0.0542/0.0497s | 174.5/142.3s |
+| TP1 N200 scale 3 | 200/200 | 20/16 | 0.218/0.270 | 3390/4608 | 0.74 | 0.0534/0.0462s | 154.5/122.8s |
+| TP2 N200 scale 2 | 200/200 | 0/0 | 0.270/0.270 | 4581/7547 | 0.61 | 0.0430/0.0300s | 106.8/72.5s |
+| TP2 N200 scale 3 | 200/200 | 0/0 | 0.270/0.270 | 4062/6426 | 0.63 | 0.0394/0.0191s | 101.6/54.0s |
+| TP4 N200 scale 2 | 200/200 | 0/0 | 0.270/0.270 | 5035/9073 | 0.55 | 0.0337/0.0163s | 84.9/43.6s |
+| TP4 N200 scale 3 | 200/200 | 0/0 | 0.270/0.270 | 4355/7403 | 0.59 | 0.0311/0.0094s | 83.0/27.9s |
+
+Important prefix caveat: the vLLM prefix-hit column in this table is the
+trace-side synthetic estimate from the vLLM summaries. For TP1 runs with
+preemption and finite KV pressure, the observed vLLM scheduler `computed:`
+signal is the better comparator. Earlier analysis in
+`docs/rs4_frontier_h20_tp1_alignment.md` records those finite-cache comparisons.
+For TP2/TP4, no preemption occurs and the trace-side prefix ratio matches
+Frontier exactly.
+
+## Plots
+
+![Throughput ratio](assets/frontier_vllm_alignment/throughput_ratio.png)
+
+![Latency ratios](assets/frontier_vllm_alignment/latency_ratios.png)
+
+![TP scaling](assets/frontier_vllm_alignment/tp_scaling_total_tps.png)
+
+![Completion and prefix reuse](assets/frontier_vllm_alignment/completion_prefix.png)
+
+## Interpretation
+
+### TP1
+
+The early TP1 100/500/scale-0.667 runs are still useful as historical stress
+points, but they were run before the decode-preemption lifecycle fix. Frontier
+therefore missed rows in those runs:
+
+- `96/100` for N100 raw
+- `439/500` for N500 raw
+- `176/200` for N200 scale 0.667
+
+After the lifecycle fix, TP1 scale 2 and scale 3 both complete `200/200`.
+Preemption is now in the same order as vLLM:
+
+- scale 2: Frontier 33 vs vLLM 43
+- scale 3: Frontier 20 vs vLLM 16
+
+TP1 timing is the closest current calibrated region. Throughput is about
+`0.74x` vLLM, TPOT p50/p95 is close, and E2E p95 is about `1.23-1.26x` vLLM.
+This is not perfect, but it is usable for integration-level trend checks.
+
+### TP2 and TP4
+
+The TP2/TP4 runs are functionally cleaner than TP1:
+
+- `200/200` completed for all four runs.
+- Frontier and vLLM both report no preemption.
+- Frontier uses explicit vLLM KV capacities:
+  - TP2: 69,055 blocks
+  - TP4: 177,077 blocks
+- Prefix hit ratio matches exactly: `0.2697549478`.
+
+We did profile TP2/TP4 true-mixed attention. The active RS12 profile includes:
+
+- `attention_tp2_tp4_combined.csv`: 36,163 rows, including 1,260 true-mixed
+  prefill+decode rows for TP2/TP4.
+- `linear_op_tp2_tp4_full32k.csv`: covers up to 32,768 tokens.
+- `moe_tp2_tp4_full32k.csv`: covers up to 32,768 tokens.
+
+Without the true-mixed rows, Frontier fails with missing
+`attn_decode_in_mixed` predictions. With them, all RS12 runs complete.
+
+The remaining TP2/TP4 gap is therefore not a missing-profile blocker. It is a
+timing-model gap:
+
+- TP2 throughput is `0.61-0.63x` vLLM.
+- TP4 throughput is `0.55-0.59x` vLLM.
+- TP4 TPOT p50 is `2.06-3.30x` vLLM.
+
+## Scaling
+
+For the same first-200 request fixtures:
+
+| fixture | metric | Frontier TP4/TP2 | vLLM TP4/TP2 |
+|---|---|---:|---:|
+| scale 2 | total tok/s | 1.10 | 1.20 |
+| scale 2 | decode tok/s | 1.10 | 1.20 |
+| scale 2 | TPOT p50 | 0.78 | 0.54 |
+| scale 3 | total tok/s | 1.07 | 1.15 |
+| scale 3 | decode tok/s | 1.07 | 1.15 |
+| scale 3 | TPOT p50 | 0.79 | 0.49 |
+
+Frontier sees some TP4 improvement, but much less than real vLLM. This is the
+clearest current evidence that the simulator is not yet modeling vLLM's
+TP-dependent decode execution path well enough.
+
+## Likely Gap Sources
+
+The main unresolved issues are:
+
+- CPU/scheduler overhead is still skipped (`skip_cpu_overhead_modeling=true`).
+- Decode CUDA graph behavior is not modeled in the current Frontier runs
+  (`decode_cuda_graph_mode=none`).
+- Random-forest predictors interpolate over profile grids, while real online
+  mixed batches may concentrate on shapes not directly sampled.
+- Some TP4 predictor fit errors are nontrivial, for example
+  `attn_kv_cache_save` MAPE around 11% in the TP4 profile log.
+- Frontier's scheduler and preemption behavior is close but not identical for
+  TP1 under finite KV pressure.
+
+## ReplayServe TODO
+
+1. Rerun the 500-request TP1 stress after the decode-preemption lifecycle fix,
+   so the 500-row result is no longer mixed with the old incomplete behavior.
+2. Record vLLM observed scheduler prefix/preemption metrics in machine-readable
+   summaries, not only in docs, especially first-start and last-start
+   `computed:` ratios.
+3. Add a shape-ledger analysis: compare Frontier's actual online batch shapes
+   against the profile grid and identify hot shapes that are interpolated.
+4. Profile or import vLLM CPU overhead and test
+   `skip_cpu_overhead_modeling=false`.
+5. Collect kernel-only / decode-CUDA-graph timing profiles before enabling a
+   Frontier CUDA-graph decode mode.
+6. Calibrate TP2/TP4 timing only after the above, because current functional
+   replay is aligned but the TP scaling is not.
--- a/docs/rs1_frontier_blocker.md
+++ b/docs/rs1_frontier_blocker.md
@@ -0,0 +1,199 @@
+# RS1 Frontier Blocker: Prefix Cache + Chunked Prefill
+
+This note narrows the RS1 `coder_2000` failure into a small Frontier repro.
+It does not change the RS1 fixed config or make performance claims.
+
+## Status
+
+- Frontier repo: `/tmp/toc-llm-sim-research/Frontier`
+- Frontier HEAD: `d9cfeb6d8791fbf2f295dd9744c56a666171776e`
+- ReplayServe canonical fixtures were not changed.
+- Frontier source was not modified.
+- Diagnostic artifacts live under `runs/rs1/blocker_request_194/`.
+
+The original `coder_2000` run failed with:
+
+```text
+ValueError: Request 194 already scheduled.
+```
+
+First-N probing shows the smaller blocker is not a single malformed row 194.
+The smallest observed first-N failure is `N=193`, which fails as:
+
+```text
+ValueError: Request 192 already scheduled.
+```
+
+`N=192` passes under the same fixed config.
+
+## Repro Commands
+
+Generate diagnostic slices:
+
+```bash
+cd /home/gahow/phd/replayserve
+for n in 190 191 192 193 194 195 200; do
+  out="runs/rs1/blocker_request_194/fixtures/coder_${n}"
+  mkdir -p "$out"
+  python3 tools/qwen_to_frontier.py \
+    --input /home/gahow/phd/qwen-bailian-usagetraces-anon/qwen_coder_blksz_16.jsonl \
+    --frontier-csv "$out/frontier.csv" \
+    --sidecar-jsonl "$out/sidecar.jsonl" \
+    --source-jsonl "$out/source.jsonl" \
+    --manifest-json "$out/manifest.json" \
+    --fixture-name "blocker_coder_${n}" \
+    --limit "$n" \
+    --max-tokens 32768 \
+    --block-size 16 \
+    --fail-on-overflow
+done
+```
+
+Minimal failing command:
+
+```bash
+cd /home/gahow/phd/replayserve
+scripts/run_frontier_blocker_probe.sh \
+  n193_default \
+  runs/rs1/blocker_request_194/fixtures/coder_193
+```
+
+The exact Frontier CLI for every probe is preserved in each
+`runs/rs1/blocker_request_194/probes/<name>/command.txt`.
+
+## First-N Matrix
+
+All default rows use the RS1 fixed config: prefix caching on, chunked prefill
+on, `long_prefill_token_threshold=64`, batch cap 128, max batch tokens 32768.
+
+| probe | rows | prefix cache | chunked prefill | threshold | exit | result |
+|---|---:|---|---|---:|---:|---|
+| `n190_default` | 190 | on | on | 64 | 0 | pass |
+| `n191_default` | 191 | on | on | 64 | 0 | pass |
+| `n192_default` | 192 | on | on | 64 | 0 | pass |
+| `n193_default` | 193 | on | on | 64 | 1 | `Request 192 already scheduled` |
+| `n194_default` | 194 | on | on | 64 | 1 | `Request 192 already scheduled` |
+| `n195_default` | 195 | on | on | 64 | 1 | `Request 194 already scheduled` |
+| `n200_default` | 200 | on | on | 64 | 1 | `Request 194 already scheduled` |
+
+## Diagnostic Variants
+
+These are diagnosis only. They are not replacements for the RS1 fixed config.
+
+| probe | rows | prefix cache | chunked prefill | threshold | exit | result |
+|---|---:|---|---|---:|---:|---|
+| `n193_prefix_off` | 193 | off | on | 64 | 0 | pass |
+| `n193_chunked_off` | 193 | on | off | 64 | 1 | Frontier config rejects this combination |
+| `n193_chunked_off_threshold_0` | 193 | on | off | 0 | 0 | pass |
+| `n193_threshold_32768` | 193 | on | on | 32768 | 0 | pass |
+| `n195_prefix_off` | 195 | off | on | 64 | 0 | pass |
+| `n195_chunked_off` | 195 | on | off | 64 | 1 | Frontier config rejects this combination |
+| `n195_chunked_off_threshold_0` | 195 | on | off | 0 | 0 | pass |
+| `n195_threshold_32768` | 195 | on | on | 32768 | 0 | pass |
+| `n200_prefix_off` | 200 | off | on | 64 | 0 | pass |
+| `n200_chunked_off_threshold_0` | 200 | on | off | 0 | 0 | pass |
+| `n200_threshold_32768` | 200 | on | on | 32768 | 0 | pass |
+
+Frontier enforces:
+
+```text
+VllmV1SchedulerConfig.long_prefill_token_threshold > 0 requires enable_chunked_prefill=True
+```
+
+So a valid chunked-off diagnostic also sets
+`LONG_PREFILL_TOKEN_THRESHOLD=0`.
+
+## Local Trace Analysis
+
+Generated files:
+
+- `runs/rs1/blocker_request_194/analysis/request_192_analysis.json`
+- `runs/rs1/blocker_request_194/analysis/request_192_analysis.md`
+- `runs/rs1/blocker_request_194/analysis/request_194_analysis.json`
+- `runs/rs1/blocker_request_194/analysis/request_194_analysis.md`
+
+Request 192, the minimal first-N failure target:
+
+- `timestamp=43.406`
+- `chat_id=192`, `parent_chat_id=-1`, `turn=1`, `type=coder`
+- `input_length=13436`, `output_length=1425`, total `14861`
+- `hash_count=840`
+- partial final block: yes, final block token count `12`
+- top prior prefix overlap: 7 blocks, 112 tokens
+- no parent candidate in the sidecar
+
+Request 194, the original `coder_2000` failing request:
+
+- `timestamp=43.931`
+- `chat_id=194`, `parent_chat_id=-1`, `turn=1`, `type=coder`
+- `input_length=2064`, `output_length=2278`, total `4342`
+- `hash_count=129`
+- partial final block: no, final block token count `16`
+- top prior prefix overlap: 1 block, 16 tokens
+- no parent candidate in the sidecar
+
+Interpretation:
+
+- The failing requests are independent first turns, not child turns in a chat.
+- Request 192 has a partial final block, but its observed prior prefix overlap
+  is only the first 7 full blocks.
+- Request 194 has no partial final block and only a 1-block prefix overlap.
+- The failure is therefore not explained by a malformed partial final block,
+  deep shared-prefix trace structure, or a parent/child chat mismatch.
+- Fixture validation confirms monotonic timestamps, max-token compliance,
+  sidecar hash lengths, and block token counts.
+
+## Frontier Source Localization
+
+Relevant Frontier files:
+
+- `/tmp/toc-llm-sim-research/Frontier/frontier/scheduler/replica_scheduler/vllm_v1_engine_replica_scheduler.py`
+- `/tmp/toc-llm-sim-research/Frontier/frontier/entities/request.py`
+- `/tmp/toc-llm-sim-research/Frontier/frontier/config/config.py`
+
+Key path:
+
+- `VllmV1EngineReplicaScheduler._prepare_prefix_cache_admission`
+  at `vllm_v1_engine_replica_scheduler.py:1178` calls
+  `kv_cache_manager.get_computed_blocks(request)` and returns
+  `prefix_cached_tokens`.
+- `_schedule_waiting_requests` at
+  `vllm_v1_engine_replica_scheduler.py:3075` runs prefix-cache admission for
+  any waiting request with prefix caching enabled and incomplete prefill.
+- The same waiting path allocates KV and then calls
+  `request.on_cache_hit(prefix_cached_tokens)` at
+  `vllm_v1_engine_replica_scheduler.py:3179`.
+- `Request.on_cache_hit` at `request.py:503` raises if `_scheduled` is already
+  true.
+- `Request.on_batch_schedule` at `request.py:1058` sets `_scheduled=True`.
+- Chunked-prefill continuations run through `_schedule_running_requests`
+  around `vllm_v1_engine_replica_scheduler.py:2696`, with long-prefill
+  capping applied around `:2826`.
+- Valid chunked-off CLI requires `long_prefill_token_threshold=0`; otherwise
+  `config.py:714` rejects the configuration.
+
+The evidence points to a Frontier scheduler state issue: with prefix caching
+enabled and chunked prefill active, a request that has already been scheduled
+can later reach waiting-admission prefix-cache handling and receive
+`on_cache_hit` again. That violates `Request.on_cache_hit`'s current invariant.
+
+This is more consistent with a repeated cache-hit application or scheduled
+request re-admission path than with bad ReplayServe trace/hash data.
+
+## Suggested Next Steps
+
+1. Add temporary Frontier instrumentation around `_schedule_waiting_requests`
+   before `request.on_cache_hit` to log `request.id`, `_scheduled`,
+   `_preempted`, `is_prefill_complete`, `num_processed_tokens`,
+   `prefix_cached_tokens`, and whether the request came from
+   `_preempted_requests` or `_request_queue`.
+2. Decide Frontier semantics for prefix-cache hits after a request has already
+   been scheduled once. A likely fix is to apply `on_cache_hit` only for a
+   first admission with `_scheduled=False` and `num_processed_tokens=0`, or to
+   reset/request-restart state before re-admission if that is the intended
+   vLLM parity behavior.
+3. Keep RS1 fixed config blocked for `coder_2000` until Frontier behavior is
+   patched or a documented upstream-compatible workaround is selected.
+4. Do not use the passing diagnosis variants as RS1 performance evidence:
+   prefix-off, chunked-off, and threshold-32768 change the fixed config.
+
--- a/docs/rs1_frontier_patch.md
+++ b/docs/rs1_frontier_patch.md
@@ -0,0 +1,150 @@
+# RS1B Frontier Patch
+
+This document records the scratch Frontier patch used to unblock RS1 fixed
+config replay. It is not applied to the canonical Frontier checkout.
+
+## Patch
+
+- Patch file:
+  `patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch`
+- Canonical Frontier checkout:
+  `/tmp/toc-llm-sim-research/Frontier`
+- Scratch Frontier checkout:
+  `/tmp/replayserve-frontier-rs1b`
+- Frontier base HEAD:
+  `d9cfeb6d8791fbf2f295dd9744c56a666171776e`
+
+Apply from a Frontier checkout at the same base commit:
+
+```bash
+cd /path/to/Frontier
+git apply /home/gahow/phd/replayserve/patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch
+```
+
+Check applicability without modifying a checkout:
+
+```bash
+cd /path/to/Frontier
+git apply --check /home/gahow/phd/replayserve/patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch
+```
+
+## Root Cause
+
+Instrumentation in the scratch checkout showed the minimal `N=193` failure
+has two admissions for request 192:
+
+```text
+req=192 source=request_queue scheduled=False preempted=False prefill_complete=False num_processed_tokens=0 prefix_cached_tokens=112 num_new_tokens=64
+req=192 source=preempted_requests scheduled=True preempted=True prefill_complete=False num_processed_tokens=0 prefix_cached_tokens=1232 num_new_tokens=64
+```
+
+The second admission comes from `_preempted_requests`. Frontier preemption
+resets `victim._num_processed_tokens` and removes the explicit scheduler
+frontier, but it leaves `victim._scheduled=True`. The request then re-enters
+waiting admission, prefix-cache admission finds cached blocks, and
+`request.on_cache_hit(prefix_cached_tokens)` raises because `on_cache_hit`
+requires `_scheduled=False`.
+
+The failure is therefore a Frontier runtime-state reset issue for preempted
+chunked-prefill requests with prefix caching enabled, not bad ReplayServe
+trace data.
+
+## Patch Rationale
+
+The first patch reset two request runtime fields in
+`VLLMv1EngineReplicaScheduler._preempt_request`:
+
+```python
+victim._num_prefill_tokens_cached = 0
+victim._scheduled = False
+```
+
+This matches the existing preemption intent in the same block: computed tokens
+are reset and the request is re-entered into a waiting queue for recomputation.
+After that reset, waiting admission can apply prefix-cache hit state through
+the existing `Request.on_cache_hit` path before the request is scheduled again.
+
+An earlier conservative experiment skipped `on_cache_hit` for already scheduled
+requests and advanced only the scheduler frontier. That avoided the immediate
+exception but left request 192 incomplete at simulation shutdown, because the
+request object's processed-token state never reflected the cached prefix.
+
+The 2026-06-25 RS10 debug runs exposed a second lifecycle bug. Missing request
+metrics for `coder_200_ts2` and `coder_200_ts3` were not postprocess artifacts:
+Frontier drained with `completed_requests < total_requests`. Missing requests
+had this state pattern:
+
+```text
+preempted=True
+is_prefill_complete=True
+num_processed_tokens=0
+scheduled=False
+completed=False
+```
+
+They had been preempted after entering decode. Frontier cleared processed
+tokens but kept the request in prefill-complete state. The next waiting
+admission therefore computed `num_new_tokens=0` and dropped the request from
+the waiting queue.
+
+The current patch now also:
+
+- replays decode-phase preemption by turning already-produced tokens into the
+  next prefill segment and leaving the remaining tokens as decode work;
+- preserves user-facing prompt/output lengths for metrics after runtime token
+  splitting;
+- preserves unfinished zero-token waiting requests instead of silently dropping
+  them;
+- makes sequential simulation fail fast if the event queue drains before all
+  generated requests complete, with per-request debug snapshots.
+
+## Verification Matrix
+
+All patched runs used RS1 fixed config unless explicitly stated otherwise:
+online, co-location, vLLM v1, A800, Qwen/Qwen3-32B, TP2, dummy predictor,
+analytical communication backend, `max_tokens=32768`, prefix cache on, block
+size 16, chunked prefill on, batch cap 128, max batch tokens 32768, memory
+planner KV capacity.
+
+| run | Frontier root | result | runtime | notes |
+|---|---|---:|---:|---|
+| `runs/rs1b/instrumentation/n193_instrumented_print` | scratch instrumentation | fail | 4s | Proved request 192 re-entered from `_preempted_requests` with `_scheduled=True`. |
+| `runs/rs1b/patched/n193_fixed_v2` | patched scratch | pass | 11s | `N=193` fixed config passed. |
+| `runs/rs1b/patched/coder_100` | patched scratch | pass | 8s | Prefix hit ratios matched original RS1 `coder_100`. |
+| `runs/rs1b/patched/coder_2000` | patched scratch | pass | 87s | Full fixed config run completed. |
+| `runs/rs10_preemption_replay_fix_ts2/.../coder_200_ts2` | patched scratch | pass | 462s | RS10 H20 TP1 full32K profile; completion `200/200`; 33 preemption events. |
+| `runs/rs10_preemption_replay_fix_ts3/.../coder_200_ts3` | patched scratch | pass | 465s | RS10 H20 TP1 full32K profile; completion `200/200`; 20 preemption events. |
+
+Prefix cache summaries:
+
+| run | Frontier block hit ratio | ReplayServe token-weighted hit ratio | preemption events |
+|---|---:|---:|---:|
+| original `runs/rs1/coder_100` | 0.0494866184 | 0.0495623259 | 0 |
+| patched `runs/rs1b/patched/coder_100` | 0.0494866184 | 0.0495623259 | 0 |
+| patched `runs/rs1b/patched/n193_fixed_v2` | 0.1245897179 | 0.1247698141 | 5 |
+| patched `runs/rs1b/patched/coder_2000` | 0.1231893025 | 0.1233297822 | 35940 |
+| patched `runs/rs10_preemption_replay_fix_ts2/.../coder_200_ts2` | 0.2310157359 | 0.2313416900 | 33 |
+| patched `runs/rs10_preemption_replay_fix_ts3/.../coder_200_ts3` | 0.2173684294 | 0.2176751278 | 20 |
+
+For `coder_2000`, ReplayServe postprocess skipped 745 request rows whose
+Frontier request metrics had blank prefix-cache fields. The run still completed
+and produced `system_metrics.json` and `request_metrics.csv`.
+
+## Risks
+
+- The patch touches Frontier private `Request` fields from scheduler code,
+  matching existing local style but still relying on internal state layout.
+- Resetting `_scheduled` during preemption may affect request scheduling
+  accounting outside this RS1 path. It does not clear `_scheduled_at`, so
+  schedule history remains present, but downstream assumptions about the
+  boolean should be reviewed upstream.
+- Resetting `_num_prefill_tokens_cached` means request-level cached-prefill
+  metrics reflect the current post-preemption admission rather than stale
+  pre-preemption state. This is necessary for the existing `on_cache_hit` path
+  to model cached-prefix progress, but metrics semantics should be confirmed
+  with Frontier maintainers.
+- The decode-phase preemption replay mutates Frontier private request token
+  fields. Metrics are explicitly anchored to user-facing prompt/output lengths,
+  but upstream should review whether this should become a public Request method.
+- The patched `coder_2000` run has many preemptions. RS1 remains a plumbing
+  smoke; latency and throughput should not be treated as performance evidence.
--- a/docs/rs1_frontier_smoke.md
+++ b/docs/rs1_frontier_smoke.md
@@ -0,0 +1,163 @@
+# RS1 Frontier Smoke
+
+RS1 runs Frontier trace replay as a plumbing smoke for the Qwen coder fixtures
+generated in RS0. It checks that Frontier can consume ReplayServe's Frontier CSV,
+preserve online arrivals, run vLLM v1 prefix caching, and emit request/system
+metrics. It does not make latency or throughput claims.
+
+## Fixed Configuration
+
+- `simulation_mode=online`
+- `sys_arch=co-location`
+- `cluster_scheduler=sticky_round_robin`
+- `replica_scheduler=vllm_v1`
+- `device=a800`
+- `network_device=a800_dgx`
+- `model_name=Qwen/Qwen3-32B`
+- `attn_tensor_parallel_size=2`
+- dummy execution predictor, 1 ms per model execution
+- analytical communication backend
+- `trace_request_generator_config_max_tokens=32768`
+- prefix caching enabled
+- block size 16
+- chunked prefill enabled
+- batch cap 128
+- max batch tokens 32768
+- `num_blocks_mode=memory_planner`
+- `gpu_memory_utilization=0.9`
+- `non_kv_cache_overhead_bytes=0`
+
+The memory planner point uses Frontier's A800 device config
+(`total_memory_gb=80`) and analytical parameter memory. The non-KV overhead is
+set to 0 for this smoke, so the derived KV block count is a permissive plumbing
+budget, not a calibrated serving budget.
+
+Frontier also ships an `a800_pairwise_nvlink` network profile, but
+`replica_config_network_device` is used to construct a node SKU in the current
+co-location path. This checkout has `A800_DGX` as a node SKU and does not have an
+`A800_PAIRWISE_NVLINK` node SKU, so RS1 uses `a800_dgx`.
+
+## Reproduce
+
+From `/home/gahow/phd/replayserve`:
+
+```bash
+PIP_CACHE_DIR=/home/gahow/phd/replayserve/.cache/pip python3 -m pip install \
+  --target /home/gahow/phd/replayserve/.deps/python \
+  'ddsketch>=3.0,<4' 'fasteners>=0.19,<1' 'numpy>=1.23' 'pandas>=1.5' \
+  'plotly>=5.0' 'pyyaml>=6.0' 'scikit-learn>=1.1' 'scipy>=1.9' 'tqdm>=4.64'
+scripts/run_frontier_smoke.sh coder_100
+scripts/run_frontier_smoke.sh coder_2000
+```
+
+Each run writes:
+
+- `runs/rs1/<fixture>/command.txt`
+- `runs/rs1/<fixture>/stdout.log`
+- `runs/rs1/<fixture>/stderr.log`
+- `runs/rs1/<fixture>/exit_code.txt`
+- `runs/rs1/<fixture>/runtime_seconds.txt`
+- `runs/rs1/<fixture>/frontier_metrics/.../config.json`
+- `runs/rs1/<fixture>/frontier_metrics/.../system_metrics.json`
+- `runs/rs1/<fixture>/frontier_metrics/.../request_metrics.csv`
+- `runs/rs1/<fixture>/postprocess_summary.json`
+- `runs/rs1/<fixture>/postprocess_summary.md`
+
+## Current Results
+
+Initial local attempt with `network_device=a800_pairwise_nvlink` failed during
+config reconstruction:
+
+```text
+ValueError: [BaseNodeSKUConfig] Invalid type string: a800_pairwise_nvlink
+```
+
+The preserved failed run context is under
+`runs/rs1/coder_100_failed_a800_pairwise_nvlink/`.
+
+The first `a800_dgx` attempt failed because the base Python environment lacked
+Frontier runtime dependencies:
+
+```text
+ModuleNotFoundError: No module named 'plotly'
+```
+
+Dependencies were installed into ReplayServe-local `.deps/python` with pip
+`--target`; Frontier source was not installed or modified.
+
+### coder_100
+
+Status: passed.
+
+- Run dir: `runs/rs1/coder_100/`
+- Runtime: 7 seconds
+- Metrics dir:
+  `runs/rs1/coder_100/frontier_metrics/qwen_qwen3_32b/online_serving/rs1_coder_100/`
+- Frontier block-level prefix hit ratio: `0.04948661841440835`
+- ReplayServe token-weighted prefix hit ratio: `0.04956232588915065`
+- Frontier total query blocks: `29705`
+- Frontier total hit blocks: `1470`
+- ReplayServe total query tokens: `474554`
+- ReplayServe total hit tokens: `23520`
+- Memory planner mode: `memory_planner`
+- GPU memory utilization: `0.9`
+- A800 memory budget: `80 GiB * 0.9 = 77309411328 bytes`
+- Qwen3-32B TP2 analytical weight shard estimate:
+  `28940697600 bytes` (`26.953125 GiB`)
+- Non-KV overhead assumption: `0 bytes`
+- Available KV budget under this smoke assumption: `48368713728 bytes`
+- Derived KV blocks: `36902`
+- Preemption events: `0`
+- Allocation/preemption/OOM log lines: `0`
+
+The derived KV block count is recomputed by ReplayServe postprocess with the
+same formula as Frontier `MemoryPlanner.get_num_blocks` because this run did
+not emit Frontier's `[MEMORY_STATE]` line in stdout/stderr.
+
+### coder_2000
+
+Status: blocked by Frontier runtime error under the fixed RS1 configuration.
+
+- Run dir: `runs/rs1/coder_2000/`
+- Runtime: 4 seconds
+- Config:
+  `runs/rs1/coder_2000/frontier_metrics/qwen_qwen3_32b/online_serving/rs1_coder_2000/config.json`
+- Failure summary: `runs/rs1/coder_2000/failure_summary.md`
+
+Frontier failed during vLLM v1 prefix-cache scheduling:
+
+```text
+ValueError: Request 194 already scheduled.
+```
+
+The traceback reaches
+`vllm_v1_engine_replica_scheduler.py:3185`, where the scheduler calls
+`request.on_cache_hit(prefix_cached_tokens)`, and then
+`request.py:505`, where `Request.on_cache_hit()` rejects cache-hit updates after
+the request has already been scheduled.
+
+No Frontier source changes were made. RS1 stops here rather than changing
+scheduler knobs, because disabling prefix caching or chunked prefill would no
+longer match the fixed smoke point.
+
+## Metric Semantics
+
+Frontier reports prefix-cache hits at block granularity. ReplayServe postprocess
+uses `sidecar.jsonl` to weight each request's first `hit_blocks` by
+`block_token_counts`, so a hit on a partial final block contributes the true
+partial token count rather than 16 tokens.
+
+If Frontier omits `request_cached_prefill_tokens`,
+`request_prefix_cache_query_blocks`, or `request_prefix_cache_hit_blocks` from
+`request_metrics.csv`, ReplayServe cannot compute token-weighted hit ratio from
+that run without additional simulator instrumentation.
+
+## Limitations
+
+- Frontier's public A800 compute profiles in the checked source do not include a
+  dense `Qwen/Qwen3-32B` profile.
+- Dummy execution predictor is enabled, so TTFT, TPOT, E2E latency, and
+  throughput are only pipeline smoke outputs.
+- Memory planner uses analytical parameter memory and a 0-byte non-KV overhead
+  assumption. The derived KV capacity must be replaced by calibrated overhead or
+  runtime profiling before interpreting capacity pressure.
--- a/docs/rs3_sweep_harness.md
+++ b/docs/rs3_sweep_harness.md
@@ -0,0 +1,143 @@
+# RS3 Sweep Harness
+
+RS3 adds a reproducible Frontier sweep harness and a tiny smoke. This is not the
+full TP/EP/DP/config scan.
+
+## Files
+
+- Config: `configs/rs3_tiny_sweep.json`
+- Runner: `tools/run_frontier_sweep.py`
+- Aggregator: `tools/aggregate_runs.py`
+- Tiny smoke outputs: `runs/rs3_tiny_smoke_20260624/`
+
+The output layout is:
+
+```text
+runs/<suite>/<sim>/<fixture>/<config_id>/
+  command.txt
+  env.txt
+  run_manifest.json
+  run_status.json
+  stdout.log
+  stderr.log
+  exit_code.txt
+  runtime_seconds.txt
+  frontier_metrics/...
+  postprocess_summary.json
+  postprocess_summary.md
+runs/<suite>/summary.csv
+runs/<suite>/summary.md
+```
+
+## Config Scheme
+
+`configs/rs3_tiny_sweep.json` is intentionally small JSON:
+
+- `suite_id`: output suite under `runs/`.
+- `sim`: simulator/mode name used in the run path.
+- `frontier`: Frontier checkout metadata. The tiny smoke points at patched
+  scratch `/tmp/replayserve-frontier-rs1b`, not canonical Frontier.
+- `fixtures`: fixture names under `traces/fixtures/`.
+- `defaults`: fixed Frontier knobs shared by each config.
+- `configs`: named variants with optional `overrides`.
+
+The exposed Frontier knobs include:
+
+- parallelism: `attn_tensor_parallel_size`, `attn_data_parallel_size`,
+  `moe_tensor_parallel_size`, `moe_expert_parallel_size`,
+  `num_pipeline_stages`, `num_replicas`
+- scheduler: `batch_size_cap` / max-num-seqs equivalent,
+  `max_tokens_in_batch` / max-batch-tokens equivalent, `block_size`,
+  `enable_prefix_caching`, `enable_chunked_prefill`,
+  `long_prefill_token_threshold`
+- fixed smoke context: model, device, network device, trace max tokens,
+  memory-planner mode, GPU memory utilization, non-KV overhead, and dummy
+  execution time
+
+For dense `Qwen/Qwen3-32B`, the EP-like knobs stay at `1` in the tiny smoke.
+They are present so later MoE configs can be represented without changing the
+harness schema.
+
+## Run Commands
+
+From `/home/gahow/phd/replayserve`:
+
+```bash
+python3 tools/run_frontier_sweep.py \
+  --config configs/rs3_tiny_sweep.json \
+  --suite-id rs3_tiny_smoke_20260624
+
+python3 tools/aggregate_runs.py runs/rs3_tiny_smoke_20260624
+```
+
+The runner refuses to replace an existing selected run directory unless
+`--force` is passed. Use `--dry-run` to emit commands/manifests without running
+Frontier, and `--only-config` / `--only-fixture` to narrow the selected matrix.
+
+## Frontier Mode
+
+The RS3 tiny smoke uses:
+
+- `frontier.root=/tmp/replayserve-frontier-rs1b`
+- `frontier.mode=patched_scratch`
+- patch file `patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch`
+
+The canonical checkout `/tmp/toc-llm-sim-research/Frontier` remains clean and is
+not modified by the harness. `summary.csv` records `frontier_dirty=true` for the
+patched scratch because the local patch is applied there; that is expected.
+
+To run canonical mode for a safe config, copy the JSON config, set
+`frontier.root` to `/tmp/toc-llm-sim-research/Frontier`, change `sim`, and run a
+small selected config. Do not use canonical fixed `coder_2000` until the
+prefix-cache chunked-prefill bug is fixed upstream.
+
+## Tiny Smoke Results
+
+Command:
+
+```bash
+python3 tools/run_frontier_sweep.py \
+  --config configs/rs3_tiny_sweep.json \
+  --suite-id rs3_tiny_smoke_20260624
+python3 tools/aggregate_runs.py runs/rs3_tiny_smoke_20260624
+```
+
+Results:
+
+| config | status | runtime | prefix cache | chunked prefill | Frontier block hit ratio | ReplayServe token hit ratio | preemptions |
+|---|---:|---:|---:|---:|---:|---:|---:|
+| `fixed_prefix_on` | pass | 8s | on | on | `0.049486618` | `0.049562326` | 0 |
+| `prefix_cache_off` | pass | 7s | off | on | n/a | n/a | 0 |
+
+Aggregated files:
+
+- `runs/rs3_tiny_smoke_20260624/summary.csv`
+- `runs/rs3_tiny_smoke_20260624/summary.md`
+
+The prefix-off run does not have Frontier cache columns in `request_metrics.csv`;
+`summary.csv` records `cache_metrics_available=false` and the missing-column
+reason.
+
+TTFT/TPOT/E2E/throughput fields are aggregated from Frontier `system_metrics.json`
+when present. In this tiny smoke they are dummy-predictor plumbing outputs, not
+performance results.
+
+## Not Yet Run
+
+- No `coder_2000` sweep was run in RS3.
+- No TP/DP/EP matrix was swept.
+- No batch cap, max batch tokens, block size, chunked-prefill, or threshold
+  matrix was swept beyond the two-config smoke.
+- No canonical Frontier patched-vs-unpatched comparison was rerun.
+- No Vidur or AIConfigurator run is part of this harness yet.
+
+## Next Harness Work
+
+- Add a small checked-in config for a real RS3 candidate grid only after deciding
+  the patch/upstream policy.
+- Add guardrails for invalid dense/MoE parallelism combinations before launching
+  larger matrices.
+- Investigate `coder_2000` missing request-level cache fields before using
+  request-level hit ratio as a headline sweep metric.
+- Keep latency/throughput result tables clearly separated by predictor/profile
+  mode: dummy smoke, profiled Frontier, or calibrated run.
--- a/docs/rs4_frontier_h20_tp1_alignment.md
+++ b/docs/rs4_frontier_h20_tp1_alignment.md
@@ -0,0 +1,740 @@
+# RS4 Frontier H20 TP1 Alignment
+
+This note compares Frontier H20 TP1 against the real vLLM TP1 run on dash2 for
+`coder_100`.
+
+## Setup
+
+Real vLLM:
+
+- Runtime: vLLM 0.11.1
+- Host/GPU: dash2, NVIDIA H20
+- Model: `/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`
+- TP: 1
+- KV capacity: 244,496 tokens = 15,281 blocks at block size 16
+- Run: `runs/vllm_gpu_smoke_20260624/tp1_coder100_uncapped`
+
+Frontier:
+
+- Frontier root: `/tmp/replayserve-frontier-rs1b`
+- Frontier commit: `d9cfeb6d8791fbf2f295dd9744c56a666171776e`
+- Model config name: `qwen3-a3b-30b-moe`
+- Device: `h20`
+- Network node SKU: `h20_dgx`
+- TP: `attn_tensor_parallel_size=1`, `moe_tensor_parallel_size=1`,
+  `moe_expert_parallel_size=1`
+- `max_tokens_in_batch=32768`, `batch_size_cap=64`, block size 16
+- Prefix cache on, chunked prefill on
+- `long_prefill_token_threshold=32768`
+- Config: `configs/rs4_frontier_h20_tp1.json`
+- Run: `runs/rs4_frontier_h20_tp1_20260624`
+
+The high long-prefill threshold is deliberate. Frontier's earlier threshold 64
+run under-counted prefix hits because long prompts were admitted in 64-token
+chunks, unlike the current real vLLM run.
+
+## KV Capacity
+
+| run | KV blocks | KV tokens | note |
+|---|---:|---:|---|
+| Frontier `planner_kv` | 17,385 | 278,160 | Frontier H20 memory planner, no non-KV overhead |
+| Frontier `vllm_kv_15281` | 15,281 | 244,496 | Explicitly matched to real vLLM TP1 |
+| vLLM TP1 | 15,281 | 244,496 | From vLLM memory profiling |
+
+So only `vllm_kv_15281` has the same KV block count as real vLLM TP1.
+
+## Results
+
+| run | completed | prefix hit tokens / ratio | preemptions | TTFT p50/p95 | TPOT p50/p95 | E2E p50/p95 | decode tok/s |
+|---|---:|---:|---:|---:|---:|---:|---:|
+| Frontier `planner_kv` | 96/100 | 110,608 / 0.240691 | 0 | 0.986/128.991s | 0.582/0.582s | 279.092/1706.675s | 19.4 |
+| Frontier `vllm_kv_15281` | 92/100 | 103,168 / 0.242542 | 0 | 0.964/182.639s | 0.582/0.582s | 305.290/1765.347s | 19.4 |
+| vLLM TP1 real | 100/100 | 119,152 / 0.251082 sidecar estimate | 8 | 4.503/29.060s | 0.066/0.621s | 41.841/97.366s | 567.4 |
+
+The latency/throughput rows are not calibrated. Frontier still uses dummy
+execution timing, so TPOT is a constant simulator artifact.
+
+## Prefix Admission Check
+
+For TP1, real vLLM has preemption. Therefore the sidecar theoretical prefix-hit
+estimate is not the right observed comparator for every request. The observed
+vLLM scheduler signal is the first `computed:` value in `stdout.log` for each
+request start.
+
+Using first-start `computed:` tokens:
+
+| Frontier run | compared rows | Frontier computed sum | vLLM first-start computed sum | mismatch |
+|---|---:|---:|---:|---:|
+| `planner_kv` | 96 | 110,608 | 108,208 | one request differs |
+| `vllm_kv_15281` | 92 | 103,168 | 103,168 | exact match |
+
+So with the KV block count explicitly matched, Frontier's prefix-cache admission
+matches real vLLM TP1 for every row where Frontier emits complete cache metrics.
+
+## Current Alignment Judgment
+
+Aligned:
+
+- H20 device and Qwen3-30B-A3B structural model config can run in Frontier.
+- TP1 scheduler knobs can be matched.
+- KV block count can be matched explicitly at 15,281 blocks.
+- First-admission prefix-cache hit tokens match real vLLM TP1 on completed rows
+  when KV blocks are explicit.
+
+Not aligned:
+
+- Frontier emits complete request/cache metrics for only 92/100 requests in the
+  explicit-KV run, while vLLM completes 100/100.
+- Frontier reports 0 preemptions; real vLLM TP1 reports 8 preemptions across 5
+  repeated-start requests.
+- Frontier timing is not comparable because it still uses dummy execution
+  prediction. The current latency/throughput gap is expected and not a
+  calibrated simulator error.
+
+Next work:
+
+- Treat RS6 as the current profiled baseline and investigate why it omits
+  complete latency/cache metrics for requests `70`, `77`, `88`, and `90`.
+- Instrument Frontier's vLLM V1 scheduler around KV block allocation, free-block
+  count, and preemption victim selection. Real vLLM TP1 has 8 preemptions, while
+  Frontier still reports 0 with the same explicit 15,281-block capacity.
+- Add a per-request Frontier/vLLM comparator that reports TTFT/TPOT/E2E ratios,
+  prefix hits, and completion/preemption status on the same request ids.
+- Calibrate CPU/scheduler/CUDA-graph effects separately from op profile timing;
+  RS6 removed the 4096-token linear/MoE extrapolation as the primary explanation
+  for the remaining gap.
+
+## Performance Gap
+
+Use Frontier `vllm_kv_15281` as the current aligned-KV simulator point. This
+matches the real vLLM TP1 KV block count, but it still uses Frontier dummy
+execution timing.
+
+| metric | Frontier H20 TP1 explicit KV | real vLLM H20 TP1 | gap |
+|---|---:|---:|---:|
+| completed requests | 92/100 | 100/100 | not aligned |
+| TTFT p50 | 0.964s | 4.503s | Frontier 0.21x real |
+| TTFT p95 | 182.639s | 29.060s | Frontier 6.28x real |
+| TPOT p50 | 0.582s | 0.066s | Frontier 8.81x real |
+| TPOT p95 | 0.582s | 0.621s | Frontier 0.94x real |
+| E2E p50 | 305.290s | 41.841s | Frontier 7.30x real |
+| E2E p95 | 1765.347s | 97.366s | Frontier 18.13x real |
+| RPS | 0.0217 | 0.6880 | vLLM 31.74x Frontier |
+| decode tok/s | 19.4 | 567.4 | vLLM 29.20x Frontier |
+
+Interpretation:
+
+- The prefix admission path is close after explicit KV matching, but performance
+  is not calibrated.
+- Frontier uses dummy execution timing; its TPOT is nearly constant at 582 ms,
+  while real vLLM TP1 has p50 TPOT 66 ms and p95 TPOT 621 ms.
+- Frontier does not reproduce real vLLM's TP1 preemption behavior: real vLLM had
+  8 preemptions, while Frontier reported 0.
+- Frontier emits complete request/cache metrics for only 92 rows in this run,
+  so p95 and throughput are not yet on the same request set.
+- The TTFT sign is mixed: Frontier p50 TTFT is too optimistic, but p95 TTFT is
+  far too pessimistic. This is consistent with uncalibrated execution timing plus
+  different queue/preemption dynamics.
+
+## RS5 Profiled Frontier Timing
+
+Frontier does support replacing dummy timing with real CSV profiles through the
+random-forest execution-time predictor. The required non-dummy flags are wired
+in `tools/run_frontier_sweep.py`, and the active profiled config is
+`configs/rs5_frontier_h20_tp1_profile.json`.
+
+Profile data collected on dash2 H20 TP1:
+
+- Linear ops: `linear_op.csv`, CUDA event, max tokens 4096.
+- Attention: `attention_combined.csv`, CUDA event, max sequence/chunk 18000,
+  with 15417 standard rows plus 612 true-mixed rows. Online replay needs the
+  true-mixed rows to train `attn_prefill_mixed` and `attn_decode_in_mixed`.
+- MoE: `moe_vllm_fused.csv`, CUDA event, max tokens 4096, vLLM fused MoE
+  backend.
+
+Frontier vLLM 0.11.1 profiling needed local compatibility patches in
+`patches/frontier-vllm-0.11.1-profiling-compat.patch`:
+
+- RoPE helper fallback when vLLM 0.11.1 `get_rope()` no longer accepts the
+  legacy `rotary_dim` keyword.
+- `_get_config_dtype_str` fallback for vLLM fused MoE config dtype.
+- `ReplicatedLinear(disable_tp=True)` fallback to torch `Linear` when vLLM TP
+  group is not initialized in standalone profiling.
+- `fused_topk()` variable-return handling.
+- `invoke_fused_moe_kernel()` 0.11.1 signature compatibility.
+
+The first profiled MoE attempt used Frontier's `frontier_loop` backend and was
+not faithful to vLLM serving. It predicted `moe_grouped_gemm` at about 16 ms for
+24 tokens and 19 ms for 1024 tokens, causing TPOT around 0.93 s. The vLLM fused
+MoE profile predicts about 0.32 ms for 24 tokens and 0.87 ms for 1024 tokens.
+
+| run | completed | prefix hit ratio | TTFT p50/p95 | TPOT p50/p95 | E2E p50/p95 | total tok/s | decode tok/s |
+|---|---:|---:|---:|---:|---:|---:|---:|
+| Frontier dummy `vllm_kv_15281` | 92/100 | 0.2422 | 0.964/182.639s | 0.582/0.582s | 305.290/1765.347s | 131.3 | 19.4 |
+| Frontier profiled `frontier_loop` MoE | 93/100 | 0.2492 | 3.320/310.235s | 0.930/1.767s | 492.097/2038.538s | 165.9 | 24.6 |
+| Frontier profiled vLLM fused MoE | 97/100 | 0.2376 | 0.355/13.695s | 0.056/0.098s | 27.032/119.019s | 2056.7 | 304.5 |
+| Frontier profiled vLLM fused MoE, linear/MoE 32K | 96/100 | 0.2484 | 0.909/12.763s | 0.057/0.146s | 30.939/119.636s | 2348.9 | 347.8 |
+| vLLM TP1 real | 100/100 | 0.2511 | 4.503/29.060s | 0.066/0.621s | 41.841/97.366s | 3832.3 | 567.4 |
+
+Current judgment:
+
+- The profiled vLLM fused MoE run is the first useful timing baseline. TPOT p50
+  is close to real vLLM, but throughput is still about 54% of real vLLM and
+  TTFT/E2E tails do not align.
+- After extending linear and MoE profiles to 32768 tokens and adding
+  `prefill_hot` MoE rows, the cache hit ratio is nearly aligned
+  (0.2484 vs vLLM 0.2511), throughput improves to about 61% of real vLLM, and
+  TTFT p50 moves from 0.08x to 0.20x of real vLLM. This confirms that the 4096
+  profile ceiling was a real source of error.
+- Prefix/cache accounting remains close but not exact: the profiled run emits
+  complete cache metrics for 96/100 requests in the 32K run, with token hit
+  ratio 0.2488 vs vLLM's sidecar estimate 0.2511.
+- Frontier still reports zero preemptions, while real vLLM TP1 had 8 preemption
+  events. This affects completion set, TTFT tail, and E2E tail.
+- The remaining gaps are no longer explained by the linear/MoE 4096-token
+  extrapolation alone. The 32K run still has TTFT p50 at 0.20x, TTFT p95 at
+  0.44x, TPOT p95 at 0.23x, and throughput at 0.61x of real vLLM. This points
+  to missing CPU/scheduler/CUDA-graph modeling plus Frontier's scheduler and
+  completion/preemption fidelity.
+- The 32K run still completes only 96/100 requests in latency/cache metrics
+  (`70`, `77`, `88`, `90` missing), while real vLLM completes 100/100. This is
+  a Frontier lifecycle/metrics or scheduler-fidelity issue to debug separately.
+
+## 2026-06-24 Follow-Up
+
+Handled in the ReplayServe harness:
+
+- `tools/run_frontier_sweep.py` now passes an absolute metrics output path into
+  Frontier. Frontier runs with `cwd=/tmp/replayserve-frontier-rs1b`; relative
+  metrics paths can otherwise be written under the Frontier scratch instead of
+  ReplayServe's run directory.
+- `tools/postprocess_frontier_smoke.py` now emits a `completion` block with
+  `completed_requests`, `total_requests`, and `missing_latency_request_ids`.
+- `tools/aggregate_runs.py` now marks a run as `incomplete` when postprocess
+  reports missing latency rows. The latest RS6 summary is therefore incomplete,
+  not a clean pass.
+
+Latest RS6 vs real vLLM TP1 after the 32K profile and harness fixes:
+
+| metric | Frontier RS6 32K profile | real vLLM TP1 | Frontier / vLLM |
+|---|---:|---:|---:|
+| completed requests | 96/100 | 100/100 | 0.96 |
+| prefix token hit ratio | 0.2488 | 0.2511 | 0.99 |
+| preemption events | 0 | 8 | 0.00 |
+| TTFT p50 | 0.909s | 4.503s | 0.20 |
+| TTFT p95 | 12.763s | 29.060s | 0.44 |
+| TPOT p50 | 0.0569s | 0.0661s | 0.86 |
+| TPOT p95 | 0.146s | 0.621s | 0.23 |
+| E2E p50 | 30.939s | 41.841s | 0.74 |
+| E2E p95 | 119.636s | 97.366s | 1.23 |
+| total tok/s | 2348.9 | 3832.3 | 0.61 |
+| decode tok/s | 347.8 | 567.4 | 0.61 |
+
+Preemption experiment:
+
+- A local trial enabled waiting-admission preemption in Frontier Phase 2. It did
+  produce preemption events, but it was not a valid alignment improvement:
+  Frontier completed only 79/100 requests and amplified the early-decode
+  disappearance pattern. That config was removed from `configs/`.
+- This means the remaining preemption gap is not just "turn on preemption in
+  Phase 2". Frontier's batch/runtime-epoch lifecycle needs a deeper fix before
+  its preemption behavior can be considered faithful to vLLM TP1.
+
+Current interpretation:
+
+- Prefix/cache replay is close: token-weighted prefix hit ratio is within about
+  1% relative of the vLLM synthetic replay estimate.
+- Completion/preemption is not aligned. Requests `70`, `77`, `88`, and `90`
+  begin decode in RS6 but never reach completion metrics; vLLM completes all
+  100 requests and logs 8 preemption events.
+- Timing is partially useful but not fully calibrated. Linear and MoE profiles
+  now cover the trace's long-prefill range up to 32768 tokens, so the old 4096
+  extrapolation is no longer the main explanation. The remaining TTFT/TPOT/E2E
+  gap likely comes from missing CPU/scheduler overhead, decode CUDA graph
+  modeling, and Frontier scheduler lifecycle differences.
+
+## 2026-06-25 500-Request Stress
+
+Generated `traces/fixtures/coder_500` from the first 500 rows of
+`qwen_coder_blksz_16.jsonl`:
+
+- `row_count=500`
+- `max_total_tokens=21318`
+- `overflow_count=0`
+- `partial_final_block_rows=466`
+
+Frontier RS8 used the same H20 TP1 Qwen3-30B-A3B full32K profile and explicit
+KV block count as RS6:
+
+- Config:
+  `configs/rs8_frontier_h20_tp1_profile_full32k_coder500.json`
+- Run:
+  `runs/rs8_frontier_h20_tp1_profile_full32k_coder500_20260625`
+- Runtime: 492 seconds
+- Status: incomplete
+
+| metric | Frontier RS6 100 reqs | Frontier RS8 500 reqs |
+|---|---:|---:|
+| completed requests | 96/100 | 439/500 |
+| missing latency/cache rows | 4 | 61 |
+| prefix token hit ratio | 0.2488 | 0.1192 |
+| preemption events | 0 | 0 |
+| TTFT p50/p95 | 0.909/12.763s | 136.776/340.237s |
+| TPOT p50/p95 | 0.0569/0.146s | 0.0564/0.0894s |
+| E2E p50/p95 | 30.939/119.636s | 177.800/397.291s |
+| total tok/s | 2348.9 | 4733.7 |
+| decode tok/s | 347.8 | 656.2 |
+
+Missing request ids in RS8:
+
+```text
+70,77,88,90,103,106,134,135,142,143,153,154,176,178,183,184,186,188,210,211,216,222,245,246,263,272,274,278,291,298,299,300,320,325,334,335,347,348,363,367,373,374,393,399,403,409,412,413,414,433,434,437,439,450,453,460,469,475,476,479,497
+```
+
+The incomplete-row issue clearly scales: 4/100 missing in RS6 becomes 61/500
+missing in RS8. This makes RS8 invalid for final performance claims, but useful
+as a stress signal for Frontier lifecycle/metrics fidelity.
+
+The lower prefix hit ratio is not by itself proof of adapter failure. The
+unbounded trace-side trie estimate for `coder_500` is 0.3868 token hit ratio,
+but the H20 TP1 configuration has finite KV capacity (`num_blocks=15281`, about
+244K tokens). The 500-request window has 2.7M prompt tokens, so KV eviction can
+substantially reduce real prefix hits. The dash1 vLLM run below is the current
+finite-cache comparator for whether Frontier's behavior is faithful.
+
+Real vLLM TP1 500 was first attempted on dash2 with the same settings as
+`tp1_coder100_uncapped` (`max_num_seqs=64`, `max_num_batched_tokens=32768`,
+`gpu_memory_utilization=0.85`, `CUDA_VISIBLE_DEVICES=0`), but did not start
+because dash2 was already occupied by eight existing `agentic-kvc` vLLM serve
+processes on ports 8000-8007. Each H20 had about 89GB allocated, and vLLM failed
+with free memory below the required 0.85 utilization target. Those processes
+were not killed; the temporary ReplayServe GPU lock was released.
+
+A replacement vLLM TP1 500 run completed on dash1:
+
+- Run:
+  `runs/vllm_gpu_smoke_20260625_dash1/tp1_coder500_uncapped`
+- Runtime: vLLM 0.11.1
+- Host/GPU: dash1, one NVIDIA H20 via `CUDA_VISIBLE_DEVICES=0`
+- Model: `/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`
+- Command knobs: `TP=1`, `max_model_len=32768`, `max_num_seqs=64`,
+  `max_num_batched_tokens=32768`, `gpu_memory_utilization=0.85`,
+  prefix caching on, chunked prefill on
+- vLLM profiled KV capacity: 244,496 tokens = 15,281 blocks at block size 16
+- Replay wall time after engine startup: 595.116 seconds
+- Process elapsed including model load/startup: 2026-06-25T03:08:18Z to
+  2026-06-25T03:19:41Z
+
+| metric | Frontier RS8 500 reqs | vLLM TP1 500 reqs | vLLM / Frontier |
+|---|---:|---:|---:|
+| completed requests | 439/500 | 500/500 | not aligned |
+| preemption events | 0 | 63 | not aligned |
+| repeated/preempted request ids | 0 | 57 | not aligned |
+| TTFT p50 | 136.776s | 185.658s | 1.36 |
+| TTFT p95 | 340.237s | 375.895s | 1.10 |
+| TPOT p50 | 0.0564s | 0.0498s | 0.88 |
+| TPOT p95 | 0.0894s | 0.0919s | 1.03 |
+| E2E p50 | 177.800s | 224.270s | 1.26 |
+| E2E p95 | 397.291s | 417.356s | 1.05 |
+| requests/s | 0.661 | 0.840 | 1.27 |
+| total tok/s | 4733.7 | 5282.9 | 1.12 |
+| decode tok/s | 656.2 | 732.3 | 1.12 |
+
+Because Frontier emits latency/cache rows for only 439 requests, the latency
+comparison above mixes Frontier's completed subset with vLLM's complete 500-row
+run. Restricting vLLM to the same 439 request ids gives:
+
+| metric | Frontier RS8 439 rows | vLLM same 439 ids | vLLM / Frontier |
+|---|---:|---:|---:|
+| TTFT p50 | 136.776s | 169.968s | 1.24 |
+| TTFT p95 | 340.237s | 375.760s | 1.10 |
+| TPOT p50 | 0.0564s | 0.0498s | 0.88 |
+| TPOT p95 | 0.0894s | 0.1071s | 1.20 |
+| E2E p50 | 177.800s | 218.606s | 1.23 |
+| E2E p95 | 397.291s | 416.110s | 1.05 |
+
+Prefix/cache comparison needs careful metric naming:
+
+- The unbounded ReplayServe trie estimate for all 500 rows is 1,047,632 hit
+  tokens / 2,708,110 prompt tokens = 0.3868 token hit ratio.
+- vLLM's finite-cache scheduler log is much lower under this pressure:
+  first-start `computed:` ratio is 0.0979, last-start ratio is 0.1643, and
+  max-per-request ratio is 0.1655.
+- On the same 439 request ids where Frontier emits complete metrics, vLLM's
+  first-start `computed:` ratio is 0.1050, last-start ratio is 0.1665, and
+  max-per-request ratio is 0.1679.
+- Frontier RS8 reports `replayserve_token_hit_ratio=0.1192` and
+  `frontier_block_hit_ratio=0.1191`, which is in the same order as vLLM's
+  finite-cache scheduler signal but far below the unbounded trace-side estimate.
+
+Current 500-request judgment:
+
+- Frontier's timing profile is now in the right broad range for this stressed
+  H20 TP1 run: TPOT p50/p95 and E2E p95 are close to vLLM, and aggregate token
+  throughput is within about 12%.
+- The run is still not a faithful simulator result because completion and
+  preemption diverge: Frontier drops 61 latency/cache rows and reports zero
+  preemptions, while real vLLM completes all 500 requests and logs 63
+  preemption events across 57 request ids.
+- The 500-request trace invalidates the earlier use of the unbounded sidecar
+  prefix estimate as the primary comparator. Finite KV capacity, eviction, and
+  preemption must be part of the prefix-cache replay metric.
+
+ReplayServe TODO:
+
+- Treat incomplete Frontier runs as invalid for final performance claims unless
+  the comparison explicitly reports the missing request set.
+- Keep the focused Frontier debug guard in the local patch: sequential mode now
+  fails if `completed_requests < total_requests` at drain time and reports the
+  missing request state.
+- Add a comparator that reports both unbounded trace-side prefix reuse and
+  finite-cache observed reuse from vLLM scheduler logs; do not compare
+  Frontier's finite-cache hit ratio directly to the unbounded trie estimate.
+- Profile or import vLLM CPU overhead records for H20 TP1 before enabling
+  `skip_cpu_overhead_modeling=false`; without those records Frontier falls back
+  to zero CPU overhead.
+- Collect kernel-only/decode-CUDA-graph timing profiles before using
+  `decode_cuda_graph_mode=full_decode_only`; the current RS6 profile is CUDA
+  event/eager timing.
+
+## 2026-06-25 200-Request Timestamp Scale 2/3
+
+Generated `traces/fixtures/coder_200_ts0667` from the first 200 rows of
+`qwen_coder_blksz_16.jsonl`, with each timestamp multiplied by `2/3` in the
+fixture files:
+
+- `row_count=200`
+- `timestamp_scale=0.6666666666666666`
+- `last_timestamp=30.711333333333332`
+- `max_total_tokens=18985`
+- `partial_final_block_rows=182`
+
+Important: in the current replay semantics, smaller timestamp scale makes
+arrivals denser. It reduces the arrival window from about 46.1s to 30.7s for the
+first 200 requests. This does not reduce queue pressure relative to the same
+200 requests at scale 1.0; it only reduces the request count relative to the
+500-request stress.
+
+Frontier RS9:
+
+- Config:
+  `configs/rs9_frontier_h20_tp1_profile_full32k_coder200_ts0667.json`
+- Run:
+  `runs/rs9_frontier_h20_tp1_profile_full32k_coder200_ts0667`
+- Runtime: 460 seconds
+- Status: incomplete
+
+vLLM dash1 TP1:
+
+- Run:
+  `runs/vllm_gpu_smoke_20260625_dash1/tp1_coder200_ts0667_uncapped`
+- Runtime: vLLM 0.11.1
+- Host/GPU: dash1, one NVIDIA H20 via `CUDA_VISIBLE_DEVICES=0`
+- vLLM profiled KV capacity: 244,496 tokens = 15,281 blocks at block size 16
+- Replay wall time after engine startup: 242.813 seconds
+
+| metric | Frontier RS9 200 ts=2/3 | vLLM TP1 200 ts=2/3 | vLLM / Frontier |
+|---|---:|---:|---:|
+| completed requests | 176/200 | 200/200 | not aligned |
+| preemption events | 0 | 26 | not aligned |
+| TTFT p50 | 20.580s | 34.563s | 1.68 |
+| TTFT p95 | 96.718s | 120.804s | 1.25 |
+| TPOT p50 | 0.0584s | 0.0515s | 0.88 |
+| TPOT p95 | 0.2359s | 0.2535s | 1.07 |
+| E2E p50 | 73.207s | 83.622s | 1.14 |
+| E2E p95 | 189.240s | 183.727s | 0.97 |
+| requests/s | 0.583 | 0.824 | 1.41 |
+| total tok/s | 3913.4 | 4864.8 | 1.24 |
+| decode tok/s | 593.3 | 737.5 | 1.24 |
+
+Restricting vLLM to the same 176 request ids where Frontier emits complete
+metrics gives:
+
+| metric | Frontier RS9 176 rows | vLLM same 176 ids | vLLM / Frontier |
+|---|---:|---:|---:|
+| TTFT p50 | 20.580s | 27.896s | 1.36 |
+| TTFT p95 | 96.718s | 120.804s | 1.25 |
+| TPOT p50 | 0.0584s | 0.0520s | 0.89 |
+| TPOT p95 | 0.2359s | 0.2539s | 1.08 |
+| E2E p50 | 73.207s | 82.645s | 1.13 |
+| E2E p95 | 189.240s | 183.727s | 0.97 |
+
+Prefix/cache comparison:
+
+- The unbounded ReplayServe trie estimate for all 200 rows is 270,336 hit
+  tokens / 1,002,154 prompt tokens = 0.2698 token hit ratio.
+- vLLM finite-cache scheduler signal for all 200 rows: first-start `computed:`
+  ratio 0.1392, last-start ratio 0.2126, max-per-request ratio 0.2129.
+- On the same 176 request ids where Frontier emits complete metrics, vLLM
+  first-start ratio is 0.1487, last-start ratio is 0.1926, and max-per-request
+  ratio is 0.1927.
+- Frontier RS9 reports `replayserve_token_hit_ratio=0.1703` and
+  `frontier_block_hit_ratio=0.1700`, again between vLLM first-start and
+  last/max finite-cache scheduler signals.
+
+Missing request ids in RS9:
+
+```text
+70,78,80,86,87,89,96,101,102,105,125,126,131,132,135,144,145,146,147,148,149,150,151,198
+```
+
+Current 200-request judgment:
+
+- Reducing the request count from 500 to 200 substantially reduces TTFT and E2E
+  tails, but `scale=2/3` is still a dense-arrival stress test. vLLM TTFT p95 is
+  still 120.8s.
+- Frontier timing is closer than the old 100-request dummy/profile baselines:
+  TPOT p50/p95 and E2E p50/p95 are broadly aligned.
+- Completion/preemption remains the blocking fidelity issue: Frontier drops 24
+  rows and reports zero preemptions; vLLM completes all 200 and logs 26
+  preemptions across 22 repeated-start request ids.
+- To actually reduce queue pressure for the same first 200 requests, use a
+  timestamp scale greater than 1. The follow-up scale 2 and 3 runs below do
+  this.
+
+## 2026-06-25 200-Request Timestamp Scale 2 and 3
+
+Generated two more first-200 fixtures from `qwen_coder_blksz_16.jsonl`:
+
+| fixture | timestamp scale | last timestamp | max total tokens |
+|---|---:|---:|---:|
+| `coder_200_ts2` | 2.0 | 92.134s | 18,985 |
+| `coder_200_ts3` | 3.0 | 138.201s | 18,985 |
+
+These are the intended lower-arrival-pressure runs. The request payloads are the
+same first 200 rows as `coder_200_ts0667`; only timestamps differ.
+
+Frontier RS10:
+
+- Config:
+  `configs/rs10_frontier_h20_tp1_profile_full32k_coder200_ts2_ts3.json`
+- Run:
+  `runs/rs10_frontier_h20_tp1_profile_full32k_coder200_ts2_ts3`
+- Status: incomplete for both fixtures
+
+vLLM dash1 TP1:
+
+- Runs:
+  `runs/vllm_gpu_smoke_20260625_dash1/tp1_coder_200_ts2_uncapped`
+  and `runs/vllm_gpu_smoke_20260625_dash1/tp1_coder_200_ts3_uncapped`
+- Runtime: vLLM 0.11.1
+- Host/GPU: dash1, one NVIDIA H20 via `CUDA_VISIBLE_DEVICES=0`
+- vLLM profiled KV capacity: 244,496 tokens = 15,281 blocks at block size 16
+
+Run-level comparison:
+
+| metric | Frontier scale 2 | vLLM scale 2 | Frontier scale 3 | vLLM scale 3 |
+|---|---:|---:|---:|---:|
+| completed requests | 182/200 | 200/200 | 184/200 | 200/200 |
+| preemption events | 0 | 43 | 0 | 16 |
+| TTFT p50 | 8.118s | 9.217s | 0.779s | 1.166s |
+| TTFT p95 | 67.850s | 69.211s | 35.918s | 32.258s |
+| TPOT p50 | 0.0544s | 0.0497s | 0.0544s | 0.0462s |
+| TPOT p95 | 0.0747s | 0.0686s | 0.0773s | 0.0714s |
+| E2E p50 | 51.118s | 55.002s | 40.641s | 33.213s |
+| E2E p95 | 162.607s | 142.338s | 158.434s | 122.789s |
+| requests/s | 0.593 | 0.803 | 0.544 | 0.780 |
+| total tok/s | 3846.1 | 4742.5 | 3490.6 | 4608.1 |
+| decode tok/s | 583.1 | 719.0 | 529.2 | 698.6 |
+
+Restricting vLLM to the same request ids where Frontier emits complete metrics:
+
+| metric | Frontier scale 2 182 rows | vLLM same 182 ids | Frontier scale 3 184 rows | vLLM same 184 ids |
+|---|---:|---:|---:|---:|
+| TTFT p50 | 8.118s | 8.574s | 0.779s | 0.945s |
+| TTFT p95 | 67.850s | 68.934s | 35.918s | 32.258s |
+| TPOT p50 | 0.0544s | 0.0501s | 0.0544s | 0.0461s |
+| TPOT p95 | 0.0747s | 0.0686s | 0.0773s | 0.0679s |
+| E2E p50 | 51.118s | 53.263s | 40.641s | 33.213s |
+| E2E p95 | 162.607s | 141.264s | 158.434s | 122.789s |
+
+Prefix/cache comparison:
+
+| metric | scale 2 | scale 3 |
+|---|---:|---:|
+| unbounded trace-side token hit ratio | 0.2698 | 0.2698 |
+| vLLM first-start `computed:` ratio | 0.1433 | 0.1471 |
+| vLLM last-start `computed:` ratio | 0.2382 | 0.1968 |
+| vLLM max-per-request `computed:` ratio | 0.2383 | 0.1998 |
+| Frontier `replayserve_token_hit_ratio` | 0.1448 | 0.1523 |
+| Frontier `frontier_block_hit_ratio` | 0.1446 | 0.1521 |
+
+Current scale 2 and 3 judgment:
+
+- The user's intended `scale=2` and `scale=3` runs do reduce queueing. vLLM
+  TTFT p95 drops from 120.8s at `scale=2/3` to 69.2s at `scale=2` and 32.3s at
+  `scale=3`.
+- `scale=3` is the first run where vLLM p50 TTFT is near 1s. The p95 is still
+  high because long prompts and KV pressure remain, but the severe all-request
+  queueing seen in the 500-request run is much reduced.
+- Frontier timing is now close on TTFT and TPOT for the completed-row subset,
+  especially at `scale=2`. However, Frontier still misses completion/cache rows
+  and still reports zero preemptions.
+- Completion/preemption is therefore still the main Frontier fidelity blocker:
+  `scale=2` misses 18 rows and vLLM logs 43 preemptions; `scale=3` misses 16 rows
+  and vLLM logs 16 preemptions.
+
+## 2026-06-25 Frontier Lifecycle Fix For RS10
+
+The missing-row root cause was Frontier lifecycle handling after decode-phase
+preemption. Missing requests were preempted after prefill/decode had started,
+then left in this inconsistent state:
+
+```text
+preempted=True
+is_prefill_complete=True
+num_processed_tokens=0
+scheduled=False
+completed=False
+```
+
+The next waiting admission computed `num_new_tokens=0` and removed the request
+from the queue, so sequential simulation drained with fewer completed requests
+but no remaining scheduler work.
+
+The updated ReplayServe Frontier patch now:
+
+- replays decode-phase preemption by treating already-produced tokens as the
+  next prefill segment and the remaining tokens as decode work;
+- preserves unfinished zero-token waiting requests instead of silently dropping
+  them;
+- reports metrics against user-facing trace prompt/output lengths after runtime
+  token splitting;
+- fails fast if sequential mode drains before all generated requests complete.
+
+Verification runs:
+
+| run | old completion | fixed completion | Frontier preemptions | prefix token hit ratio | status |
+|---|---:|---:|---:|---:|---|
+| `coder_200_ts2` | 182/200 | 200/200 | 33 | 0.2313 | pass |
+| `coder_200_ts3` | 184/200 | 200/200 | 20 | 0.2177 | pass |
+
+Fixed-run paths:
+
+- `runs/rs10_preemption_replay_fix_ts2/frontier_h20_tp1_profile_full32k/coder_200_ts2/vllm_kv_15281_profile_full32k`
+- `runs/rs10_preemption_replay_fix_ts3/frontier_h20_tp1_profile_full32k/coder_200_ts3/vllm_kv_15281_profile_full32k`
+
+Updated run-level comparison:
+
+| metric | Frontier scale 2 fixed | vLLM scale 2 | Frontier scale 3 fixed | vLLM scale 3 |
+|---|---:|---:|---:|---:|
+| completed requests | 200/200 | 200/200 | 200/200 | 200/200 |
+| preemption events | 33 | 43 | 20 | 16 |
+| TTFT p50 | 9.595s | 9.217s | 1.001s | 1.166s |
+| TTFT p95 | 77.503s | 69.211s | 45.947s | 32.258s |
+| TPOT p50 | 0.0542s | 0.0497s | 0.0534s | 0.0462s |
+| TPOT p95 | 0.0665s | 0.0686s | 0.0686s | 0.0714s |
+| E2E p50 | 61.458s | 55.002s | 44.761s | 33.213s |
+| E2E p95 | 174.484s | 142.338s | 154.548s | 122.789s |
+| requests/s | 0.594 | 0.803 | 0.574 | 0.780 |
+| total tok/s | 3506.3 | 4742.5 | 3390.0 | 4608.1 |
+| decode tok/s | 531.6 | 719.0 | 513.9 | 698.6 |
+
+Current judgment after the fix:
+
+- The completion/preemption lifecycle blocker for RS10 is fixed: both scale 2
+  and scale 3 now emit 200 request rows and complete postprocess.
+- Frontier preemption is now in the same order as vLLM, but not exact:
+  scale 2 is 33 vs 43 events, scale 3 is 20 vs 16 events.
+- Prefix hit ratio changed materially because preempted requests now replay and
+  re-enter prefix-cache admission instead of disappearing. It is no longer valid
+  to compare the old incomplete RS10 prefix ratios against vLLM.
+- Timing remains close in TPOT but Frontier is still slower in aggregate
+  throughput, about 0.74x of vLLM total/decode token throughput for both scale 2
+  and scale 3. TTFT/E2E tails are still worse after the completion set becomes
+  complete.
+- Remaining gap is no longer "missing metrics rows"; it is scheduler/preemption
+  fidelity plus CPU/scheduler/CUDA-graph timing calibration.
+
+## 2026-06-25 H20 TP2/TP4 Comparison
+
+The TP2/TP4 comparison uses the same first-200 `coder_200_ts2` and
+`coder_200_ts3` fixtures. The vLLM runs are on dash1 with
+`/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`, vLLM 0.11.1,
+`max_model_len=32768`, `max_num_seqs=64`,
+`max_num_batched_tokens=32768`, `gpu_memory_utilization=0.85`,
+prefix caching on, and chunked prefill on.
+
+vLLM measured KV capacity:
+
+| TP | KV tokens | KV blocks |
+|---:|---:|---:|
+| 2 | 1,104,880 | 69,055 |
+| 4 | 2,833,232 | 177,077 |
+
+Frontier RS12 uses explicit matching KV blocks and fresh H20 TP2/TP4 profiles:
+
+- Config:
+  `configs/rs12_frontier_h20_tp2_tp4_profile_full32k_coder200_ts2_ts3.json`
+- Run:
+  `runs/rs12_frontier_h20_tp2_tp4_profile_full32k_coder200_ts2_ts3`
+- Profile source:
+  `dash1:/home/admin/cpfs/wjh/replayserve_frontier_profiles/h20_tp2_tp4_qwen3_30ba3b_full32k_20260625_true_mixed`
+- Linear/MoE profiles cover TP2/TP4 up to 32768 tokens.
+- Attention profile covers TP2/TP4 standard attention plus 1260 true-mixed
+  prefill+decode rows. The true-mixed rows are required; standard attention
+  alone fails with missing `attn_decode_in_mixed` predictions.
+
+All four Frontier runs completed 200/200 request rows. Neither Frontier nor the
+vLLM TP2/TP4 logs reported preemption events. Prefix token hit ratio is exactly
+the same in Frontier postprocess and vLLM's trace-side synthetic estimate:
+0.2697549478.
+
+Run-level comparison:
+
+| TP | fixture | metric | Frontier | vLLM | Frontier / vLLM |
+|---:|---|---|---:|---:|---:|
+| 2 | `coder_200_ts2` | requests/s | 0.776 | 1.278 | 0.61 |
+| 2 | `coder_200_ts2` | total tok/s | 4581 | 7547 | 0.61 |
+| 2 | `coder_200_ts2` | decode tok/s | 695 | 1144 | 0.61 |
+| 2 | `coder_200_ts2` | TTFT p50/p95 | 0.269/6.745s | 0.225/0.715s | 1.20/9.43 |
+| 2 | `coder_200_ts2` | TPOT p50/p95 | 0.0430/0.0529s | 0.0300/0.0434s | 1.43/1.22 |
+| 2 | `coder_200_ts2` | E2E p50/p95 | 26.05/106.76s | 16.45/72.53s | 1.58/1.47 |
+| 4 | `coder_200_ts2` | requests/s | 0.853 | 1.536 | 0.55 |
+| 4 | `coder_200_ts2` | total tok/s | 5035 | 9073 | 0.55 |
+| 4 | `coder_200_ts2` | decode tok/s | 763 | 1376 | 0.55 |
+| 4 | `coder_200_ts2` | TTFT p50/p95 | 0.098/0.386s | 0.170/1.420s | 0.57/0.27 |
+| 4 | `coder_200_ts2` | TPOT p50/p95 | 0.0337/0.0384s | 0.0163/0.0283s | 2.06/1.36 |
+| 4 | `coder_200_ts2` | E2E p50/p95 | 18.65/84.94s | 9.26/43.62s | 2.01/1.95 |
+| 2 | `coder_200_ts3` | requests/s | 0.688 | 1.088 | 0.63 |
+| 2 | `coder_200_ts3` | total tok/s | 4062 | 6426 | 0.63 |
+| 2 | `coder_200_ts3` | decode tok/s | 616 | 974 | 0.63 |
+| 2 | `coder_200_ts3` | TTFT p50/p95 | 0.134/0.574s | 0.154/0.627s | 0.87/0.92 |
+| 2 | `coder_200_ts3` | TPOT p50/p95 | 0.0394/0.0467s | 0.0191/0.0280s | 2.07/1.67 |
+| 2 | `coder_200_ts3` | E2E p50/p95 | 21.79/101.59s | 9.96/53.98s | 2.19/1.88 |
+| 4 | `coder_200_ts3` | requests/s | 0.737 | 1.254 | 0.59 |
+| 4 | `coder_200_ts3` | total tok/s | 4355 | 7403 | 0.59 |
+| 4 | `coder_200_ts3` | decode tok/s | 660 | 1122 | 0.59 |
+| 4 | `coder_200_ts3` | TTFT p50/p95 | 0.089/0.346s | 0.100/0.318s | 0.89/1.09 |
+| 4 | `coder_200_ts3` | TPOT p50/p95 | 0.0311/0.0358s | 0.0094/0.0128s | 3.30/2.80 |
+| 4 | `coder_200_ts3` | E2E p50/p95 | 16.90/83.01s | 5.55/27.87s | 3.05/2.98 |
+
+TP scaling comparison:
+
+| fixture | metric | Frontier TP4 / TP2 | vLLM TP4 / TP2 |
+|---|---|---:|---:|
+| `coder_200_ts2` | total tok/s speedup | 1.10 | 1.20 |
+| `coder_200_ts2` | decode tok/s speedup | 1.10 | 1.20 |
+| `coder_200_ts2` | TPOT p50 reduction | 0.78 | 0.54 |
+| `coder_200_ts3` | total tok/s speedup | 1.07 | 1.15 |
+| `coder_200_ts3` | decode tok/s speedup | 1.07 | 1.15 |
+| `coder_200_ts3` | TPOT p50 reduction | 0.79 | 0.49 |
+
+Current TP2/TP4 judgment:
+
+- Functional replay is aligned for this setting: same request rows, same
+  trace-side prefix reuse ratio, matched vLLM KV block counts, and no
+  preemption on either side.
+- Absolute performance is not aligned. Frontier reports only 55-63% of vLLM
+  total/decode throughput across TP2/TP4, and TPOT is especially pessimistic at
+  TP4.
+- Relative TP scaling is also under-estimated. vLLM's TP4 improves TPOT p50 by
+  about 46-51% over TP2, while Frontier improves by only about 21-22%.
+- The remaining gap is therefore not caused by missing rows, prefix-cache
+  mismatch, or KV capacity mismatch in these runs. It points to timing model
+  limitations: missing CPU/scheduler/CUDA-graph modeling, random-forest profile
+  interpolation error, and imperfect modeling of vLLM's TP-dependent decode
+  execution path.
+- These RS12 results are acceptable for continuing ReplayServe integration and
+  rough qualitative trends. They are not yet acceptable as calibrated absolute
+  performance predictions.
--- a/docs/rs4_vllm_gpu_smoke.md
+++ b/docs/rs4_vllm_gpu_smoke.md
@@ -0,0 +1,138 @@
+# RS4 vLLM GPU Smoke
+
+RS4 starts a real serving baseline for ReplayServe. This is separate from the
+Frontier dummy/patched simulator smoke: it checks whether the Qwen block-hash
+trace can drive a real vLLM engine with the intended arrival, prompt length,
+decode length, and prefix reuse patterns.
+
+## Setup
+
+- Host: `dash2`
+- GPU: NVIDIA H20
+- Model: `/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`
+- Runtime: Python 3.12.3, vLLM 0.11.1
+- Fixture: `traces/fixtures/coder_100`
+- Runner: `tools/vllm_synthetic_replay.py`
+- Replay mode: online, trace-relative timestamps preserved
+- Prompt mode: `prompt_token_ids`, generated synthetically from trace block
+  hashes
+- Common vLLM knobs: `max_model_len=32768`, `block_size=16`,
+  `max_num_batched_tokens=32768`, `gpu_memory_utilization=0.85`,
+  prefix caching on, chunked prefill on
+
+The Qwen trace does not expose original token IDs or text. The runner maps each
+block hash deterministically to one stable synthetic token block. Equal block
+hashes therefore produce equal token blocks, preserving arrival, length, and
+block-prefix sharing patterns, but not original text semantics.
+
+## Runs
+
+The first smoke used single-request runs for engine bring-up, 32-request capped
+runs for prefix-cache validation, 32-request uncapped runs for a first
+real-output baseline, and full `coder_100` uncapped runs for the first useful
+TP=1/2 comparison.
+
+| run | TP | rows | prompt toks | gen toks | wall s | RPS | prompt tok/s | gen tok/s | TTFT p50/p95 | TPOT p50/p95 | E2E p50/p95 |
+|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| `tp1_limit1` | 1 | 1 | 1008 | 4 | 1.861 | 0.537 | 541.5 | 2.1 | 1.255/1.255 | 0.007/0.007 | 1.274/1.274 |
+| `tp2_limit1` | 2 | 1 | 1008 | 4 | 2.269 | 0.441 | 444.3 | 1.8 | 1.317/1.317 | 0.008/0.008 | 1.340/1.340 |
+| `tp1_limit32_o8` | 1 | 32 | 120813 | 253 | 11.244 | 2.846 | 10744.4 | 22.5 | 3.974/5.051 | 0.387/1.081 | 7.157/9.817 |
+| `tp2_limit32_o8` | 2 | 32 | 120813 | 253 | 9.071 | 3.528 | 13318.2 | 27.9 | 1.881/3.324 | 0.285/0.727 | 4.368/7.043 |
+| `tp1_limit32_uncapped` | 1 | 32 | 120813 | 22209 | 41.874 | 0.764 | 2885.1 | 530.4 | 1.276/1.842 | 0.024/0.102 | 14.366/29.523 |
+| `tp2_limit32_uncapped` | 2 | 32 | 120813 | 22209 | 33.588 | 0.953 | 3596.9 | 661.2 | 0.961/1.700 | 0.017/0.071 | 10.786/21.570 |
+| `tp1_coder100_uncapped` | 1 | 100 | 474554 | 82479 | 145.351 | 0.688 | 3264.9 | 567.4 | 4.503/29.060 | 0.066/0.621 | 41.841/97.366 |
+| `tp2_coder100_uncapped` | 2 | 100 | 474554 | 82479 | 102.001 | 0.980 | 4652.5 | 808.6 | 1.951/10.355 | 0.049/0.262 | 25.678/61.971 |
+
+Artifacts were copied back from dash2 to:
+
+```text
+runs/vllm_gpu_smoke_20260624/
+```
+
+That directory is ignored by git. Each run contains `summary.json` and
+`request_metrics.csv`; the 32-request runs also keep `stdout.log`.
+
+## KV Capacity
+
+vLLM estimated KV capacity from actual H20 memory profiling:
+
+| TP | weights memory | available KV memory | GPU KV cache size | max concurrency at 32768 tokens/request |
+|---:|---:|---:|---:|---:|
+| 1 | 56.93 GiB | 22.39 GiB | 244,512 tokens | 7.46x |
+| 2 | 28.50 GiB/rank | 50.58 GiB/rank | 1,104,880 tokens | 33.72x |
+
+This satisfies the RS4 requirement that KV capacity comes from the real GPU
+memory planner rather than a manually fixed block count.
+
+## Prefix-Cache Check
+
+For the first 32 coder requests, ReplayServe estimated:
+
+- query blocks: 7,564
+- hit blocks: 1,786
+- block hit ratio: 0.236118456
+- query tokens: 120,813
+- hit tokens: 28,576
+- token hit ratio: 0.236530837
+
+The vLLM scheduler logs for both TP=1 and TP=2 reported exactly 32 request
+starts and `computed:` token sums of 28,576 in both capped and uncapped runs.
+The largest single hit was 11,552 tokens. Examples include:
+
+```text
+Request 16 started running, prompt: 12296, computed: 11552
+Request 26 started running, prompt: 5836, computed: 4336
+Request 30 started running, prompt: 11017, computed: 10768
+```
+
+So this smoke validates the core ReplayServe invariant: identical Qwen block
+hash prefixes become identical synthetic token prefixes, and vLLM's prefix cache
+actually reuses them.
+
+For full `coder_100`, ReplayServe estimated:
+
+- query blocks: 29,705
+- hit blocks: 7,447
+- block hit ratio: 0.250698536
+- query tokens: 474,554
+- hit tokens: 119,152
+- token hit ratio: 0.251082069
+
+The TP=2 full `coder_100` run had no preemptions. Its vLLM `computed:` sum was
+119,152, matching the trace-side estimate exactly. The TP=1 run had 8
+preemptions across repeated starts for requests 70, 71, 72, 77, and 94. In that
+case, raw `computed:` sums are not a simple prefix-hit ratio:
+
+| run | starts | unique requests | preemptions | all-start computed | first-start computed | last-start computed | max/request computed | estimated hit tokens |
+|---|---:|---:|---:|---:|---:|---:|---:|---:|
+| `tp1_coder100_uncapped` | 108 | 100 | 8 | 180896 | 108560 | 141744 | 141984 | 119152 |
+| `tp2_coder100_uncapped` | 100 | 100 | 0 | 119152 | 119152 | 119152 | 119152 | 119152 |
+
+Use `tools/analyze_vllm_prefix_log.py` to reproduce this parsing.
+
+## Reliability Boundary
+
+These numbers are useful for mechanism validation and for seeding simulator
+calibration. They are not final serving throughput claims because:
+
+- Some bring-up runs capped decode length to 4 or 8 tokens.
+- The largest real-output baseline so far is `coder_100`, not `coder_2000` or
+  the full coder trace.
+- Synthetic token IDs preserve block identity and length but not original text
+  distribution.
+- Prefix reuse in `request_metrics.csv` is a trace-side estimate. For real
+  scheduler hit/miss behavior, use vLLM `stdout.log` `computed:` fields and
+  account for preemption/re-admission.
+- This run uses H20 and `Qwen3-30B-A3B`, while the earlier Frontier smoke used
+  dummy A800/Qwen3-32B plumbing. They should be compared as calibration inputs,
+  not as one-to-one simulator accuracy evidence yet.
+
+## Next
+
+- Move to `coder_2000` once runtime and queueing cost are acceptable.
+- Add the vLLM log parser output into the run aggregation summary.
+- Compare vLLM real-backend TTFT/TPOT/E2E against Frontier outputs only after
+  selecting a matched model/hardware/profile policy.
+
+See `docs/rs4_frontier_h20_tp1_alignment.md` for the first Frontier H20 TP1
+alignment run against real vLLM TP1.
--- a/docs/sources.md
+++ b/docs/sources.md
@@ -0,0 +1,45 @@
+# Sources
+
+Checked on 2026-06-24.
+
+## Local Repositories
+
+| Source | Local path | Commit / HEAD | Notes |
+|---|---|---|---|
+| Qwen Bailian usage traces | `/home/gahow/phd/qwen-bailian-usagetraces-anon` | `5f7439c51ec248a0c585f7d90a41a6f57773b912` | Primary RS0 input is `qwen_coder_blksz_16.jsonl`. |
+| Frontier | `/tmp/toc-llm-sim-research/Frontier` | `d9cfeb6d8791fbf2f295dd9744c56a666171776e` | Primary RS1 simulator candidate. |
+| Vidur | `/tmp/toc-llm-sim-research/vidur` | `8383d2935bc62723a212090baa9f98ada206fc14` | Baseline simulator candidate for arrival and length replay. |
+| AIConfigurator | `/tmp/toc-llm-sim-research/aiconfigurator` | `e46ece7510e727fafefb8212e5846172145a30ea` | Configuration search reference, not per-request faithful replay. |
+
+All four local repositories were present when RS0 was generated. No external
+repository was cloned for RS0.
+
+## Frontier Findings
+
+- Frontier trace replay reads CSV columns `arrived_at`, `num_prefill_tokens`,
+  and `num_decode_tokens`.
+- It also parses optional `session_id` and `block_hash_ids`; `block_hash_ids`
+  can be `|` separated, matching `examples/fixtures/prefix_cache_shared_session_trace.csv`.
+- Frontier's trace replay generator can clip prefill tokens when total tokens
+  exceed `trace_request_generator_config_max_tokens`. ReplayServe fixtures hard
+  fail before Frontier sees the trace, so the RS1 smoke cannot silently clip.
+- Frontier has a built-in `Qwen/Qwen3-32B` model config.
+- Frontier has A800 network profiles:
+  `data/profiling/network/a800_dgx/` and
+  `data/profiling/network/a800_pairwise_nvlink/`.
+- Current public A800 compute profiles in this checkout include Llama2-7B and
+  Qwen3 MoE / Qwen3-Next reduced variants, but no dense `Qwen/Qwen3-32B`
+  compute profile. RS1 Qwen3-32B A800 latency and throughput results are only
+  plumbing smoke until matching compute profiles or calibration data are added.
+
+## Qwen Trace Findings
+
+- The released JSONL rows contain `chat_id`, `parent_chat_id`, `timestamp`,
+  `input_length`, `output_length`, `type`, `turn`, and `hash_ids`.
+- The trace README documents `hash_ids` as salted SipHash blocks with 16 tokens
+  per block.
+- The released input lengths and hashes are already after the model-specific
+  chat template has been applied. ReplayServe does not apply chat templates.
+- The final input block can be padded. ReplayServe records per-block token
+  counts in the sidecar so partial final blocks can be accounted for by true
+  token count.