SGLang PP 与 HiCache 时序分析

发表于 2026-04-29 分类于 AI Infra Disqus：

概述

SGLang 的 Pipeline Parallelism (PP) 模式下，HiCache 负责 KV cache 的异步 prefetch（Host→GPU）和 backup（GPU→Host）。本文从 PP 调度器的外层事件循环出发，追踪 Load 和 Write 的完整时序，揭示 write_ack 比 load_ack 多延迟的根本原因。

一、PP 事件循环

每个 iteration 执行以下 4 步：

iter=N:
  ① check_hicache_events()      ← 查询 HiCache 异步事件（load_ack / write_ack）
  ② get_next_batch_to_run()     ← 选下一批 batch（prefix match → eviction → load）
  ③ _pp_launch_batch()          ← launch forward
  ④ _pp_process_batch_result()  ← 处理上一批 batch 的结果（insert → write）

核心设计：process_batch_result 处理的是 mbs[next_mb_id]——上一轮 launch 的 batch。同一个 iter 内，调度器同时在处理两个不同 batch 的生命周期阶段。

二、Load 时序（Host→GPU）

触发时机

iter=N 的 get_next_batch_to_run —— prefill 阶段 prefix match 发现 host_hit，发起 load_back。

完整时序

步骤	发生在	说明
1. load 发起	iter=N `get_next_batch_to_run`	match_prefix → host_hit → load_back → start_loading
2. CUDA copy 启动	iter=N `get_next_batch_to_run`	GPU 从 Host 异步拉取 KV cache，CUDA event 入队
3. forward 逐层等待	iter=N `_pp_launch_batch`	forward 通过 consumer_index 逐层等待对应 layer 的 load 完成
4. load_ack 消费	iter≥N+1 `check_hicache_events`	loading_check → event.query()=True → 消费 ack

关键：load 是 prefetch——在 forward 之前触发，CUDA copy 和 forward 可以重叠（逐层等待、逐层执行）。

时序图

iter=N:
  ① check_hicache_events()
     └─ 消费更早的 load_ack
  ② get_next_batch_to_run()
     └─ match_prefix → host_hit → load_back() → start_loading()
     └─ CUDA copy 启动，event 入队
  ③ _pp_launch_batch()
     └─ forward 逐层等待 load 完成

iter=N+1:
  ① check_hicache_events()
     └─ loading_check() → event.query()=True → load_ack ✓

延迟：load_ack 比 load 发起晚 1 iter。

三、Write 时序（GPU→Host）

触发时机

iter=N 的 process_batch_result —— 处理上一轮 launch 的 batch 结果，insert 时触发 write_backup。

完整时序

步骤	发生在	说明
1. forward 执行	iter=N-1 `_pp_launch_batch`	Prefill batch 在 GPU 上计算，生成 KV cache
2. write 发起	iter=N `process_batch_result`	处理 iter=N-1 的 batch → insert → write_backup → start_writing
3. CUDA copy 启动	iter=N `process_batch_result`	GPU 异步写回 Host，CUDA event 入队
4. write_ack 消费	iter≥N+1 `check_hicache_events`	writing_check → event.query()=True → 消费 ack

关键：write 是 post-write——必须在 forward 算完、拿到完整 KV cache 后才能触发。

时序图

iter=N-1:
  ② get_next_batch_to_run() → 选出 Prefill A
  ③ _pp_launch_batch() → forward(Prefill A)

iter=N:
  ④ _pp_process_batch_result()
     └─ 处理 iter=N-1 的 Prefill A
     └─ insert() → write_backup() → start_writing()
     └─ CUDA copy 启动，event 入队

iter=N+1:
  ① check_hicache_events()
     └─ writing_check() → event.query()=True → write_ack ✓

延迟：write_ack 比 write 发起晚 1 iter，但 write 本身比 forward 晚 1 iter（因为 process_batch_result 处理上一批 batch）。

四、延迟对比

以同一个 Prefill batch 为基准

追踪一个 Prefill batch 从 launch 到 write_ack 的完整生命周期：

iter=1:
  ② get_next_batch_to_run()
     └─ 选出 Prefill A
     └─ match_prefix → host_hit → load_back() → start_loading()
  ③ _pp_launch_batch() → forward(Prefill A)

iter=2:
  ① check_hicache_events()
     └─ loading_check() → load_ack ✓
        ↑ load_ack: 1 iter delay

iter=3:
  ④ process_batch_result()
     └─ 处理 iter=1 的 Prefill A
     └─ insert() → write_backup() → start_writing()
  ① check_hicache_events()
     └─ (write_ack 还没完成)

iter=4:
  ① check_hicache_events()
     └─ writing_check() → write_ack ✓
        ↑ write_ack: 3 iter delay from Prefill launch

对比表

事件	触发时机	ack 入队时机	ack 消费时机	相对于 Prefill launch 的延迟
load	iter=1 `get_next_batch_to_run`	iter=1	iter=2	1 iter
write	iter=3 `process_batch_result`（处理 iter=1 的 batch）	iter=3	iter=4	3 iter

为什么 write_ack 多延迟？

两个因素叠加：

1. process_batch_result 滞后一轮

它处理 mbs[next_mb_id]（上一轮 launch 的 batch），以 PP2 为例偏移 2 iter。所以 write 比 load 晚 2 iter 才触发。

2. CUDA event query 需要等下一轮 check

无论 load 还是 write，ack 入队后最早等下一轮 check_hicache_events 才能消费。两者各加 1 iter。

综合：

1 2	Load: iter=1 发起 → iter=1 ack 入队 → iter=2 消费 Write: iter=1 forward → iter=3 process_result 发起 → iter=3 ack 入队 → iter=4 消费

Load 是 prefetch（forward 之前），Write 是 post-write（forward 之后）。这个架构差异决定了 Write 必然比 Load 多延迟。

五、PP 间同步问题

PP0 和 PP1 共享同一个 HiCache 实例（radix tree + host memory）。由于 output relay 延迟，PP0 和 PP1 的 iter 进度存在 1-2 iter 的偏移，需要三层同步机制保证一致性。

5.1 逻辑时钟保证重放一致性

PP1 的 writing_check/loading_check 不再直接消费 ack，而是将事件通过 Gloo P2P 通道 replay 给 PP0。PP0 作为唯一的事件消费者，按逻辑时钟顺序处理所有 ack，确保 PP0 和 PP1 的 radix tree 操作顺序一致。

问题：早期实现中 PP1 的 writing_check() 绕过 check_hicache_events guard，直接消费 write_ack（即 ack theft），导致 PP0 端 pending event 永远无法完成，radix tree 分叉。

修复：将 PP rank 分支移入 writing_check/loading_check 内部，用 PPHiCacheEventsReq 控制请求替代 dict wrapper，强制 PP1 replay 事件给 PP0。

5.2 Count Sync 保证 CP 一致性

PP0 比 PP1 早 1-2 iter 积累 write ack（output relay 延迟），导致 ack_write_queue 积累差异。PR #22878 通过 piggybacking write-ack consumption counts 在 PP ranks 间同步：

1
2
3

iter=N:
  PP0: writing_check() → 消费 3 个 write_ack → count=3
  PP1: 等待 PP0 的 count → count=3 → radix tree 操作对齐

Count sync 确保 PP0 和 PP1 对同一个 radix tree node 的 checkpoint（CP）操作一致，避免一个 stage 认为 node 已 backup、另一个 stage 还在等待的情况。

5.3 PP1 同步消费 PP0 的 ack

PP1 不再独立消费 ack，而是通过同步机制确保 PP0 消费 ack 后，PP1 的 radix tree 状态与 PP0 对齐：

PP0: writing_check() → event.query()=True → 消费 write_ack
     → radix tree: insert() → finalize() → node 状态更新
     ↓ TP all_reduce sync
PP1: 等待 PP0 完成 → radix tree: 同步执行 finalize()
     → node 状态与 PP0 一致

三层保障：

同步层	机制	保证什么
逻辑时钟	Event Replay（Gloo P2P）	PP0/PP1 事件处理顺序一致
Count Sync	PR #22878 piggyback counts	PP0/PP1 checkpoint 状态一致
Ack 同步	PP1 replay → PP0 消费 → TP all_reduce	radix tree 节点状态一致

核心目标：无论 PP0 和 PP1 的 iter 偏移多少，radix tree 的结构和节点状态在两个 stage 上始终保持一致。

六、总结

维度	Load	Write
触发函数	`get_next_batch_to_run`	`process_batch_result`
触发契机	prefix match 发现 host_hit	insert → _inc_hit_count
与 forward 关系	forward 之前（prefetch）	forward 之后（post-write）
CUDA copy 与 forward	可重叠（逐层等待）	串行（forward 算完才触发）
ack 消费延迟	+1 iter	+3 iter（含 process_result 偏移）

核心结论：

PP 调度器每个 iter 同时做两件事：get_next_batch_to_run 选下一批 batch，process_batch_result 处理上一批 batch 的结果
Load 是 prefetch（forward 之前触发），Write 是 post-write（forward 之后触发）
process_batch_result 处理 mbs[next_mb_id] 的 iter 偏移是 Write 延迟的根本原因
PP 间需要 count sync 补偿 output relay 带来的时序差异

本文基于 SGLang 源码分析，涉及文件：hiradix_cache.py、cache_controller.py。