参考 Benchmarking Text Generation Inference 。
参考 SGLang issue 364 。
参考 LLM inference server performances comparison llama.cpp / TGI / vLLM 。
相关代码:
sglang bench 。
vLLM bech prefix cache 。
vLLM bench serving 。
TokenAttention和PagedAttention,感觉TokenAttention是个很离谱的设计,而Radix的话和PagedAttention的颗粒度不是完全对应的。
vLLM 的默认block size最多是32,虽然这个32对应的字符串长度不是固定的,一般一个Token平均对应4个字母,所以有效前缀大概120比较合适。
最近的vLLM production stack的RFC 准备采用分block的trie树或者直接使用simHash,这些可能更匹配vLLM本身的实现
包括最近字节推出了aibrix 也提供了相关能力。
前缀重复度
为了能够测试不同数据集的前缀重复度,需要一种方法衡量对话的前缀重复度,如果前缀的重复度不高,可能测试结果不太能体现前缀缓存的优势。
对于所有的对话构造一个Radix树,每个树节点保存一个计数器记录经过该节点的字符串的数量。
计数重复前缀的数量,比如W
这个前缀是比较多的因为很多英文问句都是Wh-开头的,而中文的话是比较随机的。
对于每个节点,在进行计数器过滤的时候,要一直遍历到某个节点的子节点都小于计数器N才结束,这样防止过滤出多个公共前缀的前缀, 因为较短的前缀肯定是被较长的前缀包含的。相当于对这棵树做剪枝,删除所有计数器小于过滤值的节点。
再从满足要求的所有被剪枝完的叶子结点中选择长度大于L的前缀。
对话数据集的前缀重复度 = 基于N剪枝的所有长度大于L的叶子前缀节点数 / 所有对话数量
压力测试数据集
databricks-dolly-15k 这个数据集的前缀重复度不高。 只有两个前缀长度超过00,重复次数大于1,因为里面都是单轮的对话。 (‘Extract all of the dates mentioned in this paragraph and list them using bullets in the format {Date} - {Description}’, 11) (‘Extract all of the names of people mentioned in this paragraph and list them using bullets in the format {Name}’, 15)
LMSYS-CHAT-1M 一个parquet有16W个对话。前缀重复比较高的是30~40次。这样的对话有9483条,也就占总数的5%,重复前缀的平均长度只有300左右。
ShareGPT 这是vLLM官方使用的一个压测数据集。压测脚本在这 。这个的比重也只有2%,重复前缀的平均长度是4K。
以上数据集可能对于前缀缓存的优势体现不太明显。
测试工具
sglang inference benchmark
测试参数
batch_size: 30
max_length: 4096
num_samples: 1000
测试结果
构造数据集
用实际的数据集结果不是特别好,差异度不是很高,因为这些数据集的前缀重复度比重都不是很高。 没有特别好的现成的数据集,需要使用人工构造的方式去构造数据集。
sglang 的benchmark提供了 generated-shared-prefix dataset arguments相关的参数。 他是通过随机生成一个系统提示词再组合问题,但是Prompt是随机的。语言不是很明朗。但可能并不 影响测试效果。
比较理想的应该是认为构造一些长度的系统提示词加一些问题进行组合,这个可读性会更高一点,但是没那么灵活 不太好按要求生成指定上下文长度的提示词。
测试结果 结果来看,在batch size更大的情况下,TTFT会变得特别长,而TBT也会相应的增加一些但没有TTFT恐怖。 batch size变大以后,TTFT从300s变成了900s,而ITL则从0.2s变成了0.3s。 这和MoonCacke的论文是一致的。
测试一下PD分离的效果,使用vLLM的1P1D。 PD分离以后TTFT可以降低一个数量级,这个效果还是很明显的,直接降了一个数量级。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ============ Serving Benchmark Result ============ Backend: vllm Traffic request rate: inf Max reqeuest concurrency: not set Successful requests: 47 Benchmark duration (s): 127.03 Total input tokens: 14545 Total generated tokens: 2993 Total generated tokens (retokenized): 2992 Request throughput (req/s): 0.37 Input token throughput (tok/s): 114.50 Output token throughput (tok/s): 23.56 Total token throughput (tok/s): 138.06 Concurrency: 24.49 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 66177.90 Median E2E Latency (ms): 61336.75 ---------------Time to First Token---------------- Mean TTFT (ms): 39888.70 Median TTFT (ms): 22421.85 P99 TTFT (ms): 116090.20 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 491.86 Median TPOT (ms): 394.97 P99 TPOT (ms): 1917.39 ---------------Inter-token Latency---------------- Mean ITL (ms): 419.69 Median ITL (ms): 275.52 P99 ITL (ms): 1766.40 ==================================================
双v100 LLAMA3.2:11b
python -m sglang_router.launch_router –worker-urls http://127.0.0.1:8081 http://127.0.0.1:8082
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ============ Serving Benchmark Result ============ Backend: vllm Traffic request rate: inf Max reqeuest concurrency: not set Successful requests: 1000 Benchmark duration (s): 1247.16 Total input tokens: 289255 Total generated tokens: 184429 Total generated tokens (retokenized): 184388 Request throughput (req/s): 0.80 Input token throughput (tok/s): 231.93 Output token throughput (tok/s): 147.88 Total token throughput (tok/s): 379.81 Concurrency: 470.04 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 586218.50 Median E2E Latency (ms): 596155.97 ---------------Time to First Token---------------- Mean TTFT (ms): 520113.99 Median TTFT (ms): 526194.47 P99 TTFT (ms): 1067230.41 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 363.05 Median TPOT (ms): 356.14 P99 TPOT (ms): 736.93 ---------------Inter-token Latency---------------- Mean ITL (ms): 360.61 Median ITL (ms): 273.54 P99 ITL (ms): 1525.31 ==================================================
双卡的并发的情况下,吞吐可以线性增长,但是相较于1P1D来说,prefill的时间没有改善。
多机器配置 DeepSeek R1 8xH20 x2 台机器,每台机器RDMA配置16个 MT2910 Family [ConnectX-7] 做8个bond。
8TP x 2PP 的部署方案,如果后面EP支持的话可能会有更好的效果。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ============ Serving Benchmark Result ============ Backend: vllm Traffic request rate: inf Max reqeuest concurrency: not set Successful requests: 1000 Benchmark duration (s): 234.47 Total input tokens: 303481 Total generated tokens: 187870 Total generated tokens (retokenized): 186116 Request throughput (req/s): 4.26 Input token throughput (tok/s): 1294.33 Output token throughput (tok/s): 801.26 Total token throughput (tok/s): 2095.59 Concurrency: 363.04 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 85122.29 Median E2E Latency (ms): 82826.18 ---------------Time to First Token---------------- Mean TTFT (ms): 31789.26 Median TTFT (ms): 17669.77 P99 TTFT (ms): 100110.92 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 770.73 Median TPOT (ms): 341.77 P99 TPOT (ms): 9445.55 ---------------Inter-token Latency---------------- Mean ITL (ms): 284.74 Median ITL (ms): 214.68 P99 ITL (ms): 745.14 ==================================================
sglang tp 16的配置,sglang不支持pp,sglang明显要快一些,主要原因应该是sglang支持了MTP,vLLM目前还没有。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: inf Max reqeuest concurrency: not set Successful requests: 1000 Benchmark duration (s): 190.92 Total input tokens: 306113 Total generated tokens: 197108 Total generated tokens (retokenized): 195033 Request throughput (req/s): 5.24 Input token throughput (tok/s): 1603.38 Output token throughput (tok/s): 1032.43 Total token throughput (tok/s): 2635.81 Concurrency: 488.50 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 93263.23 Median E2E Latency (ms): 86230.17 ---------------Time to First Token---------------- Mean TTFT (ms): 39722.57 Median TTFT (ms): 43590.80 P99 TTFT (ms): 60010.86 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 1529.43 Median TPOT (ms): 270.69 P99 TPOT (ms): 37619.47 ---------------Inter-token Latency---------------- Mean ITL (ms): 276.88 Median ITL (ms): 158.45 P99 ITL (ms): 945.60 ==================================================