LLM Inference Benchmark

发表于 2025-02-06 更新于 2025-03-14 Disqus：

参考 Benchmarking Text Generation Inference。

参考 LLM inference server performances comparison llama.cpp / TGI / vLLM。

测试结果

结果来看，在batch size更大的情况下，TTFT会变得特别长，而TBT也会相应的增加一些但没有TTFT恐怖。
batch size变大以后，TTFT从300s变成了900s，而ITL则从0.2s变成了0.3s。
这和MoonCacke的论文是一致的。

测试一下PD分离的效果，使用vLLM的1P1D。
PD分离以后TTFT可以降低一个数量级，这个效果还是很明显的，直接降了一个数量级。

============ Serving Benchmark Result ============
Backend:                                 vllm
Traffic request rate:                    inf
Max reqeuest concurrency:                not set
Successful requests:                     47
Benchmark duration (s):                  127.03
Total input tokens:                      14545
Total generated tokens:                  2993
Total generated tokens (retokenized):    2992
Request throughput (req/s):              0.37
Input token throughput (tok/s):          114.50
Output token throughput (tok/s):         23.56
Total token throughput (tok/s):          138.06
Concurrency:                             24.49
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   66177.90
Median E2E Latency (ms):                 61336.75
---------------Time to First Token----------------
Mean TTFT (ms):                          39888.70
Median TTFT (ms):                        22421.85
P99 TTFT (ms):                           116090.20
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          491.86
Median TPOT (ms):                        394.97
P99 TPOT (ms):                           1917.39
---------------Inter-token Latency----------------
Mean ITL (ms):                           419.69
Median ITL (ms):                         275.52
P99 ITL (ms):                            1766.40
==================================================

双v100 LLAMA3.2:11b

python -m sglang_router.launch_router –worker-urls http://127.0.0.1:8081 http://127.0.0.1:8082

============ Serving Benchmark Result ============
Backend:                                 vllm
Traffic request rate:                    inf
Max reqeuest concurrency:                not set
Successful requests:                     1000
Benchmark duration (s):                  1247.16
Total input tokens:                      289255
Total generated tokens:                  184429
Total generated tokens (retokenized):    184388
Request throughput (req/s):              0.80
Input token throughput (tok/s):          231.93
Output token throughput (tok/s):         147.88
Total token throughput (tok/s):          379.81
Concurrency:                             470.04
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   586218.50
Median E2E Latency (ms):                 596155.97
---------------Time to First Token----------------
Mean TTFT (ms):                          520113.99
Median TTFT (ms):                        526194.47
P99 TTFT (ms):                           1067230.41
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          363.05
Median TPOT (ms):                        356.14
P99 TPOT (ms):                           736.93
---------------Inter-token Latency----------------
Mean ITL (ms):                           360.61
Median ITL (ms):                         273.54
P99 ITL (ms):                            1525.31
==================================================

双卡的并发的情况下，吞吐可以线性增长，但是相较于1P1D来说，prefill的时间没有改善。

多机器配置

DeepSeek R1 8xH20 x2 台机器，每台机器RDMA配置16个 MT2910 Family [ConnectX-7] 做8个bond。

8TP x 2PP 的部署方案，如果后面EP支持的话可能会有更好的效果。

============ Serving Benchmark Result ============
Backend:                                 vllm
Traffic request rate:                    inf
Max reqeuest concurrency:                not set
Successful requests:                     1000
Benchmark duration (s):                  234.47
Total input tokens:                      303481
Total generated tokens:                  187870
Total generated tokens (retokenized):    186116
Request throughput (req/s):              4.26
Input token throughput (tok/s):          1294.33
Output token throughput (tok/s):         801.26
Total token throughput (tok/s):          2095.59
Concurrency:                             363.04
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   85122.29
Median E2E Latency (ms):                 82826.18
---------------Time to First Token----------------
Mean TTFT (ms):                          31789.26
Median TTFT (ms):                        17669.77
P99 TTFT (ms):                           100110.92
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          770.73
Median TPOT (ms):                        341.77
P99 TPOT (ms):                           9445.55
---------------Inter-token Latency----------------
Mean ITL (ms):                           284.74
Median ITL (ms):                         214.68
P99 ITL (ms):                            745.14
==================================================

sglang tp 16的配置，sglang不支持pp，sglang明显要快一些，主要原因应该是sglang支持了MTP，vLLM目前还没有。

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                not set
Successful requests:                     1000
Benchmark duration (s):                  190.92
Total input tokens:                      306113
Total generated tokens:                  197108
Total generated tokens (retokenized):    195033
Request throughput (req/s):              5.24
Input token throughput (tok/s):          1603.38
Output token throughput (tok/s):         1032.43
Total token throughput (tok/s):          2635.81
Concurrency:                             488.50
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   93263.23
Median E2E Latency (ms):                 86230.17
---------------Time to First Token----------------
Mean TTFT (ms):                          39722.57
Median TTFT (ms):                        43590.80
P99 TTFT (ms):                           60010.86
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1529.43
Median TPOT (ms):                        270.69
P99 TPOT (ms):                           37619.47
---------------Inter-token Latency----------------
Mean ITL (ms):                           276.88
Median ITL (ms):                         158.45
P99 ITL (ms):                            945.60
==================================================