vLLM 0.11.1 Seems to Be Bringing Massive Speedup on Turing GPUs
Posted by lly0571@reddit | LocalLLaMA | View on Reddit | 3 comments
vllm v0.11.1 using a new FLASHINFER backend and re-enables FP16 support on Turing GPUs, resulting in a much better performance on Volta and Turing GPUs (close to lmdeploy, better in prefill, worse in decode).
Hoping someone with a V100, T4, 2080Ti(22GB) or Titan RTX can have a similar test.
Here is a brief Qwen3-4B-Inst-2507 throughput benchmark of on my Tesla T10 16GB (a rare Tesla GPU close to RTX 2080, but 16GB).
I am using these commands to serve all of these models:
CUDA_VISIBLE_DEVICES=1 vllm serve Qwen3-4B-Instruct-2507 --gpu_memory_utilization 0.9 --port 8000 --max-model-len 16k
CUDA_VISIBLE_DEVICES=1 lmdeploy serve api_server Qwen3-4B-Instruct-2507 --server-port 8000 --session-len 16384
Prefill Heavy: PP8192/TG1 (Parallel 16)
vllm 0.11.0
vllm bench serve --dataset-name random --num-prompts 16 --backend vllm --host 10.249.42.202 --port 8000 --max-concurrency 16 --tokenizer Qwen3-0.6B --model Qwen3-4B-Instruct-2507 --random-input-len 8192 --random-output-len 1
INFO 11-19 14:58:30 [__init__.py:216] Automatically detected platform cuda.
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x7f020b929620>, seed=0, num_prompts=16, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, custom_output_len=256, custom_skip_chat_template=False, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, random_input_len=8192, random_output_len=1, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 0}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='vllm', endpoint_type=None, base_url=None, host='10.249.42.202', port=8000, endpoint='/v1/completions', header=None, max_concurrency=16, model='Qwen3-4B-Instruct-2507', tokenizer='Qwen3-0.6B', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, request_id_prefix='benchmark-serving', top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None, ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=600)
INFO 11-19 14:58:32 [datasets.py:507] Sampling input_len from [8192, 8192] and output_len from [1, 1]
Starting initial single prompt test run...
Waiting for endpoint to become up in 600 seconds
| | 01:21 elapsed, 31635:35:38 remaining
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [04:48<00:00, 18.02s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests: 16
Maximum request concurrency: 16
Benchmark duration (s): 288.39
Total input tokens: 130981
Total generated tokens: 16
Request throughput (req/s): 0.06
Output token throughput (tok/s): 0.06
Peak output token throughput (tok/s): 1.00
Peak concurrent requests: 16.00
Total Token throughput (tok/s): 454.23
---------------Time to First Token----------------
Mean TTFT (ms): 125794.42
Median TTFT (ms): 111166.06
P99 TTFT (ms): 283469.41
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 0.00
Median TPOT (ms): 0.00
P99 TPOT (ms): 0.00
---------------Inter-token Latency----------------
Mean ITL (ms): 0.00
Median ITL (ms): 0.00
P99 ITL (ms): 0.00
==================================================
vllm 0.11.1
vllm bench serve --dataset-name random --num-prompts 64 --backend vllm --host 10.249.42.202 --port 8000 --max-concurrency 16 --tokenizer Qwen3-0.6B --model Qwen3-4B-Instruct-2507 --random-input-len 8192 --random-output-len 1
INFO 11-19 14:47:01 [__init__.py:216] Automatically detected platform cuda.
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x7f2572149620>, seed=0, num_prompts=64, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, custom_output_len=256, custom_skip_chat_template=False, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, random_input_len=8192, random_output_len=1, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 0}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='vllm', endpoint_type=None, base_url=None, host='10.249.42.202', port=8000, endpoint='/v1/completions', header=None, max_concurrency=16, model='Qwen3-4B-Instruct-2507', tokenizer='Qwen3-0.6B', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, request_id_prefix='benchmark-serving', top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None, ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=600)
INFO 11-19 14:47:04 [datasets.py:507] Sampling input_len from [8192, 8192] and output_len from [1, 1]
Starting initial single prompt test run...
Waiting for endpoint to become up in 600 seconds
| | 00:01 elapsed, 642:35:16 remaining
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [01:50<00:00, 1.72s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests: 64
Maximum request concurrency: 16
Benchmark duration (s): 110.03
Total input tokens: 523886
Total generated tokens: 64
Request throughput (req/s): 0.58
Output token throughput (tok/s): 0.58
Peak output token throughput (tok/s): 1.00
Peak concurrent requests: 17.00
Total Token throughput (tok/s): 4761.83
---------------Time to First Token----------------
Mean TTFT (ms): 24172.28
Median TTFT (ms): 27210.15
P99 TTFT (ms): 28380.61
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 0.00
Median TPOT (ms): 0.00
P99 TPOT (ms): 0.00
---------------Inter-token Latency----------------
Mean ITL (ms): 0.00
Median ITL (ms): 0.00
P99 ITL (ms): 0.00
==================================================
lmdeploy
vllm bench serve --dataset-name random --num-prompts 64 --backend vllm --host 10.249.42.202 --port 8000 --max-concurrency 16 --tokenizer Qwen3-0.6B --model Qwen3-4B-Instruct-2507 --random-input-len 8192 --random-output-len 1
INFO 11-19 15:16:51 [__init__.py:216] Automatically detected platform cuda.
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x7fa4823b5620>, seed=0, num_prompts=64, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, custom_output_len=256, custom_skip_chat_template=False, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, random_input_len=8192, random_output_len=1, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 0}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='vllm', endpoint_type=None, base_url=None, host='10.249.42.202', port=8000, endpoint='/v1/completions', header=None, max_concurrency=16, model='Qwen3-4B-Instruct-2507', tokenizer='Qwen3-0.6B', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, request_id_prefix='benchmark-serving', top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None, ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=600)
INFO 11-19 15:16:53 [datasets.py:507] Sampling input_len from [8192, 8192] and output_len from [1, 1]
Starting initial single prompt test run...
Waiting for endpoint to become up in 600 seconds
| | 00:01 elapsed, 756:41:43 remaining
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [01:58<00:00, 1.85s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests: 64
Maximum request concurrency: 16
Benchmark duration (s): 118.10
Total input tokens: 523886
Total generated tokens: 124
Request throughput (req/s): 0.54
Output token throughput (tok/s): 1.05
Peak output token throughput (tok/s): 8.00
Peak concurrent requests: 18.00
Total Token throughput (tok/s): 4437.05
---------------Time to First Token----------------
Mean TTFT (ms): 24981.20
Median TTFT (ms): 28008.93
P99 TTFT (ms): 29259.25
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 1803.85
Median TPOT (ms): 1869.74
P99 TPOT (ms): 1937.03
---------------Inter-token Latency----------------
Mean ITL (ms): 895.75
Median ITL (ms): 0.33
P99 ITL (ms): 1936.55
==================================================
Decode heavy: PP512/TG512 (Parallel 16)
v0.11.0
vllm bench serve --dataset-name random --num-prompts 16 --backend vllm --host 10.249.42.202 --port 8000 --max-concurrency 16 --tokenizer Qwen3-0.6B --model Qwen3-4B-Instruct-2507 --random-input-len 512 --random-output-len 512
INFO 11-19 15:08:12 [__init__.py:216] Automatically detected platform cuda.
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x7fe684875620>, seed=0, num_prompts=16, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, custom_output_len=256, custom_skip_chat_template=False, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, random_input_len=512, random_output_len=512, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 0}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='vllm', endpoint_type=None, base_url=None, host='10.249.42.202', port=8000, endpoint='/v1/completions', header=None, max_concurrency=16, model='Qwen3-4B-Instruct-2507', tokenizer='Qwen3-0.6B', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, request_id_prefix='benchmark-serving', top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None, ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=600)
INFO 11-19 15:08:14 [datasets.py:507] Sampling input_len from [512, 512] and output_len from [512, 512]
Starting initial single prompt test run...
Waiting for endpoint to become up in 600 seconds
| | 00:40 elapsed, 15758:20:48 remaining
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [03:02<00:00, 11.43s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests: 16
Maximum request concurrency: 16
Benchmark duration (s): 182.80
Total input tokens: 8177
Total generated tokens: 7681
Request throughput (req/s): 0.09
Output token throughput (tok/s): 42.02
Peak output token throughput (tok/s): 75.00
Peak concurrent requests: 16.00
Total Token throughput (tok/s): 86.75
---------------Time to First Token----------------
Mean TTFT (ms): 18188.82
Median TTFT (ms): 16467.30
P99 TTFT (ms): 22968.20
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 322.22
Median TPOT (ms): 325.09
P99 TPOT (ms): 327.25
---------------Inter-token Latency----------------
Mean ITL (ms): 322.22
Median ITL (ms): 307.80
P99 ITL (ms): 389.45
==================================================
v0.11.1
vllm bench serve --dataset-name random --num-prompts 64 --backend vllm --host 10.249.42.202 --port 8000 --max-concurrency 16 --tokenizer Qwen3-0.6B --model Qwen3-4B-Instruct-2507 --random-input-len 512 --random-output-len 512
INFO 11-19 14:54:10 [__init__.py:216] Automatically detected platform cuda.
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x7f76d6b1d580>, seed=0, num_prompts=64, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, custom_output_len=256, custom_skip_chat_template=False, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, random_input_len=512, random_output_len=512, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 0}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='vllm', endpoint_type=None, base_url=None, host='10.249.42.202', port=8000, endpoint='/v1/completions', header=None, max_concurrency=16, model='Qwen3-4B-Instruct-2507', tokenizer='Qwen3-0.6B', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, request_id_prefix='benchmark-serving', top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None, ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=600)
INFO 11-19 14:54:12 [datasets.py:507] Sampling input_len from [512, 512] and output_len from [512, 512]
Starting initial single prompt test run...
Waiting for endpoint to become up in 600 seconds
| | 00:12 elapsed, 4714:00:33 remaining
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [01:11<00:00, 1.11s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests: 64
Maximum request concurrency: 16
Benchmark duration (s): 71.04
Total input tokens: 32565
Total generated tokens: 31353
Request throughput (req/s): 0.90
Output token throughput (tok/s): 441.34
Peak output token throughput (tok/s): 512.00
Peak concurrent requests: 31.00
Total Token throughput (tok/s): 899.75
---------------Time to First Token----------------
Mean TTFT (ms): 591.82
Median TTFT (ms): 599.07
P99 TTFT (ms): 1251.87
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 33.70
Median TPOT (ms): 34.11
P99 TPOT (ms): 35.13
---------------Inter-token Latency----------------
Mean ITL (ms): 33.68
Median ITL (ms): 32.30
P99 ITL (ms): 35.16
==================================================
lmdeploy:
vllm bench serve --dataset-name random --num-prompts 64 --backend vllm --host 10.249.42.202 --port 8000 --max-concurrency 16 --tokenizer Qwen3-0.6B --model Qwen3-4B-Instruct-2507 --random-input-len 512 --random-output-len 512
INFO 11-19 15:14:54 [__init__.py:216] Automatically detected platform cuda.
Namespace(subparser='bench', bench_type='serve', dispatch_function=<function BenchmarkServingSubcommand.cmd at 0x7f3146319580>, seed=0, num_prompts=64, dataset_name='random', no_stream=False, dataset_path=None, no_oversample=False, custom_output_len=256, custom_skip_chat_template=False, spec_bench_output_len=256, spec_bench_category=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, blazedit_min_distance=0.0, blazedit_max_distance=1.0, random_input_len=512, random_output_len=512, random_range_ratio=0.0, random_prefix_len=0, random_batch_size=1, random_mm_base_items_per_request=1, random_mm_num_mm_items_range_ratio=0.0, random_mm_limit_mm_per_prompt={'image': 255, 'video': 0}, random_mm_bucket_config={(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}, hf_subset=None, hf_split=None, hf_name=None, hf_output_len=None, prefix_repetition_prefix_len=256, prefix_repetition_suffix_len=256, prefix_repetition_num_prefixes=10, prefix_repetition_output_len=128, label=None, backend='vllm', endpoint_type=None, base_url=None, host='10.249.42.202', port=8000, endpoint='/v1/completions', header=None, max_concurrency=16, model='Qwen3-4B-Instruct-2507', tokenizer='Qwen3-0.6B', use_beam_search=False, logprobs=None, request_rate=inf, burstiness=1.0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, request_id_prefix='benchmark-serving', top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None, ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None, ready_check_timeout_sec=600)
INFO 11-19 15:14:57 [datasets.py:507] Sampling input_len from [512, 512] and output_len from [512, 512]
Starting initial single prompt test run...
Waiting for endpoint to become up in 600 seconds
| | 00:14 elapsed, 5459:10:19 remaining
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 16
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [01:05<00:00, 1.03s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests: 64
Maximum request concurrency: 16
Benchmark duration (s): 65.94
Total input tokens: 32565
Total generated tokens: 30895
Request throughput (req/s): 0.97
Output token throughput (tok/s): 468.55
Peak output token throughput (tok/s): 560.00
Peak concurrent requests: 32.00
Total Token throughput (tok/s): 962.42
---------------Time to First Token----------------
Mean TTFT (ms): 1051.63
Median TTFT (ms): 1118.93
P99 TTFT (ms): 1370.53
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 30.14
Median TPOT (ms): 30.31
P99 TPOT (ms): 32.24
---------------Inter-token Latency----------------
Mean ITL (ms): 30.11
Median ITL (ms): 29.66
P99 ITL (ms): 31.83
==================================================
a_beautiful_rhind@reddit
I thought flashinfer was ampere only.
lly0571@reddit (OP)
I also believe in that, but vLLM
FlashInferBackenddid seems to support Turing+ GPUs now.DeltaSqueezer@reddit
Thanks for sharing. I recently re-pasted my 2080Ti so look forward to putting it back into action!