vLLM on V100 for Qwen - Newer models
Posted by SectionCrazy5107@reddit | LocalLLaMA | View on Reddit | 15 comments
I am struggling to run vLLM on my V100 GPU. I am trying to run the newest models like Qwen 9B. I try the VLLM nightly + latest transformers etc still they dont work together. I am unable to make it run. Any advice will be much appreciated.
wasatthebeach@reddit
Did you try this one? https://github.com/1CatAI/1Cat-vLLM
SectionCrazy5107@reddit (OP)
Thanks, mate, this works nicely. I tried 35B on 2 V100 32GB, VLLM runs, but I see the t/s is slower than llama.cpp, but thanks for pointing out, I can finally see VLLM running. Also, since I have 3 v100 32GB, is it possible to run the odd number there at all?
wasatthebeach@reddit
AFAIK, that repo is from a company that makes sxm2 carrier boards, and AFAIK vllm requires power of two's in device counts (2, 4, 8, 16).
Are both pp and tg slower? I used to think that tg is the most important, but for agentic work the pp matters a lot too in how fast a task completes.
Substantial_Log_1707@reddit
you mean Qwwn3.5 9B ?
Dont try it untill vllm give another release like 0.16.1, there are bugs in it.
Im using the official GPTQ model Qwen/Qwen3.5-27b-GPTQ-Int4, 2xV100, cuda 12.8, vllm nightly docker image
The code runs, model loads, and silently stuck after this line:
[gpu_model_runner.py:5259] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
this is not necessarily the cause, but the CPU and GPU is 100% seems some kind of deadlock. same for moe models.
nightly + qwen3 : OK, so this specific combination of nightly + qwen3.5 has problem in it, i guess vllm team is working hard on it. (maybe not for V100 LOL)
SectionCrazy5107@reddit (OP)
EXACTLY same point its stuck for me too.
Substantial_Log_1707@reddit
you may also need to set this env. The JIT for triton kernel in language model is very slow, default timeout 300s is not enough on some machine.
jessechapman1@reddit
I was able to do qwen 3.5 on v100's with this extended timeout, thank you <3
RedBrick789@reddit
What's your setup? single or nvlink? Could share some benchmarks, please?
Substantial_Log_1707@reddit
--mm-encoder-attn-backend TORCH_SDPA
try this, i got 0.8B to work WITH vision using SDPA backend for mm-encoder
Substantial_Log_1707@reddit
update:
i managed to run a Qwen3.5 0.8B model on V100 vllm with `--skip-mm-profiling` and `--enforce-eager` and `--gpu-memory-utilization 0.8` argument, but wield thing happens:
The memory consumption is abusrd. A very simple prompt of " who are you" eat up 6GB of free memory for pp, actually it should be a few hundred megabytes.
The tg thoughput is ridiculous, 1 tok/s at best.
If i deliberately choke the memory for the model, `--gpu-memory-utilization 0.9` i can see a Triton OOM error inside this function:
vllm/model_executor/layers/fla/ops/chunk_scaled_dot_kkt.py", line 141, in chunk_scaled_dot_kkt_fwd
MelodicRecognition7@reddit
https://www.google.com/search?channel=entpr&q=how+to+ask+technical+questions+about+when+program+does+not+work
SectionCrazy5107@reddit (OP)
Thanks
nerdlord420@reddit
Last official vLLM version that supported the V100 was 0.8.6.post1 I believe.
MelodicRecognition7@reddit
v0.10 maybe? https://docs.vllm.ai/en/v0.10.0/getting_started/installation/gpu.html
SectionCrazy5107@reddit (OP)
Thanks