vLLM on V100 for Qwen - Newer models

Posted by SectionCrazy5107@reddit | LocalLLaMA | View on Reddit | 15 comments

I am struggling to run vLLM on my V100 GPU. I am trying to run the newest models like Qwen 9B. I try the VLLM nightly + latest transformers etc still they dont work together. I am unable to make it run. Any advice will be much appreciated.

[-]

wasatthebeach@reddit

Did you try this one? https://github.com/1CatAI/1Cat-vLLM

[-]

SectionCrazy5107@reddit (OP)

Thanks, mate, this works nicely. I tried 35B on 2 V100 32GB, VLLM runs, but I see the t/s is slower than llama.cpp, but thanks for pointing out, I can finally see VLLM running. Also, since I have 3 v100 32GB, is it possible to run the odd number there at all?

[-]

wasatthebeach@reddit

AFAIK, that repo is from a company that makes sxm2 carrier boards, and AFAIK vllm requires power of two's in device counts (2, 4, 8, 16).

Are both pp and tg slower? I used to think that tg is the most important, but for agentic work the pp matters a lot too in how fast a task completes.

[-]

Substantial_Log_1707@reddit

you mean Qwwn3.5 9B ?
Dont try it untill vllm give another release like 0.16.1, there are bugs in it.

Im using the official GPTQ model Qwen/Qwen3.5-27b-GPTQ-Int4, 2xV100, cuda 12.8, vllm nightly docker image

The code runs, model loads, and silently stuck after this line:

[gpu_model_runner.py:5259] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.

this is not necessarily the cause, but the CPU and GPU is 100% seems some kind of deadlock. same for moe models.

nightly + qwen3 : OK, so this specific combination of nightly + qwen3.5 has problem in it, i guess vllm team is working hard on it. （maybe not for V100 LOL)

[-]

SectionCrazy5107@reddit (OP)

EXACTLY same point its stuck for me too.

[-]

Substantial_Log_1707@reddit

you may also need to set this env. The JIT for triton kernel in language model is very slow, default timeout 300s is not enough on some machine.

  -e VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=3000

[-]

The memory consumption is abusrd. A very simple prompt of " who are you" eat up 6GB of free memory for pp, actually it should be a few hundred megabytes.
The tg thoughput is ridiculous, 1 tok/s at best.
If i deliberately choke the memory for the model, `--gpu-memory-utilization 0.9` i can see a Triton OOM error inside this function:

vllm/model_executor/layers/fla/ops/chunk_scaled_dot_kkt.py", line 141, in chunk_scaled_dot_kkt_fwd