Running Qwen3.5 27b dense with 170k context at 100+t/s decode and ~1500t/s prefill on 2x3090 (with 585t/s throughput for 8 simultaneous requests)

Posted by JohnTheNerd3@reddit | LocalLLaMA | View on Reddit | 140 comments

Hi everyone!

I've been trying to run the new Qwen models as efficiently as possible with my setup - and seem to have performance higher than I've seen around, so wanted to share my scripts and metrics!

The above video is simulating ideal conditions - due to the nature of MTP, it does get slower once your response requires more intelligence and creativity. However, even at the worst-case scenario I rarely ever see my decode speeds drop below 60t/s. And for multi-user throughput, I have seen as high as 585t/s across 8 requests.

To achieve this, I had to:

Use vLLM with tensor parallelism (I also have NVLink, which probably plays a role considering tensor parallelism does better with GPU interconnect).
Enable MTP with 5 tokens predicted. This is in contrast to any documentation I've seen which suggests 3, but in practice I am getting mean acceptance length values above 3 with my setup so I think 5 is appropriate. I found values above 5 not to be worth it, since the mean acceptance length never exceeded 5 when I tried with higher values. I have also observed a noticable slowdown when I cranked MTP above 5 tokens.
Compile vLLM from scratch on my own hardware. It's a fairly slow operation, especially if your CPU is not great or you don't have a lot of RAM - I typically just leave the compilation running overnight. It also doesn't seem to increase the performance much, so it's certainly not a requirement but something I did to get the absolute most out of my GPU's.
I also played around a lot with the vLLM engine arguments and environment variables.
Use this exact quant because the linear attention layers are kept at full-precision (as far as I can tell, linear attention still quantizes rather poorly) and the full attention layers are quantized to int4. This matters, because 3090's have hardware support for int4 - massively boosting performance.

The tool call parser for Qwen3 Coder (also used in Qwen3.5 in vLLM) seems to have a bug where tool calling is inaccurate when MTP is enabled, so I cherry-picked this pull request into the current main branch (and another pull request to fix an issue where reasoning content is lost when using LiteLLM). My fork with the cherry-picked fixes are available on my GitHub if you'd like to use it, but please keep in mind that I am unlikely to maintain this fork.

Prefill speeds appear to be really good too, at ~1500t/s.

My current build script is:

#!/bin/bash

. /mnt/no-backup/vllm-venv/bin/activate

export CUDACXX=/usr/local/cuda-12.4/bin/nvcc
export MAX_JOBS=1
export PATH=/usr/local/cuda-12.4/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH

cd vllm

pip3 install -e .

And my current launch script is:

#!/bin/bash

. /mnt/no-backup/vllm-venv/bin/activate

export CUDA_VISIBLE_DEVICES=0,1
export RAY_memory_monitor_refresh_ms=0
export NCCL_CUMEM_ENABLE=0
export VLLM_SLEEP_WHEN_IDLE=1
export VLLM_ENABLE_CUDAGRAPH_GC=1
export VLLM_USE_FLASHINFER_SAMPLER=1

vllm serve /mnt/no-backup/models/Qwen3.5-27B-AWQ-BF16-INT4 --served-model-name=qwen3.5-27b \
--quantization compressed-tensors \
--max-model-len=170000 \
--max-num-seqs=8 \
--block-size 32 \
--max-num-batched-tokens=2048 \
--swap-space=0 \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--attention-backend FLASHINFER \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":5}' \
--tensor-parallel-size=2 \
-O3 \
--gpu-memory-utilization=0.9 \
--no-use-tqdm-on-load \
--host=0.0.0.0 --port=5000

deactivate

Hope this helps someone!

[-]

Radiant_Condition861@reddit

I was also able to replicate the results. with caveats

# vLLM Optimization Findings


## Hardware
- Dual RTX 3090 GPUs (24GB VRAM each)
- Model: cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4


## Current Best Configuration
```yaml
--tensor-parallel-size 2
--gpu-memory-utilization 0.97
--max-model-len 131072
--quantization compressed-tensors
--max-num-seqs 8
--block-size 32
--max-num-batched-tokens 2048
--enable-prefix-caching
--attention-backend FLASHINFER
--speculative-config '{"method":"mtp","num_speculative_tokens":5}'
--compilation-config '{"cudagraph_mode": "PIECEWISE"}'
-O3
```


## Key Findings


### 1. CUDA_LAUNCH_BLOCKING
- 
**Finding**
: Significantly slows down inference (2-3x slowdown)
- 
**Action**
: Remove this flag


### 2. Attention Backend: FLASHINFER vs FLASH_ATTN
- 
**Finding**
: FLASHINFER is faster for AWQ/int4 models (50-96 tok/s vs 20-68 tok/s)
- 
**Action**
: Use `--attention-backend FLASHINFER`


### 3. Speculative Decoding (MTP)
- 
**Finding**
: Performance depends heavily on acceptance rate
- 
**Coding prompts**
: ~80% acceptance → 106 tok/s
- 
**Creative writing**
: ~27-49% acceptance → 37-71 tok/s
- 
**Action**
: Works best with predictable/structured tasks (coding)


### 4. GPU Memory Utilization
- 
**Finding**
: 0.98 causes OOMs, 0.97 is stable, 0.96 is safe
- 
**Action**
: Use 0.97 for best balance of performance and stability


### 5. CUDAGraph Mode
- 
**Finding**
: PIECEWISE required for spec-decoding + FLASHINFER
- 
**Action**
: Use `--compilation-config '{"cudagraph_mode": "PIECEWISE"}'`


### 6. -O3 Optimization Level
- 
**Finding**
: Enables VLLM_COMPILE mode (mode: 3)
- 
**Action**
: Keep `-O3` flag


### 7. Open WebUI Timeout
- 
**Finding**
: Caused by Nginx Proxy Manager timing out before streaming completes
- 
**Action**
: Add to Nginx config:
  - `proxy_read_timeout 300s;`
  - `proxy_send_timeout 300s;`
  - `proxy_buffering off;`


## Benchmark Results


| Prompt Type | Throughput | Acceptance Rate |
|-------------|-----------|-----------------|
| Coding (Python function) | 106 tok/s | 80.5% |
| Creative story | 37-71 tok/s | 26-49% |


### Test Prompts


**Slow (Creative Writing):**
```
Write a creative story about a person who discovers they can communicate with animals.
```


**Fast (Coding):**
```
Write a Python function that takes a list of dictionaries containing user data (name, email, age) and returns a new list with only users over 18, sorted by age. Include type hints, docstrings, and error handling.
```


## Notes
- Prefix caching improves subsequent request performance (0% → 73%+ hit rate)
- spec-decoding overhead not worth it for low-acceptance scenarios
- Model architecture: Qwen3_5ForConditionalGeneration with MTP drafter

services:
  vllm:
    image: vllm/vllm-openai:latest-cu130
    container_name: vllm
    restart: unless-stopped
    # ports:
    #   - "8999:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      # - VLLM_LOGGING_LEVEL=DEBUG
      # - VLLM_LOG_STATS_INTERVAL=1
      # - NCCL_DEBUG=TRACE
      # - VLLM_TRACE_FUNCTION=1
      # - NCCL_IGNORE_DISABLED_P2P=1
      # - CUDA_LAUNCH_BLOCKING=1
      - VLLM_API_KEY=[YOUR_KEY_HERE]
      - VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
      - CUDA_VISIBLE_DEVICES=0,1
      - RAY_memory_monitor_refresh_ms=0
      - NCCL_CUMEM_ENABLE=0
      # - VLLM_SLEEP_WHEN_IDLE=1
      - VLLM_ENABLE_CUDAGRAPH_GC=1
      - VLLM_USE_FLASHINFER_SAMPLER=1
    shm_size: 4g
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
    command: >
      cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4
      --tensor-parallel-size 2
      --gpu-memory-utilization 0.97
      --max-model-len 131072
      --quantization compressed-tensors
      --max-num-seqs 8
      --block-size 32
      --max-num-batched-tokens 2048
      --enable-prefix-caching
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --reasoning-parser qwen3
      --attention-backend FLASHINFER
      --speculative-config '{"method":"mtp","num_speculative_tokens":5}'
      --compilation-config '{"cudagraph_mode": "PIECEWISE"}'
      --no-use-tqdm-on-load
      -O3
    networks:
      - reverse-proxy-net

networks:
  reverse-proxy-net:
    name: reverse-proxy-net
    external: true

[-]

Radiant_Condition861@reddit

just installed nvlink. fastest speeds yet

[vllm] 2026-04-13T18:44:44.161259124Z (APIServer pid=1) INFO 04-13 18:44:44 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 4.71, Accepted throughput: 123.18 tokens/s, Drafted throughput: 165.98 tokens/s, Accepted: 1232 tokens, Drafted: 1660 tokens, Per-position acceptance rate: 0.904, 0.801, 0.732, 0.654, 0.620, Avg Draft acceptance rate: 74.2%
[vllm] 2026-04-13T18:44:54.161771040Z (APIServer pid=1) INFO 04-13 18:44:54 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 155.1 tokens/s, Running: 2 reqs, Waiting: 1 reqs, GPU KV cache usage: 81.3%, Prefix cache hit rate: 0.0%
[vllm] 2026-04-13T18:44:54.161945822Z (APIServer pid=1) INFO 04-13 18:44:54 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 4.64, Accepted throughput: 121.69 tokens/s, Drafted throughput: 166.99 tokens/s, Accepted: 1217 tokens, Drafted: 1670 tokens, Per-position acceptance rate: 0.895, 0.781, 0.707, 0.647, 0.614, Avg Draft acceptance rate: 72.9%
[vllm] 2026-04-13T18:45:04.163584109Z (APIServer pid=1) INFO 04-13 18:45:04 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 114.8 tokens/s, Running: 1 reqs, Waiting: 1 reqs, GPU KV cache usage: 72.3%, Prefix cache hit rate: 0.0%
[vllm] 2026-04-13T18:45:04.163825548Z (APIServer pid=1) INFO 04-13 18:45:04 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 5.31, Accepted throughput: 93.48 tokens/s, Drafted throughput: 108.48 tokens/s, Accepted: 935 tokens, Drafted: 1085 tokens, Per-position acceptance rate: 0.931, 0.880, 0.848, 0.829, 0.820, Avg Draft acceptance rate: 86.2%
[vllm] 2026-04-13T18:45:14.165325873Z (APIServer pid=1) INFO 04-13 18:45:14 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 105.4 tokens/s, Running: 1 reqs, Waiting: 1 reqs, GPU KV cache usage: 72.8%, Prefix cache hit rate: 0.0%

[-]

robertio1@reddit

u/JohnTheNerd3 Hi John,

cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4 updated 2 days ago. Do you know why? Is there any benefit to update?

[-]

Equivalent-Home-223@reddit

Thanks for this! I have tested the above in with a project I have ( redesign a project using Material UI) and the model ends up in a infinite loop while using RooCode:

API Request Failed

Provider ended the request: terminatedDetails

Roo said

Now I'll create a comprehensive redesign plan document based on my analysis of the project.

Checkpoint

Provider Error

API Request FailedDetails

API Request...

$0.0000

Roo said

Now I'll create a comprehensive redesign plan document based on my analysis of the project.

[-]

jslominski@reddit

I tried the same setup on my own 2× RTX 3090 machine, with each card capped at 280W and no NVLink, and this is genuinely impressive.

I’m seeing about 692 tok/s total throughput on an 8-request run, around 77 tok/s output throughput, and roughly 1,112 tok/s on a prefill-heavy test, very nice result indeed!

Here's another run:

[-]

JohnTheNerd3@reddit (OP)

WOW I actually forgot to mention..... I cap at 260W............

[-]

jslominski@reddit

And so far the model output seems to be good even though this is fairly exotic quant. Quality work, thanks! Gonna play with it a bit :D

[-]

webber26232@reddit

May I know why capped at 280W? Does the card get overheated with the default setting?

[-]

jslominski@reddit

Thermals of the room it is in :)

[-]

ortegaalfredo@reddit

The trick is basicaly "don't use llama.cpp, lmstudio or ollama".

For a project so widespread and with so many contributors, there has to be something fundamentally wrong if every other project is basically 20x faster. I just measured on my rig, it's more than 20 times faster. This is not just a "bug".

I bet there is some pressure from you know who, to slow it down and make it useless to server multi-users. I have no other explanation, 20X is too much.

[-]

ai-infos@reddit

i agree with you except that it's also depends on the people use case... i think llama.cpp has been designed and built around cpu usage and handles very well cpu offloading while vllm doesn't do very well in that domain
and as more people don't have a lot of vram, llama.cpp became more mainstream

but for people with enough vram (and 0 offloading), i keep telling them to use vllm (that fits most use cases with very good perf)

[-]

JohnTheNerd3@reddit (OP)

my understanding is that llama.cpp can actually be faster for decode on single-user usecases. there are a few reasons the speed difference is this ridiculous:

MTP support just isn't in llama.cpp yet. this makes an insane difference. doubling speed is common.
llama.cpp is mostly hobbyists and individual contributors, while vLLM and sglang are used by big corporations for LLM inference. this means there are people on payroll working to improve vLLM, since any compute savings actually result in savings in companies' bottom lines.
due to the above, vLLM actually has some custom handmade CUDA kernels that fuse operations. this is a truly incredible amount of effort, and requires expertise most people lack.

I don't think it's fair to compare the two this way. llama.cpp actually does a lot better in many cases (offloading, KV cache quantization, decode for single-user inference, VRAM efficiency) because these do not align with corporate interests and therefore very few people spend the time and effort on these aspects.

[-]

overand@reddit

If the you-know-who is HF, and you're suggesting they're trying to make `llama.cpp` worse, why exactly are they partnering with GGML-org and llama.cpp's team?

[-]

Deathclaw1@reddit

To be fair, llama.cpp has a feature where it offloads some of the model layers to the ram instead of vram, making things slow sometimes, its also starts fast and has low requirements (lm studio and ollama both are using it).

Vllm on the other hand fits everything in the vram from my understanding, even memory (I think) so its better optimized than llama.cpp.

So yea vllm will be faster but it needs cuda and other things. Basically llama.cpp is meant for consumer grade hardware while vllm is for production and eats vram.

[-]

ortegaalfredo@reddit

I fit everything into VRAM in llama.cpp and it still is 20 times slower. 60 tok/s llama.cpp vs 1500 tok/s vllm, 40 queries.

[-]

tarruda@reddit

That's probably because vLLM contains official implementation for qwen-next architecture.

Llama.cpp implementation comes community contributions and is probably not as optimized yet.

[-]

ortegaalfredo@reddit

Also because llama.cpp basically has a non-working tensor-parallel inference.

[-]

desirew@reddit

Are those faster for single user usage though ?

[-]

ortegaalfredo@reddit

About the same or slightly faster. For single use I guess llama.cpp is OK but even for agents vllm already is much faster.

[-]

throwawaysugaracc@reddit

How would this compare to 8 bit quant? In terms of logic and coding

[-]

Unlucky-Fox-6297@reddit

Please tell me if this work result can be reproduced in docker on image vllm-openai:nightly or vllm-openai:latest?
Or do I need to recompile the kernels specifically for the rtx 3090?

[-]

overand@reddit

Good lord, the difference between that and my dual 3090 rig (no NVLink) with llama.cpp is shocking. Also, this isn't factoring in my current "IDK what's going on here" situation where the model takes a surprisingly long time to start responding after llama.cpp has announced that it's done with prompt processing. The comparison against Gemma3-27B is stark - I'll try to get some numbers. But, in terms of basic numbers, with one request, we look like this:

## Qwen3.5-27B-heretic-GGUF:Q4_K_M

prompt eval time =     831.24 ms /   781 tokens (    1.06 ms per token,   939.56 tokens per second)
       eval time =   14170.30 ms /   485 tokens (   29.22 ms per token,    34.23 tokens per second)
      total time =   15001.54 ms /  1266 tokens

[-]

munkiemagik@reddit

I know right, I am even looking at yours and thinking how the bleep are you getting 34 t/s.

Oh hang on you're using Q4_K_M, I was seeing around 24t/s on Q6_K_L.. What other parameters are you running in your llama-server command?

[-]

overand@reddit

For what it's worth, my weird inference delays have stopped - heck if I know why. I think I updated llama.cpp, so, that could be it - who knows. (But I thought I was even having issues with vLLM, so, go figure. Maybe it was my brain acting up.)

[-]

truedima@reddit

There was a bug with the promp caches being thrashed somehow, so it would fully reprocess the context for me almost every message - that def stopped. And a few other small fixes I'm not in detail informed about. Now I even get \~26-30tok/s on a single RTX 3090 using Q4_K_M and q8_0 kv.

[-]

overand@reddit

I'm using a --models-preset file, which has the following assorted entries - use at your own risk - I don't recall if they all work. (the ts = 47,48 is because I have two 24GB GPUs, and GPU0 usually has a ~0.5GB of VRAM taken by a whisper model.)

I have probably used a few other settings, but I also did give this a try with vLLM last night - the rates were better, but it still had the large delay between the end of prompt processing and the beginning of apparent generation.

[Qwen3.5-27B-heretic-20k-ctx:Q4_K_M]
ctx-size = 20000
model = /home/myusername/.cache/llama.cpp/mradermacher_Qwen3.5-27B-heretic-GGUF_Qwen3.5-27B-heretic.Q4_K_M.gguf
mmproj = /home/myusername/.cache/llama.cpp/mradermacher_Qwen3.5-27B-heretic-GGUF_Qwen3.5-27B-heretic.mmproj-Q8_0.gguf

[Qwen3.5-27B-heretic-20k-ctx-ts2:Q4_K_M]
ctx-size = 20000
model = /home/myusername/.cache/llama.cpp/mradermacher_Qwen3.5-27B-heretic-GGUF_Qwen3.5-27B-heretic.Q4_K_M.gguf
mmproj = /home/myusername/.cache/llama.cpp/mradermacher_Qwen3.5-27B-heretic-GGUF_Qwen3.5-27B-heretic.mmproj-Q8_0.gguf
ts = 47,48

[Qwen3.5-27B-heretic-32k-ctx-nothink-tuned:Q4_K_M]
ctx-size = 32768
model = /home/myusername/.cache/llama.cpp/mradermacher_Qwen3.5-27B-heretic-GGUF_Qwen3.5-27B-heretic.Q4_K_M.gguf
mmproj = /home/myusername/.cache/llama.cpp/mradermacher_Qwen3.5-27B-heretic-GGUF_Qwen3.5-27B-heretic.mmproj-Q8_0.gguf
temp = 0.7
top-p = 0.8
top-k = 20
min-p = 0
presence-penalty = 1.5
repeat-penalty = 1
chat-template-kwargs = "{\"enable_thinking\": \"false\"}"
reasoning-budget = 0

[Qwen3.5-27B-heretic-32k-ctx-think-tuned:Q4_K_M]
ctx-size = 32768
model = /home/myusername/.cache/llama.cpp/mradermacher_Qwen3.5-27B-heretic-GGUF_Qwen3.5-27B-heretic.Q4_K_M.gguf
mmproj = /home/myusername/.cache/llama.cpp/mradermacher_Qwen3.5-27B-heretic-GGUF_Qwen3.5-27B-heretic.mmproj-Q8_0.gguf
temp = 1.0
top-p = 0.95
top-k = 20
min-p = 0
presence-penalty = 1.5
repeat-penalty = 1

[-]

munkiemagik@reddit

thanks for the feedback. I dont know if it s me or something going on with the backend but sometimes I also experienced these bizarre random delays. I use the unsloth recommended temps and other parameters (I have the bartowski gguf). I spun up an open terminal docker container and one test just creating a directory and writing a small list to a file took 3 minutes! God knows what happened there.

I keep meaning to sort out vLLM but never get round to getting started. I had a horrible time last when I tried due to something funny going on with my Ubuntu. Someone did suggest jsut to run it in docker as the problem was cuda 13 related which is o my system but I figured i'd just hold off until the whole pytorch/vllm situation eventually became more tolerant and played nicer with cuda 13

[-]

sleepy_roger@reddit

Yep, this is why I grabbed an NVLink way back. People in the sub used to say it didn't make much of a difference but I saw a pretty significant difference, glad I paid the $200 back then.

[-]

H4UnT3R_CZ@reddit

nvlink has nothing to do with inference. During fine tuning it's utilised a lot. I had two 3090 without nvlink, one on pcie gen4 x4 and had on qwen3 42GB version around 60t/s and looked on pcie utilisation - even this wasnt utilised on 100%, avg. 30% during inference. x16 just few %.

But now I switched to Github Pro, sold both 3090 and running on 5070Ti smaller LLMs if needed... Claude 4.6 is just saving lot of time for programming on my senior level.

[-]

Glarione@reddit

well, try p2p driver + patch vllm. It's similar to nvlink functionality, but increase of the PP (most) and TG (least) are something like 15+% (my case with Qwen3.5 122B A10B from 70 to 85 tps (TG).
tensor parallel is benefiting from nvlink or another p2p technology to interact without CPU track.

[-]

H4UnT3R_CZ@reddit

Yeah, thx, but now I chose the way without spending time on HW and SW maintenance and focus on work results - found out was almost 60% of the time playing with these and 40% did the work... :-D I have 9950x, slightly OCed, so CPU wasn't much bottleneck (ofc if it haven't had too much part of LLM.)

[-]

Glarione@reddit

same here :)
A couple of weeks tweaking LLM engine setup, but without actual usage. Time to move on. My 9950x downvolted + overclocked a little, I'd advice to check voltage (because of RMA stories and also personal experience with ryzen 7700, it got cooked after 1.5 years).

[-]

bongkyo@reddit

thank you!

[-]

Tyr_56k@reddit

INT4...

for some speed is indeed import but without sacrificing the models toolcall ability completely.

[-]

Sufficient-Rent6078@reddit

I can confirm that I'm hitting above 3000t/s prefill for a dual RTX-4090 setup on the current vllms nightly build with pretty much the same configuration. Decode is roughly in the 100-130 t/s range. I did not run any rigorous benchmarks, so take this with a grain of salt.

[-]

JohnTheNerd3@reddit (OP)

beware: the nightly is missing the tool call fix - you might get incorrect tool calls at times!

I'm curious, have you tried this driver? it might improve performance further! https://github.com/tinygrad/open-gpu-kernel-modules

[-]

RS_n@reddit

Its merged now, thank you for this info, tool calls now working wonderfully 🙏 On bf16 27b model and 4x3090 + driver patch i'm getting ~101t/s.

ps vllm also needs patch to use pcie bus for gpu interconnect after driver patch

[-]

ratbastid2000@reddit

what's the vLLM patch your referring to? is it a configuration flag for run time or do I need to build from source with a specific feature flag?

[-]

JohnTheNerd3@reddit (OP)

it is neither, but rather a modification to the vllm source itself. my fork has this patch. you can find the specific patch used in the commits. keep in mind i might break the fork at times by syncing from upstream!

[-]

Sufficient-Rent6078@reddit

Thanks for the heads up. Last time I tried the geohot driver was more than a year ago and had some UI issues. Since then I'm using the dual RTX in a headless setting, so it might be worth another shot.

[-]

naximus17061989@reddit

I am new to setting up local llms and have a dual 3090 setup as well, but the best i am getting doe 27B model is 30 maybe 40token/s. Will try the scripts you shared, thanks

[-]

Medium_Chemist_4032@reddit

That's spectacular, on a dense 30b-ish dual-gpu split configuration. Never seen anything like it!

[-]

DistanceSolar1449@reddit

It’s because he’s running attention at int4

Attention quants better than SSM, but 4 bit attention is a brave/stupid move. Most people quant attention to Q8 for a reason. For example, unsloth Q4_K_XL quants attention qkv to Q8 and gate to Q6.

That model is gonna be really brain damaged at 4 bit attention.

[-]

robertio1@reddit

After a half an hour usage with opencode I confirm this BF16-INT4 quant is incredible high quality!

I gave a fairly complex playwright MCP task and it was solved in 1 minute.

Then new task: Classical Earth task:

"Create a Three.js-based procedural visualization of Earth using high-resolution satellite imagery (e.g., NASA Blue Marble), with real-time rotation, drag-to-rotate and scroll-to-zoom interaction, and WebGL PBR rendering. add visible floating clouds layer"

And then this prompt:
"make an python application: open my webcam show it, and recognise what number i show with my fingers. show it as large number and say it in english."

Was finalised in 2 minutes. with creating venv, installing pip packages, and run the perfect implementation.

Im so surprised on Quality and speed.

[-]

JohnTheNerd3@reddit (OP)

the quality is surprising, actually - I urge you to try it before you mock it!

[-]

DistanceSolar1449@reddit

What’s the PPL? And/or KLD but even just PPL would tell us a lot in this case.

[-]

DeltaSqueezer@reddit

The specific AWQ quant in OP's startup script above actually keeps the linear attention layers at BF16.

[-]

DistanceSolar1449@reddit

The layers that contribute to long context performance the most tho (3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63) are 4 bit. To put it another way, those layers are 16gb of kv cache in size at full context. The other layers are 147MB total at full context! Even at W4A16 that’s gonna kill performance

[-]

JohnTheNerd3@reddit (OP)

good point! if unsloth is suggesting against it... I'm certainly skeptical myself.

it's not my quant so I certainly never gathered PPL/KLD - but I'll figure out a way to! do you happen to know of any tools to do so?

[-]

nutyourself@reddit

How the hell did you guys learn what all this shit means??

[-]

Forsaken_Address8812@reddit

Ask the ai 🤣

[-]

jeffwadsworth@reddit

A somewhat challenging coding project would be a good test of its perplexity.

[-]

JohnTheNerd3@reddit (OP)

FWIW, I just looked at the [unsloth quant for the 27b](https://huggingface.co/unsloth/Qwen3.5-27B-GGUF/tree/main?show_file_info=Qwen3.5-27B-Q4_K_M.gguf) and it doesn't seem any of the layers you mentioned are actually at Q8. perhaps you're thinking of another model?

[-]

Expensive-Cry-8313@reddit

Aren't the new qwen models specifically designed for Q4?

[-]

jeffwadsworth@reddit

Yeah, I value quality over all other factors.

[-]

robertio1@reddit

Hi.
Many thanks u/JohnTheNerd3,

i also has dual RTX 3090 so i was exited to try it out your advise.

Incredible speed increase compared to unslot Q3.5_27B_Q8_250k context. provided a quality code and works great with opencode. but it was slow 10 token/s.

Based on your excellent advise i tried VLLM from python (version 17.0) but failed with any MTP speculative config.
but today the Git issue solved: https://github.com/vllm-project/vllm/issues/36498#issuecomment-4030405785

So i compiled VLLM from source and this command is works for me:

vllm serve /adat/ai/models/Qwen3.5-27B-AWQ-BF16-INT4 --served-model-name=qwen3.5-27b \
--quantization compressed-tensors \
--max-model-len=170000 \
--max-num-seqs=8 \
--block-size 32 \
--max-num-batched-tokens=2048 \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--attention-backend FLASHINFER \
--speculative-config '{"method":"mtp","num_speculative_tokens":5}' \
--tensor-parallel-size=2 \
-O3 \
--gpu-memory-utilization=0.9 \
--no-use-tqdm-on-load \
--host=0.0.0.0 --port=1234

having 22t/s and max 103t/s but mostly 50-60t/s
Many many thanks. :)

[-]

robertio1@reddit

the provided code quality surprisingly good. even could be better than i had GGUF Q8. Dont ask i dont know how. :)

[-]

oxygen_addiction@reddit

Really nice!

What is your overall VRAM usage at 170k context?

[-]

JohnTheNerd3@reddit (OP)

yes.

[-]

JohnTheNerd3@reddit (OP)

jokes aside, it basically takes up the entire cards. I think I have like 20MB free VRAM? I also run headless Linux just to make sure vLLM gets every bit of the VRAM, the OS VRAM usage is under 1MB.

[-]

joonanykanen@reddit

I'm running about your exact setup (2 x 3090s) with the script below on Ubuntu Server, but I still run into OOM. I don't have NVLink, though. Do you have any suggestions? Does compiling vllm by yourself have an effect?

#!/usr/bin/env bash

docker run -d \
  --runtime=nvidia --gpus all \
  -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v ~/.models/cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4:/root/.models/cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4:ro \
  --name vllm-server \
  -e HUGGING_FACE_HUB_TOKEN="$HF_TOKEN" \
  -e VLLM_SERVER_DEV_MODE="1" \
  -e RAY_memory_monitor_refresh_ms="0" \
  -e NCCL_CUMEM_ENABLE="0" \
  -e VLLM_SLEEP_WHEN_IDLE="1" \
  -e VLLM_ENABLE_CUDAGRAPH_GC="1" \
  -e VLLM_USE_FLASHINFER_SAMPLER="1" \
  vllm/vllm-openai:nightly \
  /root/.models/cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4 \
  --quantization compressed-tensors \
  --tensor-parallel-size 2 \
  --data-parallel-size 1 \
  --enable-sleep-mode \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --max-model-len 170000 \
  --max-num-seqs 8 \
  --block-size 32 \
  --max-num-batched-tokens 2048 \
  --enable-prefix-caching \
  --attention-backend FLASHINFER \
  --gpu-memory-utilization 0.9 \
  -O3 \
  --no-use-tqdm-on-load \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":5}' \
  --served-model-name "Qwen3.5-27B-AWQ-BF16-INT4"

[-]

JohnTheNerd3@reddit (OP)

compiling vllm by hand should not make a difference in memory consumption. do you have anything else running at all on the GPU? it takes up both entire GPUs' VRAM

[-]

oxygen_addiction@reddit

Thanks!

[-]

Extreme-Pass-4488@reddit

Can You test https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled please? Pretty please???

[-]

TacGibs@reddit

FIY I got around 66 tok/s for the full precision 27B on 4 RTX 3090 (PCIe 4.0 4x), max context and MTP enabled with vLLM nightly.

[-]

Kamal965@reddit

Very impressive! But I really suggest FP8. There's no point in FP/BF16 unless it's, like, life or death, really. Keep KV at FP16

[-]

JohnTheNerd3@reddit (OP)

I would actually strongly recommend against FP8 specifically - the 3090 doesn't support that in hardware!

I found that int8 works okay - but appears to be under-optimized in vLLM (at least since I checked last). I don't have numbers to show, other than my observation suggesting int4 performs insanely good on my 3090s. I think the quant I used is a perfect trade-off for the 3090 hardware (the full-precision layers are for linear attention, which itself doesn't take as much compute anyway).

[-]

Lissanro@reddit

Could you please share your full command to run vLLM with Int8 model? Maybe I am doing something wrong. I built vLLM with patches as described in the original post. I have four 3090 but cannot make it work, always fails to run, unless I disablecudagraph_mode using --compilation-config '{"cudagraph_mode": "NONE"}' otherwise it crashes this error:

[multiproc_executor.py:924] RuntimeError: CUDA driver error: invalid argument

Speed that I am getting is not that great, especially prompt processing, I suspect maybe this is because of cudagraph_mode set to NONE but I just could not find a way to improve prompt processing performance:

Avg prompt throughput: 143.5 tokens/s, Avg generation throughput: 29.7 tokens/s

For reference, this is the full script that I am using to run the model on four 3090 cards (I also had to reduce num_speculative_tokens to 2, since in my testing men acceptance length is about 2, but I guess it can vary greatly depending on use case, lower for creative writing and higher for programming):

#!/bin/bash

. /home/lissanro/pkgs/vllm/.venv/bin/activate

export CUDA_VISIBLE_DEVICES=0,1,2,3
export RAY_memory_monitor_refresh_ms=0
export NCCL_CUMEM_ENABLE=0
export VLLM_SLEEP_WHEN_IDLE=1
export VLLM_ENABLE_CUDAGRAPH_GC=1
export VLLM_USE_FLASHINFER_SAMPLER=1

vllm serve /mnt/neuro/models/Qwen3.5-27B-AWQ-BF16-INT8 \
  --served-model-name qwen3.5-27b \
  --quantization compressed-tensors \
  --max-model-len=262144 \
  --max-num-seqs=8 \
  --block-size 32 \
  --max-num-batched-tokens=2048 \
  --swap-space=0 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --attention-backend FLASHINFER \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
  --tensor-parallel-size=4 \
  -O3 \
  --gpu-memory-utilization=0.9 \
  --no-use-tqdm-on-load \
  --compilation-config '{"cudagraph_mode": "NONE"}' \
  --host=0.0.0.0 --port=5000

[-]

Kamal965@reddit

^ this. INT8. My bad.

[-]

TacGibs@reddit

No HW support for FP8 on Ampere, so theoretically it'll be slower. I'll try it though. And I'm always keeping the KV at FP16 for hybrid attention models.

[-]

Pentium95@reddit

Why keeping KV cache fp16? Only full attention layers use KV cache. Linear attention doesn't have kV cache.

There are tons of tests that shows that, with full attention, 8bpw kV cache quantization Is harmless . Only 4 bpw KV cache quantization is bad, IMHO, for GQA and MLA with long context.

[-]

Kamal965@reddit

It's a vLLM thing specifically. Apparently, vLLM has some wonky 8-bit LV quantization quality, according to my friend (OP) that uses vLLM.

[-]

Kamal965@reddit

Ah, well, I meant 8-bit in general, my bad. You'd look for INT8 AWQ, as Ampere does INT8 and INT4.

[-]

mentallyburnt@reddit

What settings did you use for this?

[-]

Tereadol@reddit

Great work honestly!! I have the exact same setup, 2 3090 joined by NVLink and I am so far happy with the results, I am using a bit more GPU utilization to 0.95 in order to reach 180k context window.

I am seeing a lot of people mentioning to rather use the Q6 or Q8 version.... HOW??? those quantization do not fit on a 3090. I mean If you have 5090 I am happy for you guys, but this scenario is running under the constraints of 2 3090 with 24GB of VRAM.

I have tried myself for a Q5_K_M, that should theoretically fit with the tensor parallel and more importantly try to offload the KV cache to RAM, but no matter what I have tried, all my attempts have failed.

If you try to --kv-offloading-backend native you get a: ValueError: Connector OffloadingConnector does not support HMA but HMA is enabled. Please set `--disable-hybrid-kv-cache-manager`.

If you try to disable hybrid cache you get an
```

Hybrid KV cache manager is disabled for this hybrid model

ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.
```

Finally I tried with an LMCache server, but the moment vLLM detects the --kv-transfer-config it disables the hybrid cache manager resulting on the same error.

So in conclusion, and due to my limited experience with vLLM, to have the model running with a decent context all inside the GPU. Kudos!! and I have learn a lot on vLLM by playing around your script.

Thanks.

[-]

JohnTheNerd3@reddit (OP)

awesome!

I always had trouble with larger GPU utilization, I find that the profiler isn't very accurate so it over-estimates the token allocation and crashes at high load.

if you're having issues with OOM, try reducing it!

[-]

Tereadol@reddit

You tell me and 10 minutes later I get an illegal allocation of memory ;D. Trying now with your exact configuration. I am testing in on Cline, so far so good except for the errors before

[-]

ISLITASHEET@reddit

Looks like the pr that you cherry picked from has been merged. Might want to update the post to direct future people back to upstream.

[-]

JohnTheNerd3@reddit (OP)

done, thanks for the heads-up!

[-]

Middle-Advisor5783@reddit

I wonder does the code being generated work? Even deepseek r1 code doesn't work as expected. The one functional codes come from codex. And it is a a lot reliable and does whst you ask as you want. Others iust crap! Even claude code cant do shiot about serious logical and big codebase.

[-]

Medium_Chemist_4032@reddit

Yeah, nothing beats Opus by faaar. However, I keep trying and trying to find best usecases for locally hosted LLMs and the actual list of useful things is growing.

Try most recent qwen3.5 models overall. I pointed one at a legacy app (lot's of code, lot's of never cleaned up dead ends) and to list certain aspect of the REST endpoint set and it nailed it. This wasn't a trivial task for sure.

[-]

Sy-Zygy@reddit

Hmm, I've just started thinking on these kinds of use cases as well but nothing so compelling, care to share more of your usecases?

[-]

Kamal965@reddit

It absolutely performs. Perhaps not quite as good as Opus or GPT 5.2, but those are, at the very least, trillion parameter models. I find it to be a more than satisfactory assistant in math, coding and data science.

[-]

JohnTheNerd3@reddit (OP)

I certainly have not tested the resulting code - that was merely for a speed test. however, I do routinely use local models in my Claude Code (vLLM supports the Anthropic /messages endpoint and works as a drop-in replacement for the Claude Code client) and do get useful code output. just need to keep your expectations in check, it's a LLM running in my basement and will certainly require me to spell out a few things and take some debugging here and there.

[-]

amp804@reddit

What motherboard are you using?

[-]

Revolutionary_Loan13@reddit

Your asking for two 5090s. One 5080... Best offer

[-]

jslominski@reddit

Codes as good as the UD 4bit version in my 2-3 tests I did.

[-]

Beautiful-Honeydew10@reddit

That’s some real useful performance you have there. Thanks for sharing the details of your setup.

[-]

raysar@reddit

with new small 3,5 model today maybe we can go faster with speculative decoding? but we need to find best tunning for that.

[-]

RnRau@reddit

It comes with speculative decoding builtin.

[-]

raysar@reddit

Ah ok, you know if it's enable on llama.cpp ?

[-]

RnRau@reddit

Not supported. Folk are trying to use the smaller models as a draft model for the Qwen3.5 27b, but its broken atm...

https://github.com/ggml-org/llama.cpp/issues/20039

[-]

nikos_m@reddit

this is really good!

[-]

klop2031@reddit

Whats your speed with 1 3090. Im getting like 20tps which sux

[-]

ArtfulGenie69@reddit

Maybe do int8 as it is bigger and also works with 30's series.

[-]

nsmitherians@reddit

Where do you guys get GPUs from? I am paranoid about buying from Facebook marketplace (buying ones that are broken)

[-]

AdamTReineke@reddit

I got one off Marketplace and one off eBay. Both used EVGA 3090 FTW3.

[-]

nsmitherians@reddit

How’d you confirm they work? Did you just assume they were?

[-]

nsmitherians@reddit

This is good advice, I appreciate it guys! The next thing I gotta do is find some spare ram (which is going to cost me my other arms and legs)

[-]

RedKnightRG@reddit

When I was building out my home workstation (Dual 3090s) I would test potential cards by bringing a test bench or spare PC and AC power with me (my truck has 120V AC or you can bring a battery/inverter) to test with at whatever random location the sale was happening at. I would plugin the GPU and make sure it can post, had the correct details in GPU-Z, and could run inference or a game for a min or two without crapping out. I would ask if the seller was okay with on-location testing beforehand to save time/grief.

If someone doesn't want me to test their GPU its either because a) they know its broken, or b) they're afraid I'll break it testing. Either way I just say thank you and move on to the next card. I never, ever, ever, ever trusted a word anyone told me about how the GPU ran or how it was just working yesterday when they pulled it from their PC, etc. etc.

[-]

AdamTReineke@reddit

Yeah, pretty much. Check the seller account age and reviews. Check the device for physical damage or signs of tampering.

You could probably ask for a video of the device powered in a computer. Worst case, a repair service can probably fix it if it's really bad.

[-]

xfalcox@reddit

This is amazing content. I have two servers with the A100 80GB and was considering the 35BA3B MoE due to high concurrency of users + low tolerance for latency, but this may be better as it gets better intelligence.

[-]

slava_smirnov@reddit

a100x2 at 1 vm here. feel free to share your experience

[-]

sleepy_roger@reddit

I can't get it to do things I'm able to do with GLM 4 32b...

Can anyone try this and see if it actually makes a reasonable clock?

Hey! Can you create an analog clock with HTML/CSS it should show the current time lets start at 10:00pm.

Include numerals, use CSS to animate the hands using transform rotate. Lets tween the animation between each second for the second hand and the minute hand using javascript to keep it in time. The hands should start in the center of the clock face extending towards the edge. Make the clock face a circle. Additionally lets lets make the numerals and the minute and hour hand black, lets make the minute hand red. This should also be responsive, responsive in the sense that when the screen dimensions are reduced the clock scales appropriately, so ensure you’re using transform origin and zoom with the CSS.

Return only html css and javascript.

This is what GLM 4 32B one shot with the same prompt, (was proving a point to someone)

https://codepen.io/loktar00/pen/jEroozp

I get weird results with qwen though which is pretty surprising... they're "close" but not great. The prompt isn't great but that was the point I was showing to someone, yet GLM 4 from last March knocked it out of the park.

[-]

Dyssun@reddit

Not the same model but one-shot test without thinking using Qwen3.5-35B-A3B-UD-Q6_K_XL:
https://codepen.io/dark-seied/pen/MYjabKN
Unsloth recommended settings used.

[-]

sleepy_roger@reddit

Ok then it's definitely something with my settings I'm guessing. Thank you! I was thinking there's no way it couldn't do this.

[-]

Dyssun@reddit

No problem :)

[-]

Klutzy-Snow8016@reddit

Qwen3.5-27B-FP8, from Qwen: https://codepen.io/exploding_battery/pen/jEMbyxy

[-]

sleepy_roger@reddit

Ok it's definitely something on my end then, appreciate it!

[-]

EffectBrief1480@reddit

good

[-]

ElectricalOpinion639@reddit

Sick numbers, this is hella useful. The MTP speculative decoding is lowkey the key sauce here, super underrated for local inference.

One thing worth flagging: the decode speeds drop noticeably on reasoning-heavy prompts (exactly as OP mentioned), so if you are running coding agents doing multi-step problem solving you will see closer to 60-70t/s in practice. Still legit fast for a 2x3090 setup.

The AWQ-BF16-INT4 quantization choice is smart too. You get most of the quality without blowing VRAM. Been experimenting with compressed-tensors myself and the quality tradeoff vs speed gain is for sure worth it at the 27B scale.

Also stoked to see the FLASHINFER attention backend called out explicitly. A lot of guides skip that flag and leave tokens on the table. Thanks for sharing the full launch script.

[-]

sgmv@reddit

Would you mind trying the two Q8 quants from unsloth, with and without the nvlink, if it's not too much trouble ? I have 2x 3090 without nvlink but using llama cpp at the moment. I can try vllm myself I guess. Need to evaluate if it's worth getting a nvlink bridge, can't even find one in my country.

[-]

NoFudge4700@reddit

That’s amazing! 27b is a dense model, right?

[-]

RnRau@reddit

Yes

[-]

moahmo88@reddit

Good job!

[-]

sabotage3d@reddit

I am using a single 3090 on a UD Q5 K XL, getting around 30 t/s with llama.cpp. Are your settings transferable to llama.cpp?

[-]

EatTFM@reddit

Is it just fast, or is it also good in quality? I reckon it is a general purpose model not ideal for coding, right?

[-]

Soft-Analyst-9452@reddit

100+ t/s decode on a 27B dense model is wild. A year ago we were celebrating getting 20 t/s on models half that size. The combination of better quantization, optimized inference engines, and actual hardware improvements is compounding faster than most people expected. At this rate, local models running at production-quality speeds on consumer hardware is becoming genuinely viable.

[-]

ab2377@reddit

livin the dream

[-]

IrisColt@reddit

Does the code in the example video... work at all? Genuinely asking.

[-]

ghosthacked@reddit

I know nothing about nvlink, i see em on ebay from 70$ to 600$ - wtf. halp.

[-]

JohnTheNerd3@reddit (OP)

try geohot's P2P driver! it's meant for the 4090, but it just might work for the 3090 too. it might improve things enough not to need the additional hardware!

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Appropriate-Lie-8812@reddit

What’s your average acceptance length in practice (and on what workload)?

[-]

JohnTheNerd3@reddit (OP)

I didn't spend enough time with the model to know a whole lot - but I typically see above 3 for coding-related tasks. my main use case is a voice assistant, though, so I suspect it will not be very relatable regardless.

[-]

ghosthacked@reddit

Silly me, i have two 3090, one does comfyui, one does ollama/openwebui. I know now what i must do, i dont know if i have the strength...

[-]

alitadrakes@reddit

Hello, may i dm you? I’m trying to run nvidia nemotron nano 12b v2 vl with same gpu as yours… gemini is running me in circles and cant find any solution to run it.

[-]

youcloudsofdoom@reddit

As someone in the process of putting together a dual 3090 rig, this looks like it's going to be VERY useful, thank you!

[-]

DistanceSolar1449@reddit

It’ll be semi useful. I don’t think some of his decisions are good. Using 4 bit attention is questionable, it’s gonna wreck model performance. Using nvlink is overkill, it won’t help the performance much at all (an all-reduce with hidden size = 5120 and BF16 activations across 128 collectives would be 1.3MB, which doesn’t come close to saturating PCIe).

[-]

JohnTheNerd3@reddit (OP)

my understanding is that the latency is more likely to improve things than the actual bandwidth. since P2P support is typically locked away by NVIDIA, the all-reduce operation would have to push the data via the CPU.

however, geohot did since release a hacked driver that might alleviate most performance benefits from using as such. I never bothered trying since I already had the hardware at that point.

[-]

Kamal965@reddit

I mean, it's not like he's going to remove NVLink if he doesn't need it for this specific model lol.

[-]

akazakou@reddit

At an office test, a secretary proudly told the boss:

“I can type **1,500 words per minute.”

The boss was impressed and asked her to show it. She sat down and typed very fast, her fingers flying over the keyboard.

After a minute, the boss looked at the page and said: “But this is all complete nonsense. It doesn’t make any sense at all!”

The secretary smiled and replied: “Maybe… but it’s still 1,500 words per minute.” 😄

[-]