An Overnight Stack for Qwen3.6–27B: 85 TPS, 125K Context, Vision — on One RTX 3090 | by Wasif Basharat | Apr, 2026
Posted by AmazingDrivers4u@reddit | LocalLLaMA | View on Reddit | 168 comments
Hey guys! I hope this helps everyone.
TheOnlyBen2@reddit
Thanks a lot, very useful !
Question : Given you have 2 RTX 3090, why did you wish to optimize for one ? Wouldn't you gain tps by paralleling tensors ?
AmazingDrivers4u@reddit (OP)
this is first time i'm running 3090s, just got hold of them a couple of weeks ago. started from scratch. if once gpu isn't running optimally, the whole cluster will be impacted. I've moved to setting up two gpu since. more on that incoming.
TheOnlyBen2@reddit
Great, thanks you for your answer. Looking forward your two GPUs optimizations.
Do you consider investing in a NVlink ? Depending on your motherboard, PCIe perfs can realy bottleneck tensors parallelism. I had to buy one because of this
AmazingDrivers4u@reddit (OP)
i rushed to buy two cards and ended up buying two different brands and now can't connect them with an nvlink. doh! i'll try to grab one in due course.
TheOnlyBen2@reddit
If it can make you feel any better, I did the exact same thing lol. I had to sell an MSI to buy another FE.
I suggest not waiting too long to start looking for a NVLINK, because they are really hard to come by nowadays
sudeposutemizligi@reddit
did you see much performance increase with nvlink ? tps wise not serving capacity. i read somewhere that inference is compute heavy not bandwidth heavy that's why nvlink doesn't make the models twice as faster. i actually don't know how those things work though
TheOnlyBen2@reddit
As soon as you want to split a model between two cards, you needs speed.
Is some case your PCIe layout make it so that you don't need NVlink, because both cards run x16 and can P2P without going throug the CPU.
In some case you end-up with the two GPUs running 8x and having to go through the GPU, a huge bottleneck.
So there is no simple yes or no answer on if you need NVLINK for inference, it depends on your motherboard
sudeposutemizligi@reddit
i have a x99 server board where one of the rtx3090s is x16 other is x8 electrically. and a test between p2p gave 3.8 gb/s. too slow. but my regular speed for qwen3.6 27b is around 47 tps. without any hack.. i don't think nvlink would make it 94 tps magically. consumer boards all have a phb topology as far as i see.. no plx chip that's why no good p2p. but i really don't know if i could get tooo much speed if i had 112gb/s between gpus, compared to 3.8 gbps
TheOnlyBen2@reddit
I have yet to try Qwen 3.6 27b, if you give me the parameters you used, I can try and see how it compares. I can also just unplug the NVLINK I guess
sudeposutemizligi@reddit
i would really be very grateful if you can. because i really don't know exactly what i am doing. read->ask codex->apply then ask reddit 😊 that's my knowledge life cycle as an amateur.. I will try to paste all my versions, smi, p2p test output etc. not to mislead.. thank youu🤘🤘
TheOnlyBen2@reddit
No problem, I am curious as well :)
sudeposutemizligi@reddit
ok .starting 😄 i am sorry for the nonsense pastes from now 😄
this is my nvidia smi:
| NVIDIA-SMI 595.45.04 Driver Version: 595.45.04 CUDA Version: 13.2 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:02:00.0 Off | N/A |
| 0% 37C P8 4W / 220W | 23600MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 Off | 00000000:03:00.0 Off | N/A |
| 0% 36C P8 11W / 220W | 23718MiB / 24576MiB | 0% Default |
************************** this is my vllm and transformers versions*********
pip show vllm | grep -i version
pip show torch | grep -i version
pip show transformers | grep -i version
Version: 0.19.2rc1.dev45+g3461c8b02
Version: 2.11.0
Version: 5.6.2
**************this is vllm launch command **************
(vllmn) ozgur@X99-8D4-2-5G-Server:\~/venvs$ vllm serve \
/media/ozgur/463A5DFB3A5DE907/Users/AICM/Desktop/model/models--Lorbus--Qwen3.6-27B-int4-AutoRound/snapshots/c3aea2d531678621989e5e2db034e32b22536e79 \
--served-model-name qwen3.6-27b-autoround \
--quantization auto_round \
--dtype float16 \
--tensor-parallel-size 2 \
--disable-custom-all-reduce \
--max-model-len 131072 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 2 \
--max-num-batched-tokens 8192 \
--kv-cache-dtype fp8_e5m2 \
--trust-remote-code \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--enable-prefix-caching \
--enable-chunked-prefill \
--speculative-config '{"method":"mtp","num_speculative_tokens":3}' \
--default-chat-template-kwargs '{"enable_thinking":false}' \
--host 0.0.0.0 \
--port 8000
**********this is my python testing command: **********
python3 -c "
import torch, time
size = 256 * 1024 * 1024 // 4
x = torch.randn(size, device='cuda:0')
y = torch.empty(size, device='cuda:1')
torch.cuda.synchronize()
start = time.time()
for _ in range(10):
y.copy_(x)
torch.cuda.synchronize()
elapsed = time.time() - start
bw = (256 * 10) / elapsed / 1024
print(f'P2P Bandwidth: {bw:.1f} GB/s')"
P2P Bandwidth: 3.8 GB/s
========================================
this is my topology:
nvidia-smi topo -m
GPU0 X PHB 0-5,12-17 0 N/A
GPU1 PHB X 0-5,12-17 0 N/A
Legend: X = Self
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
=====================================================
https://sharetext.io/duig39h8this is vllm logs what has happened through my chat and tool use in cahtboxui wit 5 tools.
https://sharetext.io/rdjewm9h thi sis the whole chat with the model including moel errors.
again i am aterribly sorry for any non-sense 😄
sudeposutemizligi@reddit
LAST RESULT AFTE THE PATCH ... ( WTIH CLAUDE OF COURSE)
Qwen3.6-27B-int4 on dual RTX 3090, no Docker, vLLM nightly + PR #40361 patch
Hardware: 2× RTX 3090 (PCIe-only, no NVLink), Driver 595.45.04, Ubuntu
Stack: vLLM 0.19.2rc1.dev45+g3461c8b02 (nightly), torch 2.11.0+cu130, Python 3.12, fp8_e5m2 KV cache, TP=2, model
Lorbus/Qwen3.6-27B-int4-AutoRoundPatch applied: vllm#40361 (Marlin pad-sub-tile-n) ported to vllm's refactored kernel layout. Patch loaded but didn't fire on this model — shards happened to be tile-aligned. Kept it in for safety.
Install method: clean uv venv + nightly wheel +
patch -p1against the two affected.pyfiles in site-packages. Zero CUDA recompilation needed, patch is pure Python.Power cap: stock 220W per card.
Settings:
--max-model-len 131072 --max-num-seqs 8 --kv-cache-dtype fp8_e5m2 --gpu-memory-utilization 0.90 --disable-custom-all-reduce, NCCL_P2P_DISABLE=1, NCCL_CUMEM_ENABLE=0Single-stream TPS results (1000 / 800 token outputs, 3 runs each, /no_think):
MTP acceptance rates:
Why n=1 underperformed: Lorbus's quant has
mtp_num_hidden_layers=1. n=3 chains the same single layer 3 times so position-2/3 acceptance collapses on prose, but on code the structure holds up well enough that 3-pass savings still win on aggregate.Verdict: MTP n=3 is the right setting. Code gain is real (+30% over baseline). Narrative gain on this single-MTP-layer model is marginal at best — the repo's 71 TPS narrative claim assumes Genesis v7.14 + TurboQuant, not stock fp8.
Memory at 131K, fp8 KV, idle: \~22 GB per card. \~1.5 GB headroom each.
What didn't help / didn't try:
TL;DR: \~70 TPS code / \~53 TPS narrative single-stream on dual 3090 PCIe at 131K context, fp8 KV, MTP n=3. Bare-metal Python venv, no Docker. The Marlin patch is required in principle for AutoRound on TP=2 even though this specific shard didn't trigger it.
sudeposutemizligi@reddit
Final numbers locked in. Updated Reddit post — same template, real data:
Qwen3.6-27B-int4 on dual RTX 3090, no Docker, vLLM nightly + PR #40361 patch second test with mtp 3
Hardware: 2× RTX 3090 (PCIe-only, no NVLink), 220W stock cap, Driver 595.45.04
Stack: vLLM 0.19.2rc1.dev45+g3461c8b02 (nightly), torch 2.11.0+cu130, Python 3.12, fp8_e5m2 KV cache, TP=2, model
Lorbus/Qwen3.6-27B-int4-AutoRoundPatch applied: vllm#40361 (Marlin pad-sub-tile-n) ported to vllm's refactored kernel layout. Patch loaded but didn't fire on this model — shards happened to be tile-aligned. Kept it in for safety.
Install method: clean uv venv + nightly wheel +
patch -p1against the two affected.pyfiles in site-packages. Zero CUDA recompilation needed, patch is pure Python.Settings:
--max-model-len 131072 --max-num-seqs 8 --kv-cache-dtype fp8_e5m2 --gpu-memory-utilization 0.90 --disable-custom-all-reduce, NCCL_P2P_DISABLE=1, NCCL_CUMEM_ENABLE=0Single-stream TPS results (1000 / 800 token outputs, /no_think, warmed):
MTP acceptance (n=3):
Why n=1 underperformed: Lorbus's quant has
mtp_num_hidden_layers=1. n=3 chains the same single layer 3 times so position-2/3 acceptance collapses on prose, but on code the structure holds up well enough that 3-pass savings still win on aggregate. n=1 has 90%+ acceptance but only saves 1 pass per accept — net loss vs n=3.Verdict: MTP n=3 is the right setting. Code gain is real (+30% over baseline). Narrative gain on this single-MTP-layer model is marginal — the repo's 71 TPS narrative claim assumes Genesis v7.14 + TurboQuant, not stock fp8.
Memory at 131K, fp8 KV, idle: 22.4 GB per card, \~2.2 GB headroom each.
What didn't help / didn't try:
TL;DR: \~68 TPS code / \~55 TPS narrative single-stream on dual 3090 PCIe at 131K context, fp8 KV, MTP n=3. Bare-metal Python venv, no Docker. The Marlin patch is required in principle for AutoRound on TP=2 even though this specific shard didn't trigger it.
TheOnlyBen2@reddit
Thanks a lot ! I will hopefully be able to give it a shot tomorrow night
sudeposutemizligi@reddit
🤘🤘
TheOnlyBen2@reddit
I believe you've already seen it, but we kinda got our response here : https://github.com/noonghunna/qwen36-dual-3090/pull/2#ref-issue-4330244211
sudeposutemizligi@reddit
oh now i read this : What this does NOT give you Higher single-stream TPS. Single-stream narrative is ~68 TPS, vs single-card's ~66 TPS — basically flat. On Ampere PCIe-only (no NVLink), TP=2 allreduce overhead nearly cancels the memory-bandwidth doubling for batch=1 decode. If you only care about one-user-at-a-time chat, the single-card project is just as fast. TP=2's win is concurrent throughput, not per-request latency. in the repo. https://github.com/danbedford/qwen36-dual-3090-nvlink
TheOnlyBen2@reddit
Yeah I am kinda lost to be honest. I tried with both nvlink and without and saw no difference in terms of token per seconds. Sounds like it may be useful when serving multiple requests at the same time but that's it
sudeposutemizligi@reddit
strange.. they should have been more clear on such a very important difference.. we'll learn . see you bro🤘
TheOnlyBen2@reddit
I am still experimenting but missing time, I will let you know if I found something useful
sudeposutemizligi@reddit
will be grateful 🙏🤘 . there's no NVLink to buy in Türkiye. i think nvlink is an old fashioned thing. but where are the people keeping them if unused.. if you can help me buy somehow, i would really really be appreciated
TheOnlyBen2@reddit
Based on my last test, I really don't think you need one.
Just git clone this repo and follow indicated steps : https://github.com/noonghunna/club-3090
If launching the container fails because of an error linked to /opt/ai volume mappings. Just rm -rf /opt/ai and do git clone -b marlin-pad-sub-tile-n https://github.com/noonghunna/vllm.git /opt/ai/vllm-src
I reach 120 tokens per second for coding and 100 tps narrative with the default dual compose file. NVLINK mades no difference
sudeposutemizligi@reddit
ohh great news then.. 120 is a great number. no need for an nvlink you are right..🤘 thank you for your guidance 🙏 and, how is the long context tuns with tool calling have you tried. i am planning 132k ctx inam sure at 252k there will problems at remembering first runs and ctx pollution will cause false / empty tool callings
TheOnlyBen2@reddit
I am at the same point than you on this, wondering what the best context window would be.
I am also considering this project to get more out of the context window while keeping it small : https://github.com/juliusbrussee/caveman
Let me know if you find the sweet spot :)
sudeposutemizligi@reddit
exactly twice faster. really strange. chat gpt was really sure it wouldn't be twice as fast😄 could you also test with / without your nvlink?
AmazingDrivers4u@reddit (OP)
start from https://github.com/noonghunna/qwen36-27b-single-3090
or
https://github.com/noonghunna/qwen36-dual-3090
I'm keeping them up to date with help of community. they are separate for now but will eventually be merged in the coming days.
TheOnlyBen2@reddit
Awesome thank you
sudeposutemizligi@reddit
i gave these logs etc prior to your patches. now I will tell codex or claudebto implement it, (through vs code extension.) and try to run your set up an send the logs again
AmazingDrivers4u@reddit (OP)
yeah my pci 16x4 bus gives only 64 GB/s whereas Nvlink allows 112.5 GB/s bidirectional bandwidth between cards.
ttkciar@reddit
This post was reported for self-promotion, but upon review I am leaving it up.
Even though it is self-promotion and does link to an LLM-(re?)written article, it is also highly informative, novel, comprehensive, and on-topic for the sub.
That justifies keeping it around. We have our rules for good reasons, but it's also important to treat them with some flexibility.
PotaroMax@reddit
Good human.
Thanks we need this kind of experimentation
Visual_Acanthaceae32@reddit
That’s how moderation is supposed to be… You promoted your self by excellently handling the situation! Thank you
gthing@reddit
I honestly don't even see how this is remotely self-promotion. Because it links to an article they wrote? They are not advertising anything obvious as far as I can see or asking you to even subscribe to their newsletter or something. It appears to be pure useful information sharing.
666666thats6sixes@reddit
I think the point is that medium pays the author based on traffic, so OP has financial gain from linking to it. They'll probably make several dozen cents.
marscarsrars@reddit
Gasps Several dozen cents
jazir55@reddit
I completely agree, if you link to any self-hosted website or even an external article, apparently even as reputable a blog as medium, you get lambasted for it as "self-promotion" because it isn't directly posted to reddit. It's honestly really weird.
marscarsrars@reddit
Never thought I'd live to see a mod like you in my life time.
AmazingDrivers4u@reddit (OP)
I apologize in case i broke any rules. Just wanted to share this with the community, I was too knackered to read up rules before posting. Thank you!
PermanentLiminality@reddit
Thank you for leaving this up. It's amazing.
sudeposutemizligi@reddit
not at all. all thanks to OP 🤘 and claude 😁
Fabulous_Fact_606@reddit
That is a long read. 85 TPS on a single 3090 is impressive.
whiteamphora@reddit
Unfortunately, the day you wrote that it's not usable at all. Compromises as of now must be made.
Fun-Marionberry-2540@reddit
Does this also help 4090 in the same way?
AmazingDrivers4u@reddit (OP)
theoretically it should but there is only one way to find out, test it.
Southern_Sun_2106@reddit
Alright, tbh, you knew that everyone will ask for that patch. Why not release it together with your piece? Otherwise, it reads 'look what an awesome thing I've made, but it won't work without my patch that I will release 'later.'' Without the patch, this makes it a click bait and a self-promo. Also, whenever 'medium' is involved = it is red flag for me.
AmazingDrivers4u@reddit (OP)
I have the file with me, but i realised it doesn't meets the repo's standards for PR. I've built it on the latest dev branch, whereas for PR i need to prepare it against a stable release branch + test cases. I'll rather make it right and then share the link instead of pushing it out hastely.
Secondly, i've shared enough details about what exactly i've done in the patch and if you feel confident you can always have a go at it easily. I appreciate your patience.
i_wayyy_over_think@reddit
If you find yourself getting busy with other things and abandoning it, could you just push it to your own fork? I wouldn't mind working on a non stable branch.
Crafty-Confidence975@reddit
Was anyone able to get the cuda patch from them? Can’t duplicate without their patch_tolist_cudagraph.py which they say they’ll provide if requested.
AmazingDrivers4u@reddit (OP)
I need to prepare the patch on a clean branch in order to submit it as a PR. Please bear with me for a day and i'll post it.
andy2na@reddit
I got it running and seems to get 50-60t/s on my 3090, but I have to enable --eager-mode for it to launch or it doesnt work
RuntimeError: Cannot copy between CPU and CUDA tensors during CUDA graph capture
What will the patch you will be releasing do?
caetydid@reddit
python3: can't open file '/patches/patch_genesis_unified.py': [Errno 2] No such file or directory
I cant find this file in the provided sources. Could you kindly provide it or eplain how how created it?
AmazingDrivers4u@reddit (OP)
check the git repo, link in the article.
Hodler-mane@reddit
thank you. if you can also give us links to everything we need (the exact nightly of the versions of things you are running) that would be appreciated!
AmazingDrivers4u@reddit (OP)
minus the patch, all details are already in the post. Look at the last line of the aritcle.
ionizing@reddit
main thing that threw me off was anything docker related. any tips on doing your stack without docker?
AmazingDrivers4u@reddit (OP)
well its just an environment for your code. you can host it on bare metal, vm, docker, venv anywhere. I got like 15 inference engines that i keep segmented from each other via dockers/venv. docker/venv is not mandatory, you should be able to setup your environment accordingly.
Crafty-Confidence975@reddit
Sounds good! Thanks for doing that.
sagiroth@reddit
Thank you. Let us know in the thread bud
andy2na@reddit
this is the patch that got it working for me, although not perfectly, can get up to 60t/s on my 3090:
patch_tolist_cudagraph.py
Crafty-Confidence975@reddit
Nice! With 125k context?
andy2na@reddit
125k got OOM errors during certain situations. Dropping it down to 100k works. Just FYI this cudagraph patch is not official and there is a performance loss with it, so youll have to wait for the official patch for the OP's claimed performance and possibly context window
Crafty-Confidence975@reddit
Got it thanks! Did you benchmark it on anything interesting?
andy2na@reddit
Certain tasks and prompts will hit 60 to 65 t/s , up from 30-35 with traditional methods on my 3090. So then workout the full patch, there's a performance boost
Crafty-Confidence975@reddit
What about actual performance on tasks you want done? How capable do you find this setup?
andy2na@reddit
I mainly use my LLM for Frigate, home assistant voice assist, Karakeep, degoog, Sure finance, and n8n workflows. Its definitely still noticeable versus qwen3.6-35b but noticeably faster than the standard qwen3.6-27B deployment.
Overall, if Im not looking at prompt and generation speeds, its much more usable now for general tasks
McSendo@reddit
Thanks for experimenting with this. I was going to, but I think I'm going to wait. One of our dev boxes has 2x3090 with p2p drivers. With the official FP8 model and MTP-3 no kv quant, it was already doing 80 to 100 t/s with 130k context filled for single thread workloads, multithreads (3) around 200.
It is tempting though for the larger kv cache.
AmazingDrivers4u@reddit (OP)
go grab it form git, its now available there.
andy2na@reddit
seems that I still have to add - --enforce-eager under - qwen3_coder to fix the nonstop repeats but that drops output to 60t/s because eager disable cudagraphs. can you look into this? Thanks for your work!
Misio@reddit
I had the same problem
I updated 3bit to 4bit
didn't fix it
seems to relate to thinking mode on short prompts?
Misio@reddit
No, not related to thinking.
Misio@reddit
add
--override-generation-config '{"temperature":0.7,"top_p":0.8,"top_k":20}'to the vLLM command and passpresence_penalty: 1.5per-request from your client. With those params: 20/20 clean on bare "hello" with a system prompt. Without them: \~10-30% degenerate depending on prompt length.andy2na@reddit
i think its related to what is on git: https://github.com/noonghunna/qwen36-27b-single-3090
https://github.com/noonghunna/qwen36-27b-single-3090#known-issue-tool-calling--mtp--turboquant-kv
Misio@reddit
Duh, I should learn to read. Thank you!
andy2na@reddit
yeah I just found this git since the cudapatch was linked to it this morning. So unless theres a fix to MTP + TQ with tool calling, youll have to either cut your performance a bit or not use tool calling. 50-60t/s is pretty decent for 27B dense, will continue testing
Misio@reddit
with tool calling
andy2na@reddit
I switched TQ3 to fp8 caching and got tooling back with vision support, but max 65k context at up to 80t/s. If you disable vision, you can get up to 75k
What are your parameters for that?
andy2na@reddit
even with thinking mode
nbvehrfr@reddit
something with chat template
Misio@reddit
This is brilliant, thanks
```
$ ./scripts/bench.sh
=== Warmup (3x) ===
w1 comp=1000 wall=11.32s 88.34 TPS
w2 comp=1000 wall= 9.61s 104.06 TPS
w3 comp=1000 wall= 9.27s 107.87 TPS
=== Narrative (3x, 1000 tok) ===
narr1 comp=1000 wall= 9.62s 103.95 TPS
narr2 comp=1000 wall=11.89s 84.10 TPS
narr3 comp=1000 wall=10.94s 91.41 TPS
=== Code (2x, 800 tok) ===
code1 comp=800 wall= 7.45s 107.38 TPS
code2 comp=800 wall=12.63s 63.34 TPS
=== GPU state ===
0, 98 %, 22014 MiB, 24576 MiB, 388.35 W, 68
=== Last 3 SpecDecoding metrics (MTP accept) ===
(APIServer pid=1) INFO 04-24 17:09:25 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.04, Accepted throughput: 56.79 tokens/s, Drafted throughput: 83.69 tokens/s, Accepted: 568 tokens, Drafted: 837 tokens, Per-position acceptance rate: 0.975, 0.964, 0.097, Avg Draft acceptance rate: 67.9%
(APIServer pid=1) INFO 04-24 17:09:35 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.24, Accepted throughput: 62.40 tokens/s, Drafted throughput: 83.39 tokens/s, Accepted: 624 tokens, Drafted: 834 tokens, Per-position acceptance rate: 0.996, 0.989, 0.259, Avg Draft acceptance rate: 74.8%
(APIServer pid=1) INFO 04-24 17:09:45 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.42, Accepted throughput: 67.90 tokens/s, Drafted throughput: 84.00 tokens/s, Accepted: 679 tokens, Drafted: 840 tokens, Per-position acceptance rate: 0.954, 0.761, 0.711, Avg Draft acceptance rate: 80.8%
```
andy2na@reddit
Thanks!
I tried the new patch and I definitely do get over 80t/s gen but output seems to be bugged and repeating itself, any idea how to fix this?
AmazingDrivers4u@reddit (OP)
Patch is now available in git. links updated in the article.
ShengrenR@reddit
Somebody posted a comment on the medium article asking where it was; author replied
Webster2026@reddit
It can actually run with decent speed even on old Mac with M1 processor: https://youtu.be/NNOq3T26MIQ
Equivalent-Home-223@reddit
This is fantastic, I posted a question in the post, somehow when i set MTP tp >1 it will go on a infinite loop without returning response back to the client side.
I created a custom docker using the base image you use from vllm and applied the patches within the docker and ran as per below
Splinter2121@reddit
Getting \~43 narrative / \~54 code TPS at 330W on a single RTX 3090 with fp8 KV + MTP n=3. Reference setup (identical config, same GPU model) claims 66/84 TPS. MTP acceptance rates are comparable or better (93/87/74% vs 92/81/64%), but base decode throughput is \~20 TPS lower. Looking for ideas on what's causing the gap.
Full Configuration
Docker Image
vLLM Launch Args
Environment Variables
Patches (applied before vLLM start)
.tolist()calls inturboquant_attn.pywithtorch.cuda.is_current_stream_capturing()guards so CUDA graph capture doesn't crashModel
Runtime Details
Key Warning
Benchmark Results
At 330W Power Cap (after 3 warmup rounds)
At 230W Power Cap (stock, after warmup)
MTP SpecDecoding Metrics (330W, warm)
Reference Comparison
Things I've Already Checked
Using MarlinLinearKernel for GPTQMarlinLinearMethodDetected MTP model. Sharing target model embedding/lm_head weights with the draft model.mtp.fc.weightpresent as BF16 (not quantized)Potential Causes I'm Unsure About
/root/.cache/vllm/torch_compile_cache/. Not mounted as a volume, so it rebuilds on restart. Could this affect warm-run performance?/root/.cache/huggingface/vllm-qwen36-27b-int4rather than the HuggingFace repo nameLorbus/Qwen3.6-27B-int4-AutoRound. Could this affect any auto-configuration?max_num_batched_tokens=2048— vLLM warns this is suboptimal with spec-decode. The reference uses the same value but could there be a better setting?Docker Compose (Complete)
mgxts@reddit
This is really cool. I get around 65–70 tokens/sec on an RTX 5090 in LM Studio on a comparable GGUF model (Unsloth/Qwen3.6-27B-UD-Q4_K_XL). My llama.cpp build in WSL2 Ubuntu was still slower than LM Studio even though it was compiled for my setup + TurboQuant + community recommended configuration.
This is the first time I have tested vLLM. The base Qwen3.6-27B-int4-AutoRound gives me about 90 tokens/sec. With the patches enabled the max I have reached so far is around 135 tokens/sec. I have had to disable TurboQuant though as it does not work on the 5090 and the model gets stuck repeating the same token.
nbvehrfr@reddit
which cache quant you used to fit in 5090?
mgxts@reddit
I should have mentioned I could not fit 125k without TurboQuant. The test setup I got working was FP8 KV cache with a 51.2k max context and MTP 3. Not sure if it is the max, was just the value I set when I lowered it. Hopefully there is some way to get TQ working.
gthing@reddit
I'm still waiting for everything to download form huggingface so haven't tested this yet, but here is my effort to replicate the patch_tolist_cudagraph.py based on the description in the article:
andy2na@reddit
Did this patch get you the 85t/s? The one I'm using only is getting me 60ish and I have to enable --enforce-eager , which disables cudagraphs
gthing@reddit
I don't know I kept getting bad checksums when downloading the model and gave up for today.
RealestNagaEver@reddit
I think the checksum command he provided in the article might be using the wrong algorithm, it didn't work for me either until I used sha256
gthing@reddit
Ah thank you!
jimmytoan@reddit
85 TPS on a single 3090 for 27B with 125K context would be well above what most people report - most single-3090 runs at 27B are in the 40-60 TPS range at shorter context. Is the 85 TPS measured on the decode (generation) phase or prefill? Prefill throughput on long sequences is always higher because it parallelizes across the input, but decode rate is what determines how fast the response feels interactively. Also curious how much quality degradation you see at the 125K context end vs 16-32K - long context coherence usually starts dropping before the max window.
AdamDhahabi@reddit
Waiting for MTP to land in llama.cpp so that I can run Q8_0 at high speed on a multi-GPU build with consumer mainboard.
Cold_Tree190@reddit
What is MTP? I only have a single 3090 so it doesn’t sound like it will be of use to me right now, but I have been thinking about building a dedicated multi gpu server at some point
FatheredPuma81@reddit
The TLDR of what it does is: Imagine a draft model but baked into the model and much more accurate.
TLDR for draft model/speculative decoding(for those who might not know): You have a tiny model predicts a large model's output and because wizardry confirming/denying the prediction is way faster than generating it.
Glittering-Call8746@reddit
How does it "bake" into the model ? Useful for ampere only ? Or can applied to 5070ti ?
HyperWinX@reddit
Iirc, its not really a "small model baked into a big model". It just uses output of the first layers of the big model.
FatheredPuma81@reddit
By there being tensors inside the model (that GGUF's automatically remove) dedicated specifically for token prediction. As the OP says in their crazy long post it was like +200MB at 4 bit which is tiny.
AFAIK it's usable on any card but I wouldn't worry about it even if it was. Flash Attention (as an example) isn't supported on old cards like the V100 yet the community has backported them in llama.cpp.
HyperWinX@reddit
MultiToken Prediction.
Far-Low-4705@reddit
tensor parallelism is also gonna help you get more performance too.
just landed in llama.cpp a few weeks ago, but is very unstable and only really works-ish on nvidia cards
FatheredPuma81@reddit
Oh I gotta ask you then lol. The TLDR is a guy replied to a comment of mine and said that inference time would be horrible on MI50's and that it isn't the T/s that's the issue. Was curious what you make of it? Is inference really that slow on those cards? Cause I'm suddenly really interested seeing 20-24t/s on 27B (meaning 35B should be like 80t/s?).
Oh and have you tried using parallel slots and using like Subagents in Opencode to improve performance? On an RTX 4090 I've found that usually gives me 50%+ more t/s.
AdamDhahabi@reddit
I tried an only got speed regression, probably because PCIE 4.0 x4 interconnectivity on my consumer mainboard.
Far-Low-4705@reddit
it is still super early, id give it time. it is still super buggy, and they only show gains on much older dense models like qwen3, and gemma 3.
FatheredPuma81@reddit
I've seen ggerganov say they need to work out how they want to implement MTP like a dozen times since Qwen3.5 dropped... so I don't think it's coming any time soon.
YourNightmar31@reddit
I intend to follow this guide this weekend, but what's the prompt processing speed like at 100K context?
realmosai@reddit
I don't mean to sound rude. But I don't quite grasp the point of the article. You loaded and ran an AI Model - is that something to write a whole blog post for?
And you used AI to write the blog post. Of course its unnecessarily long for what it describes. It could have easily been just four bash commands and a docker compose file, and that's the gist of it.
gthing@reddit
I don't mean to sound rude. But I don't quite grasp the point of this comment. You either didn't read it or didn't understand it - is that something to write a whole comment for?
realmosai@reddit
I shouldn't have written the comment, I see that sometimes ignoring is better option. And you're right, I tried to put it forward softly, but in vain. Unlike OP, I like to get to the point.
Article's pure AI slop and engagement bait. - There that saves a lot of tokens trying to explain myself.
Crafty-Confidence975@reddit
But it’s not. Getting that model running at 85 TPS with 125k context on a single 3090 RTX is of great interest to many on here. It’s probably the absolute best coding model in this weight class right now and is almost as good as frontier stuff from last year.
The numbers out of the box are nowhere near 85 TPS so this is very valuable. No one cares if it’s written by AI if the results are duplicatable.
realmosai@reddit
If the results are reproducible - is the key differentiator. AFAIK these are claims.
Crafty-Confidence975@reddit
We’ll see about the patch but the rest of the article is correct so far. It’s already a 3x improvement without the patch at that context length.
It’s not a mere claim to give exact step by step instructions. Go try it yourself and you’ll see. It’s exactly the sort of content I think we need more of on here.
realmosai@reddit
3x improvement would mean that either you were doing something VERY VERY wrong when running the model or OP deserves global recognition.
Is your measure of correctness of the article coming from Claude reading it and telling you it's correct? Because the results I saw on this post from others were 40tk/s - a long way off from the claimed 85 t/s from the missing magical patch.
Am sorry, but these claims, the post title, and the fact that none of the post content was put on Reddit, and just all the way through to the rewritten and updated contents of the article, does NOT pass my (and some other equally keen-eyed redditors) AI-Slop marketing spam filters.
Crafty-Confidence975@reddit
Again did you run it yourself? I was being conservative at 3x. Naive implementations for the 27b model are at 10 TPS or so on a 3090 RTX. I just happen to have a bunch of nodes with those cards and am seeing good results from the model in general so I took interest in the post.
realmosai@reddit
What do you term as a naive implementation? You are speaking in very vague terms.
This article is for naive people. That's true.
As for running 27b q6kxl UD quant, yes I do run them. My not-so-naive llama build gets me 45+ t/s on a Pro card. That is in line with the results others are seeing. Why do I need this article for again?
Crafty-Confidence975@reddit
Sure and I can push much higher numbers on a $1.45/hr H200 in the cloud… that’s not what the article is about. It’s a specific card which a lot of hobbyists bought a couple years ago. Why not engage with what he’s talking about instead of making it about yourself?
realmosai@reddit
Engage with what? A 20 minute article to get what? Exactly.
I ain't renting my pro card, it's in my computer, if that's what you meant by your cloud supposition.
What's the point of the article, can you share it? I don't think so, because there isn't any. Cut to the chase. You can't, there isn't one. That's why it's a 20-min read. There are better things to read for 20 minutes of my time.
And by the same logic, have a good day.
llitz@reddit
Even if that is, it describes exactly what he went through and how he fixed it.
That alone, is enough contributions and explains several internals of how things work and what people need to pay attention to. I see that as an actual contribution and, judge by the upvotes, others do too.
realmosai@reddit
The only contribution in the post was made by Claude. Including the debugging.
MR_-_501@reddit
They patched inference frameworks to properly make use of the functionality the model provides, how is that only "load and ran an AI Model"
realmosai@reddit
wheres the patch?
AmazingDrivers4u@reddit (OP)
I'll ask Claude to be mindful next time. :P
koljanos@reddit
Apparently hosting models isn’t as easy as simple ollama run
BitGreen1270@reddit
This is very interesting. I don't fully understand everything, but theoretically, can this be applicable to any GPU? I have a 780M iGPU and 32GB RAM and am getting about 20t/s with gemma4-26B-A4B and around the same with Qwen3.6-35B-A3B. Do you think I can replicate some of the steps you describe in your post to seriously boost my tps?
caetydid@reddit
hoooly shite! Why I am still running with 50tps on rtx5090?
corpo_monkey@reddit
Sell that crap and buy a real one!
3090 masterrace
i_wish_i_was_perez@reddit
Yes, but sell it to me
MisticRain69@reddit
Nah don't sell it to him trade with me you save a step
caetydid@reddit
thanks for the offers but sadly my employer owns it
wowsers7@reddit
Has anyone run Qwen 3.6 27b on Intel Arc Pro B70? I’m curious about the performance.
robertpro01@reddit
Tldr?
FatheredPuma81@reddit
vroom vroom goes model
bgeneto@reddit
The issue with this approach is the process boundary:
python3 /patches/patch_tolist_cudagraph.pypatches only that short-lived Python process. After it exits,exec vllm serve ...starts a different Python interpreter, so the monkey patch is gone. We have to usesitecustomize.py. That makes Python automatically import the monkey patch (patch_tolist_cudagraph.py) every time a Python interpreter starts, including thevllm serveprocess and its worker subprocesses.DiscipleofDeceit666@reddit
I don’t know what any of those words in the article mean, but I felt like I did when I was reading it
anthonyg45157@reddit
RemindMe 3 days "check this out again"
edsonmedina@reddit
Meanwhile I'm getting 7 tok/s on Strix Halo 🥲
sabotage3d@reddit
Is it possible to get MTP (Multi-Token Prediction) working in llama.cpp yet? I’ve successfully managed to get TurboQuant running via an experimental branch, but I haven't seen an implementation for MTP. Are there any specific branches or PRs I should be looking at?
Important_Quote_1180@reddit
This is exactly what I needed! Thank you
No-Marionberry-772@reddit
I wish this wasn't beyond me. Experienced developer, but I'm weak with C++ , python, and getting into these tools. I don't currently have the hardware, but I'm really wanting to make the switch to local, getting tired of cloud providers. If I can make the switch and buy a 3090 instead of a 5090, that would be amazing. I know I just have to wait, but these numbers never seem to hit the main stream tooling it seems like.
zhileiz@reddit
Thx. This is probably the best piece of writing I've seen in a while. I wonder if you'd do a follow-up on the model performance (real-world experience) under that configuration, e.g. opencode / openclaw experience etc.
Ok-Measurement-1575@reddit
It's pure slop but given the outcome, I think we can look the other way :D
hainesk@reddit
Total slop, it’s hard to read…
AmazingDrivers4u@reddit (OP)
As for future posts, i hope i'm able to discover more cool optimisations in order to keep going. will try!
Ok-Measurement-1575@reddit
Post a cock teaser paragraph at least next time, ideally.
Great work.
AmazingDrivers4u@reddit (OP)
hahaha!
AmazingDrivers4u@reddit (OP)
Thank you! Glad you liked it.
cviperr33@reddit
Posting here so i can try later , thanks for the info
GregoryfromtheHood@reddit
What about prompt processing? More token generation speed is always nice, but prompt processing speed is in my opinion even more important for real world use.
edankwan@reddit
I have a 3090 + 3090 Ti running Q8 + Q8 k/v with 131072 context window. Only 26t/s
weird_ed@reddit
I have the same setup. With ik_llama, -sm graph and p2p I get >50t/s on Q5 and 500k context window on Q8 k/v.
AmazingDrivers4u@reddit (OP)
throw away the second one, only focus on the first one. =P
Ok-Measurement-1575@reddit
I didn't see the exact vllm version? I've abandoned vllm 0.19 for qwen 3.5/3.6 as the actual outputs are subpar compared to llama.cpp.
Maybe some things got fixed now.
EveningIncrease7579@reddit
This is really insane for me that i'm getting 30\~40tk/s (llama.cpp unsloth q4 or q5 depending). Could you have a docker image compilled with your own modifications? i really want to test it!
AmazingDrivers4u@reddit (OP)
bear with me for a day, its the patch i need to release. the rest of the stuff is already documented and shared.
EveningIncrease7579@reddit
Dont worry, you are awesome!
koljanos@reddit
Awesome read, can you please tell me, if I can push everything further by utilizing two 3090 with nvlink? Will using less quantized model help?
AmazingDrivers4u@reddit (OP)
having nvlink certainly helps, but there is only one way to find out, test it. inference engines are getting updated all the time and sometimes things do break.
koljanos@reddit
Thanks bro, you did a great job!
caetydid@reddit
Please post a gist or gitrepo! I think some sources are missing
AmazingDrivers4u@reddit (OP)
will look into it, I've never used gist before.
El-Dixon@reddit
Wow! The work, the writing, the results... chef kiss. Thank you!
AmazingDrivers4u@reddit (OP)
oh thank you! Claude is a good friend. =)
sagiroth@reddit
Please share the files and fix with us
AmazingDrivers4u@reddit (OP)
will do!
xrvz@reddit
I'm not reading some shitty medium post. Huge red flag. At least put it on github gist.
Gold-Debt-5957@reddit
Lo acabo de leer, y que interesante mi pregunta! como correra en mi mac m1 max de 64 gb. Ampliar la ventana de contexto hubiera sido genial!, el parche que implementaron como tal es temporal. una gran mejora. Toca ser pacientes y ver como va evolucionando, y si hay novedades. mantenerme actualizado