[-]

TheOnlyBen2@reddit

Thanks a lot, very useful !

Question : Given you have 2 RTX 3090, why did you wish to optimize for one ? Wouldn't you gain tps by paralleling tensors ?

[-]

this is first time i'm running 3090s, just got hold of them a couple of weeks ago. started from scratch. if once gpu isn't running optimally, the whole cluster will be impacted. I've moved to setting up two gpu since. more on that incoming.

[-]

TheOnlyBen2@reddit

Great, thanks you for your answer. Looking forward your two GPUs optimizations.

Do you consider investing in a NVlink ? Depending on your motherboard, PCIe perfs can realy bottleneck tensors parallelism. I had to buy one because of this

[-]

AmazingDrivers4u@reddit (OP)

i rushed to buy two cards and ended up buying two different brands and now can't connect them with an nvlink. doh! i'll try to grab one in due course.

[-]

TheOnlyBen2@reddit

If it can make you feel any better, I did the exact same thing lol. I had to sell an MSI to buy another FE.

I suggest not waiting too long to start looking for a NVLINK, because they are really hard to come by nowadays

[-]

sudeposutemizligi@reddit

did you see much performance increase with nvlink ? tps wise not serving capacity. i read somewhere that inference is compute heavy not bandwidth heavy that's why nvlink doesn't make the models twice as faster. i actually don't know how those things work though

[-]

TheOnlyBen2@reddit

As soon as you want to split a model between two cards, you needs speed.

Is some case your PCIe layout make it so that you don't need NVlink, because both cards run x16 and can P2P without going throug the CPU.

In some case you end-up with the two GPUs running 8x and having to go through the GPU, a huge bottleneck.

So there is no simple yes or no answer on if you need NVLINK for inference, it depends on your motherboard

[-]

sudeposutemizligi@reddit

i have a x99 server board where one of the rtx3090s is x16 other is x8 electrically. and a test between p2p gave 3.8 gb/s. too slow. but my regular speed for qwen3.6 27b is around 47 tps. without any hack.. i don't think nvlink would make it 94 tps magically. consumer boards all have a phb topology as far as i see.. no plx chip that's why no good p2p. but i really don't know if i could get tooo much speed if i had 112gb/s between gpus, compared to 3.8 gbps

[-]

TheOnlyBen2@reddit

I have yet to try Qwen 3.6 27b, if you give me the parameters you used, I can try and see how it compares. I can also just unplug the NVLINK I guess

[-]

sudeposutemizligi@reddit

i would really be very grateful if you can. because i really don't know exactly what i am doing. read->ask codex->apply then ask reddit 😊 that's my knowledge life cycle as an amateur.. I will try to paste all my versions, smi, p2p test output etc. not to mislead.. thank youu🤘🤘

[-]

TheOnlyBen2@reddit

No problem, I am curious as well :)

[-]

sudeposutemizligi@reddit

ok .starting 😄 i am sorry for the nonsense pastes from now 😄

this is my nvidia smi:

| NVIDIA-SMI 595.45.04 Driver Version: 595.45.04 CUDA Version: 13.2 |

+-----------------------------------------+------------------------+----------------------+

| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|=========================================+========================+======================|

| 0 NVIDIA GeForce RTX 3090 Off | 00000000:02:00.0 Off | N/A |

| 0% 37C P8 4W / 220W | 23600MiB / 24576MiB | 0% Default |

| | | N/A |

+-----------------------------------------+------------------------+----------------------+

| 1 NVIDIA GeForce RTX 3090 Off | 00000000:03:00.0 Off | N/A |

| 0% 36C P8 11W / 220W | 23718MiB / 24576MiB | 0% Default |

************************** this is my vllm and transformers versions*********

pip show vllm | grep -i version

pip show torch | grep -i version

pip show transformers | grep -i version

Version: 0.19.2rc1.dev45+g3461c8b02

Version: 2.11.0

Version: 5.6.2

**************this is vllm launch command **************

(vllmn) ozgur@X99-8D4-2-5G-Server:\~/venvs$ vllm serve \

/media/ozgur/463A5DFB3A5DE907/Users/AICM/Desktop/model/models--Lorbus--Qwen3.6-27B-int4-AutoRound/snapshots/c3aea2d531678621989e5e2db034e32b22536e79 \

--served-model-name qwen3.6-27b-autoround \

--quantization auto_round \

--dtype float16 \

--tensor-parallel-size 2 \

--disable-custom-all-reduce \

--max-model-len 131072 \

--gpu-memory-utilization 0.90 \

--max-num-seqs 2 \

--max-num-batched-tokens 8192 \

--kv-cache-dtype fp8_e5m2 \

--trust-remote-code \

--reasoning-parser qwen3 \

--enable-auto-tool-choice \

--tool-call-parser qwen3_coder \

--enable-prefix-caching \

--enable-chunked-prefill \

--speculative-config '{"method":"mtp","num_speculative_tokens":3}' \

--default-chat-template-kwargs '{"enable_thinking":false}' \

--host 0.0.0.0 \

--port 8000

**********this is my python testing command: **********

python3 -c "

import torch, time

size = 256 * 1024 * 1024 // 4

x = torch.randn(size, device='cuda:0')

y = torch.empty(size, device='cuda:1')

torch.cuda.synchronize()

start = time.time()

for _ in range(10):

y.copy_(x)

torch.cuda.synchronize()

elapsed = time.time() - start

bw = (256 * 10) / elapsed / 1024

print(f'P2P Bandwidth: {bw:.1f} GB/s')"

P2P Bandwidth: 3.8 GB/s

========================================

this is my topology:

nvidia-smi topo -m

GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID

GPU0 X PHB 0-5,12-17 0 N/A

GPU1 PHB X 0-5,12-17 0 N/A

Legend: X = Self

PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)

=====================================================

https://sharetext.io/duig39h8this is vllm logs what has happened through my chat and tool use in cahtboxui wit 5 tools.

https://sharetext.io/rdjewm9h thi sis the whole chat with the model including moel errors.

again i am aterribly sorry for any non-sense 😄

[-]

sudeposutemizligi@reddit

LAST RESULT AFTE THE PATCH ... ( WTIH CLAUDE OF COURSE)

Qwen3.6-27B-int4 on dual RTX 3090, no Docker, vLLM nightly + PR #40361 patch

Hardware: 2× RTX 3090 (PCIe-only, no NVLink), Driver 595.45.04, Ubuntu

Stack: vLLM 0.19.2rc1.dev45+g3461c8b02 (nightly), torch 2.11.0+cu130, Python 3.12, fp8_e5m2 KV cache, TP=2, model Lorbus/Qwen3.6-27B-int4-AutoRound

Patch applied: vllm#40361 (Marlin pad-sub-tile-n) ported to vllm's refactored kernel layout. Patch loaded but didn't fire on this model — shards happened to be tile-aligned. Kept it in for safety.

Install method: clean uv venv + nightly wheel + patch -p1 against the two affected .py files in site-packages. Zero CUDA recompilation needed, patch is pure Python.

Power cap: stock 220W per card.

Settings: --max-model-len 131072 --max-num-seqs 8 --kv-cache-dtype fp8_e5m2 --gpu-memory-utilization 0.90 --disable-custom-all-reduce, NCCL_P2P_DISABLE=1, NCCL_CUMEM_ENABLE=0

Single-stream TPS results (1000 / 800 token outputs, 3 runs each, /no_think):

Config	Narrative TPS	Code TPS
No MTP (baseline)	53–56	not measured
MTP n=3	\~50 (reasoning-heavy run)	70
MTP n=1	35–46	46

MTP acceptance rates:

n=3 on code: 92/79/65% per position (warmed). Avg 78%.
n=3 on prose: 71/45/30% per position. Avg 50%.
n=1 on prose: 81–95% (lovely, but only saves 1 pass per accept — net loss vs n=3).

Why n=1 underperformed: Lorbus's quant has mtp_num_hidden_layers=1. n=3 chains the same single layer 3 times so position-2/3 acceptance collapses on prose, but on code the structure holds up well enough that 3-pass savings still win on aggregate.

Verdict: MTP n=3 is the right setting. Code gain is real (+30% over baseline). Narrative gain on this single-MTP-layer model is marginal at best — the repo's 71 TPS narrative claim assumes Genesis v7.14 + TurboQuant, not stock fp8.

Memory at 131K, fp8 KV, idle: \~22 GB per card. \~1.5 GB headroom each.

What didn't help / didn't try:

NVLink bridge: don't have one. PCIe allreduce is the single-stream bottleneck.
TurboQuant 3bit_nc + Genesis patches: would unlock 4-stream concurrency at 262K. Skipped — separate patch tree.
Pushing past 131K: should fit but cold prefill at 200K is \~3+ min.

TL;DR: \~70 TPS code / \~53 TPS narrative single-stream on dual 3090 PCIe at 131K context, fp8 KV, MTP n=3. Bare-metal Python venv, no Docker. The Marlin patch is required in principle for AutoRound on TP=2 even though this specific shard didn't trigger it.

[-]

sudeposutemizligi@reddit

Final numbers locked in. Updated Reddit post — same template, real data:

Qwen3.6-27B-int4 on dual RTX 3090, no Docker, vLLM nightly + PR #40361 patch second test with mtp 3

Hardware: 2× RTX 3090 (PCIe-only, no NVLink), 220W stock cap, Driver 595.45.04

Stack: vLLM 0.19.2rc1.dev45+g3461c8b02 (nightly), torch 2.11.0+cu130, Python 3.12, fp8_e5m2 KV cache, TP=2, model Lorbus/Qwen3.6-27B-int4-AutoRound

Patch applied: vllm#40361 (Marlin pad-sub-tile-n) ported to vllm's refactored kernel layout. Patch loaded but didn't fire on this model — shards happened to be tile-aligned. Kept it in for safety.

Install method: clean uv venv + nightly wheel + patch -p1 against the two affected .py files in site-packages. Zero CUDA recompilation needed, patch is pure Python.

Settings: --max-model-len 131072 --max-num-seqs 8 --kv-cache-dtype fp8_e5m2 --gpu-memory-utilization 0.90 --disable-custom-all-reduce, NCCL_P2P_DISABLE=1, NCCL_CUMEM_ENABLE=0

Single-stream TPS results (1000 / 800 token outputs, /no_think, warmed):

Config	Narrative TPS	Code TPS
No MTP (baseline)	53–56	not measured
MTP n=1	35–46	46
MTP n=3 (final)	36 / 57 / 54 (3 runs)	68 / 67 / 69 (3 runs)

MTP acceptance (n=3):

Code: 92/79/65% per position warmed. Avg 78%.
Prose: 71/45/30% per position. Avg 50%.

Why n=1 underperformed: Lorbus's quant has mtp_num_hidden_layers=1. n=3 chains the same single layer 3 times so position-2/3 acceptance collapses on prose, but on code the structure holds up well enough that 3-pass savings still win on aggregate. n=1 has 90%+ acceptance but only saves 1 pass per accept — net loss vs n=3.

Verdict: MTP n=3 is the right setting. Code gain is real (+30% over baseline). Narrative gain on this single-MTP-layer model is marginal — the repo's 71 TPS narrative claim assumes Genesis v7.14 + TurboQuant, not stock fp8.

Memory at 131K, fp8 KV, idle: 22.4 GB per card, \~2.2 GB headroom each.

What didn't help / didn't try:

NVLink bridge: don't have one. PCIe allreduce is the single-stream bottleneck.
TurboQuant 3bit_nc + Genesis patches: would unlock 4-stream concurrency at 262K. Skipped — separate patch tree.
Power cap above 220W: didn't push.

TL;DR: \~68 TPS code / \~55 TPS narrative single-stream on dual 3090 PCIe at 131K context, fp8 KV, MTP n=3. Bare-metal Python venv, no Docker. The Marlin patch is required in principle for AutoRound on TP=2 even though this specific shard didn't trigger it.

[-]

TheOnlyBen2@reddit

Thanks a lot ! I will hopefully be able to give it a shot tomorrow night

[-]

sudeposutemizligi@reddit

🤘🤘

[-]

TheOnlyBen2@reddit

I believe you've already seen it, but we kinda got our response here : https://github.com/noonghunna/qwen36-dual-3090/pull/2#ref-issue-4330244211

[-]

sudeposutemizligi@reddit

oh now i read this : What this does NOT give you Higher single-stream TPS. Single-stream narrative is ~68 TPS, vs single-card's ~66 TPS — basically flat. On Ampere PCIe-only (no NVLink), TP=2 allreduce overhead nearly cancels the memory-bandwidth doubling for batch=1 decode. If you only care about one-user-at-a-time chat, the single-card project is just as fast. TP=2's win is concurrent throughput, not per-request latency. in the repo. https://github.com/danbedford/qwen36-dual-3090-nvlink

[-]

TheOnlyBen2@reddit

Yeah I am kinda lost to be honest. I tried with both nvlink and without and saw no difference in terms of token per seconds. Sounds like it may be useful when serving multiple requests at the same time but that's it

[-]

sudeposutemizligi@reddit

strange.. they should have been more clear on such a very important difference.. we'll learn . see you bro🤘

[-]

TheOnlyBen2@reddit

I am still experimenting but missing time, I will let you know if I found something useful

[-]

sudeposutemizligi@reddit

will be grateful 🙏🤘 . there's no NVLink to buy in Türkiye. i think nvlink is an old fashioned thing. but where are the people keeping them if unused.. if you can help me buy somehow, i would really really be appreciated

[-]

TheOnlyBen2@reddit

Based on my last test, I really don't think you need one.

Just git clone this repo and follow indicated steps : https://github.com/noonghunna/club-3090

If launching the container fails because of an error linked to /opt/ai volume mappings. Just rm -rf /opt/ai and do git clone -b marlin-pad-sub-tile-n https://github.com/noonghunna/vllm.git /opt/ai/vllm-src

I reach 120 tokens per second for coding and 100 tps narrative with the default dual compose file. NVLINK mades no difference

[-]

sudeposutemizligi@reddit

ohh great news then.. 120 is a great number. no need for an nvlink you are right..🤘 thank you for your guidance 🙏 and, how is the long context tuns with tool calling have you tried. i am planning 132k ctx inam sure at 252k there will problems at remembering first runs and ctx pollution will cause false / empty tool callings

[-]

TheOnlyBen2@reddit

I am at the same point than you on this, wondering what the best context window would be.

I am also considering this project to get more out of the context window while keeping it small : https://github.com/juliusbrussee/caveman

Let me know if you find the sweet spot :)

[-]

sudeposutemizligi@reddit

exactly twice faster. really strange. chat gpt was really sure it wouldn't be twice as fast😄 could you also test with / without your nvlink?

[-]

AmazingDrivers4u@reddit (OP)

start from https://github.com/noonghunna/qwen36-27b-single-3090

or

https://github.com/noonghunna/qwen36-dual-3090

I'm keeping them up to date with help of community. they are separate for now but will eventually be merged in the coming days.

[-]

TheOnlyBen2@reddit

Awesome thank you

[-]

sudeposutemizligi@reddit

i gave these logs etc prior to your patches. now I will tell codex or claudebto implement it, (through vs code extension.) and try to run your set up an send the logs again

[-]

AmazingDrivers4u@reddit (OP)

yeah my pci 16x4 bus gives only 64 GB/s whereas Nvlink allows 112.5 GB/s bidirectional bandwidth between cards.

[-]

ttkciar@reddit

This post was reported for self-promotion, but upon review I am leaving it up.

Even though it is self-promotion and does link to an LLM-(re?)written article, it is also highly informative, novel, comprehensive, and on-topic for the sub.

That justifies keeping it around. We have our rules for good reasons, but it's also important to treat them with some flexibility.

[-]

PotaroMax@reddit

Good human.

Thanks we need this kind of experimentation

[-]

Visual_Acanthaceae32@reddit

That’s how moderation is supposed to be… You promoted your self by excellently handling the situation! Thank you

[-]

gthing@reddit

I honestly don't even see how this is remotely self-promotion. Because it links to an article they wrote? They are not advertising anything obvious as far as I can see or asking you to even subscribe to their newsletter or something. It appears to be pure useful information sharing.

[-]

666666thats6sixes@reddit

I think the point is that medium pays the author based on traffic, so OP has financial gain from linking to it. They'll probably make several dozen cents.

[-]

marscarsrars@reddit

Gasps Several dozen cents

[-]

jazir55@reddit

"you shared some useful info, but because the article has your name on it and isn't directly on reddit it's self promotion.

I completely agree, if you link to any self-hosted website or even an external article, apparently even as reputable a blog as medium, you get lambasted for it as "self-promotion" because it isn't directly posted to reddit. It's honestly really weird.

[-]

marscarsrars@reddit

Never thought I'd live to see a mod like you in my life time.

[-]

AmazingDrivers4u@reddit (OP)

I apologize in case i broke any rules. Just wanted to share this with the community, I was too knackered to read up rules before posting. Thank you!

[-]

PermanentLiminality@reddit

Thank you for leaving this up. It's amazing.

[-]

sudeposutemizligi@reddit

not at all. all thanks to OP 🤘 and claude 😁

[-]

Fabulous_Fact_606@reddit

That is a long read. 85 TPS on a single 3090 is impressive.

[-]

whiteamphora@reddit

Unfortunately, the day you wrote that it's not usable at all. Compromises as of now must be made.

[-]

Fun-Marionberry-2540@reddit

Does this also help 4090 in the same way?

[-]

AmazingDrivers4u@reddit (OP)

theoretically it should but there is only one way to find out, test it.

[-]

Southern_Sun_2106@reddit

Alright, tbh, you knew that everyone will ask for that patch. Why not release it together with your piece? Otherwise, it reads 'look what an awesome thing I've made, but it won't work without my patch that I will release 'later.'' Without the patch, this makes it a click bait and a self-promo. Also, whenever 'medium' is involved = it is red flag for me.

[-]

AmazingDrivers4u@reddit (OP)

I have the file with me, but i realised it doesn't meets the repo's standards for PR. I've built it on the latest dev branch, whereas for PR i need to prepare it against a stable release branch + test cases. I'll rather make it right and then share the link instead of pushing it out hastely.
Secondly, i've shared enough details about what exactly i've done in the patch and if you feel confident you can always have a go at it easily. I appreciate your patience.

[-]

i_wayyy_over_think@reddit

If you find yourself getting busy with other things and abandoning it, could you just push it to your own fork? I wouldn't mind working on a non stable branch.

[-]

Crafty-Confidence975@reddit

Was anyone able to get the cuda patch from them? Can’t duplicate without their patch_tolist_cudagraph.py which they say they’ll provide if requested.

[-]

AmazingDrivers4u@reddit (OP)

I need to prepare the patch on a clean branch in order to submit it as a PR. Please bear with me for a day and i'll post it.

[-]

andy2na@reddit

I got it running and seems to get 50-60t/s on my 3090, but I have to enable --eager-mode for it to launch or it doesnt work

RuntimeError: Cannot copy between CPU and CUDA tensors during CUDA graph capture

What will the patch you will be releasing do?

[-]

caetydid@reddit

python3: can't open file '/patches/patch_genesis_unified.py': [Errno 2] No such file or directory

I cant find this file in the provided sources. Could you kindly provide it or eplain how how created it?

[-]

AmazingDrivers4u@reddit (OP)

check the git repo, link in the article.

[-]

Hodler-mane@reddit

thank you. if you can also give us links to everything we need (the exact nightly of the versions of things you are running) that would be appreciated!

[-]

AmazingDrivers4u@reddit (OP)

minus the patch, all details are already in the post. Look at the last line of the aritcle.

[-]

ionizing@reddit

main thing that threw me off was anything docker related. any tips on doing your stack without docker?

[-]

AmazingDrivers4u@reddit (OP)

well its just an environment for your code. you can host it on bare metal, vm, docker, venv anywhere. I got like 15 inference engines that i keep segmented from each other via dockers/venv. docker/venv is not mandatory, you should be able to setup your environment accordingly.

[-]

Crafty-Confidence975@reddit

Sounds good! Thanks for doing that.

[-]

sagiroth@reddit

Thank you. Let us know in the thread bud

[-]

andy2na@reddit

this is the patch that got it working for me, although not perfectly, can get up to 60t/s on my 3090:

patch_tolist_cudagraph.py

import torch


# This is a more aggressive patch for vLLM V1
_old_tolist = torch.Tensor.tolist


def patched_tolist(self):
    if self.is_cuda:
        # If we're on CUDA, we MUST move to CPU before tolist()
        # To avoid the 'pinned memory' error during graph capture, 
        # we bypass the sync by moving to numpy first.
        return self.detach().cpu().numpy().tolist()
    return _old_tolist(self)


torch.Tensor.tolist = patched_tolist
print("Successfully applied Pinned-Memory bypass for tolist().")import torch


# This is a more aggressive patch for vLLM V1
_old_tolist = torch.Tensor.tolist


def patched_tolist(self):
    if self.is_cuda:
        # If we're on CUDA, we MUST move to CPU before tolist()
        # To avoid the 'pinned memory' error during graph capture, 
        # we bypass the sync by moving to numpy first.
        return self.detach().cpu().numpy().tolist()
    return _old_tolist(self)


torch.Tensor.tolist = patched_tolist
print("Successfully applied Pinned-Memory bypass for tolist().")

[-]

Crafty-Confidence975@reddit

Nice! With 125k context?

[-]

andy2na@reddit

125k got OOM errors during certain situations. Dropping it down to 100k works. Just FYI this cudagraph patch is not official and there is a performance loss with it, so youll have to wait for the official patch for the OP's claimed performance and possibly context window

[-]

Crafty-Confidence975@reddit

Got it thanks! Did you benchmark it on anything interesting?

[-]

andy2na@reddit

Certain tasks and prompts will hit 60 to 65 t/s , up from 30-35 with traditional methods on my 3090. So then workout the full patch, there's a performance boost

[-]

Crafty-Confidence975@reddit

What about actual performance on tasks you want done? How capable do you find this setup?

[-]

andy2na@reddit

I mainly use my LLM for Frigate, home assistant voice assist, Karakeep, degoog, Sure finance, and n8n workflows. Its definitely still noticeable versus qwen3.6-35b but noticeably faster than the standard qwen3.6-27B deployment.

Overall, if Im not looking at prompt and generation speeds, its much more usable now for general tasks

[-]

McSendo@reddit

Thanks for experimenting with this. I was going to, but I think I'm going to wait. One of our dev boxes has 2x3090 with p2p drivers. With the official FP8 model and MTP-3 no kv quant, it was already doing 80 to 100 t/s with 130k context filled for single thread workloads, multithreads (3) around 200.
It is tempting though for the larger kv cache.

[-]

AmazingDrivers4u@reddit (OP)

go grab it form git, its now available there.

[-]

andy2na@reddit

seems that I still have to add - --enforce-eager under - qwen3_coder to fix the nonstop repeats but that drops output to 60t/s because eager disable cudagraphs. can you look into this? Thanks for your work!

[-]

Misio@reddit

I had the same problem

The user is noting the the the the the the the the the the the the the the the the
 the the the the the the the the the the the the the the the the the the the the the
 the the the the the the the the the the the the the the the the the the the the the the

I updated 3bit to 4bit

sed -i 's/turboquant_3bit_nc/turboquant_4bit_nc/' docker-compose.yml
sed -i 's/"125000"/"80000"/' docker-compose.yml

didn't fix it

$ curl -sf http://localhost:8020/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.6-27b-autoround",
       "messages":[{"role":"user","content":"hello"}],
       "max_tokens":500}' \
  | jq '.choices[0].message | {reasoning: (.reasoning // "" | .[0:300]), content}'
{
  "reasoning": "Here's a thinking process:\n\n1.  **Analyze User Input:** The user has simply submitted a blank message.\n2. This is a standard greeting.\n\n*WaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWaitWai",
  "content": null
}

$ curl -sf http://localhost:8020/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.6-27b-autoround",
       "messages":[{"role":"user","content":"hello"}],
       "max_tokens":100,
       "chat_template_kwargs":{"enable_thinking":false}}' \
  | jq '.choices[0].message.content'
"Hello! How can I help you today? 😊"

seems to relate to thinking mode on short prompts?

[-]

Misio@reddit

No, not related to thinking.

[-]

Misio@reddit

add --override-generation-config '{"temperature":0.7,"top_p":0.8,"top_k":20}' to the vLLM command and pass presence_penalty: 1.5 per-request from your client. With those params: 20/20 clean on bare "hello" with a system prompt. Without them: \~10-30% degenerate depending on prompt length.

[-]

andy2na@reddit

i think its related to what is on git: https://github.com/noonghunna/qwen36-27b-single-3090

https://github.com/noonghunna/qwen36-27b-single-3090#known-issue-tool-calling--mtp--turboquant-kv

[-]

Misio@reddit

Duh, I should learn to read. Thank you!

[-]

andy2na@reddit

yeah I just found this git since the cudapatch was linked to it this morning. So unless theres a fix to MTP + TQ with tool calling, youll have to either cut your performance a bit or not use tool calling. 50-60t/s is pretty decent for 27B dense, will continue testing

[-]

Misio@reddit

with tool calling

=== Warmup (3x) ===
  w1           comp=1000 wall=14.00s   71.43 TPS
  w2           comp=1000 wall=13.47s   74.24 TPS
  w3           comp=1000 wall=14.07s   71.07 TPS

=== Narrative (3x, 1000 tok) ===
  narr1        comp=996  wall=14.37s   69.31 TPS
  narr2        comp=1000 wall=14.04s   71.23 TPS
  narr3        comp=967  wall=13.42s   72.06 TPS

=== Code (2x, 800 tok) ===
  code1        comp=800  wall= 8.65s   92.49 TPS
  code2        comp=800  wall= 8.61s   92.92 TPS

[-]

andy2na@reddit

I switched TQ3 to fp8 caching and got tooling back with vision support, but max 65k context at up to 80t/s. If you disable vision, you can get up to 75k

What are your parameters for that?

[-]

andy2na@reddit

even with thinking mode

[-]

nbvehrfr@reddit

something with chat template

[-]

Misio@reddit

This is brilliant, thanks

```

$ ./scripts/bench.sh

=== Warmup (3x) ===

w1 comp=1000 wall=11.32s 88.34 TPS

w2 comp=1000 wall= 9.61s 104.06 TPS

w3 comp=1000 wall= 9.27s 107.87 TPS

=== Narrative (3x, 1000 tok) ===

narr1 comp=1000 wall= 9.62s 103.95 TPS

narr2 comp=1000 wall=11.89s 84.10 TPS

narr3 comp=1000 wall=10.94s 91.41 TPS

=== Code (2x, 800 tok) ===

code1 comp=800 wall= 7.45s 107.38 TPS

code2 comp=800 wall=12.63s 63.34 TPS

=== GPU state ===

0, 98 %, 22014 MiB, 24576 MiB, 388.35 W, 68

=== Last 3 SpecDecoding metrics (MTP accept) ===

(APIServer pid=1) INFO 04-24 17:09:25 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.04, Accepted throughput: 56.79 tokens/s, Drafted throughput: 83.69 tokens/s, Accepted: 568 tokens, Drafted: 837 tokens, Per-position acceptance rate: 0.975, 0.964, 0.097, Avg Draft acceptance rate: 67.9%

(APIServer pid=1) INFO 04-24 17:09:35 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.24, Accepted throughput: 62.40 tokens/s, Drafted throughput: 83.39 tokens/s, Accepted: 624 tokens, Drafted: 834 tokens, Per-position acceptance rate: 0.996, 0.989, 0.259, Avg Draft acceptance rate: 74.8%

(APIServer pid=1) INFO 04-24 17:09:45 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.42, Accepted throughput: 67.90 tokens/s, Drafted throughput: 84.00 tokens/s, Accepted: 679 tokens, Drafted: 840 tokens, Per-position acceptance rate: 0.954, 0.761, 0.711, Avg Draft acceptance rate: 80.8%

```

[-]

andy2na@reddit

Thanks!

I tried the new patch and I definitely do get over 80t/s gen but output seems to be bugged and repeating itself, any idea how to fix this?

[-]

AmazingDrivers4u@reddit (OP)

Patch is now available in git. links updated in the article.

[-]

ShengrenR@reddit

Somebody posted a comment on the medium article asking where it was; author replied

I'll post it in a day. Please bear with me for i need to prepare a clean patch to go as PR.I'll post it in a day. Please bear with me for i need to prepare a clean patch to go as PR.

[-]

Webster2026@reddit

It can actually run with decent speed even on old Mac with M1 processor: https://youtu.be/NNOq3T26MIQ

[-]

Equivalent-Home-223@reddit

This is fantastic, I posted a question in the post, somehow when i set MTP tp >1 it will go on a infinite loop without returning response back to the client side.

I created a custom docker using the base image you use from vllm and applied the patches within the docker and ran as per below

docker run
  -d
  --name='vLLM'
  --net='host'
  --pids-limit 4048
  -e TZ="Asia/Shanghai"
  -e HOST_OS="Unraid"
  -e HOST_HOSTNAME="BrancoUnraid"
  -e HOST_CONTAINERNAME="vLLM"
  -e 'NVIDIA_VISIBLE_DEVICES'='all'
  -e 'NVIDIA_DRIVER_CAPABILITIES'='all'
  -e 'TCP_PORT_8000'='8000'
  -e 'PYTORCH_CUDA_ALLOC_CONF'='expandable_segments:True,max_split_size_mb:512 '
  -e 'CA_TS_FALLBACK_DIR'='/root/.cache/huggingface'
  -l net.unraid.docker.managed=dockerman
  -l net.unraid.docker.webui='http://[IP]:[PORT:8000]/'
  -l net.unraid.docker.icon='https://i.imgur.com/oQcntuY.png'
  -v '/mnt/user/appdata/vllm':'/root/.cache/huggingface':'rw'
  --runtime=nvidia
  --ipc=host 'vllm_custom'
  --model Lorbus/Qwen3.6-27B-int4-AutoRound
  --port 8000
  --tensor-parallel-size 2
  --max-model-len 100000
  --reasoning-parser qwen3
  --served-model-name qwen3-coder
  --gpu-memory-utilization 0.97
  --host 0.0.0.0
  --enable-auto-tool-choice
  --tool-call-parser qwen3_coder
  --reasoning-parser qwen3
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
  --kv-cache-dtype turboquant_3bit_nc
  --enable-chunked-prefill
  --max-num-seqs 4

[-]

Splinter2121@reddit

Getting \~43 narrative / \~54 code TPS at 330W on a single RTX 3090 with fp8 KV + MTP n=3. Reference setup (identical config, same GPU model) claims 66/84 TPS. MTP acceptance rates are comparable or better (93/87/74% vs 92/81/64%), but base decode throughput is \~20 TPS lower. Looking for ideas on what's causing the gap.

Full Configuration

Docker Image

vllm/vllm-openai@sha256:9bba4628a3b943e0dd33caefb94b811569ba1e97bdf23bee19a265c31b947ccb
# v0.19.2rc1.dev21+g893611813

vLLM Launch Args

--model /root/.cache/huggingface/vllm-qwen36-27b-int4
--served-model-name qwen3.6-27b-autoround
--quantization auto_round
--dtype float16
--tensor-parallel-size 1
--max-model-len 75000
--gpu-memory-utilization 0.97
--max-num-seqs 1
--max-num-batched-tokens 2048
--kv-cache-dtype fp8_e5m2
--language-model-only
--trust-remote-code
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--enable-prefix-caching
--enable-chunked-prefill
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
--host 0.0.0.0 --port 8000

Environment Variables

VLLM_WORKER_MULTIPROC_METHOD=spawn
NCCL_CUMEM_ENABLE=0
NCCL_P2P_DISABLE=1
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
VLLM_NO_USAGE_STATS=1
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
VLLM_FLOAT32_MATMUL_PRECISION=high
VLLM_USE_FLASHINFER_SAMPLER=1
OMP_NUM_THREADS=1
CUDA_DEVICE_MAX_CONNECTIONS=8
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
VLLM_MARLIN_USE_ATOMIC_ADD=1

Patches (applied before vLLM start)

Genesis v5.10 — 20/21 applied, 0 failed (1 already done)
Key patches: Marlin FP8 fallback, TQ hybrid support, MoE fast path, Qwen3 tool_call fix, mamba .get() guard, TQ decode stage1 tune, TQ prealloc dequant+cu
tolist_cudagraph_fix (from noonghunna/qwen36-27b-single-3090) — Site A: ok, Site B: ok
Wraps .tolist() calls in turboquant_attn.py with torch.cuda.is_current_stream_capturing() guards so CUDA graph capture doesn't crash

Model

Lorbus/Qwen3.6-27B-int4-AutoRound
- Architecture: Qwen3_5ForConditionalGeneration (Qwen3.6 hybrid: 48 linear_attn + 16 full_attn)
- Quantization: AutoRound INT4, mtp.fc preserved as BF16
- Size: ~18 GiB on disk, 16.87 GiB in VRAM
- MTP: Qwen3_5MTP, shares embedding + lm_head with target model

Runtime Details

vLLM version:              0.19.2rc1.dev21+g893611813
Architecture resolved:      Qwen3_5MTP
Quantization backend:      inc (AutoRound INT4)
Weight kernel:              MarlinLinearKernel for GPTQMarlinLinearMethod
Attention backend:          FlashInfer
CUDA graphs:                FULL_AND_PIECEWISE → downgraded to PIECEWISE
  WARNING: CUDAGraphMode.FULL_AND_PIECEWISE is not supported with spec-decode
            for FlashInferBackend (UNIFORM_SINGLE_TOKEN_DECODE);
            setting cudagraph_mode=PIEWISE
torch.compile:              53s (backbone) + 12s (eagle_head), cached to disk
CUDA graph capture:         4 graphs (sizes 1,2,4,8), 0.08 GiB
KV cache:                   fp8_e5m2, ~21.3% GPU usage at 1K context
Model VRAM:                  16.87 GiB
Total VRAM used:            ~22.1 GiB / 24.5 GiB
MTP:                         n=3, shared embedding + lm_head
Driver:                      580.119.02, CUDA 13.0
GPU:                         RTX 3090 (Ampere SM_86)
Power cap:                   330W (stock is 230W)

Key Warning

CUDAGraphMode.FULL_AND_PIECEWISE is not supported with spec-decode for
attention backend FlashInferBackend (support: UNIFORM_SINGLE_TOKEN_DECODE);
setting cudagraph_mode=PIECEWISE

Benchmark Results

At 330W Power Cap (after 3 warmup rounds)

Narrative (800-word essay, max_tokens=1000, temp=0.6):
  narr1:  45.1 TPS
  narr2:  45.7 TPS
  narr3:  44.3 TPS

Code (quicksort with comments, max_tokens=800, temp=0.6):
  code1:  58.6 TPS
  code2:  57.3 TPS

At 230W Power Cap (stock, after warmup)

Narrative:  ~33 TPS
Code:       ~40 TPS

MTP SpecDecoding Metrics (330W, warm)

Per-position acceptance:  93.0% / 86.6% / 73.9%
Mean acceptance length:     3.55
Avg draft acceptance rate:  85.1%
Drafted throughput:         ~48 tok/s
Accepted throughput:         ~34-37 tok/s
GPU usage:                   21.3% KV, ~90% compute
Power draw:                  ~250W

Reference Comparison

Metric	Ours (3090)	Reference (3090)	Gap
Narrative TPS	43-45	66	-34%
Code TPS	54-58	84	-36%
MTP accept (pos 1)	93%	92%	+1%
MTP accept (pos 2)	87%	81%	+7%
MTP accept (pos 3)	74%	64%	+16%
Mean accept length	3.3-3.5	\~2.87	Better
Draft throughput	\~48 tok/s	???	???
KV cache	fp8_e5m2	fp8_e5m2	Same
CUDA graphs	PIECEWISE	???	???
Power cap	330W	230W (default)	Higher

Things I've Already Checked

Marlin kernels active — Using MarlinLinearKernel for GPTQMarlinLinearMethod
CUDA graphs working — PIECEWISE mode, NOT enforce-eager
Genesis patches all passing — 20/21 applied, 0 failed
tolist cudagraph patch applied — both sites patched
MTP sharing weights — Detected MTP model. Sharing target model embedding/lm_head weights with the draft model.
fp8_e5m2 KV — NOT turboquant (turboquant+spec-decode broken per vllm#40831)
language-model-only — no vision tower loaded
Model is correct Lorbus INT4 — mtp.fc.weight present as BF16 (not quantized)
torch.compile caching — compiled and cached to disk (53s backbone, 12s eagle)
Power cap — tested at 330W (+10% TPS vs 230W)
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 — enabled
VLLM_FLOAT32_MATMUL_PRECISION=high — set
No enforce-eager — CUDA graphs are active

Potential Causes I'm Unsure About

CUDA graphs downgraded to PIECEWISE — The FlashInfer backend doesn't support FULL_AND_PIECEWISE with spec-decode. Is the reference also running PIECEWISE, or did they get FULL mode working somehow?
Draft throughput bottleneck — My drafted throughput is only \~48 tok/s. If base decode is the bottleneck, acceptance rate improvements don't help much. What drafted throughput should I expect?
torch.compile cache persistence — The compile cache is inside the Docker container at /root/.cache/vllm/torch_compile_cache/. Not mounted as a volume, so it rebuilds on restart. Could this affect warm-run performance?
Model path vs HuggingFace repo name — My model is loaded from a local directory /root/.cache/huggingface/vllm-qwen36-27b-int4 rather than the HuggingFace repo name Lorbus/Qwen3.6-27B-int4-AutoRound. Could this affect any auto-configuration?
Genesis patch version — I'm running v5.10, the repo now has v7.10 (which uses a plugin architecture). Could newer patches improve TPS?
max_num_batched_tokens=2048 — vLLM warns this is suboptimal with spec-decode. The reference uses the same value but could there be a better setting?

Docker Compose (Complete)

services:
  vllm-qwen36-27b:
    image: vllm/vllm-openai@sha256:9bba4628a3b943e0dd33caefb94b811569ba1e97bdf23bee19a265c31b947ccb
    container_name: vllm-qwen36-27b
    restart: "no"
    ports:
      - "8020:8000"
    volumes:
      - /run/media/will/Storage/models:/root/.cache/huggingface
      - /home/will/genesis-vllm-patches/patch_genesis_unified.py:/patches/patch_genesis_unified.py:ro
      - /home/will/genesis-vllm-patches/patch_tolist_cudagraph.py:/patches/patch_tolist_cudagraph.py:ro
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN:-}
      - VLLM_WORKER_MULTIPROC_METHOD=spawn
      - NCCL_CUMEM_ENABLE=0
      - NCCL_P2P_DISABLE=1
      - VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
      - VLLM_NO_USAGE_STATS=1
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb=512
      - VLLM_FLOAT32_MATMUL_PRECISION=high
      - VLLM_USE_FLASHINFER_SAMPLER=1
      - OMP_NUM_THREADS=1
      - CUDA_DEVICE_MAX_CONNECTIONS=8
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - VLLM_MARLIN_USE_ATOMIC_ADD=1
    shm_size: "16gb"
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    entrypoint:
      - /bin/bash
      - -c
      - |
        set -e
        pip install xxhash -q
        python3 /patches/patch_genesis_unified.py
        python3 /patches/patch_tolist_cudagraph.py
        exec vllm serve "$@"
      - --
    command:
      - --model
      - /root/.cache/huggingface/vllm-qwen36-27b-int4
      - --served-model-name
      - qwen3.6-27b-autoround
      - --quantization
      - auto_round
      - --dtype
      - float16
      - --tensor-parallel-size
      - "1"
      - --max-model-len
      - "75000"
      - --gpu-memory-utilization
      - "0.97"
      - --max-num-seqs
      - "1"
      - --max-num-batched-tokens
      - "2048"
      - --kv-cache-dtype
      - fp8_e5m2
      - --language-model-only
      - --trust-remote-code
      - --reasoning-parser
      - qwen3
      - --enable-auto-tool-choice
      - --tool-call-parser
      - qwen3_coder
      - --enable-prefix-caching
      - --enable-chunked-prefill
      - --speculative-config
      - '{"method":"mtp","num_speculative_tokens":3}'
      - --host
      - 0.0.0.0
      - --port
      - "8000"

[-]

mgxts@reddit

This is really cool. I get around 65–70 tokens/sec on an RTX 5090 in LM Studio on a comparable GGUF model (Unsloth/Qwen3.6-27B-UD-Q4_K_XL). My llama.cpp build in WSL2 Ubuntu was still slower than LM Studio even though it was compiled for my setup + TurboQuant + community recommended configuration.

This is the first time I have tested vLLM. The base Qwen3.6-27B-int4-AutoRound gives me about 90 tokens/sec. With the patches enabled the max I have reached so far is around 135 tokens/sec. I have had to disable TurboQuant though as it does not work on the 5090 and the model gets stuck repeating the same token.

[-]

nbvehrfr@reddit

which cache quant you used to fit in 5090?

[-]

mgxts@reddit

I should have mentioned I could not fit 125k without TurboQuant. The test setup I got working was FP8 KV cache with a 51.2k max context and MTP 3. Not sure if it is the max, was just the value I set when I lowered it. Hopefully there is some way to get TQ working.

[-]

gthing@reddit

I'm still waiting for everything to download form huggingface so haven't tested this yet, but here is my effort to replicate the patch_tolist_cudagraph.py based on the description in the article:

#!/usr/bin/env python3
import os
import re

TARGET_FILES = [
    "/usr/local/lib/python3.10/dist-packages/vllm/attention/turboquant_attn.py",
    "/usr/local/lib/python3.11/dist-packages/vllm/attention/turboquant_attn.py",
    "/usr/local/lib/python3.10/site-packages/vllm/attention/turboquant_attn.py",
    "/usr/local/lib/python3.11/site-packages/vllm/attention/turboquant_attn.py",
]

PATCH_SNIPPET = r"""
def _safe_tolist(x):
    import torch
    # If CUDA graph capture is active, avoid .tolist() because it forces sync
    if torch.cuda.is_current_stream_capturing():
        # Return a cheap placeholder or empty list — caller only uses this
        # for logging / debug / shape checks in TurboQuant.
        return []
    return x.tolist()
"""

def patch_file(path):
    if not os.path.exists(path):
        return False

    with open(path, "r") as f:
        src = f.read()

    # Already patched?
    if "_safe_tolist" in src:
        print(f"[tolist_cudagraph_fix] Already patched: {path}")
        return True

    # Replace `.tolist()` with `_safe_tolist(x)`
    patched = re.sub(r"(\w+)\.tolist\(\)", r"_safe_tolist(\1)", src)

    # Insert helper at top
    patched = PATCH_SNIPPET + "\n" + patched

    with open(path, "w") as f:
        f.write(patched)

    print(f"[tolist_cudagraph_fix] Patched: {path}")
    return True


def main():
    patched_any = False
    for f in TARGET_FILES:
        if patch_file(f):
            patched_any = True

    if not patched_any:
        print("[tolist_cudagraph_fix] No target files found — TurboQuant layout may have changed.")


if __name__ == "__main__":
    main()

[-]

andy2na@reddit

Did this patch get you the 85t/s? The one I'm using only is getting me 60ish and I have to enable --enforce-eager , which disables cudagraphs

[-]

gthing@reddit

I don't know I kept getting bad checksums when downloading the model and gave up for today.

[-]

RealestNagaEver@reddit

I think the checksum command he provided in the article might be using the wrong algorithm, it didn't work for me either until I used sha256

[-]

gthing@reddit

Ah thank you!

[-]

jimmytoan@reddit

85 TPS on a single 3090 for 27B with 125K context would be well above what most people report - most single-3090 runs at 27B are in the 40-60 TPS range at shorter context. Is the 85 TPS measured on the decode (generation) phase or prefill? Prefill throughput on long sequences is always higher because it parallelizes across the input, but decode rate is what determines how fast the response feels interactively. Also curious how much quality degradation you see at the 125K context end vs 16-32K - long context coherence usually starts dropping before the max window.

[-]

AdamDhahabi@reddit

Waiting for MTP to land in llama.cpp so that I can run Q8_0 at high speed on a multi-GPU build with consumer mainboard.

[-]

Cold_Tree190@reddit

What is MTP? I only have a single 3090 so it doesn’t sound like it will be of use to me right now, but I have been thinking about building a dedicated multi gpu server at some point

[-]

FatheredPuma81@reddit

The TLDR of what it does is: Imagine a draft model but baked into the model and much more accurate.

TLDR for draft model/speculative decoding(for those who might not know): You have a tiny model predicts a large model's output and because wizardry confirming/denying the prediction is way faster than generating it.

[-]

Glittering-Call8746@reddit

How does it "bake" into the model ? Useful for ampere only ? Or can applied to 5070ti ?

[-]

HyperWinX@reddit

Iirc, its not really a "small model baked into a big model". It just uses output of the first layers of the big model.

[-]

FatheredPuma81@reddit

By there being tensors inside the model (that GGUF's automatically remove) dedicated specifically for token prediction. As the OP says in their crazy long post it was like +200MB at 4 bit which is tiny.

AFAIK it's usable on any card but I wouldn't worry about it even if it was. Flash Attention (as an example) isn't supported on old cards like the V100 yet the community has backported them in llama.cpp.

[-]

HyperWinX@reddit

MultiToken Prediction.

[-]

Far-Low-4705@reddit

tensor parallelism is also gonna help you get more performance too.

just landed in llama.cpp a few weeks ago, but is very unstable and only really works-ish on nvidia cards

[-]

FatheredPuma81@reddit

Oh I gotta ask you then lol. The TLDR is a guy replied to a comment of mine and said that inference time would be horrible on MI50's and that it isn't the T/s that's the issue. Was curious what you make of it? Is inference really that slow on those cards? Cause I'm suddenly really interested seeing 20-24t/s on 27B (meaning 35B should be like 80t/s?).

Oh and have you tried using parallel slots and using like Subagents in Opencode to improve performance? On an RTX 4090 I've found that usually gives me 50%+ more t/s.

[-]

AdamDhahabi@reddit

I tried an only got speed regression, probably because PCIE 4.0 x4 interconnectivity on my consumer mainboard.

[-]

Far-Low-4705@reddit

it is still super early, id give it time. it is still super buggy, and they only show gains on much older dense models like qwen3, and gemma 3.

[-]

FatheredPuma81@reddit

I've seen ggerganov say they need to work out how they want to implement MTP like a dozen times since Qwen3.5 dropped... so I don't think it's coming any time soon.

[-]

YourNightmar31@reddit

I intend to follow this guide this weekend, but what's the prompt processing speed like at 100K context?

[-]

realmosai@reddit

I don't mean to sound rude. But I don't quite grasp the point of the article. You loaded and ran an AI Model - is that something to write a whole blog post for?

And you used AI to write the blog post. Of course its unnecessarily long for what it describes. It could have easily been just four bash commands and a docker compose file, and that's the gist of it.

[-]

gthing@reddit

I don't mean to sound rude. But I don't quite grasp the point of this comment. You either didn't read it or didn't understand it - is that something to write a whole comment for?

[-]

realmosai@reddit

I shouldn't have written the comment, I see that sometimes ignoring is better option. And you're right, I tried to put it forward softly, but in vain. Unlike OP, I like to get to the point.

Article's pure AI slop and engagement bait. - There that saves a lot of tokens trying to explain myself.

[-]

Crafty-Confidence975@reddit

But it’s not. Getting that model running at 85 TPS with 125k context on a single 3090 RTX is of great interest to many on here. It’s probably the absolute best coding model in this weight class right now and is almost as good as frontier stuff from last year.

The numbers out of the box are nowhere near 85 TPS so this is very valuable. No one cares if it’s written by AI if the results are duplicatable.

[-]

realmosai@reddit

If the results are reproducible - is the key differentiator. AFAIK these are claims.

[-]

Crafty-Confidence975@reddit

We’ll see about the patch but the rest of the article is correct so far. It’s already a 3x improvement without the patch at that context length.

It’s not a mere claim to give exact step by step instructions. Go try it yourself and you’ll see. It’s exactly the sort of content I think we need more of on here.

[-]

realmosai@reddit

3x improvement would mean that either you were doing something VERY VERY wrong when running the model or OP deserves global recognition.

Is your measure of correctness of the article coming from Claude reading it and telling you it's correct? Because the results I saw on this post from others were 40tk/s - a long way off from the claimed 85 t/s from the missing magical patch.

Am sorry, but these claims, the post title, and the fact that none of the post content was put on Reddit, and just all the way through to the rewritten and updated contents of the article, does NOT pass my (and some other equally keen-eyed redditors) AI-Slop marketing spam filters.

[-]

Crafty-Confidence975@reddit

Again did you run it yourself? I was being conservative at 3x. Naive implementations for the 27b model are at 10 TPS or so on a 3090 RTX. I just happen to have a bunch of nodes with those cards and am seeing good results from the model in general so I took interest in the post.

[-]

realmosai@reddit

What do you term as a naive implementation? You are speaking in very vague terms.

This article is for naive people. That's true.

As for running 27b q6kxl UD quant, yes I do run them. My not-so-naive llama build gets me 45+ t/s on a Pro card. That is in line with the results others are seeing. Why do I need this article for again?

[-]

Crafty-Confidence975@reddit

Sure and I can push much higher numbers on a $1.45/hr H200 in the cloud… that’s not what the article is about. It’s a specific card which a lot of hobbyists bought a couple years ago. Why not engage with what he’s talking about instead of making it about yourself?

[-]

realmosai@reddit

Engage with what? A 20 minute article to get what? Exactly.

I ain't renting my pro card, it's in my computer, if that's what you meant by your cloud supposition.

What's the point of the article, can you share it? I don't think so, because there isn't any. Cut to the chase. You can't, there isn't one. That's why it's a 20-min read. There are better things to read for 20 minutes of my time.

And by the same logic, have a good day.

[-]

llitz@reddit

Even if that is, it describes exactly what he went through and how he fixed it.

That alone, is enough contributions and explains several internals of how things work and what people need to pay attention to. I see that as an actual contribution and, judge by the upvotes, others do too.

[-]

realmosai@reddit

The only contribution in the post was made by Claude. Including the debugging.

[-]

MR_-_501@reddit

They patched inference frameworks to properly make use of the functionality the model provides, how is that only "load and ran an AI Model"

[-]

realmosai@reddit

wheres the patch?

[-]

AmazingDrivers4u@reddit (OP)

I'll ask Claude to be mindful next time. :P

[-]

koljanos@reddit

Apparently hosting models isn’t as easy as simple ollama run

[-]

BitGreen1270@reddit

This is very interesting. I don't fully understand everything, but theoretically, can this be applicable to any GPU? I have a 780M iGPU and 32GB RAM and am getting about 20t/s with gemma4-26B-A4B and around the same with Qwen3.6-35B-A3B. Do you think I can replicate some of the steps you describe in your post to seriously boost my tps?

[-]

caetydid@reddit

hoooly shite! Why I am still running with 50tps on rtx5090?

[-]

corpo_monkey@reddit

Sell that crap and buy a real one!
3090 masterrace

[-]

i_wish_i_was_perez@reddit

Yes, but sell it to me

[-]

MisticRain69@reddit

Nah don't sell it to him trade with me you save a step

[-]

caetydid@reddit

thanks for the offers but sadly my employer owns it

[-]

wowsers7@reddit

Has anyone run Qwen 3.6 27b on Intel Arc Pro B70? I’m curious about the performance.

[-]

robertpro01@reddit

Tldr?

[-]

FatheredPuma81@reddit

vroom vroom goes model

[-]

bgeneto@reddit

The issue with this approach is the process boundary: python3 /patches/patch_tolist_cudagraph.py patches only that short-lived Python process. After it exits, exec vllm serve ... starts a different Python interpreter, so the monkey patch is gone. We have to use sitecustomize.py. That makes Python automatically import the monkey patch (patch_tolist_cudagraph.py) every time a Python interpreter starts, including the vllm serve process and its worker subprocesses.

[-]

DiscipleofDeceit666@reddit

I don’t know what any of those words in the article mean, but I felt like I did when I was reading it

[-]

anthonyg45157@reddit

RemindMe 3 days "check this out again"

[-]

edsonmedina@reddit

Meanwhile I'm getting 7 tok/s on Strix Halo 🥲

[-]

sabotage3d@reddit

Is it possible to get MTP (Multi-Token Prediction) working in llama.cpp yet? I’ve successfully managed to get TurboQuant running via an experimental branch, but I haven't seen an implementation for MTP. Are there any specific branches or PRs I should be looking at?

[-]

Important_Quote_1180@reddit

This is exactly what I needed! Thank you

[-]

No-Marionberry-772@reddit

I wish this wasn't beyond me. Experienced developer, but I'm weak with C++ , python, and getting into these tools. I don't currently have the hardware, but I'm really wanting to make the switch to local, getting tired of cloud providers. If I can make the switch and buy a 3090 instead of a 5090, that would be amazing. I know I just have to wait, but these numbers never seem to hit the main stream tooling it seems like.

[-]

zhileiz@reddit

Thx. This is probably the best piece of writing I've seen in a while. I wonder if you'd do a follow-up on the model performance (real-world experience) under that configuration, e.g. opencode / openclaw experience etc.

[-]

Ok-Measurement-1575@reddit

It's pure slop but given the outcome, I think we can look the other way :D

[-]

hainesk@reddit

Total slop, it’s hard to read…

[-]

AmazingDrivers4u@reddit (OP)

As for future posts, i hope i'm able to discover more cool optimisations in order to keep going. will try!

[-]

Ok-Measurement-1575@reddit

Post a cock teaser paragraph at least next time, ideally.

Great work.

[-]

AmazingDrivers4u@reddit (OP)

hahaha!

[-]

AmazingDrivers4u@reddit (OP)

Thank you! Glad you liked it.

[-]

cviperr33@reddit

Posting here so i can try later , thanks for the info

[-]

GregoryfromtheHood@reddit

What about prompt processing? More token generation speed is always nice, but prompt processing speed is in my opinion even more important for real world use.

[-]

edankwan@reddit

I have a 3090 + 3090 Ti running Q8 + Q8 k/v with 131072 context window. Only 26t/s

[-]

weird_ed@reddit

I have the same setup. With ik_llama, -sm graph and p2p I get >50t/s on Q5 and 500k context window on Q8 k/v.

[-]

AmazingDrivers4u@reddit (OP)

throw away the second one, only focus on the first one. =P

[-]

Ok-Measurement-1575@reddit

I didn't see the exact vllm version? I've abandoned vllm 0.19 for qwen 3.5/3.6 as the actual outputs are subpar compared to llama.cpp.

Maybe some things got fixed now.

[-]

EveningIncrease7579@reddit

This is really insane for me that i'm getting 30\~40tk/s (llama.cpp unsloth q4 or q5 depending). Could you have a docker image compilled with your own modifications? i really want to test it!

[-]

AmazingDrivers4u@reddit (OP)

bear with me for a day, its the patch i need to release. the rest of the stuff is already documented and shared.

[-]

EveningIncrease7579@reddit

Dont worry, you are awesome!

[-]

koljanos@reddit

Awesome read, can you please tell me, if I can push everything further by utilizing two 3090 with nvlink? Will using less quantized model help?

[-]

AmazingDrivers4u@reddit (OP)

having nvlink certainly helps, but there is only one way to find out, test it. inference engines are getting updated all the time and sometimes things do break.

[-]

koljanos@reddit

Thanks bro, you did a great job!

[-]

caetydid@reddit

Please post a gist or gitrepo! I think some sources are missing

[-]

AmazingDrivers4u@reddit (OP)

will look into it, I've never used gist before.

[-]

El-Dixon@reddit

Wow! The work, the writing, the results... chef kiss. Thank you!

[-]

AmazingDrivers4u@reddit (OP)

oh thank you! Claude is a good friend. =)

[-]

sagiroth@reddit

Please share the files and fix with us

[-]

AmazingDrivers4u@reddit (OP)

will do!

[-]

xrvz@reddit

I'm not reading some shitty medium post. Huge red flag. At least put it on github gist.

[-]

Gold-Debt-5957@reddit

Lo acabo de leer, y que interesante mi pregunta! como correra en mi mac m1 max de 64 gb. Ampliar la ventana de contexto hubiera sido genial!, el parche que implementaron como tal es temporal. una gran mejora. Toca ser pacientes y ver como va evolucionando, y si hay novedades. mantenerme actualizado