Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19

[-]

Optimal-Bass-5246@reddit

Following this article:
https://www.reddit.com/r/LocalLLaMA/comments/1stjx29/an_overnight_stack_for_qwen3627b_85_tps_125k/

I was able to get 155tps with 258K context window on 1x RTX 5090.

=== Warmup (3x) ===

w1 comp=1000 wall=19.42s 51.49 TPS

w2 comp=1000 wall= 8.11s 123.30 TPS

w3 comp=1000 wall= 8.46s 118.20 TPS

=== Narrative (3x, 1000 tok) ===

narr1 comp=1000 wall= 8.38s 119.33 TPS

narr2 comp=1000 wall= 8.13s 123.00 TPS

narr3 comp=1000 wall= 8.06s 124.07 TPS

=== Code (2x, 800 tok) ===

code1 comp=692 wall= 4.44s 155.86 TPS

code2 comp=462 wall= 3.05s 151.48 TPS

=== GPU state ===

0, 92 %, 29997 MiB, 32607 MiB, 402.53 W, 63

=== Last 3 SpecDecoding metrics (MTP accept) ===

(APIServer pid=1) INFO 04-25 14:10:16 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.60, Accepted throughput: 72.50 tokens/s, Drafted throughput: 136.20 tokens/s, Accepted: 725 tokens, Drafted: 1362 tokens, Per-position acceptance rate: 0.782, 0.533, 0.282, Avg Draft acceptance rate: 53.2%

(APIServer pid=1) INFO 04-25 14:10:26 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.71, Accepted throughput: 76.79 tokens/s, Drafted throughput: 134.99 tokens/s, Accepted: 768 tokens, Drafted: 1350 tokens, Per-position acceptance rate: 0.782, 0.564, 0.360, Avg Draft acceptance rate: 56.9%

(APIServer pid=1) INFO 04-25 14:10:36 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.97, Accepted throughput: 89.39 tokens/s, Drafted throughput: 135.89 tokens/s, Accepted: 894 tokens, Drafted: 1359 tokens, Per-position acceptance rate: 0.837, 0.647, 0.490, Avg Draft acceptance rate: 65.8%

[-]

andy2na@reddit

Are you still using TQ for cache? Because it's bugged and output is bad and tool calling doesn't work, have to switch it to fp8 caching

Read the updated information from the actual git: https://github.com/noonghunna/qwen36-27b-single-3090

[-]

Optimal-Bass-5246@reddit

Using fp8_e5m2. Haven't tried turboquant_k8v4 yet. That is next on the agenda. Should improve quality, but will lower context.

[-]

andy2na@reddit

dont bother until its fixed

https://github.com/vllm-project/vllm/issues/40831

[-]

Optimal-Bass-5246@reddit

Looks like tool calling has been fixed:
https://www.reddit.com/r/LocalLLM/comments/1sv6cqk/follow_up_tested_tool_calling_fixes_for_qwen/

[-]

Optimal-Bass-5246@reddit

Tool call totally fixed now. Hitting 160+tps with the change in too call parser and chat template.

[-]

andy2na@reddit

Can you provide your full config?

Tried various combinations (updated Genesis patches, cudagraph PIECEWISE, etc) and while I can get 125k and normal chats work up to 90tps on my 3090, anything that requires large context will either not respond or go oom, like hermes or opencode

[-]

whiteamphora@reddit

Can't recommend the article, tried it and also it's bugged and not sure if even works. Spent all day and to be honest, going with basic qwen setup was more worth it.

[-]

Optimal-Bass-5246@reddit

Article obviously does work or people would not be posting their results.

[-]

TheQuantumPhysicist@reddit

Noob question: How would using VLLM server differ with using LM Studio server?

I use LM Studio and I like it, and I'm wondering whether there is gain in using VLLM.

[-]

Beginning-Window-115@reddit

lmstudio uses llamac++ or mlx depending on what device you are on. If you have a blackwell or a really decent gpu (probably blackwell) then you should be using vllm otherwise you are wasting potential.

[-]

Fit_Split_9933@reddit

I have to use Windows. Is there a way to use VLLM for this on Windows?

[-]

Beginning-Window-115@reddit

use docker

[-]

Usual-Carrot6352@reddit

llama.cpp now also support nvfp4 in both gpu and cpu.

[-]

Usual-Carrot6352@reddit

you should Abiray-Qwen3.6-27B-NVFP4 not that one. Check the NVFP4 conversion branch abiray used.

[-]

Usual-Carrot6352@reddit

A more better version could be from Redhat something like qwen3.5 with more recovery or accuracy than the original qwen release: https://huggingface.co/RedHatAI/Qwen3.6-35B-A3B-NVFP4

[-]

TheQuantumPhysicist@reddit

What's the win in using NVFP4? Can you please elaborate for a noob?

[-]

BobbyL2k@reddit

If you have 50-series GPUs, those can compute NVFP4 at twice the speed of FP8. So, in theory, you will get faster PP, and higher batched TG.

[-]

2Norn@reddit

and if he is using q4km already?

[-]

BobbyL2k@reddit

I’m not sure of the current state of CUDA kernels used by llama.cpp, which would require specialized ones in order to deliver the performance improvements I’ve mentioned.

As for your question, using Q4_K_M would drastically reduce the memory of the model weight to ~4-bit per parameter. This already brings enormous performance benefits you expect from NVFP4 being ~4-bit. Specially single-user TG speed.

[-]

D2OQZG8l5BI1S06@reddit

In llama.cpp all compute is done in FP16 when the card supports it, even NVFP4. So you just skip the dequantization of Q4_K.

[-]

Ell2509@reddit

Dang. TIL. Gonna go push my 5070ti harder. MORE POWER. MORE INFERENCE. MORE PROFIT.

GPU screaming silently in the background like the God Emperor himself, light bleeding everywhere.

"Yes ImEll, everything you say is correct and very deep."

[-]

Beginning-Window-115@reddit

what the other guy said but also nvfp4 is near fp8 in terms of quality

[-]

Ell2509@reddit

Dang another TIL.

Hooks gpu up to local sub station and continues generating self affirming nonsense.

[-]

mxmumtuna@reddit

*sometimes. This particular one is not.

[-]

Usual-Carrot6352@reddit

The output quality is amazing alongside speed. There are 1.7B nvfp4 also on huggingface you can quickly run and check or compare it for your case.

[-]

Kindly-Cantaloupe978@reddit (OP)

Does it support MTP (which is fixed in the version I am using)? If it doesn't then speed will be slower ...

[-]

Beginning-Window-115@reddit

yes but itll be slower

[-]

Usual-Carrot6352@reddit

Nope did you tried with ik_llama

[-]

Usual-Carrot6352@reddit

ik_llama.cpp has also FP4 support, it's just called MXFP4 (type id 39) instead of NVFP4 (type id 40). In fact it has broader coverage — CPU (AVX2/NEON/Zen4), CUDA, and Metal all implemented, versus mainline's NVFP4 which is CUDA-only for now.

[-]

wolframko@reddit

llama.cpp and ik_llama.cpp does not support native FP4 compute right now and it dequantizes to FP16 in runtime instead. Wait for that to be merged: https://github.com/ggml-org/llama.cpp/pull/22196

Until then, basic Q4 GGUF quants will be better in PP and ppl.

[-]

Usual-Carrot6352@reddit

build b8925: https://github.com/ggml-org/llama.cpp/releases can load and run NVFP4 models but uses latest: b8929

[-]

Dany0@reddit

the issue is that vllm uses more vram. I can comfortably use 27B q4 with q8 kv cache with llamacpp

vllm gets me hell 200k ctx max on windows with wsl. I can get closer to full ctx windows by booting into linux but that has its own downsides

tok gen without mtp is similar. and with self-spec ngram llamacpp is competitive in some tasks

so it's not a clear vllm always better choice

[-]

Beginning-Window-115@reddit

this is why you use nvfp4 its literally 4 bit

[-]

Dany0@reddit

nvfp4 has some layers in bf16, it takes up more vram than ud q4...

[-]

TheQuantumPhysicist@reddit

I do use a 5090, actually. But why is vllm considered better for blackwell?

[-]

mxforest@reddit

The batching is superior. It doesn't allocate once and use it, scales dynamically for each request. So you can have 1 req 100k context or 10 at 10k. LM studio recently introduced it but the batching throughput is way worse.

[-]

gusbags@reddit

If you have a good GPU / multiple GPUs, vllm / sglang are superior, but more fiddly to set up. Llama / ollama / lmstudio, etc are easier to work with, have higher compatibility, but still suck at batching and parallelism.

[-]

KallistiTMP@reddit

*Multiple identical GPU's.

For context, vLLM is mostly used for small to midscale commercial setups. It's heavily geared towards the GPU rich and squeezing performance out of large inference clusters. Nearly always deployed on a production Kubernetes GPU cluster.

Llama.cpp/Ollama/lmstudio are geared towards small scale hobbyist use on consumer hardware. It's way better at running on CPU/RAM or mixed card situations. Those aren't very common in production clusters, because production clusters generally use GPU-rich uniform hardware.

Generally speaking, vLLM is designed to shine at larger scales. You can run itsy bitsy single user vLLM servers, and if you know how to set it up correctly you might be able to squeeze a little more performance out if you have Blackwell cards. But it is definitely going to be significantly harder to set up, and may not fare well long term because single-user setups are really not a big priority for vLLM. Same with funny quants like Q6_K_M, mismatched cards, CPU offload, etc. They're more focused on things like autoscaling, KV prefix routing, dynamic batching, RDMA networking, and all those sorts of things that matter a lot for large industrial-scale deployments, but don't matter at all for personal use.

I would honestly recommend sticking with the consumer stuff unless you have professional experience working with production GPU Kubernetes clusters. I actually do, and I mostly use Llama.cpp myself. I do use vLLM for some parts of my home automation setup, where the management benefits of Kubernetes outweigh the pain of setting up DCGM and nvidia container toolkit and all that.

[-]

gusbags@reddit

Yep, true on both identical GPU and setup complexity points. RE: complexity - this is less of an issue these days since you generally can find pre-baked dockerfiles or images custom made for your GPUs.
If your GPU has decent vllm support and goal is to extract maximum performance from a multi-GPU setup, vllm / sglang is probably worth investing time into.

[-]

DeepOrangeSky@reddit

What about SGLang? Is it also meant for multi-user use cases/things of a similar nature to vLLM, or is it in a different niche from both vLLM as well as also a different niche than llama.cpp? Or is it in a more similar boat to llama.cpp?

[-]

ubrtnk@reddit

Can confirm. I run qwen 3.6 35b on 2x 4080s in Llama.cpp with max 131k and I get 100t/s. Literally tested last night vllm and I got 160t/s but I could only get 8k context. The performance comes at a premium cost.

[-]

Kindly-Cantaloupe978@reddit (OP)

You need to apply the kv cache calc fix for vllm. See my other post linked in OP.

[-]

Puzzleheaded_Base302@reddit

will this fix be upstreamed eventually ?

[-]

Kindly-Cantaloupe978@reddit (OP)

IDK but from the PR it looks like a nvidia upstream issue

[-]

Deep90@reddit

Damn guess I should explore installing vllm again.

[-]

vr_fanboy@reddit

i nearly miss the qwen3.6 27b greatnesss due to lmstudio slowness, i had horrible experience with 3.5 27b (35b was fine tho), very slow PP and T/S, i was not going to try 27b. Switched to vllm , after a day of fiddling (testing llamacpp and many quants, with some isssues with repetitions) i have 40 t/s 128k context length in a single 3090 with turboquant. Is enought to replace sonnet for many tasks.

[-]

1ncehost@reddit

vLLM is designed for maximum concurrent tok/s for multi-user use cases, llama.cpp is designed for maximum single stream tok/s for single-user use cases.

That is a simplification derived from where the projects originated, and it is mostly true today. They have significant differences in the way they chop up work and what algorithms underlie which make them inherently better for their intended use case.

[-]

DeepOrangeSky@reddit

vLLM is designed for maximum concurrent tok/s for multi-user use cases, llama.cpp is designed for maximum single stream tok/s for single-user use cases.

What bout SGLang? Where does it fit in the use-case spectrum between those two things, or, what is its specialty supposed to be, compared to those?

[-]

Puzzleheaded_Base302@reddit

vllm gives you prefix caching, much faster prompt processing, arguably more stability (less crash due to OOM ?), but you might end up with less token generation rate. If you run openclaw, vllm might work out better, since it requests long context all the times. the biggest advantage is when you have more than 2 concurrencies, vllm will dramatically produces more tokens when you do multiple jobs at the same time, a lot more tokens per second with small penalty on single query speed.

[-]

haenous-alistera@reddit

LM Studio is a great interface but if you are needing to squeeze as much performance as possible llama.cpp is a much better bet. Ollama and LMStudio are easier to use but at a cost. VLLM and SGLang are also better options but imho for specific uses. We use VLLM in our multi-GPU multi-user setups and SGLang for production agentic swarms

[-]

Jeidoz@reddit

Not sure about VLLM but LM Studio from unknown me reason can allocate only 22GB of 24GB available at my GPU and a bit not intuitive how to use "--fit" option to allow MoE models optimally offload.

I.e. Qwen3.6 35b a3b Q6 are slow in LM Studio, but with compiled CUDA llama.cpp and this command it much faster and uses all 24GB of VRAM:

.\llama-server.exe --model ".\Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q6_K_P.gguf" --temp 0.6 --top-p 0.95 --min-p 0.00 --top-k 20 -c 131072 --host 0.0.0.0 --chat-template_kwargs '{"preserve_thinking":true}' -ctk q8_0 --fit on --fit-ctx 131072 --fit-target 512 --mmproj ".\mmproj-Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-f16.gguf" --alias "hauhaucs/Qwen3.6-35b-a3b"

[-]

Important_Quote_1180@reddit

Some features allow 3x the speed and or compression with minimal loss. Lm studio is ok but it’s the difference between a Segway and BMW

[-]

debackerl@reddit

The Segway is the better one in crowded ciry centers right? 😂

[-]

mxmumtuna@reddit

Be careful with that quant. Its KLD isn’t great.

[-]

DistanceAlert5706@reddit

Any KLD results available for nvfp4 quants? Also are there any ones better sub 20gb and with MTP?

[-]

Service-Kitchen@reddit

What does KLD stand for?

[-]

mxmumtuna@reddit

https://medium.com/@ncaraliceanews/transformer-fundamentals-understanding-the-kullback-leibler-divergence-kld-part-2-75f072534768

[-]

Service-Kitchen@reddit

Thank you! :)

[-]

Internal-Shift-7931@reddit

The most interesting part here might not be the \~80 tok/s number itself, but what 218k usable context does to the local RAG tradeoff. For a lot of single-user local workflows, "just keep the whole working set in context" starts to become a real alternative to vector search. Not because it is always cheaper or more elegant, but because it avoids a whole class of chunking/retrieval failures. I would love to see a context-residency curve for this setup:

- prefill time at 32k / 64k / 128k / 218k

- decode speed after the cache is hot

- VRAM headroom at each context size

- answer quality on needle-in-haystack tests near the beginning/middle/end

- what happens with 2 concurrent users

If this holds up, the bigger story may be that local long-context serving changes app architecture, not just benchmark numbers.

[-]

Sovex66@reddit

What your use case ? coding ? chat ?

[-]

Kindly-Cantaloupe978@reddit (OP)

coding primarily

[-]

benno_1237@reddit

218k context window is nice hut which prompt length did you use for testing? Speed doesnt really change with context window but the actual context you use.

Tools like opencode etc go up to ~30-40k context immediately, so thats the minimum prompt length you should benchmark against imo (if you are coding with it, different story for creative writing etc).

[-]

FortiTree@reddit

This. Need to compare apple to apple base don prefilled cache and also warm KV cache (save on prefill overhead) vs cold KV cache (need to prefill every time).

There is also 2 token speeds that are very different from each other: prompt processing PP based on GPU capability and token generation TG limited by memory bandwidth.

For example, for the same Qwen3.6-35B-A3B-Q4KM

Generation token or output speed (decode 1 token at a time) is similar for Spark and Halo - DGX Spark mem bandwidth 273 Gb/s -> TG 55 t/s - Strix Halo mem bandwith 256 Gb/s -> TG 50 t/s - Mac M3 Ultra mem bandwidth 800 Gb/s -> TG 85 t/s (hitting CPU bottleneck)

Prompt processing token on the other hand is night and day - DGX Spark CUDA - PP 1700+ t/s - Strix Halo Vulkan/Rocm - PP 300+ t/s - Mac M3 Ultra MXL - PP 1500 t/s

So if you need speed for agentic handoff, then DGX spark is better but twice the price and 3x the power cost. Else Strix Halo is the sweet spot for some. Mac M3U or M5U is actually the best of both worlds.

[-]

benno_1237@reddit

However for OPs setup, the RTX 5090 is most likely the best choice. You are not compute limited on any of them, you are memory limited. And the \~1.8TB/s of the 5090 is actually insane

[-]

FortiTree@reddit

Oh ya 80 t/s TG on Qwen 27B is insanely fast. I can only get 12 t/s max on Strix Halo. Dedicated GPU is definitely better for dense model. Unified memory is only usable with MoE.

Maybe Mac M5 Ultra can break that barrier or something else next year.

[-]

benno_1237@reddit

Given that the M5 Max is 614 GB/s and the M5 Ultra is most likely (is there info on that yet?) just two Max's slapped together, its still about 1/3 off of the 5090.

[-]

FortiTree@reddit

Just speculation now but some says M5U can get pass 1TB/s easily and maybe closer to 1.5TB/s so not far off from 5090 1.8TB/s

[-]

Weird_Search_4723@reddit

While 30-40k sounds like a your setup problem more than opencode but opencode's system prompt is large. i'll suggest either pi or https://github.com/0xku/kon (mine) if you are looking for something lightweight

[-]

benno_1237@reddit

You are right, i should have clarified a bit more. What I meant is that i rarely find myself below that in any multi turn conversation in opencode. That can surely be improved using various software tools.

My point however was that it is useless to measure generation speed without specifying how you measure it.

Most likely in OPs case, context doesnt affect the performance that badly, but still.

[-]

vr_fanboy@reddit

im using this https://github.com/yvgude/lean-ctx and this https://github.com/JuliusBrussee/caveman with PI (can be used in opencode too), it works really well to lower context requirement, also how do you use the context is important, separate plan and implementation, etc.

ondemand mcp/tool loading + cli like tools is a must in local deployments. We are at a point where we can actually work with local llms

[-]

benno_1237@reddit

I am not saying it doesnt work. I just think that most performance numbers people post on here are misleading.

You are completely correct, there are awesome tools to save context available. Still, if you do some multi turn edits, you will hit a context length that starts to matter.

But, the Qwen3.5/3.6 series is a beast in context management, so most likely its not as significant as with older models

[-]

gatewaynode@reddit

"Tools like opencode etc go up to ~30-40k context immediately"

That's probably your custom setup, I don't see that at all. For me opencode is very lean on token use.

[-]

benno_1237@reddit

Yeah, that was a bit of an exaggeration on my end. Still, the default system prompt is like \~10k. So without any kind of optimization, you hit the 30-40k quickly

[-]

Kindly-Cantaloupe978@reddit (OP)

I ran a session that generated 11k tokens and the average is 78 tps. The is based on the metrics stats that vllm provides via the /metrics end point.

[-]

car_lower_x@reddit

I have a 5090 but why would I use NVFP? It’s just a heavily quantified modern version. Sure it’s fast but because ..

[-]

BitterProfessional7p@reddit

Models only start to get stupid below 4bit quantization. Down to 4 bit, they are practically unchanged.

Comparing a BF16 to a 4 bit GGUF or NVFP4 usually decreases between 1 to 3% decrease in benchmark performance and gain between 100% and 300% speedup depending on the context length. 4 bit quantization seems to be the sweet spot between quality and performance.

[-]

amemingfullife@reddit

Do you have any bench comparisons or KL divergence stats etc? I’ve heard really different and hand wavey opinions on the quality comparisons between NVFP4 and FP8 and above. I get the theory, just would be nice to see some sources.

[-]

Still-Notice8155@reddit

maybe NV means optimized for NVIDIA, and its FP4, idk if it's the same with Q4..

[-]

Kindly-Cantaloupe978@reddit (OP)

It 4 bit but very close to fp8 quality per nvidia’s post.

[-]

some_user_2021@reddit

But this is not a file provided by Nvidia, and the model was not trained in 4 bit

[-]

Kindly-Cantaloupe978@reddit (OP)

Fwiw this is the nvdia’s post that talks about NVFP4

https://build.nvidia.com/spark/nvfp4-quantization

[-]

Kindly-Cantaloupe978@reddit (OP)

Trade off is model size vs. KV cache headroom. You can go with higher quants, but at the expense of less for kv cache. Turboquant doesn't quite deliver much gain for vllm for some reason with my setup. If there's a better setup with turboquant enabled then even better.

[-]

SnooPaintings8639@reddit

I am getting \~57 tps with the same max context at FP8 using old and tried setup of 2 x RTX 3090. Not sure about the speed with 90%+ context used.

When I switch to AWQ INT4 I am getting \~65-70 tps.

Two 3090s are half the price of a single 5090, at total to twice the amount of vRAM, and are still very competitive when run in tensor parallel mode. I just wish I had nvlink on top of them to push them even further.

[-]

IrisColt@reddit

Thanks!!!

[-]

oxygen_addiction@reddit

Try with DFlash as well. You can also quantize the model to Q8 without acceptance rate changes.

[-]

No_Algae1753@reddit

What is dflash?

[-]

oxygen_addiction@reddit

What is google?

[-]

Artistic_Okra7288@reddit

Confused, Google took me right to this post.

[-]

No_Algae1753@reddit

I forgot that I'm on reddit where I'm not allowed to ask

[-]

meatmanek@reddit

it's a form of speculative decoding where you use a diffusion language model as your draft model.

With speculative decoding, you use a faster model to draft a few tokens ahead, and then use the main model to verify (accept/reject) those draft tokens. Verification with the main model is typically much faster than generation, so if your draft model is both fast and has a good acceptance rate, you can see decent speedups. If your draft model is slow or you have a low acceptance rate, then the added compute of the draft model + verification can slow you down.

Traditionally you'd use a smaller model in the same family, like Qwen3.5 2B as the draft model for Qwen3.5 27B, but MTP and DFlash are newer variations. With MTP, the main model ships with an added few layers which are trained to predict tokens based on the internal state of the model. Since it has access to the internals of your main model, it presumably can be smaller (cheaper to run) than a separate draft model of the same accuracy.

DFlash uses a diffusion model, which are already supposed to be very fast relative to autoregressive (standard) models.

[-]

No_Algae1753@reddit

Thank you very much! Very nice explanation. I was thinking about using spec decoding on qwen 27b. Is there something you would recommend me for that mind of setup?

[-]

meatmanek@reddit

idk I'm also looking for a good SD setup for qwen3.6-27b. MTP isn't supported yet on llama.cpp or mlx-lm, so I haven't gotten it working on apple silicon yet. There are some PRs to add support, but a lot of the quants drop the MTP layers so you need to find a quant with the MTP layers intact. One that supposedly has them didn't work with the PR I tested.

[-]

No_Algae1753@reddit

Guess we will have to be patient for a while

[-]

Kindly-Cantaloupe978@reddit (OP)

Is there a speed gain or same speed but higher context window using the same gpu?

[-]

oxygen_addiction@reddit

Speed gain. I'm hitting 1.3x-2x more tokens dels MN depending on acceptance rate.

[-]

Kindly-Cantaloupe978@reddit (OP)

That's quite a bump. What's your setup? Do you have a recipe to share?

[-]

oxygen_addiction@reddit

I only just got it running last night on a forked. Llama.cpp Haven't tested with vllm yet.

5090RTX+5080RTX. Haven't tried having the draft model on the second gpu.

DTree should improve speed even more in the coming months.

[-]

Kindly-Cantaloupe978@reddit (OP)

I only have a single 5090RTX so it might not work for me ...

[-]

oxygen_addiction@reddit

Oh something to note is that I was not testing with NV4.

Results may vary, though I doubt it.

[-]

specify_@reddit

I tried DFlash with Qwen 3.6 35B-A3B and was disappointed with the token throughput at long context >50k. It seems that DFlash is only good for low context and draft acceptance worsens at longer contexts, making it slower than MTP.

[-]

oxygen_addiction@reddit

I had it fully on the 5090.

It eats up like 500mb for the draft and a few hundred for context (which I can think could be capped harder).

[-]

ddog661@reddit

I’m getting around 80 tokens/sec on my 4090 @int4 and speculative decoding on but only 16k context.

[-]

Kindly-Cantaloupe978@reddit (OP)

4090 with 24G VRAM should do better than 16k context? Just saw a post saying that a 5090 24G laptop version can fit 75K

[-]

ddog661@reddit

I am using vLLM and fp8 KV cache. It’s pushing the limits of the 24gb vram buffer at that point. It’s in line with this testing here: https://medium.com/@fzbcwvv/an-overnight-stack-for-qwen3-6-27b-85-tps-125k-context-vision-on-one-rtx-3090-0d95c6291914

[-]

Kindly-Cantaloupe978@reddit (OP)

Check this out

https://www.reddit.com/r/Olares/s/saxSzYeXKU

[-]

ddog661@reddit

Thank you for that. This vLLM config is not too far off from what I’m using (except context size of course and I’m using gpu-utilization 0.93). I might play around with it a bit more tonight. I’m looking more at those results and noticing that vLLM returned a KV pool size of 23,760 tokens which is not far off from what my vLLM logs state. I don’t know how 75000 ctx is possible without turboquant.

[-]

ecompanda@reddit

the 218k context at 80tps is the more impressive number here. most setups start throttling hard past 64k because the kv cache hits memory bandwidth limits. NVFP4 with MTP is clearly doing a lot of heavy lifting to hold that flat. have you tested the degradation past 150k or does throughput stay consistent all the way out?

[-]

Kindly-Cantaloupe978@reddit (OP)

I ran a session with 11k *generated tokens * that averaged at 78 tps

[-]

Mother_Desk6385@reddit

Can gguf run on vllm ?

[-]

some_user_2021@reddit

Nice pussy

[-]

mxmumtuna@reddit

Yes, but you would not want to.

[-]

Kindly-Cantaloupe978@reddit (OP)

It's experimental last I checked.

[-]

Barry_22@reddit

Impressive. Is it only possible with NVFP4 quant? Bcuz with AWQ it seems to not allow foruch context on 24GB, like, very little.

[-]

Kindly-Cantaloupe978@reddit (OP)

The model itself is \~19G so on 24G you don't have much headroom for KV. I'm running it on 32G VRAM which does leave a good amount of space for KV cache.

[-]

Barry_22@reddit

Yep, you're right. Do you think TurboCache would help here significantly? For 27B on 24gb vllm

[-]

Kindly-Cantaloupe978@reddit (OP)

It should .. but I didn't have much success with turboquant on vllm at the moment. It's not supported by the official line, and the forks don't work very well for some reason.

[-]

Barry_22@reddit

Thanks! Eh, I guess for my 24GB build I shall wait for 3.6 9B lol

[-]

Kindly-Cantaloupe978@reddit (OP)

Do look into the KV cache calcs issue that was fixed in vllm. Check my other posts on 3.5-27B in the link above. This may get you a bit further.

[-]

Barry_22@reddit

Nice, appreciate it. Will try it out!

[-]

grizzlybear_jpeg@reddit

At what quantisation?

[-]

Kindly-Cantaloupe978@reddit (OP)

NVFP4 is 4bit?

[-]