Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19
Posted by Kindly-Cantaloupe978@reddit | LocalLLaMA | View on Reddit | 126 comments
Qwen3.6-27B is out for a few days and the NVFP4 with MTP is dropped earlier on HF: sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
Can follow the same recipe I used for Qwen3.5-27B to achieve \~80 tps on a single RTX 5090 at 218k context window via latest vllm 0.19 builds (vLLM 0.19.1rc1)
https://www.reddit.com/r/LocalLLaMA/comments/1sr8gyf/qwen3527b_on_rtx_5090_served_via_vllm_77_tps/
Optimal-Bass-5246@reddit
Following this article:
https://www.reddit.com/r/LocalLLaMA/comments/1stjx29/an_overnight_stack_for_qwen3627b_85_tps_125k/
I was able to get 155tps with 258K context window on 1x RTX 5090.
=== Warmup (3x) ===
w1 comp=1000 wall=19.42s 51.49 TPS
w2 comp=1000 wall= 8.11s 123.30 TPS
w3 comp=1000 wall= 8.46s 118.20 TPS
=== Narrative (3x, 1000 tok) ===
narr1 comp=1000 wall= 8.38s 119.33 TPS
narr2 comp=1000 wall= 8.13s 123.00 TPS
narr3 comp=1000 wall= 8.06s 124.07 TPS
=== Code (2x, 800 tok) ===
code1 comp=692 wall= 4.44s 155.86 TPS
code2 comp=462 wall= 3.05s 151.48 TPS
=== GPU state ===
0, 92 %, 29997 MiB, 32607 MiB, 402.53 W, 63
=== Last 3 SpecDecoding metrics (MTP accept) ===
(APIServer pid=1) INFO 04-25 14:10:16 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.60, Accepted throughput: 72.50 tokens/s, Drafted throughput: 136.20 tokens/s, Accepted: 725 tokens, Drafted: 1362 tokens, Per-position acceptance rate: 0.782, 0.533, 0.282, Avg Draft acceptance rate: 53.2%
(APIServer pid=1) INFO 04-25 14:10:26 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.71, Accepted throughput: 76.79 tokens/s, Drafted throughput: 134.99 tokens/s, Accepted: 768 tokens, Drafted: 1350 tokens, Per-position acceptance rate: 0.782, 0.564, 0.360, Avg Draft acceptance rate: 56.9%
(APIServer pid=1) INFO 04-25 14:10:36 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.97, Accepted throughput: 89.39 tokens/s, Drafted throughput: 135.89 tokens/s, Accepted: 894 tokens, Drafted: 1359 tokens, Per-position acceptance rate: 0.837, 0.647, 0.490, Avg Draft acceptance rate: 65.8%
andy2na@reddit
Are you still using TQ for cache? Because it's bugged and output is bad and tool calling doesn't work, have to switch it to fp8 caching
Read the updated information from the actual git: https://github.com/noonghunna/qwen36-27b-single-3090
Optimal-Bass-5246@reddit
Using fp8_e5m2. Haven't tried turboquant_k8v4 yet. That is next on the agenda. Should improve quality, but will lower context.
andy2na@reddit
dont bother until its fixed
https://github.com/vllm-project/vllm/issues/40831
Optimal-Bass-5246@reddit
Looks like tool calling has been fixed:
https://www.reddit.com/r/LocalLLM/comments/1sv6cqk/follow_up_tested_tool_calling_fixes_for_qwen/
Optimal-Bass-5246@reddit
Tool call totally fixed now. Hitting 160+tps with the change in too call parser and chat template.
andy2na@reddit
Can you provide your full config?
Tried various combinations (updated Genesis patches, cudagraph PIECEWISE, etc) and while I can get 125k and normal chats work up to 90tps on my 3090, anything that requires large context will either not respond or go oom, like hermes or opencode
whiteamphora@reddit
Can't recommend the article, tried it and also it's bugged and not sure if even works. Spent all day and to be honest, going with basic qwen setup was more worth it.
Optimal-Bass-5246@reddit
Article obviously does work or people would not be posting their results.
TheQuantumPhysicist@reddit
Noob question: How would using VLLM server differ with using LM Studio server?
I use LM Studio and I like it, and I'm wondering whether there is gain in using VLLM.
Beginning-Window-115@reddit
lmstudio uses llamac++ or mlx depending on what device you are on. If you have a blackwell or a really decent gpu (probably blackwell) then you should be using vllm otherwise you are wasting potential.
Fit_Split_9933@reddit
I have to use Windows. Is there a way to use VLLM for this on Windows?
Beginning-Window-115@reddit
use docker
Usual-Carrot6352@reddit
llama.cpp now also support nvfp4 in both gpu and cpu.
Usual-Carrot6352@reddit
you should Abiray-Qwen3.6-27B-NVFP4 not that one. Check the NVFP4 conversion branch abiray used.
Usual-Carrot6352@reddit
A more better version could be from Redhat something like qwen3.5 with more recovery or accuracy than the original qwen release: https://huggingface.co/RedHatAI/Qwen3.6-35B-A3B-NVFP4
TheQuantumPhysicist@reddit
What's the win in using NVFP4? Can you please elaborate for a noob?
BobbyL2k@reddit
If you have 50-series GPUs, those can compute NVFP4 at twice the speed of FP8. So, in theory, you will get faster PP, and higher batched TG.
2Norn@reddit
and if he is using q4km already?
BobbyL2k@reddit
I’m not sure of the current state of CUDA kernels used by llama.cpp, which would require specialized ones in order to deliver the performance improvements I’ve mentioned.
As for your question, using Q4_K_M would drastically reduce the memory of the model weight to ~4-bit per parameter. This already brings enormous performance benefits you expect from NVFP4 being ~4-bit. Specially single-user TG speed.
D2OQZG8l5BI1S06@reddit
In llama.cpp all compute is done in FP16 when the card supports it, even NVFP4. So you just skip the dequantization of Q4_K.
Ell2509@reddit
Dang. TIL. Gonna go push my 5070ti harder. MORE POWER. MORE INFERENCE. MORE PROFIT.
GPU screaming silently in the background like the God Emperor himself, light bleeding everywhere.
"Yes ImEll, everything you say is correct and very deep."
Beginning-Window-115@reddit
what the other guy said but also nvfp4 is near fp8 in terms of quality
Ell2509@reddit
Dang another TIL.
Hooks gpu up to local sub station and continues generating self affirming nonsense.
mxmumtuna@reddit
*sometimes. This particular one is not.
Usual-Carrot6352@reddit
The output quality is amazing alongside speed. There are 1.7B nvfp4 also on huggingface you can quickly run and check or compare it for your case.
Kindly-Cantaloupe978@reddit (OP)
Does it support MTP (which is fixed in the version I am using)? If it doesn't then speed will be slower ...
Beginning-Window-115@reddit
yes but itll be slower
Usual-Carrot6352@reddit
Nope did you tried with ik_llama
Usual-Carrot6352@reddit
ik_llama.cpp has also FP4 support, it's just called MXFP4 (type id 39) instead of NVFP4 (type id 40). In fact it has broader coverage — CPU (AVX2/NEON/Zen4), CUDA, and Metal all implemented, versus mainline's NVFP4 which is CUDA-only for now.
wolframko@reddit
llama.cpp and ik_llama.cpp does not support native FP4 compute right now and it dequantizes to FP16 in runtime instead. Wait for that to be merged: https://github.com/ggml-org/llama.cpp/pull/22196
Until then, basic Q4 GGUF quants will be better in PP and ppl.
Usual-Carrot6352@reddit
build b8925: https://github.com/ggml-org/llama.cpp/releases can load and run NVFP4 models but uses latest: b8929
Dany0@reddit
the issue is that vllm uses more vram. I can comfortably use 27B q4 with q8 kv cache with llamacpp
vllm gets me hell 200k ctx max on windows with wsl. I can get closer to full ctx windows by booting into linux but that has its own downsides
tok gen without mtp is similar. and with self-spec ngram llamacpp is competitive in some tasks
so it's not a clear vllm always better choice
Beginning-Window-115@reddit
this is why you use nvfp4 its literally 4 bit
Dany0@reddit
nvfp4 has some layers in bf16, it takes up more vram than ud q4...
TheQuantumPhysicist@reddit
I do use a 5090, actually. But why is vllm considered better for blackwell?
mxforest@reddit
The batching is superior. It doesn't allocate once and use it, scales dynamically for each request. So you can have 1 req 100k context or 10 at 10k. LM studio recently introduced it but the batching throughput is way worse.
gusbags@reddit
If you have a good GPU / multiple GPUs, vllm / sglang are superior, but more fiddly to set up. Llama / ollama / lmstudio, etc are easier to work with, have higher compatibility, but still suck at batching and parallelism.
KallistiTMP@reddit
*Multiple identical GPU's.
For context, vLLM is mostly used for small to midscale commercial setups. It's heavily geared towards the GPU rich and squeezing performance out of large inference clusters. Nearly always deployed on a production Kubernetes GPU cluster.
Llama.cpp/Ollama/lmstudio are geared towards small scale hobbyist use on consumer hardware. It's way better at running on CPU/RAM or mixed card situations. Those aren't very common in production clusters, because production clusters generally use GPU-rich uniform hardware.
Generally speaking, vLLM is designed to shine at larger scales. You can run itsy bitsy single user vLLM servers, and if you know how to set it up correctly you might be able to squeeze a little more performance out if you have Blackwell cards. But it is definitely going to be significantly harder to set up, and may not fare well long term because single-user setups are really not a big priority for vLLM. Same with funny quants like Q6_K_M, mismatched cards, CPU offload, etc. They're more focused on things like autoscaling, KV prefix routing, dynamic batching, RDMA networking, and all those sorts of things that matter a lot for large industrial-scale deployments, but don't matter at all for personal use.
I would honestly recommend sticking with the consumer stuff unless you have professional experience working with production GPU Kubernetes clusters. I actually do, and I mostly use Llama.cpp myself. I do use vLLM for some parts of my home automation setup, where the management benefits of Kubernetes outweigh the pain of setting up DCGM and nvidia container toolkit and all that.
gusbags@reddit
Yep, true on both identical GPU and setup complexity points. RE: complexity - this is less of an issue these days since you generally can find pre-baked dockerfiles or images custom made for your GPUs.
If your GPU has decent vllm support and goal is to extract maximum performance from a multi-GPU setup, vllm / sglang is probably worth investing time into.
DeepOrangeSky@reddit
What about SGLang? Is it also meant for multi-user use cases/things of a similar nature to vLLM, or is it in a different niche from both vLLM as well as also a different niche than llama.cpp? Or is it in a more similar boat to llama.cpp?
ubrtnk@reddit
Can confirm. I run qwen 3.6 35b on 2x 4080s in Llama.cpp with max 131k and I get 100t/s. Literally tested last night vllm and I got 160t/s but I could only get 8k context. The performance comes at a premium cost.
Kindly-Cantaloupe978@reddit (OP)
You need to apply the kv cache calc fix for vllm. See my other post linked in OP.
Puzzleheaded_Base302@reddit
will this fix be upstreamed eventually ?
Kindly-Cantaloupe978@reddit (OP)
IDK but from the PR it looks like a nvidia upstream issue
Deep90@reddit
Damn guess I should explore installing vllm again.
vr_fanboy@reddit
i nearly miss the qwen3.6 27b greatnesss due to lmstudio slowness, i had horrible experience with 3.5 27b (35b was fine tho), very slow PP and T/S, i was not going to try 27b. Switched to vllm , after a day of fiddling (testing llamacpp and many quants, with some isssues with repetitions) i have 40 t/s 128k context length in a single 3090 with turboquant. Is enought to replace sonnet for many tasks.
1ncehost@reddit
vLLM is designed for maximum concurrent tok/s for multi-user use cases, llama.cpp is designed for maximum single stream tok/s for single-user use cases.
That is a simplification derived from where the projects originated, and it is mostly true today. They have significant differences in the way they chop up work and what algorithms underlie which make them inherently better for their intended use case.
DeepOrangeSky@reddit
What bout SGLang? Where does it fit in the use-case spectrum between those two things, or, what is its specialty supposed to be, compared to those?
Puzzleheaded_Base302@reddit
vllm gives you prefix caching, much faster prompt processing, arguably more stability (less crash due to OOM ?), but you might end up with less token generation rate. If you run openclaw, vllm might work out better, since it requests long context all the times. the biggest advantage is when you have more than 2 concurrencies, vllm will dramatically produces more tokens when you do multiple jobs at the same time, a lot more tokens per second with small penalty on single query speed.
haenous-alistera@reddit
LM Studio is a great interface but if you are needing to squeeze as much performance as possible llama.cpp is a much better bet. Ollama and LMStudio are easier to use but at a cost. VLLM and SGLang are also better options but imho for specific uses. We use VLLM in our multi-GPU multi-user setups and SGLang for production agentic swarms
Jeidoz@reddit
Not sure about VLLM but LM Studio from unknown me reason can allocate only 22GB of 24GB available at my GPU and a bit not intuitive how to use "--fit" option to allow MoE models optimally offload.
I.e. Qwen3.6 35b a3b Q6 are slow in LM Studio, but with compiled CUDA llama.cpp and this command it much faster and uses all 24GB of VRAM:
Important_Quote_1180@reddit
Some features allow 3x the speed and or compression with minimal loss. Lm studio is ok but it’s the difference between a Segway and BMW
debackerl@reddit
The Segway is the better one in crowded ciry centers right? 😂
mxmumtuna@reddit
Be careful with that quant. Its KLD isn’t great.
DistanceAlert5706@reddit
Any KLD results available for nvfp4 quants? Also are there any ones better sub 20gb and with MTP?
Service-Kitchen@reddit
What does KLD stand for?
mxmumtuna@reddit
https://medium.com/@ncaraliceanews/transformer-fundamentals-understanding-the-kullback-leibler-divergence-kld-part-2-75f072534768
Service-Kitchen@reddit
Thank you! :)
Internal-Shift-7931@reddit
The most interesting part here might not be the \~80 tok/s number itself, but what 218k usable context does to the local RAG tradeoff. For a lot of single-user local workflows, "just keep the whole working set in context" starts to become a real alternative to vector search. Not because it is always cheaper or more elegant, but because it avoids a whole class of chunking/retrieval failures. I would love to see a context-residency curve for this setup:
- prefill time at 32k / 64k / 128k / 218k
- decode speed after the cache is hot
- VRAM headroom at each context size
- answer quality on needle-in-haystack tests near the beginning/middle/end
- what happens with 2 concurrent users
If this holds up, the bigger story may be that local long-context serving changes app architecture, not just benchmark numbers.
Sovex66@reddit
What your use case ? coding ? chat ?
Kindly-Cantaloupe978@reddit (OP)
coding primarily
benno_1237@reddit
218k context window is nice hut which prompt length did you use for testing? Speed doesnt really change with context window but the actual context you use.
Tools like opencode etc go up to ~30-40k context immediately, so thats the minimum prompt length you should benchmark against imo (if you are coding with it, different story for creative writing etc).
FortiTree@reddit
This. Need to compare apple to apple base don prefilled cache and also warm KV cache (save on prefill overhead) vs cold KV cache (need to prefill every time).
There is also 2 token speeds that are very different from each other: prompt processing PP based on GPU capability and token generation TG limited by memory bandwidth.
For example, for the same Qwen3.6-35B-A3B-Q4KM
Generation token or output speed (decode 1 token at a time) is similar for Spark and Halo - DGX Spark mem bandwidth 273 Gb/s -> TG 55 t/s - Strix Halo mem bandwith 256 Gb/s -> TG 50 t/s - Mac M3 Ultra mem bandwidth 800 Gb/s -> TG 85 t/s (hitting CPU bottleneck)
Prompt processing token on the other hand is night and day - DGX Spark CUDA - PP 1700+ t/s - Strix Halo Vulkan/Rocm - PP 300+ t/s - Mac M3 Ultra MXL - PP 1500 t/s
So if you need speed for agentic handoff, then DGX spark is better but twice the price and 3x the power cost. Else Strix Halo is the sweet spot for some. Mac M3U or M5U is actually the best of both worlds.
benno_1237@reddit
However for OPs setup, the RTX 5090 is most likely the best choice. You are not compute limited on any of them, you are memory limited. And the \~1.8TB/s of the 5090 is actually insane
FortiTree@reddit
Oh ya 80 t/s TG on Qwen 27B is insanely fast. I can only get 12 t/s max on Strix Halo. Dedicated GPU is definitely better for dense model. Unified memory is only usable with MoE.
Maybe Mac M5 Ultra can break that barrier or something else next year.
benno_1237@reddit
Given that the M5 Max is 614 GB/s and the M5 Ultra is most likely (is there info on that yet?) just two Max's slapped together, its still about 1/3 off of the 5090.
FortiTree@reddit
Just speculation now but some says M5U can get pass 1TB/s easily and maybe closer to 1.5TB/s so not far off from 5090 1.8TB/s
Weird_Search_4723@reddit
While 30-40k sounds like a your setup problem more than opencode but opencode's system prompt is large. i'll suggest either pi or https://github.com/0xku/kon (mine) if you are looking for something lightweight
benno_1237@reddit
You are right, i should have clarified a bit more. What I meant is that i rarely find myself below that in any multi turn conversation in opencode. That can surely be improved using various software tools.
My point however was that it is useless to measure generation speed without specifying how you measure it.
Most likely in OPs case, context doesnt affect the performance that badly, but still.
vr_fanboy@reddit
im using this https://github.com/yvgude/lean-ctx and this https://github.com/JuliusBrussee/caveman with PI (can be used in opencode too), it works really well to lower context requirement, also how do you use the context is important, separate plan and implementation, etc.
ondemand mcp/tool loading + cli like tools is a must in local deployments. We are at a point where we can actually work with local llms
benno_1237@reddit
I am not saying it doesnt work. I just think that most performance numbers people post on here are misleading.
You are completely correct, there are awesome tools to save context available. Still, if you do some multi turn edits, you will hit a context length that starts to matter.
But, the Qwen3.5/3.6 series is a beast in context management, so most likely its not as significant as with older models
gatewaynode@reddit
"Tools like opencode etc go up to ~30-40k context immediately"
That's probably your custom setup, I don't see that at all. For me opencode is very lean on token use.
benno_1237@reddit
Yeah, that was a bit of an exaggeration on my end. Still, the default system prompt is like \~10k. So without any kind of optimization, you hit the 30-40k quickly
Kindly-Cantaloupe978@reddit (OP)
I ran a session that generated 11k tokens and the average is 78 tps. The is based on the metrics stats that vllm provides via the /metrics end point.
car_lower_x@reddit
I have a 5090 but why would I use NVFP? It’s just a heavily quantified modern version. Sure it’s fast but because ..
BitterProfessional7p@reddit
Models only start to get stupid below 4bit quantization. Down to 4 bit, they are practically unchanged.
Comparing a BF16 to a 4 bit GGUF or NVFP4 usually decreases between 1 to 3% decrease in benchmark performance and gain between 100% and 300% speedup depending on the context length. 4 bit quantization seems to be the sweet spot between quality and performance.
amemingfullife@reddit
Do you have any bench comparisons or KL divergence stats etc? I’ve heard really different and hand wavey opinions on the quality comparisons between NVFP4 and FP8 and above. I get the theory, just would be nice to see some sources.
Still-Notice8155@reddit
maybe NV means optimized for NVIDIA, and its FP4, idk if it's the same with Q4..
Kindly-Cantaloupe978@reddit (OP)
It 4 bit but very close to fp8 quality per nvidia’s post.
some_user_2021@reddit
But this is not a file provided by Nvidia, and the model was not trained in 4 bit
Kindly-Cantaloupe978@reddit (OP)
Fwiw this is the nvdia’s post that talks about NVFP4
https://build.nvidia.com/spark/nvfp4-quantization
Kindly-Cantaloupe978@reddit (OP)
Trade off is model size vs. KV cache headroom. You can go with higher quants, but at the expense of less for kv cache. Turboquant doesn't quite deliver much gain for vllm for some reason with my setup. If there's a better setup with turboquant enabled then even better.
SnooPaintings8639@reddit
I am getting \~57 tps with the same max context at FP8 using old and tried setup of 2 x RTX 3090. Not sure about the speed with 90%+ context used.
When I switch to AWQ INT4 I am getting \~65-70 tps.
Two 3090s are half the price of a single 5090, at total to twice the amount of vRAM, and are still very competitive when run in tensor parallel mode. I just wish I had nvlink on top of them to push them even further.
IrisColt@reddit
Thanks!!!
oxygen_addiction@reddit
Try with DFlash as well. You can also quantize the model to Q8 without acceptance rate changes.
No_Algae1753@reddit
What is dflash?
oxygen_addiction@reddit
What is google?
Artistic_Okra7288@reddit
Confused, Google took me right to this post.
No_Algae1753@reddit
I forgot that I'm on reddit where I'm not allowed to ask
meatmanek@reddit
it's a form of speculative decoding where you use a diffusion language model as your draft model.
With speculative decoding, you use a faster model to draft a few tokens ahead, and then use the main model to verify (accept/reject) those draft tokens. Verification with the main model is typically much faster than generation, so if your draft model is both fast and has a good acceptance rate, you can see decent speedups. If your draft model is slow or you have a low acceptance rate, then the added compute of the draft model + verification can slow you down.
Traditionally you'd use a smaller model in the same family, like Qwen3.5 2B as the draft model for Qwen3.5 27B, but MTP and DFlash are newer variations. With MTP, the main model ships with an added few layers which are trained to predict tokens based on the internal state of the model. Since it has access to the internals of your main model, it presumably can be smaller (cheaper to run) than a separate draft model of the same accuracy.
DFlash uses a diffusion model, which are already supposed to be very fast relative to autoregressive (standard) models.
No_Algae1753@reddit
Thank you very much! Very nice explanation. I was thinking about using spec decoding on qwen 27b. Is there something you would recommend me for that mind of setup?
meatmanek@reddit
idk I'm also looking for a good SD setup for qwen3.6-27b. MTP isn't supported yet on llama.cpp or mlx-lm, so I haven't gotten it working on apple silicon yet. There are some PRs to add support, but a lot of the quants drop the MTP layers so you need to find a quant with the MTP layers intact. One that supposedly has them didn't work with the PR I tested.
No_Algae1753@reddit
Guess we will have to be patient for a while
Kindly-Cantaloupe978@reddit (OP)
Is there a speed gain or same speed but higher context window using the same gpu?
oxygen_addiction@reddit
Speed gain. I'm hitting 1.3x-2x more tokens dels MN depending on acceptance rate.
Kindly-Cantaloupe978@reddit (OP)
That's quite a bump. What's your setup? Do you have a recipe to share?
oxygen_addiction@reddit
I only just got it running last night on a forked. Llama.cpp Haven't tested with vllm yet.
5090RTX+5080RTX. Haven't tried having the draft model on the second gpu.
DTree should improve speed even more in the coming months.
Kindly-Cantaloupe978@reddit (OP)
I only have a single 5090RTX so it might not work for me ...
oxygen_addiction@reddit
Oh something to note is that I was not testing with NV4.
Results may vary, though I doubt it.
specify_@reddit
I tried DFlash with Qwen 3.6 35B-A3B and was disappointed with the token throughput at long context >50k. It seems that DFlash is only good for low context and draft acceptance worsens at longer contexts, making it slower than MTP.
oxygen_addiction@reddit
I had it fully on the 5090.
It eats up like 500mb for the draft and a few hundred for context (which I can think could be capped harder).
ddog661@reddit
I’m getting around 80 tokens/sec on my 4090 @int4 and speculative decoding on but only 16k context.
Kindly-Cantaloupe978@reddit (OP)
4090 with 24G VRAM should do better than 16k context? Just saw a post saying that a 5090 24G laptop version can fit 75K
ddog661@reddit
I am using vLLM and fp8 KV cache. It’s pushing the limits of the 24gb vram buffer at that point. It’s in line with this testing here: https://medium.com/@fzbcwvv/an-overnight-stack-for-qwen3-6-27b-85-tps-125k-context-vision-on-one-rtx-3090-0d95c6291914
Kindly-Cantaloupe978@reddit (OP)
Check this out
https://www.reddit.com/r/Olares/s/saxSzYeXKU
ddog661@reddit
Thank you for that. This vLLM config is not too far off from what I’m using (except context size of course and I’m using gpu-utilization 0.93). I might play around with it a bit more tonight. I’m looking more at those results and noticing that vLLM returned a KV pool size of 23,760 tokens which is not far off from what my vLLM logs state. I don’t know how 75000 ctx is possible without turboquant.
ecompanda@reddit
the 218k context at 80tps is the more impressive number here. most setups start throttling hard past 64k because the kv cache hits memory bandwidth limits. NVFP4 with MTP is clearly doing a lot of heavy lifting to hold that flat. have you tested the degradation past 150k or does throughput stay consistent all the way out?
Kindly-Cantaloupe978@reddit (OP)
I ran a session with 11k *generated tokens * that averaged at 78 tps
Mother_Desk6385@reddit
Can gguf run on vllm ?
some_user_2021@reddit
Nice pussy
mxmumtuna@reddit
Yes, but you would not want to.
Kindly-Cantaloupe978@reddit (OP)
It's experimental last I checked.
Barry_22@reddit
Impressive. Is it only possible with NVFP4 quant? Bcuz with AWQ it seems to not allow foruch context on 24GB, like, very little.
Kindly-Cantaloupe978@reddit (OP)
The model itself is \~19G so on 24G you don't have much headroom for KV. I'm running it on 32G VRAM which does leave a good amount of space for KV cache.
Barry_22@reddit
Yep, you're right. Do you think TurboCache would help here significantly? For 27B on 24gb vllm
Kindly-Cantaloupe978@reddit (OP)
It should .. but I didn't have much success with turboquant on vllm at the moment. It's not supported by the official line, and the forks don't work very well for some reason.
Barry_22@reddit
Thanks! Eh, I guess for my 24GB build I shall wait for 3.6 9B lol
Kindly-Cantaloupe978@reddit (OP)
Do look into the KV cache calcs issue that was fixed in vllm. Check my other posts on 3.5-27B in the link above. This may get you a bit further.
Barry_22@reddit
Nice, appreciate it. Will try it out!
grizzlybear_jpeg@reddit
At what quantisation?
Kindly-Cantaloupe978@reddit (OP)
NVFP4 is 4bit?
grizzlybear_jpeg@reddit
Nice. I thought it’s a gpu or something instead of the quant
cell-on-a-plane@reddit
What is the commas you are using? I’m having issues getting it to run on my 5090 with more than 80k context. I guess one of the 40,000 I have set is wrong
fasti-au@reddit
Qwen 3.6 27b is better on llama ccp atm I think still. Been struggling to get the genesis stuff running well here but I’m ampere stacks so I’m not the market
mxmumtuna@reddit
Why would it be better? It doesn’t even support tensor parallelism.