What tokens/sec do you get when running Qwen 3.5 27B?

[-]

MushroomCharacter411@reddit

I know this is potato hardware, but maybe it will provide a data point.

I have an RTX 3060 in an i5-8500 system with 48 GB of RAM. I get about 1.8 to 2.0 tokens/sec out of the Q4_K_M quantization of 27B. However, one thing that mitigates this is that it doesn't seem to slow down much as the context window fills up, unlike 35B-A3B (also Q4_K_M) which starts much faster (6 to 8 tokens/sec) but by the time I hit 30k in the context window, it also is below 2 tokens/sec while the 27B would only have degraded to 1.5 or 1.6 tokens/sec. So the harder I push them, the less of a speed disadvantage 27B has.

[-]

dibu28@reddit

Try Qwen3.5 35B model with 2bit quant. I'm getting better speed with rtx 2060 12gb if model fits in vram.

[-]

MushroomCharacter411@reddit

I might if I hadn't already migrated to the new hotness, Gemma 4.

Try Gemma 4-26B-A4B at IQ4_XS. You'll have to split it partly onto the CPU, but the speed will actually be tolerable, and it's meaningfully smarter than Qwen 3.5 at i1-Q4_K_M.

[Path]\llama-server.exe -m "[Path and model filename].gguf" --mmproj "[Path and mmproj filename].gguf" --chat-template-file "[Path]\chat_template.jinja" -c 131072 --n-cpu-moe 12 --cache-type-k q5_1 --cache-type-v q5_1 --flash-attn on --reasoning-budget 1792

The reasoning budget is to prevent it from getting stuck in thought loops. If you find it's cutting off legitimate Reasoning, increase it as necessary. The .jinja is mandatory with llama.cpp, otherwise it misinterprets the tags sometimes. If you want the full context window, increase --n-cpu-moe to 24 and -c to 252144, and it won't get *that* much slower.

[-]

dibu28@reddit

I will try Gemma 4-26B-A4B
Also Qwen3.6 35B is out yesterday

[-]

MushroomCharacter411@reddit

I have been playing with this for days now, and if you use Q5_1 for the K and V caches, you won't be able to use the entire 262144 context window because the model is going to go mad before you get there. You're better off cutting the context window in half and using the VRAM that frees up to push some more layers onto the GPU. I tried using Q8_0 for the K cache to potentially forestall the madness, but that was just murder on the performance.

The specific forms of madness I've found Gemma 4 26B-A4B to exhibit are:

• Hanging after one token of output. The 100% reliable workaround for this is to stop the current reply in progress, and edit my last prompt to include a completely blank image. The presence of the image means that the "hang" will clear itself after a few seconds, but it won't inject any unintended meaning to the conversation.

• Malformed and/or tags. When this happens, the usual result is that the actual reply gets left inside the Reasoning box, which I have to then expand to read the reply.

• Mangling words. It is generally obvious from context what the screwed up word was supposed to be, but it's still a sign that the model is going to start exhibiting odd technical glitches, if it hasn't already. I think this is actually the same problem as the previous point, just with a slightly different symptom.

And you really want to set a Reasoning budget. I've found 2048 to be a good balance. If it's still thinking at that point, it's almost certainly stuck in a thought loop. Legitimate reasoning does seem to require as much as 1800 or 1900 tokens at times so I leave a little bit of wiggle room.

[-]

MushroomCharacter411@reddit

I am finding that the MoE architecture means it's not so critical that everything fits on the CPU, and there isn't that much of a speed hit using the (noticeably smarter) i1-Q5_K_M quantization. On the other end, IQ3_M isn't much of a step down from IQ4_XS. Gemma 4 26B-A4B is remarkably adaptable to being run on potatoes of all grades.

I spent a whole day on trying to get Gemma 4 31B to run at an acceptable rate, and failed. I was never able to get more than 3 tokens per second. Nonetheless, it accurately diagnosed the problems I was having with 26B and gave me ideas on how to resolve the problems. At the end, it was telling me "your hardware is too potato for me, but I'm fine with you using my little brother instead, and your hardware is actually adequate for that". (Well, not literally in those words, but that was the meaning.)

[-]

coder543@reddit

unlike 35B-A3B (also Q4_K_M) which starts much faster (6 to 8 tokens/sec)

You need to use --n-cpu-moe with llama-server... you should be getting substantially more speed than that.

[-]

MushroomCharacter411@reddit

Holy crap, you weren't kidding. It's now over 20 tokens/sec. Thank you!

[-]

Additional_Ad_7718@reddit

Even at 20T/s doesn't the reasoning take like minutes to finish? Or am I doing something wrong?

[-]

MushroomCharacter411@reddit

You need to use the recommended penalties and other settings as provided by the Qwen devs. Default settings lead to neurotic reasoning where the model constantly "But wait!"s itself into a spiral from which it sometimes doesn't escape.

(Quote blocks are disabled so I just have to paste it in as is. Everything below this line is their words, not mine.)

To achieve optimal performance, we recommend the following settings:

Sampling Parameters:

We suggest using the following sets of sampling parameters depending on the mode and task type:

Non-thinking mode for text tasks: temperature=1.0, top_p=1.00, top_k=20, min_p=0.0, presence_penalty=2.0, repetition_penalty=1.0

Non-thinking mode for VL tasks: temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

Thinking mode for text tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

Thinking mode for VL or precise coding (e.g., WebDev) tasks: temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

For supported frameworks, you can adjust the presence_penalty parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.

[-]

Additional_Ad_7718@reddit

Hey thanks so much for sharing this.
LM studio doesn't have a presence penalty yet but I heard it was in the beta version of the software. Maybe I should just get my butt off the couch and install llama.cpp

[-]

snmnky9490@reddit

Yes it does have presence penalty. I just used it yesterday on lm studio

[-]

thegr8anand@reddit (OP)

Is it in beta? Because the stable version only shows Repeat penalty for me.

[-]

snmnky9490@reddit

It shows up as being added in the 0.4.7 build 1 release notes.

Right now I have 0.4.7 build 2

[-]

KURD_1_STAN@reddit

Q5km from aesidai gets me 27 at the start with 3060 on lms so u should be getting more like ~30 than +20. Not sure how .cpp work but there might be a -- thingie so u keep some layers on gpu instead of offloading all to cpu as i got 17-->27 when using 3.5gb-->11gb vram

[-]

MushroomCharacter411@reddit

I think the bottleneck here is the i5-8500 and DDR4 RAM run at extremely conservative speed because it's an old office PC.

[-]

Icy_Butterscotch6661@reddit

You should vibe code something to automatically do this swapping for you maybe

[-]

Unlucky-Message8866@reddit

unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL runs at ~55tok/s native context size on my 5090 rtx

[-]

dibu28@reddit

Rtx 5070 llama.cpp Qwen3.5 35B A3B 2bit I get 112 TPS bartowski 20k context 9gb model size.

[-]

stormy1one@reddit

Same hardware setup. Decided to take a little tiny hit on performance and switch to vLLM for the NVFP4 support on Blackwell, and avoid the constant context reloading from llamacpp. Sooo smooth now with 78% KV cache hits doing Python/TypeScript dev.

[-]

_TeflonGr_@reddit

What model are you using and with what config? i tried running a couple of qwen3.5 NVFP4 ones and vLLM was not working on my blackwell cards and was not recognizing the model's structure or NVFP4 support even on the nightly builds.

[-]

stormy1one@reddit

Understandable - vLLM is cranky beyond belief; it took me a while to find a NVFP4 variant that works. I'm using Kbenkhaled/Qwen3.5-27B-NVFP4, and the config someone posted here: https://huggingface.co/Kbenkhaled/Qwen3.5-27B-NVFP4/discussions/1

[-]

_TeflonGr_@reddit

Thank you so much, have you tried other NVFP4 models from that same publisher? i am specially interested to run dual 9B models. Ill try them later with the same config as i guess they are similar enough.

[-]

UmpireBorn3719@reddit

"Running with vllm on 5090 with about 60 T/s." <-- nvfp4
I use Unsloth Q5 UD + llama cpp around 5xT/s.

[-]

Unlucky-Message8866@reddit

Thanks for the tip! The constant cache invalidation is def annoying, gonna try vllm

[-]

stormy1one@reddit

Yeah it was driving me insane. I was going to say that tg performance is actually not that different from llamacpp - but cached pp performance is off the charts once you warm up and are close to full context. My peak pp/s was nearly 8000 tok/s with 59k in the KV cache, 78% prefix hit rate. Night and day difference. It’s my daily driver right now

[-]

DealingWithIt202s@reddit

Mind sharing your launch command?

[-]

stormy1one@reddit

Sure - I'm using this variant https://huggingface.co/Kbenkhaled/Qwen3.5-27B-NVFP4 and the config located on one of the comments on that model: https://huggingface.co/Kbenkhaled/Qwen3.5-27B-NVFP4/discussions/1

[-]

Murinshin@reddit

Makes me seriously consider upgrading my 4090 😭

[-]

Own_Attention_3392@reddit

What is vllm doing differently to manage the cache? I'm curious because it seems like implementating whatever is going on there in llama would be a no brainer

[-]

Dany0@reddit

same setup but with latest llamacpp I get 50-60tok/s on the first ~4k context but just 8k context crashes down to 5tok/s and 30k context never responded

am I doing smth wrong? 9950x3D

llama-server --hf-repo Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF --hf-file Qwen3.5-27B.Q4_K_S.gguf --host 0.0.0.0 --port 8080 --threads 16 --cpu-strict 1 --n-gpu-layers 999 --ctx-size 65536 --batch-size 4096 --ubatch-size 1024 --flash-attn on --cache-type-k bf16 --cache-type-v bf16 --swa-full --cont-batching --no-mmap --jinja --chat-template-kwargs '{\"reasoning_effort\": \"high\"}' --prio 2

[-]

Sh1d0w_lol@reddit

Dude thanks for the heads up. I also have rtx 5090, just changed runtime to vllm and it flies! Running Qwen3.3-27B Q6_K_XL with no quantization to KV cache. Getting 50 tps but there is no prompt processing time now, answer is almost instant!

[-]

HallTraditional2356@reddit

Do you run qwen 3.5 27B Q6 K XL on vllm ? Single or dual gpu ?

[-]

HallTraditional2356@reddit

Nice man ! Which Model do you use there with vllm? And can you Post your Launch args? I run a 5090 as well. Thx in Advanced !

[-]

eaz135@reddit

I can confirm these numbers. I ran the same model on my 5090 with almost identical results.

[-]

Igot1forya@reddit

On my DGX Spark if I run Qwen 3.5 27B natively via llama.cpp on Linux with zero optimizations I get 10t/s and if I run the same model on my 3090 I get 34t/s. Running a hybrid between the 3090 and GB10 (Spark) I get 12t/s.

[-]

estrafire@reddit

how did you benchmark the 34 t/s? Did you build llama.cpp from main? I did a build with cuda and seeing 22-24 tops on the metrics endpoint

[-]

Igot1forya@reddit

I am new to llama.cpp so, but downloaded llama.cpp (git clone) and compiled from source (took like 20 tries to get it to work with sm_86 + sm121). I had to setup two concurrent instances of llama.cpp, one for GPU0 and one for GPU1. Tested with a large PDF and a 16K prefill that I then ran a query against. I then set one to advertise the RPC service on one port and the other on a neiboring port.

I then wrote a custom python script to boot one as a listener and one as the master node and split some layers to use the RPC service and left the KV cache on the 3090, and then did a test where the KV was on the GB10. Depending on who was master and who was listener the t/s swong wildly depending also on the model size.

Since this post I've made some additional tweaks and refinements and I should mention that my GB10 is NOT stock any more as I've modded it some. I've added some additional cooling and GB10 never exceed 74C and I regularly see 104W from the GPU (the 3090 never exceeds 61C at 350W load). I'm also working on a method to reduce the stress on the Oculink bandwidth and have some limited success with compressing the stream of communications with idle GPU compute (since its waiting for the layer to pass anyway). But this is a work-in-progress and I'm hoping in the next few weeks to have something tanglable that could even be useful over a 10/25Gb Ethernet link.

[-]

estrafire@reddit

hey super thanks for the reply! I forgot I had an undervolt, I got 36 t/s after removing it in the single 3090, incredible for q4_k_m on full 268k context

[-]

Igot1forya@reddit

Ah yes, that would do it! Yeah, the 3090 is such a special GPU it's up there with the 1080TI as one of the greatest cards to ever come out.

[-]

estrafire@reddit

Agreed! It was such a great investment some years ago. I'm also happy that Vulkan is catching up to CUDA, got 32 t/s on it, what a year

[-]

YourVelourFog@reddit

Macbook M1 Pro w/ 32GB of memory running base Qwen3.5-27B on LMStudio with a 4096 context window and I'm getting 5 TPS.

[-]

ImagineWealth@reddit

m1 max, 8-10t/s at 16k context, not usable.

[-]

itch-@reddit

M5 macbook, 7 t/s. Not a lot of GPU cores in this one

[-]

YourVelourFog@reddit

How much ram do you have?

[-]

itch-@reddit

32GB. I had set more context but reran it with 4000 and it was still 7 t/s. Not an M5 pro, just M5, I didn't choose it

I had an M1 pro with 16GB that I can't compare the 27B with but it ran the 9B faster than this M5. 25 t/s vs 29 t/s on M1 pro.

0.8B: 178 t/s on M5 and 98t/s on M1 pro, so there's that

[-]

IntelVEVO@reddit

28 tok/s q5_k_m 12k context on a 5090 laptop.

[-]

Minimum-Lie5435@reddit

65 TPS (Dual 3090's NVLink) on Cyankiwi 4 bit awq

[-]

Numerous_Mulberry514@reddit

At 0 context ~ 30 tokens with ik llama cpp and graph parallel using two rtx 3060. At around 20k context it slows down to 24-25tps

[-]

overand@reddit

Dual 3090s only has me at \~36, so, don't hate on that dual 3060 setup too much!

[-]

Minimum-Lie5435@reddit

gotta try cyankiwi/Qwen3.5-27B-AWQ-4bit with vLLM - getting 60TPS-65TPS (Dual 3090 NVLink)

[-]

Numerous_Mulberry514@reddit

I'm very happy actually! I can run so much more with a very acceptable speed for me which I had not think would be possible!

[-]

getmevodka@reddit

F16 about 33tok/s with rtx 6000 pro blackwell

[-]

Makers7886@reddit

Man that's sweet in a single card - I am seeing 41t/s FP16 4x3090s w/196k context taking up all of 4x3090s using vLLM.

[-]

RomanticDepressive@reddit

Are you using NVLINK?

[-]

Makers7886@reddit

No I only have 1 nvlink and actually don't even use it right now. It's an epyc 7502 romed8-2t with the 7th pcie slot bifurcated into two gpus to give me the 8 total on that rig.

[-]

twack3r@reddit

That’s very impressive! What launch parameters do you use?

[-]

Makers7886@reddit

Was testing it vs gguf and exl3 versions so didn't "iron it out" but this is the gist:

  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --max-model-len 196608 \
  --max-num-seqs 4 \
  --port 8002 \
  --dtype bfloat16 \
  --kv-cache-dtype fp8_e5m2 \
  --gpu-memory-utilization 0.95 \
  --trust-remote-code \
  --enable-prefix-caching \
  --reasoning-parser qwen3

[-]

getmevodka@reddit

Thats great too, yeah i only have the max q variant so im happy with the speed anf power use. Running full context though.

[-]

nunodonato@reddit

what? thats what I get with FP8!! i thought the performance would be much more different. vllm, right?

[-]

thegr8anand@reddit (OP)

How much context are you able to use with your setup?

[-]

twack3r@reddit

The native full ctx and there is a lot left, around 30GiB of VRAM remain available.

[-]

getmevodka@reddit

Full context lenght

[-]

coder543@reddit

Some people get strangely upset at the concept, but you can use q8 KV cache and fit about 200k context on that graphics card, with minimal quality loss. The people claiming you need bf16 kv cache don’t make any sense to me.

[-]

grumd@reddit

With q8_0 kv cache quant, at 20-30k+ context my models start failing "edit" tools in opencode while coding - because for a successful edit tool they need to exactly output the current file contents with what specifically needs to be changed. And they often start failing with "file contents dont match exactly", which means they forget some rare tokens from the kv cache and make small mistakes. Never happens with unquantized kv cache.

[-]

Sad_Individual_8645@reddit

When I see the context limit for that in lmstudio any prompt processing now takes literally forever. If I set th context limit lower but do the same prompt it’s quick, I don’t get how people are used over 200k context with it on 24gb if prompt processing takes that long

[-]

coder543@reddit

It really seems like either an LM Studio issue or a Windows issue. On Linux with llama-server, I get the same speed with 200k as I would with a smaller context, because it is all in VRAM.

[-]

JockY@reddit

The reason it’s unpopular is that at long contexts the KV cache quality degrades enough that it impacts model performance significantly, often leading to loops and never ending output. This is compounded by any quantization applied to the model itself.

These issues don’t really show up with small context lengths of just a few thousand tokens. But at 100k+? Garbage.

[-]

coder543@reddit

Yes, this is what people say. It is not what I have experienced. I would like to see actual benchmarks. It is not logical for q8 to have that much impact.

[-]

input_a_new_name@reddit

Not much of a benchmark, but I've experienced with gemma 3 27b, at q5km and q8 kv cache, in a 30k long chat, unable to precisely quote a line from somewhere around 28k, writing analogous lines but not the actual line verbatim, when it was asked to, but when i tried to load it in full precision kv cache, suddenly it worked and wrote the exact line it was supposed to 10 times out of 10. So there you go.

[-]

JockY@reddit

I agree that it’s minimal loss, but when you compound it with a quantized model and then further compound it with massive context, it just breaks down.

I tried it with MiniMax-M2.5 using Claude cli and ultimately concluded that it wasn’t suitable for daily use due to the way Claude would essentially stop working towards the end of its context buffer.

BF16 KV doesn’t suffer the same issue.

Now I understand that this is a sample size of 1 with no data to back it up, so if you want to rip me a new one that’s fair enough. But know this: I’m no noob to this and I’m doing these tests on a quad RTX 6000 PRO rig on which I’ve been experimenting for quite some time. I’m comfortable that I understand the issue and that I’ve tested it well enough to be sure in my conclusions for my use cases. YMMV.

[-]

ProfessionalSpend589@reddit

> The people claiming you need bf16 kv cache don’t make any sense to me.

Otherwise I would have plenty of VRAM left which would be a waste. :p

[-]

Klutzy-Snow8016@reddit

Someone needs to run long context benchmarks with and without KV cache quantization. Like, if I can get X context with 16-bit and 2X with 8-bit, I want to know that 8-bit isn't degraded at X tokens. The entire point of quantizing the KV cache is to unlock longer context, but people just run their normal low-context prompts and call it good.

[-]

thegr8anand@reddit (OP)

It might be my noobness but i never tried KV quantization options even though they are just a toggle in LM studio. Turning it on and using Q8_0 for KV, now i can get full speed of 34-40 tps at 110k context! Thanks for the suggestion, at-least i tried it out. How much does it affect though from using Q8 quantization than not using it at all? 3-4x the speed surely there must be some trade-off?

[-]

ttkciar@reddit

I've had good experiences with q8_0 K and V cache quantization, but I think some task types are more sensitive to it (like codegen).

[-]

itch-@reddit

7900XTX, 32 t/s using 27B-UD-Q4_K_XL using vulkan llama.cpp. I only put 40 000 tokens for ctx on vram, when I need more I'll try to see how much I can squeeze in.

I'm much more impressed with this than 35B-A3B. That fails to make Cline work, but 27B handles it just fine. And thanks to that quirk where it doesn't think much if there are tools, 30 t/s is a reasonable speed.

[-]

PaMRxR@reddit

RTX 3090, 32-34 t/s @ d40000 using 27B-UD-Q4_K_XL. Practically the same!

[-]

itch-@reddit

https://kaitchup.substack.com/p/summary-of-qwen35-gguf-evaluations

I saw this and got bartowski IQ4_NL, apparently a touch better and also a bit faster. 36 t/s. 12% increase, I feel it more than I thought I would

[-]

PaMRxR@reddit

Thanks for sharing, I see similar performance bump which correlates roughly with the ~10% smaller size. According to KLD benchmarks the IQ4_XS is even better (and actually really good for its size), but both are proportionally worse than bigger quants. Anyway I'll give it a try for a few days!

https://old.reddit.com/r/LocalLLaMA/comments/1rk5qmr/qwen3527b_q4_quantization_comparison/

[-]

Dirk__Gently@reddit

Y u no use roc?

[-]

itch-@reddit

I just get vulkan and hip builds off github releases, check the speed difference, make my choice. IIRC it was very close in recent builds

[-]

sine120@reddit

Don't split dense models into system RAM. With 24GB VRAM you can easily fit the model. When context leaks out, your TG speed will drop dramatically. If you need more context space, try the IQ3.

[-]

Short_Ad_7685@reddit

which quants are u running?

[-]

sine120@reddit

I run the UD-IQ3-XXS in 16gb of vram

[-]

Short_Ad_7685@reddit

Thanks.

[-]

Haeppchen2010@reddit

Radeon RX 7800XT, 64k context Q8:

iQ3_XS: 390/s in, 16/s out. But it is slightly too „dumb“

IQ4_XS with CPU offload: 230/s in, 4,5-5,5/s out. But the quality improvement is worth the wait.

[-]

Opteron67@reddit

FP8 model 60-70 t/s with dual 5090 with vllm and p2p driver

[-]

Nepherpitu@reddit

Quad 3090 runs fp16 at ~90tps average with mtp. Looks like you can squeeze more.

[-]

Opteron67@reddit

I had cpu on powersave lol (it matters in fact)

now with cuda 13.2, vllm HEAD, linux 6.19 nvidia 595 p2p enabled driver on W790 and dual 5090 PNY watercooled
concurrency 1 with mtp 1 => avg 125tps
concurrency max without mtp => avg 626tps

[-]

RomanticDepressive@reddit

Wait can you tell me more

Like I have dual NVLINK but am waiting for Corsair cables to become back in stock

Have you tried NVLINK? ik_llama seems to support it with NCCL, and I can verify it but have yet to put all 4 cards in a single system

[-]

Nepherpitu@reddit

Not sure about llamacpp, I'm running it with VLLM.

[-]

Dexamph@reddit

30-35tk/s on 4090+3090Ti in Q5-Q6, with Bartowski's Q6KXL running a bit faster because of some layers at Q8. The 3090Ti allows for higher quants while keeping context maxed out. LM Studio just updated their llama.cpp runtime today so it's basically performing the same as what I get in OpenWebUI+llama-server.

Very impressed with this model, it doesn't buckle with complex prompts like 35B nor forgot things in a long chat like GLM 4.7 Flash, while still being much much faster and usable than bigger MoE models with partial offloading

[-]

thegr8anand@reddit (OP)

Yes, the new runtimes in LM studio did give slight boost.

[-]

Radiant_Condition861@reddit

dual 3090 and llama-swap settings.

|=========================================+========================+======================|

| 0 NVIDIA GeForce RTX 3090 On | 00000000:01:00.0 Off | N/A |

| 32% 38C P8 18W / 225W | 23189MiB / 24576MiB | 0% Default |

| | | N/A |

+-----------------------------------------+------------------------+----------------------+

| 1 NVIDIA GeForce RTX 3090 On | 00000000:02:00.0 Off | N/A |

| 30% 38C P8 26W / 225W | 22006MiB / 24576MiB | 0% Default |

| | | N/A |

+-----------------------------------------+------------------------+----------------------+

Note: Power was reduced from 350W to 225W.

# Global settings
healthCheckTimeout: 120
logLevel: info
startPort: 5800


# Reusable configuration snippets for Qwen3.5 models
macros:
  # Model paths
  "qwen_model_path": "/models/Qwen3.5-35B-A3B-UD-Q5_K_XL.gguf"
  "mmproj_path": "/models/mmproj-F16.gguf"
  
  # Context window settings (YARN scaling)
  "ctx_size": "1048576"
  "rope_scaling": "yarn"
  "rope_scale": "4"
  "yarn_orig_ctx": "262144"
  "yarn_ext_factor": "1.0"
  "yarn_attn_factor": "1.414"
  "yarn_beta_slow": "32768"
  "yarn_beta_fast": "32768"
  
  # Base inference parameters
  "parallel": "1"
  "fit": "on"
  "fit_target": "2048"
  
  # Base llama-server command
  "llama_server_base": |
    llama-server
    --host "0.0.0.0"
    --port ${PORT}
    --model ${qwen_model_path}
    --mmproj ${mmproj_path}
    --parallel ${parallel}
    --ctx-size ${ctx_size}
    --rope-scaling ${rope_scaling}
    --rope-scale ${rope_scale}
    --yarn-orig-ctx ${yarn_orig_ctx}
    --yarn-ext-factor ${yarn_ext_factor}
    --yarn-attn-factor ${yarn_attn_factor}
    --yarn-beta-slow ${yarn_beta_slow}
    --yarn-beta-fast ${yarn_beta_fast}
    --fit ${fit}
    --fit-target ${fit_target}
    --jinja

models:
  Qwen3.5-35B-A3B-UD-Q5_K_XL.gguf-coding:
    name: "Coding Mode"
    description: "Balanced creative-technical mode for code generation"
    cmd: |
      ${llama_server_base}
      --temp 0.6
      --top-p 0.95
      --top-k 20
      --min-p 0.00

[-]

PassengerPigeon343@reddit

I’m so glad I clicked into this thread to find someone with my exact hardware, also using llama-swap, and posting their settings to achieve this speed. Thank you!

[-]

Radiant_Condition861@reddit

lmk if if you can replicate my results.

[-]

RomanticDepressive@reddit

Are you using nvlink? I can test two separate nvlink setups

[-]

Radiant_Condition861@reddit

no nvlink. I didn't think it'll affect that much. If I were model training, then I would consider offloading the pcie buss traffic to nvlink to help with hard drive performance.

[-]

Fabulous_Fact_606@reddit

Was getting excited, then it's the 35B model

[-]

oxygen_addiction@reddit

Different model.

[-]

LMTLS5@reddit

~20tps tg and ~200tps pp on mi50 at q4km

[-]

EvilGuy@reddit

I get about 35 tokens a sec when I run Q_4_K_M 27B on a 3090 with 128k context. (Q8 kv quant) in LM studio.

[-]

Sad_Individual_8645@reddit

What settings? When I do that the prompt processing takes forever with my 3090, exact same setup as you too.

[-]

EvilGuy@reddit

Hmm nothing crazy in lm studio

Qwen 3.5 27b model at q4km

Context 128000 Gpu offload 64 (all layers) CPU thread pool size 7 Most things are default here Unified kv cache on Offload kv cache to gpu on Keep model in memory off Try mmap on Flash attention on K and v cache set to q8_0

Nothing crazy 35 tokens pretty steady with good prompt processing

[-]

No_Information9314@reddit

20 t/s on dual RTX 3060s using Unsloth IQ4_XS at 100k context, KV cache 8_0.

[-]

odikee@reddit

22-23tks on 5080+3080 UDQ5, 50k context, tp 8_0

[-]

ParamedicAble225@reddit

3090, and slow as fuck, but I use it anyways. Haven’t calculated TPS but it’s 1-7 minutes per request with around 2000-15000 tokens input. It’ll be slow for agentic work but works excellently. I just live in slow mo now.

Not running A3B which would be a bit faster but more stupid since only 3 billion active parameters instead of almost all of them.

[-]

Sad_Individual_8645@reddit

I see people on here with a 3090 and native 260k cache selected in lmstudio for Q4 version of the model, when I do the same the prompt processing literally takes forever. I have 64gb ddr5 and have tried every setting under the sun I don’t get it, if I go down to like 90k with Q8 kv cache quant it works but I don’t get how people are doing that with a 3090

[-]

MushroomCharacter411@reddit

A3B has 3 billion *always* active parameters, but apparently the average prompt actually uses around 10B, the "always on" 3B, plus 7B chosen from the remaining 32B.

[-]

CreamPitiful4295@reddit

I’ve been sitting on my 3090 and then figured out I could run models. Did exactly what you did and got exactly the same results. I’m a bit poorer now. :)

[-]

serioustavern@reddit

Output speed around 35 tok/s.

This is using the same Q4_K_XL on a RTX 3090 with 32GB DDR3 system ram via llama.cpp (weights and context fully loaded on GPU).

Quantizing KV cache to Q8 rather than the default F16 got my max context length up to around 100K instead of around 50K with a negligible effect on speed. Lowering the 3090’s power limit from 350W to 280W decreased my speed by about 1 tok/s.

[-]

GhenB@reddit

FP8: 28 tok/s on 4 x rtx3060, vllm, tp4, 128k context, max-num-seqs 2

[-]

asfbrz96@reddit

7.5 on strix halo q8

[-]

octopus_limbs@reddit

Slow but still faster than if you type it yourself

[-]

simracerman@reddit

Yikes. Do you use it for non-coding?

[-]

asfbrz96@reddit

I'm using more minimax m2.5 or the 122b model, they are much faster

[-]

AdventurousSwim1312@reddit

On rtx 6000 pro, the awq version runs at roughly 80-90 tokens / seconds, with nearly same perf as the fp8 version

[-]

l1t3o@reddit

105 t/s gen speed with vllm and multi token Prediction activated.

[-]

nunodonato@reddit

how many speculative tokens? and which gpu?

[-]

l1t3o@reddit

5 multi token Prediction and 2x3090 . I followed this guide : https://www.reddit.com/r/LocalLLaMA/s/WNeLqJyQoP It's using the multi token Prediction that's built into the qwen3.5 that provides the most boost. But only vllm supports it for now. Hope to be able to use it un llama.cpp soon !

[-]

nunodonato@reddit

Damn I'm also with vllm but at around 30tok/s. Using 3 for MTP because I read at higher values you can have issues with tool calling at long context.

[-]

l1t3o@reddit

Are you using this exact quant: https://huggingface.co/cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4 ?
I believe the tool calling issue you're mentioning might have been solved recently,

[-]

nunodonato@reddit

No, I'm using the official Qwen FP8

[-]

l1t3o@reddit

What GPU do you have ?
rtx 30xx (ampere architecture) doesn't natively support fp8 (hence the funky format quant I gave you a link to).
rtx 40xx natively supports it.

[-]

nunodonato@reddit

I'm renting a RTX PRO 6000 (blackwell)

[-]

l1t3o@reddit

1/ vLLM nightly + Blackwell flag

Your RTX PRO 6000 is SM120 (compute capability 12.0). Stable vLLM doesn't fully recognize it and falls back to slower Marlin kernels instead of native FP4/FP8 ones. You need the nightly build compiled for CUDA 13.0:

docker pull vllm/vllm-openai:cu130-nightly

That alone can make a big difference.

2/ NVFP4 — the next level

Blackwell has hardware FP4 tensor cores (SM120), which means you can run NVFP4 quantization natively — ~1.6x faster than FP8, with minimal quality loss. There's a ready-made checkpoint:

Kbenkhaled/Qwen3.5-27B-NVFP4**1/ vLLM nightly + Blackwell flag**

Your RTX PRO 6000 is SM120 (compute capability 12.0). Stable vLLM doesn't fully recognize it and falls back to slower Marlin kernels instead of native FP4/FP8 ones. You need the nightly build compiled for CUDA 13.0:

```bash
docker pull vllm/vllm-openai:cu130-nightly

That alone can make a big difference.

2/ NVFP4 — the next level

Blackwell has hardware FP4 tensor cores (SM120), which means you can run NVFP4 quantization natively — ~1.6x faster than FP8, with minimal quality loss. There's a ready-made checkpoint:

Kbenkhaled/Qwen3.5-27B-NVFP4

Caveat: SM120 NVFP4 support is still a bit rough on the edges (some kernel bugs were only fixed recently). Use it with the nightly image above and test — if you get NaN outputs or garbage, fall back to FP8 which is rock solid on your card.

TL;DR: nightly first (free perf gain), then try NVFP4 if you want to push further. ```

Caveat: SM120 NVFP4 support is still a bit rough on the edges (some kernel bugs were only fixed recently). Use it with the nightly image above and test — if you get NaN outputs or garbage, fall back to FP8 which is rock solid on your card.

TL;DR: nightly first (free perf gain), then try NVFP4 if you want to push further.

I know that RTX 5090 (32GB, 1.8 TB/s bandwidth) → ~80 tok/s using NVFP4, without MTP.

What I think your blackwell could reach : NVFP4 + MTP 5 tokens + SGLang → 100-130 tok/s

Good luck !

[-]

oxygen_addiction@reddit

What hardware

[-]

SharinganSiyam@reddit

Getting about 46 TPS on my RTX 5090 using this command llama-server -m "C:\Users\Pc\AppData\Local\llama.cpp\unsloth_Qwen3.5-27B-GGUF_Qwen3.5-27B-UD-Q5_K_XL.gguf" --mmproj "C:\Users\Pc\AppData\Local\llama.cpp\unsloth_Qwen3.5-27B-GGUF_mmproj-BF16.gguf" --ctx-size 262144 --fit-ctx 262144 --n-predict -1 --parallel 1 --flash-attn on --fit on --threads 8 --threads-batch 16 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --cache-type-v q8_0 --cache-type-k q8_0.
Try using llama cpp with kv cache quantization at q8_0 under 131k context. You might get optimal performance

[-]

stavenhylia@reddit

How are you running it that fast with a context that big? In LM Studio it predicts to take around 52GB of RAM, overflowing my 5090

[-]

SharinganSiyam@reddit

But in my case it stays within 32GB of my vram. It is because I have used --flash-attn on and kv cache q8_0 flags.

[-]

stavenhylia@reddit

Hmm.. Maybe I'm underestimating the overhead LM Studio adds.
I'll try to run it like you and see what I can figure out.
Thank you for getting back to me :)

[-]

MerePotato@reddit

I really wouldn't quantise the kv cache on a reasoning model if I could possibly help it

[-]

SharinganSiyam@reddit

Didn't see any performance issue in my use case. But yeah avoiding kv cache quantization is better if you are satisfied with your context limit

[-]

EffectiveCeilingFan@reddit

There was a post on here asserting that the Qwen3.5 arch really didn’t like anything other than BF16 for the KV cache. If you’re getting good results though then I might have to experiment myself.

[-]

Shamp0oo@reddit

For what it's worth: I have an RTX 5060 Ti 16G with a memory OC of +3000 MHz. Using this quant I can fit around 20k of context. This gives me around 25 tok/s of tg (don't remember pp), which is way faster than 122B for me, where I have to offload to system RAM (DDR4) and only get around half the performance. I'm using self-compiled mainline llama.cpp on Linux.

[-]

HugoCortell@reddit

Same speeds as you, are you using LM Studio?

[-]

thegr8anand@reddit (OP)

Yes

[-]

HugoCortell@reddit

Yeah, that's normal then. I made a post recently and the consensus seems to be that somehow LM Studio gobbles performance.

If you go to your model tab, you can change the load settings of the specific model to improve performance, but it won't be as good as with other programs. For some reason the default settings on LM Studio is not to load the model onto VRAM, which is obviously going to hurt performance a lot.

[-]

mixman68@reddit

Same problems with latest lm studio

The program load only 12 go into vram, I needed to test layer per layer upon find the limit of my graphic card and with 35b I get 37 t/s

For 27b I don't go upper than 7 t/s

Config : 4070 ti super 16 Go, 32 Go ram 6000mhz, 7800x3d, Debian trixie xfce

[-]

sammcj@reddit

30-70 (max) with llama.cpp on my 2x3090 setup
30-190~ with vLLM

https://smcleod.net/2026/02/patching-nvidias-driver-and-vllm-to-enable-p2p-on-consumer-gpus/#results

[-]

lemondrops9@reddit

Once anything gets off loaded to CPU its going to be a lot slower... its why Vram is so sought after.

[-]

Rasekov@reddit

exl3 at 4bpw and 3.1bpw, 25t/s with context running at FP16 in a 3090 with a power limit of 270W. Speed is similar in both cases so I'm guessing I'm fully compute limited. A higher power limit helps but not too much, for my use case and my card 270-280 it's the sweet spot, not worth increasing it 100W just for 5% gain or so.

Max context is set to 96K, speed was more or less stable up until 24K. I havent filled the context yet, I tried batching first(~190t/s with a batch of 16, but context is too small for anything useful) and will test context in detail tomorrow or so.

It's not hard to set up and if you arent offloading to RAM and dont want llama.cpp's webui it's very much worth the trouble.

[-]

La7ish@reddit

For 27b I get between 33-40 tok/sec on my 4090

[-]

dubesor86@reddit

same here. Q5_K_M@32k ctx

[-]

SillyLilBear@reddit

81t/sec on dual RTX 6000 Pro @ FP8

[-]

shrug_hellifino@reddit

Old amd 1950x threadripper with 64 gb ddr 4 and 5 amd pro viis (16gb) on rocm 6.4.x and llamacpp bf16 unsloth 148k ctx fp16kv 4096 b 1024 ub ~10t/s

[-]

scrappyappl@reddit

MacBook Pro m3 max 48gb. I get around 12t/s using mrader 27b heretic

[-]

d4mations@reddit

What pp/s?

[-]

VickWildman@reddit

Like 4-5 tps, Q4_0. On my phone (which has 24 GB RAM).

[-]

pmttyji@reddit

What t/s are you getting for Qwen3.5-9B?

[-]

hyrulia@reddit

Q4_K_M runs at 5 t/s

UD-IQ3_XSS at 25 t/s

5060 TI 16Gb

[-]

input_a_new_name@reddit

Run at least IQ4_XS, IQ3 is not worth it, it should fully fit

[-]

hyrulia@reddit

I will try it, thx!

[-]

InternationalNebula7@reddit

RTX 5080 - pp 1673.83, tg 54.35, Qwen3.5-27B-UD-IQ3_XXS.gguf

[-]

HEAVYlight123@reddit

26 tok/sec on 5070ti 16gb and 3070 8gb Runs faster than coder 80b on my DDR4 and 12700 but with less context

[-]

Embarrassed-Boot5193@reddit

Dual RTX 5060ti (total 32G vram) Contexto 130k, kv cache f16, llama.ccp Q5_K_XL - pp ~650 t/s, tg ~16 t/s Q4_K_M - pp ~800 t/s, tg ~19 t/s IQ3_XXS - pp ~1200 t/s, tg ~25 t/s

[-]

SLI_GUY@reddit

\~45 tok/sec rtx 4090

[-]

Dundell@reddit

After some tests on aider i've been finding Q4 being less successful than Q5, so I run Q5 27B alot at around 14t/s on x3 RTX 3060's I now the recent updates to llama.cpp brought my 122b speed up 25%, and probably could d the same for my 27B. I haven't tried anything different to speed it up, but open to some ideas. I'm more interested in Q4 122B at 26t/s

[-]

putrasherni@reddit

All tk/s is average until all of context used

Qwen 3.5 27B IQ3_XXS
34 tk/s at 32K context
32 tk/s at 65K context
29 tk/s at 131K context

Qwen 3.5 27B Q4_K_S
30 tk/s at 8K context
27 tk/s at 65K context

[-]

Twirrim@reddit

If you can tolerate a little precision loss, try Qwen3.5-35B-A3B. I'm getting \~20 tok/s with 128k context on a 8GB VRAM RTX 3050. I've been finding it perfectly fine for my use cases.

[-]

Ok-Caregiver9383@reddit

Wood quant are you using to get it to fit in 8gb must be VERY low

[-]

Twirrim@reddit

I'm using Q3_K_S. I tried a Q2 one previously but the perf drop-off was noticeable.

[-]

coder543@reddit

You don’t have to fit a MoE entirely into VRAM to get good performance. Just the dense core of attention/KV/shared expert.

[-]

amejin@reddit

How? Using vLLM trying to run any of these thinking models requires all the vRAM I have. What am I missing?

[-]

coder543@reddit

vLLM is not optimized for these kinds of low memory use cases. You need to use llama-server.

[-]

amejin@reddit

Understood... That's a bummer.

[-]

Ok-Caregiver9383@reddit

So what q is he using?

[-]

coder543@reddit

I’m not them, but they could be using q8 and still get good performance. Obviously the lower the quant, the faster it is.

[-]

Thunderstarer@reddit

What quant?

[-]

soyalemujica@reddit

I would not say "a little precision loss", 35B loses by quite a noticeable margin in all benchmarks against 27B, which IS EXPECTED.

[-]

overand@reddit

That is genuinely bad-ass for that card! What's your prompt-processing speed?

[-]

thegr8anand@reddit (OP)

That's sound very good for 8gb vram. What quant are you using?

[-]

timhok@reddit

V100 32GB capped at 200W
llama.cpp on llama-swap
Qwen3.5-27B-UD-Q6_K_XL.gguf
k/v cache in Q8
vision enabled
fit ON = 182k context window w/ vision in F16

small requests under 10k tokens - 22 t/s gen 640 t/s pp
30k+ tokens - 14 t/s gen 400 t/s pp

I love it

[-]

ismaelgokufox@reddit

Q3 unsloth quant at around 20T/s On a 6800. When the context gets bigger, 6-7T/s

[-]

Additional_Ad_7718@reddit

I think it was 20T/s on my 3060 64 gb ram.

Honestly it was too slow to use with reasoning on.

[-]

Front_Eagle739@reddit

2000 pp and 47 decode. Rtx5090

[-]

JockY@reddit

I tested the BF16 unquantized version in vLLM 0.17.1rc1.dev5+g8d98d7cd on 4x RTX 6000 PRO on EPYC with DDR5 6400 in tensor parallel mode with MTP and 2-token speculation.

“Write flappy bird” = avg generation throughput of 286.6 tokens/sec and accepted MTP the output of 185.49 tokens/sec with a 91.4% acceptance rate.

Of course flappy bird is benchmaxxed to death, which means MTP is crushing it and leading to a false sense of speed.

“Write an Objective-C program to recursively scan the home directory looking for .png files. Build an index of these and then use Mac OS frameworks to convert all of the PNG files into a .mp4 video that runs at 2 frames per second.” = avg gen throughout of 139.2 tokens/sec with accepted MTP throughput of 85.7 tokens/sec and 80.1% acceptance rate.

Not too shabby!

[-]