Qwen3.5 27B is Match Made in Heaven for Size and Performance

[-]

Southern-Chain-6485@reddit

I'm hitting 25 tokens/sec with *a Q5 quant* on a single RTX 3090 (and 64gb of ram ddr5, but that's not relevant because the Q5 fully fits the RTX 3090). Of course, the Q8 runs at about 5 tokens per second. But I'm not sure if there a case for the 35bA3, as it's not much faster

Reply

[-]

Intrepid-Second6936@reddit

I think maybe you might've had some issues with your 35b-a3b run. I'm getting 30 tokens/sec with the 27b at Q5\_K\_M on my 7900 XTX while getting 102 tokens/sec on AesSedai's quantization of the 35b-a3b model at Q4\_K\_M. It's not an order of magnitude but I think it's very significant depending on if you want to rapid fire questions with the MoE or favor a longer form answer using the dense model.

Reply

[-]

Accomplished-Star-36@reddit

Hi, do you mind sharing your llama.cpp parameters? ty

Reply

[-]

Intrepid-Second6936@reddit

Sure I'm honestly running pretty standard out-of-the-box. That run was with a default context size set to 85744 by llama.cpp, but I now set the context to 128k and achieve basically the same performance. `./llama-server -m ~/.cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -c 72000 --mmproj ~/.cache/llama.cpp/mmproj-BF16.gguf --` `port 8888` If it's helpful, this is my current execution, using unsloth's Q4\_K\_XL quant for 35B with multi-modal. I get a consistent 116 tokens/second running this quant. I've also kept the KV cache at its default 16-bit because quantizing down to 8-bit to save memory led to a lot of looping when performing large requests, particularly with RAG.

Reply

[-]

Southern-Chain-6485@reddit

I was comparing the Q5 27b to the Q8 35b3a. The 35b3a at Q4 indeed hits over 100 tokens per second

Reply

[-]

Technical-Earth-3254@reddit

Mind sharing ur settings? I'm getting around 1/5th of urs on my 3090 at q5.

Reply

[-]

Southern-Chain-6485@reddit

/home/juan/llama.cpp/build/bin/llama-server -m /mnt/shared/LM\_Studio\_Models/Qwen3.5-27B/Qwen3.5-27B\_Q5\_K\_M.gguf -c 32000 --temp 1 --top-k 20 --top-p 0.95 --min-p 0 --host [0.0.0.0](http://0.0.0.0) \--port 8502 --fit on Here, for the MoE version, they add other stuff I don't know about [https://www.reddit.com/r/LocalLLaMA/comments/1rdxfdu/qwen3535ba3b\_is\_a\_gamechanger\_for\_agentic\_coding/](https://www.reddit.com/r/LocalLLaMA/comments/1rdxfdu/qwen3535ba3b_is_a_gamechanger_for_agentic_coding/)

Reply

[-]

IrisColt@reddit

And compile it from source...

Reply

[-]

BeautyxArt@reddit

u/IrisColt what benefits from compiling llama.cpp ?

Reply

[-]

IrisColt@reddit

In my case, my compile command is very simple... and it only speeds up prompt fetching... so no real significant impact.

Reply

[-]

tomakorea@reddit

I've got 31 tok/sec with an RTX 3090, update your llama.cpp version, your performance is too low for your hardware

Reply

[-]

Southern-Chain-6485@reddit

Yep, I just did and and 27b is around 30t/s, 35b3a is about 45 t/s and 122b10a is about 23t/s

Reply

[-]

Ciffa_@reddit

you should easily reach around 70-80 t/s for the 35b3a

Reply

[-]

HugoCortell@reddit

That's odd, I'm getting 4toks on a 3090... What the heck!

Reply

[-]

Southern-Chain-6485@reddit

Are you using a quant that fully fits in your vram? the 27b is a dense model, all parameters are active at once

Reply

[-]

HugoCortell@reddit

Yep, I am, 21.1GB per the file. It turns out LM Studio defaults to full CPU for some reason, so I've changed that, but my speeds are still \~12tok/s at best. I made a post about it asking for advice, but all I got was zero comments and a few downvotes.

Reply

[-]

nuusain@reddit

Im getting 101 t/s at 131k context with 35b-3b:UD-Q4_K_XL quant. For anyone still on an older llama.cpp build - update. I was stuck at 28 t/s until I rebuilt from latest. The qwen35moe graph deduplication PRs (#19597, #19660, #19668) made a 3.6x difference. The model loaded fine on the old build but ran through an unoptimised code path. ``` llama-server -m ~/models/qwen3.5-35b-a3b-q4.gguf \ -ngl 99 -c 131072 --threads 4 --batch-size 2048 \ -np 1 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 ```

Reply

[-]

DeZepTup@reddit

unsloth also uploaded new meta - mxfp4\_moe

Reply

[-]

YourNightmar31@reddit

131k? What are you running it on?

Reply

[-]

nuusain@reddit

3090 + 96gb drr4. To be clear this is the 35b-3b. Was saying that there is a case for it as it seems much faster than the 27b. Haven't run the 27b yet myself.

Reply

[-]

_raydeStar@reddit

yeah im getting pretty fast runs too. MOE is always superior for speed - and looking at benchmarks, pretty good at coding. dense models are better for creative writing though.

Reply

[-]

IrisColt@reddit

>But I'm not sure if there a case for the 35bA3 Nice insight, I had the same dilemma with Qwen 3 VL 32B vs MoW 30B, and 32B won.

Reply

[-]

Poro579@reddit

If 35ba3 using n-cpu-moe, I expect to reach at least 30t/s.

Reply

[-]

Conscious_Cut_6144@reddit

Since everyone seems to be getting distracted by your fancy gpu, here is another data point: Single RTX 3090 Q4-XL quant 110k context (fully offloaded) Prefill at 900t/s gen at 15k context 31t/s

Reply

[-]

Fin5ki@reddit

Wow, 110k context? On my single 3090 and just 60k context ollama already starts dumping the model/cache onto cpu ram and token generation drops to ~5t/s. Ollama is garbage, I know, but this is quite the difference. Would you mind sharing more details on your software stack and settings, please?

Reply

[-]

Ke5han@reddit

not the person you are asking, but I am using the same setup and getting about the same result. use llama.cpp instead of Ollama is the key I guess, I also loaded mmproj, so the max context I can get is 80K, but if I don't load mmproj, I can set the context at 110K :).

Reply

[-]

grey-seagull@reddit

im using it at 260k context len \`\`\` llama-server \\ \-hf unsloth/Qwen3.5-27B-GGUF:Q4\_K\_M \\ \-ngl 99 \\ \-c 262144 \\ \-fa on \\ \--cache-type-k q4\_0 \\ \--cache-type-v q4\_0 \\ \--alias unsloth/Qwen3.5-27B-GGUF \\ \--reasoning-format deepseek \\ \--host [127.0.0.1](http://127.0.0.1) \`\`\`

Reply

[-]

Putrid-Engineering38@reddit

Your GPU memory won't be boomed?

Reply

[-]

grey-seagull@reddit

No. new qwens dont consume much vram in kv cache.

Reply

[-]

Lopsided_Dot_4557@reddit (OP)

thanks. that's a good one too

Reply

[-]

oxygen_addiction@reddit

Talk to the guy above you who is getting better results with Q5. [https://www.reddit.com/r/LocalLLaMA/comments/1rdvq3s/comment/o78otrd/](https://www.reddit.com/r/LocalLLaMA/comments/1rdvq3s/comment/o78otrd/)

Reply

[-]

sammcj@reddit

My 2x RTX 3090 setup: - 27b UD-Q6_K_XL 64k: 80-103tk/s - 30b-a3b UD-6_K_XL 64k: 110tk/s - 30b-a3b 4bit-AWQ (vLLM) 128k: 172 tokens/s

Reply

[-]

NecessaryKitchen4656@reddit

Did you use Nvlink. May I ask for your full setting?

Reply

[-]

sammcj@reddit

No NVLink, here's a write up: https://smcleod.net/2026/02/patching-nvidias-driver-and-vllm-to-enable-p2p-on-consumer-gpus/

Reply

[-]

Simple_Library_2700@reddit

single user?

Reply

[-]

sammcj@reddit

Those tk/s are a single client's throughput (roughly on average), the total combined throughput will go up with multiple clients to a point (not sure where the point is at present).

Reply

[-]

AlternativeBoss8595@reddit

Running 27B at Q4 AWQ vLLM on my 5090 32gb (need that KV Cache / context) lol but yeah… the capabilities for its size is INSANE!🔥💯

Reply

[-]

Educational-Agent-32@reddit

T/s ?

Reply

[-]

AlternativeBoss8595@reddit

Like 70-110 it varies depending on how heavy the context

Reply

[-]

big_bad_wolf@reddit

Me running two 27b at bf16 because i can 🙂

Reply

[-]

tecneeq@reddit

50 T/s auf einer 5090 mit Debian 13. Kann leider nicht vollen Kontext fahren weil da noch Platz für meinen Desktop und so bleiben muss. ``` /root/llama.cpp/build/bin/llama-server \ --hf-repo mradermacher/Qwen3.5-27B-GGUF:Q6_K \ --ctx-size 131072 \ --host 0.0.0.0 \ --port 11337 \ --parallel 1 \ --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 ```

Reply

[-]

chris_0611@reddit

I don't know. Is it better than 122B A10B MOE? That runs in Q5 on my RTX3090 + 96GB DDR5 6800 @ 400T/s PP and 20T/s TG with 256K context....

Reply

[-]

KaMaFour@reddit

Considering the performance is supposedly comparable to the 122b MoE and the 122b MoE will likely not be pushing 20T/s on systems that won't even fit it in the RAM I'd say it's a fair model.

Reply

[-]

dampflokfreund@reddit

Exactly. Most systems have 32 GB and you can fit the 27B. With llama.cpp's CPU offloading you can get bearable speeds on low end hardware like something like a 2060.

Reply

[-]

rerri@reddit

With a dense 27B you can get bearable speeds on a RTX 2060 + CPU offloading? I have doubts... What kind of speeds do you get with that kind of a setup?

Reply

[-]

24gasd@reddit

I get around 24tk/s on a 3090 without CPU offloading. Which is fine for Chat use but barely bearable for agentic use. So 2060 + cpu offloading would be SLOW....

Reply

[-]

Xp_12@reddit

I'm hitting 20tok/s on the 122b MoE with dual 5060ti 16gb and 64gb DDR5.

Reply

[-]

iBog@reddit

Can you share your launch params?

Reply

[-]

chris_0611@reddit

./llama-server \ -m ./Qwen3.5-122B-A10B/Qwen3.5-122B-A10B-IQ4_NL-00001-of-00003.gguf \ --mmproj ./Qwen3.5-122B-A10B/mmproj-F16.gguf \ --n-cpu-moe 42 \ --n-gpu-layers 99 \ --threads 16 \ -c 0 -fa 1 \ --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 \ --jinja \ -ub 2048 -b 2048 \ --host 0.0.0.0 --port 8502 --api-key "dummy" \

Reply

[-]

iBog@reddit

9.6 tps on my i7-11700K + 96Gb Ram + RTX 3090 with context 262k - I'm surprised how well!

Reply

[-]

chris_0611@reddit

Totally makes sense. TG should scale nearly linear with memory bandwidth, and my DDR5 6800 would be approximately double of your DDR4

Reply

[-]

FusionCow@reddit

i mean ram kinda shakes things up a bit for moe's, but at least comparing the 27b to the 35b in my own testing, the 27b seems actually a lot better. moe models tend to have better knowledge but their choices from token to token are overall "dumber". its why you don't see 1t param models with 1m experts

Reply

[-]

chris_0611@reddit

Ohh certainly I would expect 27b dense be much better than 35b a3b. But 122b is still 10b active. But yeah, I agree. It would be really interesting to see a good comparison between 27B and 122B-A10B in different quants. Still, he has a $10k GPU to run only a 27B model with only 32K context at only 20T/s and that just seems really sad and completely the *opposite* of "Match Made In Heaven"...... (sorry no better words for that).

Reply

[-]

OuchieMaker@reddit

Oh, I totally agree. I got a Strix Halo machine and have been comparing the 35B and 122B myself (getting also roughly \~20-30 tokens generating on both, and still finagling with the optimizing), and OP getting 20 tokens at that price point for that sort of model is kind of nuts. You'd think GPUs would benefit from the faster bandwidth more.

Reply

[-]

Ptifiela@reddit

What do you think of the two models? And what are your PP speeds on the Strix for each models (for wich context size)?

Reply

[-]

Conscious_Cut_6144@reddit

Calling an a6000 a 10k gpu is silly. You can get a Pro 6000 new for 8k A 27B model can fit fully in Vram on a 3090 with a normal 4bit quant and decent context.

Reply

[-]

MR_-_501@reddit

There is a place for dense models, its called VLLM with a large vram pool. (If your demand if for example formatting/extracting a large dataset) Dense models scale better with concurrency than MoE's do.

Reply

[-]

Lopsided_Dot_4557@reddit (OP)

I tried it in a pre-prod RAG pipeline and works flawlessly.

Reply

[-]

dampflokfreund@reddit

But you have 96 GB. You realize not everybody has that much RAM right? I just have 32 GB like the majority of gaming systems. That MoE won't run on such a system, unless you wait minutes for a token. 27B would run nicely.

Reply

[-]

chris_0611@reddit

This dude has a $10k professional GPU. I'm like 99.999% he has 96GB or more of RAM.

Reply

[-]

dampflokfreund@reddit

"That's the whole point. This 27B model requires a much more expensive system to run (fast) than the 122B MOE." Depends what you define as fast. But the 27B can be run on much cheaper hardware compared to the MoE. I have a system with 32 GB RAM and RTX 2060 in a laptop, so far weaker than your hardware. And a 27B dense runs at like 2.5 token/s, which is not fast but bearable. Meanwhile that MoE would run far slower, perhaps at 0.3 token/s because I simply do not have enough RAM for it and it would swap to disk.

Reply

[-]

chris_0611@reddit

>But the 27B can be run on much cheaper hardware compared to the MoE. No it doesn't. It will crawl. 27B will be slow as \*\*\*\* if you can't fit it fully in VRAM. An MOE model will be SO SO much faster, especially on low-VRAM hardware. That's seriously like the whole point of MOE which you completely just don't understand. He is running a dense model (even if it's 'only' a small 27B dense model), so he needs tons of VRAM, hence the $10.000 (!!!) GPU. For an MOE you just need a small amount of VRAM and a somewhat decent amount of DDR5. Much cheaper.

Reply

[-]

dampflokfreund@reddit

I told you the dense will run much faster if the system in question is equipped with just 32 GB RAM, because the MoE will NOT fit in 32 GB RAM while the dense does. Much more people have 32 GB RAM PCs than PCs with more than 64 GB RAM (and btw, RAM is crazy expensive now), so more people can run the dense therefore it is cheaper to run. What's not to understand about that. Yes, the MoE will be faster if you have enough RAM, I don't doubt that. But again, try getting 96 GB in this day and age. And for systems like my laptop, you can't upgrade to more than 32 GB RAM, so I'd have to buy a new system to run the MoE which would be much more expensive. Think outside your bubble for a moment, please.

Reply

[-]

chris_0611@reddit

Seriously, you are wrong. He runs 27B dense on 1TB/s of memory speed and gets 20T/s. If your laptop has about 50GB/s memory speed of 32GB DDR5, the model will thus run at a whopping.... **1T/s** (it's really that simple). We all did run dense models 4 years ago. We all tried in on CPU/DDR. And we all figured out it just doesn't work. 4 years ago. Stop trying to explain **me** thing.

Reply

[-]

dampflokfreund@reddit

You are right about the speed, I was confusing it with another model. Still though, 1 token/s is still enough for some, and barely acceptable for me. It still runs. The 117B MoE will be magnitudes slower because it has to swap to disk. And sure, Qwen 3.5 A35B A3B will run much faster, however it will also be less capable. The 27B dense is comparable to the 117B MoE in quality while it can also run on RAM constrained hardware. I think that's a neat option to have.

Reply

[-]

chris_0611@reddit

I'm also very very very much not talking about **YOU** I was talking to THIS GUY (TO) who has a $10.000 GPU. **HE** (absolutely) will also have the RAM to run the 122B MOE model. This guy has the money for a 10.000 GPU. He should NOT be satisfied with running 27B at 20T/s. He did spend **that** much money and thought he find "match made in heaven" but bro If I spend $10.000 i would **not** be satisfied with that. Even my $2000 outperforms his $10.000 which kinda was my point.

Reply

[-]

dampflokfreund@reddit

Yeah, but that guy is a bot. Look at the em-dashes, the structure of the text. It is clearly AI written. He doesn't exist, which is exactly why this doesn't make any sense.

Reply

[-]

Lopsided_Dot_4557@reddit (OP)

I am not a bot , and its a genuine post. Entitlement and ignorance of some people is outstanding.

Reply

[-]

Opteron67@reddit

frustration

Reply

[-]

Opteron67@reddit

3090 is 400 not 2000

Reply

[-]

Thrumpwart@reddit

I run my chonky boi w7900 on an am4 pc with 64GB of ram. I didn’t choose the thug life…

Reply

[-]

Adventurous-Paper566@reddit

I can put lot much context in 27B Q6 than 35B Q5 lol

Reply

[-]

Maximum@reddit

I tried to see how much context I can fit into 35B and my patience ran out faster than my RAM because after 150k (all on RAM, 15GB VRAM is full with model weights, plus some on RAM) the gen speed slowed down to 12t/s.

Reply

[-]

piggledy@reddit

I've tried it in LMStudio with the recommended settings and it seems like it's thinking way too much and being quite indecisive. Even for "Hi" it's considering 4 possible replies before getting back. Is that normal?

Reply

[-]

-_Apollo-_@reddit

for me, it thought of 23 possible replies to just, "Hi" and then errored out.

Reply

[-]

JohnTheNerd3@reddit

here's another data point: 2x RTX 3090 150k context using vLLM with [this quant](https://huggingface.co/cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4) (AWQ with BF16 activations and INT4 weights, this matters since 3090's have hardware INT4 support) ~1500t/s prefill ~40t/s decode at 100k context

Reply

[-]

Downtown-Figure6434@reddit

What’s the setup monthly costs and if you ever experienced, how do the costs compare to simply using openai api’s

Reply

[-]

Sherry141@reddit

Thanks for sharing the info. Is there a reason you're running 32K context? Do you feel it'd work as well with a greater context (like its 256K native) at Q4?

Reply

[-]

Lopsided_Dot_4557@reddit (OP)

Otherwise I get OOM errors

Reply

[-]

TotallyToxicToast@reddit

"no reason to go lower if your VRAM allows it." well if you want more context you can go lower on the quant

Reply

[-]

TechySpecky@reddit

How do you think it compares to qwen3 VL 8B FP8 in terms of inference speed? I need to run 1.5 million queries with 2000 input tokens and 128 output tokens soon. I have access to 6000 Pro, H200 and B200.

Reply

[-]

xoovs@reddit

Did anyone manage to run it on an intel arc pro B series card? I’m only getting errors so far?

Reply

[-]

thecodeassassin@reddit

I am actually quite impressed by the quality of this model I'm going to test drive it for a few days comparing it to Claude Sonnet 4.6 So far it's really impressive My stats on a single 5090: \- Prompt eval: \~1,037–1,095 tok/s \- Generation: \~37.7–37.9 tok/s

Reply

[-]

SlechteConcentratie@reddit

Some stupid question: do you guys set up powerful machines for fun or for work ? I want to know how to set up my learning. So far my strongest machine is a laptop of 16 G Ram, Intel Core Ultra 5, 3600 MHz, 14 cores , 18 logical processors.

Reply

[-]

LegacyRemaster@reddit

Try Qwen Next 80b last llama.cpp version: Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes load\_backend: loaded CUDA backend from C:\\llm\\llama.cpp\\build\\bin\\Release\\ggml-cuda.dll load\_backend: loaded RPC backend from C:\\llm\\llama.cpp\\build\\bin\\Release\\ggml-rpc.dll load\_backend: loaded CPU backend from C:\\llm\\llama.cpp\\build\\bin\\Release\\ggml-cpu.dll | model | size | params | backend | ngl | n\_batch | n\_ubatch | fa | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: | | qwen3next 80B.A3B Q8\_0 | 78.98 GiB | 79.67 B | CUDA | 999 | 1024 | 1024 | 1 | pp512 | 3188.64 ± 82.29 | | qwen3next 80B.A3B Q8\_0 | 78.98 GiB | 79.67 B | CUDA | 999 | 1024 | 1024 | 1 | pp16384 | 3865.69 ± 41.54 | | qwen3next 80B.A3B Q8\_0 | 78.98 GiB | 79.67 B | CUDA | 999 | 1024 | 1024 | 1 | pp32768 | 3826.49 ± 34.39 | | qwen3next 80B.A3B Q8\_0 | 78.98 GiB | 79.67 B | CUDA | 999 | 1024 | 1024 | 1 | pp81920 | 3410.23 ± 13.52 | | qwen3next 80B.A3B Q8\_0 | 78.98 GiB | 79.67 B | CUDA | 999 | 1024 | 1024 | 1 | tg128 | 114.85 ± 0.21 | | qwen3next 80B.A3B Q8\_0 | 78.98 GiB | 79.67 B | CUDA | 999 | 1024 | 1024 | 1 | tg8192 | 117.35 ± 1.16 |

Reply

[-]

Ok_Helicopter_2294@reddit

Honestly, while it’s true that a 27B dense model is smarter than a 35B A3B model, it’s just too slow for agentic use cases to be practical. I’m running it on dual RTX 3090s, and the 35B A3B model delivers about 2.5× higher token generation speed, which makes it significantly more usable in real-world scenarios.

Reply

[-]

ScythSergal@reddit

I have always been confused by statements like this. Is it not beneficial to have a slower model that gets the job done properly, rather than a fast one that constantly misses the mark? I know for a lot of more important things I do, I would rather wait 5 minutes for a response than have 100 responses with issues in all of them Granted, I haven't done much agentic stuff, so I could very well be missing benefits over just "fast", so please let me know

Reply

[-]

Ok_Helicopter_2294@reddit

And honestly, your argument feels too extreme, and it seems to be addressing a different point than the model size I was referring to.

Reply

[-]

Ok_Helicopter_2294@reddit

If prompt one-shot output quality is the most important factor for a task, then a dense model would naturally be the better choice. However, if you frequently use agents like I do, or require multi-turn workflows, the perspective can be quite different. And honestly, if you simply want to use the best model as quickly as possible, you can always upgrade your local environment or pay to use models hosted in the cloud. But in my case, I’m working with limited computational resources. :)

Reply

[-]

Ok_Helicopter_2294@reddit

When using agents extensively like I do, the context inevitably grows larger, which in turn slows down inference speed. To address this, what's needed is an MoE model that offers reasonably good quality while also being fast.

Reply

[-]

Ok_Helicopter_2294@reddit

In real-world use, the dense model is genuinely smarter and produces higher quality outputs overall — that's not in question. But that doesn't mean the Qwen 35B A3B MoE model falls dramatically behind in quality compared to a dense model. The gap isn't that wide — it holds up well in practice. The bigger point is this: in agentic workflows, you're not waiting for one perfect answer — you're chaining together dozens or even hundreds of small decisions in a loop. That 2.5x speed difference doesn't just feel nice, it multiplies across every single step. What might be a 10-minute task on the faster model becomes a 25-minute task on the slower one. So while I'd absolutely agree that quality matters more than speed when you're asking one important question, agentic use is a fundamentally different context — and in that context, speed *is* practicality.

Reply

[-]

jacek2023@reddit

I have serious problem using 27B on 3x3090 because all the thinking, it takes lots of time to wait for the answer. Maybe it will be better with valid prompts (and for example opencode) but I need to test it more. In the beginning it was crashing, but there was a quick fix in llama.cpp in the evening.

Reply

[-]

tomakorea@reddit

I asked it to generate a 4 lines poem in french, it's thinking process used 5000 tokens and it wasn't over before I had to stop it manually. It was just not able to decide if it was correct or not and got stuck in a doubt loop. For my usage this model sucks

Reply

[-]

layer4down@reddit

Yeah I’m loving this little qwen3.5-27b-q8 for coding and debugging. I was concerned that adding vision might water down the model but it doesn’t appear to have lost a hint of intelligence so far compared to the old 32B. I don’t mind 19 tps or less for a smart, dense model like this. I’ve been able to give it simple prompts and let it figure out the rest for most of the night. I’ve had it fixing code from the 35b-a3b model and it’s working great. Really impressive little model. Best in class IMHO.

Reply

[-]

layer4down@reddit

Yeah I’m loving this like qwen3.5-27b-q8 for coding and debugging. I was concerned that adding vision might water down the model but don’t appear to have lost a hint of intelligence so far compared to the old 30b. I don’t mind 19 tps or less I’ve been able to give it instructions and let it figure out the rest form most of the night.

Reply

[-]

LinkSea8324@reddit

Keep in mind MoE isn't the answer to everything and dense models (here) might be much better on your specific problem. Now go try MoE vs dense on benchmarks that requires multiple expertises on the same task. MoE underperforms on multiple expertise tasks (pretty much everything related to real world usage in other words).

Reply

[-]

gpt872323@reddit

Care to share example use cases for both. How to decide what to use?

Reply

[-]

gpt872323@reddit

Impressive stats at the very least plus vision support.

Reply

[-]

wrk79@reddit

I have been testing Qwen3.5-35B-A3B-GGUF on an Radeon 780M with 56GB shared memory allocated to the GPU and got solid 17.2t/s

Reply

[-]

kidflashonnikes@reddit

I just ran it - flawless. Incredible. I have 4 RTX PRO 6000s, with 1 TB of RAM and a 96 Core Threadripper Pro and 14 TB of Nvme. I tried it out on the CPU RAM first to lower my expectations - then I ran it on the GPUs. Insane speed and quality. They cooked. They cooked indeed.

Reply

[-]

kidflashonnikes@reddit

please dont ask for t/s ect. 1) I am not allowed to report these things as I work a very large AI lab and 2) I have different settings for my model tests, so I am not going to explain everything ect. This model cooks hard - China is absolutely tied with the US

Reply

[-]

IamaLlamaAma@reddit

Yeah. My girlfriend also goes to another school.

Reply

[-]

Pro-editor-1105@reddit

Can i have one please 😢

Reply

[-]

GestureArtist@reddit

Should I ditch ollama since it can’t use sharded ggufs? It seems like nothing works with Ollama now due to lack of support for it

Reply

[-]

arcanemachined@reddit

The sooner you bite the bullet and get off the Ollama train, the better IMO.

Reply

[-]

Ok_Helicopter_2294@reddit

I honestly recommend using llama.cpp, as it receives updates very quickly.

Reply

[-]

Lopsided_Dot_4557@reddit (OP)

totally agreed about llama.cpp

Reply

[-]

Kornelius20@reddit

I have the same gpu and I'm downloading the ggufs now! How's real world use for you? Benchmarks seem a little iffy to me these days

Reply

[-]

Lopsided_Dot_4557@reddit (OP)

Its pretty good , just tried it in an existing RAG pipeline in pre-prod and not bad at all

Reply

[-]

jiegec@reddit

Provide some data point for NV4090 24GB: \+ CUDA\_VISIBLE\_DEVICES=1 ../llama.cpp/llama-bench -p 1024 -n 64 -d 0,16384,32768,49152 --model unsloth/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-Q4\_K\_XL.gguf ggml\_cuda\_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen35 ?B Q4\_K - Medium | 15.57 GiB | 26.90 B | CUDA | 99 | pp1024 | 2448.99 ± 1.48 | | qwen35 ?B Q4\_K - Medium | 15.57 GiB | 26.90 B | CUDA | 99 | tg64 | 38.29 ± 0.19 | | qwen35 ?B Q4\_K - Medium | 15.57 GiB | 26.90 B | CUDA | 99 | pp1024 @ d16384 | 1698.83 ± 0.83 | | qwen35 ?B Q4\_K - Medium | 15.57 GiB | 26.90 B | CUDA | 99 | tg64 @ d16384 | 36.11 ± 0.26 | | qwen35 ?B Q4\_K - Medium | 15.57 GiB | 26.90 B | CUDA | 99 | pp1024 @ d32768 | 1297.37 ± 2.91 | | qwen35 ?B Q4\_K - Medium | 15.57 GiB | 26.90 B | CUDA | 99 | tg64 @ d32768 | 33.21 ± 0.22 | | qwen35 ?B Q4\_K - Medium | 15.57 GiB | 26.90 B | CUDA | 99 | pp1024 @ d49152 | 1040.60 ± 1.81 | | qwen35 ?B Q4\_K - Medium | 15.57 GiB | 26.90 B | CUDA | 99 | tg64 @ d49152 | 30.63 ± 0.18 | build: 244641955 (8148) \+ CUDA\_VISIBLE\_DEVICES=1 ../llama.cpp/llama-bench -p 1024 -n 64 -d 0,16384,32768,49152 --model unsloth/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q3\_K\_XL.gguf ggml\_cuda\_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | qwen35moe ?B Q3\_K - Medium | 14.66 GiB | 34.66 B | CUDA | 99 | pp1024 | 5189.48 ± 12.92 | | qwen35moe ?B Q3\_K - Medium | 14.66 GiB | 34.66 B | CUDA | 99 | tg64 | 115.79 ± 1.80 | | qwen35moe ?B Q3\_K - Medium | 14.66 GiB | 34.66 B | CUDA | 99 | pp1024 @ d16384 | 3703.44 ± 10.14 | | qwen35moe ?B Q3\_K - Medium | 14.66 GiB | 34.66 B | CUDA | 99 | tg64 @ d16384 | 109.06 ± 2.10 | | qwen35moe ?B Q3\_K - Medium | 14.66 GiB | 34.66 B | CUDA | 99 | pp1024 @ d32768 | 2867.74 ± 4.48 | | qwen35moe ?B Q3\_K - Medium | 14.66 GiB | 34.66 B | CUDA | 99 | tg64 @ d32768 | 97.30 ± 1.64 | | qwen35moe ?B Q3\_K - Medium | 14.66 GiB | 34.66 B | CUDA | 99 | pp1024 @ d49152 | 2326.84 ± 2.83 | | qwen35moe ?B Q3\_K - Medium | 14.66 GiB | 34.66 B | CUDA | 99 | tg64 @ d49152 | 88.42 ± 1.18 | build: 244641955 (8148) \+ CUDA\_VISIBLE\_DEVICES=1 ../llama.cpp/llama-bench -p 1024 -n 64 -d 0,16384,32768,49152 --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q2\_K\_XL.gguf ggml\_cuda\_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | deepseek2 30B.A3B Q2\_K - Medium | 11.06 GiB | 29.94 B | CUDA | 99 | pp1024 | 5152.83 ± 11.29 | | deepseek2 30B.A3B Q2\_K - Medium | 11.06 GiB | 29.94 B | CUDA | 99 | tg64 | 136.07 ± 1.95 | | deepseek2 30B.A3B Q2\_K - Medium | 11.06 GiB | 29.94 B | CUDA | 99 | pp1024 @ d16384 | 1396.21 ± 1.71 | | deepseek2 30B.A3B Q2\_K - Medium | 11.06 GiB | 29.94 B | CUDA | 99 | tg64 @ d16384 | 34.86 ± 0.16 | | deepseek2 30B.A3B Q2\_K - Medium | 11.06 GiB | 29.94 B | CUDA | 99 | pp1024 @ d32768 | 806.41 ± 0.85 | | deepseek2 30B.A3B Q2\_K - Medium | 11.06 GiB | 29.94 B | CUDA | 99 | tg64 @ d32768 | 19.00 ± 0.04 | | deepseek2 30B.A3B Q2\_K - Medium | 11.06 GiB | 29.94 B | CUDA | 99 | pp1024 @ d49152 | 492.17 ± 0.32 | | deepseek2 30B.A3B Q2\_K - Medium | 11.06 GiB | 29.94 B | CUDA | 99 | tg64 @ d49152 | 12.68 ± 0.03 | build: 244641955 (8148) \+ CUDA\_VISIBLE\_DEVICES=1 ../llama.cpp/llama-bench -p 1024 -n 64 -d 0,16384,32768,49152 --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4\_K\_XL.gguf ggml\_cuda\_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | deepseek2 30B.A3B Q4\_K - Medium | 16.31 GiB | 29.94 B | CUDA | 99 | pp1024 | 5982.52 ± 15.24 | | deepseek2 30B.A3B Q4\_K - Medium | 16.31 GiB | 29.94 B | CUDA | 99 | tg64 | 132.23 ± 2.02 | | deepseek2 30B.A3B Q4\_K - Medium | 16.31 GiB | 29.94 B | CUDA | 99 | pp1024 @ d16384 | 1455.00 ± 1.20 | | deepseek2 30B.A3B Q4\_K - Medium | 16.31 GiB | 29.94 B | CUDA | 99 | tg64 @ d16384 | 34.52 ± 0.12 | | deepseek2 30B.A3B Q4\_K - Medium | 16.31 GiB | 29.94 B | CUDA | 99 | pp1024 @ d32768 | 824.62 ± 0.42 | | deepseek2 30B.A3B Q4\_K - Medium | 16.31 GiB | 29.94 B | CUDA | 99 | tg64 @ d32768 | 18.94 ± 0.04 | | deepseek2 30B.A3B Q4\_K - Medium | 16.31 GiB | 29.94 B | CUDA | 99 | pp1024 @ d49152 | 497.81 ± 0.11 | | deepseek2 30B.A3B Q4\_K - Medium | 16.31 GiB | 29.94 B | CUDA | 99 | tg64 @ d49152 | 12.64 ± 0.01 | build: 244641955 (8148)

Reply

[-]

jakegh@reddit

Honestly, even if it matched Opus 4.6, I wouldn't find it usable at 32k context.

Reply

[-]

khronyk@reddit

Speed: ~19.7 tokens/sec... is it really worth it then for a 27b? The vision aspect and processing speed for long context's sounds really nice so it has sparked my interest. But I would really love to know how this compares outside of benchmarks, especially for agentic tasks with some of the larger models at Q4/MXFP4. I know this is a dense model and I'm comparing it against MOE models but I have 2x RTX 3090 (also 48GB vram) in my zen 2 based epyc rome server with 256gb ddr4 2666. And i can run much larger models at Q4_K_M/MXFP4_MOE - UD-Q5_K_XL Levels at better speeds. In particular i can run MiniMax M2.5 Q4_K_M at 17.3 tok/sec for generation and Qwen3-Coder-Next MXFP4 runs at about 36 tok/sec and GPT-OSS-120B MXFP4 at 54 Tok/sec. If i drop down to the Qwen3-Coder-Next REAP model it can fit completely in vram.

Reply

[-]

EndlessZone123@reddit

There is significant reliability improvements even at Q8 when you need to do simple tasks as maximum reliability. Comparing to a bigger model at lower quant is not essentially the point. If you want to trade speed for quality, use the A3B model. I like running my models at Q6 minimum, these new medium to small sized models are nice.

Reply to Post

114 Comments