Is there anything better than Qwen3.5-27B-UD-Q5_K_XL for coding?

[-]

chicky-poo-pee-paw@reddit

do you have reasoning on for your coding? do you think it makes a difference in quality vs speed?

[-]

guiopen@reddit

Yes, it makes a difference, but I am forced to leave it disabled because there is a bug in llama.cpp where tool calls leak in reasoning.

[-]

StardockEngineer@reddit

Are you sure? I've been using it daily and haven't experienced this since all the issues were fixed about a month ago.

[-]

guiopen@reddit

https://github.com/ggml-org/llama.cpp/issues/20837

[-]

StardockEngineer@reddit

Says it's a tooling issue, mostly.

"Ultimately, this is a tooling issue"

Seems your client doesn't properly handle reasoning_content in multi-turn agentic conversations

[-]

ytklx@reddit

No, this is not a tooling issue. I have the same problem with OpenAI's Go llibrary. That only happens with various Qwen 3.5 models.

[-]

guiopen@reddit

Yes, I am sure, but it depends o nthe harness you use. Like, opencode go users didn't report problems, but in simpler harness like zed agent it happens constantly

[-]

StardockEngineer@reddit

Sorry I didn't see this second comment when I replied to your other comment.

[-]

ayylmaonade@reddit

Not the person you asked, but generally for coding you want to leave reasoning on. You should get significantly better results in more complex workflows.

[-]

Specter_Origin@reddit

I still can't get 27b and 35b from QWEN to not overthink or loop, tried so many harnesses etc.

[-]

sonicnerd14@reddit

Pro tip for qwen3.5, just turn off thinking altogether and you're likely to see much better results overall in most responses.

Gemma4 has a nice balance of knowing when and how much to think, and you can actually prompt it to think harder and it'll do it. Which is such an underrated capability for a model to have.

[-]

Specter_Origin@reddit

Output quality significantly degrades with that, I would just use Gemma at that point

[-]

I've tested it in many scenarios and from what I've experienced qwen3.5 overthinks so much It’s actually the opposite. Turn it off, especially if you are using it agentically because you are just burning tokens.

[-]

NewtMurky@reddit

Have you set the repetition_penalty ?

[-]

Maximum-Wishbone5616@reddit

Loops are due to bad KV. It HAS to be F16.

[-]

Yeelyy@reddit

Please try this systemprompt from Fernflower: https://pastebin.com/pU25DVnB

It fixxed it for me. And is soooo efficient now

[-]

reddoca@reddit

This looks quite nice and long. I guess it would work on any model?

[-]

rpkarma@reddit

If it makes you feel any better 3.6-plus also way overthinks and loops lol

[-]

ModelScoutAI@reddit

Honestly this really depends on what kind of coding you're doing. I've been testing a bunch of models on real infra work (K8s manifests, Terraform modules, Dockerfiles) and the SWE-bench rankings are pretty misleading for that stuff.

GLM-5 is ranked way below Claude on SWE-bench but in my testing they performed basically the same on infra tasks. Same pass rate, quality scores within a hair of each other.

[-]

lehoang318@reddit

I would suggest to keep Qwen3.5 27B as your main LLM and prepare another (maybe Gemma4) as backup. When Qwen stucks, you could switch the model (llama server - router mode). This approach works pretty well in my setup (which is much weaker than yours)

[-]

AnonLlamaThrowaway@reddit

If you have enough system RAM (most likely 48GB or 64GB), try gpt-oss-120b. I haven't been able to find anything better when its reasoning is set to high. Qwen will do basic mistakes while it won't.

You can use an option that will offload the "expert layers" into system RAM to make sure the more speed-critical layers will be on the GPU. Some GUIs like LM Studio will let you fine tweak this so that you can still keep some experts in VRAM.

[-]

hurdurdur7@reddit

Compared to qwen3.5 the gpt-oss-120b is terrible at writing real life code.

[-]

unjustifiably_angry@reddit

Can't speak from personal experience but that recent article making claims about local models being capable of doing similar things to Mythos found that 120B is still unexpectedly competent at finding bugs.

[-]

hurdurdur7@reddit

Finding bugs - perhaps indeed, but as for writing code i have had a terrible time with it, i have just switched over to Qwen3.5 (and now also some Gemma4) instead. Too many logic errors slip in.

[-]

luckynummer13@reddit

Would you say it’s better than Qwen3-Coder-Next?

[-]

LikeSaw@reddit

I think its the best Model for coding in that size. Coder next is not even close because its an instruct model no reasoning and low active parameter. Its good for autofilling and fast but not very smart.

I tried all Qwen 3.5 models. only the 397b is better imo. For coding the 27b is even better than 122b. It feels overall smarter like a bigger model. And I don't know about Gemma 4, even after all the fixes and redownloads of 31b it feels way more confused at coding with tool calls etc. weird reasoning and not really solution focused but thats just me.

Somehow I am also disappointed by minimax m2.7. I can only run the IQ3 quants from Unsloth I tried both versions IQ3 XXS and IQ3 S that runs on my GPU only and it was very fast but tbh feels very dumb. I also tried the Q6 with RAM offload but its really slow and only tested on my codebase to find bugs and also really really disappointed missing so much stuff and assuming and also feeling overall just not smart enough.

For me the 27b feels special because it "understands" the problems making perfect tool calls, but the lack of knowledge and size is the downsite.

[-]

unjustifiably_angry@reddit

Early MiniMax 2.7 quants had some issues, might be worth trying again at some point. That said - and I'm basing this on nothing - I would expect that since they distribute the model pre-quantized to FP8, further quantization is more lossy than models distributed at full F16.

[-]

Front-Relief473@reddit

So if give 27b enough network search ability, which is equivalent to an external knowledge base, will he be able to perform coding tasks better?

[-]

LikeSaw@reddit

Yes and no. It will need alot of guidance when the task is outside of its knowledge because it will assume its own knowledge before searching for solutions and maybe contradict with itself and produce bad code. 27b parameters will still be 27b and not match knowledge und reasoning as top tier models. It also depends on how complex the project is itself.

[-]

unjustifiably_angry@reddit

Q3CN was an enormous letdown for me, was very excited about finally getting able to use it after upgrading my hardware and it turned out to be extremely mid, at least for my purposes. Fast but unreliable.

[-]

Maximum-Wishbone5616@reddit

Day and night. Q8/KV F16 was killing Opus4.6 even before big dumbing.

[-]

itsmetherealloki@reddit

Try the 26b4a. Seems counterintuitive but in my case it was actually better at tool calling. Might be my setup though. And please use the llama.cpp fork with turboquants. It’s amazing!

[-]

Blues520@reddit

I'd say coder-next is better imo.

[-]

Kodix@reddit

You should check out gemma-4, both the 26B and 31B versions. You may or may not like it more, it may or may not fit your usecase better.

Other than that, I'm not aware of anything *specifically* worth paying attention to at the moment, not in this VRAM bracket.

[-]

viperx7@reddit

What's your hardware and what flags are you using for qwen3.5 27b to get 62t/s? Vllm/llama.cpp?

[-]

unjustifiably_angry@reddit

A 5090 or 6000 Pro will do that.

[-]

fulgencio_batista@reddit

Gemma won’t necessarily be faster. Qwen has MTP. I’m running qwen3.5-27b and 35b-a3b at 62 and 125 t/s tg512 respectively with MTP on budget hardware. For gemma4-31b and 26b-a4b I’d be somewhere in the ballpark of 30 and 80t/s tg.

As for OP, try NVFP4 it might offer better performance than UD-Q5 and reduce looping.

[-]

feckdespez@reddit

What exactly qualifies as "budget hardware" in your view? (Just for a better reference point)

[-]

fulgencio_batista@reddit

Dual RTX5060Ti, think anything under $1k is fairly budget here when a lot of people have multiple 3090s or rtx6000pros lol

[-]

feckdespez@reddit

Thanks!

[-]

ttkciar@reddit

For what it's worth, you can use Gemma-4-E2B as a draft model under llama.cpp to get MTP for Gemma 4.

[-]

Tilted_reality@reddit

Speculative decoding is not the same thing as MTP. MTP has the speculation built into the model itself

[-]

ttkciar@reddit

The difference is slight. MTP with built-in weights can be more effective due to being trained specifically for speculative decoding with the model, and for sharing the same hidden states, but the logic of processing the weights, speculating N tokens, and comparing to the parents' probability distribution is literally the same.

MTP with a model's own built-in weights is indistinguishable from using an external draft model which was trained explicitly for speculative decoding with that model. That having been said, Gemma-4-E2B was not trained explicitly for speculative decoding, so it will be somewhat less effective.

[-]

Clean_Initial_9618@reddit

Can you pls guide me a little on how I can achieve that pls

[-]

ttkciar@reddit

You achieve it by passing the pathname of your draft model to llama-server (or whichever llama.cpp utility you are using) via its --model-draft command line option.

Complete documentation here: https://github.com/ggml-org/llama.cpp/blob/master/docs/speculative.md

[-]

fulgencio_batista@reddit

Not a bad idea, I’m hoping z-lab makes a dflash SD model for gemma. I found dflash to outperform MTP, but I haven’t stuck with it for Qwen due to the extra overhead - would be sick for 26b-a4b though for a speedy little one

[-]

Kodix@reddit

Oo, that's good to know. Didn't tinker with Qwen3.5 too much, unfortunately it just doesn't fit in my VRAM well enough.

[-]

No-Manufacturer-3315@reddit

Other then crashes every other prompt

[-]

Kodix@reddit

Absolutely has not been an issue for me. Occasional crashing, but very rare, and getting rarer with llama.ccp updates. May be your hardware there, or your setup.

[-]

No-Manufacturer-3315@reddit

Hmm I keep seeing my normal ram grow and grow until it crashes with latest llama

[-]

grumd@reddit

-np 1
-ctxcp 2
--cache-ram 0

thank me later

[-]

No-Manufacturer-3315@reddit

Let me try thanks!

[-]

Kodix@reddit

Sounds like a K/V cache/attention issue. Is flash attention on? Are you quantizing your cache? What backend are you using?

Oddly enough, for gemma4 *specifically*, not quantizing attention in llama.cpp has *significantly* better results for me than doing so. It's really weird, but I haven't looked into this more deeply yet.

This is on the ROCM backend though, which affects things severely. Vulkan acted very differently in previous tests.

[-]

No-Manufacturer-3315@reddit

I sized my context to my GPUs about 180k context fits in my GPUs.

4090 + 7900xt using vulkan backend.

I’m trying to use 31b q6 with q8 context for kv as I heard it’s important not to shrink them for Gemma.

flash attention is on, should it be off?

[-]

Kodix@reddit

That's.. huh. I don't have specific suggestions, but I see the same behaviour (with the VRAM usage balooning) when I do *any* quantization of the K/V cache with gemma and leave it as default. Performance specifically drops - and I start seeing the balooning you mention - when I change it to any value at all (including Q8_0 for both K/V). So try running the attention unquantized to start with.

And yeah, flash attention should be on. It's a huge benefit when it works.

[-]

FinalCap2680@reddit

For me Qwen 3.5 122B-A10B (Q8 with RAM offloading) looks best from what I have tried.

[-]

Blues520@reddit

What hardware do you run this on?

[-]

FinalCap2680@reddit

Don't laugh ;) It is slow, but works - Dell T7910 single Xeon 2680v3 with 192GB RAM + RTX 3090. The model version is UD-Q8_K_XL and I'm running it with 262k context and recommended setings for coding.

PS Compared to other \~120B models also at UD-Q8_K_XL(Mistral-Small-4-119B-2603 and NVIDIA-Nemotron-3-Super-120B-A12B) I liked Qwen 122 more.

Also liked it more than Qwen3 Coder Next 80B-A3B UD-Q8_K_XL; Qwen3 30B-A3B-Instruct-2507 Q8_0 and the dense Qwen3.5 27B UD-Q8_K_XL.

[-]

CloudEquivalent7296@reddit

Cool, jow manu T/S do you get? Could you share your llama command?

[-]

FinalCap2680@reddit

I'm not interested in speed, so my setup is not optimised at all. I'm running LMstudio (I know it iis slow) and the 3090 is power caped at 170W.

The 120B models start at \~3 t/s generation for Qwen and Nemotron and \~6 for Mistral and go down with growing context. Qwen 3 Coder Next (80B) starts also \~3 t/s

[-]

XtremeBee1970@reddit

Luv qwen3.5! Haven’t seen a better model than that. Not seeing much diff between the 9b and the 27b models personally, but I’m only on 5070ti w/ 16gb ram… 27b looks like almost same exact outputs but is much slower on my machine due to lower vram…. Thnx for posting! Interesting thread. Always in the lookout for better models! What will come after qwen3.5!? Hmmmm!

[-]

jwpbe@reddit

Really late to this thread, but I would give the RYS variant a try, it duplicates some of the blocks of the model where it's reasoning is the strongest:

https://huggingface.co/dnhkng/RYS-Qwen3.5-27B-FP8-XL

The blog post explaining it:

https://dnhkng.github.io/posts/rys-ii/

[-]

Tormeister@reddit

Looks interesting, but this doesn't fit in a 5090

[-]

jwpbe@reddit

?

Use a gguf quant, q5 or 6 will fit easily

[-]

Tormeister@reddit

I read the blog post and checked the linked FP8 models, not realizing there are smaller quants as well, my bad.

[-]

ArugulaAnnual1765@reddit

the opus distill v3 works well for me. also using iq4xs as its just as good as q6 but i can get the full 256k context on my 5090

[-]

Shronx_@reddit

what did you do to make it work? if I hook up any of the Qwen3.5 models (27b, 35ba3b, 9b, Qwopus, ..) via llama-cpp server with claude, generation stops after the first token. Qwen3 Coder Next works fine on the other hand.

[-]

RnRau@reddit

Is your models and llama.cpp up to date? Are you following the unsloth guide on the recommended settings? https://unsloth.ai/docs/models/qwen3.5

[-]

Shronx_@reddit

unless I missed something, yes.

[-]

AlreadyBannedLOL@reddit

I can’t find any 27b opus destill v3 on Hugging Face. There’s 4b but… why…

[-]

Account-67@reddit

I believe they are referring to: Jackrong/Qwopus3.5-27B-v3-GGUF

[-]

Maximum-Wishbone5616@reddit

Q8 KV F16

[-]

LegacyRemaster@reddit

To be honest... if you have DDR5 (128gb or 96gb) + 32gb ram try Minimax 2.7. It's MOE.

[-]

Free-Combination-773@reddit

With enough RAM you xan try 122b variant

[-]

cosmicr@reddit

I'm curious what kind of coding do people do with models like these? because half the time I can't even get models like sonnet or opus to do what I want. What languages? I'm assuming python because that's what most models always suggest.

[-]

Leading-Month5590@reddit

I am running qwen3.5:122B-A10B in IQ_4_XS precision on 40GB Vram and 64GB sysram and get like 20tok/s on a R9 9950x3D machine. It is actually quiet good so if you have enough sysram and a good processor I would advise you to trie it. Maybe gat an rtx 5060 ti 16gb as secondary GPU for that.

[-]

Doct0r0710@reddit

For agentic stuff it's good, but if you still use it as an "old school" chatbot i found Qwen 3 Coder 30B and Nemotron Cascade 2 to be more consistent. Might be specific to our codebase though

[-]

iThunderclap@reddit

Why not run a cloud open source model at full inference levels? No need to run locally if what you do is not absofuckinlutly secret

[-]

Creepy-Bell-4527@reddit

Have you considered RAM maxing and using krasis with Minimax M2.7 Q2 or Q3?

Because if anything will actually rival Claude or codex it’s that.

[-]

Front-Relief473@reddit

According to my use, udq3kxl has already shown a state of obvious decline ability, so the golden rule has some basis, and the quantification should not be lower than q4.

[-]

Thunderstarer@reddit

Gemma 4 31b is less "anxious" for me, and I qualitatively feel like I've had more reliable results. On the other hand, that 3.6GB SWA context window really hurts.

[-]

annodomini@reddit

On the other hand, that 3.6GB SWA context window really hurts.

Bruh. OOM'd my 128 GiB system a few times before I figured out what was going on.

Other than that, Gemma 4 31b does feel pretty nice.

[-]