Is there anything better than Qwen3.5-27B-UD-Q5_K_XL for coding?
Posted by hedsht@reddit | LocalLLaMA | View on Reddit | 98 comments
I have a 5090, so my VRAM is limited to 32GB, but i find that Qwen3.5-27B-UD-Q5_K_XL with opencode (and mmproj) does a pretty good job for my use case (mainly web development).
i use claude and codex here and there, because usage limits got nerfed hard. really only when qwen gets stuck or repeats himself over and over again, which happens, but sometimes i'm too lazy to be more specific and spin up claude or codex.
is there any other model i should try? or is there something coming out i should have on my radar?
guiopen@reddit
Yes, its the best for 32gb
chicky-poo-pee-paw@reddit
do you have reasoning on for your coding? do you think it makes a difference in quality vs speed?
guiopen@reddit
Yes, it makes a difference, but I am forced to leave it disabled because there is a bug in llama.cpp where tool calls leak in reasoning.
StardockEngineer@reddit
Are you sure? I've been using it daily and haven't experienced this since all the issues were fixed about a month ago.
guiopen@reddit
https://github.com/ggml-org/llama.cpp/issues/20837
StardockEngineer@reddit
Says it's a tooling issue, mostly.
"Ultimately, this is a tooling issue"
Seems your client doesn't properly handle reasoning_content in multi-turn agentic conversations
ytklx@reddit
No, this is not a tooling issue. I have the same problem with OpenAI's Go llibrary. That only happens with various Qwen 3.5 models.
guiopen@reddit
Yes, I am sure, but it depends o nthe harness you use. Like, opencode go users didn't report problems, but in simpler harness like zed agent it happens constantly
StardockEngineer@reddit
Sorry I didn't see this second comment when I replied to your other comment.
ayylmaonade@reddit
Not the person you asked, but generally for coding you want to leave reasoning on. You should get significantly better results in more complex workflows.
Specter_Origin@reddit
I still can't get 27b and 35b from QWEN to not overthink or loop, tried so many harnesses etc.
sonicnerd14@reddit
Pro tip for qwen3.5, just turn off thinking altogether and you're likely to see much better results overall in most responses.
Gemma4 has a nice balance of knowing when and how much to think, and you can actually prompt it to think harder and it'll do it. Which is such an underrated capability for a model to have.
Specter_Origin@reddit
Output quality significantly degrades with that, I would just use Gemma at that point
sonicnerd14@reddit
I've tested it in many scenarios and from what I've experienced qwen3.5 overthinks so much It’s actually the opposite. Turn it off, especially if you are using it agentically because you are just burning tokens.
NewtMurky@reddit
Have you set the repetition_penalty ?
Maximum-Wishbone5616@reddit
Loops are due to bad KV. It HAS to be F16.
Yeelyy@reddit
Please try this systemprompt from Fernflower: https://pastebin.com/pU25DVnB
It fixxed it for me. And is soooo efficient now
reddoca@reddit
This looks quite nice and long. I guess it would work on any model?
rpkarma@reddit
If it makes you feel any better 3.6-plus also way overthinks and loops lol
ModelScoutAI@reddit
Honestly this really depends on what kind of coding you're doing. I've been testing a bunch of models on real infra work (K8s manifests, Terraform modules, Dockerfiles) and the SWE-bench rankings are pretty misleading for that stuff.
GLM-5 is ranked way below Claude on SWE-bench but in my testing they performed basically the same on infra tasks. Same pass rate, quality scores within a hair of each other.
lehoang318@reddit
I would suggest to keep Qwen3.5 27B as your main LLM and prepare another (maybe Gemma4) as backup. When Qwen stucks, you could switch the model (llama server - router mode). This approach works pretty well in my setup (which is much weaker than yours)
AnonLlamaThrowaway@reddit
If you have enough system RAM (most likely 48GB or 64GB), try gpt-oss-120b. I haven't been able to find anything better when its reasoning is set to high. Qwen will do basic mistakes while it won't.
You can use an option that will offload the "expert layers" into system RAM to make sure the more speed-critical layers will be on the GPU. Some GUIs like LM Studio will let you fine tweak this so that you can still keep some experts in VRAM.
hurdurdur7@reddit
Compared to qwen3.5 the gpt-oss-120b is terrible at writing real life code.
unjustifiably_angry@reddit
Can't speak from personal experience but that recent article making claims about local models being capable of doing similar things to Mythos found that 120B is still unexpectedly competent at finding bugs.
hurdurdur7@reddit
Finding bugs - perhaps indeed, but as for writing code i have had a terrible time with it, i have just switched over to Qwen3.5 (and now also some Gemma4) instead. Too many logic errors slip in.
luckynummer13@reddit
Would you say it’s better than Qwen3-Coder-Next?
LikeSaw@reddit
I think its the best Model for coding in that size. Coder next is not even close because its an instruct model no reasoning and low active parameter. Its good for autofilling and fast but not very smart.
I tried all Qwen 3.5 models. only the 397b is better imo. For coding the 27b is even better than 122b. It feels overall smarter like a bigger model. And I don't know about Gemma 4, even after all the fixes and redownloads of 31b it feels way more confused at coding with tool calls etc. weird reasoning and not really solution focused but thats just me.
Somehow I am also disappointed by minimax m2.7. I can only run the IQ3 quants from Unsloth I tried both versions IQ3 XXS and IQ3 S that runs on my GPU only and it was very fast but tbh feels very dumb. I also tried the Q6 with RAM offload but its really slow and only tested on my codebase to find bugs and also really really disappointed missing so much stuff and assuming and also feeling overall just not smart enough.
For me the 27b feels special because it "understands" the problems making perfect tool calls, but the lack of knowledge and size is the downsite.
unjustifiably_angry@reddit
Early MiniMax 2.7 quants had some issues, might be worth trying again at some point. That said - and I'm basing this on nothing - I would expect that since they distribute the model pre-quantized to FP8, further quantization is more lossy than models distributed at full F16.
Front-Relief473@reddit
So if give 27b enough network search ability, which is equivalent to an external knowledge base, will he be able to perform coding tasks better?
LikeSaw@reddit
Yes and no. It will need alot of guidance when the task is outside of its knowledge because it will assume its own knowledge before searching for solutions and maybe contradict with itself and produce bad code. 27b parameters will still be 27b and not match knowledge und reasoning as top tier models. It also depends on how complex the project is itself.
unjustifiably_angry@reddit
Q3CN was an enormous letdown for me, was very excited about finally getting able to use it after upgrading my hardware and it turned out to be extremely mid, at least for my purposes. Fast but unreliable.
Maximum-Wishbone5616@reddit
Day and night. Q8/KV F16 was killing Opus4.6 even before big dumbing.
itsmetherealloki@reddit
Try the 26b4a. Seems counterintuitive but in my case it was actually better at tool calling. Might be my setup though. And please use the llama.cpp fork with turboquants. It’s amazing!
Blues520@reddit
I'd say coder-next is better imo.
Kodix@reddit
You should check out gemma-4, both the 26B and 31B versions. You may or may not like it more, it may or may not fit your usecase better.
Other than that, I'm not aware of anything *specifically* worth paying attention to at the moment, not in this VRAM bracket.
viperx7@reddit
What's your hardware and what flags are you using for qwen3.5 27b to get 62t/s? Vllm/llama.cpp?
unjustifiably_angry@reddit
A 5090 or 6000 Pro will do that.
fulgencio_batista@reddit
Gemma won’t necessarily be faster. Qwen has MTP. I’m running qwen3.5-27b and 35b-a3b at 62 and 125 t/s tg512 respectively with MTP on budget hardware. For gemma4-31b and 26b-a4b I’d be somewhere in the ballpark of 30 and 80t/s tg.
As for OP, try NVFP4 it might offer better performance than UD-Q5 and reduce looping.
feckdespez@reddit
What exactly qualifies as "budget hardware" in your view? (Just for a better reference point)
fulgencio_batista@reddit
Dual RTX5060Ti, think anything under $1k is fairly budget here when a lot of people have multiple 3090s or rtx6000pros lol
feckdespez@reddit
Thanks!
ttkciar@reddit
For what it's worth, you can use Gemma-4-E2B as a draft model under llama.cpp to get MTP for Gemma 4.
Tilted_reality@reddit
Speculative decoding is not the same thing as MTP. MTP has the speculation built into the model itself
ttkciar@reddit
The difference is slight. MTP with built-in weights can be more effective due to being trained specifically for speculative decoding with the model, and for sharing the same hidden states, but the logic of processing the weights, speculating N tokens, and comparing to the parents' probability distribution is literally the same.
MTP with a model's own built-in weights is indistinguishable from using an external draft model which was trained explicitly for speculative decoding with that model. That having been said, Gemma-4-E2B was not trained explicitly for speculative decoding, so it will be somewhat less effective.
Clean_Initial_9618@reddit
Can you pls guide me a little on how I can achieve that pls
ttkciar@reddit
You achieve it by passing the pathname of your draft model to
llama-server(or whichever llama.cpp utility you are using) via its--model-draftcommand line option.Complete documentation here: https://github.com/ggml-org/llama.cpp/blob/master/docs/speculative.md
fulgencio_batista@reddit
Not a bad idea, I’m hoping z-lab makes a dflash SD model for gemma. I found dflash to outperform MTP, but I haven’t stuck with it for Qwen due to the extra overhead - would be sick for 26b-a4b though for a speedy little one
Kodix@reddit
Oo, that's good to know. Didn't tinker with Qwen3.5 too much, unfortunately it just doesn't fit in my VRAM well enough.
No-Manufacturer-3315@reddit
Other then crashes every other prompt
Kodix@reddit
Absolutely has not been an issue for me. Occasional crashing, but very rare, and getting rarer with llama.ccp updates. May be your hardware there, or your setup.
No-Manufacturer-3315@reddit
Hmm I keep seeing my normal ram grow and grow until it crashes with latest llama
grumd@reddit
thank me later
No-Manufacturer-3315@reddit
Let me try thanks!
Kodix@reddit
Sounds like a K/V cache/attention issue. Is flash attention on? Are you quantizing your cache? What backend are you using?
Oddly enough, for gemma4 *specifically*, not quantizing attention in llama.cpp has *significantly* better results for me than doing so. It's really weird, but I haven't looked into this more deeply yet.
This is on the ROCM backend though, which affects things severely. Vulkan acted very differently in previous tests.
No-Manufacturer-3315@reddit
I sized my context to my GPUs about 180k context fits in my GPUs.
4090 + 7900xt using vulkan backend.
I’m trying to use 31b q6 with q8 context for kv as I heard it’s important not to shrink them for Gemma.
flash attention is on, should it be off?
Kodix@reddit
That's.. huh. I don't have specific suggestions, but I see the same behaviour (with the VRAM usage balooning) when I do *any* quantization of the K/V cache with gemma and leave it as default. Performance specifically drops - and I start seeing the balooning you mention - when I change it to any value at all (including Q8_0 for both K/V). So try running the attention unquantized to start with.
And yeah, flash attention should be on. It's a huge benefit when it works.
FinalCap2680@reddit
For me Qwen 3.5 122B-A10B (Q8 with RAM offloading) looks best from what I have tried.
Blues520@reddit
What hardware do you run this on?
FinalCap2680@reddit
Don't laugh ;) It is slow, but works - Dell T7910 single Xeon 2680v3 with 192GB RAM + RTX 3090. The model version is UD-Q8_K_XL and I'm running it with 262k context and recommended setings for coding.
PS Compared to other \~120B models also at UD-Q8_K_XL(Mistral-Small-4-119B-2603 and NVIDIA-Nemotron-3-Super-120B-A12B) I liked Qwen 122 more.
Also liked it more than Qwen3 Coder Next 80B-A3B UD-Q8_K_XL; Qwen3 30B-A3B-Instruct-2507 Q8_0 and the dense Qwen3.5 27B UD-Q8_K_XL.
CloudEquivalent7296@reddit
Cool, jow manu T/S do you get? Could you share your llama command?
FinalCap2680@reddit
I'm not interested in speed, so my setup is not optimised at all. I'm running LMstudio (I know it iis slow) and the 3090 is power caped at 170W.
The 120B models start at \~3 t/s generation for Qwen and Nemotron and \~6 for Mistral and go down with growing context. Qwen 3 Coder Next (80B) starts also \~3 t/s
XtremeBee1970@reddit
Luv qwen3.5! Haven’t seen a better model than that. Not seeing much diff between the 9b and the 27b models personally, but I’m only on 5070ti w/ 16gb ram… 27b looks like almost same exact outputs but is much slower on my machine due to lower vram…. Thnx for posting! Interesting thread. Always in the lookout for better models! What will come after qwen3.5!? Hmmmm!
jwpbe@reddit
Really late to this thread, but I would give the RYS variant a try, it duplicates some of the blocks of the model where it's reasoning is the strongest:
https://huggingface.co/dnhkng/RYS-Qwen3.5-27B-FP8-XL
The blog post explaining it:
https://dnhkng.github.io/posts/rys-ii/
Tormeister@reddit
Looks interesting, but this doesn't fit in a 5090
jwpbe@reddit
?
Use a gguf quant, q5 or 6 will fit easily
Tormeister@reddit
I read the blog post and checked the linked FP8 models, not realizing there are smaller quants as well, my bad.
ArugulaAnnual1765@reddit
the opus distill v3 works well for me. also using iq4xs as its just as good as q6 but i can get the full 256k context on my 5090
Shronx_@reddit
what did you do to make it work? if I hook up any of the Qwen3.5 models (27b, 35ba3b, 9b, Qwopus, ..) via llama-cpp server with claude, generation stops after the first token. Qwen3 Coder Next works fine on the other hand.
RnRau@reddit
Is your models and llama.cpp up to date? Are you following the unsloth guide on the recommended settings? https://unsloth.ai/docs/models/qwen3.5
Shronx_@reddit
unless I missed something, yes.
AlreadyBannedLOL@reddit
I can’t find any 27b opus destill v3 on Hugging Face. There’s 4b but… why…
Account-67@reddit
I believe they are referring to: Jackrong/Qwopus3.5-27B-v3-GGUF
Maximum-Wishbone5616@reddit
Q8 KV F16
LegacyRemaster@reddit
To be honest... if you have DDR5 (128gb or 96gb) + 32gb ram try Minimax 2.7. It's MOE.
Free-Combination-773@reddit
With enough RAM you xan try 122b variant
cosmicr@reddit
I'm curious what kind of coding do people do with models like these? because half the time I can't even get models like sonnet or opus to do what I want. What languages? I'm assuming python because that's what most models always suggest.
Leading-Month5590@reddit
I am running qwen3.5:122B-A10B in IQ_4_XS precision on 40GB Vram and 64GB sysram and get like 20tok/s on a R9 9950x3D machine. It is actually quiet good so if you have enough sysram and a good processor I would advise you to trie it. Maybe gat an rtx 5060 ti 16gb as secondary GPU for that.
Doct0r0710@reddit
For agentic stuff it's good, but if you still use it as an "old school" chatbot i found Qwen 3 Coder 30B and Nemotron Cascade 2 to be more consistent. Might be specific to our codebase though
iThunderclap@reddit
Why not run a cloud open source model at full inference levels? No need to run locally if what you do is not absofuckinlutly secret
Creepy-Bell-4527@reddit
Have you considered RAM maxing and using krasis with Minimax M2.7 Q2 or Q3?
Because if anything will actually rival Claude or codex it’s that.
Front-Relief473@reddit
According to my use, udq3kxl has already shown a state of obvious decline ability, so the golden rule has some basis, and the quantification should not be lower than q4.
Thunderstarer@reddit
Gemma 4 31b is less "anxious" for me, and I qualitatively feel like I've had more reliable results. On the other hand, that 3.6GB SWA context window really hurts.
annodomini@reddit
Bruh. OOM'd my 128 GiB system a few times before I figured out what was going on.
Other than that, Gemma 4 31b does feel pretty nice.
grumd@reddit
Limiting ctx checkpoints, parallel and cache-ram works well with gemma
qubridInc@reddit
Not really, Qwen3.5-27B is still one of the best for that VRAM; you can try Qwen3 Coder, but it’s more of a sidegrade than an upgrade.
povedaaqui@reddit
Have you tried an MoE model?
Reggitor360@reddit
Tried Devstral Small 2 2512 in Q8 yet?
ayylmaonade@reddit
In all of my testing, Qwen 3.5 27B (even at Q4) is superior to Devstral Small 2. It's not even a competition, really. Devstral is a good model, no doubt, but it's no Qwen.
Eyelbee@reddit
Why UD-Q5 on a 5090, can't you fit a larger quant? You'd get 30-40k context even with q8
Unlucky-Message8866@reddit
I havent found anything better, llama.cpp + unsloth ud quants + recommended hparams. Minimal system prompt with pi coding agent. I use both the dense and moe. Automated linting and type checking mandatory.
noctrex@reddit
Well, you can try the larger 122B model, with RAM offloading some tensors. Or even MiniMax, if you have 128GB RAM
starkruzr@reddit
wait. isn't 3.5 supposed to be native multimodal? why do you need mmproj?
Skystunt@reddit
For .gguf files multimodality is stores separately in a .mmproj files .gguf has just the text inference capabilities for models at least that’s how llama.cpp works
dinerburgeryum@reddit
Nope, for GPU-mid folks in the 32-48GB range it still comes out on top.
denoflore_ai_guy@reddit
Nemotron Cascade 2 has impressed the shit out of me and getting 140tok/s Q8 unquantized kv 16 experts it’s solid for me.
Just_Maintenance@reddit
I’ve tried Gemma 31b but qwen 3.5 27b is more reliable for me
putrasherni@reddit
gemma 31B if its cool calling works
jacek2023@reddit
I am currently running opencode with gemma 31B, I told her to look at my project generated with codex (GPT/gemma) and write better one :)