Is Qwen27B dense really the best local agentic coding for 32gb VRAM?
Posted by soyalemujica@reddit | LocalLLaMA | View on Reddit | 134 comments
I haven't seen benchmarks or tests for example with the "growing tree with branches and leaves prompt in html" so I am curious if there's really anything better than that for coding.
CoolestSlave@reddit
For me it's not even close, qwen3.5 27b is the best in 24gb \~ 32gb vram range.
Even though i barely tried gemma 4 31b, i read strong positive sentiment about it. A user managed to make it run on a single rtx 5090. https://www.reddit.com/r/LocalLLaMA/comments/1sbdihw/gemma_4_31b_at_256k_full_context_on_a_single_rtx/?tl=fr&utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
Without TurboQuant, this model is unusable on a single gpu. It will eat your family memory at 5k token context
gearcontrol@reddit
I had a group of local LLMs rewrite two bash scripts that I hacked together years ago (I'm not a coder). Then I had Claude grade them. Notice the top three were all correct.
Combined Overall Ranking (weighting both threads, models appearing in both get averaged):
Poha_Best_Breakfast@reddit
I’ve been running Gemma 4 on 3090 and it works fine. IQ4_XS quant works really well and you can fit decent KV cache at Q8
DertekAn@reddit
Ohhh, how much context are you running on IQ4_XS and KV cache at Q8?
Also, I always thought you should never go below Q4_KM
Poha_Best_Breakfast@reddit
IQ4 and Q4_K_M have almost the same divergence. Basically any quant at Q4 is fine.
DertekAn@reddit
Oh wow, 64k is much. And thank you!
erazortt@reddit
I some days back I read a post here about somebody showing that gamma4 is sensitive to KV quantization. So it perhaps would be best to use at least b8699, since here the attention rotation have been enabled for gamma4 as well.
siegfried2p@reddit
attention rotation not working as expected. Am I doing something wrong?
.\llama-server.exe -m .\models\gguf\gemma-4-26B-A4B-it-UD-IQ4_XS.gguf -ctv q4_0 -ctk q4_0 -fa on
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 12281 MiB):
Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes, VRAM: 12281 MiB
load_backend: loaded CUDA backend from C:\LLM\ggml-cuda.dll
load_backend: loaded RPC backend from C:\LLM\ggml-rpc.dll
load_backend: loaded CPU backend from C:\LLM\ggml-cpu-zen4.dll
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b8708-ae65fbdf3
llama_kv_cache_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache: CUDA0 KV buffer size = 22.50 MiB
llama_kv_cache: size = 22.50 MiB ( 4096 cells, 5 layers, 4/1 seqs), K (q4_0): 11.25 MiB, V (q4_0): 11.25 MiB
llama_kv_cache: attn_rot_k = 0
llama_kv_cache: attn_rot_v = 0
llama_kv_cache_iswa: creating SWA KV cache, size = 4096 cells
llama_kv_cache: CUDA0 KV buffer size = 800.00 MiB
llama_kv_cache: size = 800.00 MiB ( 4096 cells, 25 layers, 4/1 seqs), K (f16): 400.00 MiB, V (f16): 400.00 MiB
llama_kv_cache: attn_rot_k = 0
llama_kv_cache: attn_rot_v = 0
sched_reserve: reserving ...
Healthy-Nebula-3603@reddit
Cache Q4 ?
You're serious? Even with rotation is useless.
Poha_Best_Breakfast@reddit
Attn_rot doesn't work with gemma4 unfortunately as of yesterday as it has SWA, but even then Q8 is generally pretty good. At least in the coding tasks I've given it it's performing very well.
Okay just checked out your linked release and it's brilliant, it seems this will fix attn-rot and give Q8 lossless. Will try it out today when I get back from work.
Either way in the next few weeks we should get the full turboquant working on llama-cpp and all gemma4 gremlins sorted out.
but even now I feel gemma4 31B is better than qwen 27B as qwen just thinks too much and takes too long to code. In the time qwen 27B codes, I can get gemma 26B code -> gemma 31B review -> gemma 26B fix the implementation.
Healthy-Nebula-3603@reddit
Is working already rotation to Gemma 4
erazortt@reddit
Are you sure you have tried that build I linked? That build was built only 6 hours ago. It was not working until that build but with it the logs are now saying that attn-rot is enabled.
Poha_Best_Breakfast@reddit
No no read my comment again, I haven't tried your linked release, I said I'll try it once I get home from work in evening. This release should fix it as SWA was what's causing it to not work.
I'm saying even without attn_rot it was working fine at Q8. now it should work even better.
Healthy-Nebula-3603@reddit
Gemma 4 already got rotation cache for Q8 so it should be very close to FP16 now
IrisColt@reddit
I'm using Q4_K_M from unsloth and KV at Q4, and 65535 context (I have 64GB of RAM). I see no reasoning degradation.
DertekAn@reddit
Ohhhhhhh, with a mac?
Classroom-Impressive@reddit
Nvidia's 4 bit modelopt quant + vllm also works on a single 5090 with no noticable loss in quality, has pagedattention which is why im more of a fan in comparison to gguf
EffectiveCeilingFan@reddit
Don’t write it off quite yet. llama.cpp implementation is still incomplete as far as I know. Gemma 4 should use roughly half the KV cache of the equivalent Qwen3.5 since it has a unified KV, it just hasn’t been implemented yet.
Hot-Employ-3399@reddit
I tried it twice. One time couple days after release and last time probably yesterday.
So far no luck. Either model sucks or llama cpp bad with it or unsloth quants broke them.
First time it failed to call tools(repeatedly passed string rather than array, dense model did the same 2 times but then realized what to pass)
Second it was beter but like initial deepmind v3. Had "but wait" over and over and over.
And it didn't call tools to edit files. It pasted patch right into chat to review instead of tool, then went "but wait" and started rewriting it. I stopped it after 16K tokens and no code was edited.
Maybe it'll get better, but comparing to how adaptable qwen is, so far it's very bad.
SocialDinamo@reddit
I am using a 4bit quant of Gemma 4 26b in a single 3090 on vllm. I have a couple agents it is powering and I am really impressed! Super quick and orchestrates clearly defined tasks very well!
nakedspirax@reddit
Using q8 on a strix halo and it's done building and SEO for me. Good shit.
SocialDinamo@reddit
If you don’t mind me asking, what is your t/s generating on the strix halo? I have a framework 395 128gb as well I should try it in. I just knew the 3090 would get me a ton of speed
nakedspirax@reddit
Simple header for website.
Prompt was 1002.64 t/s Eval was 35.80 t/s
No_Afternoon_4260@reddit
On strix halo? What quant?
nakedspirax@reddit
Strix halo Q8
Hot-Employ-3399@reddit
What quants do you use? (Type like gguf/awq and source like Unsloth/Bartowski)
_bones__@reddit
What kind of performance? Meanwhile I can't even load E2B in vLLM in 12gb of VRAM...
-dysangel-@reddit
I tried it on Google AI Studio and it seemed just as bad
Kitchen-Year-8434@reddit
I’ve had big issues with unsloth on Gemma. Bartowski at comparable bpw is almost twice as fast.
And commits are landing daily improving it. Have to build llama.cpp from HEAD.
trusty20@reddit
Man the Gemma glazing posts in this sub the past week are just absurd. Simultaneously saying it's the best model literally competing with Claude Opus to also saying "please give it a chance it's got a broken tokenizer / not properly implemented right now". Or "check this benchmark out it shows it's the best model ever!" but then "ehhh benchmarks don't matter it's so much more creative"
CoolestSlave@reddit
hopefully, it really seems to be a monstrous model for its size
gr8citizen@reddit
I'm running llama.cpp compiled on the turboquant branch with Gemma4 31B Q5 heretic on a 4090.. the KV goes to ram but it's very fast. I don't have figures on hand but it feels faster than gpt5.4
Polite_Jello_377@reddit
Have you tried Qwen3-coder?
soyalemujica@reddit (OP)
It is what I use everyday, at UDQ4M, it's amazing tbh
cyberdork@reddit
You mean Qwen3-coder? or Qwen3-coder-next?
soyalemujica@reddit (OP)
Qwen3-Coder-Next!
cyberdork@reddit
Ah, I see Qwen3-Coder-Next-UD-Q4_K_M.gguf is 49.3GB :-(
Ill-Chart-1486@reddit
Did you try to compare it to models like Haiku? I trying to use local model, but It’s not even close to budget external models.
Ok-Idea2943@reddit
Heavy advocate of Glm 5.1 here
soyalemujica@reddit (OP)
If I had the money, I would consider it too 😭
FatheredPuma81@reddit
Pretty sure its the best open source model period right now under 397B parameters.
Trademarkd@reddit
qwen3.5 blew me away, then i tried gemma4. Wildly better.
Ok-Idea2943@reddit
I’m a super fan of glm 5.1 myself
FatheredPuma81@reddit
Good luck running that on 32GB of VRAM...
GregoryfromtheHood@reddit
What about the 122B? I've been running it instead of the 27B dense and it has been really excellent
sagiroth@reddit
ehhh it only has A10B , so it's working similarly to Qwen3.5 35A3B. I think some benchmarks showed the 27B not being that far off from A10B one.
GregoryfromtheHood@reddit
Wild. I think I can run the 27B faster and with more context than the 122B, but in the tests I was doing, especially on longer context agentic stuff, 122B seemed to be getting into more detail and getting more stuff right. I might have to switch back to 27B and try again.
EbbNorth7735@reddit
The geometric mean between 122 and 10 is roughly 35. So 122B is roughly equal to a 35B equivalent series model. The benchmarks show it's slightly better than 27B.the basic idea is if you can't fit or use 122B than the 27B is way less ram and just as good but will be slower. 27B also fits into a 24 or 32GB consumer grade card and the 122B fits in an rtx pro 6000.
EstarriolOfTheEast@reddit
That geometric rule of thumb was never empirically validated and is generally thought to no longer hold for sparser models trained on more data and better router balancing. The 122B has many more precomputed/partially computed heuristics and knowledge than can be described in just ~15GB. I too find the 120B much better suited for detailed or complex work. The 27B is great when you're working on things that are well covered on the internet.
That the 122B is better at tasks requiring more specialist expertise seems counter-intuitive with only 10B active. But that 10B is dynamically arranged per token and selected out of tens to possibly hundreds of possible realized expert combinations (depending on model size and experts configuration) specialized at predicting that one token. It's like the wizard archetype that has not much manna output but has mastered countless spells across various scenarios which and so makes up the difference vs a powerhouse.
EbbNorth7735@reddit
Okay well the benchmarks are pretty cut and dry and show it slightly ahead, roughly what the geometric mean indicates. It's not an exact science and your anecdotes require data
EstarriolOfTheEast@reddit
As you well know, benchmarks are not reliable and benchmaxing* is enough of a thing that collective anecdotes (ie vibes) carry more weight than benchmark numbers. We know the Qwen models do benchmax while also being legit good. Now, if we look at the benchmark numbers, we'll see that the 35B is also very close to the 27B and 122B, even beating or matching the 27B in a few places. The 122B is roughly matching the 27B except in the agentic section where it performed notably better. The 35B is also staying very close to both, and the 27B in particular (note graphs do not start at zero and accounting for testing noise, the diffs are often not significant). Those same benchmarks are thus countering the geometric mean rule of thumb. What is actually happening is smaller models of whatever architecture do well on benchmarks they've prepared for but fail to generalize beyond them compared to larger models (in terms of total parameters).
*Someone on x-twitter recently ran a test based on nvidia's https://arxiv.org/abs/2510.27055; it found stronger evidence that the Qwen models studied harder for the benchmark than the Gemma models. Curiously, they incorrectly concluded that this means the gemma models will generalize better but all it means is the gemma models will leave you with better calibrated expectations on model quality. They're probably very close in real world use, except maybe that the qwen models are unexpectedly strong if what you're doing has a similar enough shape to what they "over-trained" for.
sagiroth@reddit
It's all about trade off. I think important for agentic work is speed and how quickly you can iterate on plan, findings, debugging. Personally large models that are slow (considering you can't fit them all in vram) are only good for planning. You offload most of it to ram, give it few hours to plan out the work, and then fire up smaller model (that fits in vram) to do the plan. This again feels kinda redundant too, because wouldn't it be better to run 27B model 3-4 times over plan in much much shorter time to find gaps in plan and then execute it? It's all about working out what works best for your usecase. I personally for my day work I use claude provided by work, but if ever that would be taken away from me, or needed privacy. I would be happy with the 27B model to carry on.
ikeo1@reddit
If you don’t mind my asking what do you use to coordinate this? Do you have an orchestration tool or do you set it up via cron? I’m wondering what’s the best method to use for this these days
sagiroth@reddit
I dont at the moment, but I tried and i have two separate scripts for llama.cpp to load either model. I think there is also llama-swap or something similar to hotplug but have tried
ikeo1@reddit
Thanks for the reply, i'm currently working on a tool that would do that. What would your best case be? To give you a high level, i have a supervisor that does the plan and distributes work to worker nodes. The supervisor selects the model and to distribute work to. I'm mainly look to build tools for devs to work in multi-agent systems that's pluggable and asking devs what you would want.
nakedspirax@reddit
Slower than qwen3.5 A3B due to parameter size but smarter.
sagiroth@reddit
Yep, if you can fit it sure, but I personally value speed with decent knowledge over being smart and slow AF. I only have 3090 and 32gb ram so can't speak for the 122B really. However 180k context with 27b is good for me
my_name_isnt_clever@reddit
The story is the opposite on unified memory hardware. The 122b is doing great on my Strix Halo, I haven't even tried the 27b because it would crawl. It's great
FatheredPuma81@reddit
According to benchmarks its a 4 way tie alongside other models like Minimax M2.5 and Deepseek V3.2. LLM Leaderboard - Comparison of over 100 AI models from OpenAI, Google, DeepSeek & others I can't vouch for the validity of these benchmarks though but I don't see why they'd be wrong. If it is right then 27B is the best model to run on a 5090 right now.
DeepOrangeSky@reddit
Better than even GLM 4.7 355b a32b? That would be pretty crazy for a 27b model. I assume GLM 4.7 is the next strongest after Qwen 397b and Qwen 27b? And then, like a close call between Qwen 235b, Minimax 230b, Mimo, and Qwen Coder Next 80b?
I'm curious how strong Devstral 123b is at coding, btw. I know probably almost nobody uses it since it must be ridiculously slow, but, compared to those models in raw smarts I mean.
I'm not a coder or anything, so I have just been using local models casually so far for writing, chatting, etc, and so far Mistral 123b and its fine-tunes have still been the strongest I've used by far (although I haven't had enough memory to run any of the 200b+ models yet). But, I would say it's actually close between Qwen 27b and the Llama 70b fine-tunes for which is 2nd strongest at writing/chatting after the 123b model, despite it being so much smaller. It seems very strong for its size. Gemma4 31b has been very good, too, although it has this thing where the memory keeps shooting way up when you use it, and you have to reload the model after each reply, in LM Studio, or else the memory use goes totally crazy after not even that long of an interaction. It has something to do with the stuff they were discussing in this thread, but I don't know much about computers, so I don't really understand it or how to implement that --cache-ram 0 --ctx-checkpoints 1 thing that got mentioned as a fix for it on the github discussion about it. If I use LM Studio, where do I type that thing?
unjustifiably_angry@reddit
I only tried out Q3CN briefly (though with great optimism) and quickly went back to Q3.5-122b. I might've just been using it wrong but it was clearly, obviously inferior at working on my python script.
FatheredPuma81@reddit
As I said in my other reply I'm going off benchmarks on a single website. Apparently it's roughly a 5 way tie with GLM 4.7, Minimax M2.5, Deepseek V3.2, Qwen3.5 122B, and Qwen3.5 27B.
DeepOrangeSky@reddit
Ah, yea it's benchmarks are pretty crazy. A lot of people on here feel the benchmarks can be a bit misleading sometimes, since models can be very bench-maxxed sometimes, if they train on data that has the benchmark tests in them or whatever. And also just in basic terms of it being 27 billion total parameters of not having as much room to have nearly as much world knowledge as some of those really big models. But, I wouldn't be surprised if it can actually hang with them or beat them in some specific tasks sometimes. In my casual use it does seem very strong for its size.
PinkySwearNotABot@reddit
why 27B > 32B?
FatheredPuma81@reddit
The 35B variant is a MoE model. MoE models trade quality for speed.
Dany0@reddit
It's a multidimensional tradeoff. At any given point, assume more params => more breadth of knowledge. A 31B MoE might suck at coding but know a little about everything. A well optimised 27B dense model can outperform Claude Mythos at coding, but have the EQ of a rock
grumd@reddit
35B-A3B it's called, yes total parameters count is 32b, but "A3B" means "active 3B", so for each token only 3B aprams are used at a time
misha1350@reddit
You need to test Qwen3.5 27B and Gemma 4 31B out yourself. Gemma 4 31B is supposed to be better for agentic coding. Hopefully Alibaba would release Qwen3.6-27B soon and it will become even better.
IrisColt@reddit
I can run Qwen3.5 27B at 256k on 24GB (Q4 KV) no sweat.
misha1350@reddit
Well, that's Q4. I do hope TurboQuant support rolls up soon so as to be able to run something like UD-Q5_K_XL or UD-Q6_K_XL, or to run Gemma 4 31B whenever necessary (although Qwen 3.5 27B is still extremely good, and sometimes beats out Gemma 4 31B).
IrisColt@reddit
You nailed it!
Nyghtbynger@reddit
Holding my download until 3.6 drops
deenspaces@reddit
Currently Gemma 4 31B even fails tool calls. Q8 with LMStudio. Context cache is heavy, trying to use quantization (an option in lmstudio) leads to the model going into loops.
Otherwise the model is very similar to qwen3.5-27b.
m94301@reddit
Same with the more. Bad tool calls, it can often recognize and fix, but then goes back to the original bad tool call every single time
Danmoreng@reddit
Well if you want to try the latest models you should not rely on an outdated version of llama.cpp inside LM Studio, but use llama.cpp directly. Best built from source directly for your hardware: https://github.com/Danmoreng/llama.cpp-installer
OfficialXstasy@reddit
Don't use LM Studio. I can do tool calls fine with Gemma 4 on llama.cpp.
Danfhoto@reddit
Have you updated the engine? It was on an old llama.cpp until yesterday. Working quite well with OpenClaw and OpenCode for me now.
Healthy-Nebula-3603@reddit
Yes .... currently
raketenkater@reddit
Try https://github.com/raketenkater/llm-server recommend best Model for your system and tunes the shit out of your model for your system
But yes 27b qwen works good especially with opus4.6 distill but Gemma4 as well
InstaMatic80@reddit
I’m using it on my own agent and it works pretty good. However some are saying that Gemma 4 is performing great too so I need to give it a try. Did anyone tried Gemma? However I only have 24GB (3090)
my_name_isnt_clever@reddit
I've been messing with it, but tool calls are still so unreliable. I need to hold my horses with new model releases because they never work right for the first couple weeks.
InstaMatic80@reddit
Yes I tried on my agent and tool calls are wrong, they add a tool colon tool like for example: system:notify instead of just notify or notify:notify… did you find th same issue?
ReentryVehicle@reddit
If you have 128GB RAM or more, Qwen397B might be an option at some IQ2 or Q3 (just remember to set a high ubatch in llama.cpp for faster prompt processing)
It's going to be of course much slower but depending on your setup can be usable.
erazortt@reddit
What ubatch size are you using?
GCoderDCoder@reddit
Usually just the 512 default. Have you found tweaking that to help with these? Because I haven't seen much value for many models in the past so I dont even test anymore lol. I usually am only doing one or two requests at a time so not sure that's going to make much difference for me with personal inference. I test for usability like how well I can actually use this vs theoretical.
unjustifiably_angry@reddit
see here https://old.reddit.com/r/LocalLLaMA/comments/1sflepf/is_qwen27b_dense_really_the_best_local_agentic/of2y9xo/
Thrumpwart@reddit
Helps with pp speed if you have the memory to support it.
ReentryVehicle@reddit
4096, but I have more VRAM (rtx pro 6000) - probably could go higher. Without this I get some 100t/s PP, with it ~550t/s which makes Q3_K_XL usable.
It could be that llama.cpp is doing something inefficient or my CPU struggles to talk to the card somehow because at no point I saw the PCIe throughput maxed out (20GB/s when prompt processing), so maybe (much?) higher PP is possible.
unjustifiably_angry@reddit
Did you actually benchmark that? Talking about 3.5-122b here but I found with short depth/prompt much of the benefit trails off over 2048, and at greater depth there's little benefit over 2048. You might be able to run a higher quant instead. Different architecture though, might not apply.
Here's the script, takes a couple hours or so. Adjust as needed (probably just model location and maybe Q8 rather than F16):
@echo off set CUDA_VISIBLE_DEVICES=0 set MODEL_PATH="A:\AI\Llama\Models\Qwen3.5-122B-A10B\Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf" set OUTPUT_FILE=prefill_results.txt echo ----------------------------------------------------------- >> %OUTPUT_FILE% for %%D in (8192 16384 32768 65536 131072) do ( for %%P in (2048 4096 8192 16384) do ( for %%U in (256 512 768 1024 1536 2048 2560 3072 4096) do ( echo Running D=%%D P=%%P UB=%%U echo [DEPTH: %%D] [PROMPT: %%P] [UBATCH: %%U] >> %OUTPUT_FILE% llama-bench.exe ^ -m %MODEL_PATH% ^ -ngl 99 ^ -p %%P ^ -b 32768 ^ -ub %%U ^ -d %%D ^ -n 0 ^ -dio 1 ^ -fa 1 ^ -r 3 ^ -ctk f16 ^ -ctv f16 >> %OUTPUT_FILE% echo ----------------------------------------------------------- >> %OUTPUT_FILE% ) ) )
SkyFeistyLlama8@reddit
Qwen Next Coder 80B at q4 barely fits into 64 GB RAM but it's worth the effort if you can run it. No matter how good Gemma 4 26B or Qwen 3.5 35B are, the big 80B model still feels a lot smarter.
Hopefully Ali or Google come up with a 70B MOE next, instead of these 120B or 400B monsters that few people can run.
Thrumpwart@reddit
The Apex quant of Qwen 3.5 122b is very, very good. Apex is kind of flying under the radar but it’s a quality quant technique and the Apex 122B is my new fave model to run on the Mac Studio.
ReentryVehicle@reddit
I did not try the qwen coder, but qwen 3.5 122b and 27b are similarly smart to each other in my experience. 35b is much weaker than both.
27b overall feels the most "focused" somehow - it seems to me that if it faces an issue when coding, especially a failing test, it goes into a sort of "make it pass or run out of context trying" mode where it writes pages of text thinking, but usually it can do it one way or another.
EmPips@reddit
Can confirm UD_IQ2_M 397B beats Q8 27B - but good luck with the prompt processing if most of it is on system memory and CPU
ReentryVehicle@reddit
That's why you need the ubatch to be sufficiently high (I use ubatch=batch=4096, but increase it until you stop seeing gains on long context or run out of VRAM).
In llama.cpp the entire prompt processing runs on GPU - I think it essentially sends layers onto GPU one by one, computes outputs for that layer for n=ubatch tokens, then sends the next layer. If you have a high enough ubatch, transferring layers to the GPU should become insignificant and you should be limited purely by compute.
GCoderDCoder@reddit
I would still use a higher quant of 27b over those lower quant 397b versions. I have q4 levels of 397b and I use it to come up with ideas for features then 27b codes them.
At that level it's like the character Sister Sage from The Boys.... She's burdened by too much intelligence so she lobotomizes herself to feel normal. That's qwen 3.5 397b at q4. It's not obviously handicapped but you def walk in the room realizing this isn't the normal 397b.
Sorry the season is just starting so that's all I can think about... that and the impending AI apocalypse... and general imminent nuclear apocalypse... I might actually need a lobotomy now that I think about it :)
erazortt@reddit
Perhaps you can show anything which backup your point?
Even medium size MoE models are quantized well at q4. So with just hearsay and without proof I have a hard time beleving that a big model like this has issues at q4.
GCoderDCoder@reddit
So the best way to describe this is that I have actually started enjoying using game creation as a fun way to test models. You can visually see the level of complexity the models code. Like bigger models tend to think of more artistic elements like clouds, mountains, enemies... the higher quants of smaller models tend to give better details like 3d shading design elements or it will look like a person with q8 vs a box as the main character for q4.
Xcreates does videos comparing different quant performance and it's not always linear especially with quants like unsloth dynamic quants (he does mlx so I'm saying unsloth not him). Anecdotally I have repeatedly had a q5kxl of certain models beating q6kxl of the model with doing a task error free so I think the relative amounts for different weights either removes or introduces more noise at certain quants too. That's me guessing though but the observation is still the observation...
erazortt@reddit
Oh that sounds like a fun way of testing models. What programming language do you let the models do that in?
uk-youngprofessional@reddit
What sort of harness / agent wrapper are people using for local models?
Are people rolling their own or using somwthing like claude code pointing at ollama?
DistanceAlert5706@reddit
Opencode, and my own wrapper
cunasmoker69420@reddit
I started by using Claude Code pointed to my local qwen and then eventually settled on using Qwen Code. I find Qwen Code has slightly better integration with its native models (vision support, among other things). You'll just have to try and see what works for you
my_name_isnt_clever@reddit
I'm using https://pi.dev/, the ability to point Pi at it's own folder and ask it to create it's own extentions has been amazing. Running it with Qwen 3.5 122b via llama.cpp.
Ok-Internal9317@reddit
Check out cognithor
DistanceAlert5706@reddit
Yes. A bit slow in llama.cpp but sadly in vLLM it's not really working, and had no luck with ik_llama. Maybe some day they will support it.
Soft_Match5737@reddit
Dense vs MoE matters more for agentic coding than people realize. With MoE, different tool-use calls can activate different expert sets, which means the model's behavior is less consistent across a multi-step agent loop — one step might route through strong coding experts while the next routes through weaker ones. Dense models give you predictable quality per token across the entire chain, which is why Qwen 27B dense punches above its weight in agentic tasks even though MoE models score higher on single-turn benchmarks.
Born-Caterpillar-814@reddit
It depends how much you have ram and what architecture your gpu is. I am running Q3CN @q8 with https://github.com/brontoguana/krasis for local coding.
OmarBessa@reddit
how is this krasis thing not talked about more? it looks amazing
Born-Caterpillar-814@reddit
My thoughts exactly. I am in no way affiliated to the dev, but I find it amazing, although it is still in early development. For example I had to create a proxy script to make krasis understand openai api requests that are in array format (krasis atm understands only string format).
Born-Caterpillar-814@reddit
I don't know why this got down voted, but running Qwen3 Coder Next on 24GB VRAM + enough RAM is really a thing with krasis. And a fast one too. I have 5090 + 128gb ddr5 ram and I get around 4.1k/s pp and 40tok/s decode.
ormandj@reddit
I don't think the 3090 vs. 5090 should lead to over an order of magnitude faster pp, you sure config/etc didn't change?
Born-Caterpillar-814@reddit
Well all I did was remove rtx3090 and a4000 from my rig and installed rtx5090 on the pcie 5.0 lane the rtx3090 previously occupied. Then I reinstalled krasis with cu130 that's supported with 5090. I havent seen anyone else running benchmarks on rtx3090, so can't be sure why there is so big a difference.
fermuch@reddit
I'm curious: how much VRAM and RAM, and what speeds does it get you?
HopePupal@reddit
it's the reason i got an R9700. worked well enough on my Strix Halo that i wanted to throw hardware at it for the speedup.
it still fucks up sometimes even at Q8, but for real, i think it's smarter than any other Qwen 3.5 except maybe the 397B-A17B. and are there other coding models that fit in 32 GB? the only ones i can think of are GLM Flash 4.6v and 4.7, which are strictly worse ime, and Gemma 4…
Gemma 4 31B is about the only other thing remotely in its class right now but it seems like the runtimes are still a little buggy. that'll probably be better in a matter of days and we'll be able to compare coding performance more fairly. Qwen's instruction following isn't always perfect and the previous Gemma had a good rep for that, so maybe it'll be worth looking at.
soyalemujica@reddit (OP)
You're running Q8 27b in a single R9700?
HopePupal@reddit
absolutely not lol. sorry about the ambiguity. i have run Q8 on the Strix Halo for quality comparison with the Q6 i run on the R9700.
soyalemujica@reddit (OP)
Would you mind showing the prompt results of the tree grow with branches and leaves in the 27b? I haven't seen this model results with it
HopePupal@reddit
i'm not familiar with it. is that the whole prompt?
soyalemujica@reddit (OP)
https://www.instagram.com/reel/DTVaBJSjTvM/?igsh=MWR6cDJlaDNxN29nZA==
mp3m4k3r@reddit
From a prompt of:
Qwen3.5-27B unsloth Q6_K_XL gave something that does render but isn't exactly the same as any of the ones in that insta post. The code was too much for a reddit post.
Stats:
soyalemujica@reddit (OP)
Did the tree grew, also the branches and leaves?
mp3m4k3r@reddit
Yep grew from seed to part of the screen in height and the leaves grew in green and changed to red. The sun rose and the color of the sky changed with it.
It'd probably take some couple of prompts to get it more realistic, I also ran it on 9B, 35B-A3B, and 2B at the same time. Basically came from best to worst as 27B->9B->35B-A3B->2B
soyalemujica@reddit (OP)
It is great to know it beat it, I wanted to know if the 27B dense would manage this, because Qwen3-Coder does not, even at Q8, even GLM Flash models fail at it, so it is good! Thank you for taking your time to test this and write this up, I wanted to confirm that 27B would be better to use locally for coding agent in C++
mp3m4k3r@reddit
Its definitely worth a try, Qwen3(or 2.5)-Coder should be worthwhile for a try as well.
I've found that 3.5-9B is actually really competent at a lot of things as well, I use them both with Arduino projects as well as loads of python, js, docker, shell scripts, hell even just project chat back and forth. They've worked well with pi-coding-agent as a harness as well as roocode. I use Llama-cpp-server for hosting and openwebui as the front end for harnesses to connect via.
Tried Omnicoder-9B and its alright (but aggressive in making changes, hoping v2 is better improved). May give another run at the latest Nemotron models soonish as well.
mp3m4k3r@reddit
Had Qwen3.5-35-A3B transcribe a screenshot I took of the Instagram post the op got this from:
Interesting-Print366@reddit
Depends. I don't feel significant difference between 27b and 35a3b 35a3b might be better if you are handling famous libraries
silentus8378@reddit
so much recency bias.
WetSound@reddit
No, it's Gemma 4 31B, and will be even better soon
Pleasant-Shallot-707@reddit
31B is the dense model.27B is the MoE
kaisurniwurer@reddit
SWA and constant context re-processing will make it very sluggish compared to Mistral for example. But quality is likely the best in this size.
Maleficent-Low-7485@reddit
qwen3.5 27b on 32gb is genuinely hard to beat right now for agentic stuff.
SharinganSiyam@reddit
Yes
g_rich@reddit
The Qwen3.5 models are right now some of the best available; I personally prefer 35B-A3B over 27B due to it being much more responsive with only a small hit to quality. Gemma 4 seems promising but I’ve been getting better results from Qwen so I’m sticking with it for agentic and coding related work. Qwen3 Coder Next at a 4 bit quant is also very good, and might work for you but would need to offloaded to RAM so performance might be worse than Qwen3.5.
soyalemujica@reddit (OP)
I have used both, Coder is much better than the 35B by a big margin tbh
Technical-Earth-3254@reddit
Tbh, it's kinda the only choice rn if you want decent speeds without offloading.