Folks running qwen 3.6 27b for agentic work. Do you dare to use q4_k_m?
Posted by StandardLovers@reddit | LocalLLaMA | View on Reddit | 87 comments
I dont have good experience running q4_k_m, the difference to q6 is "a few errors an hour" to " a few errors every couple of days".
MapSensitive9894@reddit
Yes! I use IQ4_XS from unsloth with q4 kv with OpenCode and general chat. It works very well for my use case for feature level coding and larger scale prototyping with python+html
It’s planning definitely needs to get checked as it will choose some architectural decisions that are unideal (like legacy libs). I’ll point it in the right direction and it will implement most milestones flawlessly. Occasional it will create a weird bug or forget an edge case that I have to nudge but that’s a like a 5-10 min detour.
I also was testing it against gpt 5.5 nano on a specific task involving long running agents and mcp tooling and it surprisingly (and annoyingly) outperformed gpt 5.5 nano.
The only time I encounter looping issues is when the context size for the workload far outpaces the pinned context set. But that’s why I have it running at 230k context
13henday@reddit
Q4kxl at 128k is no worse than q6 in my usage.
tired514@reddit
Most of the time I either use 3.6-35A-A3B-MTP at MXFP4 for stuff that doesn't matter (scanning codebases to update READMEs, summarizing, brainstorming, etc) or 3.6-27B-MTP at UD_Q6_K for stuff that does. I haven't had good luck with Q4 when it comes to turning out code.
Having said all that, I'll take 3.5-122B-A10B-MTP at Q6 over both of them any day. It feels so much less frenetic, and much more confident, like a more senior dev. More pleasant to work with. The changes are more carefully considered and though it does make programming mistakes, 90% of the time I end up with a better implementation.
If you look at the chain of thought the smaller models tend to be "oh wait I see it! Wait. Maybe this! Maybe that! I need to just try something! Ok, fixed! Try that!" and 122B tends to be more "Let's figure this out."
I'm so looking forward to a mid-sized 3.7 model. crosses fingers.
Embarrassed-Rich3397@reddit
When being used agenticly always go for the higher quant, if this is just for short chats the errors would be less noticeable.
ixdx@reddit
If I need a Q4_K model, I usually use the Q4_K_L by Bartowski. It uses Q8_0 for some weight tensors. I compared it when Qwen3-Next-Coder was released, and the error rate relative to the M variant was lower. With Q6_K, of course, the error rate is even lower.
cleversmoke@reddit
I'm patiently waiting for Bartowski to release a MTP version of Qwen3.6-27B 🙇🏻
jopereira@reddit
The current versions are already MTP enabled.
ixdx@reddit
He released the MTP layers as a separate file, similar to mmproj. https://huggingface.co/bartowski/Qwen_Qwen3.6-27B-GGUF/blob/main/mtp-Qwen_Qwen3.6-27B-Q8_0.gguf
cleversmoke@reddit
Oo!! Then it could just be a grafting on top, I'll try over the weekend. Thank you!
TheTerrasque@reddit
I've been using unsloth's q4_k_xl for programming (pi) and haven't had any issues with it. What do you mean by "errors" here?
qudat@reddit
I’m constantly hitting edit errors. It’s the tool that fails so consistently. I’m thinking about making a pi extension to tell the agent to generate a git diff then apply it because pi’s batch edit tool sucks atm
TheTerrasque@reddit
I've seen edit fails happen occasionally, maybe one time in 10 sessions. But then it just usually re-reads the file and tries again, and continues. Tool calls.. Bad enough to stop the agent has happened 3-4 times in about as many weeks. And I've been using pi a lot.
DarkAndrei@reddit
I’ve been usig unsloths q3_k_m also for coding with pi and 147k context window, it does not get stuck… but it does make bugs… quite a few 🤔
soyalemujica@reddit
Bugs is due to Q3, some stuff were lobotomized so it makes those mistakes. I personally have never experienced it make a mistake in Q4 XL or Q5KM
jopereira@reddit
Have used IQ3 XXS with turbo3 and it works flawlessly. 160K context on 16Gb. 20-45 tg. I'm now using MTP with turbo4 and 100k context with tg from 50-80t/s.
StandardLovers@reddit (OP)
Q3 quant Works flawlessly for what? Creative writing?
jopereira@reddit
For creative writing Qwen is not your best choice. And 27B 3.5 is better than 3.6 in that regard. I use it for coding (python, web UI. C++), with reasoning off (it thinks deep enough for coding tasks).
Turbulent_War4067@reddit
Newbie here, what do you mean "reasoning off"? Turning off it's CoT processing? How did you do this!?
jopereira@reddit
On llama-server is '--reasoning off'.
jopereira@reddit
Let me add: I'm using VS Code with GitHub Copilot mainly.
johnzadok@reddit
Which 16GB GPU do you have with 20-45 tg? What about prefill speed?
hovo1990@reddit
Can you please share llama.cpp command?
jopereira@reddit
https://www.reddit.com/r/AIToolsPerformance/s/llpICA0A0s
hovo1990@reddit
Thanks a lot
jopereira@reddit
Just a note: the current version of bartowski GGUFs are already MTP enabled. To download the non-MTP version, you have to check file upload history.
DarkAndrei@reddit
Same here on rtx 5080 I use Q3KM with turbo3 @147k context, 30-45tg…. I did it try MTP yet. What llama.cpp build you using that gas MTP and Turbo support ?
jopereira@reddit
TheTom version. It already has MTP (but, at least my llama-server build, doesn't have chat UI working - I don't use it anyway, but nevertheless is a negative point)
Septerium@reddit
It fails a lot to me, since my projects contain a lot of content in portuguese. Most q4/q5 ggufs seems to be kind of broken for non-English languages
codeanish@reddit
I’ve been using it at q4 with MTP recently on a 3090. It works decently well, but can echo the thoughts about errors every now and again. Would love to run this at FP8, but with a 256k context, what sort of hardware are we actually talking about here? Anything affordable to mere mortals while actually providing decent enough speed? For context, I’m currently getting >70 tok/s on the 3090 with MTP and a q4 kv cache until the context gets big
AdIllustrious436@reddit
I'm getting 40–50 tok/s on an RTX 3090 under a 290w powercap, using llama.cpp (upstream) with Q4_K_M quantization, Q4 KV cache, and a 256k context length. MTP hit rate averages \~70%. What configurations or optimizations do you use to push past 70 tok/s?
DeSibyl@reddit
Use MTP. You’d go from 30-40 to 70-80 t/s
AdIllustrious436@reddit
I already do ....
DeSibyl@reddit
Dang and you only get 30 t/s? Are you loading it entirely on vram? I get a boost from 30 t/s to 85 t/s on Q8
AdIllustrious436@reddit
Yep full VRAM, I'll try different powercap value maybe. With MTP it's closer to 40 tok/s but still far from 70 :(
DeSibyl@reddit
Shouldn’t be the power cap, I cap my 3090’s at 250w… which quant are you using? I was using one that said I can put the MTP tokens at 3 instead of 1 or 2…
AdIllustrious436@reddit
I'm not in the lab rn but as far as I remember, I run the Q4_K_M, need to check my MTP args tho. Thanks for the input.
Sofakingwetoddead@reddit
That was my experience, as well. I could live with the errors, though. They'd be looping issues or tool call issues, or occasionally the coder would stop mid work. q6 reduced the occurrence pretty dramatically. fp8 kv16 stopped it entirely. If I had to run q4 or q6, I'd still be happy doing it, but requires a bit more babysitting than with fp8
It also depends on what you're doing. Having a play? q4 is fine. Building professional-use software for a high-stakes industry, then fp8 :D
alexkey@reddit
Can fp8 run on cpu only? My setup is (separate machine) ryzen 5900X with 128GB ram, no GPU. Right now running q4_k_m which works but super slow - code gen goes at 5-6 tokens rate.
fasti-au@reddit
You should be moe and mtp chasing I think. CPU you need to grind too much to be happy dense so api architech like deepseek for the cheapness and you use yours inside for privaacy and control.
35b you can fun in 8gb vram so in your situation your doing the same thing but 100% cpu so dflash and mtp are the way you get the math to be in cpu as the lowest matrix. By tuning 5x5 math before tuning the cintext to everything you will get useable cpu inference.
This is sorta like LLMs finally being treated like a math issue not a search issue which is Lou these and just money farming
alexkey@reddit
Sorry friend you just threw a lot of smart words at me here that all are gibberish to me at this stage. I am very new to running this locally and there are not many good guides on all the different terminology and techniques.
Sofakingwetoddead@reddit
fp8 could run on cpu only, the full RAM requirement is around 48gb. It would be slower, though, than the 5-6 tokens you're currently getting.
alexkey@reddit
Thanks! Guess I’ll need to address the speed from HW angle then, the newbie level information on this is really sparse (was trying to make a post just earlier asking for suggestions on how to improve my setup and it got removed because “karma too low” 🤷🏼♂️)
Sofakingwetoddead@reddit
Aw, that sucks. There are some p inconsiderate people in this sub, for sure. Have you tried having a look at r/LocalLLM ? Some good info there, too. If I were you, I'd try to find a way to get 32gb and run q6 kv 8 qwen 27b. Nvidia def better performance per dollar, but Radeon more accessible.
alexkey@reddit
Thanks I’ll try that sub well.
By “32GB” do you mean a GPU? Yes that was one of the Qs in my removed post. I’ve seen that people can do dual Nvidia, so 2x 16GB is doable. But I don’t like dealing with Nvidia drivers on Linux, and I did some basic ML with rocm before so wondering if I can do dual 9070/9060 instead.
zampson@reddit
I have a twin 9060 XT 16GB setup on Ubuntu, ROCM+MTP with the unsloth 27B MTP Q4 model over 20 Tok/sec
alexkey@reddit
Thanks for sharing this, would you mind sharing some details of your setup? Are you running native or docker? llama-server? What arguments do you use to run it? Thanks in advance
Sofakingwetoddead@reddit
You can. I am on Linux, as well. No trouble here but I have a single Nvidia card. I tried the r9700 before this Nvidia card and it was too slow for me to use for work. I then ran some calculations on 'performance per dollar spent' - Like, how many tokens p sec generation am I getting per dollar? How many prompt processing tokens am I getting per dollar? The Nvidia option was dramatically better than the Radeon even though the Radeon was far cheaper. Per dollar, it was like half the value. So, I talked myself into going with Nvidia and it was a great choice. Everything running amazingly well, far beyond my expectations.
32gb vram should be the minimum target if your'e coding so you can run q6 27b qwen w/ 8bit KV ... you can squeeze it into 32gb with \~200k context, and it performs very well, in my use case.
If you build a harness(behavioral instructions) for Qwen to iron out the annoying tendencies, in my experience it will perform about as good as Opus. That sounds nuts but that has literally been my experience. I code for \~12 hours a day using him. Not a coder. Quality and speed(on blackwell) are Opus esque. Faster than opus, maybe nearly or same quality.
I don't have any experience setting up dual GPUs but I'm sure you can find some good threads on it. If you have bifurcation on your PCIe lanes to where you can run 8x8, then the dual setup is probably a good option but I can say for certain.
alexkey@reddit
> The Nvidia option was dramatically better than the Radeon even though the Radeon was far cheaper
That's interesting, I guess it is all the result of CUDA prevalence. I've got both Nvidia and AMD used for other things right now so i will test with them I guess to see.
> If you build a harness(behavioral instructions) for Qwen to iron out the annoying tendencies, in my experience it will perform about as good as Opus
This is something I am very interested in. Would you recommend any specific place/doc to read on to understand this better?
Sorry for so many Qs, I am a complete newb to the self-hosting llms (even with 20 odd years of doing programming and systems management). Appreciate all the information you have provided!
Sofakingwetoddead@reddit
This is something I am very interested in. Would you recommend any specific place/doc to read on to understand this better?
I don't know of any resource but there was a thread posted on this subject in localllama. I just intuitively developed a method to direct the coding agent to a markdown which I used to record the directions that I was constantly repeating into each prompt. I have a header that directs the agent to a folder which has, in it, an orientation. It tells the coder to read some mandatory stuff, then tells it to read other things based on its task. The net effect is the agent produces predictable and desirable behavior on each session.
One of its instructions is to propose 'tips and tricks' should it overcome a hurdle through trial and error. This way, next time, it can refer to the tip and save itself from repeated test failures and solving it all over again.
I'm not a coder, so I need to rely on these types of systems. It's really really easy to build this out, especially since you actually know what you're doing and I don't LOL
I would talk with gpt about it, come up with an implementation gameplan and start playing around. It's p simple really. Sorta like if you were to get a job at a fast food restaurant or walmart. It's orientation and protocols.
Hipponomics@reddit
It can but it would be even slower.
alexkey@reddit
Which variant would be more suitable for CPU then? Or is the q4_k_m it then and I need to look at better hardware?
Hipponomics@reddit
Sure :) Think of it like this. Let's assume you have enough RAM on whatever device you're using. You then have two resources, memory bandwidth and compute. You are always constrained by either one of these. For each inference step, you need to read the whole model from memory and the speed you can do that with is limited by your bandwidth. You then have to perform computations on all the weights in the LLM and the speed of that is constrained by compute.
A smaller Q4_K_M needs \~40% less bandwidth than fp8, but the same amount of compute. GPUs are practically always limited by bandwidth, so a smaller quant means more speed. I don't know if the same is true for CPUs. If you're heavily compute constrained, fp8 might be similarly fast but just higher quality. If you're bandwidth constrained, it's going to be slower. I would recommend an MoE for CPU use, as they use much less compute and bandwidth at the cost of higher RAM use. Qwen3.6 35B-A3B is a good alternative to 27B.
Celestial_aki@reddit
Three weeks on Qwopus3.6-27B-v2-MTP at Q4_K_M as the workhorse for my own coding-agent harness (was on vanilla Qwen 3.6 and 3.5 before that): the failure mode that bit hardest wasn't bad code, it was tool-use drift. At Q6_K the model honours "write to file X" almost always; at Q4_K_M I started seeing it confidently invent file paths every so often, then loop trying to read its own hallucinated file. DifficultDog8435 and FullstackSensei describe the same shape.
The thing nobody documents: Q4_K_M + MTP/spec-decoding is uniquely bad for agents, worse than either knob alone. A Q4 draft produces tokens the verifier rejects right on tool-call JSON boundaries (commas, closing braces, quote escapes), so you pay full quant tax AND lose half the speedup. Equal_Television_894 above is right — NVFP4 cleared it up for me on the 5090.
Genuine ask for the agents-first crowd: anyone got clean IFEval / tool-use bench numbers across Q4 → Q5 → Q6 → FP8 → NVFP4 on Qwen 3.6 27B? I keep meaning to run it properly and daily-driver work eats the slot.
segmond@reddit
I have always encouraged folks to go high of quality as you can at the expense of speed for serious work. For small/medium models, Q8 or nothing for me. Even at 5-10tk/sec.
relmny@reddit
I mentioned this yesterday, there are ppl here that claim that q4 is almost useless (loops, errors, etc) and when going to q6 almost all goes away (or happens rarely).
There is a big difference between q4 and q6. If you can do q6, go with it without even thinking about it.
nastywoodelfxo@reddit
yeah i run q6 minimum for anything agentic. q4 works fine for chat but tool calling gets weird fast, especially function arguments. you'll get structurally valid json with wrong parameter names or swapped values and the orchestrator wont catch it.
the quant degradation shows up in the boring parts, not the creative ones. q4 can still reason through a problem but itll mess up the handoff format between stages which breaks everything downstream.
Equal_Television_894@reddit
I am using the Native MTP preserved NVFP4 version dont know it never gets stuck like Q4
russianguy@reddit
Equal_Television_894@reddit
Sure https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF
ThePixelHunter@reddit
This is the dedicated NVFP4 repo:
https://huggingface.co/llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF/tree/main
kant12@reddit
Same. Q6 really is the minimum and most the time I'm using BF16.
curious_ab0ut_stuff@reddit
How much VRAM do you need for Bf16?
kant12@reddit
Yeah 96 is realistically what you'd want.
jonydevidson@reddit
4x the q4 and 2x the q8.
BlackBeardAI@reddit
I am running bf16 with full context (260k) on 4x3090's atm (96gb vram)
fasti-au@reddit
Well it depends. You see it’s a funnel really. For most people the 27b on one card is now very much viable. It’s like a 16 gb q4 and that’s fine for internal stuff but it does mean you are the brain and it is the follow where if you have a api build the arg the 27b do the internal wiring and the 35 b do the one shots you get everything from 2 3090 and then it’s scaling.
For me I’m in way too deep for a home lab with 20 card in play but I’m trying to be science and mathing stuff. For ithers a 4b mtp task manager can work well for just openclaw mcp driving no brains just better or more adaptive replacement to say email rules and templates.
The Q is more about targeting smarts and the reality is moe we can cut 35b apart and remove Lithuanian party tricks from coding and force more but the baseline 27b q4 qwen was built q4 in many ways so if you talk about smarts vs size they hold up better at q4 than say mistral devstral which can’t even make 5 calls in a row work at iq4 is last I looked.
So q4 with tools is great but your leaning on linters not in good code syntax in some cases. The whole why guess or generate when it exists. M
acerackham@reddit
What is the best version of 3.6 27b for coding then? I am a web and app developer and have a 5080 inside a pc with 96gb ram if that context helps.
Awwtifishal@reddit
I use q4_k_m but with q8 linear attention tensors (like unsloth's). You can use llmfan46's quants who also put more bits to those tensors.
Napster3301@reddit
the agentic failure mode at q4 isnt random noise, its concentrated in rare tokens. tool call json structure, function names, and structured arguments are low-frequency in training data, so quantization error hits them harder than it hits prose tokens. for chat this looks like "slightly worse writing." for agents it looks like "forgot the closing brace on the third tool call" or "invented a parameter name."
thats also why imatrix quants help here. they calibrate against represenative data so the rare-but-critical tokens dont lose precision relative to common ones. for agentic work specifically id reach for q4_k_xl with imatrix over plain q4_k_m. the size delta is small and the tool call reliability difference is measurable.
FortiTree@reddit
Did you specifically run into those problems? I heard similar comments but not sure how true it is.
With TDD and a separate agent to review the code, I thought these would be caught pretty quickly?
StandardLovers@reddit (OP)
I have to test q4_k_xl, it might be the solution to run q4 with agentic work. Will test it. Thanks.
My_Unbiased_Opinion@reddit
I've used 27B down to IQ3XXS on my hermes agent. The only issue I had is sometimes it would fail tool calls, but it would self recover. Never had an unrecoverable fail. The biggest issue is that sometimes you have to remind it to stay on the specific task exactly. Not a big issue. I've had Q8 35B straight up delete codebases.
I'm using IQ4XS 27B now with KVcache at Q4. Its more focused and has less tool call errors.
cibernox@reddit
I use it, because it allows me to have 200k context. Or more importantly, two agents with 100k context each.
Endurance_Beast@reddit
Yeah, general system administration tasks are fine. But not coding.
ResponsibleTruck4717@reddit
Yes, and I was quite surprised how good kv cache of q4_0 was. I manged to get around 110k context size on 24gb.
soyalemujica@reddit
24gb I fit 120k context with q5_1/q4_1 with Q5KM dense with MTP
ResponsibleTruck4717@reddit
Really? can you share your settings please?
soyalemujica@reddit
I just did in that comment. Ofc using Linux
Ok-Measurement-1575@reddit
Q4KM? No.
Q4KXL? Yes.
I even run 35b Q2KXL for some tasks.
llama-impersonator@reddit
no, however q5km is fine.
Pristine-Woodpecker@reddit
Sure, works fine, gets you large context. Main issue is model getting into loops or breakdown at large context, but at least you can get to that point, eh.
Mammoth-Pass9658@reddit
For agentic stuff, q4_k_m failures are awful coz they look correct until the agent drifts 20 steps later. q6 is way more stable over long runs.
DifficultDog8435@reddit
For normal chat it’s usually fine. The problem is agents fail in annoying little ways. Not always “the answer is totally wrong,” more like it forgets one instruction, picks the wrong file, misses an error message, or confidently goes down the wrong path.
StandardLovers@reddit (OP)
This was the same I was experiencing. Failing in annoying little ways, like dumbed down alittle.
DifficultDog8435@reddit
yea ive been working on a app where you can train the models been fun to see them go crazy or get smart
FullstackSensei@reddit
If you're using it for code, there's a lot more to it than a few errors. There's quite a bit of good code that you never see coming out. Things like better errors handling, better edge case handling, more thorough unit tests, and a lot of other little things like that.
cleversmoke@reddit
Yea, I think Qwen3.6-27B Q4_K_M is quite good for Python development. I used it for some time when I only had one RTX 3090 24G. I paired it with q8_0 KV cache and it did well with 128k context. It created minor bugs where a second or third pass cleared it up quickly.
Even at Q5_K_M (what I'm using now) creates just as many bugs on its first pass, but I'm at a larger context now, so it's expected (both quants seems to degrade after ~128k context).