Tried Qwen3.6-27B-UD-Q6_K_XL.gguf with CloudeCode, well I can't believe but it is usable
Posted by Clasyc@reddit | LocalLLaMA | View on Reddit | 19 comments
So I tried to run Qwen3-27B-UD-Q6_K_XL.gguf with 200K context on my RTX 5090 using llama.cpp. I'm getting around 50 tok/s, which is fine I guess, I don't really know this stuff so it might be improvable. But what I want to say is, I haven't tried local models for coding for quite a long time, and hell, I can't believe we're at the point where it's actually usable? Of course not the same first class experience as Opus 4.7, but damn, we are getting closer and closer.

Tried quite a difficult task, not casual CRUD stuff, to see if it can even try to prepare a plan that is somewhat making sense, and it did very well on the first try.
Of course that's just a general first impression and I haven't done real day to day coding with it, but at least I like what I see and it looks much more promising than my earlier experience with other models, which could start doing total nonsense at some points.
shifty21@reddit
What does your llama-server config look like?
Clasyc@reddit (OP)
There it is:
Also, I built llama, like so:
relmny@reddit
Instead of using --cache-type-k q4_0 --cache-type-v q4_0
wouldn't it be better to use TheTom Turboquant fork with?:
-cache-type-k turbo3 --cache-type-v turbo3
(or even --cache-type-k q8_0 if you want to further reduce K )
Clasyc@reddit (OP)
haven't tried. I originally wanted to use Turboquant, but as everyone told me it isn't supported, I didn't even research the possible options. Will try this fork, thanks.
shifty21@reddit
Thank you for this!
Looking at unsloth's docs to run Qwen2.6-27B, they note different parameters for temp, top-p, min-p, top-k for coding tasks compared to yours which looks more like general tasks. https://unsloth.ai/docs/models/qwen3.6
Coding tasks:
Curious of using those values makes a difference in quality output for coding.
Clasyc@reddit (OP)
thanks, that was quick "dumb" testing from my side, will try more variations this week.
ibbobud@reddit
Yep let’s see it….
AnihcamE@reddit
Is there any cons to using Mistral Vibe instead of Claude Code ?
SharinganSiyam@reddit
But isn't it too damn slow in claude code?
Clasyc@reddit (OP)
Yes it feels slower compared with Anthropic native API models, but for my use cases it is still usable as I like to re-read and fully understand everything myself. I have high hopes that in the future we would get even more optimizations, so model speed might increase in general.
SharinganSiyam@reddit
Same here. I tried following the unsloth's guide to fix 90% slower inference in claude code How to Run Local LLMs with Claude Code | Unsloth Documentation but it still doesn't work. Also I am using ud-q5kxl with q8_0 kv cache quantization in my rtx 5090. Should I switch to ud-q6kxl with q4_0? Do you see any performance degradation?
Clasyc@reddit (OP)
Sincerely can't tell about performance degradation yet, as I haven't done long coding sessions. But I'm planing to do proper comparison this week with different parameters to see the impact.
qubridInc@reddit
We’ve crossed the “actually usable” line, not Opus-level yet, but good enough to seriously get work done locally.
fredandlunchbox@reddit
Benchmarking at 4.5 levels, which is very usable.
I think the trick might be a combination of both: Use frontier models for planning or really hard bugs, but qwen for execution.
I think it could probably be done with skills, one way or the other. Either a skill in claude to execute the plan which spins off a qwen agent or a skill in qwen that uses claude to plan.
Exciting-Camera3226@reddit
Yeah, I ran some benchmark the other day, significantly usable
ionizing@reddit
I'm just getting to testing. but it is looking promising. I am used to using the moe 35B and 122B. First impressions: 27B understands the system prompts better it seems. It is using the parallel tool execution system instead of sending out one tool call at a time. The moe tend to send single tool calls, and use parallel calls much much less often for some reason. The 27B thinks for a bit longer, but will then call several at once (which my backend executes then groups together back to the model in the tool call response). I will have a look at that part of the system prompt and think about how I can simplify it for the moe. Anyhow I just thought that behavior is interesting. So far it looks like a solid performer on the basics. Looking forward to putting it to real work. here is screenshot, you can see the parallel tool calls going out in groups at certain timestamps. If this were one of the moe each tool call would have its own timestamp typically.
oxygen_addiction@reddit
They fixed a bug in llama.cpp today regarding parallel tool execution. Maybe that is related?
bytefactory@reddit
No, I observed the same thing on 3.6 35-3 vs 27B, and I'm a week behind on llama.cpp
ionizing@reddit
ohhh... I did just refresh to the latest build before this screenshot. I will run the moe for a bit and see if they get better at parallel. thx for the heads up