Tried Qwen3.6-27B-UD-Q6_K_XL.gguf with CloudeCode, well I can't believe but it is usable

Posted by Clasyc@reddit | LocalLLaMA | View on Reddit | 19 comments

So I tried to run Qwen3-27B-UD-Q6_K_XL.gguf with 200K context on my RTX 5090 using llama.cpp. I'm getting around 50 tok/s, which is fine I guess, I don't really know this stuff so it might be improvable. But what I want to say is, I haven't tried local models for coding for quite a long time, and hell, I can't believe we're at the point where it's actually usable? Of course not the same first class experience as Opus 4.7, but damn, we are getting closer and closer.

Tried quite a difficult task, not casual CRUD stuff, to see if it can even try to prepare a plan that is somewhat making sense, and it did very well on the first try.

Of course that's just a general first impression and I haven't done real day to day coding with it, but at least I like what I see and it looks much more promising than my earlier experience with other models, which could start doing total nonsense at some points.

[-]

shifty21@reddit

What does your llama-server config look like?

[-]

Clasyc@reddit (OP)

There it is:

./build/bin/llama-server \
    -m ~/models/qwen3.6-27b/Qwen3.6-27B-UD-Q6_K_XL.gguf \
    --alias "qwen3.6-27b" \
    --host 0.0.0.0 \
    --port 8080 \
    -c 196608 \
    -fa on \
    --kv-unified \
    --cache-type-k q4_0 \
    --cache-type-v q4_0 \
    --batch-size 4096 \
    --ubatch-size 1024 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.00 \
    --top-k 20 \
    --repeat-penalty 1.0 \
    --jinja

Also, I built llama, like so:

cmake . -B build \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=120 \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DGGML_CUDA_FORCE_CUBLAS=OFF

[-]

relmny@reddit

Instead of using --cache-type-k q4_0 --cache-type-v q4_0

wouldn't it be better to use TheTom Turboquant fork with?:
-cache-type-k turbo3 --cache-type-v turbo3

(or even --cache-type-k q8_0 if you want to further reduce K )

[-]

Clasyc@reddit (OP)

haven't tried. I originally wanted to use Turboquant, but as everyone told me it isn't supported, I didn't even research the possible options. Will try this fork, thanks.

[-]

shifty21@reddit

Thank you for this!

Looking at unsloth's docs to run Qwen2.6-27B, they note different parameters for temp, top-p, min-p, top-k for coding tasks compared to yours which looks more like general tasks. https://unsloth.ai/docs/models/qwen3.6

Coding tasks:

    --temp 0.6 \
    --min-p 0.0 \
    --top-p 0.95 \
    --top-k 20 \
    --presence_penalty 0.0 \
    --repeat-penalty 1.0 \

Curious of using those values makes a difference in quality output for coding.

[-]

Clasyc@reddit (OP)

thanks, that was quick "dumb" testing from my side, will try more variations this week.

[-]

ibbobud@reddit

Yep let’s see it….

[-]

AnihcamE@reddit

Is there any cons to using Mistral Vibe instead of Claude Code ?

[-]

SharinganSiyam@reddit

But isn't it too damn slow in claude code?

[-]

Clasyc@reddit (OP)

Yes it feels slower compared with Anthropic native API models, but for my use cases it is still usable as I like to re-read and fully understand everything myself. I have high hopes that in the future we would get even more optimizations, so model speed might increase in general.

[-]

SharinganSiyam@reddit

Same here. I tried following the unsloth's guide to fix 90% slower inference in claude code How to Run Local LLMs with Claude Code | Unsloth Documentation but it still doesn't work. Also I am using ud-q5kxl with q8_0 kv cache quantization in my rtx 5090. Should I switch to ud-q6kxl with q4_0? Do you see any performance degradation?

[-]

Clasyc@reddit (OP)

Sincerely can't tell about performance degradation yet, as I haven't done long coding sessions. But I'm planing to do proper comparison this week with different parameters to see the impact.

[-]

qubridInc@reddit

We’ve crossed the “actually usable” line, not Opus-level yet, but good enough to seriously get work done locally.

[-]

fredandlunchbox@reddit

Benchmarking at 4.5 levels, which is very usable.

I think the trick might be a combination of both: Use frontier models for planning or really hard bugs, but qwen for execution.

I think it could probably be done with skills, one way or the other. Either a skill in claude to execute the plan which spins off a qwen agent or a skill in qwen that uses claude to plan.

[-]

Exciting-Camera3226@reddit

Yeah, I ran some benchmark the other day, significantly usable

[-]

ionizing@reddit

I'm just getting to testing. but it is looking promising. I am used to using the moe 35B and 122B. First impressions: 27B understands the system prompts better it seems. It is using the parallel tool execution system instead of sending out one tool call at a time. The moe tend to send single tool calls, and use parallel calls much much less often for some reason. The 27B thinks for a bit longer, but will then call several at once (which my backend executes then groups together back to the model in the tool call response). I will have a look at that part of the system prompt and think about how I can simplify it for the moe. Anyhow I just thought that behavior is interesting. So far it looks like a solid performer on the basics. Looking forward to putting it to real work. here is screenshot, you can see the parallel tool calls going out in groups at certain timestamps. If this were one of the moe each tool call would have its own timestamp typically.

[-]