Qwen3.6 27B and llama.cpp appreciation post

Posted by ABLPHA@reddit | LocalLLaMA | View on Reddit | 55 comments

To preface, here's my config:

llama-server \
   --host 0.0.0.0 \
   --port 1235 \
   --models-preset %h/Software/models.ini \
   --models-max 1 \
   --sleep-idle-seconds 3600 \
   --timeout 3600 \
   --parallel 1 \
   --device ROCm0,ROCm1

[*]
flash-attn = on
jinja = true
fit = true
ctxcp = 5
offline = true
mmproj-offload = false
mmap = false



; ... many other models here ...



[tp-go-brrr-WORK-CODE]
hf = unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q5_K_XL

ctx-size = 131072
temp = 0.6
top-p = 0.95
top-k = 20
presence-penalty = 0.0
min-p = 0.00

fitt = 1024,1024,0

spec-type = draft-mtp
spec-draft-n-max = 2

chat-template-kwargs = {"preserve_thinking": true}

sm = tensor

And it's been a blast with a minimal Pi config.

I've been running it on two RX 9070 XTs (PCIe 5.0 x8/x8) both powerlimited to \~235W and using it for actual work. Despite the quant being a bit too low for my liking, the speed, smarts and steerability of the result I feel like is the best of what my current setup can offer for my use cases.

I've been doing a long debugging session where I needed the model to analyze interactions between a couple of backend services deployed on 3 separate instances with different configs and avoid a networking complication while doing so.

And yet, despite some roughness showing up at 5 bit, it did all I asked it to without much issue. Given enough control over the situation, its agentic capabilities are crazy. It successfully pinpointed many vague issues down to specific lines of code by adding logging, spinning up services locally, running requests (both local and to remote instances), iterate, and successfully mocking non-important parts to make sure the actually important code stays untouched for reproducibility, all while maintaining insane responsiveness and speed for a dense model. Some examples:

prompt eval time =     845.93 ms /   337 tokens (    2.51 ms per token,   398.38 tokens per second)
eval time =    5863.80 ms /   275 tokens (   21.32 ms per token,    46.90 tokens per second)
total time =    6709.73 ms /   612 tokens
draft acceptance rate = 0.83981 (  173 accepted /   206 generated)

prompt eval time =    1429.61 ms /   618 tokens (    2.31 ms per token,   432.29 tokens per second)
eval time =    3862.16 ms /   175 tokens (   22.07 ms per token,    45.31 tokens per second)
total time =    5291.77 ms /   793 tokens
draft acceptance rate = 0.80597 (  108 accepted /   134 generated)

prompt eval time =    1275.30 ms /   543 tokens (    2.35 ms per token,   425.78 tokens per second)
eval time =    3287.57 ms /   151 tokens (   21.77 ms per token,    45.93 tokens per second)
total time =    4562.87 ms /   694 tokens
draft acceptance rate = 0.82456 (   94 accepted /   114 generated)

prompt eval time =     318.94 ms /    45 tokens (    7.09 ms per token,   141.09 tokens per second)
eval time =   15105.91 ms /   784 tokens (   19.27 ms per token,    51.90 tokens per second)
total time =   15424.84 ms /   829 tokens
draft acceptance rate = 0.98859 (  520 accepted /   526 generated)

prompt eval time =    2151.53 ms /   960 tokens (    2.24 ms per token,   446.19 tokens per second)
eval time =    2084.82 ms /   104 tokens (   20.05 ms per token,    49.88 tokens per second)
total time =    4236.35 ms /  1064 tokens
draft acceptance rate = 0.94444 (   68 accepted /    72 generated)

What's especially important to me is privacy here. I can safely navigate private environments with it without worrying that I'm leaking something to Gemini or alike.

It might not be perfect, but thanks to the high speeds, it's very easy to guide the model in the right direction if it ever starts drifting away.

Can't wait to get my hands on a R9700, or even a couple of them. A higher quant and bigger context are both gonna make it even more usable. Just need to get a new UPS first because my current one already tripped once due to tensor parallelism while I was away, hence the powerlimits 😅

[-]

CodeDominator@reddit

What I have sadly realized after testing it with my 24GB VRAM is that for Qwen 3.6 27B to work efficiently the bar for VRAM is 32GB.

[-]

LightBroom@reddit

It depends on the quant.

UD-IQ4_XS-MTP and 128k context fits just fine in 24GB
Q5_K_S-MTP fits with about 100k context.

Q8_0 for K, Q4_0 for V.

[-]

unjustifiably_angry@reddit

Are you using a fork that allows mixed kv-cache quantization types? When I use different for k and v it falls back to CPU.

[-]

hopbel@reddit

It's in vanilla llama.cpp. You might need to update

[-]

CodeDominator@reddit

KV cache quantization crashed my performance to sub 3 t/s generation, so I won't be trying that again. MTP at least on my setup works even slower than non-MTP, pretty disappointing so far. Also everybody keeps repeating the same mantra all over again - Q6 and up for coding and there's no way in hell you can do Q6 with a meaningful amount of context with 24GB VRAM.

[-]

hopbel@reddit

I don't think MTP works well if you have to offload. Maybe that's what you're seeing?

[-]

unjustifiably_angry@reddit

When I used mixed-quantization on kv-cache it falls back to CPU. Sounds like you're having the same issue. Dunno how these other guys are getting it to work.

[-]

LightBroom@reddit

That's a bit strange.

I see maybe a 5t/s loss with going to the values I listed. MTP also bumps up the speed from about 30t/s to 70+

This is with llama.cpp compiled from master, usually at least once per day. 7900XTX btw, Vulkan slightly faster than ROCm.

[-]

taking_bullet@reddit

It's about time to switch to Vulkan.

[-]

kant12@reddit

Not really. Rocm 7.13 is significantly better

[-]

hopbel@reddit

I'll take a simple Vulkan build over rocm's clusterfuck, speedup be damned

[-]

soyalemujica@reddit

You're wrong, Vulkan is better by about 30% in prompt processing and token generation.

[-]

kant12@reddit

maybe 6 months ago

[-]

LightBroom@reddit

Not really. As of today in llama.cpp, Vulkan seems to be faster and slightly better with VRAM.

[-]

am17an@reddit

What do you use for managing your llama-server? Does it pick up the models automatically now?

[-]

ABLPHA@reddit (OP)

I'm simply running llama-server as a systemd service in router mode.

As far as I know, Pi doesn't automatically pick up models, I just maintain its own `models.json` as per the docs

[-]

am17an@reddit

I created an extension called pi-llama-server which does this automatically. I do think it should just work though

[-]

hopbel@reddit

Neat. Could it potentially adjust Pi's context size to match the selected model? I don't know if llama-server makes that information available

[-]

ggerganov@reddit

I would highly recommend trying to add `--spec-default` to your existing config. Currently it enables the `ngram-mod` speculative type (in addition to MTP) with reasonable parameters. In my workflows, adding this option makes file edits instantaneous.

At the moment this spec type is still optional, but I think in the future this will become the default. If you notice any issues, please report back. Thanks.

[-]

ABLPHA@reddit (OP)

Thank you! Did a basic test with initial prompt "Create a simple diffchecker in a single HTML file directly in your response" and a followup "Add light mode to it" with seed 42 and found that the base tg t/s dropped from 53.5 down to 34, with followup being the same I assume thanks to ngram-mod:

no-spec-default initial:
prompt eval time =     812.42 ms /   519 tokens (    1.57 ms per token,   638.83 tokens per second)
eval time =   70057.23 ms /  3749 tokens (   18.69 ms per token,    53.51 tokens per second)
total time =   70869.64 ms /  4268 tokens
draft acceptance rate = 0.89784 ( 2408 accepted /  2682 generated)
statistics draft-mtp: #calls(b,g,a) = 11 4434 4434, #gen drafts = 4434, #acc drafts = 4094, #gen tokens = 8868, #acc tokens = 7737, dur(b,g,a) = 0.008, 19426.017, 2.185 ms

no-spec-default followup:
prompt eval time =    4840.54 ms /  3648 tokens (    1.33 ms per token,   753.63 tokens per second)
eval time =   99271.16 ms /  5410 tokens (   18.35 ms per token,    54.50 tokens per second)
total time =  104111.70 ms /  9058 tokens
draft acceptance rate = 0.94498 ( 3538 accepted /  3744 generated)
statistics draft-mtp: #calls(b,g,a) = 12 6306 6306, #gen drafts = 6306, #acc drafts = 5897, #gen tokens = 12612, #acc tokens = 11275, dur(b,g,a) = 0.008, 27657.295, 3.099 ms

spec-default initial:
prompt eval time =    5649.62 ms /  4697 tokens (    1.20 ms per token,   831.38 tokens per second)
eval time =  123741.98 ms /  4209 tokens (   29.40 ms per token,    34.01 tokens per second)
total time =  129391.60 ms /  8906 tokens
draft acceptance rate = 0.88133 ( 2785 accepted /  3160 generated)
statistics ngram-mod: #calls(b,g,a) = 1 1425 5, #gen drafts = 5, #acc drafts = 5, #gen tokens = 320, #acc tokens = 32, dur(b,g,a) = 0.256, 1.792, 0.523 ms
statistics draft-mtp: #calls(b,g,a) = 1 1420 1420, #gen drafts = 1420, #acc drafts = 1420, #gen tokens = 2840, #acc tokens = 2753, dur(b,g,a) = 0.000, 6083.560, 0.955 ms

spec-default followup:
prompt eval time =    5830.56 ms /  4116 tokens (    1.42 ms per token,   705.94 tokens per second)
eval time =  102372.59 ms /  5599 tokens (   18.28 ms per token,    54.69 tokens per second)
total time =  108203.16 ms /  9715 tokens
draft acceptance rate = 0.74209 ( 4549 accepted /  6130 generated)
statistics ngram-mod: #calls(b,g,a) = 2 2475 70, #gen drafts = 70, #acc drafts = 70, #gen tokens = 4480, #acc tokens = 2661, dur(b,g,a) = 0.727, 5.880, 2.995 ms
statistics draft-mtp: #calls(b,g,a) = 2 2405 2405, #gen drafts = 2405, #acc drafts = 2405, #gen tokens = 4810, #acc tokens = 4673, dur(b,g,a) = 0.001, 10374.743, 1.685 ms

I figured maybe ngram needs more space, but decreasing context to 32K didn't change the initial prompt speed. Thought order matters too, but placing spec-default = true before or after MTP didn't change the speed either.

I am running llama.cpp-hip b9204-1 though, so perhaps I'm missing some important updates. Would have to wait a few hours until I get home and update the whole system proper.

[-]

L0stInHe11@reddit

You had some performance degrading after enabling ngram-mod, as I noticed too.

Maybe our workflows are different from Georgi's. So I suggest you'd better give spec-type = ngram-simple,draft-mtp a try.

ngram-simple has the minimum overhead, and it worked very well on my potato laptop (~10% increase of TGS).

[-]

ggerganov@reddit

Yes, you need the latest version.

[-]

unjustifiably_angry@reddit

I saw a warning in a relevant PR suggesting that ngram isn't ready for use yet, has that been fixed?

[-]

ggerganov@reddit

Yes, it was fixed with https://github.com/ggml-org/llama.cpp/pull/23269

[-]

ionizing@reddit

I added `--spec-default` as soon as I saw you talk about it and noticed improvement in tool call speed as well. Thank you for your work.

[-]

Then-Topic8766@reddit

I am a big fan of ngram-mod. Good job.

[-]

Death-_-Row@reddit

Could you test out Vulkan? I have been getting better performance on Vulkan than with ROCM even in prompt processing speeds.

[-]

ABLPHA@reddit (OP)

Interesting! Last time I tested Vulkan, prompt processing was way worse, but that was a while ago. Gonna check later today hopefully. If performance is similar or better, might as well just jump ship because ROCm still can't handle KV cache quantization without a memory leak for some reason, and from time to time drops an ungoogleable crash for seemingly no reason lol.

[-]

LightBroom@reddit

Vulkan is about 10% faster on my 7900XTX cards today. A bit better on VRAM too but only marginally, I can't really give good figures.

Definitely worth a try. If you compile your own llama.cpp you can enable both Vulkan and ROCm and just use the one you one via --device ROCmX or --device VulkanX.

#!/usr/bin/bash
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
cmake -S . -B build \
-DGGML_HIP=ON \
-DGPU_TARGETS=gfx1100 \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_TESTING=OFF \
-DGGML_CUDA_FA_ALL_QUANTS=true \
-DBUILD_TESTING=OFF \
-DGGML_VULKAN=1 \
-DLLAMA_BUILD_UI=1 \
&& cmake --build build --config Release -- -j 16

[-]

DrBearJ3w@reddit

I just tested is yesterday. With all my fixes ROCM outmatches Vulkan. It's not even close.

[-]

soyalemujica@reddit

You don't need the CUDA_FA_ALL_QUANTS when compiling for the 7900XTX

[-]

LightBroom@reddit

If defined it unconditionally enables additional quants for the KV cache, like q4_0, q4_1, q5_0, iq4_nl, etc

I'll probably test at some point to see if all those quants are enabled elsewhere or not, not sure.

[-]

kant12@reddit

A lot of people are just hung up on Vulkan being better 6 months ago. If you're not already doing it try rocm 7.13. Set ROCBLAS_USE_HIPBLASLT=1 at runtime and build with: -DGGML_HIP=ON -DGGML_HIP_ROCWMMA_FATTN=OFF -DGGML_HIP_MMQ_MFMA=OFF -DGGML_CUDA_FORCE_CUBLAS=ON

[-]

geek_at@reddit

from another thread

[-]

unjustifiably_angry@reddit

u_u

AMD why are you like this

[-]

MrBIMC@reddit

It changes week to week with how fast everything is moving lol.

Yesterdays llama.cpp really buffed up the pp performance on strix halo in vulkan.

On q8 qwen3.6-35b-a3b with full length unquantized kv cache I’m now getting 700tps on sub-10k context.

It still falls off to 100tps by the 200k mark though.

[-]

sylverCode@reddit

Vulkan also prevents OOM from FA spikes for me. Sadly I haven't seen any forks with turbo quants working on Vulkan

[-]

JapanFreak7@reddit

Same here Vulkan>ROCm but then again i have an AMD MI50 (2018) still worth trying

[-]

jfufufj@reddit

Compare to Claude model, it’s capability is closer to Sonnet or Haiku? Or somewhere in between?

[-]

ABLPHA@reddit (OP)

I don't think I can answer this for a couple of reasons, first being that I never used Haiku specifically, second being that I never used any cloud model for the kind of tasks I've been giving to Qwen3.6 27B, those mostly being pair-debugging sessions so far. But its agentic capabilities justify every saved percentage of cloud models quota for me personally.

[-]

jfufufj@reddit

Thanks for the answer!

[-]

Rikers88@reddit

I regularly use Claude for work and Qwen3.6 27b for personal usage and I can tell that qwen is way better then Haiku. The way I feel it, is that we are at the level of Sonnet 4.5/4.6.

Harness makes a lot of differences. MCP servers like Perplexica and Context7 boost model intellect by a lot.

Quantization strategy matters both on model Weights and KV cache. I run UD Q4 K XL with Turboquant4 on K and Turboquant3_tcq on V. Some would judge my setup as Model Lobotomizzation, but in reality it's working quiete well for me. If I could afford no quantization I would definitely go for that.

[-]

jfufufj@reddit

Wow, that’s amazing. I have trash hardware so I never truly used any local model. Thanks for sharing.

[-]

Kagemand@reddit

What’s the prompt processing speed on a long context, eg. 50-100k tokens? Thanks!

[-]

ABLPHA@reddit (OP)

Having a \~110k session right now, prompt processing varies widely from \~252 all the way down to \~51.

I assume this is the price of powerlimiting the GPUs as aggressively as I did. However, looking at another thread linked here in the comments, someone claims that Vulkan is over twice as fast at long context prompt processing. I'd have to test this out later

[-]

Kagemand@reddit

So 7-15 minutes for a 100k prompt? Damn, was hoping it was faster on two GPUs.

[-]

ABLPHA@reddit (OP)

Ah, I think I misunderstood your question. I was talking about prompt processing speeds in an interactive chat with prefix caching at \~110k context.

Processing a 100k prompt from zero would be faster than that because prompt processing starts at around 700-800 and then towards 100k drops to the numbers I initially gave.

[-]

Kagemand@reddit

I was wondering if that prompt processing speed is any different between 1 or 2 GPUs at all, but it is hard to compare when the model doesn’t fit on 1 GPU.

I suppose the closest comparison would be to try and load a lower quant model that fits in 16gb and compare.

But my impression so far is that 2 GPUs doesn’t speed up anything, it only allows you to load a bigger model, is that correct?

[-]

techlatest_net@reddit

hell yeah, this is exactly the kind of real-world writeup i love to see. dual 9070 xts + rocm running qwen3.6 for actual debugging work? chef's kiss

that draft accept rate hitting ~99% on some prompts is wild and totally get the privacy angle—nothing beats knowing your code/logs aren't leaving your box

powerlimiting to keep the ups happy is a mood hope the r9700 upgrade treats you well

[-]

sagiroth@reddit

This numbers are amazing if true on 16gb vram

[-]

ABLPHA@reddit (OP)

That's not actually 16GB of VRAM! In the post I specifically mentioned that I'm using two RX 9070 XTs, so the total pool is 32GB.

And the model is just unsloth's UD-Q5_K_XL MTP quant.

For fitting it all into VRAM, I used these settings, as shown in my config:

Reduced parallel slots to just 1 (I don't need parallel requests)
Reduced the number of checkpoints down to 5 (I don't usually rollback chats, so a few checkpoints is just enough for me)
Disabled mmproj offload to GPU (in the rare cases where I do need vision it just runs on the CPU instead)
131K context just fits! I'll hopefully see if I can use q8 KV for 262k context later today when I switch to Vulkan

[-]

sagiroth@reddit

Ohhhh my bad for missing that . It all.make sense now

[-]

pmttyji@reddit

Hope you tried latest llama.cpp version. One more MTP related PR got merged 13 hours ago

https://github.com/ggml-org/llama.cpp/pull/23287

[-]

ABLPHA@reddit (OP)

Haven't updated to that yet, hoping to do so later today along with trying out Vulkan, thanks!

[-]

trialbuterror@reddit

Use vs code and wat extension for coding ?