Qwen3.6 27B and llama.cpp appreciation post
Posted by ABLPHA@reddit | LocalLLaMA | View on Reddit | 55 comments
To preface, here's my config:
llama-server \
--host 0.0.0.0 \
--port 1235 \
--models-preset %h/Software/models.ini \
--models-max 1 \
--sleep-idle-seconds 3600 \
--timeout 3600 \
--parallel 1 \
--device ROCm0,ROCm1
[*]
flash-attn = on
jinja = true
fit = true
ctxcp = 5
offline = true
mmproj-offload = false
mmap = false
; ... many other models here ...
[tp-go-brrr-WORK-CODE]
hf = unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q5_K_XL
ctx-size = 131072
temp = 0.6
top-p = 0.95
top-k = 20
presence-penalty = 0.0
min-p = 0.00
fitt = 1024,1024,0
spec-type = draft-mtp
spec-draft-n-max = 2
chat-template-kwargs = {"preserve_thinking": true}
sm = tensor
And it's been a blast with a minimal Pi config.
I've been running it on two RX 9070 XTs (PCIe 5.0 x8/x8) both powerlimited to \~235W and using it for actual work. Despite the quant being a bit too low for my liking, the speed, smarts and steerability of the result I feel like is the best of what my current setup can offer for my use cases.
I've been doing a long debugging session where I needed the model to analyze interactions between a couple of backend services deployed on 3 separate instances with different configs and avoid a networking complication while doing so.
And yet, despite some roughness showing up at 5 bit, it did all I asked it to without much issue. Given enough control over the situation, its agentic capabilities are crazy. It successfully pinpointed many vague issues down to specific lines of code by adding logging, spinning up services locally, running requests (both local and to remote instances), iterate, and successfully mocking non-important parts to make sure the actually important code stays untouched for reproducibility, all while maintaining insane responsiveness and speed for a dense model. Some examples:
prompt eval time = 845.93 ms / 337 tokens ( 2.51 ms per token, 398.38 tokens per second)
eval time = 5863.80 ms / 275 tokens ( 21.32 ms per token, 46.90 tokens per second)
total time = 6709.73 ms / 612 tokens
draft acceptance rate = 0.83981 ( 173 accepted / 206 generated)
prompt eval time = 1429.61 ms / 618 tokens ( 2.31 ms per token, 432.29 tokens per second)
eval time = 3862.16 ms / 175 tokens ( 22.07 ms per token, 45.31 tokens per second)
total time = 5291.77 ms / 793 tokens
draft acceptance rate = 0.80597 ( 108 accepted / 134 generated)
prompt eval time = 1275.30 ms / 543 tokens ( 2.35 ms per token, 425.78 tokens per second)
eval time = 3287.57 ms / 151 tokens ( 21.77 ms per token, 45.93 tokens per second)
total time = 4562.87 ms / 694 tokens
draft acceptance rate = 0.82456 ( 94 accepted / 114 generated)
prompt eval time = 318.94 ms / 45 tokens ( 7.09 ms per token, 141.09 tokens per second)
eval time = 15105.91 ms / 784 tokens ( 19.27 ms per token, 51.90 tokens per second)
total time = 15424.84 ms / 829 tokens
draft acceptance rate = 0.98859 ( 520 accepted / 526 generated)
prompt eval time = 2151.53 ms / 960 tokens ( 2.24 ms per token, 446.19 tokens per second)
eval time = 2084.82 ms / 104 tokens ( 20.05 ms per token, 49.88 tokens per second)
total time = 4236.35 ms / 1064 tokens
draft acceptance rate = 0.94444 ( 68 accepted / 72 generated)
What's especially important to me is privacy here. I can safely navigate private environments with it without worrying that I'm leaking something to Gemini or alike.
It might not be perfect, but thanks to the high speeds, it's very easy to guide the model in the right direction if it ever starts drifting away.
Can't wait to get my hands on a R9700, or even a couple of them. A higher quant and bigger context are both gonna make it even more usable. Just need to get a new UPS first because my current one already tripped once due to tensor parallelism while I was away, hence the powerlimits 😅
CodeDominator@reddit
What I have sadly realized after testing it with my 24GB VRAM is that for Qwen 3.6 27B to work efficiently the bar for VRAM is 32GB.
LightBroom@reddit
It depends on the quant.
UD-IQ4_XS-MTP and 128k context fits just fine in 24GB
Q5_K_S-MTP fits with about 100k context.
Q8_0 for K, Q4_0 for V.
unjustifiably_angry@reddit
Are you using a fork that allows mixed kv-cache quantization types? When I use different for k and v it falls back to CPU.
hopbel@reddit
It's in vanilla llama.cpp. You might need to update
CodeDominator@reddit
KV cache quantization crashed my performance to sub 3 t/s generation, so I won't be trying that again. MTP at least on my setup works even slower than non-MTP, pretty disappointing so far. Also everybody keeps repeating the same mantra all over again - Q6 and up for coding and there's no way in hell you can do Q6 with a meaningful amount of context with 24GB VRAM.
hopbel@reddit
I don't think MTP works well if you have to offload. Maybe that's what you're seeing?
unjustifiably_angry@reddit
When I used mixed-quantization on kv-cache it falls back to CPU. Sounds like you're having the same issue. Dunno how these other guys are getting it to work.
LightBroom@reddit
That's a bit strange.
I see maybe a 5t/s loss with going to the values I listed. MTP also bumps up the speed from about 30t/s to 70+
This is with llama.cpp compiled from master, usually at least once per day. 7900XTX btw, Vulkan slightly faster than ROCm.
taking_bullet@reddit
It's about time to switch to Vulkan.
kant12@reddit
Not really. Rocm 7.13 is significantly better
hopbel@reddit
I'll take a simple Vulkan build over rocm's clusterfuck, speedup be damned
soyalemujica@reddit
You're wrong, Vulkan is better by about 30% in prompt processing and token generation.
kant12@reddit
maybe 6 months ago
LightBroom@reddit
Not really. As of today in llama.cpp, Vulkan seems to be faster and slightly better with VRAM.
am17an@reddit
What do you use for managing your llama-server? Does it pick up the models automatically now?
ABLPHA@reddit (OP)
I'm simply running llama-server as a systemd service in router mode.
As far as I know, Pi doesn't automatically pick up models, I just maintain its own `models.json` as per the docs
am17an@reddit
I created an extension called pi-llama-server which does this automatically. I do think it should just work though
hopbel@reddit
Neat. Could it potentially adjust Pi's context size to match the selected model? I don't know if llama-server makes that information available
ggerganov@reddit
I would highly recommend trying to add `--spec-default` to your existing config. Currently it enables the `ngram-mod` speculative type (in addition to MTP) with reasonable parameters. In my workflows, adding this option makes file edits instantaneous.
At the moment this spec type is still optional, but I think in the future this will become the default. If you notice any issues, please report back. Thanks.
ABLPHA@reddit (OP)
Thank you! Did a basic test with initial prompt "Create a simple diffchecker in a single HTML file directly in your response" and a followup "Add light mode to it" with seed 42 and found that the base tg t/s dropped from 53.5 down to 34, with followup being the same I assume thanks to ngram-mod:
I figured maybe ngram needs more space, but decreasing context to 32K didn't change the initial prompt speed. Thought order matters too, but placing
spec-default = truebefore or after MTP didn't change the speed either.I am running llama.cpp-hip b9204-1 though, so perhaps I'm missing some important updates. Would have to wait a few hours until I get home and update the whole system proper.
L0stInHe11@reddit
You had some performance degrading after enabling
ngram-mod, as I noticed too.Maybe our workflows are different from Georgi's. So I suggest you'd better give
spec-type = ngram-simple,draft-mtpa try.ngram-simplehas the minimum overhead, and it worked very well on my potato laptop (~10% increase of TGS).ggerganov@reddit
Yes, you need the latest version.
unjustifiably_angry@reddit
I saw a warning in a relevant PR suggesting that ngram isn't ready for use yet, has that been fixed?
ggerganov@reddit
Yes, it was fixed with https://github.com/ggml-org/llama.cpp/pull/23269
ionizing@reddit
I added `--spec-default` as soon as I saw you talk about it and noticed improvement in tool call speed as well. Thank you for your work.
Then-Topic8766@reddit
I am a big fan of ngram-mod. Good job.
Death-_-Row@reddit
Could you test out Vulkan? I have been getting better performance on Vulkan than with ROCM even in prompt processing speeds.
ABLPHA@reddit (OP)
Interesting! Last time I tested Vulkan, prompt processing was way worse, but that was a while ago. Gonna check later today hopefully. If performance is similar or better, might as well just jump ship because ROCm still can't handle KV cache quantization without a memory leak for some reason, and from time to time drops an ungoogleable crash for seemingly no reason lol.
LightBroom@reddit
Vulkan is about 10% faster on my 7900XTX cards today. A bit better on VRAM too but only marginally, I can't really give good figures.
Definitely worth a try. If you compile your own llama.cpp you can enable both Vulkan and ROCm and just use the one you one via --device ROCmX or --device VulkanX.
#!/usr/bin/bashHIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \cmake -S . -B build \-DGGML_HIP=ON \-DGPU_TARGETS=gfx1100 \-DCMAKE_BUILD_TYPE=Release \-DBUILD_TESTING=OFF \-DGGML_CUDA_FA_ALL_QUANTS=true \-DBUILD_TESTING=OFF \-DGGML_VULKAN=1 \-DLLAMA_BUILD_UI=1 \&& cmake --build build --config Release -- -j 16DrBearJ3w@reddit
I just tested is yesterday. With all my fixes ROCM outmatches Vulkan. It's not even close.
soyalemujica@reddit
You don't need the CUDA_FA_ALL_QUANTS when compiling for the 7900XTX
LightBroom@reddit
If defined it unconditionally enables additional quants for the KV cache, like q4_0, q4_1, q5_0, iq4_nl, etc
I'll probably test at some point to see if all those quants are enabled elsewhere or not, not sure.
kant12@reddit
A lot of people are just hung up on Vulkan being better 6 months ago. If you're not already doing it try rocm 7.13. Set ROCBLAS_USE_HIPBLASLT=1 at runtime and build with: -DGGML_HIP=ON -DGGML_HIP_ROCWMMA_FATTN=OFF -DGGML_HIP_MMQ_MFMA=OFF -DGGML_CUDA_FORCE_CUBLAS=ON
geek_at@reddit
from another thread
unjustifiably_angry@reddit
u_u
AMD why are you like this
MrBIMC@reddit
It changes week to week with how fast everything is moving lol.
Yesterdays llama.cpp really buffed up the pp performance on strix halo in vulkan.
On q8 qwen3.6-35b-a3b with full length unquantized kv cache I’m now getting 700tps on sub-10k context.
It still falls off to 100tps by the 200k mark though.
sylverCode@reddit
Vulkan also prevents OOM from FA spikes for me. Sadly I haven't seen any forks with turbo quants working on Vulkan
JapanFreak7@reddit
Same here Vulkan>ROCm but then again i have an AMD MI50 (2018) still worth trying
jfufufj@reddit
Compare to Claude model, it’s capability is closer to Sonnet or Haiku? Or somewhere in between?
ABLPHA@reddit (OP)
I don't think I can answer this for a couple of reasons, first being that I never used Haiku specifically, second being that I never used any cloud model for the kind of tasks I've been giving to Qwen3.6 27B, those mostly being pair-debugging sessions so far. But its agentic capabilities justify every saved percentage of cloud models quota for me personally.
jfufufj@reddit
Thanks for the answer!
Rikers88@reddit
I regularly use Claude for work and Qwen3.6 27b for personal usage and I can tell that qwen is way better then Haiku. The way I feel it, is that we are at the level of Sonnet 4.5/4.6.
Harness makes a lot of differences. MCP servers like Perplexica and Context7 boost model intellect by a lot.
Quantization strategy matters both on model Weights and KV cache. I run UD Q4 K XL with Turboquant4 on K and Turboquant3_tcq on V. Some would judge my setup as Model Lobotomizzation, but in reality it's working quiete well for me. If I could afford no quantization I would definitely go for that.
jfufufj@reddit
Wow, that’s amazing. I have trash hardware so I never truly used any local model. Thanks for sharing.
Kagemand@reddit
What’s the prompt processing speed on a long context, eg. 50-100k tokens? Thanks!
ABLPHA@reddit (OP)
Having a \~110k session right now, prompt processing varies widely from \~252 all the way down to \~51.
I assume this is the price of powerlimiting the GPUs as aggressively as I did. However, looking at another thread linked here in the comments, someone claims that Vulkan is over twice as fast at long context prompt processing. I'd have to test this out later
Kagemand@reddit
So 7-15 minutes for a 100k prompt? Damn, was hoping it was faster on two GPUs.
ABLPHA@reddit (OP)
Ah, I think I misunderstood your question. I was talking about prompt processing speeds in an interactive chat with prefix caching at \~110k context.
Processing a 100k prompt from zero would be faster than that because prompt processing starts at around 700-800 and then towards 100k drops to the numbers I initially gave.
Kagemand@reddit
I was wondering if that prompt processing speed is any different between 1 or 2 GPUs at all, but it is hard to compare when the model doesn’t fit on 1 GPU.
I suppose the closest comparison would be to try and load a lower quant model that fits in 16gb and compare.
But my impression so far is that 2 GPUs doesn’t speed up anything, it only allows you to load a bigger model, is that correct?
techlatest_net@reddit
hell yeah, this is exactly the kind of real-world writeup i love to see. dual 9070 xts + rocm running qwen3.6 for actual debugging work? chef's kiss
that draft accept rate hitting ~99% on some prompts is wild and totally get the privacy angle—nothing beats knowing your code/logs aren't leaving your box
powerlimiting to keep the ups happy is a mood hope the r9700 upgrade treats you well
sagiroth@reddit
This numbers are amazing if true on 16gb vram
ABLPHA@reddit (OP)
That's not actually 16GB of VRAM! In the post I specifically mentioned that I'm using two RX 9070 XTs, so the total pool is 32GB.
And the model is just unsloth's UD-Q5_K_XL MTP quant.
For fitting it all into VRAM, I used these settings, as shown in my config:
sagiroth@reddit
Ohhhh my bad for missing that . It all.make sense now
pmttyji@reddit
Hope you tried latest llama.cpp version. One more MTP related PR got merged 13 hours ago
https://github.com/ggml-org/llama.cpp/pull/23287
ABLPHA@reddit (OP)
Haven't updated to that yet, hoping to do so later today along with trying out Vulkan, thanks!
trialbuterror@reddit
Use vs code and wat extension for coding ?