Qwen3.6 GGUF is so good for debugging.

Posted by _BigBackClock@reddit | LocalLLaMA | View on Reddit | 19 comments

using unsloth dynamic quant on 16GB vram + 32GB dram. 200k q8_0 kv cache (context window)

[-]

_BigBackClock@reddit (OP)

UPDATE - configured ik_llama with proper cpu offloading, cpu kv cache; now getting 38.98 tok/s (249.71 tok/s in, 18.83 tok/s out).

[-]

spvn@reddit

which quant? Are you using Ik llama with such low VRAM?

[-]

sagiroth@reddit

What's the diff between llama.cpp?

[-]

volleyneo@reddit

Il llama better for gpu cpu setups, as 16gb vram is not enough for that model

[-]

R_Duncan@reddit

The advantage against using iso4 kv cache in llama-cpo-turboquant reduces if you don't have a recent cpu

[-]

_BigBackClock@reddit (OP)

UD Q4 K M and no, I’m not using ik llama

[-]

_BigBackClock@reddit (OP)

will try it tonight

[-]

DarthLoki79@reddit

What are you using to serve this? llama.cpp?

[-]

_BigBackClock@reddit (OP)

yes

[-]

metover@reddit

which os is this, I like your top statusbar

[-]

_BigBackClock@reddit (OP)

fedora hyprland. the top bar is waybar

[-]

CardinalRedwood@reddit

Context? Tokens per second? Cool!

[-]

_BigBackClock@reddit (OP)

200k q8_0 context windows, it’s pretty slow tho 10-12tok/s

[-]

9r4n4y@reddit

3.6 is really damn good :)

[-]

viperx7@reddit

Just waiting for the 27B

[-]

KAPMODA@reddit

Need more pixels please

[-]

_BigBackClock@reddit (OP)

Zoom in, it does have all the pixels

[-]

SM8085@reddit

1.3.15

Opencode 1.4.10 is out now, by the way. In another thread we explored that it only auto updates the patch releases, ie. the third number.

[-]

_BigBackClock@reddit (OP)

will update, thanks