[Qwen3.6 35b a3b] Used the top config for my setup 8gb vram and 32gb ram, and found that somehow the Q4_K_XL model from Unsloth runs just slightly faster and used less tokens for output compared to Q4_K_M despite more memory usage

Posted by EggDroppedSoup@reddit | LocalLLaMA | View on Reddit | 12 comments

Config

CtxSize: 131,072
GpuLayers: 99
CpuMoeLayers: 38
Threads: 16
BatchSize/UBatchSize: 4096/4096
CacheType K/V: q8_0
Tool Context: file mode (tools.kilocode.official.md)

Metric	M Model	XL Model	Difference
Avg Tokens/sec	28.92	29.78	+0.86 (+3.0%)
Median Tokens/sec	30.96	32.08	+1.12 (+3.6%)
Avg Wall Seconds	108.03s	99.93s	-8.10s (-7.5%)
Avg Output Tokens	3,031.8	2,895.8	-136 (-4.5%)
Avg Input Tokens/sec	50.20	55.96	+5.76 (+11.5%)
Avg Decode Tokens/sec	75.89	76.44	+0.55 (+0.7%)

Runs \~33% slower for the first run because my code has a bug that includes the initiation time, and as you know for an moe model you have to pass it from storage into ram. It's run 5 times to try to cancel is out, but still included it because that's how i would realistically use it (turning it on, using it once, turning it off to run something, etc).

[-]

EggDroppedSoup@reddit (OP)

Metric	Bartowski Q4_K_L	Unsloth Q4_K_XL
Avg tok/s	31.8	29.8
Avg Decode tok/s	83.9	76.4
Avg Input tok/s	57.1	56.0
Avg TTFT	62.6s	85s
Decode range	75–100	67–90

Similar size, small comparison between two popular GGUF providers

[-]

Alan_Silva_TI@reddit

Are you using CUDA or Vulkan?

I might try your settings.

I have a GTX 3060 with 12 GB of VRAM, and for some reason CUDA performance is awful for me. I get 20% to 30% more TPS with the latest Vulkan release of llama.cpp.

[-]

EggDroppedSoup@reddit (OP)

CUDA w. latest llama.cpp

[-]

Alan_Silva_TI@reddit

thanks.

[-]

EggDroppedSoup@reddit (OP)

Ill try out vulkan again later though, i originall had both and vulkan was slightly faster in one aspect but skower in another, but I forgot which so ill experiment again

[-]

Uncle___Marty@reddit

8 gig of vram and 48 gig of ram here and when 3.6 27B dropped I tried a Q4 and almost cried when I saw the tok/sec with 100k context. When the 36BA3B came out I figured it would be slightly faster and didnt try it for a bit, when I did? OMFG. The speed of this thing is insane for our cards. I'm actually looking forward to 3.6 9B as it might well be the first small model that can do simple coding tasks and stuff.

Happy its running so well for you bud!

[-]

PaceZealousideal6091@reddit

Hi! Thanks for sharing this. I can you explain why your choice of seeing batch and ubatch at 4096?

[-]

EggDroppedSoup@reddit (OP)

2048 also works for both, i tested a bunch of different configs like 1024 512, 2048 512, etc until this one. I setup a script and just added avunch of configs to test and it records the t/s and all the other performance metrics for me to view. I assume its because larger batching made handoff between ram and gpu more efficient, and I would get a ~10t/s slower difference when running at low values

[-]

Saegifu@reddit

What do you use it for?

[-]

EggDroppedSoup@reddit (OP)

Replacement for github copilot raptor mini preview

Research and tedious tasks like: Renaming a sht ton of files (refactoring) Finding bugs and getting the line numbers it affects Quick questions on ps commands and libraries I dont usually use

[-]

TangledEarphones@reddit

Thank you for posting, I have a similar set up, and it is refreshing to see someone with good hardware posting their stats (and not from AI-maxxers with multi-GPU setups). This helps me benchmark how my setup is doing, and it seems pretty comparable.

[-]

Pristine-Woodpecker@reddit

The XL model has some tensors that aren't quantized, so it will run a bit faster. The output tokens...while it's observed that the non-quantized models loop less and thus have shorter outputs, this is almost certainly still well within the error margin of 5 samples.