[Qwen3.6 35b a3b] Used the top config for my setup 8gb vram and 32gb ram, and found that somehow the Q4_K_XL model from Unsloth runs just slightly faster and used less tokens for output compared to Q4_K_M despite more memory usage
Posted by EggDroppedSoup@reddit | LocalLLaMA | View on Reddit | 12 comments
Config
- CtxSize: 131,072
- GpuLayers: 99
- CpuMoeLayers: 38
- Threads: 16
- BatchSize/UBatchSize: 4096/4096
- CacheType K/V: q8_0
- Tool Context: file mode (tools.kilocode.official.md)
| Metric | M Model | XL Model | Difference |
|---|---|---|---|
| Avg Tokens/sec | 28.92 | 29.78 | +0.86 (+3.0%) |
| Median Tokens/sec | 30.96 | 32.08 | +1.12 (+3.6%) |
| Avg Wall Seconds | 108.03s | 99.93s | -8.10s (-7.5%) |
| Avg Output Tokens | 3,031.8 | 2,895.8 | -136 (-4.5%) |
| Avg Input Tokens/sec | 50.20 | 55.96 | +5.76 (+11.5%) |
| Avg Decode Tokens/sec | 75.89 | 76.44 | +0.55 (+0.7%) |
Runs \~33% slower for the first run because my code has a bug that includes the initiation time, and as you know for an moe model you have to pass it from storage into ram. It's run 5 times to try to cancel is out, but still included it because that's how i would realistically use it (turning it on, using it once, turning it off to run something, etc).
EggDroppedSoup@reddit (OP)
Similar size, small comparison between two popular GGUF providers
Alan_Silva_TI@reddit
Are you using CUDA or Vulkan?
I might try your settings.
I have a GTX 3060 with 12 GB of VRAM, and for some reason CUDA performance is awful for me. I get 20% to 30% more TPS with the latest Vulkan release of llama.cpp.
EggDroppedSoup@reddit (OP)
CUDA w. latest llama.cpp
Alan_Silva_TI@reddit
thanks.
EggDroppedSoup@reddit (OP)
Ill try out vulkan again later though, i originall had both and vulkan was slightly faster in one aspect but skower in another, but I forgot which so ill experiment again
Uncle___Marty@reddit
8 gig of vram and 48 gig of ram here and when 3.6 27B dropped I tried a Q4 and almost cried when I saw the tok/sec with 100k context. When the 36BA3B came out I figured it would be slightly faster and didnt try it for a bit, when I did? OMFG. The speed of this thing is insane for our cards. I'm actually looking forward to 3.6 9B as it might well be the first small model that can do simple coding tasks and stuff.
Happy its running so well for you bud!
PaceZealousideal6091@reddit
Hi! Thanks for sharing this. I can you explain why your choice of seeing batch and ubatch at 4096?
EggDroppedSoup@reddit (OP)
2048 also works for both, i tested a bunch of different configs like 1024 512, 2048 512, etc until this one. I setup a script and just added avunch of configs to test and it records the t/s and all the other performance metrics for me to view. I assume its because larger batching made handoff between ram and gpu more efficient, and I would get a ~10t/s slower difference when running at low values
Saegifu@reddit
What do you use it for?
EggDroppedSoup@reddit (OP)
Replacement for github copilot raptor mini preview
Research and tedious tasks like: Renaming a sht ton of files (refactoring) Finding bugs and getting the line numbers it affects Quick questions on ps commands and libraries I dont usually use
TangledEarphones@reddit
Thank you for posting, I have a similar set up, and it is refreshing to see someone with good hardware posting their stats (and not from AI-maxxers with multi-GPU setups). This helps me benchmark how my setup is doing, and it seems pretty comparable.
Pristine-Woodpecker@reddit
The XL model has some tensors that aren't quantized, so it will run a bit faster. The output tokens...while it's observed that the non-quantized models loop less and thus have shorter outputs, this is almost certainly still well within the error margin of 5 samples.