Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model
Posted by old-mike@reddit | LocalLLaMA | View on Reddit | 18 comments
I'm posting this because it may be helpful to squeeze the 12GB VRAM in the 3060.
All credit goes to spiritbuun's fork (github.com/spiritbuun/buun-llama-cpp) and mudler's APEX quantizations (huggingface.co/mudler). Spiritbuun's CUDA optimizations for NVIDIA GPUs — fused MMA fix, TurboQuant, fattn improvements — are what make offloading a 17.3 GB model on a 12 GB card at these speeds possible. Mudler's APEX I-Compact quantization gave me the best perplexity/speed trade-off of any variant I tested.
Hardware: - GPU: 1× RTX 3060 12GB (110W power limit) - CPU: Xeon E5-2678 v3 - RAM: 128 GB DDR4-2133 - PCIe 3.0 x16 - Container: Incus (LXC)
Command (optimal for me):
./build/bin/llama-server \
-m /models/mudler/Qwen3.6-35B-A3B-APEX-MTP-I-Compact.gguf \
--no-warmup -c 131072 -np 1 --no-mmap --mlock \
-ctk turbo4 -ctv turbo4 \
--jinja --reasoning-budget 1536 \
--flash-attn on \
--host 0.0.0.0 --port 8000 \
-fitt 1500 \
--mmproj /models/mmproj-Qwen3.6-35B-A3B-Uncensored-Genesis-f16.gguf
Note on -fitt 1500: the mmproj takes ~900 MB. Without a fitting limit, llama-server tries to load it on GPU and OOMs. -fitt makes it work. Leaves room for the mmproj. Not needed without mmproj.
Models tested (72K prompt + 100 gen):
| Model | Prompt (t/s) | Gen (t/s) | Notes |
|---|---|---|---|
| mudler/...APEX-MTP-I-Compact + genesis mmproj, MTP off | 475 | 37.17 | 🏆 |
| mudler/...APEX-MTP-I-Compact, no mmproj, MTP off | 487 | 36.74 | |
| mudler/...APEX-I-Compact, no mmproj | 461 | 34.04 | No MTP heads in VRAM |
| unsloth/...UD-IQ3_S, no mmproj | 488 | 26.21 | |
| unsloth/...UD-IQ4_NL, no mmproj | 462 | 22.65 | |
| mudler/...APEX-MTP-I-Compact, MTP on | 412 | 21.74 |
Full model names: mudler/Qwen3.6-35B-A3B-APEX-MTP-I-Compact.gguf, mudler/Qwen3.6-35B-A3B-APEX-I-Compact.gguf, unsloth/Qwen3.6-35B-A3B-UD-IQ3_S.gguf, unsloth/Qwen3.6-35B-A3B-UD-IQ4_NL.gguf
Context degradation (optimal config): - Fresh: ~45 t/s gen - @72K filled: 37.17 gen · 475 prompt - @129K filled: 28.08 gen · 420 prompt
llama-perplexity (enwik8 subset, 64K ctx, turbo4, flash-attn):
PPL = 3.2529 +/- 0.01852 across 4 chunks
I think it's pretty good for this model and quantization. I'm happy with it.
Needle-in-a-haystack (manual, web UI): 5 trials with hidden codes (e.g. secret=6301) planted in 150K–200K token texts at varying depths. 100% retrieval — model found every hidden code on every trial. I've used academic markdown texts for this.
Key findings:
-
Spiritbuun's fork + mudler models are the key. Without spiritbuun's CUDA work these numbers wouldn't be possible on a 3060 with a 17 GB model, but as figures show, the mudler model was also fundamental.
-
MTP hurts on my setup (3060 12GB with heavy offloading): it drops gen by 41% when enabled. On cards with enough VRAM to fit the whole model, MTP works well — there are posts in this sub about it, and about cards with same VRAM but more compute power doing well. On a 3060 with offloading, leave it off.
-
Mudler's APEX quantizations are decisive over other options. I tried several APEX I-Compact variants from other users and they topped out at 32-34 t/s — mudler's consistently gives the best numbers. The gap vs bartowsky or unsloth is substantial.
-
The MTP-I file (with MTP heads included) performs better than the APEX-I even with MTP disabled (36.74 vs 34.04). Maybe, I'm not sure, the extra tensors sitting in VRAM seem to make some magic aligning the memory layout. No good explanation, just empirical.
-
Context degradation: ~18% from fresh to 72K, another ~24% from 72K to 129K. Prompt speed also suffers as context grows.
For a single RTX 3060 12GB, spiritbuun's fork + mudler/Qwen3.6-35B-A3B-APEX-MTP-I-Compact.gguf with MTP off is the best combo I've found for long sessions with large context. 37 t/s gen, PPL 3.25, offloading a 17.3 GB model on a 12 GB card. Again, all credit to spiritbuun and mudler
comanderxv@reddit
you can set --no-mmproj-offload to put the mmproj to Ram and save some space in Vram
old-mike@reddit (OP)
One curious thing is that putting it in GPU doesn't seem to affect speed. Anyway, I have to test the vision thing again, maybe with bigger images I will get OOM? Let's play!
Training_Visual6159@reddit
12GB 4070 UD-MTP-Q4_K_XL 128K@Q8kv - 850pp/70-80tg easy, no fork. you're doing it wrong.
-fit off --n-gpu-layers 42 --n-cpu-moe 27 --ctx-size 128000 --cache-type-k q8_0 --cache-type-v q8_0 --spec-type draft-mtp --spec-draft-n-max 3 --no-mmproj --no-mmap
pro-tip: connect the display to motherboard/iGPU. saves you 1-2GB of VRAM.
old-mike@reddit (OP)
Hello, I downloaded same model, compiled last llama.cpp main branch. OOM. When using a smaller model, it loads, but when you begin to fill context, it gave me OOM. No graphic server loaded, no display.
Training_Visual6159@reddit
use nvitop to monitor your VRAM usage, and adjust the context/--n-cpu-moe to make it stay under 97%.
you can get about 10K extra context by going with k q8_0 / v q5_1 cache.
old-mike@reddit (OP)
Hi! Because I'm not using ngl nor ncmoe, llama-server is managing itself the VRAM. I can set the CTX to 256k and get it running without OOM. Of course the degradation in speed is greater (and I don't know what will happen to the quality) when you fill this context. About Linux thing, it's working ok here. Ubuntu server 24.
zz-kz@reddit
OP said their card is power-limited. Might that be why your tg is so much better?
Training_Visual6159@reddit
with 12gb cards part of inference is offloaded to cpu -> 4070 with 200W TDP only goes to ca. 110W during prefill, with occasional 170W spike once in a while. no need to power-limit it.
also turbo4 is dumb, check https://www.reddit.com/r/LocalLLaMA/comments/1tp9d1w/kv_cache_quant_benchmarks_q5_q6_are_underrated/
old-mike@reddit (OP)
Great resource. Thank you.
comanderxv@reddit
That speed also depend on the graphics card. The op has a 3060 and I have a 2060 and will never reach such speed.
ps5cfw@reddit
I've never ever seen APEX work well in big context scenarios (> 100K) and / or complex tasks. It just starts to make wrong tool calls and hallucinate and loop it's thinking.
10F1@reddit
Same sadly, I really wanted it to work well.
old-mike@reddit (OP)
mmmm I just tested what I wrote, perplexity and NIAH (well, the perplexity test changing context size and the options I can change, ctk and ctv and so on, several times) . After NIAH test with less texts loaded, I been playing with the model, asking questions about the test, asking for summaries, adding things, making it think in general, and get no bad answers, no strange behavior. Maybe in an agentic loop it fails.
Ready_Performance_35@reddit
you're saying apex quant >>> unsloth?
Icy-Degree6161@reddit
I also like the APEX approach more. Not saying it's universally better, but what I use it for, there is a difference.
old-mike@reddit (OP)
in my setup, yes. As I said, I really don't know why.
Humble_Rabbt@reddit
very nice, mtp only works if the entire model is in vram because for mtp the gpu basically needs access to entire 256 experts, not just the usual 8.
also you might be able to squeeze out more with -ngl 99 -ncmoe starting with 30 and then lower to find the most experts you can fit on the gpu.
Radiant-Giraffe5159@reddit
I get a 5%-20% speed up with MTP set at 1 or 2 draft tokens with pretty much no expert layers on GPU. I go from around 22-25 tps to 27-31 tps. So over all not super large, but everything helps.