Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model

Posted by old-mike@reddit | LocalLLaMA | View on Reddit | 18 comments

I'm posting this because it may be helpful to squeeze the 12GB VRAM in the 3060.

All credit goes to spiritbuun's fork (github.com/spiritbuun/buun-llama-cpp) and mudler's APEX quantizations (huggingface.co/mudler). Spiritbuun's CUDA optimizations for NVIDIA GPUs — fused MMA fix, TurboQuant, fattn improvements — are what make offloading a 17.3 GB model on a 12 GB card at these speeds possible. Mudler's APEX I-Compact quantization gave me the best perplexity/speed trade-off of any variant I tested.

Hardware: - GPU: 1× RTX 3060 12GB (110W power limit) - CPU: Xeon E5-2678 v3 - RAM: 128 GB DDR4-2133 - PCIe 3.0 x16 - Container: Incus (LXC)

Command (optimal for me):

./build/bin/llama-server \
    -m /models/mudler/Qwen3.6-35B-A3B-APEX-MTP-I-Compact.gguf \
    --no-warmup -c 131072 -np 1 --no-mmap --mlock \
    -ctk turbo4 -ctv turbo4 \
    --jinja --reasoning-budget 1536 \
    --flash-attn on \
    --host 0.0.0.0 --port 8000 \
    -fitt 1500 \
    --mmproj /models/mmproj-Qwen3.6-35B-A3B-Uncensored-Genesis-f16.gguf

Note on -fitt 1500: the mmproj takes ~900 MB. Without a fitting limit, llama-server tries to load it on GPU and OOMs. -fitt makes it work. Leaves room for the mmproj. Not needed without mmproj.

Models tested (72K prompt + 100 gen):

Model	Prompt (t/s)	Gen (t/s)	Notes
mudler/...APEX-MTP-I-Compact + genesis mmproj, MTP off	475	37.17	🏆
mudler/...APEX-MTP-I-Compact, no mmproj, MTP off	487	36.74
mudler/...APEX-I-Compact, no mmproj	461	34.04	No MTP heads in VRAM
unsloth/...UD-IQ3_S, no mmproj	488	26.21
unsloth/...UD-IQ4_NL, no mmproj	462	22.65
mudler/...APEX-MTP-I-Compact, MTP on	412	21.74

Full model names: mudler/Qwen3.6-35B-A3B-APEX-MTP-I-Compact.gguf, mudler/Qwen3.6-35B-A3B-APEX-I-Compact.gguf, unsloth/Qwen3.6-35B-A3B-UD-IQ3_S.gguf, unsloth/Qwen3.6-35B-A3B-UD-IQ4_NL.gguf

Context degradation (optimal config): - Fresh: ~45 t/s gen - @72K filled: 37.17 gen · 475 prompt - @129K filled: 28.08 gen · 420 prompt

llama-perplexity (enwik8 subset, 64K ctx, turbo4, flash-attn):

PPL = 3.2529 +/- 0.01852 across 4 chunks

I think it's pretty good for this model and quantization. I'm happy with it.

Needle-in-a-haystack (manual, web UI): 5 trials with hidden codes (e.g. secret=6301) planted in 150K–200K token texts at varying depths. 100% retrieval — model found every hidden code on every trial. I've used academic markdown texts for this.

Key findings:

Spiritbuun's fork + mudler models are the key. Without spiritbuun's CUDA work these numbers wouldn't be possible on a 3060 with a 17 GB model, but as figures show, the mudler model was also fundamental.
MTP hurts on my setup (3060 12GB with heavy offloading): it drops gen by 41% when enabled. On cards with enough VRAM to fit the whole model, MTP works well — there are posts in this sub about it, and about cards with same VRAM but more compute power doing well. On a 3060 with offloading, leave it off.
Mudler's APEX quantizations are decisive over other options. I tried several APEX I-Compact variants from other users and they topped out at 32-34 t/s — mudler's consistently gives the best numbers. The gap vs bartowsky or unsloth is substantial.
The MTP-I file (with MTP heads included) performs better than the APEX-I even with MTP disabled (36.74 vs 34.04). Maybe, I'm not sure, the extra tensors sitting in VRAM seem to make some magic aligning the memory layout. No good explanation, just empirical.
Context degradation: ~18% from fresh to 72K, another ~24% from 72K to 129K. Prompt speed also suffers as context grows.

For a single RTX 3060 12GB, spiritbuun's fork + mudler/Qwen3.6-35B-A3B-APEX-MTP-I-Compact.gguf with MTP off is the best combo I've found for long sessions with large context. 37 t/s gen, PPL 3.25, offloading a 17.3 GB model on a 12 GB card. Again, all credit to spiritbuun and mudler

[-]

Training_Visual6159@reddit

12GB 4070 UD-MTP-Q4_K_XL 128K@Q8kv - 850pp/70-80tg easy, no fork. you're doing it wrong.

-fit off --n-gpu-layers 42 --n-cpu-moe 27 --ctx-size 128000 --cache-type-k q8_0 --cache-type-v q8_0 --spec-type draft-mtp --spec-draft-n-max 3 --no-mmproj --no-mmap

pro-tip: connect the display to motherboard/iGPU. saves you 1-2GB of VRAM.

[-]

old-mike@reddit (OP)

Hello, I downloaded same model, compiled last llama.cpp main branch. OOM. When using a smaller model, it loads, but when you begin to fill context, it gave me OOM. No graphic server loaded, no display.

[-]

Training_Visual6159@reddit

use nvitop to monitor your VRAM usage, and adjust the context/--n-cpu-moe to make it stay under 97%.

you can get about 10K extra context by going with k q8_0 / v q5_1 cache.

[-]

old-mike@reddit (OP)

Hi! Because I'm not using ngl nor ncmoe, llama-server is managing itself the VRAM. I can set the CTX to 256k and get it running without OOM. Of course the degradation in speed is greater (and I don't know what will happen to the quality) when you fill this context. About Linux thing, it's working ok here. Ubuntu server 24.

[-]

zz-kz@reddit

OP said their card is power-limited. Might that be why your tg is so much better?

[-]

Training_Visual6159@reddit

with 12gb cards part of inference is offloaded to cpu -> 4070 with 200W TDP only goes to ca. 110W during prefill, with occasional 170W spike once in a while. no need to power-limit it.

also turbo4 is dumb, check https://www.reddit.com/r/LocalLLaMA/comments/1tp9d1w/kv_cache_quant_benchmarks_q5_q6_are_underrated/

[-]

old-mike@reddit (OP)

Great resource. Thank you.

[-]

comanderxv@reddit

That speed also depend on the graphics card. The op has a 3060 and I have a 2060 and will never reach such speed.