Best settings for gemma-4 on a 3090?
Posted by Deadhookersandblow@reddit | LocalLLaMA | View on Reddit | 15 comments
3090 (24G) + 32G DDR4
Currently running
--mmproj mmproj-BF16.gguf
--chat-template-kwargs '{"enable_thinking":true}' \
--flash-attn on \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
-np 1 \
-c 160000 \
--jinja
at 26B-A4B-it-UD-Q5_K_XL and generally quite happy with it but it does oom die occasionally (usually when I do something quite convoluted figuring out a workflow, etc.)
I get around 90-95 tok/s. What can I improve on? I'm completely OK with trading speed for performance (by like half, so lets say 40 tok/s would be OK)
Thanks
texasdude11@reddit
Don't quantize kv cache, it significantly degrades model performance
Due-Function-4877@reddit
With only 24gb of vram, what choice does OP have?
texasdude11@reddit
Instead of using q5 as op is using, trying instead q4 quant with q8 kv cache will help too.
Anbeeld@reddit
Q4 cache is bad, but you can't get high context without quantizing it... which is why you download Tom's llama.cpp fork with TurboQuant and use turbo4 or turbo4+3 or even turbo3, which is still not 100% accurate but much better than raw Q4.
erazortt@reddit
You really should not quantize the KV cache with gemma4, not even at Q8 let alone to Q4! The KLD of that are really bad. There was a post about this here last days.
You can do this with Qwen3.5 though.
Important_Quote_1180@reddit
Here's what we have for both models:
Gemma 4 26B-A4B MoE (Q5_K_M)
The 26B MoE runs as two distinct profiles on the 3090:
Profile 1 — Context King (256K, cpu-moe): 256K native context with all expert FFN weights offloaded to DDR5 via --cpu-moe while attention layers and router stay on GPU. Uses 13.8 GB VRAM with 10.3 GB free at q8_0 KV. Benchmarked at 34.2 t/s generation, 81.3 t/s prefill at 128K. At 128K context it drops to 10.0 GB VRAM (14 GB free) and 33.4 t/s generation. 128 experts total, only 8+1 active per token (~4B active params).
Profile 2 — Batch Workhorse (64K, all-GPU): Drops --cpu-moe, pins all 128 experts on GPU, shrinks context to 64K, parallel 4. Uses 22.9 GB VRAM (1.2 GB free). This is the big one — measured 96% GPU utilization vs only 24% with cpu-moe, and wiki ingest went from ~24 hours to ~3 hours (8x speedup). Generation speed in this config is reported at 90-128 t/s.
Architecture: 25.2B total params, 3.8B active (8 experts + 1 shared), 30 layers (24 sliding window + 6 global), 1024-token sliding window, 256K native context, GQA with 8 KV heads for sliding and 2 for global.
Gemma 4 19B REAP-Heretic MoE (Q6_K)
This is a Router-weighted Expert Activation Pruned version of the 26B — 25 of 128 experts removed via calibration-weighted scoring, leaving 103 experts with the same 4B active compute. The pruning costs 2-4% accuracy on benchmarks. Also includes heretic ablation (refusal removal) on layers 10-30.
Key advantage: all-GPU fit at 128K. The 16 GB Q6_K quant (higher quality than the 26B's Q5_K_M despite being smaller thanks to expert pruning) fits entirely on the 3090 with lossless q8_0/q8_0 KV at 128K context. Uses 18.3 GB VRAM with 5.8 GB free. Benchmarked at 120+ t/s generation — that's about 3.5x faster than the 31B dense. Max predict set to 30,000 tokens. Uses the same vision mmproj as the 26B Opus distill (dimension-compatible since REAP only prunes FFN experts) — verified at 1.9s/image.
Comparison at a glance:
• 26B MoE with cpu-moe at 256K: 34.2 t/s, 13.8 GB VRAM, experts in RAM • 26B MoE all-GPU at 64K: 90-128 t/s, 22.9 GB VRAM, 96% GPU utilization • 19B REAP all-GPU at 128K: 120+ t/s, 18.3 GB VRAM, 5.8 GB free
The REAP-19B sits in the sweet spot — all-GPU speed, higher quant, longer context, more headroom. The 26B with cpu-moe is the long-context workhorse when you need 256K. The 26B all-GPU batch profile is the throughput king for small-prompt heavy workloads where you can afford the 64K context limit.
BigYoSpeck@reddit
I'd hazard a guess you are getting out of memory because Gemma 4 absolutely devours RAM for context checkpoints. With the default 32 it will cripple even 64gb of RAM
Add in -ctxcp 4 to start with and see if that stops the OOM and then increase the number of checkpoints to a level your system has capacity for
Powerful_Evening5495@reddit
Use opencode and use the model's default context size
thirteen-bit@reddit
If you do not use it for image captioning workflows (every request contains images) and only need image input sometimes, move mmproj to RAM:
--no-mmproj-offload.Set higher fit target (
--fit-target 1536or--fit-target 2048, default is 1024) to leave more VRAM free.Maybe look into
--fit-ctx 160000instead of--ctx-size 160000?Docs here: https://github.com/ggml-org/llama.cpp/tree/master/tools/server
For coding workflows:
Add
--spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64Docs here: https://github.com/ggml-org/llama.cpp/blob/master/docs/speculative.md#n-gram-mod-ngram-mod
I'm using Q8_0 for better but slower results at 128K context and IQ4_XS for fast variant fitting fully in VRAM.
Q8_0 command line
Prompt to rewrite the bash script adding some new functions:
thirteen-bit@reddit
And IQ4_XS at 256K context fitting into 20GB VRAM:
caetydid@reddit
I get \~85t/s with this setup on first prompt. Rare crashes, hence the loop.
cat \~/gemma4-llmserver.sh
#!/bin/bash
while true; do
\~holu/llama.cpp/llama.cpp-b8779/build/bin/llama-server \
--slots -np 1 \
-m \~holu/llama.cpp/models/gemma/gemma-4-26B-A4B-it-UD-IQ4_XS.gguf \
--mmproj \~/llama.cpp/models/gemma/mmproj-F16.gguf \
--host 0.0.0.0 \
--port 8888 \
--ctx-size 262000 \
-ngl 9999 \
--temp 0.3 \
--reasoning auto \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--threads $(nproc) \
--batch-size 64 \
--repeat_penalty 1.1 \
--top-p 0.95 \
--flash-attn on
sleep 5
done;
tmvr@reddit
Setting KV to q4_0 kills that model apparently, try and stay at q8_0 there:
https://www.reddit.com/r/LocalLLaMA/comments/1suh3sz/gemma_4_and_qwen_36_with_q8_0_and_q4_0_kv_cache/
Sadman782@reddit
In real world usage, I don't see much degradation (it's far from being killed) after attn rot was introduced for iSWA. Maybe they recover somehow through reasoning? Also, the PPL isn't as different as the KLD
https://github.com/ggml-org/llama.cpp/pull/21513
BitGreen1270@reddit
Your context seems quite high, I get about 130 t/s on the same model with just --fit on and -c 65536.
But I'm running it on a rental on vast.ai.
Your cache type also seems low? Any reason you are using 4 instead of 8 or nothing at all? My understanding is that it quantizing kv cache can lead to model confusion.
Also, not exactly what you asked for, but I did a hangman one page html using opencode on both gemma4-26B and qwen3.5-35B and for some reason I felt Qwen worked better. Gemma also got there eventually but needed more handholding and refactoring.
Deadhookersandblow@reddit (OP)
I occasionally do need high context and a high context wasn’t fitting with q8 cache. I’ll try reducing cache to and not quantizing cache i suppose