Best settings for gemma-4 on a 3090?

Posted by Deadhookersandblow@reddit | LocalLLaMA | View on Reddit | 15 comments

3090 (24G) + 32G DDR4

Currently running

--mmproj mmproj-BF16.gguf
--chat-template-kwargs '{"enable_thinking":true}' \
--flash-attn on \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
-np 1 \
-c 160000 \
--jinja

at 26B-A4B-it-UD-Q5_K_XL and generally quite happy with it but it does oom die occasionally (usually when I do something quite convoluted figuring out a workflow, etc.)

I get around 90-95 tok/s. What can I improve on? I'm completely OK with trading speed for performance (by like half, so lets say 40 tok/s would be OK)

Thanks

[-]

texasdude11@reddit

Don't quantize kv cache, it significantly degrades model performance

[-]

Due-Function-4877@reddit

With only 24gb of vram, what choice does OP have?

[-]

texasdude11@reddit

Instead of using q5 as op is using, trying instead q4 quant with q8 kv cache will help too.

[-]

Q4 cache is bad, but you can't get high context without quantizing it... which is why you download Tom's llama.cpp fork with TurboQuant and use turbo4 or turbo4+3 or even turbo3, which is still not 100% accurate but much better than raw Q4.

[-]

erazortt@reddit

You really should not quantize the KV cache with gemma4, not even at Q8 let alone to Q4! The KLD of that are really bad. There was a post about this here last days.

You can do this with Qwen3.5 though.

[-]

Important_Quote_1180@reddit

Here's what we have for both models:

Gemma 4 26B-A4B MoE (Q5_K_M)

The 26B MoE runs as two distinct profiles on the 3090:

Profile 1 — Context King (256K, cpu-moe): 256K native context with all expert FFN weights offloaded to DDR5 via --cpu-moe while attention layers and router stay on GPU. Uses 13.8 GB VRAM with 10.3 GB free at q8_0 KV. Benchmarked at 34.2 t/s generation, 81.3 t/s prefill at 128K. At 128K context it drops to 10.0 GB VRAM (14 GB free) and 33.4 t/s generation. 128 experts total, only 8+1 active per token (~4B active params).

Profile 2 — Batch Workhorse (64K, all-GPU): Drops --cpu-moe, pins all 128 experts on GPU, shrinks context to 64K, parallel 4. Uses 22.9 GB VRAM (1.2 GB free). This is the big one — measured 96% GPU utilization vs only 24% with cpu-moe, and wiki ingest went from ~24 hours to ~3 hours (8x speedup). Generation speed in this config is reported at 90-128 t/s.

Architecture: 25.2B total params, 3.8B active (8 experts + 1 shared), 30 layers (24 sliding window + 6 global), 1024-token sliding window, 256K native context, GQA with 8 KV heads for sliding and 2 for global.

Gemma 4 19B REAP-Heretic MoE (Q6_K)

This is a Router-weighted Expert Activation Pruned version of the 26B — 25 of 128 experts removed via calibration-weighted scoring, leaving 103 experts with the same 4B active compute. The pruning costs 2-4% accuracy on benchmarks. Also includes heretic ablation (refusal removal) on layers 10-30.

Key advantage: all-GPU fit at 128K. The 16 GB Q6_K quant (higher quality than the 26B's Q5_K_M despite being smaller thanks to expert pruning) fits entirely on the 3090 with lossless q8_0/q8_0 KV at 128K context. Uses 18.3 GB VRAM with 5.8 GB free. Benchmarked at 120+ t/s generation — that's about 3.5x faster than the 31B dense. Max predict set to 30,000 tokens. Uses the same vision mmproj as the 26B Opus distill (dimension-compatible since REAP only prunes FFN experts) — verified at 1.9s/image.

Comparison at a glance:

• 26B MoE with cpu-moe at 256K: 34.2 t/s, 13.8 GB VRAM, experts in RAM • 26B MoE all-GPU at 64K: 90-128 t/s, 22.9 GB VRAM, 96% GPU utilization • 19B REAP all-GPU at 128K: 120+ t/s, 18.3 GB VRAM, 5.8 GB free

The REAP-19B sits in the sweet spot — all-GPU speed, higher quant, longer context, more headroom. The 26B with cpu-moe is the long-context workhorse when you need 256K. The 26B all-GPU batch profile is the throughput king for small-prompt heavy workloads where you can afford the 64K context limit.

[-]

BigYoSpeck@reddit

I'd hazard a guess you are getting out of memory because Gemma 4 absolutely devours RAM for context checkpoints. With the default 32 it will cripple even 64gb of RAM

Add in -ctxcp 4 to start with and see if that stops the OOM and then increase the number of checkpoints to a level your system has capacity for

[-]

Powerful_Evening5495@reddit

Use opencode and use the model's default context size

[-]

thirteen-bit@reddit

If you do not use it for image captioning workflows (every request contains images) and only need image input sometimes, move mmproj to RAM: --no-mmproj-offload.

Set higher fit target (--fit-target 1536 or --fit-target 2048, default is 1024) to leave more VRAM free.

Maybe look into --fit-ctx 160000 instead of --ctx-size 160000?

Docs here: https://github.com/ggml-org/llama.cpp/tree/master/tools/server

For coding workflows:

Add --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64

Docs here: https://github.com/ggml-org/llama.cpp/blob/master/docs/speculative.md#n-gram-mod-ngram-mod

I'm using Q8_0 for better but slower results at 128K context and IQ4_XS for fast variant fitting fully in VRAM.

Q8_0 command line

$ ./bin/llama-server \
 --jinja \
 --temp 1.0 \
 --min-p 0.00 \
 --top-p 0.95 \
 --top-k 64 \
 --cache-type-k q8_0 \
 --cache-type-v q8_0 \
 --spec-type ngram-mod \
 --spec-ngram-size-n 24 \
 --draft-min 48 \
 --draft-max 64 \
 --flash-attn on \
 --fit-ctx 131072 \
 --fit on \
 --fit-target 1536 \
 --no-mmproj-offload \
 --model ./models/google_gemma-4-26B-A4B-it-Q8_0.gguf \
 --mmproj ./models/mmproj-google_gemma-4-26B-A4B-it-bf16.gguf

Prompt to rewrite the bash script adding some new functions:

prompt eval time =    2434.96 ms /  1178 tokens (    2.07 ms per token,   483.79 tokens per second)
       eval time =   54830.28 ms /  2463 tokens (   22.26 ms per token,    44.92 tokens per second)
      total time =   57265.24 ms /  3641 tokens
draft acceptance rate = 0.61277 (  902 accepted /  1472 generated)
statistics ngram_mod: #calls(b,g,a) = 1 1560 23, #gen drafts = 23, #acc drafts = 23, #gen tokens = 1472, #acc tokens = 902, dur(b,g,a) = 0.072, 3.316, 1.041 ms
slot      release: id  3 | task 0 | stop processing: n_tokens = 3640, truncated = 0

[-]

thirteen-bit@reddit

And IQ4_XS at 256K context fitting into 20GB VRAM:

$ ./bin/llama-server \
 --port 5814 \
 --jinja \
 --flash-attn on \
 --n-gpu-layers 999 \
 --gpu-layers-draft 999 \
 --temp 1.0 \
 --min-p 0.00 \
 --top-p 0.95 \
 --top-k 64 \
 --cache-type-k q8_0 \
 --cache-type-v q8_0 \
 --spec-type ngram-mod \
 --spec-ngram-size-n 24 \
 --draft-min 48 \
 --draft-max 64 \
 --no-mmproj-offload \
 --fit on \
 --fit-target 1536 \
 --model ./models/google_gemma-4-26B-A4B-it-IQ4_XS.gguf \
 --mmproj ./models/mmproj-google_gemma-4-26B-A4B-it-bf16.gguf

prompt eval time =     786.22 ms /  1178 tokens (    0.67 ms per token,  1498.32 tokens per second)
       eval time =   17301.66 ms /  2332 tokens (    7.42 ms per token,   134.78 tokens per second)
      total time =   18087.88 ms /  3510 tokens
draft acceptance rate = 0.74494 (  809 accepted /  1086 generated)
statistics ngram_mod: #calls(b,g,a) = 1 1522 17, #gen drafts = 17, #acc drafts = 17, #gen tokens = 1086, #acc tokens = 809, dur(b,g,a) = 0.080, 1.185, 0.002 ms
slot      release: id  3 | task 0 | stop processing: n_tokens = 3509, truncated = 0
srv  update_slots: all slots are idle

[-]

caetydid@reddit

I get \~85t/s with this setup on first prompt. Rare crashes, hence the loop.

cat \~/gemma4-llmserver.sh

#!/bin/bash

while true; do

\~holu/llama.cpp/llama.cpp-b8779/build/bin/llama-server \

--slots -np 1 \

-m \~holu/llama.cpp/models/gemma/gemma-4-26B-A4B-it-UD-IQ4_XS.gguf \

--mmproj \~/llama.cpp/models/gemma/mmproj-F16.gguf \

--host 0.0.0.0 \

--port 8888 \

--ctx-size 262000 \

-ngl 9999 \

--temp 0.3 \

--reasoning auto \

--cache-type-k q8_0 \

--cache-type-v q8_0 \

--threads $(nproc) \

--batch-size 64 \

--repeat_penalty 1.1 \

--top-p 0.95 \

--flash-attn on

sleep 5

done;

[-]

tmvr@reddit

Setting KV to q4_0 kills that model apparently, try and stay at q8_0 there:

https://www.reddit.com/r/LocalLLaMA/comments/1suh3sz/gemma_4_and_qwen_36_with_q8_0_and_q4_0_kv_cache/

[-]

Sadman782@reddit

In real world usage, I don't see much degradation (it's far from being killed) after attn rot was introduced for iSWA. Maybe they recover somehow through reasoning? Also, the PPL isn't as different as the KLD

https://github.com/ggml-org/llama.cpp/pull/21513

[-]

BitGreen1270@reddit

Your context seems quite high, I get about 130 t/s on the same model with just --fit on and -c 65536.

But I'm running it on a rental on vast.ai.

Your cache type also seems low? Any reason you are using 4 instead of 8 or nothing at all? My understanding is that it quantizing kv cache can lead to model confusion.

Also, not exactly what you asked for, but I did a hangman one page html using opencode on both gemma4-26B and qwen3.5-35B and for some reason I felt Qwen worked better. Gemma also got there eventually but needed more handholding and refactoring.

[-]

Deadhookersandblow@reddit (OP)

I occasionally do need high context and a high context wasn’t fitting with q8 cache. I’ll try reducing cache to and not quantizing cache i suppose