Gemma 4 for 16 GB VRAM
Posted by Sadman782@reddit | LocalLLaMA | View on Reddit | 58 comments
I think the 26B A4B MoE model is superior for 16 GB. I tested many quantizations, but if you want to keep the vision, I think the best one currently is:
https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/blob/main/gemma-4-26B-A4B-it-UD-IQ4_XS.gguf
(I tested bartowski variants too, but unsloth has better reasoning for the size)
But you need some parameter tweaking for the best performance, especially for coding:
--temp 0.3 --top-p 0.9 --min-p 0.1 --top-k 20
Keeping the temp and top-k low and min-p a little high, it performs very well. So far no issues and it performs very close to the aistudio hosted model.
For vision use the mmproj-F16.gguf. FP32 gives no benefit at all, and very importantly:
--image-min-tokens 300 --image-max-tokens 1024
Use a minimum of 300 tokens for images, it increases vision performance a lot.
With this setup I can fit 30K+ tokens in KV fp16 with np -1. If you need more, I think it is better to drop the vision than going to KV Q8 as it makes it noticeably worse.
With this setup, I feel this model is an absolute beast for 16 GB VRAM.
Make sure to use the latest llama.cpp builds, or if you are using other UI wrappers, update its runtime version.
In my testing compared to my previous daily driver (Qwen 3.5 27B):
- runs 80 tps+ vs 20 tps
- with --image-min-tokens 300 its vision is >= the Qwen 3 27B variant I run locally
- it has better multilingual support, much better
- For real world coding which requires more updated libraries, it is much better because Qwen more often uses outdated modules
- for long context Qwen is still slightly better than this, but this is expected as it is an MoE
andy2na@reddit
how are you doing --image-min tokens 300? If I try that, it fails to load:
Doing 280 works
Sadman782@reddit (OP)
max token should also increase make that 512 to make this work
andy2na@reddit
thanks, forgot the image-max-tokens
RandomTrollface@reddit
I am getting pp512 = 893.46 t/s and tg512 = 36.86 t/s in llama bench with unsloth's gemma 4 31b UD_IQ3_XXS with my radeon rx 9070 (non-XT, 16GB vram) on radv. With 32k ish context coding it's closer to 30-32 tok/s but that is still acceptible speed for me. I am pretty surprised by how well the 31b performs at IQ3_XXS even, it's perfectly usable.
Although I still prefer qwen 3.5 27b UD_IQ3_XXS for agentic coding with opencode from my short testing so far. I can also have longer context with Qwen vs Gemma because Gemma doesn't scale that efficiently over longer context. I prefer gemma for chatting in my native language and non coding related requests though.
Mosari@reddit
5060ti. 16GB. Gemma4 26B fit with IQ3_S 80000 ctx. (15.4GB vram used) but speed was 20~25 response token/s. I thought reason by memory bandwidth. Running by ollama. Flash attension enable and Ctx quant by q8_0.
gptoss20b mxfp4 was 85-90response token/s. So...... I need pay for RTX5070TI or Upper class.
Need information about response token/s result. May be 5060ti is Fallen.... Ahhhhhhhhh
ansibleloop@reddit
Will try this vs GLM 4.7 and qwen3 coder 30b a3b
Seems like it could be the best in theory
BingpotStudio@reddit
I’m new to this an would be interested to hear your take vs GLM 4.7.
Mister_bruhmoment@reddit
How did you get 27B running on 16GB?? You'd have to have all the context in system ram
clickrush@reddit
The „A4B“ stands for actual 4b I think. Meaning while it has 26b in total it will only use 4b at a given time. It’s constructed this way specifically to run on consumer hardware.
chadlost1@reddit
That’s true for the increased speed in tok/s, but it’ll still need to be entirely loaded in vram; if the model or the kv cache gets swapped to system ram, performance takes a huge dip
AnonLlamaThrowaway@reddit
What I've found (from gpt-oss-120b, at least) is that you can use an option to shove most experts onto RAM.
For example, in LM Studio, I can see that model has 36 layers. I'll set GPU offload (layers loaded onto GPU) to the full 36.
But then I'll adjust "number of layers for which to force MoE weights onto CPU" down from 36 until my VRAM fills up. Having the number set to 30, for example, means I keep 6 expert layers inside VRAM.
That way, I know I have the "routing layers" loaded, because THOSE are covered under the 36 loaded layers under "GPU offload"
It's a decent speedup over simply tuning the "GPU offload" slider down until your VRAM fills up, because that slider doesn't make the distinction between expert layers (fine to have in RAM) and routing layers (shouldn't be in RAM) by itself.
At least, that's my understanding of the situation.
nickthatworks@reddit
Can you provide some guidance on those switches and adjustments? I'm starting to play more with these MoE models and would love be to able to tweak them like this.
AnonLlamaThrowaway@reddit
cosmicr@reddit
On mine it only shows 30 layers, offloading all puts memory at 17.93gb.
the number of layers for which to force MoE weights onto CPU starts at 0, not 36. Changing it's value doesn't seem to affect the esitmated memory usage.
i-eat-kittens@reddit
*active 4B
Mister_bruhmoment@reddit
Hmm, gonna have to give it a go then. Thanks!
Hell_L0rd@reddit
CPU: AMD Ryzen 9 9955HX3D 16-Core Processor
RAM: 64GB
GPU: NVIDIA GeForce RTX 5080 Laptop GPU 16GB
Type: Lenovo Legion Pro 7 LAPTOP
ENV:
Name Value
Modelfile:
FROM C:\....\gemma-4-26B-A4B-it-UD-Q3_K_M.gguf
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER min_p 0.1
PARAMETER top_k 20
PARAMETER num_ctx 32768
> ollama ps
NAME ID SIZE PROCESSOR CONTEXT UNTIL
gemma4-iq4-coder:latest 98d2016bd766 15 GB 100% GPU 32768 9 minutes from now
In opencode, ran a prompt: "what is current directory we are at? create a test file "test.txt" and write todays date and time"
Too slow took 2.5min, can't work like this. :(
When running directly in terminal using ollama run gemma4-q3-coder-x1 asking simple things it process fast without using CPU, all on GPU. but when in opencode it goes to CPU to run the prompts even simple prompts.
I tried qwen3.5:9b is works fast but we not that great coding experience. I belive model between 15-20b parameter will be nicer for 16GB ram
Is their any tweaks we can do to perform better.
mr_Owner@reddit
You need to manually test which Ubatch size with -cpu-moe works best to keep most in vram. Your pp speed should be blazing fast with that gpu. Mine runs at least 800 token prompt processing speed with a rtx 4070s 12gb...
I am considering to post a table of benchmarks and speeds and ppl for my setup as a reference for many.
Monad_Maya@reddit
Please benchmark and share your results via llama-bench, here's a guide - https://np.reddit.com/r/LocalLLaMA/comments/1qp8sov/how_to_easily_benchmark_your_models_with/
Motive being to determine if it's a prompt processing issue and to quantify it with supporting evidence.
Cool-Chemical-5629@reddit
I wish Unsloth made Q8_0 vision module to save even more space. There's a heretic variant which has that and depending on your hardware and need for vision while saving as much space as possible, Q8_0 for vision may just be your savior.
Sadman782@reddit (OP)
Use it: https://huggingface.co/prithivMLmods/gemma-4-26B-A4B-it-F32-GGUF/blob/main/GGUF/gemma-4-26B-A4B-it.mmproj-q8_0.gguf
Wow! I didn't even consider Q8, it is even slightly better for some of my tests, at least no quality loss, now you can fit 30K more context in place of it. Amazing
qnixsynapse@reddit
I use my own quantization, mxfp4 for experts and rest at bf16. Works great.It is the best local model I have used till now!
arbv@reddit
Would you mind sharing the GGUF (or conversion instructions)?
qnixsynapse@reddit
I will try to upload it on HF tonight if possible since llamacpp has a bug when try to override a tensor's dtype when dtype is not a quantisation dtype like 'Q8_0', 'Q4_K', etc.
arbv@reddit
Thanks! Please educate us how you made it?
farkinga@reddit
I really like this balance of quants. Would you mind sharing your recipe for producing this gguf?
Western-Cod-3486@reddit
hey, nice! How much context do you tit and in what amount of VRAM?
IrisColt@reddit
THANKS!!!
JumpingJack79@reddit
NVFP4 please. Thank you!
ivdda@reddit
What hardware are you running that on?
qnixsynapse@reddit
Intel Arc + i3 CPU.
MoffKalast@reddit
Which gen? Does it actually do bf16 without a speed drop?
qnixsynapse@reddit
It’s alchemist. Using the vulkan backend. It’s better than SYCL right now.
MoffKalast@reddit
Damn that's weird, I'm using a Xe-LPG in a system that's supposedly also alchemist and it absolutely sucks on Vulkan. I guess the discrete ones really are built different.
Hytht@reddit
Xe-LPG doesn't have XMX cores, if we're talking about meteor lake.
andy2na@reddit
nice, what llama.cpp commands are you using to offload ?
FeiX7@reddit
how you quantized? and why you picked such architecture? can you please share more details about it?
qnixsynapse@reddit
Used llama.cpp with my patch. Tried to follow gpt-oss.
FeiX7@reddit
so you quantized 35B as you mentioned MoE? how you benchmarked it? on what?
yehyakar@reddit
Quick Test using unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q3_K_M.gguf
Prompt Generation around 150t/s and Prompt Processing around 5900t/s on 16GB 5080
nvidia-smi VRAM usage showing (15582MiB / 16303MiB)
Im dropping the vision layers all together to fit more context and using latest llama.cpp cuda 13 binaries with this command
./build/bin/llama-server -m /home/yk/Data/lmstudio/models/unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q3_K_M.gguf \
--ctx-size 228000 \
--alias "gemma-4-26b-A4B" \
--parallel 1 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--temp 1.0 \
--top-p 0.95 \
--top-k 64 \
-fa on \
--host 0.0.0.0 \
--port 8888 --fit on --fit-target 256 --no-mmap --jinja
Still have to do some real testing with claude code using this model and some tool calling and long context to actually see if its better than the Qwen3.5 models
+ I think when TurboQuant arrives we would be able to squeeze in more context, less VRAM, more accuracy and efficiency hopefully
i-eat-kittens@reddit
Is Q3_K_M really good enough for high-accuracy work like coding and tool calls?
I know it works for huge models, but in the range around 30B-A3B I've been defaulting to Q6_K after some early frustrations.
FenderMoon@reddit
I tried 3 bit quants and frankly I don’t recommend them on Gemma. For whatever reason these models are much more sensitive to quantization than a lot of other models.
It runs well on IQ4. However I did notice improvements when I tested at 5 bits, particularly in world knowledge. 5 bits is too heavy for my system to run well though, so I have to stick with 4.
i-eat-kittens@reddit
I'm gpu-poor, so nothing runs well. In that case I might as well go large. ;)
FenderMoon@reddit
For what it’s worth, I did notice better outputs on 5 bits.
One of my benchmark prompts is “tell me more about the Apple A6” (a pass is if the LLM correctly identifies the most important piece of information that the A6 is known for, which is introducing the swift microarchitecture rather than using off the shelf designs. A fail is if the model just throws a bunch of information out and doesn’t recognize what is most significant.)
26B at IQ4: fail. 31B at IQ3: fails badly. 31B on AI studio: fails. 26B at Q5K_S: passes.
It’s just one prompt. Both models do well on all of my other benchmark prompts. This surprised me though.
iq200brain@reddit
I ran the "swift" test on the 26B from Q4_K_S up to Q6_K, also on the openrouter and nano-gpt hosted models. Not once did it mention swift, not even when i outright asked "what does swift mean to you in that context" right after. result: it talked about the swift programming language, swiftness etc.
OfficialXstasy@reddit
If you're using a recent llama.cpp build it already has attention rotation for quantized KV. It's already in use on Q8_0/Q5_0/Q4_0.
Thistlemanizzle@reddit
FYI, I have a 5070 with 12GB RAM and 96GB RAM. Its a painful experience, I wish I has bought a 5070 ti instead.
No-Educator-249@reddit
Thanks a lot for sharing the image min and max tokens setting! It really improved the model's vision task quality. It now recognizes anime characters better and more reliably for me now.
CodeCatto@reddit
What's a good fit for a 12GB RTX 5070Ti laptop GPU?
Sevealin_@reddit
I am trying to use 26b MoE for Home Assistant with llama.cpp, HA has pretty huge prompts with tool definitions up to like 25k tokens, it takes 26b sometimes like 40 seconds for time to first token with thinking disabled. Anyone else notice this? Single 3090 with any set context (8k-128k).
drallcom3@reddit
What if I don't? Is there a premade model with vision removed?
LostDrengr@reddit
I took the similar one from unsloth, will get another look at it today. I had an earlier build 8661 and hit context chat issues. I have pulled 8665 which may have ironed out some of the behaviour. I have 16GB vram so this is almost the sweet spot, hoping some more compression techniques can cement this size of model!
Confident-Ad-3465@reddit
Does the Problem still exist in llama.cpp and the unsloth (UD) quants?
steadeepanda@reddit
Thank you mate for sharing that, this saves a ton of time
InitiateIt@reddit
I just did this as close as possible in Lm Studio and got roughly 80tps too. Running a 5060ti 16GB.
TheWiseTom@reddit
Did you run benchmarks how KV Quantization works with gemma4? Especially with Hadamard transformation (ik_llama.cpp has them since November) many models don’t mind at all.
llama.cpp mainline since a few days but I’m unsure if they are automatically enabled in mainline llama.cpp or if they must be enabled manually like in ik Also don’t know if they are the same.
If they are the same and already are automatically always on now (merged 3-4 days ago) and you saw worse results even with q8 KV this would mean that gemma4 is highly allergic to that - which would make me wonder as google launched turboQuant a week ago and then launching their new Gemma that wants the opposite - would be a strange / funny coincidence.
VickWildman@reddit
Yet to try it, but I hope it will fit on my OnePlus 13 24 GB, either Q4_0, IQ4_NL or MXFP4 using the OpenCL backend.
jtonl@reddit
Worth a shot. Will test this out while running over a Tailscale network.