Why is gemma4 using so much ram.

Posted by BestSeaworthiness283@reddit | LocalLLaMA | View on Reddit | 12 comments

Im sorry if this is a really beginner question, but im trying to get into how LLMs work under the hood.

From my testing i have observed that when running gemma4:e4b I see a usage of about 4gb of vram and 8 gb of ram. As context, i have a rtx 4060 with 8gb of vram. From my understanding the chunks cant load entirely in vram and they offload in ram.

What do you think the problem is ?

[-]

Double_Cause4609@reddit

The problem is basically as follows:

When Google was designing Gemma 4 E2B and E4B series, they wanted them to be accessible on smartphones.

Problem: smartphones don't have a lot of RAM.

Solution: What if they carefully design the architecture so that the phone only needs to load the 2B or 4B parameters that matter at any one point into RAM, and store the rest on flash storage (the long term storage)?

That's basically how the E4B etc models work.

So, new problem: How do we run it on LlamaCPP?

LlamaCPP has "backends", like CPU, CUDA, Vulkan, etc.

It doesn't really understand the idea of mapping a file to disc (flash) storage. So there's not really an easy way to say "hey, these 8B parameters are easy to load to RAM, so you can leave them on SSD until the GPU needs them".

Instead, the best solution they had that didn't require 5k extra lines of code, was to say "okay, we'll load the 4B effective parameters to VRAM, and we'll leave the rest on RAM, because CPU is already a valid GGML device".

So long story short: LlamaCPP (and LM Studio and Ollama which inherit from it) just aren't built well to take advantage of the way Gemma 4 E4B works.

If it helps, to the LlamaCPP ecosystem (again, like LM Studio, Ollama, etc), Gemma 4 E4B looks more like a 12B A4B MoE model (kind of. It's weird because the sparsity is actually in the per-layer embeddings IIRC but work with me), so if you look at something like IBM Granite 3B A1.2B, or any of the 19B A3B or 30B A3B MoEs, they'll perform the same way where you LlamaCPP wil load the full 19B to *some* type of memory, and can't easily just load the active parameters only.

What makes the Gemma 4 models special is due to how they work you can cleanly separate just the active parameters onto VRAM, though.

[-]

BestSeaworthiness283@reddit (OP)

So gemma4 e4b doesnt have just 4b parameters, its like loading the 4b needed at once right?

[-]

Double_Cause4609@reddit

Yeah, all you really need to understand is basically "it loads the 4B parameters it needs to VRAM", and "it needs somewhere to put the other model weights not being used".

The issue is that LlamaCPP, LM Studio, Ollama, etc, all need to put the weights on any of:
A) CPU + System RAM
B) A CUDA device (VRAM)
C) A HIP / ROCm device (VRAM)
D) An intel accelerator (VRAM)
etc.

So if your VRAM is full, the only other place to put those weights is CPU + System RAM.

The way the model was designed honestly the weights could go on an SSD instead of system RAM, but none of the LLM backends on PC are designed for that.

[-]

BestSeaworthiness283@reddit (OP)

Thank you very much!

[-]

LionStrange493@reddit

yeah this is pretty common with that kind of setup
part of it is just stuff spilling over into system ram when vram is tight, so you end up seeing both used
does it go up more when your prompts get longer or stay roughly the same?

[-]

BestSeaworthiness283@reddit (OP)

Thank you very much!

[-]

LionStrange493@reddit

ah ok that’s interesting

i’ve seen similar where it just sits around the same usage but starts getting slower or weirder instead of actually using more memory

are you running it through llama.cpp directly or some wrapper?

[-]

BestSeaworthiness283@reddit (OP)

Well it is an agent for small context windows, be it locally run like Ollama or Llama or free api tiers

Here is the link to one of my most recent posts if it interests you: https://www.reddit.com/r/ollama/comments/1srvl4v/i_built_a_free_opensource_cli_coding_agent/

[-]

LionStrange493@reddit

oh this is cool

8k constraint stuff is exactly where things start getting messy tbh

how are you handling like consistency across steps? does it ever drift or lose track mid-run?

[-]

BestSeaworthiness283@reddit (OP)

Well if you mean when i ask it to do something, it works like this: it analyses and writes a plan broken in to steps and each step gets its own llm call to put it simply, and it is guided by a map of the codebase so it doesnt need to know everything. If you mean between prompts, the newest version does have memory and works pretty good in my testing, but from my testing it could only fit like the 2 last actions it took.

[-]

LionStrange493@reddit

ah yeah that makes sense

that “only last 2 actions” thing feels like where stuff can quietly drift without you noticing

have you had cases where earlier context mattered but it just kinda lost it mid flow?

[-]

BestSeaworthiness283@reddit (OP)

I didn't think about it, you gave me an idea. I want to add a command to add an action you want to a separate pool of the ring buffer one. I think the user can control that separate pool.

Also i thought when making the tool to add an option to select the size of the ring buffer, because when connecting the model you want to use u can actually select the context size, 8k is just what it was built to use.