Is 200k context realistic on Gemma 31B locally? LM Studio keeps crashing
Posted by Open_Gur_4733@reddit | LocalLLaMA | View on Reddit | 9 comments
Hi everyone,
I’m currently running Gemma 4 31B locally on my machine, and I’m running into stability issues when increasing the context size.
My setup:
- LM Studio 0.4.9
- llama.cpp 2.12.0
- Ryzen AI 395+ Max
- 128 GB total memory (≈92 GB VRAM + 32 GB RAM)
I’m mainly using it with OpenCode for development.
Issue:
When I push the context window to around 200k tokens, LM Studio eventually crashes after some time. From what I can tell, it looks like Gemma is gradually consuming all available VRAM.
Has anyone experienced similar issues with large context sizes on Gemma (or other large models)?
Is this expected behavior, or am I missing some configuration/optimization?
Any tips or feedback would be really appreciated
Zestyclose_Yak_3174@reddit
It seems to use much more memory than normal. I believe Llama.cpp has a fix for it or are working on it.
Thomasedv@reddit
One of the more annoying things that took me along time to learn in llama.cpp was that it automatically saved checkpoints to RAM. Useful for multi users, but I ran a single agent.
I assume LM Studio has something like it? At least check for it. Llama.cpp defaulted to 32 checkpoints which was 1-2 GB each, which ate my 64 GB of RAM rather fast, despite the model being all in VRAM.
sgmv@reddit
I think the problem is related to this https://www.reddit.com/r/LocalLLaMA/comments/1sdqvbd/llamacpp_gemma_4_using_up_all_system_ram_on/
I encountered freeze as well (all ram was used up), 92gb vram, 128gb ram, same with llamacpp, now experimenting with --checkpoint-every-n-tokens 32768 --ctx-checkpoints
Open_Gur_4733@reddit (OP)
Unfortunately, this feature is not yet available in LM Studio. We are still waiting for them to implement options to override llama.cpp :(
sgmv@reddit
llamacpp is not hard to install, you could give it a shot. This lag between llamacpp and lmstudio will always be a problem.
Open_Gur_4733@reddit (OP)
Unfortunately, this feature is not yet available in LM Studio. We are still waiting for them to implement options to override llama.cpp. :(
Ethrillo@reddit
Im not aware of any issue with long context. I mean on that machine 200k context should be easily possible. What quant of Gemma 31b are you running?
You can always quantize kv cache to Q8 to save some memory. Setting Max concurrent predictions to 1 also saves some memory if you dont need more than 1 agent.
Open_Gur_4733@reddit (OP)
I use Q4_K_M.
I am up to date on all stacks (as I described in my message).
getmevodka@reddit
try deactivating mmap and keep in memory in the model options. also turn off the safety guardrails in the menu.