Experience of using OpenClaude and Gemma4 26b
Posted by nonekanone@reddit | LocalLLaMA | View on Reddit | 12 comments
Hi Guys,
I am relatively new to the LocalLLM scene, and today I started to download my first Local LLM with Gemma 4 26b. I am using Ollama and am running on a M1 Max with 32GB of RAM. When I just use Gemma 4 inside of Ollama, it works like a charm. It takes up a good amount of memory, but that is to be expected with my limited hardware. As soon as I start something like Open Claude, it fully breaks down. For a simple Hello World C++ program, it took 5 minutes to write. (In a new folder so it didn't have to interpret any files). Does anyone know why that's happening and if there is maybe a fix to make it run better on my hardware? Thanks a lot.
Sadman782@reddit
Don't use Ollama. Use llama.cpp. Ollama doesn't let you optimize accordingly, but llama.cpp gives you complete control, updating all issues every day.
With 16 GB VRAM, I am running Gemma 4 26B with 150K+ context (4-bit KV), and it's working pretty well for agentic coding.
_hephaestus@reddit
They’re on a Mac, use neither. Llama.cpp might have mlx support by now but using something mlx-first like oMLX will have better support.
praqueviver@reddit
Can you point me to materials that show how to setup local agentic coding with llama.cpp? I've tried setting up llama.vscode but any agentic model requires an openrouter api key.
FamousWorth@reddit
Lm studio?
chibop1@reddit
OpenClaw has huge system prompt with bunch of memory, instruciton, tools, etc that model has to read before get to your prompt. Sometimes it pushes like 40K even if you say Hi.
Ill_Barber8709@reddit
Gemma4 models currently suck at coding in Xcode too. I tried everything: Ollama, LMStudio, vLLM, llama.cpp - Q4_K_M GGUF and 4Bit MLX.
I'm currently using Qwen3.5 35B MoE Q4_K_M GGUF from LMStudio, with a 64K context size on my 32GB M2 Max MBP, and it works great.
Gemma4 support still needs a lot of polishing. I'll come back to it in a few of weeks because it is still promising despite all the errors.
Pleasant-Regular6169@reddit
People more experienced than me recommend using https://omlx.ai/ which has SSD caching.
AVX_Instructor@reddit
Watch your memory consumption; OpenCode usually uses the lion's share of RAM - LSP, and I think that's the problem.
jacek2023@reddit
I tried using codex with gemma 4 26B and it was working correctly. I think I had also good experience with opencode. I don't know open claude. Also it's worth considering llama.cpp instead ollama.
Konamicoder@reddit
It sounds like you’re hitting a memory bottleneck that is forcing your Mac to use "swapping."
A 26b model is very large for a 32GB machine. When you use the basic Ollama terminal, it’s very lightweight. But when you use a WebUI (like Open Claude), you are adding a web server, a browser, and—most importantly—a much larger "context window." The context window is the amount of text the model can "remember" during a chat, and a large window requires a massive amount of extra RAM.
If the total memory needed for the model + the context window + your browser exceeds your 32GB, macOS starts using your SSD as temporary RAM. Since even the fastest SSD is much slower than your M1's actual memory, the whole system grinds to a halt.
Try these steps to fix it:
Lower the Context Window: Look in your WebUI settings and find the "Context Length" or "num_ctx" setting. Lower it to 4096 or 8192. This is the most likely culprit.
Check Activity Monitor: Open the "Activity Monitor" app on your Mac, click the "Memory" tab, and watch the "Memory Pressure" graph while you use the UI. If that graph turns yellow or red, you have confirmed that you are running out of physical RAM.
Check Docker: If you are running the UI through Docker, make sure you haven't assigned a strict memory limit to the Docker engine that is too low for the model to function.
Close heavy apps: Close any extra Chrome tabs or heavy apps (like Xcode or Adobe) while running the model to free up as much of that 32GB as possible.
xeeff@reddit
if they wanted an AI response, they'd have used AI
Several-Tax31@reddit
Agentic frameworks like opencode, openclaude uses huge system prompts (or claude something like 20.000 tokens - basically giving the model information about system and how should it behave) In ram, this will take 5 mins or more to process this. (We call this prompt processing, pp) You can shorten it or get rid of it completely if you're not happy. You can also try optimizations to make pp faster. Keep in mind this is the "reading speed" of the model, and will happen any time the model reads a file, result of a tool, etc. In summary, yeah, running the models in ram is slow as hell.