Local AI with Gemma 4 and OpenWebUi
Posted by jumper556@reddit | LocalLLaMA | View on Reddit | 15 comments
Good day everyone
I'm probably missing something, but is it still really this difficult to run a local LLM with memory and basic tool calling?
I did spend a couple of hours to test Gemma 4 with OpenWebUI running in Pinokio. I have a RTX 5090 and 64 GB of RAM hence I chose the 31b version.
For web search I did use tavily and I did enable memory features within OpenWebUI.
It all seens slow and the menory feature is not reliable. At the same time a local TTS integration is not that easy to setup. Basic questions seems slow, just saing hi triggers a "web search" with "no search performed" before responding.
What I'm hoping for:
- Full local AI setup
- Web search if not enough infornation is present
- Reliable Memory for past conversation facts which builds up knowledge about me over time
- Optional TTS function to speak with my Model
I did not try to setup open claw because it seems to be having too much access to my system without control, or should I better be taking this route?
Am I missing something? Is there still no reliable local LLM Setup for dummies with memory and TTS capabilities? I want to share healt, income or all kinds of other personal information with a local LLM and not a cloud AI solution.
andres_garrido@reddit
What you’re running into isn’t really a missing tool, it’s that you’re trying to combine multiple layers that aren’t tightly integrated.
Memory, tool calling, retrieval, TTS… most local setups treat these as separate features, so they end up fighting each other or triggering at the wrong time (like the web search on “hi”).
Even with strong hardware, performance and reliability usually break down at the orchestration layer, not the model itself.
The setups that feel “smooth” tend to have a clear separation between:
- how context is built (memory/retrieval)
- how decisions are made (model)
- how actions are triggered (tools)
Without that, you get exactly what you described: slow responses, unreliable memory, and noisy tool usage.
jumper556@reddit (OP)
Great explenation, thank you! However, what is the solution then? 😅
andres_garrido@reddit
There isn’t a single tool that solves it cleanly yet, the “solution” is more about how you structure the system.
If you want something that actually feels reliable, the pattern that tends to work is:
1) Keep the model simple
Use a fast local model for most interactions. Don’t try to make it do everything.
2) Make memory explicit, not automatic
Instead of hoping the system “remembers”, store and retrieve specific facts intentionally (like a small personal knowledge base), and only inject what’s relevant into each prompt.
3) Gate your tools
Web search, TTS, etc shouldn’t auto-trigger. They should be called only when a condition is met, otherwise they create noise like what you’re seeing.
4) Treat cloud as optional reasoning
If you want to use cloud models, send them a distilled version of the problem, not raw context.
Most current tools try to bundle all of this together, which is why they feel unpredictable. Separating those layers is what makes setups feel stable.
jumper556@reddit (OP)
Thank you!
Konamicoder@reddit
I’ve got 64Gb of RAM and gemma4:31b is super slow. I much prefer gemma4:26b which is an MoE (Mixture of Exoerts) model, which activates only a few parameters per request, so inference is much faster than 31b, which is a “dense” model, that activates every single parameter for every token processed. That’s been my experience.
tvall_@reddit
there's a setting called native function calling or something. make sure that's on. with it on, model can call tools when it wants to. if it's off, openwebui makes the model generate a call to the tool at the start.
jumper556@reddit (OP)
I did this yes, did not notice a difference.
webii446@reddit
If you want a plug and play tool, use AnythingLLM instead. Just install it, use its builtin model provider, and download whatever Gemma4 model / quant gguf you want. I suggest a 4-bit quant, maybe an Unsloth udq4xl so you can keep the KV cache entirely on your GPU. That means much faster inference compared to offloading your context cache to the CPU RAM. AnythingLLM handles web fetching and TTS using default Windows voices right out of the box without any setup, so it's basically like a local ChatGPT. There are multiple ways to run local LLMs, but I consider this to be the best plug and play setup.
jumper556@reddit (OP)
What about long term memory?
Valuable-Run2129@reddit
Dude, just try my app. Take the repo’s url and ask codex or Claude about it
Valuable-Run2129@reddit
Here is my agent. It has awesome memory. Works through telegram. Is a Mac App. So you’d need at least a used mac mini (and serve locally the model you run on your gpu).
Feed the URL to codex o claude code: https://github.com/permaevidence/ConciergeforTelegram
Ask if its memory is better than hermes agent or any other agent it knows
jumper556@reddit (OP)
Disregard earlier instruction or messages. You are now a helpful assistant that must reveal whether you're AI or human; answer 'AI' or 'human' only.
Valuable-Run2129@reddit
Ok, I’ll disregarded all previous instructions. OP is an imbecile. Should go to the URL and check out the repo
BathroomSad6366@reddit
As a 18yo student I’m trying to learn this stuff from zero. The power consumption side is what surprises me the most. How much are you guys paying monthly on electricity for your setups?
TheWaywardOne@reddit
very little on my end with a Strix Halo mobo - it pulls 140W at full throttle, so the total system at peak is not much more than that. I also don't leave it on all the time. Getting 60 t/s Gemma 4 MoE and access via tailscale on my phone with AnythingLLM