Tried hermes agent with local gemma4 on ollama. free tokens are nice but the agent quality gap vs cloud is still huge

Posted by RepulsivePurchase257@reddit | LocalLLaMA | View on Reddit | 13 comments

Saw a post about running hermes agent locally with gemma4 through ollama. zero api costs, unlimited tokens, full privacy. spent a weekend setting it up.

Install is straightforward. brew install ollama, pull gemma4:4eb (9.6gb, took about 2 hours), configure hermes to use local endpoint instead of deepseek api. it works, model responds, does basic tasks.

But the quality gap between local and cloud frontier models for agentic tasks is massive. not 10-20% worse, more like a different category.

Tested three things:

Simple file organization script: gemma4 handled it fine. 40 seconds vs 5 on cloud claude. acceptable.

Refactoring a react component with complex state: local model got the structure right but missed two edge cases cloud models catch consistently.

Multi step task planning: asked it to break down a feature with dependencies. output was generic, missed project context entirely. same task in verdent with cloud models gives me clarifying questions about my codebase and catches dependency conflicts. night and day.

Speed compounds too. 15-20 tps on m2 pro. for chat its fine. for agentic loops where the model iterates 5-6 times, latency adds up fast.

Where local actually shines: privacy sensitive review, offline dev, cheap first pass before sending complex stuff to cloud. my deepseek bill dropped from $30/month to $8 by offloading simple queries locally.

Worth setting up as a complement, not a replacement. the "token freedom" pitch is technically true but quality tradeoff is significant for anything beyond basics

[-]

_ObsessiveCoder@reddit

mine pretends to use tools but wont and can't convince any ai tool to tell me how in a way that actually works and doesn't nuke my install. It's all surface level theory that is not based on hermes, just general AI suggestions. Totally lost

Cultural-Broccoli-41@reddit

For complex tasks like agent-based tasks, the Gemma4-26B-A4B seems to be the minimum acceptable model in terms of performance.

If it's difficult to get this working, it's more constructive to simply rely on the Web API.

If you absolutely need to complete the process offline, you should try to create an environment where Gemma4-26B-A4B can run. I mean in terms of physical equipment.

inphaser@reddit

the problem is that hermes agent seems to eat context like peanuts, idk how it manages but one tool call can result in 10/20k context use. which is odd since my system can only do some 20t/s and it doesn't seem to last that long

I still haven't managed to fix its web search..
also reconfiguring does not seem to change options

So it always tries to use camofox, that it downloads, but fails using, and for web scrape i specified to use a local instance of firecrawl, that works for me but not for him

Idk what's the issue.. And ofc with a local gemma4 it can't do that much debugging

AcrobaticChain1846@reddit

Bro out of all local models I feel like Gemma4 is the laziest model of all.
Qwen3.5 9b is far better with same quantization or you could try any other model.
All Gemma4 models are just lazy AF.

NormalNature6969@reddit

Ollama is shit, let's get a grip.

Olive64@reddit

qu'est-ce que tu proposes ?

Best of luck. Purposes that.

Independent_Plum_489@reddit

The planning quality gap is the biggest issue imo. i use verdent for task decomposition with cloud models and the difference is night and day compared to local. the plan mode actually asks clarifying questions about your codebase which requires deep reasoning that local models cant do yet. maybe when 70b+ runs efficiently locally itll close the gap

carrot_gg@reddit

Are you really comparing a 9.6GB model against the cloud alternatives?

the_omicron@reddit

Bro literally comparing 4.5B model to >1T models lmao

CryptoLamboMoon@reddit

The fair comparison is actually model class, not size. gemma4:4b vs claude haiku, not claude sonnet. at that tier the gap shrinks a lot. the real issue is ollama’s throughput under agentic loops — swap to llama.cpp or vllm and you’ll see different results. local makes sense for privacy-first or high-volume tasks, not as a drop-in frontier replacement.

mlhher@reddit

You are comparing gemma4 (and not even like 26b or 31b, you are comparing e4b) to a cloud model on a harness that is made for cloud models.

I am not sure what you really expected.