gemma 4 running at 40 tokens/sec on iphone is impressive but it completely falls apart as a coding agent

Posted by Fun-Newspaper-83@reddit | LocalLLaMA | View on Reddit | 11 comments

Been testing gemma 4 since google dropped it. the small variants E2B and E4B are genuinely impressive on device. 40+ tps on iphone with mlx optimization, 128k context window, handles image and audio natively. feels like magic for basic chat and quick questions.

Ran it on my m5 pro macbook too. the 26B MoE version is fast for direct conversation. text generation, code explanation, summarization all smooth.

Then i tried using it as an actual coding agent and everything fell apart.

The problem isnt raw intelligence. its tool calling and structured output. agent workflows need the model to reliably call functions, parse results, chain multiple steps together. gemma 4 keeps choking on this. outputs malformed json, misses required fields, gets confused mid-chain. tried it with aider and it would stall, throw errors, or produce structurally broken responses.

Switched to qwen3-coder in the same setup. same framework, same tasks. file creation, command execution, multi step refactoring. all worked fine. the difference isnt general capability, its whether the model was specifically trained for agentic tool use patterns.

This is the gap nobody talks about when they get excited about on-device models. running a model locally is one thing. running it as a reliable agent that can plan, execute, verify, and iterate is completely different. the agent loop requires consistent structured output across dozens of tool calls. one malformed response breaks the whole chain.

For simple stuff gemma 4 on device is genuinely useful. quick code explanations, reviewing a function, answering questions about syntax. zero latency, zero cost, works offline. great for that.

But for actual development work where you need the model to autonomously write code, run tests, fix failures, and iterate? cloud models are still way ahead. the reliability gap for agentic workflows is massive.

The business model implications are interesting though. if on-device models keep improving, they eat into the high-frequency simple query market. cloud providers will have to justify their pricing with capabilities local models cant match. complex multi-agent orchestration, massive context windows, reliable tool calling chains.

Tools like verdent and cursor that run multi-agent workflows with verification loops are exactly the kind of thing that needs cloud-grade models. you cant have an agent that fails 1 in 5 tool calls when its running a 20-step automated pipeline. the compound failure rate kills you.

Short term: local models for quick stuff, cloud for serious agent work. long term: depends on how fast on-device tool calling reliability improves. but were not close yet.

[-]

Thunderstarer@reddit

I mean yeah but like. It's on a phone.

tenebreoscure@reddit

Every backend had bugs in regards to tool calling and other issues up until yesterday, the official template has been updated hours ago. You have essentially wasted your time.

Sadman782@reddit

template issue
use the updated jinja (updated few hours ago) : https://huggingface.co/google/gemma-4-26B-A4B-it/raw/main/chat_template.jinja
or slightly modified version(better): https://pastebin.com/raw/hnPGq0ht

ttkciar@reddit

Thanks for this. It makes me glad to be using my own templates, but am also glad to have these as references.

Crampappydime@reddit

Well its an E4B model meant for a phone, idk what you expected

_raydeStar@reddit

4B should be like sonnet 4.6, at least Haiku 4.6 😤😤

UndecidedLee@reddit

"My moped fails to handle loads that my neighbor's semi-truck manages without problem. Do you guys have the same problem? What are the long-term implications?"

Dropitse@reddit

This matches what ive seen. use local models for quick lookups and explanations. but for anything involving plan-execute-verify loops i need cloud models through verdent or similar tools. the reliability difference for multi step agent work is night and day

ea_man@reddit

Yeah I bet that local LM as distilled SOTA are behind about one year as functionalities like tooling.

If you stay in plan mode, make the model (QWEN3.5 27B) eat some 10K of context of doc, plan and propose and EDIT (that you will manually apply),as long as you stay below \~120K context all goes well.

I do hope that we get a QWEN3.6 that fixes tooling, then we need to have a conversation about "vertical integration" in clients like OpenCode.

iMrParker@reddit

Why would you use those models as a coding agent? And then compare it to a 30b coder model that was, at the time, best in its class?

kvothe5688@reddit

they don't even work as normal information models. op using it for coding lmao