Comparing Qwen3.5 vs Gemma4 for Local Agentic Coding

Posted by garg-aayush@reddit | LocalLLaMA | View on Reddit | 95 comments

Gemma4 was relased by Google on April 2nd earlier this week and I wanted to see how it performs against Qwen3.5 for local agentic coding. This post is my notes on benchmarking the two model families. I ran two types of tests:

Standard llama-bench benchmarks for raw prefill and generation speed
Single-shot agentic coding tasks using Open Code to see how these models actually perform on real multi-step coding workflows

My pick is Qwen3.5-27B which is still the best model for local agentic coding on an 24GB card (RTX 3090/4090). It is reliable, efficient, produces the cleanest code and fits comfortably on a 4090.

Generation speeds based on llama-bench

Model	Architecture	Generation (tokens/s)
Qwen3.5-35B-A3B	MoE	165.84
Gemma4-26B-A4B	MoE	164.38
Qwen3.5-27B	Dense	45.88
Gemma4-31B	Dense	44.42

Single-shot agentic coding tasks

I tested two prompts (simple httpx script and a more complex Gemini image generation workflow with TDD) where the model has to figure everything out on its own.

Speed in llama.cpp + OpenCode setup

Model	Prefill tok/s (P1)	Prefill tok/s (P2)	Gen tok/s (P1)	Gen tok/s (P2)
Gemma4-26B-A4B	4,338	4,560	135.5	134.4
Qwen3.5-35B-A3B	3,179	3,056	136.7	132.3
Gemma4-31B	1,466	1,357	37.7	35.2
Qwen3.5-27B	2,474	2,188	44.9	44.6

Generated Code Quality on complex prompt

Aspect	Gemma4-26B-A4B	Gemma4-31B	Qwen3.5-35B-A3B	Qwen3.5-27B
Structure	2 files, basic separation	3 files, clean separation	Class-based with helpers, cleanest design	3 files + dead `main.py` stub
Error handling	Minimal, no API error handling	Poor, no try/except around API	Adequate but no batch error recovery	Weak, silent failures
TDD	Placeholder test, no real TDD	One integration test, superficial	Integration tests only, claimed but not real	Integration tests only, claimed but not real
Cleanliness	Acceptable, concise	Good, readable, concise	Good structure but unused `base64` import	Good docstrings, type hints, pathlib usage
Critical issues	Broken summary, no `uv run` setup	New client per API call	Hardcoded API key in tests, wrong model	Dead `main.py`, new client per call

Key Takeaways

MoE models are \~3x faster at generation (\~135 tok/s vs \~45 tok/s) but both dense models got the complex task right on the first try. Both the MoE models needed retries.
Qwen3.5-35B-A3B is seems to be the most verbose (32K tokens on the complex task).
Gemma4-31B dense is context-limited in comparison to others on a 4090. Had to drop to 65K context to maintain acceptable generation speed.
None of the models actually followed TDD despite being asked to. All claimed red-green methodology but wrote integration tests hitting the real API.
Qwen3.5-27B produced the cleanest code (correct API model name, type hints, docstrings, pathlib). Qwen3.5-35B-A3B had the best structure but hardcoded an API key in tests and used the wrong model name.

You can find the detailed analysis notes here: https://aayushgarg.dev/posts/2026-04-05-qwen35-vs-gemma4/index.html

Happpy to discuss and understand other folks experience too.

[-]

donhardman88@reddit

Model choice is important, but for agentic coding, the retrieval layer is actually the bigger variable. You can have the best model in the world, but if the agent is just using flat semantic search to find code, it'll still struggle with complex cross-file dependencies.

I've found that the most successful local agent setups are the ones using a structural knowledge graph via MCP. It allows the agent to actually 'navigate' the project architecture rather than just guessing based on embeddings. It makes a huge difference in how the agent handles refactoring across multiple files.

[-]

Potential-Leg-639@reddit

Tell us more about „using a structural knowledge graph via MCP“, please?

[-]

donhardman88@reddit

Sure! The core idea is that instead of treating your code as a collection of text chunks (which is what standard RAG does), you use AST parsing (via tree-sitter) to map out the actual symbols, function definitions, and cross-file dependencies. This creates a structural knowledge graph.

By exposing this graph through an MCP (Model Context Protocol) server, the AI agent doesn't just 'search' for a similar string—it can actually 'navigate' the codebase. For example, it can find a function definition and then immediately see every other file that imports or calls that specific function, regardless of whether the keywords match.

I actually built an open-source tool called Octocode to handle this. It's basically my attempt to bring the kind of deep codebase indexing that Cursor does, but in a fully open-source, local-first way. I use it daily in my own workflow and it's been a game-changer for how my agents handle complex refactoring across multiple files.

It's written in Rust for speed and includes a built-in MCP server so you can plug it directly into Claude or Cursor.

You can check it out here: https://github.com/Muvon/octocode

[-]

Potential-Leg-639@reddit

Great stuff, thanks for that! Crazy times we are living in right now :) Is it also working with Opencode?

[-]

donhardman88@reddit

Thanks! And yeah, it should absolutely work with OpenCode. The goal was to make it compatible with any client that can handle the MCP protocol, so it should be plug-and-play.

To be honest, I'm not much of a marketing person – I just keep building. I've been a dev for 20+ years and have seen the industry shift a dozen times, but right now I'm obsessed with the RAG/agentic side of things. My main focus is just figuring out how to get these agents to actually work reliably, faster, and cheaper.

It's a wild time to be building, but getting the retrieval precision right is where the real magic happens. Happy to hear if you try it out with OpenCode!

P.S. Just as a side note – I actually wrote zero lines of code for the CLI tool itself. I built the entire thing with AI while acting purely as the architect and key decision maker. It's a pretty wild feeling to move from writing every character to just directing the logic, but it's exactly why I'm so focused on the retrieval layer now – the AI is only as good as the context you give it.

[-]

ThankYouOle@reddit

question: this sound interesting, i am gonna try it this afternoon.

but how to connect with opencode?

octocode will open the MCP, and i will registered it in opencode as MCP server too look for,

but how opencode will know certain actions will need to use octocode? is there any specific command to make it use octocode?

[-]

donhardman88@reddit

To get it working properly, there are a few things you need to do before the MCP server becomes useful:

1. The 'Engine' Setup (API Keys & Local Models)
MCP just connects the AI to the tool, but Octocode needs its own environment. You can either set your VOYAGE_API_KEYin your environment variables for high-performance cloud embeddings, or you can run embeddings locally (Octocode supports local-first options for privacy and speed). Just make sure your preferred embedding provider is configured before starting the server.

2. Pre-Indexing (Don't skip this)
While the MCP server can trigger indexing, it's much slower and can timeout the LLM's request. Run octocode indexmanually in your project root first. This builds the AST and knowledge graph locally. Once that's done, the MCP tools (semantic_search, graphrag) will respond instantly.

3. Configuring the MCP Connection
If you're using OpenCode or Claude Desktop, the best way to ensure the AI is looking at the right codebase is to pass the path directly in the command:

octocode mcp --path /absolute/path/to/your/project

(If you're already running the agent inside the project workdir, you can omit the path, but explicitly defining it in the config is the safest bet.)

Once those are set, you don't need a special command—just ask the AI to 'find the logic for X' or 'map the dependencies of Y,' and it'll trigger the Octocode tools automatically.

Feel free to ping me if any issues.

[-]

ThankYouOle@reddit

i see, so it only works with voyage, openrouter and local server? or i missed it.

for local, do you think https://ollama.com/library/gemma4:e2b will work fine?

[-]

donhardman88@reddit

Ollama works great. Just use the ollama:model_nameprefix in your config for both the LLM and embeddings. It'll route everything locally. If you run locally, prefix can be local:..

Do you run it locally or via ollama.com API?

[-]

Potential-Leg-639@reddit

No worries, I‘m also 20+ years in dev and also stopped coding by myself, agents can do it faster and better at the end most of the time, let‘s be honest :) still takes some time for lot of people to realize that, but hey - it‘s just a question of time. And no - not all devs will be replaced by AI, hehe.

[-]

donhardman88@reddit

Exactly. It's a massive shift in the industry. I think the real realization is that we aren't being replaced, but our roles are evolving.

The 'coding' part is becoming a commodity, but the ability to architect, validate, and steer the AI is where the actual value is now. We're still required at the critical points – the 'last mile' of logic and the high-level decision making – but the skill set is just shifting. It's less about knowing the syntax and more about knowing how to structure the problem so the agent can actually solve it.

It's a wild transition to be part of after 20 years in the game, but it's definitely the most exciting time to be an architect.

[-]

Potential-Leg-639@reddit

Installation in windows failed, in WSL it was working at the end.

[-]

donhardman88@reddit

Thanks for the heads-up. I'm on OSX/Linux only, so I haven't tested Windows native much. I'll look into this deeply to make sure we get a smoother experience going for Windows.

[-]

IrisColt@reddit

Thanks for the link!

[-]

onlymagik@reddit

Would you mind giving some pros/cons of your repo vs https://github.com/DeusData/codebase-memory-mcp?

[-]

donhardman88@reddit

I just did a quick dive into that repo. It's a really impressive piece of engineering—the indexing speed and the structural mapping are top-tier.

In terms of a comparison: they're both using tree-sitter for AST parsing to build a knowledge graph, but they're solving for different things. Codebase-Memory-MCP is essentially a deterministic structural index. It's perfect for 'Where is this called?' or 'Show me the call chain.' It's basically a super-powered LSP.

Octocode is a hybrid. We do the structural mapping, but we layer semantic search (embeddings) on top of it.

The difference is in the query. If you know the exact function name, both tools win. But if you're asking 'How is the auth flow handled?' or 'Where is the logic for X?', a deterministic graph alone struggles because it doesn't understand the meaning of the question. Octocode uses semantic search to find the right starting point in the graph and then uses the structural relationships to provide the full context.

So, if you just need a fast, deterministic map of your symbols, that repo is awesome. If you want an agent that can actually 'reason' through the codebase using natural language, that's where Octocode fits in.

[-]

IrisColt@reddit

Thanks!

[-]

snugglezone@reddit

Is this supposed to compete with Serena?

[-]

donhardman88@reddit

I haven't used Serena extensively, but from a quick look, the architectures are quite different. Serena seems to leverage LSP/JetBrains for symbolic understanding, whereas Octocode focuses on semantic retrieval via embeddings and hybrid search.

It's the difference between "finding a specific symbol" and "finding code that does X." Both are valuable, just different ways of getting context to the agent.

[-]

stormy1one@reddit

The probe guys compare their solution to yours here - https://github.com/probelabs/probe — seems like either they got it wrong or you changed your architecture to be more like theirs ?

[-]

donhardman88@reddit

I can't tell if this is even Octocode they're talking about (no link), but it feels like a case of mistaken identity.

Their feature list is inaccurate—we've got hybrid search, custom configs, and a much wider range of embeddings. I'll admit the bootstrap is slow, but the "10 tools vs 1" logic is flawed. Semantic search is for discovery, ASTs are for precision. Depending on the task, a simple AST grep can be more powerful, but for large-scale discovery, you need the semantic side.

[-]

swfsql@reddit

This kind of thing should be massive if models were trained for it

[-]

donhardman88@reddit

The problem is that training is static, but knowledge is dynamic. Even if models were perfectly trained for this, they'd be outdated the moment a new framework version drops or a project's architecture shifts. In a world where knowledge evolves faster than training cycles, you can't rely on weights for 'ground truth.' You need a high-precision RAG layer to provide the current state of the world in real-time. That's why the focus should be on the retrieval quality – if the RAG is precise, the model doesn't need to 'know' everything; it just needs to know how to use the truth you're giving it.

[-]

swfsql@reddit

I intended to say that a model trained with that tool, say, on the entirety of code-forces or something, may learn to more efficiently use the tool itself. It may also more efficiently learn that frameworks can evolve and so on. Training with tool use is standard practice for model post-training, I imagine.

[-]

LaCipe@reddit

I know EXACTLY what you mean...I wonder if hooks for claude can make it force to use this stuff...hmm

[-]

teh_spazz@reddit

How does it compare to Context7?

[-]

donhardman88@reddit

They're actually solving two different sides of the same problem. Context7 is an MCP server for official documentation—it's amazing for when the agent needs to know the latest API specs or how a specific library is supposed to work so it doesn't hallucinate outdated code.

Octocode is for the actual codebase you're building. It uses AST parsing to map your specific project's internal structure, dependencies, and logic.

Basically, Context7 gives the agent the 'official manual' for the tools you're using, and Octocode gives it the 'blueprint' of how you've actually put those tools together in your app. You'd actually want to use both: one for the external docs and one for the internal code. That's how you get an agent that actually understands both the library and your implementation.

[-]

teh_spazz@reddit

So far so good.

Now explain to my smooth brain why octocode wouldn’t work as RAG? Something tells me it would with the right data pumped in, but I’m not sure I’m smart enough to verbalize how it would work.

[-]

donhardman88@reddit

Haha, no 'smooth brain' here—you're actually hitting on the exact reason why I built this.

The short answer is: Octocode is a form of RAG, but it's way more than just a vector store. Standard RAG is 'flat'—it just finds similar-sounding text. Octocode is a hybrid system.

First, we don't just chunk text; we use AST parsing to extract proper code blocks and then describe them, so the retrieval is actually context-aware. Then we layer in Hybrid Search (combining semantic and keyword) and GraphRAG to capture the actual relationships between symbols.

The best part is that it's completely local-first. You can use any embedding or LLM model you want, which is critical when you're dealing with private data and can't just upload your whole repo to the cloud. And while I focus on code, it actually works on any files (like .md or docs), so you can build a structural RAG over pretty much any knowledge base.

It's basically RAG on steroids—using multiple tuning layers to make sure the agent gets the actual ground truth, not just a 'similar' chunk of code.

[-]

teh_spazz@reddit

That's what it felt like when I was reading the repo. Thanks, man. I'm gonna give this a go.

[-]

donhardman88@reddit

Glad to hear it! I'm still polishing the README, so bear with me on the docs. Let me know how it goes or if you find any bugs!

[-]

cleverusernametry@reddit

No. Boris Cherny himself says "agentic search" - simply grep and glob outperform rag for coding. That's all that Claude code uses.

Unless you're a poor engineer or vibe coder, you're codebase will follow good/standard folder structures for your language + have good docs. That's all that the model needs to get the right context

[-]

IrisColt@reddit

Today I learned that my Diogenes syndrome-esque folders mean that I'm a poor engineer, whatever that means, heh

[-]

donhardman88@reddit

he problem with the 'just grep it' argument is that it assumes a human is steering the ship. If you're the one telling the agent exactly which symbols to grep on every single request, then you're the one doing the work, not the AI.

I've spent a lot of time with Claude Code, and the 'grep loop' is exactly where it falls apart. The agent greps, gets 20 results, reads 5 of them, gets overwhelmed by the noise, and then starts hallucinating or ignoring the original intent just to finish the task. It's a classic case of 'context collapse.'

Grep doesn't avoid the context window problem; it actually makes it worse by filling the prompt with irrelevant boilerplate.

If you have a small project and you're guiding the AI manually, sure, grep is fine. But if you want an agent that can actually handle a complex refactor autonomously, it needs to know the structural relationships before it starts reading files. Otherwise, you're just paying for a very expensive loop of 'grep -> read -> hallucinate -> repeat.'

[-]

EugeneSpaceman@reddit

Very interested in your project - I hope to try it out.

But I haven’t experienced what you’re suggesting with Claude Code (at least using Anthropic models). They get the job done quite quickly using grep. Are you suggesting they would perform even better with a codebase search tool such as Octocode?

Or does this hallucination problem in Claude Code only affect smaller models?

I guess it’s just an MCP server so would be easy to install in CC and compare!

[-]

donhardman88@reddit

It's definitely a scale and complexity thing. For a lot of tasks, Claude is smart enough to 'brute force' its way through grep results, especially if the project is well-structured. If it's working for you right now, that's awesome.

The 'hallucination' or 'context collapse' I mentioned usually kicks in when you hit a certain level of complexity—like a massive refactor where you need to trace a symbol through 10+ files. That's when the agent starts getting overwhelmed by the noise of 20 different grep matches and starts taking shortcuts or missing a critical edge case.

To answer your question: Yes, I'd argue they perform significantly better with Octocode, but the difference is most obvious in those 'hard' cases. Instead of the agent guessing which files to read, it has a structural map. It doesn't just find the word; it finds the actual relationship.

It's not about the model size (though bigger models handle the noise better), it's about the quality of the context. Grep gives the model a pile of snippets; Octocode gives it a blueprint.

Since it's just an MCP server, it's a quick install. I'd love to hear if you notice a difference in how it handles the more complex parts of your repo!

[-]

EugeneSpaceman@reddit

That makes sense, thanks. I guess most of the time I’m performing small, easier tasks - but I can imagine running into the limits of grep.

[-]

gyzerok@reddit

Can you elaborate your specific setup?

[-]

donhardman88@reddit

My setup is really focused on minimizing the 'noise' that usually kills agentic coding. I've tried almost everything—Claude, Codex, and various other tools with open-source models—and the biggest realization was that most clients aren't actually focused on retrieval quality. They just give you a basic search and expect the LLM to figure it out, but you really have to tune the retrieval layer to get a professional outcome.

Currently, I'm mostly using GLM-5 with MiniMax (I actually have a sub on ollama.com for this) and I'm really happy with the performance. The core driver for my workflow is a semantic code indexer called Octocode—it's the engine that allows me to find the root cause of a bug almost instantly and precisely follow the dependency chain to fix it.

I've integrated this into Octomind, which is where I've spent a lot of time heavily tuning the retrieval to make sure the agent doesn't get lost in irrelevant files.

If you're building your own setup, my advice is to focus less on the model and more on the 'precision' of the context you're feeding it. Once you get the retrieval right (using AST graphs), even mid-sized local models start performing like giants.

[-]

No_Hedgehog_7563@reddit

Very interesting and what you say makes sense. I’ll take a look later on your repo.

[-]

donhardman88@reddit

Awesome, hope you find it useful! Feel free to ping me in the issues if you have any feedback. I use it daily for my own dev work, so I'm always looking for ways to make it better.

[-]

No_Hedgehog_7563@reddit

And this tool is fully local I suppose? As in it doesn't relay any of the (indexed) code to your company(?). I'm asking because I see the repo is from an organization (your company I guess) and I suppose at some point you'd want to make money from it?

Sorry if I worded this out in a weird way, I think the idea is amazing either way, just want to have some assurance before trying it in some corporate repo :D

[-]

donhardman88@reddit

No worries at all – it's a totally fair question, especially when you're dealing with corporate repos.

To be 100% clear: Yes, it's fully local. Octocode does not send your code, your index, or your queries back to us. Everything stays on your machine.

Regarding the 'company' part – Muvon is really just two of us. We aren't some big corporate entity; we're just a couple of devs who are obsessed with AI agents and performance. We're using these tools to run our own products, and we're sharing Octocode because it's the piece of the puzzle we wish existed for us.

That's why it's open-source under Apache 2.0. We're not looking to build a 'data-collection' business. Our focus is on the tech and the quality of the output. If we ever build a paid 'pro' version or a hosted service, that'll be a separate, optional thing, but the core engine will always be local and transparent.

You're safe to run it in your corporate repo. We're just builders building tools for other builders like us. 🐙

[-]

No_Hedgehog_7563@reddit

Hey, many thanks for the detailed responses. Wanted to try octocode right now and I see it must be configured with a Voyage API key. Is it not possible to use a local encoder/reranker? Maybe I'm missing something from the docs

[-]

donhardman88@reddit

No problem at all! You're not missing anything—the docs just highlight Voyage because it's the easiest 'zero-config' way to get started, and their 200M free tokens usually cover most people's needs.

But yes, Octocode is built to be fully local. You can absolutely use local encoders and rerankers. For example, you can swap in:

fastembed:all-MiniLM-L6-v2 (super fast, low overhead)
huggingface:sentence-transformers/all-mpnet-base-v2
huggingface:microsoft/codebert-base (great for code-specific semantics)

If you're dealing with sensitive data or just want to keep everything on your own hardware, just switch the provider in your config to one of those. I personally stick with Voyage for the convenience, but the local-first path is fully supported.

Let me know if you hit any issues getting the local models spun up!

[-]

donhardman88@reddit

I've integrated this into Octomind, which is where I've spent a lot of time heavily tuning the retrieval to make sure the agent doesn't get lost in irrelevant files.

[-]

twanz18@reddit

Both are solid for agentic coding. In my experience Qwen3.5 handles longer context tasks better when you are running multi-step workflows. Gemma4 is faster on shorter prompts though. If you are running these locally and want to connect them to something like Claude Code or other agents remotely, check out OpenACP. It lets you bridge any coding agent to Telegram or Discord so you can trigger tasks from your phone. Open source, self-hosted. Full disclosure: I work on it.

[-]

aldegr@reddit

Assuming you're using the latest llama.cpp, try testing Gemma 4 with https://github.com/ggml-org/llama.cpp/blob/master/models/templates/google-gemma-4-31B-it-interleaved.jinja.

[-]

garg-aayush@reddit (OP)

No, I ran these tests on Friday (April 3rd) morning. Thanks, I will try to run them again with this model template.

[-]

ResearchCrafty1804@reddit

Please update us with your findings, if the latest llama.cpp and chat template make a difference to Gemma4 in local agentic coding

[-]

IrisColt@reddit

RemindMe! 7 days

[-]

RemindMeBot@reddit

I will be messaging you in 7 days on 2026-04-13 08:57:43 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]

garg-aayush@reddit (OP)

Weekend is over and the weekly grind starts again :|
Though will try to run them some time tonight or tomorrow. I will make to update here.

[-]

Intelligent_Lab1491@reddit

How do I use this?

[-]

xanduonc@reddit

--chat-template-file "/models/google-gemma-4-31B-it-interleaved.jinja"

[-]

IrisColt@reddit

Thanks!!!

[-]

garg-aayush@reddit (OP)

You need to provide the chat template as one of the flags during the server launch.

[-]

riceinmybelly@reddit

Are lama.cpp templates also working in lm studio?

[-]

aldegr@reddit

This requires a particular implementation in llama.cpp to build the model turn, so I highly doubt it will work with LM Studio.

[-]

riceinmybelly@reddit

Thx

[-]

grumd@reddit

What does this template do?

[-]

aldegr@reddit

The template preserves reasoning between tool calls, in adherence to the [Gemma 4 prompt formatting guide](https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4#managing-thought-context). The original templates strip this reasoning, contradicting the guide. It does require a recent llama.cpp and an agent harness that sends back reasoning traces. Pi and OpenCode should work out of the box.

[-]

grumd@reddit

Oh that sounds interesting. Should the same template be used for 26B-A4B?

[-]

aldegr@reddit

Yes, all the models use the same template.

[-]

mapsbymax@reddit

Great comparison. One thing I've noticed running agentic tasks locally - the MoE speed advantage is deceptive for agent loops. The 3x faster generation looks great on paper, but when the model needs 2-3 retry cycles because it got something wrong, you end up slower than the dense model that nailed it first try.

The TDD observation is really interesting too. I've tried multiple local models and none of them actually do proper red-green-refactor even when explicitly asked. They all write the implementation and tests together. Would love to see someone crack that with better system prompts or fine-tuning.

For anyone on the fence - if your agentic workflow has good error recovery (automatic test runs, lint feedback loops), the MoE models become more competitive since each retry is cheap. If you're doing fire-and-forget single shots, dense Qwen 27B is hard to beat.

[-]

garg-aayush@reddit (OP)

I also have a hunch that if we can build a good feedback loops and automatic test runs, then MOEs will start performing at par with dense models. I remember reading one of the researcher tweets where they got the local models be it MOE or dense work better by having an automatic feedback loop which ensured maximum of 2 tries per problem with the error in some smart way becoming a feedback signal for the 2nd try.

[-]

Pjbiii@reddit

I tried a coding variant of the Qwen3.5-31b and it was not quite as fast as Gemma4-26b-Q4_K_M but they were both just as accurate for small coding, agent tasks and using tools.

[-]

kiwibonga@reddit

This post made me realize I dreamed about local model benchmarks last night. I don't remember any specifics but I was so excited about this graph with red and green balls.

[-]

gyzerok@reddit

Maybe you were dreaming about balls, not benchmarks?

[-]

2muchnet42day@reddit

Why not both?

[-]

gyzerok@reddit

Both balls?

[-]

Voxandr@reddit

This is my findings too. Qwen Still better than Gemma in agentic insturction following.

[-]

Constant-Bonus-7168@reddit

Qwen3.5-27B handles multi-turn corrections cleanly — it can accept feedback and adjust without hallucinating. That's more valuable for agentic work than raw single-shot accuracy.

[-]

D2OQZG8l5BI1S06@reddit

Gemma 4 prompt processing is almost twice faster than Qwen for me (75 vs 40 t/s), it does help a lot too.

[-]

garg-aayush@reddit (OP)

That is interesting. What is your setup that you are able to get 75t/s for Gemma but 40 for Qwen?

[-]

D2OQZG8l5BI1S06@reddit

Latest llama.cpp with -cmoe, so all MoE weights on CPU

[-]

defervenkat@reddit

I’m actually very impressed with Gemma4 and I’m running Q6 XL from unsloth. Upgraded from qwen 3.5.

[-]

Barry_22@reddit

So Qwen-3-27B is still a champ?

[-]

garg-aayush@reddit (OP)

Yes, I still feel qwen3.5 is better for coding. Not just the results but also the size and speed is way better suited for 24GB cards.

[-]

danf0rth@reddit

I have similar feeling, i found Qwen really more intelligent when writing code, e.g. it written ipynb files correctly, while Gemma 4 create not working notebook.

But on the other side, i use rtk to compress context a bit, and i found that with the same agents, Qwen ignores prefixing all commands with rtk, but Gemma for some reason make it more frequently, feels like it following instructions better in this case.

Also i use HIP ROCm on Windows (7900XTX), and observed that Gemma has performance degradation. It prints response and every n seconds it slows down again and again, until there is really slow performance, like 4-5 tok/s. Don't know why. Latest versions of llama.cpp and model.

[-]

garg-aayush@reddit (OP)

I also use rtk to compress the context. This is a very interesting observation. I never paid attention to comparing rtk usage b/w qwen and gemma.

How do you compare the rtk usage between them?

[-]

danf0rth@reddit

I just check what command it runs, you will see like `rtk ls ...`, `rtk grep ...`, etc. I thought that I put my rule in wrong directory while working with Qwen, forgot about it, and recently switched to Gemma 4, and noticed that before every command there is `rtk` prefix. You can also check `rtk gain` to observe how much you saved tokens.

[-]

ixdx@reddit

For 2x16GB GPUs, the Qwen3.5-27B-Q4_K_M/L is also better, as it fits within a 128K context. The Gemma-4-31B context takes up more VRAM.

[-]

mmontes11@reddit

Similar findings with an RTX PRO 4000 SFF, also 24GB:

https://github.com/mmontes11/llm-bench

[-]

vasimv@reddit

Did they actually test and debug the code they wrote? In my experiments writing simple android apps with local models, the most challenging part was debugging the code to ensure it met technical requirements.

[-]

MrMisterShin@reddit

I took a look at your link. Thanks for including the actual duration time in your analysis.

Tokens per second is not the full story, when you have models that “think extensively” or require more tool calls etc than other models to complete a task with good quality.

[-]

garg-aayush@reddit (OP)

Yup, I observed the same thing especially with MOEs they are blazing fast in comparison to dense ones but they are more likely to use more api and tall calls along with every now and then issue of infinite loop. They seem less precise for coding. Maybe there is way to get around this with better prompting and mid-feedback.

[-]

Rich_Artist_8327@reddit

Reddit is full of Qwen marketing posts.

[-]

garg-aayush@reddit (OP)

This is more of appreciation post based on my experience over the last few days. :)

[-]

Rich_Artist_8327@reddit

I was supposed to write "Reddit is full of Qwen marketing department people posting marketing posts"

[-]

Eyelbee@reddit

How do you get 130k context?

[-]

garg-aayush@reddit (OP)

I used q8 quantization for KV cache. This fits well on 4090. Actually, I have also seen folks use q8 for K and turbo3 for V that should even help you get more context in.

[-]

DaLyon92x@reddit

Interesting comparison. I run agentic coding workflows daily with Claude through CLI tooling and the retrieval layer comment is spot on - the model matters less than how much context you feed it and how well your agent recovers from bad outputs.

For local models specifically, the thing I'd watch for isn't just single-shot accuracy but how they handle multi-turn correction loops. A model that produces slightly worse code but accepts corrections cleanly is more useful in an agent than one that nails it first try but hallucinates when you push back. anyone tested that dimension?

[-]

Public-Thanks7567@reddit

For programming tasks, the Q5, Q6, or Q8 models are preferable—especially when the number of parameters is limited, i.e., specifically for machine learning applications.For programming tasks, the Q5, Q6, or Q8 models are preferable—especially when the number of parameters is limited, i.e., specifically for machine learning applications. I’ve read the article. Thanks! If it’s possible to repeat the same test—but with higher quantization—that would be great!For programming tasks, the Q5, Q6, or Q8 models are preferable—especially when the number of parameters is limited, i.e., specifically for machine learning applications.For programming tasks, the Q5, Q6, or Q8 models are preferable—especially when the number of parameters is limited, i.e., specifically for machine learning applications. I’ve read the article. Thanks! If it’s possible to repeat the same test—but with higher quantization—that would be great!

[-]

Public-Thanks7567@reddit

For programming tasks, the Q5, Q6, or Q8 models are preferable—especially when the number of parameters is limited, i.e., specifically for machine learning applications.For programming tasks, the Q5, Q6, or Q8 models are preferable—especially when the number of parameters is limited, i.e., specifically for machine learning applications. I’ve read the article. Thanks! If it’s possible to repeat the same test—but with higher quantization—that would be great!