gemma-4-26B-A4B with my coding agent Kon
Posted by Weird_Search_4723@reddit | LocalLLaMA | View on Reddit | 37 comments
Wanted to share my coding agent, which has been working great with these local models for simple tasks. https://github.com/0xku/kon
It takes lots of inspiration from pi (simple harness), opencode (sparing little ui real state for tool calls - mostly), amp code (/handoff) and claude code of course
I hope the community finds it useful. It should check a lot of boxes:
- small system prompt, under 270 tokens; you can change this as well
- no telemetry
- works without any hassle with all the best local models, tested with zai-org/glm-4.7-flash, unsloth/Qwen3.5-27B-GGUF and unsloth/gemma-4-26B-A4B-it-GGUF
- works with most popular providers like openai, anthropic, copilot, azure, zai etc (anything thats compatible with openai/anthropic apis)
- simple codebase (<150 files)
Its not just a toy implementation but a full fledged coding agent now (almost). All the common options like: @ attachments, / commands, AGENTS.md, skills, compaction, forking (/handoff), exports, resuming sessions, model switch ... are supported.
Take a look at the https://github.com/0xku/kon/blob/main/README.md for all the features.
All the local models were tested with llama-server buildb8740 on my 3090 - see https://github.com/0xku/kon/blob/main/docs/local-models.md for more details.
Nyghtbynger@reddit
Love the polish horse name by the way
Weird_Search_4723@reddit (OP)
I'll gather my thoughts and do want to capture it on a blog but I'm not sure exactly one. So yes, dunno when.
openSourcerer9000@reddit
In the age of fatberg codebases, this is a gift. If minimax/qwen 397 don't get hung up on these tool calls and it can do small things reliably, I think I just found my coding harness.
A simple guide on plugging in additional mcps would be helpful
Weird_Search_4723@reddit (OP)
I've never felt the need for MCPs, i've always created clis and either add them directly in my system prompt (which you can completely customise) or expose them via skills
I don't mind adding support for MCP though, please create an issue
Nyghtbynger@reddit
I agree with you. I never understood the use of MCPs. Like a CLI returns a value after execution, that should be enough right ? Since the model waits for tool completion anyway
Do you have an example of integrating a nerw tool ? I'd be really interested since I'm a bit of a noob, and My first hunch would be too directly integrate the Class into the program
bigh-aus@reddit
I’m a big fan of clis for tools too. I’m experimenting with having the cli self define too. Eg cli skill list, cli skill describe spits out json, heck even include the md etc
Weird_Search_4723@reddit (OP)
Look for a tools directory, all tool implementations are standalone files that implement an interface. Then just register it in the tool registry. Quite startightforeard.
Nyghtbynger@reddit
OKay. I will add them to the list then ! It's assumed we modify the code ourselves then
openSourcerer9000@reddit
Mm That's interesting, yeah, everything's been pointing to CLI lately, since models are already used to that in their training data. Context7 at least would probably be a good one though
Weird_Search_4723@reddit (OP)
The readme for https://github.com/upstash/context7 says:
Works in two modes:
ctx7CLI commands (no MCP required)---
Have you tried the cli route?
openSourcerer9000@reddit
Aight I'm sold. I've always been skeptical about mcp, anthropics "standards" are always just throwing a wrench in the system of an already solved problem to split the community. Speaking of which, does this use xml or json tool calling?
Weird_Search_4723@reddit (OP)
Openai and anthropic SDKs expect the final tool calling payload in json format but the actual result is pure text. Not sure if this is what you wanted to know.
If this is about the format in which you should return results from a cli then it can be both - you'll have to serialise it anyway. Try to design it like other unix tools so that piping works nicely.
pardeike@reddit
It would be a dream if you could build in what I just build for opencode: a second session can overseer the main session and keep it running until it fulfills a 'done' criteria. Let me know if you’re interested and I give you details or send you a minimal PR
tarruda@reddit
With pi I could never get gemma 4 26 to think
Weird_Search_4723@reddit (OP)
posts in this channel only helped me figure out why, you can raise a PR in pi, this should help
https://github.com/0xku/kon/commit/baf1d65bb35d8d1b7de68b90460c6b8229a9d36e
tarruda@reddit
If I understood correctly,
<|think|>You are helpfulmust be prepended to the system prompt. This seems like something that should be handled by the chat template whenever reasoning is enabledaldegr@reddit
The new template uses
<|think|>\nand maybe I’m crazy but it seems 26b can now think…Weird_Search_4723@reddit (OP)
yup
wiltors42@reddit
270 tokens system prompt is extremely low. Opencode is regularly doing 10k just on the first message.
Weird_Search_4723@reddit (OP)
Yup, and you don't need that much. These models have been reinforced pretty much all info the system prompt from previous generation of models. You don't need to keep telling them "that you are a ... and you can ... and you should ..." they know this now.
Just try it for some time and you'll see.
sine120@reddit
I'm looking for agents to replace OpenCode's massive 10k system prompt. My PP speeds are slow so every time context is modified I'm waiting 20 seconds for the next response to even start. Pi is interesting to me, and I'll have to give this a shot.
bjodah@reddit
Thank you for sharing this (and thank you for working on it multiple weeks before doing so, which cannot be said about some other redditors).
I'm curious about how your process has been building it. Did you use kon itself? What models have been your favorites working on it?
Weird_Search_4723@reddit (OP)
- implement some core tools like, read grep
- get a llm provider working - i used openai (5.3-codex)
- implement the core loop
- start with a simple append only outputs in the ui, not a fully functioning TUI
This is enough to bootstrap kon using kon
I think when i reached this stage i started using kon to build kon. Now i only use kon so its been adding/modifying itself for a while now.
I've implemented coding agents in the past but only the core loop and tools, not all the way upto ui, so insights dawn on you that typically won't if you go all the way, like:
- why storing the conversation in json format is easy but eventually a bad idea
- storing ui events along with tool calls, thinking and text blocks to be able to reload them when loading an older conversation (now its obvious to me, but when i was implementing it this bit was a surprise to me)
- the fact that all the openai api compatible implementations could have little quirks that need to be addressed before you truly get them working (took some time to figure out how to get thinking working in gemma 4)
gpt-5.3-codex and gpt-5.4 with the plus plan have been my workhorse. i have a glm lite coding plan as well but they are very unreliable these days.
i'm going to implement subagents next (don't need them for paid models, in fact i hate the overuse of subagents by opus), mainly to use local llms for all the grepping/reading to locate code, aka the "codebase-search" subagent to help save tokens burnt by the larger model and help with weekly limits.
bjodah@reddit
Thank you for sharing. Interesting to see that you're also looking into using a hybrid cloud/local approach (to save some ingestion costs?). I have tried using Nemotron-3-Nano-30B for the initial code grepping phase (since it's so vRAM efficient with kv-cache), but I think I'm leaning towards trying Qwen3.5 instead (but I might have to use 9B to use a large kv-cache window) simply becaus eI find the Qwen3.5 model to be a more reliable tool caller (constructing more intelligent grep arguments etc.). (I too only have a single 3090).
Weird_Search_4723@reddit (OP)
I actually want to fine tune the 9b model just like windsurf folks: https://cognition.ai/blog/swe-grep
I have used windsurf quite a lot and i've found the latest version of their swe-grep models quite capable so we should be able to push the 9b model quite a lot (thats the hope), will report once i have tried it 🤞
k_means_clusterfuck@reddit
👌kon
jacek2023@reddit
koń means horse in Polish ;)
srigi@reddit
Your harness should be called Bober and there should be a slash command /kurwa
Weird_Search_4723@reddit (OP)
Quite the polish crowd here 😏
iasad12@reddit
"Kon" in Urdu/Hindi is an interactive pronoun which directly translates into "Who".
Weird_Search_4723@reddit (OP)
Didn't know that
Don't know but this image came to my mind based on this fact :)
Unlucky-Message8866@reddit
is there a reason you didn't use pi? looks like you reinvented the same thing
Weird_Search_4723@reddit (OP)
Many reasons:
- Reinventing is fun
- Just like Mario (author of Pi) has strong opinions about how to build a coding agent, so do i, i mostly agree with him but there are some fundamental differences (that will eventually start showing up in kon)
- Pi is not a small project anymore, i know it supports customisations with extensions but there are some things you can't customise as far as i can tell, mainly how the tool calling and its results look (i have very strong opinions here as well)
Honestly i wanted something that is mine and something that i understand completely (codebase is small enough atm that i can still retain most of it in my head)
Unlucky-Message8866@reddit
fair points but you can pretty much do anything with pi, including tool overrides/custom rendering (i built half an ide on top of it)
Weird_Search_4723@reddit (OP)
Nice!
I'm a big fan of pi as well. I keep sending folks there – especially those who like to tinker with their agents rather than rely on a out of the box curated experience like claude code's
jacek2023@reddit
I was testing various agents this week (like goose or llm) and this looks promising (small prompt is a big plus), will check it soon, thanks for sharing
Weird_Search_4723@reddit (OP)
can't wait to get some feedback :)