I built a coding agent that gets 87% on benchmarks with a 4B parameter model, here's how
Posted by Glittering_Focus1538@reddit | LocalLLaMA | View on Reddit | 321 comments
I was frustrated that every coding agent (OpenCode, Cursor, Claude Code) assumes you're running GPT-5.4 or Claude Opus. If you try them with a local model like Gemma or Qwen they fall apart. I find that often tool calls fail, context overflows, multi-step tasks collapse.
So I built SmallCode. It's designed from the ground up for small local models.
The result: 87/100 benchmark tasks pass with a Gemma 4 model that only activates 4B parameters per token. OpenCode scores \~75% with 14B models. The harness does the heavy lifting, not the model size.
How it works (the tricks that make small models reliable):
- Compound tools: Instead of making the model chain 4 tool calls (find file → read file → edit file → verify), SmallCode gives it one tool that does all 4. Small models lose coherence after 3+ sequential calls. This cuts failures in half.
- Improvement loop: Every time the model writes code, SmallCode instantly compiles/lints it. If it fails, it feeds the errors back automatically. The model doesn't need to be smart enough to get it right first try — it just needs to fix errors when shown them.
- Decompose on failure: If the model fails the same thing twice, SmallCode stops retrying and instead breaks the problem into smaller pieces. "Fix this 200-line file" becomes "fix line 45 only."
- Escalation: If even decompose fails and you have a Claude/OpenAI key configured, it auto-escalates to the bigger model for just that one task. You stay local 95% of the time, cloud 5%.
- Token budgeting: Small models have 32k-256k context. SmallCode never dumps a whole file in. It summarizes, truncates, and manages every token so the model never sees "..." truncation in the middle of important code.
- Code graph: Instead of grep-searching your codebase, SmallCode indexes your code into a symbol graph (functions, classes, who-calls-what). When you ask "how does auth work," it walks the graph and returns just the relevant connected code — not 15 random file snippets.
What it looks like:
Full-screen terminal UI (like OpenCode/vim), scrollable chat, command palette with /, plugin system, persistent memory across sessions.
What it doesn't do:
- No LSP integration (yet)
- No multi-session (yet)
- No desktop app
- Doesn't compete with Claude Code for frontier model users
Install:
npm install -g smallcode
cd your-project
smallcode
Point it at LM Studio, Ollama, or any OpenAI-compatible endpoint.
MIT licensed, everything's on GitHub: https://github.com/Doorman11991/smallcode
Happy to answer questions about the architecture or benchmark methodology.
No-Entrepreneur-5099@reddit
Finally an actually interesting project posted here...
dtdisapointingresult@reddit
Doesn't work with Qwen 3.
I recommend adding a 'compatibilityMode: true/false' setting. When enabled, you must do this:
Every single model and backend supports this classic setup. In fact I recommend this should be your default behavior.
cosimoiaia@reddit
I HATE a terminal UI written in node.js, it should be outright impossible to even run it. For fuck sake if you're gonna slop out some half baked ideas at least have the decency to use something that works well for the environment it's intended.
I think everyone should start slopping WebApps written in c++ just to balance out the landscape.
jonas-reddit@reddit
Benchmaxxed or outright cheating.
apoptosist@reddit
It sounds like this is quite different in goals from little-coder, but it'd be awesome to have a pi-based alternative to this with the same goals. I don't trust OpenCode.
Wittica@reddit
i would of tried this if you spent the time to actually write this reddit post yourself
almbfsek@reddit
it's great. a simple pet peeve I have with all new harnesses are why build a custom UI that will be subpar instead of making it ACP-first (https://zed.dev/acp)
Glittering_Focus1538@reddit (OP)
You can now just use the ACP adapter if you'd like. --acp
dinerburgeryum@reddit
Failed for me in VSCode
Glittering_Focus1538@reddit (OP)
dm me with the error or open an issue on the git please
dinerburgeryum@reddit
Ha yeah I apologize that was completely unhelpful feedback. Yeah you got it I’ll spin up an issue on GitHub later.
almbfsek@reddit
nice
krzyk@reddit
Why use agents from IDE, it beats the purpose of them.
Your_Friendly_Nerd@reddit
What part of "Integrated Development Environment" says it agents shouldn't be built into them? It's where all the development happens anyways, it has git and diff integrations, and facilitates easier code review. And with agent integrations, it allows you to easily reference whatever you're currently working on, or send code snippets to the agent just by highlighting it.
StardockEngineer@reddit
That’s your environment. Automated headless, autonomous agents have no need for an IDE and thus should be disconnected. Almost none of my coding these days has any need for an ide.
I have nothing against ACP I quite like it. But “that’s where all the development happens” is a stretch these days.
Your_Friendly_Nerd@reddit
And that's your environment.
Also, are you using small models for development? Or mainly big cloud models? Because sure, when I use claude code I don't have to have it running in an IDE (though I still prefer that), but with small models my workflow is a lot more hands-on, which I also prefer in many situations since it allows me to still understand what every part of the code does.
StardockEngineer@reddit
I run Qwen 27b, 35b, Gemma 4 for small, MiniMax 2.7 and Deepseek Flash 4 in the mid range.
I just use Pi Coding Agent with my models. You can see what the code is right in the terminal if you want, or look at it during the PR on Github.
If you have Qwen reviewing Gemma and Gemma reviewing Qwen, have solid AGENTS.md and skills, the small models can do great and you be more free in turning them loose.
Your_Friendly_Nerd@reddit
And is your workflow hindered by Pi having ACP support?
StardockEngineer@reddit
No why would it be
Your_Friendly_Nerd@reddit
That‘s what started this discussion, someone asking for ACP support, someone else saying using agents in IDE’s is dumb anyways.
StardockEngineer@reddit
I wasn’t commenting on the acp part. Did you miss the part where I said I like ACP?
easyEggplant@reddit
> It has git and diff
If only there were some way to get access to git and diff on the terminal, that would be so powerful!
t4a8945@reddit
GENIUS IDEA! I'm getting Qwen 0.8B working on that A S A P.
Pleasant-Shallot-707@reddit
Mmkay
almbfsek@reddit
there is no consensus on such thing. you're just making itnip. for me it's better to be in an IDE, for other they might prefer otherwise
Glittering_Focus1538@reddit (OP)
for most users it's easier to tell them to just download vs code or their preferred ide and open a terminal inside it and start up SmallCode.
Androix777@reddit
Looks interesting, I hadn't heard of it before. If I'd known about it, I might have tried using it instead of a TUI. Though I'm not sure how well it fits complex multi-agent systems where you want to display a lot of information - at first glance it seems suitable only for the most basic cases, but I might be wrong.
Glittering_Focus1538@reddit (OP)
Fair point. The terminal UI isn't the end goal. I'll be honest, it's just the fastest way to get the agent loop working and testable without depending on any specific editor.
and honestly ACP is interesting and we're watching it. The reason we didn't start there: SmallCode's target audience is people running local models on consumer hardware who probably aren't using or better yet have never heard of zed. The terminal works everywhere.
almbfsek@reddit
fair enough. just to make it clear, it's originally developed by zed but there are mature plugins for jetbrains, vscode and many cli tools too.
migsperez@reddit
I tried. It didn't work well for me at all. I used gemma4:e4b via ollama. The responses weren't related to the prompt. I couldn't paste into the prompt area. I couldn't write me the one line, shift return didn't work it just sends the message. Lots of issues.
I wanted it to do well. Let me know when the next version is out.
Glittering_Focus1538@reddit (OP)
I fixed the prompting issue, can u tell me more about the messages not being related? I never had that issue during testing.
migsperez@reddit
I'll give it a go.
Feature request, could you get it working with OpenRouter. I'd like to try it with their cheapest models, get value and best results to the max.
Glittering_Focus1538@reddit (OP)
Should be fully working with openrouter? should work with any openapi key, can u send your logs over?
migsperez@reddit
Can you get your smallcode --version to work, it shows 0.1.0 but I see your making massive version progress in your commits.
migsperez@reddit
I've sent you a DM.
Adi4x4@reddit
what's the benchmark? 87/100 doesn't mean much without knowing if it's SWE-bench, your own eval set, or something else. the harness tricks sound reasonable but a custom benchmark + abliterated gemma is a red flag for cherry-picking.
Great and awesome project though, love seeing this!
Existing_Bet_350@reddit
impressive benchmark results, especially the efficiency gain over larger models. The architectural focus on small model constraints is the right approach, most agent frameworks waste context on verbose tool schemas that bloat token usage unnecessarily.
Curious about your approach to multi-step task state management. We've been tackling similar efficiency problems at Yellow Network for AI agent-to-agent transactions, state channels let agents settle micro-interactions without burning resources on full chain commits every step. Your context window optimization thinking maps well to how we handle settlement batching.
If you're thinking about adding economic primitives (pay-per-use APIs, agent commerce), Yellow SDK abstracts that layer so you can focus on the agent logic. Check out yellow.org (Yellow SDK)would be interesting to see SmallCode agents transacting autonomously. cheers
darkbit1001@reddit
Is there really a model behind small code? I like it! Thats exactly what we need. just let another model, preferably just a pile of deterministic neurons detect when code can be compacted, summarized, and therefore digested. I was going to dive into NLP . but seeing this give me hope to run 'taller' Models (4B+) on 3x Rk3588's with 16gb RAM ( it could fit a full 8b model, but get 3tk/s so lol ).
Thanks!
Glittering_Focus1538@reddit (OP)
Hey! There is only the LLM's you setup, the actual architecture around the LLM is mostly deterministic scaffolding orchestrating bounded stochastic calls. The model is only invoked at specific, declared points with typed inputs/outputs, retry budgets, and timeout guarantees. Everything else is compiled code.
OsmanthusBloom@reddit
Interesting tricks. Though I wish these could be integrated with existing tools like Pi or OpenCode instead of creating Yet Another Coding Agent. See for example little-coder which is nowadays a set of Pi extensions.
A standard benchmark instead of "87% of my self selected tasks" would be more convincing.
The README in the GitHub repo looks heavily AI generated. All the "Supported Models" are basically obsolete. This makes me wonder if this is a serious project or just AI slop.
KontoOficjalneMR@reddit
I mean all the improvements seem to come from improved tooling. But "the best way to improve LLM coding is to actually give them symbolic code and make them less english" s a tough sell
LetsGoBrandon4256@reddit
That's probably the biggest sign that the repo is AI slop.
_TheWolfOfWalmart_@reddit
r/LocalLLaMA of all places doesn't like that someone used AI to code?
Huh?
a_beautiful_rhind@reddit
The real "huh" is how everyone got oneshotted by an obvious self promotion post and an hours old project.
LetsGoBrandon4256@reddit
Sometimes it just happens.
ihppxng62020@reddit
jazir55@reddit
Always baffling to see AI related sub users bagging on people for AI related code, just pure irony.
HFRleto@reddit
ahaha, this section is now removed
LetsGoBrandon4256@reddit
Now OP just need to fix their prompt so they don't constantly switch between "we" and "I".
I gave them the benefit of doubt that they are using "we" because they are a small team or something. Nope, just a random bloke with a six-day-old GitHub account.
Icy-Pay7479@reddit
This is what i was looking for - an opinionated toolkit i can modify using an existing platform and ecosystem.
Glittering_Focus1538@reddit (OP)
I have a partial solution for you, https://github.com/Doorman11991/opencode-bonescript-backend
gives LLM's the tools they need to use my TS/NodeJS backend compiler to make more complete and secure backend code all from one .bone file.
yoomiii@reddit
boy, what a busy little bee you are!
a_beautiful_rhind@reddit
Hey man, opus works fast.
IronColumn@reddit
this is such a weird way to say it's a moe model while implying it's e4b
faldore@reddit
OpenCode is really unfriendly towards community contributions. They put up enough friction that I would just fork it instead of contributing.
Glittering_Focus1538@reddit (OP)
the reason for the fork is honestly because I had to re-arrange the guts, not just put on a shinier coat of paint.
gh0stwriter1234@reddit
Why not just make this as an extension for Pi.dev
Glittering_Focus1538@reddit (OP)
The governor doesn't exist elsewhere. The Bayesian tool scoring, hard fail detection, automatic decompose -> escalate pipeline. that's what makes small models actually usable for agentic work. Pi doesn't have this because it doesn't need it (it targets models that don't fail as often). You can't "add" this as an extension without reimplementing the entire agent loop.
Compound tools and 2-stage routing. SmallCode halves context overhead by routing tools in two stages and offering compound operations (read_and_patch, find_and_read). This matters when your model has 32k-128k context.
Pi can afford to dump all tool schemas because it assumes 128k+.
That being said, many others have already proved if you're already in Pi's ecosystem and using frontier models, SmallCode probably isn't for you.
capsid@reddit
Daily pi user. I love the idea of optimizing for small models. I'm also a believer in the stuff Mario Zechner said about extra features like LSPs and MCPs affecting context observability depending on how it's implemented. Does SmallCode have a minimalist mode in line with Pi's philosophy? Is extensibility/self-modification planned? Looking forward to trying it out.
gh0stwriter1234@reddit
The whole point of pi dev is you CAN easily implement a whole different agent loop...
rinaldo23@reddit
Interesting. I think there is a trend towards using smaller, more focused models for specific tasks. Extraordinary claims require extraordinary evidence though.
SnooEagles1027@reddit
Yup
Glittering_Focus1538@reddit (OP)
Feel free to try it out with gemma 4 e4b, you'll be amazed how well it codes.
someonesmall@reddit
Did you also test with Qwen?
Glittering_Focus1538@reddit (OP)
Yes qwen runs smoothly. Try it out. should work with any model made after qwen 2.5 14b
opimentoso@reddit
Even with Q4 and Q6 models?
Glittering_Focus1538@reddit (OP)
yes
geek_404@reddit
Does it work with all languages or is it better with specific ones? I have been toying with the idea of doing a fine tune of a model to make it an expert in rust programming. I hadn’t thought about making the harness better. Kudos and look forward to trying it out.
Glittering_Focus1538@reddit (OP)
It's only as good as the model you're working with and the code you've indexed in your working dir. if you have a bunch of documentation on rust in your working directory it should help.
addiktion@reddit
Do you plan to scale this up to the larger gemma 4 models to see if you can get them to do well too?
Glittering_Focus1538@reddit (OP)
It already scales perfectly fine. Especially for mid-tier models. No huge plans other than sub agents.
ismail_the_whale@reddit
waht. extremely intrigued
1_4_1_5_9_2_6_5@reddit
I'm gg to try this out. I've been building something very similar, which works well for the same reasons, so I fully believe yore onto something here.
LippyBumblebutt@reddit
Oh you were talking about E4B in the initial Post? Because A4B can code reasonably well with opencode, but still needs a 16GB+ GPU. E4B should fly even on 6GB GPUs (llama.cpp supports not offloading the embeddings, does it?)
Anyways, I'm happy to try this. I always felt like there could be big improvements over opencode for smaller models.
Glittering_Focus1538@reddit (OP)
No worries, you aren't the first to accidentally make that mistake today lol.
hishazelglance@reddit
There are, Nvidia has a great paper about Small Language Models (SLMs) and their use cases: https://research.nvidia.com/labs/lpr/slm-agents/
JollyJoker3@reddit
Pricing will make smaller models good eventually. This project does stuff I've been meaning to try myself so I'll have to give it a look.
duy0699cat@reddit
title: 4B parameter model
post: only activates 4B parameters
the wording is dishonest tbh
Finorix079@reddit
The harness-over-model thesis is right, and the benchmark gap backs it up. Two things worth pushing on though.
Compound tools cut failures but they also cut your visibility when something breaks. If find_read_edit_verify fails, you don't know which of the 4 steps regressed. Worth logging the sub-step that failed even if the model only sees the unified tool. You'll want it the first time someone reports "edits started failing on Tuesday."
The decompose-on-failure trigger is interesting. Two attempts feels low for a 4B model. Have you looked at whether the second attempt is materially different from the first, or is it the same failure mode? If it's the same failure, decompose makes sense. If it's drifting randomly, more retries with temperature variation might be cheaper than decomposing.
Code graph approach beats grep, agreed. Curious how you handle stale graphs during active editing. Rebuild on every save or lazy invalidate?
Escalation policy is the part I'd think hardest about. "Failed twice locally" is one signal, but the more useful one is "this kind of task has a 40% local success rate historically, just escalate immediately." Otherwise you burn tokens and wall clock on tasks the small model was never going to land.
Will try it on a Qwen 2.5 Coder 7B setup this week.
Glittering_Focus1538@reddit (OP)
For decompose, its 2 failed attempts, then 2 more failed broken up attempts(post-decompose). Then there's the optional escalation you mentioned.
Economy-Register97@reddit
This is what I've been doing with the exception of this: "Compound tools: Instead of making the model chain 4 tool calls (find file → read file → edit file → verify), SmallCode gives it one tool that does all 4. Small models lose coherence after 3+ sequential calls. This cuts failures in half."
How was this done? Did you just have it combine the process in a single script? I've seen failures on my side due to too many tool calls. Just like memes, I'm stealing this 😂
As far as the continuous learning, I've included Obsidian and some scripts to provide counts, analysis, thresholds, and promotion criteria. It's been working really well.
Glittering_Focus1538@reddit (OP)
We use a db based memory layer that's searched the same way we'd search a codebase, so you get efficient memory recall across sessions.
But honestly, there's no magic orchestration framework, each compound tool is a single function definition that internally runs the steps sequentially in plain JS. The model makes one tool call, and the handler does the multi-step work server-side.
Here are the 5 compound tools:
read_and_patchcreate_and_runfind_and_readsearch_and_readrunThe key design choices:
read_and_patchreads the file, validates the old_str exists exactly once, replaces it, writes it back, and returns the line number where the edit happened. Ifold_strisn't found, it returns the file content so the model can see what's actually there and self-correct — no second call needed.read_and_patchfails, it shows the file contents in the error response. Whencreate_and_runhas a command failure, it still reports the file was created plus the error output. The model gets everything it needs to course-correct in one round trip.Steal away. It's the simplest possible approach: just bundle the steps in a single handler function. The insight isn't architectural, it's that small models fundamentally can't maintain intent across 3-4 sequential decisions, so you move the sequencing out of the model's responsibility and into deterministic code.
Your Obsidian-based learning loop sounds solid. SmallCode does something similar with
tool_scores.jsontracking success/failure rates per tool to adjust which tools get offered.Economy-Register97@reddit
Kudos and appreciate the sharing of info. Very cool and I'm jealous of how well it's working for you at 4B! That's very impressive. I chose obsidian and smart connections to monitor and document my entire pipeline. Mostly out of ease and simplicity for a single user. Seems to do ok although I probably should address house keeping efforts on the data 😅. Appreciate you taking the time to reply.
Glittering_Focus1538@reddit (OP)
It's honestly not fair, I made an entirely new compiler for LLM agentic frameworks. I'm still debating releasing it.
DiscipleofDeceit666@reddit
I tried doing something similar lol. I tried building my own harness to try to manage context bc that’s local ai biggest enemy. I learned that by harnesses are just fancy text formatters. How do you format the chatter, inconsistent spacing, and just random “if you need anything else, let me know” statements in the output.
Props to you for not only building that, but handling it in a way that even truly regarded models can use.
I wanted to use my GPU for coding development too, but the limited context was hard to work with directly so I built a loop that sends tasks written down in a toml file to an AI harness (like yours) and then runs unit tests after each task is done. Feeds failures back into the LLM for a fix.
If I could use tiny models instead of the qwen3.6 35B, I could probably run multiple agents in parallel each reaching 100 tok/s nearly tripling my current output. I might give this a go.
Do you support the -p syntax other CLI tools use? Where you send it a prompt to send it off?
Glittering_Focus1538@reddit (OP)
try running gemma 4 26b, you will probably get much better speeds allowing you to code much faster, should work in my harness without crapping it's pants.
Not yet as
-p(that's currently used for--provider), but you can pass a prompt as a positional arg and it runs non-interactively:Or pipe it:
I'll add
--prompt/-Pas an explicit flag in the next release.DiscipleofDeceit666@reddit
Ayo, whatever works. When you feel like the surface area won’t change much, I’ll drop the script here. I’ll work around whatever you have setup, the non interactive mode comes in clutch.
https://github.com/Minerest/leanloop/tree/master/leaners
NotARedditUser3@reddit
Qwen3.6 35b-a3b works perfectly fine on opencode. I use it for 95-99% of my daily software development.
Glittering_Focus1538@reddit (OP)
good for you. this is built for smaller models 8b-26b
NotARedditUser3@reddit
Your original post explicitly calls out qwen as if it doesn't work well on opencode.
"If you try them with a local model like Gemma or Qwen they fall apart."
Glittering_Focus1538@reddit (OP)
yes for smaller models that most people can run. Not models that require 24 gb's of vram.
nickl@reddit
This is interesting. I've been working on a custom agent for small models too, and I've been tempted to go down the "many tool" route. One problem I've found is that including the instructions bloats the context more than these small models can deal with easily.
Progressive disclosure ala skills helps, but it remains a problem.
How are you handling this?
Glittering_Focus1538@reddit (OP)
We keep tool descriptions to one sentence each, the system prompt is conditional (BoneScript guide only injects for backend tasks, skills show as one-line summaries until activated), and the instructions are baked into the tools themselves rather than the prompt,
read_and_patchencodes the "read then edit" workflow so the model doesn't need to learn it. The base system prompt is \~400 tokens, tool schemas add \~750 tokens fixed cost. We haven't found a way to compress schemas further without degrading tool selection accuracy.nickl@reddit
Does something like `read_and_patch` get its own context?
Glittering_Focus1538@reddit (OP)
tools get a persistent context while files get normal context, instructions and code are saved to the graph/db so not hard to recall efficiently
Somtimesitbelikethat@reddit
im sorry if this is a dumb question, but i'm not sure how you link my ollama downloaded Gemma 4 model to this. do I have an API key?
Glittering_Focus1538@reddit (OP)
you dont need one locally but its a smart idea to use one regardless
dtdisapointingresult@reddit
I can't install this.
'npx smallcode' gives:
Instead change the require of store.js in /home/user/.npm/_npx/2a34a6c93b3d02a2/node_modules/smallcode/bin/smallcode.js to a dynamic import() which is available in all CommonJS modules. at Object. (/home/user/.npm/_npx/2a34a6c93b3d02a2/node_modules/smallcode/bin/smallcode.js:48:41) {
code: 'ERR_REQUIRE_ESM'
}
Node.js v18.20.4
Glittering_Focus1538@reddit (OP)
kk
yoomiii@reddit
Why force this upon users of your harnees?
Glittering_Focus1538@reddit (OP)
because small models suck at coding even if you walk with them the whole way, so I made a language for backend coding that's easier for them to use and debug.
MetricZero@reddit
Bro, you cooked. This is great for mid-tier models too.
call_me_venom@reddit
!remindme 72 hours
Snoo_81913@reddit
Getting pretty spicy in here. I'm not opposed to this, but without actual benchmarking that can be compared there's really no way to tell unless we actually download it and use it and to be honest I've got enough stuff going on. I am curious though and I'll keep an eye out for the person who gets so invested in this to try it out. cheers mate.
AppealSame4367@reddit
How would you compare yourself to pi.dev or little-coder (based on pi.dev, good swe bench 2 scores)?
Glittering_Focus1538@reddit (OP)
It's simple honestly, we have different goals.
Pi/Little-Coder is a minimal harness. 4 tools (read, write, edit, bash), tiny system prompt, and it relies on the model being smart enough to figure things out. Their SWE-bench scores come from running frontier models (Claude, GPT-5) through a lightweight wrapper. The harness gets out of the way and lets the model do the heavy lifting.
SmallCode is the opposite bet, the harness does the heavy lifting so the model doesn't have to be smart. Compound tools, improvement loops, auto-validation, decompose-on-failure, token budgeting. We get 87% on our stress test with a 4B-active model. Pi would get maybe 40-50% with that same model because it doesn't compensate for the model's limitations.
AppealSame4367@reddit
Your assessment of little-coder seems wrong to me. The swe bench benchmarks I referred to were made with Qwen3.6 27B and 35B and little-coder is an extension to pi that _adds_ a lot of rules and tools.
But I get your point: You try to encase a 4B in a way that it can't go wrong when editing and still produce very usable results. I will definitely try it, cause I have low vram anyway and rely on a lot of new optimizations in ik_llama to run 35B at around 10tps output in little-coder.
rpkarma@reddit
Not seems. Is entirely wrong.
AppealSame4367@reddit
If you wanna speak in scientific terms, you cannot rush to conclusions. Also not if you wanna be diplomatic.
OP put in honest, serious work and tried to provide a benchmark. There might be doubts, but there is no reason to punch him in the face with a "all wrong!"
rpkarma@reddit
It’s objectively true: his assessment of what Little Coder is is entirely wrong; it’s literally backwards.
SummarizedAnu@reddit
I mean even if he used Ai to make it. As long as it works better than it's alternatives I would put all my chip on it.
Pleasant-Shallot-707@reddit
Pi’s philosophy isn’t that the harness is minimal and the llm does the hard work. It’s more nuanced. Yes, you can yolo like that in Pi, but the actual philosophy is that the harness is minimal and unopinionated and easily extended so the developer can craft a harness they want to use without burning a bunch of tokens on bloated system prompts.
gh0stwriter1234@reddit
Yes building this kind of thing is exactly what pi.dev is supposed to be able to do.
brakx@reddit
Can you explain your own words?
mister2d@reddit
Pi is small with intention. It's fully extensible so you could have applied your harness concepts easily to it.
rpkarma@reddit
That is not at all what Little Coder is.
fittyscan@reddit
Have you tried https://github.com/swival/swival ?
itsyourboiAxl@reddit
I want to try local but i wonder how precise you must be? Claude code is good at understanding, like i can explain a feature like i am 5 and he will do it right on first try. Do local models also perform this good or do you need more careful prompting and planning? Maybe its a bit naive but i’ve never tried local ai for serious dev. I will give your agent a try
Imjustmisunderstood@reddit
Wow I tried it and I’m pleasantly surprised. Please keep building this up, apply the same philosophy to MCPs, and you will have a more accessible, cheaper claudecode! Good job!
Orolol@reddit
Which Model ? Which Benchmark ?
If you want to be taken seriously,you have to be precise enough so people are able to reproduce your results.
sillib@reddit
That’s the real question that needs to be answered, what fucking benchmark test was used
Rustybot@reddit
TrustMeBro-2.1-hard
eightshone@reddit
Interesting project. I’m working on a small tool that cleans up uncommitted files in my local branches by composing grouped commits in one command using 3b model (qwen 2.5 coder). What I learned was that with small models we have to be clever how to make the prompts and how pass context, while with frontier models you can just throw everything at it and they’ll work as expected. I think building agents that are focused on small models is a great way to come up with cleaver tricks that overcome challenging situations.
FancyNet9095@reddit
I understand the "how", but why?
takuarc@reddit
I love Gemma, it’s pretty powerful as it is. I wanted to do something similar but even more focused. Will definitely check out your repo.
sagiroth@reddit
Fuck off bot account
overand@reddit
Be really careful with accusations of being a bot. From someone who's been on Reddit for 18 years to someone who has been there 13 years - even I got accused of being an LLM because of my writing style.
Nah, bro: it's not that I write like an LLM, it's that LLMs write like us, like chronically internet-soaked aging millennials. (Even that thing right there, the "it's not X, it's Y" is the kind of shit they're known for. And you'll see my hyphen in the first line? LLMs use the more technically accurate emdash.)
Nearly 2 decades of Reddit posts contributing to their corpus of knowledge, plus grocery store romance novels. 🙃
a_beautiful_rhind@reddit
This is definitely some kind of marketing or promotion. Op is answering everyone and providing computer help, bullshit username, fanciful claims, post title in the clickbait style.
He writes:
Well.. anyone who wants to establish a "rep" and parlay into some revenue down the line.
The entire repo was made like 9 hours ago. The other projects on it, exactly the same. Op is selling himself HARD and everyone is lapping it up.
Just like the automated upvotes to the post, I expect automated downvotes to my comment like any other dissenter in this thread. That's right op.. you are a phony.
overand@reddit
I don't disagree that it's promotion, or even spam. (I'm not saying it is, or isn't) - I'm just saying that the person I replied to (later) accuses OP of being an LLM, and it's not great.
Glittering_Focus1538@reddit (OP)
Not a bot, js sharing my new project, thanks for being a dick tho.
sagiroth@reddit
Well your profile history tells other story
Glittering_Focus1538@reddit (OP)
how dare I advertise my projects, all opensource. What idiot would waste money advertising projects that don't generate revenue??
SaveSpend@reddit
Back in the day, tech people just wanted to be tech people, not revenue chasing assholes. Thanks for self selecting!
sagiroth@reddit
Have a read of this sub rules once again. Clearly you have not read them. At least state AI generated content MR Em Dash —
Prize_Negotiation66@reddit
qwen code works good even with 27b model
josuf107@reddit
In https://github.com/Doorman11991/smallcode/blob/ec6df2dc205f25c48b8799ab54c118a7a213bf35/src/plugins/skills.js#L79 there's a parsing bug in that inline comments in the frontmatter are parsed as values. E. g.
keywords: manual # manual is bestwill setkeywordstomanual # manual is bestsince the regex is(.+)$. Probably should use a library for parsing YAML, but if you are wanting to roll your own that's one thing that doesn't currently work correctly.LippyBumblebutt@reddit
I gave this a quick look. I first tried on CPU with E4B and it timed out all the time. Partially because of llama.cpp, but even with --timeout 99999 it failed in smallcode.
Running on GPU, I let it code a 2048 game in html/js. After it worked ok, I asked for animations. Then this happened.
It repeated all it's answers twice all the time. Didn't read the file a couple of times and then crashed.
I asked for no external dependencies. Maybe that's why it created a single 13kb file with embedded js/css. Might have been part of the problem?
Also "Created index.html (104 lines) 2ms" wow, you can write a file in 2ms! Amazing. That's not how long it took though...
Warsel77@reddit
Does it get "87% on benchmarks with a 4B parameter model" with Opus in the loop or without?
CircularSeasoning@reddit
It feels like OP has been watching how I work with my models over my shoulder. This is how I harness my models manually to do great things they initially didn't seem capable of doing.
Whether you use this harness or not, pay attention to the techniques listed in bullet points here. They seem simple and even almost old/unoriginal as the ideas have been around already, but they are very powerful.
ThiccStorms@reddit
how do you manually apply things like symbol graph and linting stuff to existing code agents? where do we find the entrypoint to the pipeline where i start customising the system and making hard guardrails for the generation?
CircularSeasoning@reddit
For the code graph thing, give it as much of your code base as can fit at once, with every relevant project file, but obviously without all the boilerplate that just powers your framework behind the scenes.
Then you... wait for it... ask it to make a code graph. Or a code map. Or use some words that make it clear that you want a small, compact document that accurately describes the code in terms of its architecture, UI, etc, and how they all currently interact and interdepend.
Once you've checked that the code graph/map checks out, isn't missing any important things, etc., you can just provide that along with your normal system prompt or within the prompt, for prompts where such a high-level (but not too fuzzy) view is needed. It can help tremendously. It can also hurt, depending on whether the prompt doesn't actually need that info, for small easy changes and so on where it actually gets in the way or it makes it subbornly enforce "current architecture" vs. "architecture I actually want in this new change".
This may sound direct, and maybe doesn't apply to you in particular but... "Learn to code" a little is my advice for anyone. :)
As for linting... I mean, the code editor or IDE (e.g., VS Code, Zed) does that automatically when you paste your code into it and/or when you try run the program. So if it throws an error, I just throw the code back at the LLM with the error and voila.
It seems like a lot of work at first, and the whole copy paste thing gets tiring, but that's where you start coding your own little tools to make all that much easier and faster. My personal workflow with Qwen 35B A3B is insanely good right now, but I have been working on all this for at least a year, maybe two. It's worth the grind. I have literally no fear of working with high context anymore, because the eventual context I give it is really dense, relevant, human (me) curated, and high quality.
ThiccStorms@reddit
umm this might sound very very weird but im a developer, and it's just that i've been living under a rock when it comes to AI assisted coding and the tools around it. I am familiar with python and a lot of other things. So at the end you mean that the code graph thing is really a prompt in text out thing, and MCPs are also just that? instead of a tool call like framework?
CircularSeasoning@reddit
Ha, you and me both under this rock. I do heavy work with LLMs and coding but I have only tried agentic coding very briefly so far. I think agentic coding is really cool and everything and I will be trying it more but I strongly dislike how many tools seem to consider a lot of what the model is actually thinking and doing as "what the user doesn't need to see". I get that, because it's often too much too read, anyway, but I simply must be able to quickly scan how things are going, which parts of my original context are being "compacted" and how, etc.
As for the code graph, I just treat this as extra context I put in my prompt/sys prompt when needed. This is because when I provide my prompt to do a code task, and I don't have that in there, I notice in its reasoning that it's essentially trying to build such a mental map from scratch each time. So the code graph helps it skip past that little process on every generation, and improves the likelihood of a good response.
I'm specifically not doing anything fancy on this point. Just helping it along in terms of context. It doesn't fit into any agentic process or anything for me. Sorry if that was unclear.
Confession: I've never tried to use MCP, and would you believe, I've never asked my LLMs to perform a web search yet. In fact, I often perform manual "fact-check" web searches about my code or a code change given by the LLM simply because it's so much quicker than waiting for the model to load, process the prompt, do the tool thing, etc.
ThiccStorms@reddit
aha i can relate. i have never used MCP too and i dont even know what it does or how does it work. and apart from that, i still use LLMs the old fashioned way, browser tab and prompting + copy paste. i really try not to rely on ai tools when learning, and im a student so i have been avoiding llms since the last 2 years or so, so that i dont get dependant or develop a habit.
CircularSeasoning@reddit
It's very good how you're approaching things. The hype and marketing can easily blind people to the basics. And where the hype is real, like it's become and becoming with these latest local models, it is just as easy to just skip the hard part of learning a thing or two or adjusting things by hand. But, I believe it is still the superior way, else we may all lose our sense of what's good, correct, and so on, to begin with, and start to accept subpar results from AI as well.
I am confident in my extensive and very vibe-codey use of LLMs largely because I have spent many years learning fundamentals, whether in writing, code or other subjects. Many tough years. I am the kind of person who has been obsessively looking words up in a dictionary since early high school and never stopped. I'm well past student stage in my field/s and yet I am still learning. Always. The joy of being able to quickly and effortlessly correct that one syntax error from an LLM generation which is otherwise perfect, and not have to re-roll... Great feeling, and only one way to get there, to put in some work!
Glittering_Focus1538@reddit (OP)
Thank you?
CircularSeasoning@reddit
You're... welcome?
Wait. Let's start again. You say "Thanks." and I say, "You're welcome."
Glittering_Focus1538@reddit (OP)
qwelcome
CircularSeasoning@reddit
This one's gonna need a bigger harness. Or a smaller one. I'm not sure.
an0maly33@reddit
*The user is saying the harness needs to be bigger.
Wait. They also said it might need to be smaller.
Wait. The user is not sure if the harness needs to be bigger or smaller. I should acknowledge this and wait for user input.*
You were right to call him out. He should have had a differently-sized harness from the start.
Konamicoder@reddit
Hey there, I am trying to install smallcode on my Linux Mint rig. My models are being served up by oMLX running on an M1 Max Mac on my Tailscale network. So per your instructions on github, i pointed smallcode to my project folder and created a .env file to tell smallcode the model name and base url where it should connect to find my models. smallcode gave me this error message:
" Cannot reach LM Studio at https://MY-OPENAI-ENDPOINT-TAILNET-URL/v1
Make sure LM Studio server is running."
The TailNet URL is verified correct and working because it's what I am using to drive nanocoder and pi on the same rig. So i know the issue is not an incorrect URL. My suspicion is that my omlx server requires an API key. So I tried to add a SMALLCODE_API_KEY line in the .env file, but the error message persists. Thoughts on how I can get smallcode to connect to my model backend over the TailScale network? I would really like to try it out. Also your github states to "See
.env.examplefor all options. " but I can't seem to find that example file?Help!
Glittering_Focus1538@reddit (OP)
also, use this format.
SMALLCODE_MODEL=your-model-name
SMALLCODE_BASE_URL=https://your-tailnet-url/v1
OPENAI_API_KEY=your-omlx-key
Konamicoder@reddit
Thanks for the quick reply! I edited .env per the above format but it's still failing to connect with the same error message. sad_face.gif
Glittering_Focus1538@reddit (OP)
Fixed the issue, should work now. https://github.com/Doorman11991/smallcode/blob/master/.env.example
_underlines_@reddit
did you look into little-coder and how it differs? maybe even combine forces, if goals align well enough between these two projects?
and a relevant add on that might interest you: semble for faster, less token heavy and more accurate code search, compared to glob and grep
elijahebanks@reddit
Good work! I'm working on an engine specifically for the raspberry Pi!
nuclearbananana@reddit
Which benchmark btw?
Also I see "patch first editing" in the readme. Can you explain what that is and how it helps?
Glittering_Focus1538@reddit (OP)
it was a custom 100-task stress test that was across multiple different coding languages.
as for the patch first editing.
When a small model needs to edit an existing file, there are two approaches:
Rewrite: model outputs the entire file with changes (what most agents do)
Patch: model specifies just the old text and new text to swap
Small models (4B-14B) are terrible at rewriting. They truncate files at 60-120 lines, hallucinate middle sections, change indentation randomly, or forget imports they didn't modify. A 200-line file comes back as 140 lines with subtle bugs.
Patch is safer because:
The model only needs to produce the changed lines, not reproduce 180 unchanged lines from memory. It's cheaper on tokens (10 lines of context vs 200 lines of full file) It's verifiable, if old_str doesn't match exactly one location, the tool rejects it and asks the model to be more specific.
It can't accidentally delete code it didn't intend to touch.
The tradeoff is the model needs to get the old_str exactly right (whitespace and all). That's why SmallCode also has read_and_patch: a new compound tool that reads the file AND patches it in one call, so the model sees the exact content right before editing. Eliminates the "remembered it wrong" failure mode.
nuclearbananana@reddit
Ah, that's a standard replace tool. Patch had me thinking it was a codex style patch, which is hard for non-gpt models.
How does read_and_patch work? Like it takes until the next round for the model to get the benefit of the read, so how does it help the patch?
aegismuzuz@reddit
The point is that you’re cutting out a useless round-trip to the API. Usually, an agent asks to read a file, waits for the response, then asks to replace text. On smaller models (sub-14B), by the time that second tool is called, the model has already forgotten why it even read the file in the first place. Wrapping that into a single call is a quick and dirty fix for the attention problem
Glittering_Focus1538@reddit (OP)
When
read_and_patchfails (old_str not found), it returns the actual file content in the error so the model can correct itself in one retry. instead of needing a separate read_file call first. Cuts 3 round trips to 2. Less turns = less coherence loss for small models.nuclearbananana@reddit
That's actually pretty smart. Do you give it the whole file or try to guess which part oldText was trying to point to
Glittering_Focus1538@reddit (OP)
First 50 lines. Small models can't process a full 200-line file in the error response without losing track of what they were trying to do. 50 lines is usually enough to see the actual content around where they thought their
old_strwas.If the file is short (under 50 lines), it gets the whole thing. Could be smarter, Fuzz matching the failed
old_stragainst the file to find the closest section. But honestly the simple approach works well enough. The model usually just got whitespace or a variable name slightly wrong, and seeing the real first 50 lines is enough to self-correct.ysustistixitxtkxkycy@reddit
Loving the intent and what you describe. For rewriting, I'd suggest looking into stream rewriting: don't treat the rewrite task as a here's 100k lines, send me back 100k, but instead send it as "task setup" 100x "task continuation, context and next section" "task cleanup"
Glittering_Focus1538@reddit (OP)
That's a good pattern, chunk-based rewriting. We avoid the problem differently (patch instead of rewrite), but for cases where you genuinely need to transform an entire file (like migrating from one API to another), stream rewriting would be the right approach.
Thanks for the input, adding to roadmap!
gffftgdft455@reddit
Custom benchmarks is like marking your own homework. Sure you could mark it with honesty, but not everyone will believe you.
Glittering_Focus1538@reddit (OP)
I agree, im working on adapting opencode bench to work with smallcode so I can give concrete results.
zoomaaron@reddit
I think the idea is very much oversold. 4B active parameters is not the same as 4B parameter model. That’s misleading. You also made your own benchmark without telling us where it is so we can verify your claim. If you are using bench/stress_test in your repo, I’m afraid that’s making a completely wrong claim, because it didn’t even check for the success of any of the test. As long as it produced 20 characters of output it passes. What kind of benchmark is this?
Some of the ideas you introduced is neat in demo but unclear to me how well they work in real world. For example, different models have different abilities to compose multiple tool calls. I’ve tested this extensively with my own harness and got mixed results because some models are just not well trained to chain tool calls; it’s out of distribution for them and caused more round trips than before. There are also models like deepseek which is trained to launch large batch of tool calls at the same time, asking it to compose calls actually reduced its token efficiency by a factor of a few.
The error decomposition is also unconvincing. The most challenging part is often to figure out which is the one line that needs to change. I don’t see how a harness alone can pin point that precisely without relying on a large model.
aegismuzuz@reddit
Totally agree on the decomposition bit. If the bug is a messed-up state mapping in Redux, narrowing the focus to a single line in a component won’t do squat. A harness can't replace the model's ability to hold the abstraction of an entire feature in its context
Glittering_Focus1538@reddit (OP)
Fair points. The compound tools work because the composition happens server-side as a single function call, the model never knows it's chaining, so it doesn't go out of distribution. Models that prefer parallel tool calls can still use the standard individual tools since both are offered. On decompose, you're right that it doesn't solve semantic bugs, it specifically handles the class of failures small models actually produce (compilation errors, missing imports, type mismatches) where the fix is mechanical, not reasoning-heavy. For logic errors that need understanding intent, that's where escalation to a stronger model takes over. The honest take is that SmallCode's value scales inversely with model capability at 4B the harness is critical, at 155B you barely need it.
Glittering_Focus1538@reddit (OP)
Fair points. The compound tools work because the composition happens server-side as a single function call, the model never knows it's chaining, so it doesn't go out of distribution. Models that prefer parallel tool calls can still use the standard individual tools since both are offered. On decompose, you're right that it doesn't solve semantic bugs, it specifically handles the class of failures small models actually produce (compilation errors, missing imports, type mismatches) where the fix is mechanical, not reasoning-heavy. For logic errors that need understanding intent, that's where escalation to a stronger model takes over. The honest take is that SmallCode's value scales inversely with model capability at 4B the harness is critical, at 155B you barely need it.
aegismuzuz@reddit
I love these posts. First, we loudly claim an 87% benchmark score, and then in the comments, we quietly mention it’s a custom stress test where the model just has to spit out 20 characters of code. The architecture itself - compound tools and compiler validation - is legit, but trying to sell it with cooked metrics is a total shot in the foot. Run it through a proper SWE-bench, take your honest 30-40% for a 4B model and be proud of it, because for that size, it would already be a massive breakthrough
bronekkk@reddit
I am very interested in an alternative to Augment which I could point to locally hosted LLM. Although you do not mention vscode integration that's not the most important point about Augment, it's a secondary thing. Indexing is important and you seem to have done it. Thank you and I am definitely going to follow this project!
Glittering_Focus1538@reddit (OP)
I could technically get that working, give me a day or two to make a viable plugin.
gh0stwriter1234@reddit
I could technically get that working, give me a day or two to make a [vibeable] plugin.
AbjectBug5885@reddit
The 4B parameter claim is what caught my eye tbh. If you're actually hitting 87% on something like SWE-bench or HumanEval with that small of a model, that's genuinely impressive and worth writing up properly. But without knowing which benchmark or seeing the eval methodology, this just reads like another demo project that worked on cherry-picked examples.
gh0stwriter1234@reddit
He's not ... its his own bench which he hasn't even released.
TechnoByte_@reddit
They mean Gemma 4 26B-A4B...
Very misleading post
Arrowstar@reddit
How can I get this pointing at a llama-server instance I have running on another machine?
Glittering_Focus1538@reddit (OP)
Create a .env file in your project directory:
SMALLCODE_MODEL=your-model-name
SMALLCODE_BASE_URL=http://OTHER_MACHINE_IP:8080/v1
That's it. llama-server exposes an OpenAI-compatible endpoint at /v1 by default. SmallCode talks to anything that speaks that protocol.
If you started llama-server with --api-key, also add:
OPENAI_API_KEY=your-key
nostriluu@reddit
I would like to integrated a coding agent into an existing Typescript framework. I would ideally pass it a prompt, context, and test condition(s), and let it do its thing. Would smallcode be a good choice for this? Thanks!
Glittering_Focus1538@reddit (OP)
SmallCode can literally deterministically generate type Script files with BoneScript, it has no issues editing and refining TypeScript
nostriluu@reddit
I specifically mean using smallcode as a library, rather than a self contained external thing. Maybe the answer is obvious, but it's difficult to navigate all the options.
Glittering_Focus1538@reddit (OP)
It's not too obvious but definitely use
smallcode --mcpto run it as a JSON-RPC subprocess from your TypeScript framework. Send prompts + context over stdio, get structured tool calls back. It already auto-validates edits (runs your test command, feeds errors back to the model, retries). Set your test condition in[hooks] post_edit = "npm test"in config. Best fit if you're using local models (8B-26B) the architecture compensates for their limitations.Specialist_Major_976@reddit
The compound-tools bit is the interesting part to me. Small models aren't really failing at coding first, they're failing at staying coordinated across tool hops, so moving the boring orchestration into the harness makes sense.
MerePotato@reddit
This kind of elegant is missing from so many projects in this space, props to you
Ashraf_mahdy@reddit
Sent you a Message! Check your DMs
Imaginary_String_954@reddit
How can this be explained to a newbie? If this isn't an LLM, how do I make the best use of it
Glittering_Focus1538@reddit (OP)
imagine your average harness as a open template, you give it a small constraint and let it roam free, and some lock it down a little with permissions but still keep it free, MY harness is a workflow loop engine that works without any LLM directing it, so no wasting tokens or context on driving the bus. The template is much more constrained and allows small models to code effectively by combining tool calls and robust error debugging and guards to stop early/incomplete outputs.
Imaginary_String_954@reddit
I appreciate it. I think I get the idea but actual implementation I'm not sure of. I really am new to all of this. I have Gemma 4 8B installed on LM Studio. How can I pair your coding agent with it?
If you have a YT channel or tutorial series you recommend, I would love to know. I'm not sure where to begin to learn these things! Thank you
HiddenPingouin@reddit
This looks interesting but I'd be curious to know what you think about the bitter lesson.
If you're not familiar with it: The bitter lesson is the observation in artificial intelligence that, in the long run, general approaches that scale with available computational power tend to outperform ones based on domain-specific understanding .
Glittering_Focus1538@reddit (OP)
The bitter lesson is correct for models, don't hand-engineer intelligence, scale wins. But SmallCode isn't competing with scale. It's infrastructure: error handling, retry logic, validation. No amount of scaling makes "check if the code compiles before delivering it" obsolete. The bitter lesson applies to reasoning. It doesn't apply to plumbing.
HiddenPingouin@reddit
Interesting. You make some good points. I think Smallcode could be really useful. A lot of people, including myself, are always going to be running smaller models.
HiddenMushroom11@reddit
Anyway to use a global config file instead of having to setup smallcode every time in my projects? Something like \~/.config/smallcode.conf
Glittering_Focus1538@reddit (OP)
Already supported since 0.4.5. Put your
.envat any of these paths:SmallCode checks in this order, first found wins:
<project>/.env(project-level override)<project>/.smallcode/.env~/.config/smallcode/.env(global)~/.smallcode/.env(global alt)So set up your model endpoint once in
~/.config/smallcode/.envand it works everywhere without per-project config:Then just
cd any-project && smallcode,no setup needed.HiddenMushroom11@reddit
Nice! I must have missed that in the documentation.
Infamous_Jaguar_2151@reddit
This is awesome, will it also work well with models 27b+ sized? How would it compare to open code for those larger local models?
Glittering_Focus1538@reddit (OP)
It's much more token efficient, so outputs don't take as long, built in project guards so you don't have to say "continue" a million times, it will continue till it's done with the task it's been given. Ships with a memory and codebase MCP which massively cuts down on context usage leaving more for your coding tasks, try it out, there's much more I haven't even mentioned yet.
Infamous_Jaguar_2151@reddit
I’ll try it, thanks so much for doing this, been waiting for something to use easy with llama.cpp!
Hezy@reddit
It sounds like this could be useful with non local models as well. Any reason not to try it?
Glittering_Focus1538@reddit (OP)
works with openAPI, should work with Qwen but needs to be tested, no reason it wouldn't.
cafedude@reddit
claude code seems to have recently added an /advisor option. It allows you to designate an advisor model that is consulted if the model you're using gets stuck. It would be interesting to have something like that where you designate a frontier model that would be consulted when needed. That would help reduce token usage of the frontier model.
DigThatData@reddit
Some clever work from a few years ago demonstrated a competitive solution w/o involving agentic components at all: https://github.com/OpenAutoCoder/Agentless
Hot-Employ-3399@reddit
First impressions weren't good. It didn't work well with qwen3.6-27B (task was to make custom "autotype" in bevy so holding Ctrl-Z triggered undo N times per second; other agents complete from \~30 minutes(pi) to \~12 hours(hermes))
And stopped talking with llm (total time: \~30 minutes).
Also it didn't print much info, at least by default. (Like what is it thinking, or what bash is doing, or what file is being read)
Also it ignored config at .config/smallcode/config.toml which I saw in source code, and ignored env variables (I've used /endpoint to setup model; didn't test .env file)
Also created too much extra dirs into project dir ( `.smallcode/ .code-graph/ .memory/ ` )
Glittering_Focus1538@reddit (OP)
those "extra dirs" are features, the codegraph and memory are how the harness lets the LLM effeciently index code in it's workspace and remember key details across conversations, Will work on the bugs you experienced.
Substantial-Cicada-4@reddit
Good idea, but not going to install an npm based anything from reddit with "too good to be true" signals left and right. I don't want to be on the hews.
Hot-Employ-3399@reddit
That's why podman is generally good for evaluation. You can mount directories in
:Overlay mode so all changes will be discarded when you quit. (And no access to cookies/wallet.dat/anything extra valuable)Glittering_Focus1538@reddit (OP)
then compile it from source, its right there.
Substantial-Cicada-4@reddit
Let ne rephrase. At this point I have severe allergies towards dependencies and trust issues with anything in the chain being compromised. Don't get me wrong, what you did looks good. "It's not you, it's me".
ThiccStorms@reddit
oh wow, i really like the symbol graph thing, is it commonly used elsewhere? i wanna read more about it, and i've recently been trying the small gemma models with opencode and gave up because my passive cooling mac just struggles sometimes, ill try this out.
Glittering_Focus1538@reddit (OP)
the idea is loosely used in other projects, I based my custom memory/codebase https://github.com/Doorman11991/budget-aware-mcp
off of https://github.com/CodeGraphContext/CodeGraphContext
just repiped in some improvements and added memory support, changed the db it uses yada yada yada
ThiccStorms@reddit
oh great! hard to explain but i've never ever used MCP servers and stuff. (i still use LLMs the old way, browser copy and paste, for reasons i cannot explain) so yeah i will explore more about MCP servers, and yeah also about the linting part, is it also an MCP thing?
OsmanthusBloom@reddit
How does this compare to Dirac? It seems that you have similar ideas e.g. about precise file edits and keeping contexts short to better support small/local models.
LegacyRemaster@reddit
I'll try it with Qwen3 4b
overand@reddit
Why not push for Qwen3.5-9b?
kevinlch@reddit
Please update to us how it feels
Glittering_Focus1538@reddit (OP)
please do!
overand@reddit
I was going to suggest "You might not want to use the face of YouTube star 'Markiplier' as your profile photo on GitHub if you want to be taken seriously," but in further consideration, I think maybe it's a good thing.
Glittering_Focus1538@reddit (OP)
I've heard that before, but I'm not a very serious person. I just do fun opensource projects so..
overand@reddit
What are these
.msfiles? (GitHub unhelpfully thinks they're "MAXScript" files, but I'm pretty sure this isn't a framework for 3D Studio MAX!)They look like JS to me, but that's not how they're named - and this is the first time I've seen that extension used. Granted I'm not a JS developer, but this smells odd to me.
Glittering_Focus1538@reddit (OP)
If you're interested, it's a heavily modified fork of my open-source BoneScript, which I call MarrowScript (or .ms).
MarrowScript is a declarative system language I built for designing and generating complete AI agent architectures. You describe what you want in a .marrow file at a high level, and the toolchain turns that into a full working runtime with all the orchestration, reliability, and control systems baked in.
BoneScript is the public-facing, lighter subset focused mainly on backend generation. MarrowScript is the full internal version that lets me declare and generate entire agent systems end-to-end.
It’s what powers the architecture behind SmallCode. Pretty fun project.
R_Duncan@reddit
I was just noticing last opencode version has toolcalling issues even with Qwen3.6-35B-A3B (issues that weren't there a month ago)
Glittering_Focus1538@reddit (OP)
cool? lol
sillib@reddit
What benchmarking did you use? Mbpp?
WebOsmotic_official@reddit
this is the right direction imo. small models don’t need “better vibes,” they need fewer chances to wander off. compound tools + instant lint/compile feedback is basically putting guardrails where the model is weakest.
elusznik@reddit
it’s not a 4B model but a 26B-A4B model which changes the situation significantly
Glittering_Focus1538@reddit (OP)
gemma 4 e4b not a4b
elusznik@reddit
sorry about that then, far more impressive
cj886@reddit
so how do you get it to work? ive set the env file, and even updated the toml file as per your instructions, all i get is this "" ✓ ✗ Failed to parse URL from undefined/v1/chat/completions"""
Glittering_Focus1538@reddit (OP)
UHHHH... do an npm -g install again, sorry.
cj886@reddit
thanks! look forward to seeing what it can do :)
Desther@reddit
Where do I set the ip for lm studio? Im getting "Cannot reach LM Studio at http://10.0.0.20:1234/v1"
Glittering_Focus1538@reddit (OP)
you need to edit these files.
Create a
.envfile in your project directory:Or
smallcode.toml:If you're running LM Studio on a different machine (like
10.0.0.20), change the IP accordingly:Make sure you're on
smallcode@0.2.7or later (npm install -g smallcode@latest).Earlier versions had a hardcoded IP that's since been removed.
Desther@reddit
Thanks I had already made both edits it wasnt working. @latest worked.
Had another error after starting at line 167 of smallcode.js TypeError: Cannot read properties of null (reading 'stdout'). My clanker added a null check and its working now
Glittering_Focus1538@reddit (OP)
xD thanks for the info.
dinerburgeryum@reddit
“4B (active) parameter model.” Not even saying the claims are untrue, but the title leaves out some pretty important detail.
Glittering_Focus1538@reddit (OP)
Sorry about that, I was and still am pretty tired and wasn't thinking. Rushed a bit. Gemma 4 e4b
dinerburgeryum@reddit
Oh! Wow I assumed it was the 27B-A4B model not E4B! Well then that’s pretty impressive. Yeah I’ll be taking this for a drive for sure.
shuwatto@reddit
Is it the only way to setup a local llama.cpp server to put
.envin each projects?Glittering_Focus1538@reddit (OP)
yes?
shuwatto@reddit
I see, thanks.
professormunchies@reddit
These models have likely been bench Maxxed for swebench. A better dataset now is rebench v2 by nebius. I was finding for gemma4-31 just because it has a 100% patch rate doesn’t mean it was 100% pass rate, it was usually in the 70-80s for swebench and <10 for rebench.
jfowers_amd@reddit
So you’re saying it could run on AMD NPU, interesting!
Glittering_Focus1538@reddit (OP)
If your NPU can serve an OpenAI-compatible endpoint (llama.cpp, Ollama, or any server that exposes
/v1/chat/completions), then yes. SmallCode doesn't care what hardware runs the model. It just talks HTTP to whatever endpoint you point it at.jfowers_amd@reddit
Yes, it does speak OpenAI. This will be an interesting experiment :)
geekynerd44@reddit
Will definitely take a look, most of my local model for coding experiment success has come from using Aider, but that is not agentic, every agentic tool felt too heavy (16k system prompt to start with) and small models struggled with those.
Are you planning to add web search? Being able research latest documentation makes a huge difference.
Glittering_Focus1538@reddit (OP)
bookmarking this, can definitely get that working.
geekynerd44@reddit
Another issue I have faced is find and replace for small models (it could be a model specific issue), either the model fails to generate the correct payload for find and replace tool or find fails.
One approach is to always to rewrite the entire file with changes at the cost of slower task completion.
How does your single tool handle find and replace?
Glittering_Focus1538@reddit (OP)
When the find fails,
read_and_patchreturns the actual file content in the error so the model can self-correct in one retry. Ifold_strmatches multiple locations, it rejects with "include more context" to force specificity. Full file rewrite exists as a fallback but small models truncate long files, so patch is safer. Either way, the improvement loop catches compilation errors from whatever the model produces and feeds them back automaticallyehiz88@reddit
if my opencode works well with small models through llama cpp server why should i use this instead?
Glittering_Focus1538@reddit (OP)
You'd get better results with the same model. SmallCode auto-validates every edit (catches broken code before you see it), uses less context via code graph retrieval instead of full file dumps, and recovers from failures automatically instead of just showing you the error. You keep the same llama.cpp server, same model and get a more reliable output with fewer wasted tokens.
Spectrum1523@reddit
I am so tired of cliche LLM reddit posts.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
LittleCelebration412@reddit
What benchmarks did you run?
South_Hat6094@reddit
the tool call reliability is what kills small models in agent loops more than raw intelligence. breaking tasks into tighter steps instead of expecting multi-hop reasoning is the real fix.
Majestic_Tailor8036@reddit
Impressive results with a 4B model. The trend toward smaller specialized models makes a lot of sense for local deployment. What benchmark did you use for the 87% score?
celsowm@reddit
Amazing job! Congrats! How does it deals with context window?
Glittering_Focus1538@reddit (OP)
Three layers: Auto-compacts history when it gets long, code graph returns only relevant symbols instead of full files, and patch edits send just the changed lines (\~20 tokens vs \~800 for a full file). Most sessions stay under 32k even with heavy tool use.
celsowm@reddit
Does it detect the ctx using models route?
Glittering_Focus1538@reddit (OP)
Not auto-detected yet it budgets to 70% of whatever context window you configure in
smallcode.tomlauto-compacts history, and summarizes large files. The model router can route to different models by task complexity but doesn't auto-read context limits from the endpoint yet. That's on the roadmap.marscarsrars@reddit
Is it open source?
Is It privacy focused ?
Glittering_Focus1538@reddit (OP)
yes, it's open source and privacy focused. All files stored locally forever.(unless you use the optional escalation feature in which case ur fucked mi amigo)
Future_Manager3217@reddit
The architecture choices make sense for small models: fewer sequential tool calls, immediate compile/lint feedback, patch-first edits, graph search.
The part I’d want before trusting the 87% is a reproducible harness: frozen repos, published task set, pass/fail criteria, raw transcripts, and the same tasks run through OpenCode/Pi with the same model. Otherwise it’s hard to separate a real harness win from a well-shaped private benchmark.
Pleasant-Shallot-707@reddit
This is pretty cool. I like how you can augment with cloud AI automatically when the model determines its in over its head. That’s a great way to sip expensive tokens to ensure you’re still being productive. I also like the idea of compound tools.
Glittering_Focus1538@reddit (OP)
It's more like, model falls, trips and then trips again, we give it an easier task and if it still trips we then escalate to the cloouuuuud
ps I thought that in the toy story aliens "the clawwwww" meme voice.
reginakinhi@reddit
You mention escalation as a feature. In itself, it's of course a workable idea, but I would rather hope that it wasn't enabled for benchmarks, was it?
Glittering_Focus1538@reddit (OP)
no it wasn't, that wouldn't be fair lol.
vitordeas@reddit
Very smart approach, it seems exactly what we need for local models, well done!
I really want to see the results on some comparable benchmarks
solarmass@reddit
Github reports that you built this using 53% MAXScript. I am curious of that choice.
Glittering_Focus1538@reddit (OP)
GitHub is misidentifying the
.msfiles — they're not 3ds Max MAXScript. It's a typed module format for internal declarations. The actual runtime is plain JavaScript. GitHub just doesn't recognize the extension and picks the closest match.those infinity stones are mine sir, no looksies
solarmass@reddit
That's good to hear. Opening a new wave of Max Viruses. Not to mention the code bending you would have to have done to complete this.
Glittering_Focus1538@reddit (OP)
I locked myself in a cave with a box of scraps... and- is that a poptart?
solarmass@reddit
LOL!
But in some seriousness you might add some content to the security and policy section. It looks like this was a personal project and checking all vulnerability boxes can be a burden, but I would just state that. I could throw it in one of my VMs without too much worry. But I could never even try to pull this onto anything at work. I think the inspection tools would just reject it because they do not identify the filetype.
Glittering_Focus1538@reddit (OP)
Thanks for the suggestion tho fr
Glittering_Focus1538@reddit (OP)
-Unknown- _error_ _error_ TOO COMPLEX MUST SELF DISTRUCT!!!
Glittering_Focus1538@reddit (OP)
If you're interested, it's a heavily modified fork of my open-source BoneScript, which I call MarrowScript (or .ms).
MarrowScript is a declarative system language I built for designing and generating complete AI agent architectures. You describe what you want in a .marrow file at a high level, and the toolchain turns that into a full working runtime with all the orchestration, reliability, and control systems baked in.
BoneScript is the public-facing, lighter subset focused mainly on backend generation. MarrowScript is the full internal version that lets me declare and generate entire agent systems end-to-end.
It’s what powers the architecture behind SmallCode. Pretty fun project.
JsThiago5@reddit
How Decompose on failure work? Do you call the model to decompose the problem into TODOs?
Glittering_Focus1538@reddit (OP)
It's all system logic. No extra model call. The governor picks a strategy based on what failed (file size, error count, retry history) and injects a system message telling the model how to break it apart "fix one error at a time," "extract working code first," or "rewrite from scratch." Cheap and fast, no separate planning inference.
ab2377@reddit
opencode does NOT fall apart with 4b models, i don't know what you talking about.
hidden2u@reddit
Rule 3: your post writing is ai slop Rule 4: you basically only self promote across subs
ziphnor@reddit
Sounds a bit like https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-agent?triedRedirect=true ?
This new trend is super exciting for local llms.
ConfusedLisitsa@reddit
If the idea is combining many tools into just one I understand you will end up with a very large numbers of final tools
The model itself has then to choose the correct one between all of the available ones
So I wonder how does this not actually make the model fail at just choosing the correct tool to call
Glittering_Focus1538@reddit (OP)
I actually found the solution to that here: https://github.com/atripati/ark
The model never sees all tools at once, 2-stage routing shows categories first 5 choices, then only the tools in that category. Compound tools replace sequences, they don't add to the count. Total stays around 15 which small models handle fine.
minus_28_and_falling@reddit
What benchmarks?
xeeff@reddit
dishonest advertising/comparison.
just no
Fast-Satisfaction482@reddit
Gemma 26B-A4B is not a 4B model.
Glittering_Focus1538@reddit (OP)
gemma 4 e4b, not a4b my friend. 8b total parameters.
Fast-Satisfaction482@reddit
Why don't you just write what model you actually want instead of being super cryptic? "Gemma 4 model that only activates 4B parameters per token. OpenCode scores ~75% with 14B models." Does not clear things up.
Skyne98@reddit
It's pretty clear on the screenshot?
Fast-Satisfaction482@reddit
Can you read the text on the screenshot? I can't. It's displayed too mushy for any text to be visible.
Skyne98@reddit
Yes, totally readable for me, especially if you open the image
Look_0ver_There@reddit
While that may be true, he does still have a valid point. If you're presenting a new utility, why be wishy-washy about what models are being used in the opening summary? Why should people need to "read the fine print" with regards to which model was used, especially for a project whose very claim is "Males small models work better!".
I get what you're saying, but this really feels like information that should be clear, front, and center since it's pivotal to the whole point of the project
Skyne98@reddit
Based on the screenshot it's not Gemma 26B it's Gemma E4B, which IS and "effective" 4B model.
dyeusyt@reddit
I recently finetuned a Qwen3-4B for Next.js & shadcn generation, and was prompting it using a Gemma4 e2b (like it understands user needs and drafts a blueprint of the page)
This problem of SLM-generated code having syntax issues is real. I'm wondering if I could delegate code gen to another model using your arch; if yes, then I could probably swap out LangGraph in my Electron app with it.
https://github.com/iamDyeus/qwendean
Glittering_Focus1538@reddit (OP)
Do whatever you want, it's open src and MIT. If you need my help let me know!
NandaVegg@reddit
Very interesting work. Though the repo looks fairly vibecoded at a glance, explanations made by the OP and intention for each feature (with nice fallbacks) makes very much sense.
Smaller models are typically post-trained much more with synthetic data from larger models than direct RL optimization (it's generally harder to make a 4B-sized model converge on diverse tasks that way), but those synthetic data generation pipeline tends to result in a single or two-turn instruction rather than a long multi-turn actions. So I can see how "tool call packing" helps.
Even in general chatbot-type use, copy pasting the list of current variables in every user instruction (and subsequently removing it from past instruction) helps a lot. It's like always assuming the model only has attention span of 2-3 turns, and was the case for most models before agentic RL.
Sad_Initiative133@reddit
!Remindme in 3days
RemindMeBot@reddit
I will be messaging you in 3 days on 2026-05-21 09:10:29 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
ganonfirehouse420@reddit
I wonder wonder if it good enough for coding in python and go.
Glittering_Focus1538@reddit (OP)
Go has mixed results for gemma 4 e4b, but pythron results were almost perfect. Give it a shot and let me know how it goes!
Kodrackyas@reddit
This is definitelly a trend, nice to see, also i agree on letting the ai test and proceed so you can measure buils but you cant measure quality on the finished builded code, do you have any process for that?
Glittering_Focus1538@reddit (OP)
I'm glad you asked, we actually track that with several metrics.
Structural checks, the verifier catches placeholders (// TODO, // ...), truncated output, unbalanced braces, and incomplete implementations before delivering code. If the model wrote a lazy skeleton, it gets rejected and re-prompted.
Governor scoring, every tool call is tracked with Bayesian confidence. If the model consistently produces bad patches or writes code that fails validation, it learns to avoid those patterns for that task type. Over time it routes toward approaches that produce working code.
BoneScript for backends, instead of letting the model freestyle Express routes (which leads to inconsistent quality), it writes a declarative spec and the compiler generates the implementation deterministically. The quality is guaranteed by the compiler, not the model.
Memory, conventions and decisions persist across sessions. Once you tell it "always use error boundaries" or "follow this naming pattern", it loads those as efficient context on every future task.
What we don't have yet: automated style/complexity scoring (cyclomatic complexity, duplication detection) or human-eval-style "does this actually solve the problem correctly" verification beyond compile+run. That's on the roadmap, likely as a plugin that runs static analysis post-generation.
Honest take: for a 4B model, "it compiles, runs, and passes basic tests" is already a high bar. Quality beyond that (clean architecture, proper error handling, idiomatic patterns) is where escalation to a stronger model helps. The local model handles the 80-90% of straightforward tasks, and Claude/GPT handles the 10-20% that need more sophistication.
BillDStrong@reddit
Does this work better using something like Qwen3.6-35B-A3B or Qwen3.6-27B?
I would think the same tools would make working with the larger models more efficient as well, reducing complexity they can use elsewhere.
Guess I need to try it.
Glittering_Focus1538@reddit (OP)
The better the model, the better it will run. It should be excellent for Qwen3.6 a3b.
dark-light92@reddit
Where did you find Qwen 3.6 9b to benchmark?
Glittering_Focus1538@reddit (OP)
sorry that was a typo, it was https://lmstudio.ai/models/qwen/qwen3-vl-8b
No_Field3913@reddit
Im doing something related but havnt touched the tool side yet, first getting my flow working work OC as harness then bench various custom stuff :)
Will try this one out :) too
Glittering_Focus1538@reddit (OP)
It's opensrc, feel free to make a fork and improve on the design!
robberviet@reddit
Sounds overfit. However yes if it's fit your needs then it's the best!
Glittering_Focus1538@reddit (OP)
overfit? maybe, not everyone has 2-10k to drop on a hobbie, and other solutions werent working as well as I wanted.
_mayuk@reddit
How is the work flow to per example if I have a small repo … how I would initialize the graph db of it ?
Glittering_Focus1538@reddit (OP)
It's automatic, you don't need to do anything.
When SmallCode boots, it:
package.json,Cargo.toml,go.mod,src/dirs)Takes around 100ms for a small repo, a few seconds for larger ones. After that, when you ask "how does auth work?" it searches the symbol graph instead of grep-reading every file. If you want to re-index (say you pulled new code), just restart SmallCode or use
graph_search,it checks staleness automatically. The graph DB stays in your project folder (gitignored by default). Nothing leaves your machine._mayuk@reddit
Another maybe dumb questions … I guess small code handles the updates of the graph.db after any major changes right ? C:
Glittering_Focus1538@reddit (OP)
It does update as you work yes, it doesn't only store code, it also stores things you tell it to remember with the same system. So you get almost perfect multi session memory.
_mayuk@reddit
Bro I think I love you … you basically solve all my problems … I think I would try to test it today … I was about to got to sleep but … I doubt I can go to bed without testing this c:
Glittering_Focus1538@reddit (OP)
Please test it and give me your feedback!
_mayuk@reddit
🥹 beautiful … i was so annoyed with megamem … even using Gemini api it felt so annoying …to set up …this can be huge !!!
Great work c:
MattOnePointO@reddit
Well done. Thanks for sharing this with the community!
Distinct_Lion7157@reddit
can you use several real benchmarks and not one you created please
trajo123@reddit
Ah, the good old "trust me bro" benchmark! I know it's good to jump into building an idea that you had, but investing some time into more standard benchmarks will pay off either by making you realize that the problem is harder than you initially thought or properly quantifying the improvement your solution provides, giving much more credibility/popularity to your project.
_mayuk@reddit
Look quite cool c: