Same task in github-copilot, pi, claude-code, and opencode with Qwen3.6 27B
Posted by sdfgeoff@reddit | LocalLLaMA | View on Reddit | 85 comments
So I vibed a setup to allow me to test multiple agentic harnesses/model combinations on the same task. Here are some images to allow you to make subjective opinions. Still working on getting automated evaluation.
Things I noticed not present in the images:
- Opencode can search the internet by default. This made it's results way better on some tasks. Eg the 3d printer explainer page it listed specific filament temperatures etc.
- On webdev, opencode delivered really good results. You can't interact with them from here, but it made cool interactive widgets that worked really well.
- The model really struggles with Github Copilot. It generally takes half a dozen tries to write a file. It keeps mucking up copilots file editing tools. Doesn't have this issue with other harnesses. Claude code, pi and opencode all take 4 LLM requests to create the pelican.svg. Github copilot takes 13! It tries the edit tool, it tries bash, it tries the edit tool again. Whatever tool schema they use, in my tests the LLM really struggles. This makes it really slow as it has to regenerate the same diffs again and again.
- Qwen3-vl-4 looped endlessly in OpenCode, couldn't even write a the pelican.svg file to disk.
gthing@reddit
So did opencode just find a better example online and copy it?
bnightstars@reddit
ะขhe Harnes makes an insane difference I have my Qwen3.6-35 connected in Copilot and Claude Code and the difference in output between the 2 is night and day. I hate with a blind passion any cli written on nodejs especially after the GitHub incident but Claude Code is not to be denied. Sadly it's probably the most token heavy Harnes on the planet. Who the fuck has a 40000 tokens system prompt !
Late_Film_1901@reddit
Maybe it's just me but I don't get which harness is better. Do you mean Claude code is much better than copilot?
my_name_isnt_clever@reddit
IMO none of the projects made by the major players are the best for local models, we have very different contraints than API services.
Pi is becoming the standard since it's so minimal, though there are a few other projects focused on smaller models. Even OpenCode targets the frontier.
bnightstars@reddit
Same task with Qwen3.6-35B Claude Code delivered while Copilot entered a loop that couldn't escape. Overall Claude Code has more tools and better prompts that work well even with an open source model.
StereoWings7@reddit
What do you mean GitHub incident in this context? Sorry for being ignorant but Iโm not as tech saggy as other guys in this sub.
nymical23@reddit
Most of us are tech-saggy in this sub, only a few are tech-savvy.
StereoWings7@reddit
Ah English is not my first language I just have accidentally picked an incorrect word perhaps because I watched a Family Guyโs saggy-naggy clip before posting it but it seems it somehow makes sense lol.
nymical23@reddit
Yeah, I get it. It isn't my first language either, but the contrast between saggy and savvy was too funny to let go. :)
bnightstars@reddit
Github got hacked yesterday because of npm package been hijacked !
LightBroom@reddit
Try https://github.com/tontinton/maki
jacek2023@reddit
Try multiple times, it can be just a normal variancy
sammybeta@reddit
"hey look at my one shot it's wonderful/terrible"
some_user_2021@reddit
I hate being bipolar, it's awesome!
FrostTactics@reddit
Agreed. Especially with the svg drawing, I can't imagine the harness actually matters much.
sdfgeoff@reddit (OP)
Yep, the pelican is mostly representative of the model not the harness, but it's the one that makes the prettiest picture to post on the internet. If you have other suggestions that are both easy/pretty to visualize and test the harness more than the model, I'm keen to hear them!
sdfgeoff@reddit (OP)
I did run the tests multiple times, but I didn't present all the pictures here. There's only so many you can visually compare! (I am also currently adding evals that aren't subjective so they can be compared across multiple runs),
For what it's worth, the conclusions I mention (opencode's internet access providing more detailed outputs, and opencode producing better interactive content, github copilot being nearly unusable for the model) appear to be very consistent across runs.
pulse77@reddit
\^\^\^ This is the only right answer! There is so much randomness involved that one can not judge by single-shot picture! At least 10 attempts should be made and shown for each harness, then we can judge...
Jeidoz@reddit
I believe "randomness" can be removed if would be used the right value of temperature or/and seed value. ๐
pulse77@reddit
This is not possible, because each harness sends different prompts to the LLM - before/after the user-given prompt. If each harness would sent exactly the same prompts, and in the same order, and if temperature would be set to 0, and if seed would be set to the same value, and if GPU task reordering would be disabled - then we would get exactly the same result with each AI assistant...
MerePotato@reddit
Yup, and you'd see poor performance then anyway since reasoning chains rely on the noise introduced by an elevated temperature to perform optimally
challis88ocarina@reddit
Locking in the seed is the first step. Diffusion models become deterministic at this stage. LLMs rarely come with the harness to lock everything in. I know where my money is.
rditorx@reddit
But then you'd still not know whether it's the harness or the exact prompt and the token suggestions controlled by the probability distribution. With a different seed, you might get better results with some prompts than others.
Lissanro@reddit
No, different agent frameworks mean different prompt, so results will be different no mater what. Also, there is a chance to pick a seed that produces bad results for one agent but good result for the other agent. The only way to test this, is to do it multiple times with different seeds per agent. At very least 3x3 grid for each, or even 5x5 (depending on how much variation there is).
LosEagle@reddit
Pi did this without extensions?
Silver-Champion-4846@reddit
Can someone describe the image please? Blind guy here
UmpireBorn3719@reddit
could you share your prompt please
shanehiltonward@reddit
OpenCode rocks. Terminal 1 = llama.cpp Terminal 2 = OpenCode.
Icy-Marzipan-2605@reddit
so they all were using same LLM under the hood right?
sdfgeoff@reddit (OP)
Yep, all with the same model. All Qwen3.6
My aim was to determine what difference the harness makes.
Glittering_Focus1538@reddit
No wonder I liked using copilot, too bad they perma ban fast for alts. Also not surprised pi agent is doing so well, can you test https://github.com/Doorman11991/smallcode ?
sdfgeoff@reddit (OP)
You get one of the best pelicans yet.
Clocking in at 2:9s with 6 requests and 3386 output tokens.
Glittering_Focus1538@reddit
Not bad! not bad at all!
Glittering_Focus1538@reddit
Thats half the output tokens with only 2 more prompts. Not bad.
sdfgeoff@reddit (OP)
Sure! I've actually had that one on my list since your post a couple days back. One thing I haven't figured out (I haven't looked very hard yet) is how to set an API key for the local model. I couldn't see an easy place in the `.env.example` file
Glittering_Focus1538@reddit
# โโโ API Keys โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Required when using a cloud provider (OpenAI, OpenRouter, DeepSeek, Anthropic)
# Also enables auto-escalation on hard fail when using a local model
# OPENAI_API_KEY=sk-...
# ANTHROPIC_API_KEY=sk-ant-...
# DEEPSEEK_API_KEY=sk-...
#
# Override default escalation model:
# SMALLCODE_ESCALATION_MODEL=claude-sonnet-4-5
and
SMALLCODE_MODEL=your-model-name
SMALLCODE_BASE_URL=http://localhost:1234/v1
SMALLCODE_PROVIDER=openai
put these anywhere in your .env and you should be alright.
my .env is just this.
SMALLCODE_MODEL=huihui-gemma-4-e4b-it-abliterated
SMALLCODE_BASE_URL=http://10.0.0.20:1234/v1 (this is local lmstudio)
SMALLCODE_PROVIDER=openai
sdfgeoff@reddit (OP)
As far as I can tell that allows setting keys for providers/fallback models, but not for the primary model? Or am I misunderstanding something?
Glittering_Focus1538@reddit
this is the main setup
SMALLCODE_MODEL=huihui-gemma-4-e4b-it-abliterated
SMALLCODE_BASE_URL=http://10.0.0.20:1234/v1ย (this is local lmstudio)
SMALLCODE_PROVIDER=openai
and ย OPENAI_API_KEY=sk-...
that should be all you need
Demonicated@reddit
Also interested in the results
artisticMink@reddit
So - what's your samplers? Did you make n generations and somehow these pictures are the average?
If not, you just hit the slot machine four times and are now presenting four different outcomes.
sdfgeoff@reddit (OP)
I did run the tests multiple times, but I didn't present all the pictures here. There's only so many you can visually compare! (I am also currently adding evals that aren't subjective so they can be compared across multiple runs),
For what it's worth, the conclusions I mention (opencode's internet access providing more detailed outputs, and opencode producing better interactive content, github copilot being nearly unusable for the model) appear to be very consistent across runs.
artisticMink@reddit
To evaluate the capability of a model you cannot use the suggested samplers. A temperature of > 0 will by necessity always be random. So you're comparing a coin flip against a coin flip.
Now, you can do that - if you run *a lot* of samples and evaluate them. For example trough automation by a vision model. You need an average and compare these averages. And even that's very simplified because i'm not a statistics person.
What i want to say is: It's neat that you've done it and i suggest you keep doing it, everyone has to start somewhere, but it's by no way conclusive and your conclusion is a result of your own bias.
techlatest_net@reddit
yeah mcp auth is a mess rn. we just wrap servers with a simple proxy for api keys + rate limits. not perfect, but stops accidental disasters. per-dev scopes help tooโonly give access to what folks actually need. anyone using something better than homegrown middleware? would love to steal a setup.
MomentJolly3535@reddit
would have been cool to include time per task aswell, basic tasks that takes me 2-3min on PI code, takes me easily 10-12 minutes on Claude Code
sdfgeoff@reddit (OP)
I added some stats to the first post
Future_Manager3217@reddit
Cool experiment. The useful split here is not just "which harness produced nicer screenshots", but where the harness spent work.
If you rerun it, Iโd log per run: total requests/tokens, invalid or failed tool calls, file edit retries, wall time, and whether an acceptance check passed. Then run 5โ10 seeds/sessions per harness on the same task.
Copilot taking 13 calls vs 4 elsewhere is already a harness signal; it just needs variance around it so people donโt dismiss it as a one-shot screenshot.
sdfgeoff@reddit (OP)
Added some stats to the first post. But yep, should run it more times to see variance.
MaCl0wSt@reddit
OpenCode's is having the time of his life
sdfgeoff@reddit (OP)
He sure is a happy one!
soyalemujica@reddit
I'd test this with overriding the current seed to a static one in all runs, because seed variance and random is what brings different results each time.
indicava@reddit
The harness makes a HUGE difference, butโฆ I would argue your tests better gauge the modelโs adaptability and generalization with a harness rather than a โharness benchmarkโ. Models work best with the harness they were RLโd with.
Also, it would be interesting to see qwen-code harness output in your benchmark being at its (probably) closest to the harness your test model (Qwen3.6) was trained on.
sdfgeoff@reddit (OP)
At this scale I think you're right - with only 4 harnesses and one model and a handful of test cases, it's mostly a how-well-does-this-model-adapt-to-these-harnesses. However, if I run it with 20 different models and 20 different harnesses and a bunch of different tests, then it may start to show trends like "these models generally do better at agentic coding" and "these harnesses generally produce good results even with weak models"
Adding qwen-code is a good idea. I'll add it to my list.
Mickenfox@reddit
I use Copilot with Claude and it still constantly tries to read and write files using the terminal rather than the tools it has.
sdfgeoff@reddit (OP)
I did some digging on why: https://www.reddit.com/r/GithubCopilot/comments/1tji9uy/many_llms_struggle_with_copilots_apply_patch_tool/
__Maximum__@reddit
I think without harness it will do better than with copilot. Copilot should be the new baseline.
sdfgeoff@reddit (OP)
I did some digging: https://www.reddit.com/r/GithubCopilot/comments/1tji9uy/many_llms_struggle_with_copilots_apply_patch_tool/
JollyJoker3@reddit
I wonder what's wrong with Github Copilot? Even with Sonnet 4.6 I've seen it fail to edit a file and resort to using Powershell to make it work. Which requires user acceptance.
sdfgeoff@reddit (OP)
I just had a quick look at the logs, and it looks like the "applyPatch" tool doesn't operate on JSON, which is different to like every other tool.
Most tools you provide the input as JSON:
{"arg1": "val1", "arg2", "val2"}but the patch tool it just expects the raw diff. Rather unhelpfully (to the model), it errors withThe patch must start with "*** Begin Patch, which, from the models perspective, it already does. It's sending:{"input":"*** Begin Patch\n....Admittedly, in the tool description is states "This is a FREEFORM toll, so do not wrap the patch in JSON", but apparently that isn't enough.
sdfgeoff@reddit (OP)
Just posted those results on the copilot reddit: https://www.reddit.com/r/GithubCopilot/comments/1tji9uy/many_llms_struggle_with_copilots_apply_patch_tool/
Heinz2001@reddit
I think that when evaluating agents, you need to focus on efficiency rather than results, since the latter depend heavily on the cost of the large language model.
So count the Context Usage, Iterations to pass, Tool calls and possibly Quality like Test counts.
Here's a quick, simple test. Just say โdo plan_v4.mdโ and you're done.
https://github.com/fischerf/aar/blob/develop/docs/testplans/testplan_v4.zip
Here are my results comparing (Claude Sonnet 4.6):
- VSCode Agent
- ClaudeCode VSCode
- ZED Agent
- and my own AAR Agent.
https://github.com/fischerf/aar/blob/develop/docs/testplans/Agent_Benchmark_Comparison.md
uti24@reddit
Qwen3.6 does that, too. At least GGUF variant.
If you want to test harness for real you need multi turn task, like 10 turns over 100k context. That's where Qwen3.6 start failing for me (well it start failing at 50k, but for purpose of benchmarking...)
TripleSecretSquirrel@reddit
For what itโs worth, it may be helpful to break things into smaller, atomic tasks and refresh context each time.
Iโm using a gguf of Qwen 3.6:27b run big locally. Iโm using it for coding, though nothing enormous or terribly complex. After very thorough planning, I have an orchestrator agent that knows the full context of the project who then spins up a sub-agent to tackle the first task. Once the first task is done, that sub-agent spins down and a new one is spun up to replace it. Rinse and repeat.
Itโs a little tricky at first because I only have enough gpu to run one agent at a time, so the two agents share the same model weights so they stay loaded into vram the whole time, but with each agent having their own kv caches.
My problem was mostly speed. My gpu gets real slow once the agent has more than \~50k of context, so I just have a system where they donโt ever hit 50k and voila, itโs much faster and way more automatable.
ortegaalfredo@reddit
Its' basically the same SVG. Agents are just a tiny layer over the LLM, particularly those coding agents that are just glorified 20-line ralph loops plus spyware.
Fast-Satisfaction482@reddit
GPT-5.4 regularly has to retry file edits in co-pilot. Really stunning in my opinion. They seem to have the policy that they give a strict schema for interaction and then the model has to exactly comply, with zero error recovery on the side of the harness.ย
nuclearbananana@reddit
every harness does this. That's how tools work. It's not usually a problem, especially with constrained decoding
sdfgeoff@reddit (OP)
With Qwen3.6-27B claude-code, opencode and pi all succeed in the tool call first time. Copilot fails like 7 times in a row before figuring out how to edit a file. This points at the harness being the problem. No idea if the issue is bad prompting or bad harness design, but there's clearly something fishy going on.
Fast-Satisfaction482@reddit
Opencode uses the jsonrepair library to fix schema errors, so your statement is false.
kvothe5688@reddit
have you tried running test multiple times in same environment in different sessions?
sdfgeoff@reddit (OP)
Yep, the conclusions I post in the original post (about github copilot causing issues and opencode giving nice detailed interactive content) are pretty consistent across runs.
moahmo88@reddit
Good job!Thanks!
Interesting_Key3421@reddit
what about the token used? with local models in my tests, Pi is very fast and use less tokens because of the minimal system prompt
sdfgeoff@reddit (OP)
Yeah, there are big differences in the number of tokens used. I'm still building out the metrics and will make another post when I have more data.
Separate-Forever-447@reddit
please try with frogs, ducks, cats and maybe a cow, so that we can tell what's going on
leo-k7v@reddit
Hmm. I tried QwenPaw 9B Q4 dense w/o any agents at all and spits out pelican.svg with exactly same picture as a text file triple backticked with svg type. I thought itโs standard picture from svg training set and most of the models know it by heart. I might be wrong (I often am)โฆ
bonobomaster@reddit
Did you set temperature to zero and locked a specific seed?
For my understanding, you have to set a fixed seed and temperature=0 to make this test meaningful.
the-username-is-here@reddit
One-shots. Means nothing.
zoyer2@reddit
I need images of 10 attempt each harness
somerussianbear@reddit
Pi shows what it means when it says its proposal is to be simple.
R_Duncan@reddit
Wait, is not 100% clear.... did all the harness used Qwen3.6 27B as model? Quantization/inference engine used?
Yes_but_I_think@reddit
I would really like to see reliability tests. Tried 10 times the same thing. This harness gave usable results 8 / 10 times, etc.
kfl@reddit
Have you seen https://github.com/cartazio/benchkit_for_harnesses?
It also try to assess the the effect of the harness/coding agent.
szansky@reddit
omg I love Qwen so much, this is so amazing model and incredible we can run its on 1x 3090 ๐ฎ
vanbukin@reddit
Try setting up https://github.com/ogx-ai/ogx in front of your vLLM instance. You can disable embeddings, reranking, and vector search - keeping only the main model enabled. PostgreSQL works well as the database backend.
Protopia@reddit
The harness is IMO probably way more important than the LLM.
What about Goose, Hermes, BMAD, Superpowers, GSD, etc.?
burdzi@reddit
this is a really nice test! thanks! It seems to matter a lot what one uses. Fascinating...