Same task in github-copilot, pi, claude-code, and opencode with Qwen3.6 27B | TheaterFire

Same task in github-copilot, pi, claude-code, and opencode with Qwen3.6 27B

Posted by sdfgeoff@reddit | LocalLLaMA | View on Reddit | 85 comments

So I vibed a setup to allow me to test multiple agentic harnesses/model combinations on the same task. Here are some images to allow you to make subjective opinions. Still working on getting automated evaluation.

Things I noticed not present in the images:

Opencode can search the internet by default. This made it's results way better on some tasks. Eg the 3d printer explainer page it listed specific filament temperatures etc.
On webdev, opencode delivered really good results. You can't interact with them from here, but it made cool interactive widgets that worked really well.
The model really struggles with Github Copilot. It generally takes half a dozen tries to write a file. It keeps mucking up copilots file editing tools. Doesn't have this issue with other harnesses. Claude code, pi and opencode all take 4 LLM requests to create the pelican.svg. Github copilot takes 13! It tries the edit tool, it tries bash, it tries the edit tool again. Whatever tool schema they use, in my tests the LLM really struggles. This makes it really slow as it has to regenerate the same diffs again and again.
Qwen3-vl-4 looped endlessly in OpenCode, couldn't even write a the pelican.svg file to disk.

[-]

gthing@reddit

So did opencode just find a better example online and copy it?

[-]

bnightstars@reddit

Тhe Harnes makes an insane difference I have my Qwen3.6-35 connected in Copilot and Claude Code and the difference in output between the 2 is night and day. I hate with a blind passion any cli written on nodejs especially after the GitHub incident but Claude Code is not to be denied. Sadly it's probably the most token heavy Harnes on the planet. Who the fuck has a 40000 tokens system prompt !

[-]

Late_Film_1901@reddit

Maybe it's just me but I don't get which harness is better. Do you mean Claude code is much better than copilot?

[-]

my_name_isnt_clever@reddit

IMO none of the projects made by the major players are the best for local models, we have very different contraints than API services.

Pi is becoming the standard since it's so minimal, though there are a few other projects focused on smaller models. Even OpenCode targets the frontier.

[-]

bnightstars@reddit

Same task with Qwen3.6-35B Claude Code delivered while Copilot entered a loop that couldn't escape. Overall Claude Code has more tools and better prompts that work well even with an open source model.

[-]

StereoWings7@reddit

What do you mean GitHub incident in this context? Sorry for being ignorant but I’m not as tech saggy as other guys in this sub.

[-]

nymical23@reddit

Most of us are tech-saggy in this sub, only a few are tech-savvy.

[-]

StereoWings7@reddit

Ah English is not my first language I just have accidentally picked an incorrect word perhaps because I watched a Family Guy’s saggy-naggy clip before posting it but it seems it somehow makes sense lol.

[-]

nymical23@reddit

Yeah, I get it. It isn't my first language either, but the contrast between saggy and savvy was too funny to let go. :)

[-]

bnightstars@reddit

Github got hacked yesterday because of npm package been hijacked !

[-]

LightBroom@reddit

Try https://github.com/tontinton/maki

[-]

jacek2023@reddit

Try multiple times, it can be just a normal variancy

[-]

sammybeta@reddit

"hey look at my one shot it's wonderful/terrible"

[-]

some_user_2021@reddit

I hate being bipolar, it's awesome!

[-]

FrostTactics@reddit

Agreed. Especially with the svg drawing, I can't imagine the harness actually matters much.

[-]

sdfgeoff@reddit (OP)

Yep, the pelican is mostly representative of the model not the harness, but it's the one that makes the prettiest picture to post on the internet. If you have other suggestions that are both easy/pretty to visualize and test the harness more than the model, I'm keen to hear them!

[-]

sdfgeoff@reddit (OP)

I did run the tests multiple times, but I didn't present all the pictures here. There's only so many you can visually compare! (I am also currently adding evals that aren't subjective so they can be compared across multiple runs),

For what it's worth, the conclusions I mention (opencode's internet access providing more detailed outputs, and opencode producing better interactive content, github copilot being nearly unusable for the model) appear to be very consistent across runs.

[-]

pulse77@reddit

\^\^\^ This is the only right answer! There is so much randomness involved that one can not judge by single-shot picture! At least 10 attempts should be made and shown for each harness, then we can judge...

[-]

Jeidoz@reddit

I believe "randomness" can be removed if would be used the right value of temperature or/and seed value. 😅

[-]

pulse77@reddit

This is not possible, because each harness sends different prompts to the LLM - before/after the user-given prompt. If each harness would sent exactly the same prompts, and in the same order, and if temperature would be set to 0, and if seed would be set to the same value, and if GPU task reordering would be disabled - then we would get exactly the same result with each AI assistant...

[-]

MerePotato@reddit

Yup, and you'd see poor performance then anyway since reasoning chains rely on the noise introduced by an elevated temperature to perform optimally

[-]

challis88ocarina@reddit

Locking in the seed is the first step. Diffusion models become deterministic at this stage. LLMs rarely come with the harness to lock everything in. I know where my money is.

[-]

rditorx@reddit

But then you'd still not know whether it's the harness or the exact prompt and the token suggestions controlled by the probability distribution. With a different seed, you might get better results with some prompts than others.

[-]

Lissanro@reddit

No, different agent frameworks mean different prompt, so results will be different no mater what. Also, there is a chance to pick a seed that produces bad results for one agent but good result for the other agent. The only way to test this, is to do it multiple times with different seeds per agent. At very least 3x3 grid for each, or even 5x5 (depending on how much variation there is).

[-]

LosEagle@reddit

Pi did this without extensions?

[-]

Silver-Champion-4846@reddit

Can someone describe the image please? Blind guy here

[-]

UmpireBorn3719@reddit

could you share your prompt please

[-]

shanehiltonward@reddit

OpenCode rocks. Terminal 1 = llama.cpp Terminal 2 = OpenCode.

[-]

Icy-Marzipan-2605@reddit

so they all were using same LLM under the hood right?

[-]

sdfgeoff@reddit (OP)

Yep, all with the same model. All Qwen3.6

My aim was to determine what difference the harness makes.

[-]

Glittering_Focus1538@reddit

No wonder I liked using copilot, too bad they perma ban fast for alts. Also not surprised pi agent is doing so well, can you test https://github.com/Doorman11991/smallcode ?

[-]

sdfgeoff@reddit (OP)

You get one of the best pelicans yet.

Clocking in at 2:9s with 6 requests and 3386 output tokens.

[-]

Glittering_Focus1538@reddit

Not bad! not bad at all!

[-]

Glittering_Focus1538@reddit

Thats half the output tokens with only 2 more prompts. Not bad.

[-]

sdfgeoff@reddit (OP)

Sure! I've actually had that one on my list since your post a couple days back. One thing I haven't figured out (I haven't looked very hard yet) is how to set an API key for the local model. I couldn't see an easy place in the `.env.example` file

[-]

Glittering_Focus1538@reddit

# ─── API Keys ────────────────────────────────────────────────────────────────

# Required when using a cloud provider (OpenAI, OpenRouter, DeepSeek, Anthropic)

# Also enables auto-escalation on hard fail when using a local model

# OPENAI_API_KEY=sk-...

# ANTHROPIC_API_KEY=sk-ant-...

# DEEPSEEK_API_KEY=sk-...

#

# Override default escalation model:

# SMALLCODE_ESCALATION_MODEL=claude-sonnet-4-5

and

SMALLCODE_MODEL=your-model-name

SMALLCODE_BASE_URL=http://localhost:1234/v1

SMALLCODE_PROVIDER=openai

put these anywhere in your .env and you should be alright.

my .env is just this.

SMALLCODE_MODEL=huihui-gemma-4-e4b-it-abliterated

SMALLCODE_BASE_URL=http://10.0.0.20:1234/v1 (this is local lmstudio)

SMALLCODE_PROVIDER=openai

[-]

sdfgeoff@reddit (OP)

As far as I can tell that allows setting keys for providers/fallback models, but not for the primary model? Or am I misunderstanding something?

[-]

Glittering_Focus1538@reddit

this is the main setup
SMALLCODE_MODEL=huihui-gemma-4-e4b-it-abliterated
SMALLCODE_BASE_URL=http://10.0.0.20:1234/v1 (this is local lmstudio)
SMALLCODE_PROVIDER=openai
and OPENAI_API_KEY=sk-...
that should be all you need

[-]

Demonicated@reddit

Also interested in the results

[-]

artisticMink@reddit

So - what's your samplers? Did you make n generations and somehow these pictures are the average?

If not, you just hit the slot machine four times and are now presenting four different outcomes.

[-]

sdfgeoff@reddit (OP)

I did run the tests multiple times, but I didn't present all the pictures here. There's only so many you can visually compare! (I am also currently adding evals that aren't subjective so they can be compared across multiple runs),

For what it's worth, the conclusions I mention (opencode's internet access providing more detailed outputs, and opencode producing better interactive content, github copilot being nearly unusable for the model) appear to be very consistent across runs.

[-]

artisticMink@reddit

To evaluate the capability of a model you cannot use the suggested samplers. A temperature of > 0 will by necessity always be random. So you're comparing a coin flip against a coin flip.

Now, you can do that - if you run *a lot* of samples and evaluate them. For example trough automation by a vision model. You need an average and compare these averages. And even that's very simplified because i'm not a statistics person.

What i want to say is: It's neat that you've done it and i suggest you keep doing it, everyone has to start somewhere, but it's by no way conclusive and your conclusion is a result of your own bias.

[-]

techlatest_net@reddit

yeah mcp auth is a mess rn. we just wrap servers with a simple proxy for api keys + rate limits. not perfect, but stops accidental disasters. per-dev scopes help too—only give access to what folks actually need. anyone using something better than homegrown middleware? would love to steal a setup.

[-]

MomentJolly3535@reddit

would have been cool to include time per task aswell, basic tasks that takes me 2-3min on PI code, takes me easily 10-12 minutes on Claude Code

[-]

sdfgeoff@reddit (OP)

I added some stats to the first post

[-]

Future_Manager3217@reddit

Cool experiment. The useful split here is not just "which harness produced nicer screenshots", but where the harness spent work.

If you rerun it, I’d log per run: total requests/tokens, invalid or failed tool calls, file edit retries, wall time, and whether an acceptance check passed. Then run 5–10 seeds/sessions per harness on the same task.

Copilot taking 13 calls vs 4 elsewhere is already a harness signal; it just needs variance around it so people don’t dismiss it as a one-shot screenshot.

[-]

sdfgeoff@reddit (OP)

Added some stats to the first post. But yep, should run it more times to see variance.

[-]

MaCl0wSt@reddit

OpenCode's is having the time of his life

[-]

sdfgeoff@reddit (OP)

He sure is a happy one!

[-]

soyalemujica@reddit

I'd test this with overriding the current seed to a static one in all runs, because seed variance and random is what brings different results each time.

[-]

indicava@reddit

The harness makes a HUGE difference, but… I would argue your tests better gauge the model’s adaptability and generalization with a harness rather than a “harness benchmark”. Models work best with the harness they were RL’d with.

Also, it would be interesting to see qwen-code harness output in your benchmark being at its (probably) closest to the harness your test model (Qwen3.6) was trained on.

[-]

sdfgeoff@reddit (OP)

At this scale I think you're right - with only 4 harnesses and one model and a handful of test cases, it's mostly a how-well-does-this-model-adapt-to-these-harnesses. However, if I run it with 20 different models and 20 different harnesses and a bunch of different tests, then it may start to show trends like "these models generally do better at agentic coding" and "these harnesses generally produce good results even with weak models"

Adding qwen-code is a good idea. I'll add it to my list.

[-]

Mickenfox@reddit

The model really struggles with Github Copilot. It generally takes half a dozen tries to write a file

I use Copilot with Claude and it still constantly tries to read and write files using the terminal rather than the tools it has.

[-]

sdfgeoff@reddit (OP)

I did some digging on why: https://www.reddit.com/r/GithubCopilot/comments/1tji9uy/many_llms_struggle_with_copilots_apply_patch_tool/

[-]

Maximum@reddit

I think without harness it will do better than with copilot. Copilot should be the new baseline.

[-]

sdfgeoff@reddit (OP)

I did some digging: https://www.reddit.com/r/GithubCopilot/comments/1tji9uy/many_llms_struggle_with_copilots_apply_patch_tool/

[-]

JollyJoker3@reddit

I wonder what's wrong with Github Copilot? Even with Sonnet 4.6 I've seen it fail to edit a file and resort to using Powershell to make it work. Which requires user acceptance.

[-]

sdfgeoff@reddit (OP)

I just had a quick look at the logs, and it looks like the "applyPatch" tool doesn't operate on JSON, which is different to like every other tool.

Most tools you provide the input as JSON: {"arg1": "val1", "arg2", "val2"} but the patch tool it just expects the raw diff. Rather unhelpfully (to the model), it errors with The patch must start with "*** Begin Patch, which, from the models perspective, it already does. It's sending: {"input":"*** Begin Patch\n....

Admittedly, in the tool description is states "This is a FREEFORM toll, so do not wrap the patch in JSON", but apparently that isn't enough.

[-]

sdfgeoff@reddit (OP)

Just posted those results on the copilot reddit: https://www.reddit.com/r/GithubCopilot/comments/1tji9uy/many_llms_struggle_with_copilots_apply_patch_tool/

[-]

Heinz2001@reddit

I think that when evaluating agents, you need to focus on efficiency rather than results, since the latter depend heavily on the cost of the large language model.

So count the Context Usage, Iterations to pass, Tool calls and possibly Quality like Test counts.

Here's a quick, simple test. Just say “do plan_v4.md” and you're done.
https://github.com/fischerf/aar/blob/develop/docs/testplans/testplan_v4.zip

Here are my results comparing (Claude Sonnet 4.6):
- VSCode Agent
- ClaudeCode VSCode
- ZED Agent
- and my own AAR Agent.
https://github.com/fischerf/aar/blob/develop/docs/testplans/Agent_Benchmark_Comparison.md

[-]

uti24@reddit

Qwen3-vl-4 looped endlessly in OpenCode, couldn't even write a the pelican.svg file to disk.

Qwen3.6 does that, too. At least GGUF variant.

If you want to test harness for real you need multi turn task, like 10 turns over 100k context. That's where Qwen3.6 start failing for me (well it start failing at 50k, but for purpose of benchmarking...)

[-]

TripleSecretSquirrel@reddit

For what it’s worth, it may be helpful to break things into smaller, atomic tasks and refresh context each time.

I’m using a gguf of Qwen 3.6:27b run big locally. I’m using it for coding, though nothing enormous or terribly complex. After very thorough planning, I have an orchestrator agent that knows the full context of the project who then spins up a sub-agent to tackle the first task. Once the first task is done, that sub-agent spins down and a new one is spun up to replace it. Rinse and repeat.

It’s a little tricky at first because I only have enough gpu to run one agent at a time, so the two agents share the same model weights so they stay loaded into vram the whole time, but with each agent having their own kv caches.

My problem was mostly speed. My gpu gets real slow once the agent has more than \~50k of context, so I just have a system where they don’t ever hit 50k and voila, it’s much faster and way more automatable.

[-]

ortegaalfredo@reddit

Its' basically the same SVG. Agents are just a tiny layer over the LLM, particularly those coding agents that are just glorified 20-line ralph loops plus spyware.

[-]

Fast-Satisfaction482@reddit

GPT-5.4 regularly has to retry file edits in co-pilot. Really stunning in my opinion. They seem to have the policy that they give a strict schema for interaction and then the model has to exactly comply, with zero error recovery on the side of the harness.

[-]

nuclearbananana@reddit

every harness does this. That's how tools work. It's not usually a problem, especially with constrained decoding

[-]

sdfgeoff@reddit (OP)

With Qwen3.6-27B claude-code, opencode and pi all succeed in the tool call first time. Copilot fails like 7 times in a row before figuring out how to edit a file. This points at the harness being the problem. No idea if the issue is bad prompting or bad harness design, but there's clearly something fishy going on.

[-]

Fast-Satisfaction482@reddit

Opencode uses the jsonrepair library to fix schema errors, so your statement is false.

[-]

kvothe5688@reddit

have you tried running test multiple times in same environment in different sessions?

[-]

sdfgeoff@reddit (OP)

Yep, the conclusions I post in the original post (about github copilot causing issues and opencode giving nice detailed interactive content) are pretty consistent across runs.

[-]

moahmo88@reddit

Good job!Thanks!

[-]

Interesting_Key3421@reddit

what about the token used? with local models in my tests, Pi is very fast and use less tokens because of the minimal system prompt

[-]

sdfgeoff@reddit (OP)

Yeah, there are big differences in the number of tokens used. I'm still building out the metrics and will make another post when I have more data.

[-]

Separate-Forever-447@reddit

please try with frogs, ducks, cats and maybe a cow, so that we can tell what's going on

[-]

leo-k7v@reddit

Hmm. I tried QwenPaw 9B Q4 dense w/o any agents at all and spits out pelican.svg with exactly same picture as a text file triple backticked with svg type. I thought it’s standard picture from svg training set and most of the models know it by heart. I might be wrong (I often am)…

[-]

bonobomaster@reddit

Did you set temperature to zero and locked a specific seed?

For my understanding, you have to set a fixed seed and temperature=0 to make this test meaningful.

[-]

the-username-is-here@reddit

One-shots. Means nothing.

[-]

zoyer2@reddit

I need images of 10 attempt each harness

[-]

somerussianbear@reddit

Pi shows what it means when it says its proposal is to be simple.

[-]

R_Duncan@reddit

Wait, is not 100% clear.... did all the harness used Qwen3.6 27B as model? Quantization/inference engine used?

[-]

Yes_but_I_think@reddit

I would really like to see reliability tests. Tried 10 times the same thing. This harness gave usable results 8 / 10 times, etc.

[-]

kfl@reddit

Have you seen https://github.com/cartazio/benchkit_for_harnesses?

It also try to assess the the effect of the harness/coding agent.

[-]

szansky@reddit

omg I love Qwen so much, this is so amazing model and incredible we can run its on 1x 3090 😮

[-]

vanbukin@reddit

Try setting up https://github.com/ogx-ai/ogx in front of your vLLM instance. You can disable embeddings, reranking, and vector search - keeping only the main model enabled. PostgreSQL works well as the database backend.

[-]

Protopia@reddit

The harness is IMO probably way more important than the LLM.

What about Goose, Hermes, BMAD, Superpowers, GSD, etc.?

[-]

burdzi@reddit

this is a really nice test! thanks! It seems to matter a lot what one uses. Fascinating...