Qwen3.6-35B becomes competitive with cloud models when paired with the right agent
Posted by Creative-Regular6799@reddit | LocalLLaMA | View on Reddit | 154 comments
A short follow-up to my previous post, where I showed that changing the scaffold around the same 9B Qwen model moved benchmark performance from 19.11% to 45.56%:
https://www.reddit.com/r/LocalLLaMA/s/JMHuAGj1LV
After feedback from people here, I tried little-coder with Qwen3.6 35B.
It now lands in the public Polyglot top 10 with a success rate of 78.7%, making it actually competitive with the best models out there for this benchmark!
At this point I’m increasingly convinced that part of the performance gap to cloud models is harness mismatch: we may have been testing local coding models inside scaffolds built for a different class of model.
Next up is Terminal Bench, then likely GAIA for research capabilities. Would love to hear your feedback here!
Full write up: https://open.substack.com/pub/itayinbarr/p/honey-i-shrunk-the-coding-agent
GitHub: https://github.com/itayinbarr/little-coder
Full benchmark results: https://github.com/itayinbarr/little-coder/blob/main/docs/benchmark-qwen3.6-35b-a3b.md
Willing-Toe1942@reddit
I can confirm the same thing, Qwen3.6 in pi-coding agents is almost twice good than opencode, the comparison was based on modification of specific web page (html code) and doing some online resource search for documentation
Deep90@reddit
What makes pi so much better?
PinkySwearNotABot@reddit
the # of shills promoting it
JamesEvoAI@reddit
Clearly if you like something you're a shill, you can't just think the thing is good and want to share it with others.
To answer your question u/Deep90, Pi has a lot more thought put behind its design than some of the other open source harnesses. Mario and team are deliberate in what they add and more importantly what they don't. You don't have to believe me, it only takes a minute to install and test yourself, the quality difference is pretty apparent.
PinkySwearNotABot@reddit
that's a completely differently different claim than the one i made at all. in fact, this logic is so fallacious that they have a name for it -- strawman argument.
everyone and their mother are building custom harnesses these days, thanks to AI. and while AI is definitely been helpful, it's also opened up the floodgates to influencers 2.0. not to say influencers aren't capable of making a competing product, it's just that it's going to take a whole lot more of convincing than just hearing the echo chamber of, "pi is so good".
and btw. it's been on my radar for a while now and i admit i am curious to see how it's different than any of the other 10 harnesses i already have on my computer
JamesEvoAI@reddit
The faults on you then for basing your opinion of something on what influencers think. There's plenty of us who are just regular people singing the praises of this thing.
PinkySwearNotABot@reddit
when I say influencers, i mean anyone who just sings praises of something without knowing the technical specifics of why. that's all i've been seeing. pi, pi, pi -- and not a single why.
JamesEvoAI@reddit
Considering how much "vibes" are just as valid of a measurement as the benchmarks (sometimes even more valid), and the large number of folks who are only really technical enough to follow a tutorial but not enough to understand why they're doing what they're doing, it comes as no surprise that a decent number of people are seeing and feeling the improvement but not being able to clearly articulate it.
Hell aside from using the creators own writing as an example of why it's better I'd be hard pressed to give you a solid reason. It just "feels" better to use than something like OpenCode.
sonderman@reddit
motte and bailey
Finanzamt_kommt@reddit
It's just better though? Tested it myself without much expextations and idea how to configure and for local models that light weight barebone structure is simply better. Doesn't let you wait for 15k plus sysprompt but 3kish or so is enough. That alone is making it better than most other clis for local models since pp on local models is often lacking. And you have more usable context.
Polite_Jello_377@reddit
Small context probably
stuckinmotion@reddit
Interesting, I might have to try pi. I'm constantly surprised by how useless opencode is whenever I try it with a local model. Like it takes a second prompt to even get it to actually write to the file instead of just printing code to the screen.
Deep90@reddit
That has been my experience with pretty much every harness. Excited to see if Pi changes things for me.
Pleasant-Shallot-707@reddit
Out of the box?
Willing-Toe1942@reddit
Yes, I used unsloth UD-Q4_XL (llamacpp - strix halo with vulkan backend)
give same question to pi-coding and opencode, and immeditly you will notice how opencode is slower (longer default prompts) and even slower in all types of actions like read files, write, search web ...etc
pi agent is insanly fast, more effecient and completed the task much much faster
Safe-Buffalo-4408@reddit
I prefer quality over speed. It would be interesting comparison over time in regards to code and tool calling quality.
Caffdy@reddit
can you help a lost soul setting up pi for agentic coding? where do one start? do you recommend any tutorial/video guide?
0h_yes_i_did@reddit
install:
to run: go to your project directory and simply run 'pi'.
JamaiKen@reddit
I’m seeing this as well, Qwen3.6 + Pi is where it’s at
Pleasant-Shallot-707@reddit
I moved my tool chain over to pi last night actually. I was just curious if you saw benefits without any extra harnesses. That’s exciting.
wrdit@reddit
This is amazing work. Well done
DependentBat5432@reddit
going from 19% to 45 to 78 just by changing the scaffold is kind of terrifying. makes you question every benchmark comparison doesn’t control for this
itsmetherealloki@reddit
Notice how all the models only show the benchmark for a single run? Why not an average of like 10 runs? That’s because it’s the best run they could get. LLMs are all really smart but inconsistent as hell. The right harness can help immensely with consistency making lesser models seem much closer to the bigger models.
jarail@reddit
I think it's because it's expensive to run.
itsmetherealloki@reddit
Good points but I disagree to an extent. You might be right on the per run costs but seems a bit high. I seriously doubt the bench mark truly can account for in model variance because we should be seeing more real world consistency if that were true. The problem is it’s hard to know what that would actually be without benchmarking them multiple times which to your point could be somewhat costly. Either way I think models inconsistencies are a bigger issue at the moment compared raw iq. They are almost all smart enough to run my harness but they still fail a certain amount of time.
Georgefakelastname@reddit
According to Artificial analysis, they quite literally did spend almost $5k on their test for the last two versions of Claude Opus. However, there’s then a steep price cliff down to like $500-$1k for most other mainstream cloud models. Still super expensive and going through over 100m tokens for each of their evaluations for most top-end models.
So yeah, given the size and cost, it’s no surprise they don’t want to run multiple tests. Not to mention, why exactly would 3rd parties be trying to get the “best” results for certain models? Corruption? It’s possible, but I haven’t seen any evidence for that.
aparamonov@reddit
If you look closer he added knowledge injection as well, giving models cheat sheets to some extent guiding model to correct implementation for benchmarks.
ItilityMSP@reddit
But that is the point, harness plus domain knowledge is a hugh improvement. ... right domain knowledge and preventing self sabotage can lead to success.
PhilippeEiffel@reddit
Not really:
- 19% to 45% is for Qwen3.5 (9B, dense model)
- 78% is for Qwen3.6 (35B, MoE)
relmny@reddit
Why there's only one reply that corrects a comment with more than 100 upvotes?
Pleasant-Shallot-707@reddit
Because why do you need more corrections?
lioffproxy1233@reddit
You know why
Pleasant-Shallot-707@reddit
3.6 27B dense was just released. Interested to see how well it does
Blaze344@reddit
Which is why I've always been a strong defender of the idea that even if LLMs stalled literally as they are today, it would have a generalized impact on a lot of roles and tasks. A dedicated developer interacting with an expert in the given task or role can very feasibly implement an agentic harness that solves a lot of things with enough accuracy to be tolerable, and that's without finetuning to the task in particular. Some combination of reliable data and synthetic data acquired from the "tolerable" agent using the harness can easily have a pretty big impact.
metigue@reddit
This is why terminal bench and sanity harness are the best- They both show how different harnesses perform with different models.
The harness has made more of a difference than the model for a while now.
Worried_Drama151@reddit
No it doesn’t, if you listened to everything on this sub Kimi is better then gpt onion, claude Mythos, and quantum engineering
candraa6@reddit
quantum engineering has nothing to do with this. and it shows you don't know what are you talking about
Y0uCanTellItsAnAspen@reddit
I think it was a joke....
sorweel@reddit
Humor has nothing to do with this. and it shows you don't know what are you talking about
PhilippeEiffel@reddit
For reference, Qwen 3.6 official score are available on Qwen model card: https://huggingface.co/Qwen/Qwen3.6-27B
Terminal bench 2:
Qwen3.6 35B A3B: 51.5
Qwen3.6 27B: 59.3
Of course these values are obtained with BF16.
I'm very interested in your results: how much you can improve with a more adapted harness?
HockeyDadNinja@reddit
Great work! I have some questions.
1) Why did you choose Aider and the Aider Polyglot benchmarks? Not hating on Aider, I personally hard forked aider-ce as the basis of my AI assistant. Aider is not really maintained and the benchmark leaderboard is looking dated.
2) You've run the polyglot benchmarks on your own agent. I suppose we could take the benchmarks and run them on any agent harness / LLM combo. I now want to try this with various combinations such as my Qwen3.6 setup with opencode and also with claude code / opus 4.7. Have you run the benchmarks using little-coder and frontier models?
WRT agent harness and LLM matching I've had similar thoughts with development frameworks such as GSD, spec kit, and open spec. I was thinking of building a GSD-light for example, something better suited for local models.
What you've done here could actually be used as a benchmark for the coding harnesses themselves (vs any particular model). Claude, codex, opencode, pi, etc could be ranked against each other given a common LLM configuration (I know, not always possible).
Creative-Regular6799@reddit (OP)
That is exactly the direction I advocate here for! Now it’s running on Terminal Bench (will send to the leaderboard when finished and report here). This benchmark shows the combined performance of agents and models
PhilippeEiffel@reddit
Just curious: how much time to run terminal bench?
Your work is interesting: model providers put a mass of knowledge, energy, time... to build great models they give to the community. The community has to optimize the harness to leverage the models' usage.
Creative-Regular6799@reddit (OP)
Just pushed the result, Terminal Bench 1 (0.1.1) finished with 40% success rate! Now running TB 2
PhilippeEiffel@reddit
Great!
I've read your full article, it's very interesting. I noticed you were running 9B in Q4_K_M. May be I missed this information, but I don't know the size you are using for Qwen3.6 35B.
Traditional benchmarks use BF16 quants to show the highest possible score some model can reach.
Coding activities are known to be more sensitive to quantization than tasks like working on texts or generating texts. It could be very interesting to see if your harness is able to mitigate this quantization effect. So, running terminal bench with different quants will be very interesting.
PS: when you submit your results to the leaderboard, mention the quantization used.
fredandlunchbox@reddit
Just put it in Claude Code for a 1-to-1 comparison.
Healthy-Nebula-3603@reddit
Bro qwen 3.6 35b is obsolete . We have 3.5 27b which is much better :)
Kahvana@reddit
Nah, 35B-A3B's speed is remarkable. Detailed planning with 27B, implementation with 35B-A3B. Best of both worlds!
Cupakov@reddit
Seems like an ideal use case for pi.dev, that’s gotta be the most extensible harness out there
Creative-Regular6799@reddit (OP)
pi.dev adaptation is up!
gilliancarps@reddit
Sorry, but where can I get pi adaptation from?
mtomas7@reddit
Just to clarify: what little coder does extra vs vanilla pi? Do you need this wrapper or it is better to do just a pi extension/package?
Creative-Regular6799@reddit (OP)
In the works, will support pi dev soon!
rorowhat@reddit
What are you using to benchmark them?
kaeptnphlop@reddit
I have your repo open since your last post and wanted to test with Qwen 3.6 myself. Thanks for the write up!
I found Qwen-Coder-Next is pretty strong with GitHub Copilot in VS Code. Now I’m curious how well it would do with little-coder.
Maybe I find some time today
iamapizza@reddit
I didn't know that Copilot could work with local models. This could be interesting...
Gargle-Loaf-Spunk@reddit
Codex can run with local models too
sdfgeoff@reddit
I keep running into a 400 error when it tries to make a tool call (using llama-server). Any tips?
kaeptnphlop@reddit
Not sure if it is in main yet, but the Insiders version can.
iamapizza@reddit
I see an add models dialog, under that which option did you pick?
kaeptnphlop@reddit
It should have an "OpenAI Compatible" option
iamapizza@reddit
Ah you're right, so it'll probably be in insiders. Thanks!
MuzafferMahi@reddit
What did you actually change about the harness?
Creative-Regular6799@reddit (OP)
I wrote extensively about it here: https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-agent
meca23@reddit
Any chance you could start with pi as your harness and apply changes around that to achieve the same result? I think this path is more likely to reach a wider audience than yet another tool.
Creative-Regular6799@reddit (OP)
pi integration is up! Thanks for the tip
StardockEngineer@reddit
You could just ask Pi to implement the article upon itself. It’ll know what to do.
arcanemachined@reddit
Pi is so cybernetic. I love it for that.
Creative-Regular6799@reddit (OP)
Doing it right now. Thank you for the tip!
AdOk3759@reddit
That was really, really interesting to read!!!
Low88M@reddit
Isn’t qwen3.5-27B still better for performance (in opencode for exemple) even if not for speed on broke consumer gpus ?
Nindaleth@reddit
If you look at the benchmarks in the 3.6-27B announcement, 3.6-35B-A3B is pretty much equivalent to 3.5-27B (at least based on those benchmarks) in performance, but something else in speed.
Of course, I'll agree that point is moot now that 3.6-27B is out... :)
New_Comfortable7240@reddit
Ywah I suposed the key is "gpu poor" setups. In my computer qwen3.5 27B runs around 12tps, but qwen3.6 35B around 35tps, so I use 35B more for agentic cases, bit someone with more patience can try 27B and get better results
SourceCodeplz@reddit
I had GLM5 clone and analyze it, here is what it does:
aparamonov@reddit
so the bottom line is it only improves tool use by injecting brief instructions and preventing destructive write ops, it that all?
josuf107@reddit
If you want to know all you can read the write up from OP. There are some other odds and ends including keeping the context small and short-circuiting reasoning. The interesting thing is that some simple accommodations in the harness drastically improve results for smaller models.
aparamonov@reddit
About reasoning limits, llama has it as a build in parameter including final reasoning message, I didn't get why it was needed to reinvent the wheel there
SpicyWolf@reddit
I've been trying small local models to learn coding specially qwen 3.5:9b and using little-coder its the first time it nailed a space shooter html test in the first run, usually it gives me a buggy mess I have to fix manually even with decent tools available for it. Crazy work, thank!
Creative-Regular6799@reddit (OP)
So exciting to hear!! Please continue experimenting and sharing. Non-trivial tasks are tend to be more interesting test cases
ikkiho@reddit
scaffold gap is mostly training-distribution mismatch. cloud coding models get RL'd against their own harness so prompt format, tool-call syntax, turn boundaries all match training. local models dropped into that same harness are out of distribution by default. a short SFT pass on little-coder traces would probably close more of the remaining gap than scaling params.
FusionX@reddit
OP, are you the dev of little-coder or affiliated with it in some form?
Creative-Regular6799@reddit (OP)
Yeah i created it!
FusionX@reddit
Gotcha. I hadn't gone through your previous post, and it wasn't as apparent in this post. Thanks for clarifying.
vex_humanssucks@reddit
The scaffold-makes-more-difference-than-model point is one of those things that sounds obvious until you've actually watched a 35B model with a good agent loop beat a 70B with a naive one. I'd be curious what your retry strategy looks like — do you let the model self-correct on failures or hard-reset the context?
PassengerPigeon343@reddit
I agree with this concept. The tools and environment are starting to become almost as important to performance as the model itself.
And for local models, I think it comes down to that being the difference between an okay experience, and one that starts to compete with frontier models.
StardockEngineer@reddit
Hard disagree. They are becoming less important. Only bad or older models benefit from tighter harnesses. The best models just don’t need them nearly as much as they used to. I would say some don’t need them at all.
arcanemachined@reddit
Hard disagree with you. The harness has full control over how the context is passed to the model, and context is king. If you mess with the context badly enough, it doesn't matter how good the model is, because its context (handled by the harness) can drive performance right off a cliff.
StardockEngineer@reddit
The only useful thing is how they compact and if they send an agents.md and when. Nothing else matters on large models.
Try pi with sonnet or Opus. Four basic tools and it works great.
Minimal harness is the best. Almost everyone I’ve gotten on Pi hasn’t gone back to open code, Claude code or Codex.
And with Pi, you can create an “extension” you want, yourself, as a feature
En-tro-py@reddit
I must be out of the loop, which harness does this?
Everything I've seen is still AGENTS.md, read_file, grep, etc. - tooling pulling in context... but not managing it live while the session develops.
arcanemachined@reddit
Implicitly, all of them. They're the middleman between you and the model.
In practice, none of them really do this much in my understanding, except Claude Code, which has some context-related issues lately, including this one and this one.
I guess my initial comment is a little misleading: Ideally, the harness is just a neutral mediator, but it certainly has the ability to improve the context (with a good system prompt, which of course should be overridable), and as Anthropic has demonstrated, has the ability to screw it up.
En-tro-py@reddit
Anthropic screwing up their own system prompts is 90% of the claude-code experience...
I just thought I'd missed some new harness that was feeding context intelligently beyond the 'injection' of simple messages and was actually stuffing code it determined was needed into the agent's feed.
arcanemachined@reddit
No, sorry, I definitely gave the wrong impression... but I do believe that something like what you're describing is likely to happen sooner than later.
En-tro-py@reddit
😉 - It is...
sonicnerd14@reddit
Honestly, it's probably always been this way, we just didn't realize it until we've arrived to this point where more of us can actually run these models on our own machines.
mjuevos@reddit
using claude code with qwen3.6 35b with guardrails and it does ok. wonder why no one uses claude code with qwen locally?
FeiX7@reddit
what about pi?
Creative-Regular6799@reddit (OP)
pi.dev integration is up!
Creative-Regular6799@reddit (OP)
In the works, will fully support pi dev soon!
freme@reddit
What's wrong with Qwen Code? I'm not experienced so it's just a question.
ForbidReality@reddit
Qwen with the right harness vs closed source with any harness is not apples to apples
rm-rf-rm@reddit
Ive been saying it literally since GPT-4 - the models are already smart enough. Its just that they need to be treated like the component that they are and for them to be embedded in a good system.
Think of LLM as the wheel, sure you can improve the tread, the ciruclarity, the strength-weight ratio etc. but you'll have much more gains from using it in a unicycle (chatbot) to a full fledged car (AI native IDE etc)
boutell@reddit
Maybe this is a silly question, but why not Qwen Code itself? Did they get it wrong for their own models?
DeliciousGorilla@reddit
little-coder wasn't working well for me (repeating/looping with qwen3.6), so I ported over your techniques to pi as 2 extensions and 2 skills: https://github.com/alisorcorp/pi-small-model-addons
arcanemachined@reddit
Good stuff. I belive this is some important work that you're doing.
hernejj@reddit
I've been researching and writing tooling for automated codebase documentation generation. I'm finding that the results returned by Cline (Llamma.cpp backend using Qwen3.6-35B) are lackluster compared to What comes back from proprietary models (Claude's Sonnet is my current baseline). And I've been wondering how much of the difference is attributed to the model, or the agent itself.
I'm going to wire up my automation to your agent and see if things improve for the local case, when I get a few minutes :)
Thanks for sharing!!
DeliciousGorilla@reddit
I tried out little-coder, but I'm getting repeating loops on nearly every reply using unsloth's Qwen3.6-35B-A3B-MXFP4_MOE. Same model works like a charm with pi-mono.
feckdespez@reddit
I've looked at your summary information and maybe I missed it.
For the MoE model, did you run Aider and the little coder agent?
swanny101@reddit
I Believe that's what he's testing. Changing out the Scaffolding ( Aider, Little Coder Agent, ETC ) using the same base model showing that the Scaffolding is critical to be paired with the model for optimal performance.
Worried-Squirrel2023@reddit
going from 19% to 45% to 78% on the same model just by changing the scaffold is exactly why benchmark scores need to come with harness disclosure. half the models we think are mid are probably running in bad harnesses. the other half of the gap is in the eval setup itself, not the weights.
PhilippeEiffel@reddit
Not really:
- 19% to 45% is for Qwen3.5 (9B, dense model)
- 78% is for Qwen3.6 (35B, MoE)
valcore93@reddit
How does that compare with running claude code harness with the same qwen model ?
akavel@reddit
There's already a popular small harness called pi.dev. What are the advantages little-coder has over it, why would I use it over Pi? What are the disadvantages, what would I lose? Did you do a comparison, does the same Qwen work better with little-coder than with Pi?
Then there's the Terminal Bench leaderboard, which compares agents. Did you submit yours to that benchmark? The leaderboard is currently topped by ForgeCode, and it seems open-source - did you compare little-coder to ForgeCode with the same model? Is your agent better?
Creative-Regular6799@reddit (OP)
I am currently running Terminal Bench BTW, will send to the leaderboard when done
Creative-Regular6799@reddit (OP)
Hey, thanks for your comment! I became aware of pi.dev just an hour ago, and this didn’t really start as a production ready tool, but more of a serious wake up call that we need as a community to invest time in adapting the scaffold to the models we are testing. I am thinking about rewriting the scaffold in pi dev to make it more accessible and contribute to unified tooling and community support
akavel@reddit
Cool, good luck then! :)
Real_Ebb_7417@reddit
I will definitely try this. I wanted to spend some time in the coming days setting up a well-working agentic workflow for smaller-local models and if this harness works well, maybe it will save me lots of work.
But to ask you (or someone who already checked the repo content), what does it do differently than "bigger" tools (like Codex, ClaudeCode, OpenCode etc.) to work better with smaller, local models?
New_Comfortable7240@reddit
So basically controlled the settings like context and temperature + limiting the context passed
The code pass the skills and history conditionally like "if this model need optimizing we cut the context to 300 token for example"
Creative-Regular6799@reddit (OP)
Thank you! I wrote extensively about it here: https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-agent
lolwutdo@reddit
Harness definitely makes a huge difference; I know people hate on openclaw and similar projects but damn, Hermes Agent feels way smarter and productive despite using the same model (qwen 3.6)
DefNattyBoii@reddit
Looks like forgecode also would be ripping just like your little-coder harness. I'm personally using opencode with omo and it works fine but there are a lot of tokens wasted
asraniel@reddit
can this be adapted to opencode?
Creative-Regular6799@reddit (OP)
So it is a suggested replacement to opencode, that is adapted to the behavioral profile of the smaller models. It tries to bridge the gap of these tools being built around frontier models, and aren’t necessarily best fitting as scaffolds for the small ones
asraniel@reddit
i wonder if it would not make more sense to improve opencode. there are too many tools already... also there is a whole ecosystem around opencode already that oke would loose
CuriouslyCultured@reddit
This should be super easy to port to Pi, that's probably the way.
Creative-Regular6799@reddit (OP)
So instead of open code, i started from a replica of claude code, and adapted from there, assuming claude code is the best current coding agent written in general and can serve as a good baseline to start from
asraniel@reddit
i just fear that its a one person project that wont be maintained over time.... but ill check it out as i want to use local models more
Y0uCanTellItsAnAspen@reddit
Is there good documentation of how to link it up to agents locally? I am using llama-cpp, and have qwen3.6-35B running, but I'm a little new to this, and would like to know what agents people are using, and how you configure them.
CountlessFlies@reddit
Once you have llama cpp server running, you get an OpenAI compatible API. Most agents and harnesses just need you to put this API url in config and you’re set. You might have to tweak the temperature and similar settings to the recommended values depending on how the harness handles it.
Y0uCanTellItsAnAspen@reddit
Thanks! I assume people somewhere have lists of which APIs work well with llama-cpp?
CountlessFlies@reddit
You mean harnesses that work well with llama.cpp right, not APIs? llama.cpp server is what gives you the OpenAI compat API.
You can try pi.dev or opencode, both are great harnesses.
Y0uCanTellItsAnAspen@reddit
Thanks!
autisticit@reddit
I will definitely try this.
One question: how hard do you think it would be to create a little coder VS code extension, to make it usable through the UI ?
aijoe@reddit
Could opus answer this or simply create this for you by feeding it enough context?
autisticit@reddit
I'm pretty sure it could, but I preferred to ask OP first :)
autisticit@reddit
Another question if you don't mind, the readme specify supported models, does it mean that any other model/quant will fail ?
Creative-Regular6799@reddit (OP)
Hey thank you for the comment! You can definitely try, I just haven’t myself
autisticit@reddit
Thanks!
Pleasant-Shallot-707@reddit
It’s well understood that harnessing improves the quality of output for ai
ZSizeD@reddit
Great work! Will be trying this out
systems2software_eng@reddit
Please forgive my ignorance, I am an amateur here: can I use this with Hermes or OC to boost the performance from my local models? Or is its own standalone agent harness?
Creative-Regular6799@reddit (OP)
It’s currently allowing to run inference via llama.cpp and ollama. Is that sufficient for your optimization pipeline?
drumyum@reddit
Polyglot benchmark shows how good LLM at following Aider-specific instructions to solve Exercism tasks. If you remove Aider from this equation - it makes no sense to compare it to the rest of the leaderboard.
If Qwen can solve some task with your instruction, but not with Aider - it could mean that yours are closer to what it was trained on, and probably that Qwen is bad at generalizing. Yet your results are still interesting, good job!
po_stulate@reddit
I mean that's kinda exactly what OP said tho, that qwen performs as well as cloud models "under certain conditions".
drumyum@reddit
Cloud models are not being tested here, what if they peform much better than Qwen?
bonobomaster@reddit
Then a box with all the GDDR7-RAM and compute you ever could wish for, will magically appear at your front door.
po_stulate@reddit
That still doesn't change what this post want to convey, I don't think OP is speaking literally that qwen and clould models are a strict tie, more about that if the correct environment is used, the performance boost could be from what you see on the original benchmark to the clould model benchmark score.
Creative-Regular6799@reddit (OP)
Exactly this. Thank you for helping clarify
Limp_Classroom_2645@reddit
Unsurprisingly tbh, this model is scary good
Lowkey_LokiSN@reddit
Dope work and direction! Fully agree with how everything is designed around frontier-model assumptions and how we can extract a lot more out of the smaller models with tailor-made harnesses.
OkFly3388@reddit
Is there any coding agent that can be used not only as standalone agent, but rather as part of workflow ?
For example: agent finish task, code got automatically pushed to my cluster, autotests runs, for failed tests we collect traces, then different agent filter traces to keep only interesting parts, and then this goes back to coding agent.
Because hooking this as tool dont have much success, agent a lot of times forget about it and try to test manually or just dont test at all.
Cupakov@reddit
pi.dev can be used like that, just ask it to write the extension
ljubobratovicrelja@reddit
I was also pondering on this topic myself in the past couple of weeks and you've done the majority of what I wanted to do to, so a big thanks to you. Amazing findings, cheers!
Best-Theory-2201@reddit
Nice, thank you for sharing!
So, in your write up, you state "redesigning the scaffold around the behavioral profile of a small local model moves the pass rate from 19.11% to 45.56%", what does that acteally mean?
What have you actually redesigned? Is that taking a smaller context into account? Creating smaller sub-tasks? I'm really curious to hear from you how you got that success rate, what did you actually do to accomplish this?
I'm intruiged by the idea of running more smaller models in parallel instead of one large flagship model but not quite sure how to address this.
SourceCodeplz@reddit
You can always look for yourself in the actual harness: https://github.com/itayinbarr/little-coder
This is what I am currently doing as I have read his last post and am trying to do something similar.