Same task in github-copilot, pi, claude-code, and opencode with Qwen3.6 27B

Posted by sdfgeoff@reddit | LocalLLaMA | View on Reddit | 85 comments

So I vibed a setup to allow me to test multiple agentic harnesses/model combinations on the same task. Here are some images to allow you to make subjective opinions. Still working on getting automated evaluation.

Things I noticed not present in the images:

  1. Opencode can search the internet by default. This made it's results way better on some tasks. Eg the 3d printer explainer page it listed specific filament temperatures etc.
  2. On webdev, opencode delivered really good results. You can't interact with them from here, but it made cool interactive widgets that worked really well.
  3. The model really struggles with Github Copilot. It generally takes half a dozen tries to write a file. It keeps mucking up copilots file editing tools. Doesn't have this issue with other harnesses. Claude code, pi and opencode all take 4 LLM requests to create the pelican.svg. Github copilot takes 13! It tries the edit tool, it tries bash, it tries the edit tool again. Whatever tool schema they use, in my tests the LLM really struggles. This makes it really slow as it has to regenerate the same diffs again and again.
  4. Qwen3-vl-4 looped endlessly in OpenCode, couldn't even write a the pelican.svg file to disk.