I've got a feeling that Llamacpp is not the biggest performance bottleneck, but it might be the OpenCode.
Posted by ThingRexCom@reddit | LocalLLaMA | View on Reddit | 32 comments
It looks as if OpenCode introduces an artificial delay in agentic coding. Have you noticed similar issues?
Could you suggest other solutions that provide better results with the local Llama server?
Comfortable-Rock-498@reddit
Try https://github.com/dirac-run/dirac (npm install -g dirac-cli)
I built with main goal of performance and efficiency.
sn2006gy@reddit
I'll be giving this a try. I've noticed Opencode isn't just context heavy (not as bad as claude code) it has some very weird quirks of OpenAI endpoints - it's tool cooling isn't up to spec, it's passing camelCase to the APIs when they expect snake_case and it seems to break thinking/reasoning, cost tracking and such - if your harness doesn't that that, i'll be jumping over 😄
Comfortable-Rock-498@reddit
The side effects of pure vibe coding. Please note that Dirac only supports models with native tool calling. Pretty much every major model released in the last year does support native tool calls, just mentioning
koljanos@reddit
Try pi.dev but you can easily shoot yourself in the foot with it.
ThingRexCom@reddit (OP)
I've tried Pi, but it feels very raw. I encountered various file-editing issues (very similar to the early days of opencode, but fixed now). Is it worth investing time in Pi?
koljanos@reddit
Some say pi is Linux of llm harnesses, you can ask it to modify itself, create plugins and stuff. I like it more, but my colleagues can’t or don’t want to adopt it, so I’m kinda stuck
ThingRexCom@reddit (OP)
Have you managed to configure Pi to orchestrate several specialized agents to work on a development task (so they can share tasks and cooperate)?
koljanos@reddit
There are extensions, which enable that, you should try them. I can’t do parallel requests, my setup isn’t that good - vllm crashes with more than 3 scheduled requests. But people had good experiences with agent handoff of the tasks
Glittering-Call8746@reddit
Any repo that has pi fully setup ?
Interesting_Key3421@reddit
It depends if you benefit by customizations and minimal initial prompt
Unlucky-Message8866@reddit
yeah i moved to pi a few months ago, opencode turned into dogshit
ThingRexCom@reddit (OP)
I've tried Pi, but it feels very raw. I encountered various file-editing issues (very similar to the early days of opencode, but fixed now). Is it worth investing time in Pi?
Unlucky-Message8866@reddit
Yeah pi is more like a framework/library, you need to spend time setting it up, but the return in investment is high. That's the while point of it, to make it your own.Â
sarcasmguy1@reddit
yes. there's a bit of time you need to invest in tweaking your set up. it feels a lot like starting emacs or vim with an empty config
the beauty is, if you find any bit of pain, just get an agent to write an extension that makes it better for you
EatTFM@reddit
I started with opencode and recently discovered pi. I notice the following:
first run on opencode, it cramps 20-30k tokens just for tooling, but the context hardly increases during file reads/tool calls
first run on pi, context almost zero, incredibly responsive and fast, but context is increasing heavily. can't force it to use grep over file_read which spams my context like nothing. Guess if I can fix this basic issue, it will supersede opencode for me!
patricious@reddit
I had a similar issue with Opencode + my harness and was able to dramatically reduce the grep calls by using this: https://github.com/oraios/serena
audioen@reddit
I don't think anybody can figure out what is wrong based on this. If I am parsing this correctly, you have 1000 second of pause which is not plausible given the numbers I see -- you'd have to have a very glacial prompt speed which you evidently can't have when even generation can go 1000 tok/s rates. Maybe you had a tool call which took 1000 seconds, who can tell? It's up to you to debug what is wrong.
Randommaggy@reddit
WHy is the screenshot soaked in piss?
patricious@reddit
Mexico filter
ThingRexCom@reddit (OP)
It looks to be an opencode issue. When I switched from a multi-agent to a single agent, the server load is way more consistent.
Pleasant-Shallot-707@reddit
Yeah. The reason pi was created was specifically due to these types of issues in opencode
ThingRexCom@reddit (OP)
When I tried Pi, it had issues in modifying huge files in a reliable manner.
pantalooniedoon@reddit
Cool UI, what is it?
ThingRexCom@reddit (OP)
Thx, that is a custom tool I created to finetune my local setup.
rorowhat@reddit
What's frameworks are you using for this?
ThingRexCom@reddit (OP)
This app is a single-file Python web app.
It uses:
I do not use any external frontend framework.
__JockY__@reddit
Is it recalculating kv each time? I seem to recall llama.cpp won’t do prefix caching unless told to by the client.
Try vLLM :)
ThingRexCom@reddit (OP)
I have a hard time making vLLM run on my Strix Halo, it starts to load a model but never finishes :/
Makers7886@reddit
the juice is worth the squeeze
FrostyCup1094@reddit
Test worth:
spinup llama.cpp, and watch GPU usage when using OpenCode processing prompts and responding.
then try this one : https://github.com/mlhher/late
And watch what happens ...
kataryna91@reddit
OpenCode creates project directory snapshots which can take some time if there are many files in the project directory, along with it potentially taking up terabytes on your SSD. You should disable that behavior with "snapshot": false in the config file.
But even then there sometimes are still delays where the server isn't doing anything. I haven't yet figured out what OpenCode is doing in that time (or rather, not doing, and why).
_p00@reddit
I feel the same, I compared it too crush and goose, speed wise it is far better. I didn't time it but it's quiet obvious.