I've got a feeling that Llamacpp is not the biggest performance bottleneck, but it might be the OpenCode.

Posted by ThingRexCom@reddit | LocalLLaMA | View on Reddit | 32 comments

It looks as if OpenCode introduces an artificial delay in agentic coding. Have you noticed similar issues?

Could you suggest other solutions that provide better results with the local Llama server?

[-]

Comfortable-Rock-498@reddit

Try https://github.com/dirac-run/dirac (npm install -g dirac-cli)

I built with main goal of performance and efficiency.

[-]

I'll be giving this a try. I've noticed Opencode isn't just context heavy (not as bad as claude code) it has some very weird quirks of OpenAI endpoints - it's tool cooling isn't up to spec, it's passing camelCase to the APIs when they expect snake_case and it seems to break thinking/reasoning, cost tracking and such - if your harness doesn't that that, i'll be jumping over 😄

[-]

Comfortable-Rock-498@reddit

The side effects of pure vibe coding. Please note that Dirac only supports models with native tool calling. Pretty much every major model released in the last year does support native tool calls, just mentioning

[-]

koljanos@reddit

Try pi.dev but you can easily shoot yourself in the foot with it.

[-]

ThingRexCom@reddit (OP)

I've tried Pi, but it feels very raw. I encountered various file-editing issues (very similar to the early days of opencode, but fixed now). Is it worth investing time in Pi?

[-]

koljanos@reddit

Some say pi is Linux of llm harnesses, you can ask it to modify itself, create plugins and stuff. I like it more, but my colleagues can’t or don’t want to adopt it, so I’m kinda stuck

[-]

ThingRexCom@reddit (OP)

Have you managed to configure Pi to orchestrate several specialized agents to work on a development task (so they can share tasks and cooperate)?

[-]

koljanos@reddit

There are extensions, which enable that, you should try them. I can’t do parallel requests, my setup isn’t that good - vllm crashes with more than 3 scheduled requests. But people had good experiences with agent handoff of the tasks

[-]

Glittering-Call8746@reddit

Any repo that has pi fully setup ?

[-]

Interesting_Key3421@reddit

It depends if you benefit by customizations and minimal initial prompt

[-]

Unlucky-Message8866@reddit

yeah i moved to pi a few months ago, opencode turned into dogshit

[-]

ThingRexCom@reddit (OP)

I've tried Pi, but it feels very raw. I encountered various file-editing issues (very similar to the early days of opencode, but fixed now). Is it worth investing time in Pi?

[-]

Unlucky-Message8866@reddit

Yeah pi is more like a framework/library, you need to spend time setting it up, but the return in investment is high. That's the while point of it, to make it your own.

[-]

sarcasmguy1@reddit

yes. there's a bit of time you need to invest in tweaking your set up. it feels a lot like starting emacs or vim with an empty config

the beauty is, if you find any bit of pain, just get an agent to write an extension that makes it better for you

[-]

EatTFM@reddit

I started with opencode and recently discovered pi. I notice the following:

first run on opencode, it cramps 20-30k tokens just for tooling, but the context hardly increases during file reads/tool calls

first run on pi, context almost zero, incredibly responsive and fast, but context is increasing heavily. can't force it to use grep over file_read which spams my context like nothing. Guess if I can fix this basic issue, it will supersede opencode for me!

[-]

patricious@reddit

I had a similar issue with Opencode + my harness and was able to dramatically reduce the grep calls by using this: https://github.com/oraios/serena

[-]

audioen@reddit

I don't think anybody can figure out what is wrong based on this. If I am parsing this correctly, you have 1000 second of pause which is not plausible given the numbers I see -- you'd have to have a very glacial prompt speed which you evidently can't have when even generation can go 1000 tok/s rates. Maybe you had a tool call which took 1000 seconds, who can tell? It's up to you to debug what is wrong.

[-]

Randommaggy@reddit

WHy is the screenshot soaked in piss?

[-]

patricious@reddit

Mexico filter

[-]

ThingRexCom@reddit (OP)

It looks to be an opencode issue. When I switched from a multi-agent to a single agent, the server load is way more consistent.

[-]

Pleasant-Shallot-707@reddit

Yeah. The reason pi was created was specifically due to these types of issues in opencode

[-]

ThingRexCom@reddit (OP)

When I tried Pi, it had issues in modifying huge files in a reliable manner.

[-]

pantalooniedoon@reddit

Cool UI, what is it?

[-]

ThingRexCom@reddit (OP)

Thx, that is a custom tool I created to finetune my local setup.

[-]

rorowhat@reddit

What's frameworks are you using for this?

[-]

ThingRexCom@reddit (OP)

This app is a single-file Python web app.

It uses:

Python standard library HTTP server: BaseHTTPRequestHandler and ThreadingHTTPServer serve the local web UI.
SQLite: stores models, variants, runs, and performance samples in llama_bench.sqlite3.
Plain server-rendered HTML: pages are built with Python string templates and returned as HTML.
Inline CSS: all styling is embedded in the generated HTML.
Vanilla JavaScript: used for table sorting, auto-refresh, and SVG chart zoom interactions.
Inline SVG charts: charts are rendered server-side as SVG, with JS handling drag-to-zoom.
Prometheus-style metrics ingestion: it fetches llama.cpp server metrics from http://:8080/metrics.
llama.cpp model discovery: it fetches model status from http://:8080/v1/models.

I do not use any external frontend framework.

[-]

JockY@reddit

Is it recalculating kv each time? I seem to recall llama.cpp won’t do prefix caching unless told to by the client.

Try vLLM :)

[-]

ThingRexCom@reddit (OP)

I have a hard time making vLLM run on my Strix Halo, it starts to load a model but never finishes :/

[-]

Makers7886@reddit

the juice is worth the squeeze

[-]

FrostyCup1094@reddit

Test worth:
spinup llama.cpp, and watch GPU usage when using OpenCode processing prompts and responding.

then try this one : https://github.com/mlhher/late

And watch what happens ...

[-]

kataryna91@reddit

OpenCode creates project directory snapshots which can take some time if there are many files in the project directory, along with it potentially taking up terabytes on your SSD. You should disable that behavior with "snapshot": false in the config file.

But even then there sometimes are still delays where the server isn't doing anything. I haven't yet figured out what OpenCode is doing in that time (or rather, not doing, and why).

[-]

_p00@reddit

I feel the same, I compared it too crush and goose, speed wise it is far better. I didn't time it but it's quiet obvious.