What agentic cli do you use for local models ?

Posted by siegevjorn@reddit | LocalLLaMA | View on Reddit | 12 comments

title says all—are there any notable difference among them? i know claude code is industry standard. opencode is probably the most popular open source project. and there is crush from charm. can gemini-cli & claude code run local agents? my plan is to spin up llama.cpp server and provide the endpoint.

also have anyone had luck with open weight models for tasks? how do qwen3.5 / gemma4 compare to sonnet? is gpt-oss-120b still balance king? or has it been taken over by qwen 3.5 /gemma4? i wonder if 10-20 tk/s is ok for running agents.

finally for those of you who use both claude / local models, what sort of task do you give it to local models?

[-]

john0201@reddit

qwen code

[-]

siegevjorn@reddit (OP)

Cool thanks. How's the experiment going? Did you find qwen3.5 useful in some cases?

[-]

john0201@reddit

It is very good but hallucinates odd things sometimes. Just hard to justify using a slightly slower, slightly worse local model when apis are so cheap. But I think eventually when local models are more capable and fast in a year it will be the opposite - why pay anything when I get the same thing done for free.

[-]

siegevjorn@reddit (OP)

True. With Max plans you get fairly ample usage for individual project right now. If you can do it for $1200/yr, it's hard to justify $8k gpu or $5k mac purchase.

[-]

qubridInc@reddit

OpenCode + llama.cpp is a solid local combo Qwen 3.5/Gemma 4 are decent for lightweight agent tasks, but for anything complex I still lean on Claude Code since local models aren’t quite there yet.

[-]

DistanceAlert5706@reddit

Opencode with llama.cpp, Qwen3.5 27b works well.

[-]

virtualunc@reddit

been running openclaw with ollama pointing at qwen3.5 30b for about a month now.. works surprisingly well for most tasks tbh. the trick is setting a cheaper model as default for routine stuff and only switching to something bigger when it actually needs to reason through something complex

hermes agent is the other one worth looking at if memory matters to you. it has per-model tool call parsers specifically tuned for local models so you dont burn tokens on failed calls. way less token hungry than openclaw imo

for pure cli coding without the agent layer, opencode is solid. less overhead, faster response, but you lose the gateway/messaging stuff

honestly the gap between local 30b models and cloud apis has gotten small enough that for 80% of daily tasks youre not missing much running local anymore

[-]

siegevjorn@reddit (OP)

Oh ok. Didn't know openclaw can work as a coding agent. Nice to know that it works well.

Will look into hermes afent.

Yeah opencode seems to be the standard for open models. And glad to learn many tasks can be handled locally. May I ask what sort of coding tasks have you been successfully outsource to local models?

[-]

Time-Dot-1808@reddit

OpenCode with a local llama.cpp endpoint works well. Claude Code can technically point at a local endpoint too via OpenAI-compatible API but it's not officially supported and tool use gets flaky with smaller models.

10-20 tk/s is usable for agentic work but feels slow on multi-step tasks where the agent makes 5+ tool calls. The bottleneck isn't generation speed, it's the cumulative latency of all those round trips. For coding specifically, Qwen 3.5 122B at Q4 is probably the best open-weight option right now if you have the VRAM.

[-]

siegevjorn@reddit (OP)

Thanks. Will try opencode out.

Got 40gb vram with 4090 & 5060 ti. Qwen3.5-122b may be too slow.

If you had any success with local agentic coding with Qwen3.5-122b, wld you mind sharing what sort of tasks did you use for?

[-]

total-context64k@reddit

SyntheticAutonomicMind/CLIO

[-]

MrHanoixan@reddit

pi-mono