What is the best coding agent (CLI) like Claude Code for Local Development

[-]

robogame_dev@reddit

This is the benchmark for agentic coding harnesses:

https://www.tbench.ai/leaderboard/terminal-bench/2.0

They test harness and model separately so you can, for example, compare 10 harnesses all using Opus 4.6 to know that you’re really seeing harness impact not model.

Spoiler: Claude code is in last place, 10th place out of 10, with Claude Opus 4.6…. Make of that what you will (and probably choose a higher performing harness)

[-]

splice42@reddit

This is LocalLLaMA though, we're looking for local solutions only.

[-]

ij123zhasd@reddit

This is just for terminal tasks though right?

Also there seem to be a lot of model + agent coding harness combinations missing.

[-]

postitnote@reddit

You could just use claude code and ask how the automated script works and adapt it for your local llm, if you don't want to figure it out yourself.

[-]

opencode has nice built-in defaults that will let you use a local model. I use llama.cpp to run the model locally, and then fire up opencode and use `local` in the /model selector. don't even have to edit a config file.

[-]

mycall@reddit

Is it better than RooCode?

[-]

tulsadune@reddit

https://roocode.com/blog/sunsetting-roo-code-extension-cloud-and-router

[-]

rorowhat@reddit

People say open code tends to use too many tokens compared to others...can anyone confirm/deny this?

[-]

VartKat@reddit

I’m missing something… Why do you care about tokens if it’s local so not limited ?

[-]

SubjectCellist609@reddit

Local: you are limited by the context that fit in vram

Cloud: you are limited by the tokens you can pay.

So even the tokens are "free" lokal, you have a finite amount until your ram is full or you blow the native context of the model (e.g. 128k for newer models)

[-]

perhaps_too_emphatic@reddit

All this and:

More tokens = more heat = more fans = more power draw. If you’re unplugged (on a plane, at the mechanic, in a cafe), that can be annoying.

I burned through >30% of my MBP battery in bed last night just using Aider to clean up some accessibility things on a tiny site. It took a half hour at most.

[-]

besmin@reddit

Also the more instructions you put, less likely it will be followed correctly.

[-]

FortiTree@reddit

Nicely out. It's also much slower for prompt processing speed PP, especially if they fck up prefill cache. That would make it crawl.

So Local is bound by both vram space and speed.

NVIDIA should release bigger vram model for consumer, or AMD and Apple can step up to fill this need.

[-]

nabagaca@reddit

As others have mentioned, the initial prompt is huge, and it will instantly consume like 14% of my context (I run at 40k context) but after that, it seems reasonable

[-]

One-Replacement-37@reddit

It literally takes one md file to fix this.

[-]

FortiTree@reddit

I thought system prompt is built-in and your cant change it? Else there is no debate.

[-]

One-Replacement-37@reddit

Of course, just create a plan/build mode either through the JSON config either through the .opencode/modes/{plan,build}.md files.

You can literally leave the files empty after the frontmatter (configuring model/temperature), and it’ll remove system prompts.

Hint here: https://github.com/anomalyco/opencode/issues/5005#issuecomment-3609831805 but any LLM can confirm for you by having it read the code.

I’m not sure why this sub keeps on spreading false information.

[-]

FortiTree@reddit

Thanks this makes more sense to me now. I meant if we have access to the source code, may as well turn it to our pet. Picking the harness is just a matter of preference then.

[-]

9gPgEpW82IUTRbCzC5qr@reddit

Opencode explicitly makes the system prompt configurable so it's in a better spot than other harnesses

[-]

idnvotewaifucontent@reddit

Cline's base prompt is like 12 or 13k tokens.

[-]

lloyd08@reddit

opencode had a historically weird compaction/pruning strategy which fucked up cached reads. This got generalized to "uses more tokens", when it's really that it just didn't reuse the cache properly. I have no idea if they fixed it, but this is pretty much explicitly why anthropic did their "no 3rd party harnesses w/ claude pro" policy a year ago, even though their own system prompt is double opencodes. This may or may not have an impact on local models depending on your caching and context size, and it's entirely possible they've fixed it since the last time I used it. But because pi's system prompt is 20% the size, the boogeyman will stick around. If you want the type of guardrails claude gives you, opencode is a good alternative from the interface perspective, but like anything, you should test it for your use-case.

[-]

Karyo_Ten@reddit

The base prompt is big (10k). I don't know about actual token usage as I only compared to Zed and Zed is way way worse. LLMs keep rewriting whole files in Zed.

What's interesting though is the oh-my-py hashlines approach for edit which seems more accurate AND saving tokens.

[-]

skredditt@reddit

I don’t know about that but I have been test driving a a locally-run service called vexp and I might be a believer. I’ve got both GitHub Copilot and Claude’s desktop app using it and it’s reporting a savings of around 70% of tokens. It seems to be extending my subscriptions pretty significantly.

[-]

ripter@reddit

I do this with Qwen 3.6 and it’s replaced Claude Code with sonnet for me. Sure it’s not as fast, but the code it generates has worked out well for me so far.

[-]

SFsports87@reddit

I'm looking for something similar. Which specific version of Qwen 3.6 are you using and what quant and parameters?

[-]

choudoufu@reddit

I tried out 35b-a3b and 27b-ud (unsloth). I like the 27b-ud a bit more but both are solid at coding.

[-]

idnvotewaifucontent@reddit

I find 27b dense makes fewer errors and I spend less time debugging that with 25b-a3b.

[-]

jadbox@reddit

27b is a lot slower (10x) if you're like me and only have 16gb vram and have to share cpu/gpu layers

[-]

ripter@reddit

Qwen3.6 35B A3B UD Q6 K Make sure your --ctx-size is large. Qwen supports over 200k. Exact values can be found in the model card.

[-]

idnvotewaifucontent@reddit

Alternatively, use a resource monitor and keep pushing your context until you use \~1 GB less than your total available VRAM. You can get pretty exact.

[-]

SirGreenDragon@reddit

same, i use opencode with ollama on a headless linux box

[-]

Purple-Programmer-7@reddit

Upvote for opencode. Currently, I still like it more than pi, but pi also works.

[-]

rorykoehler@reddit

I like the design of pi (coding harness behind openclaw) but it's much less plug and play

[-]

ZubZero@reddit

When you realize Pi can make it’s own extensions I feel it’s more plug and play. Less opinionated in a weird way

[-]

rorykoehler@reddit

It's great but you still have to put the work in and then there is a much larger and more diverse surface area of stuff that could go wrong in strange and interesting ways.

[-]

havnar-@reddit

It’s not like this doesn’t / can’t happen with frontier tools

[-]

HiddenoO@reddit

That's not the point.

Pi is basically like developing software from scratch instead of using existing software. Sure, you have more freedom, but you'll also run into all the issues that others already ran into and solved for that existing software.

That doesn't mean you won't run into issues with other harnesses, just that it's generally going to be much rarer and you won't have to spend all the time optimizing and fixing everything yourself.

[-]

FortiTree@reddit

Isnt it the beauty of DYI local model? The more I read about Pi, the better it gets. At first I thought ppl were talking about running local model on an actual Raspberry Pi, and I was like wtf how did it get so powerful. And then ohh it's not the Pi I remember.

[-]

Evening_Ad6637@reddit

Its more like: A lot of plugs and then play

[-]

walden42@reddit

And if you want to get started with useful functionality out of the box, you can add orchestrators, planners, and anything else you need like https://github.com/ruizrica/agent-pi

[-]

rorykoehler@reddit

This look genuinely useful

[-]

colin_colout@reddit

for small models, this is the way. even the default system prompt is tiny. it's literally just this:

You are an expert coding assistant operating inside pi, a coding agent harness. You help users by reading files, executing commands, editing code, and writing new files.

Available tools:
${toolsList}

In addition to the tools above, you may have access to other custom tools depending on the project.

Guidelines:
${guidelines}

Pi documentation (read only when the user asks about pi itself, its SDK, extensions, themes, skills, or TUI):
- Main documentation: ${readmePath}
- Additional docs: ${docsPath}
- Examples: ${examplesPath} (extensions, custom tools, SDK)
- When asked about: extensions (docs/extensions.md, examples/extensions/), themes (docs/themes.md), skills (docs/skills.md), prompt templates (docs/prompt-templates.md), TUI components (docs/tui.md), keybindings (docs/keybindings.md), SDK integrations (docs/sdk.md), custom providers (docs/custom-provider.md), adding models (docs/models.md), pi packages (docs/packages.md)
- When working on pi topics, read the docs and examples, and follow .md cross-references before implementing
- Always read pi .md files completely and follow links to related docs (e.g., tui.md for TUI API details)

[-]

cmdr-William-Riker@reddit

Wait openclaw is based on pi? It gets less and less impressive, the more I learn

[-]

fdrch@reddit

So if we forget about openclaw, what's wrong about pi?

[-]

illforgetsoonenough@reddit

I'm reading it like they're saying openclaw is less impressive because it's using pi as a tool rather than their own harness

[-]

cmdr-William-Riker@reddit

Correct, I should have worded that better

[-]

fdrch@reddit

Thanks. Now I see :)

[-]

Speedping@reddit

qwen code (gemini cli fork) wired up to a decent qwen model is great

[-]

besmin@reddit

I am using it along with qwen3.6 35b through llamacpp locally. I preferred it over opencode as it followed my requests better. Although the start was pretty good, I am having difficulties creating agent orchestration or making it follow instructions i put in the main system prompt. The main system prompt is combined of many instructions and becomes huge which I cannot see and also is not very customisable. Sometimes it selects the wrong tool or agent because it has way too much instructions I think. Overall good but not so customisable and also docs is not human readable.

[-]

rorowhat@reddit

Link?

[-]

gurilagarden@reddit

Pi. Just from context overhead alone it's the clear winner. The amount of unnecessary shit that gets packed into system prompts for every other local harness adds up fast when running over consumer hardware. If you're serious about local ai-assisted coding, spending a day or two getting pi right where you want it gets paid back 10-fold. One-size-fits-all doesn't work on consumer hardware, specialized agents for specialized tasks meaningfully improves reliability and productivity.

[-]

jimmytoan@reddit

Aider is worth trying if you haven't - it has an architect mode that uses a stronger model to plan and a cheaper/local model to actually write the code, which works well for local setups where you're bottlenecked on the generation step. The `--model` and `--editor-model` flags let you split the reasoning vs. implementation load. Works cleanly with ollama.

[-]

SadExcitement8893@reddit

Why not just use Claude locally. You all do know this is possible right?

[-]

Express_Quail_1493@reddit

opencode is nice but for small models its brutal. if you want to make the most of your context windows use pi-coding-agent. Pi system prompt is literally 1k tokens give the LLM more room to think and solve instead of suffering from SysPrompt token-diabetes.

[-]

Pretend_Engineer5951@reddit

I wonder why people using cli agents for coding? Doesn't it more comfortable with ide extension?

[-]

suprjami@reddit

With OpenCode you can have both.

Run web instance with opencode serve then terminal with opencode attach https://localhost:4096. You have the same session in both.

Often table rendering is better in the terminal.

Some OC features are only available in the terminal like loading an archived session and exporting sessions to files.

I spend most of my time in the terminal and Vim so it's more in line with the rest of my workflow.

[-]

ayylmaonade@reddit

For me personally, I already spend most of my time working in the terminal anyway since I prefer it to GUIs for the speed aspect. So having a terminal-based harness is perfect in my case.

[-]

Pretend_Engineer5951@reddit

Did you manage to handle tasks when you needed to make report in markdown?

[-]

ayylmaonade@reddit

Yep. I've got cronjobs running via Hermes delivering daily markdown reports. And the new TUI has full markdown support, it's really nice.

[-]

toptier4093@reddit

Yea I don't get it either. I absolutely love my cli agent for navigating my way around my system, but for coding I absolutely always use vs code with the claude extension.

[-]

Long_War8748@reddit

Hell no. Monitor 1 Terminals (one of them being CLI), Monitor 2 VS Code (Editor of choice).

That is pure comfort 👘

[-]

loudsound-org@reddit

Wondering the same. I made a post earlier today asking about VSCode extensions for this sort of thing but for some reason its only had one view and doesn't even show up when I browse the sub.

[-]

TheIncarnated@reddit

There is no description. Hard to answer anything there

[-]

loudsound-org@reddit

For some reason my thread got filtered by reddit. No idea why. This is what I posted.

VSCode and agent integration

I've been using VSCode with Github Copilot for a bit (free tier) and looking to try running locally due to running in to all of the limits with GHCP. I'd like to have as close of an experience as possible with both code autocomplete and chat integration. I know that GHCP can use local models but I think I'll still run in to session limits and such. If there's a way around that then maybe sticking with it would be best.

A few things about my setup that may make a difference. I'm running the model (primarily Qwen 3.6 35B but would like the ability to switch to 27B and other models on the fly) on my windows PC with llama.cpp. My local Linux server hosts all of my code and dev environments, and I primarily use my windows laptop with VSCode on an SSH workspace in to my server (which works fine with GHCP and any agentic tooling). I plan to also setup Hermes for non-coding use (on the linux server), also using the windows PC's models (the server only has a 1060 6GB GPU...looking at doing embeddings and such on it once I figure that out!).

So with that setup, what is the best integration with VSCode? The Hermes extension and use Hermes for coding as well? Continue pointed directly to my llama.cpp? Cline pointed to either Hermes (is that even possible?) or llama? Run pi.dev alongside Hermes and somehow integrate that (tho it seems pi is mostly for cli dev?). Some other option? Appreciate any advice!

[-]

Pretend_Engineer5951@reddit

Personally I stopped on Roo code now. Chrck it out.

[-]

loudsound-org@reddit

They're stopping development on it though. Sounds like another team is going to pick it up but seems like something not worth diving into at this point.

[-]

TheIncarnated@reddit

Ahhh, I've found that GHCP does per line/function submit and retrieval. Where as continuum does it all in context.

I would love better alternatives but my company also pays for GHCP at the highest level enterprise plan

[-]

loudsound-org@reddit

Interesting. I've had pretty good luck with it, except the fact that I'm cheap and don't want to pay a sub!

[-]

TheIncarnated@reddit

Well... I am trying to brush up on Nanocoder, since I don't care for locked in products like Claude Code and I've enjoyed OpenCode but want to see alternatives.

I found this little nugget on Nanocoder's GitHub

Maybe will help?

[-]

ripter@reddit

It’s a different way of using the LLM. When I use VSCode with it., I tend to focus on the code and write some and ask the LLM questions on what I’m doing.

With ClaudeCode or OpenCode, I’m thinking about the problems, telling it what I want it to do, then reviewing the result.

The IDE is more like writing it yourself, the CLI is more like telling another engineer the requirements and reviewing their code.

[-]

my_name_isnt_clever@reddit

Exactly this. There are some projects where I have an IDE open, but these days most of the time I just have a terminal with a few tabs. I spend a lot more time in agent TUIs than editing files in nvim.

[-]

ea_man@reddit

They work well in the shell, can use via ssh tmux.

[-]

responds-with-tealc@reddit

for me i like the terminal, and already feel like im constantly fighting the bloat in most IDEs. i prefer things to be separate most of the time. I want my IDE to edit/navigate/run/debug code. thats it, no git, nothing else.

i use the terminal for reviewing changes and git management already.

[-]

2Norn@reddit

you can open your project in vscode and use the built in terminal from there

[-]

exaknight21@reddit (OP)

Audits for code, generating small landing pages, overlooking dokploy docker configurations, small bug squashes and pushing to git for dokploy easy publishing. And finally, small experiments.

I am not a coder, I create small Proof of Concepts, then hand them off to my team to recreate and stablize for production.

I guess it really depends but this is definitely helpful.

[-]

AltoidNerd@reddit

Depends on if you care to inspect the code. If you want to, the IDE is the way to go of course.

[-]

rpkarma@reddit

Eh even in that case, just start the harness in your integrated terminal!

[-]

SatoshiNotMe@reddit

The Qwen3.6 MOE you mentioned works very well with Claude Code. I’ve gathered the exact llama.cpp/server instructions here for this and other models:

https://pchalasani.github.io/claude-code-tools/integrations/local-llms/#qwen36-35b-a3b--fast-qwen-moe

Among recent models, this one gives the best TG (token gen) speed at nearly 40 tok/s and PP (prompt processing) nearly 500 tok/s on my 5 year old M1 Max 64 GB MacBook

[-]

Ok_Chef_5858@reddit

opencode and aider both work well with llama.cpp if you stay CLI. if you're open to the editor route instead, Kilo Code in VS Code points at any local endpoint and runs Qwen through it the same way, agent modes plus you can see the prompts and context. either way claude code itself is hard to wire to a local backend cleanly.

[-]

Mobile_Marsupial_619@reddit

Use qwencode it gives all features of gemini cli with 3rd party API support. It also supports both gemini cli and Claude code extensions. It is working Great for me now

[-]

gcaussade@reddit

LOL, you can now use Claude code with open router. Basically supports every llm out there.

[-]

Far-Chest-8821@reddit

Without the "Lol" and with a link on how to, that could have been a proper answer.

[-]

Varterove_muke@reddit

Quick search will tell you the secret of this dark art.
https://medium.com/@luongnv89/run-claude-code-on-local-cloud-models-in-5-minutes-ollama-openrouter-llama-cpp-6dfeaee03cda

[-]

FortiTree@reddit

Sure. The sun rises and dies one day just how you may remember this is a local model sub.

[-]

lucaiuli@reddit

I am usingVSCode with KiloCode and Cline extensions with LMStudio server. On a Macbook Pro M4Max 32gb ram and a MacMini M4Pro 48gb ram.
qwen/qwen3.5-27b on Macbook Pro
and
qwen/qwen3.6-27b on the MacMini

I am quite happy with Cline on Macbook for my coding needs, it does the job.
On Macmini I'm using KiloCode and it does split the task to many agents.
For now, that's my stable setup and does not require subscription.

[-]

hust921@reddit

Personally I was a bit reluctant to try pi, because it's so customizable and bare-bones. I felt that I needed to understand everything before using it. But it works perfectly fine out of the box! And with qwen3.6-35B, it has been working significantly better for me, than CC and opencode. Without ANY modification or plugins.

A lot of people become emotional about tools, Operating systems, models. You are only punishing yourself, by sticking to the one and only solution. If CC is really that much better, It should survive a round of comparison with other tools. And nobody is saying that you can't use both.

[-]

mrdevlar@reddit

I quite like Roo Code.

I've had more success with Roo than OpenCode. I found the scaffolding was smarter, it produced a cleaner better encapsulated code. Though I haven't used opencode extensively, so it is possible that is on how I set it up.

That said, they are going through their monetization phase so not sure how long it'll still be good.

[-]

HumanDrone8721@reddit

opencode + a curated selection of oh-my-opencode plugins, Sisyphus is my favorite.

[-]

meow_goes_woof@reddit

What’s the hard part about getting Claude code to work or am I mistaken? U just need to add in the z.ai models in ~/.claude/settings.json according to the docs and that’s it

[-]

Covert-Agenda@reddit

I use opencode with MLX on Apple, seems to do pretty well for agentic loops.

[-]

leinadsey@reddit

Claude Code!

[-]

Dry-Tune430@reddit

Pi and OpenCode are good enough

[-]

idnvotewaifucontent@reddit

I use Qwen3.6-27b Q4_K_M and Cline in VS Code, it's been way better than any other setup I've tried, including Qwen3.6-15B-A3B, the 3.5s, and Gemma 4. I have a 24 GB RTX 3090 and get about 23 tk/s.

[-]

Human_Information561@reddit

https://open.substack.com/pub/itayinbarr/p/honey-i-shrunk-the-coding-agent

This has been working amazing for me. I figure for interview prep design review, instead of studying “design uber”, I’ll just build it. So far so good, it was able to ingest osm, osm routing, and it was able to simulate and render the data. I’m having it implement the APIs now so I can update on how it did there. But so far really good and I’m confident!

[-]

gregorskii@reddit

I’ve had the best luck with open code and Claude. But with Claude I like to leave it alone to use with opus.

[-]

uspace@reddit

With subscription or API? How do you handle it?

[-]

gregorskii@reddit

I assumed this was about local AI since it's in local llama. I have had the best luck with opencode + qwen 3.6 27B and 35B these days.

[-]

exaknight21@reddit (OP)

OpenCode is working really good. Qwen Code is kind of painful. It's not generating full responses. Stops around 400 tokens for a response, max gen is set at 16k

[-]

gregorskii@reddit

The model stops sometimes for me, I just tell it to resume. But I’m on a m5max with 128

[-]

exaknight21@reddit (OP)

Yeah, OpenCode (WebGUI) worked great and was easiest. Pi i am having trouble with in terms of set up, same continuation issue. I’m trying nanocoder now and I’ll update my post with how it all went.

[-]

kidousenshigundam@reddit

How did you get ZLM for $30/year?

[-]

Exciting_Garden2535@reddit

New Year's sale. The Lite plan was $3/month minus 10% extra, so if you bought a year, it was $32.

[-]

exaknight21@reddit (OP)

This past Christmas they had a sale for 32 dollars per year. I snatched it then.

[-]

Budget_Assignment457@reddit

Asking the important questions

[-]

ea_man@reddit

Opencode is "like" cloudecode, Qwencode is made on QWEN LLMs.

[-]

bradwmorris@reddit

It's important to note that there is an interesting tension at the moment because model capabilities and coding harnesses are changing and leap-frogging each other every 35 seconds.

argument 1:
double down on a single model/harness and learn its nuances etc

argument 2:
switch between models/agents often

it's too early to decide, so I would advice building your harness/setup (like a pi) so you can easily switch out models and try them all

[-]

ea_man@reddit

But you can't easily switch out models there's vertical optimization.

LLM - tools - prompts

[-]

2Norn@reddit

pi or opencode

[-]

exaknight21@reddit (OP)

OpenCode is definitely good. But, I have not tried pi.dev yet, Qwen Coder CLI is annoying. it was easier to set up in a way, but sadly, the responses are not generating in full. I am about to move on after 2 hours of troubleshooting. my max gen tokens are set to 16k, but they stop generating tokens around 498. Its weird.

[-]

Obvious_Equivalent_1@reddit

You need Ollama that will work with Claude Code.

I just happened to set this up for someone else which might come of use. https://github.com/pcvelz/qwen-claude-code-getting-started/

For the setup of the local modal with Claude Code you just need to put some extra parameters before the Claude command to route the requests through Qwen

# Append to ~/.zshrc or ~/.bash_profile
cat >> ~/.zshrc <<'EOF'

# --- Ollama tuning for Claude Code ---
export OLLAMA_CONTEXT_LENGTH=65536   # 64K — fits CC system prompt + a real conversation
export OLLAMA_FLASH_ATTENTION=1      # faster attention
export OLLAMA_KV_CACHE_TYPE=f16      # 2-6× speedup vs q8_0, no quality loss
EOF

[-]

SFsports87@reddit

Isn't 16 bit kv bigger and slower than 8 bit?

[-]

Obvious_Equivalent_1@reddit

Heads up: f16 being faster than q8_0 is Apple Silicon-specific. Metal's q8_0 dequant kernels are slow, some memory savings get eaten per-token. On NVIDIA, q8_0 is near-free.

On 48GB M4 Pro w/qwen3.6:35b-a3b (~22GB resident): - f16@65K ≈ 13.6GiB - f16@131K ≈ 27GiB — too tight - q8_0@131K ≈ 13.6GiB but slower per-token

[-]

Evening_Ad6637@reddit

But it should be mentioned that this is only true for M1 and M2

M3/M4/M5 should use q8 or bf16

[-]

Obvious_Equivalent_1@reddit

M3/M4/M5 should use q8 or bf16

Measured this specifically on M4 Pro (48 GB) - f16 was faster. The claim may vary by Ollama version and model type (MoE vs dense behaves differently). I just right now learnt that Llama.mcp allows prompt caching so I need to redo a new scale of benchmarking.

bf16 is genuinely interesting on M3+ though - native BF16 in Apple's Metal, same 2-byte size as f16, possibly zero-cost dequant. If you're on M3/M4 and not using Llama.mcp it’s worth trying OLLAMA_KV_CACHE_TYPE=bf16 and timing a real session.

[-]

Evening_Ad6637@reddit

I would never use ollama xD But yeah.. no I am on M1 Max. M5 Max would be a nice upgrade, but I already sold my two kidneys ^^

[-]

Latt@reddit

You don’t need ollama to use a local LLM with clause code. For your specific description sure, but it works just fine with omlx or llama.cpp

[-]

Obvious_Equivalent_1@reddit

The model is just out since a few days and this is the best I’ve got out of it so far, I did get llama.mcp to work but not with Claude Code but I’d be very much interested!

Do you have any reference or any working example?

[-]

Pleasant-Shallot-707@reddit

Pi

[-]

soulhacker@reddit

Swival

[-]

SupaBrunch@reddit

I’m using that model with vs code right now. Need to use the beta “insiders” version of vs code but it’s been working well.

[-]

DenizOkcu@reddit

NanoCoder is built with your use case mind:

https://github.com/Nano-Collective/nanocoder

[-]

Steve_Streza@reddit

Been using it with Qwen3.6-27B, it's pretty usable

[-]

kexibis@reddit

CLine, great experience with Qwen3.6 27B

[-]

SirGreenDragon@reddit

i am using opencode with success. sometimes directly, sometimes through the ACP connection from openclaw

[-]

DANGERCAT9000@reddit

Personally I like crush from charm - because it feels like a good compromise between pi and opencode. The maintainers have a long track record of building great TUI apps, and they've been adding more features but doing so in a way that I think is really measured and reasonable. When they add new stuff it feels like they've actually thought about it rather than just taking in every single feature request. The pace of development feels sustainable which is something I worry about with other tools.

[-]

Krillian58@reddit

Claude code, hands down.

It's built in now. Claude Code supports local models natively through the Anthropic Messages API that Ollama and llama.cpp both expose.

Fastest path Ollama:

Make sure you have Ollama v0.14.0+
Pull a coding model or download and import
Set two env vars:

export ANTHROPIC_AUTH_TOKEN=ollama export ANTHROPIC_BASE_URL=http://localhost:11434

Run claude ... it just works

llama.cpp path

Same idea, llama-server already exposes a compatible API. Point Claude Code at it

export ANTHROPIC_AUTH_TOKEN=local export ANTHROPIC_BASE_URL=http://localhost:8080

Claude Code prepends an attribution header that invalidates the KV cache on every request, making inference ~90% slower. To fix it:

Add to ~/.claude/settings.json:

{ "env": { "CLAUDE_CODE_ATTRIBUTION_HEADER": "0" } }

[-]

2Norn@reddit

cc is too bloated and streamlined for general use at this point

if u want project specific performant harness nothing will beat pi

[-]

Zeeplankton@reddit

Why is CC better? I thought in harness bench marks it's not great

[-]

Krillian58@reddit

Good question, made me think for a second. And it might not be for local. It's intended to be used with claude. I doubt they optimized it for local much. I was wprkong off of using many harnesses and liking how claude code managed it's context the most. I've also seen many other similar posts that contributed to this.

Personally I wrap it up and use it like openclaw, so my use case is probably much different then OPs.

[-]

OneSlash137@reddit

Unless you’re loading that 100% into vram you won’t be able to do anything with it.

You can chat with it but as soon as you try to code, have it analyze a code base, pretty much do anything you’d actually WANT to do with it, it will fail.

[-]

exaknight21@reddit (OP)

131K context isn’t enough? Sorry, I’m a little new to this.

I’ve got 32GB MI50 - 64 GB DDR3, no good?

[-]

OneSlash137@reddit

It isn’t the context length that’s the problem. I can have a 256k context length with some vram to spare. The problem is that some of the experts are offloaded to CPU, that makes tokens per second god awful.

Your first agentic prompt is going to be huge depending on your stack. After that things start to pickup steam because the first prompt is bloated but subsequent ones are smaller and don’t include the entire prompt, it’s more of a delta. Inevitably the deltas stop and the full prompt needs to be sent again. By that time the entire prompt is most of the time well over 40k, that takes minutes and minutes to process, harnesses and tools start to timeout waiting for api responses, the tools start asking for responses again, and that’s that. At that point it all falls apart.

So it isn’t a matter of context length it’s toeken processing speed.

I used to think I wouldn’t care how long it took for a LLM to respond as long as it was a quality response. Now I see just how long these tasks are goring to take, and they will rarely if ever be one shot unless you have insanely careful planning.

[-]

x8code@reddit

You are correct. A bunch of poor people without GPUs are coping by down voting you.

[-]

OkBase5453@reddit

He wants to use the Qwen3.6-35B-A3B model, which should work.
Try ik_llama.cpp it might work with your card. From my experience(rtx 3090), it is faster than llama.cpp.
I am using Qwen code, and it is great in terms of tools calling and speed.
Good luck

[-]

OleCuvee@reddit

spot on! Same as you I thought sod it, worst case the task will run overnight. It timed out, never finished or broke apart after a step or two. I tried to split them into small parts, but as soon as I wanted it to process the complex detailed brief, nah!

[-]

dreamai87@reddit

My choice 1: mistral vibe - moderate instruction prompt size 8k. Simple and good features 2: pi - smallest instruction prompt - only code mode but it’s good 👍 3: qwen cli - 14k instruction prompt - good and rich features 4: then whatever

[-]

see_spot_ruminate@reddit

Mistral vibe is just so easy to work with.

It’s nice to add to as well. I’ve used it to make its own mcp web search, agents, skills, etc. it’s not flashy but tool calls work well.

[-]

Over the w/e I configured qwen.2.5-coder.32B on my gaming rig with RTX4090 using llama.cpp on WSL. It's running as my build agent. I'm getting 30 tokens a second.

It isn't a flawless experience yet but I'm getting some results. Still experimenting.

[-]

OGScottingham@reddit

Cline works pretty well with it.