What is the best coding agent (CLI) like Claude Code for Local Development
Posted by exaknight21@reddit | LocalLLaMA | View on Reddit | 143 comments
Hey all:
I am trying to set up claude code to work with llama.cpp, I am using the Qwen3.6-35B-A3B.
I usually use claude code + ZLM subscription i got lucky with $30 yearly - the set up is very simple with their automated script, but for the life of me I cannot figure out how to get claude code to work.
Am i hyper focusing on Claude Code or should I try things like pi.dev?
Any help/pointers/guides would be appreciated.
robogame_dev@reddit
This is the benchmark for agentic coding harnesses:
https://www.tbench.ai/leaderboard/terminal-bench/2.0
They test harness and model separately so you can, for example, compare 10 harnesses all using Opus 4.6 to know that you’re really seeing harness impact not model.
Spoiler: Claude code is in last place, 10th place out of 10, with Claude Opus 4.6…. Make of that what you will (and probably choose a higher performing harness)
splice42@reddit
This is LocalLLaMA though, we're looking for local solutions only.
ij123zhasd@reddit
This is just for terminal tasks though right?
Also there seem to be a lot of model + agent coding harness combinations missing.
FortiTree@reddit
Thanks, no Pi?
postitnote@reddit
You could just use claude code and ask how the automated script works and adapt it for your local llm, if you don't want to figure it out yourself.
Positive-Raccoon-616@reddit
Opencode
tulsadune@reddit
opencode has nice built-in defaults that will let you use a local model. I use llama.cpp to run the model locally, and then fire up opencode and use `local` in the /model selector. don't even have to edit a config file.
mycall@reddit
Is it better than RooCode?
tulsadune@reddit
https://roocode.com/blog/sunsetting-roo-code-extension-cloud-and-router
rorowhat@reddit
People say open code tends to use too many tokens compared to others...can anyone confirm/deny this?
VartKat@reddit
I’m missing something… Why do you care about tokens if it’s local so not limited ?
SubjectCellist609@reddit
Local: you are limited by the context that fit in vram
Cloud: you are limited by the tokens you can pay.
So even the tokens are "free" lokal, you have a finite amount until your ram is full or you blow the native context of the model (e.g. 128k for newer models)
perhaps_too_emphatic@reddit
All this and:
More tokens = more heat = more fans = more power draw. If you’re unplugged (on a plane, at the mechanic, in a cafe), that can be annoying.
I burned through >30% of my MBP battery in bed last night just using Aider to clean up some accessibility things on a tiny site. It took a half hour at most.
besmin@reddit
Also the more instructions you put, less likely it will be followed correctly.
FortiTree@reddit
Nicely out. It's also much slower for prompt processing speed PP, especially if they fck up prefill cache. That would make it crawl.
So Local is bound by both vram space and speed.
NVIDIA should release bigger vram model for consumer, or AMD and Apple can step up to fill this need.
nabagaca@reddit
As others have mentioned, the initial prompt is huge, and it will instantly consume like 14% of my context (I run at 40k context) but after that, it seems reasonable
One-Replacement-37@reddit
It literally takes one md file to fix this.
FortiTree@reddit
I thought system prompt is built-in and your cant change it? Else there is no debate.
One-Replacement-37@reddit
Of course, just create a plan/build mode either through the JSON config either through the .opencode/modes/{plan,build}.md files.
You can literally leave the files empty after the frontmatter (configuring model/temperature), and it’ll remove system prompts.
Hint here: https://github.com/anomalyco/opencode/issues/5005#issuecomment-3609831805 but any LLM can confirm for you by having it read the code.
I’m not sure why this sub keeps on spreading false information.
FortiTree@reddit
Thanks this makes more sense to me now. I meant if we have access to the source code, may as well turn it to our pet. Picking the harness is just a matter of preference then.
9gPgEpW82IUTRbCzC5qr@reddit
Opencode explicitly makes the system prompt configurable so it's in a better spot than other harnesses
idnvotewaifucontent@reddit
Cline's base prompt is like 12 or 13k tokens.
lloyd08@reddit
opencode had a historically weird compaction/pruning strategy which fucked up cached reads. This got generalized to "uses more tokens", when it's really that it just didn't reuse the cache properly. I have no idea if they fixed it, but this is pretty much explicitly why anthropic did their "no 3rd party harnesses w/ claude pro" policy a year ago, even though their own system prompt is double opencodes. This may or may not have an impact on local models depending on your caching and context size, and it's entirely possible they've fixed it since the last time I used it. But because pi's system prompt is 20% the size, the boogeyman will stick around. If you want the type of guardrails claude gives you, opencode is a good alternative from the interface perspective, but like anything, you should test it for your use-case.
Karyo_Ten@reddit
The base prompt is big (10k). I don't know about actual token usage as I only compared to Zed and Zed is way way worse. LLMs keep rewriting whole files in Zed.
What's interesting though is the oh-my-py hashlines approach for edit which seems more accurate AND saving tokens.
skredditt@reddit
I don’t know about that but I have been test driving a a locally-run service called vexp and I might be a believer. I’ve got both GitHub Copilot and Claude’s desktop app using it and it’s reporting a savings of around 70% of tokens. It seems to be extending my subscriptions pretty significantly.
ripter@reddit
I do this with Qwen 3.6 and it’s replaced Claude Code with sonnet for me. Sure it’s not as fast, but the code it generates has worked out well for me so far.
SFsports87@reddit
I'm looking for something similar. Which specific version of Qwen 3.6 are you using and what quant and parameters?
choudoufu@reddit
I tried out 35b-a3b and 27b-ud (unsloth). I like the 27b-ud a bit more but both are solid at coding.
idnvotewaifucontent@reddit
I find 27b dense makes fewer errors and I spend less time debugging that with 25b-a3b.
jadbox@reddit
27b is a lot slower (10x) if you're like me and only have 16gb vram and have to share cpu/gpu layers
ripter@reddit
Qwen3.6 35B A3B UD Q6 K Make sure your --ctx-size is large. Qwen supports over 200k. Exact values can be found in the model card.
idnvotewaifucontent@reddit
Alternatively, use a resource monitor and keep pushing your context until you use \~1 GB less than your total available VRAM. You can get pretty exact.
SirGreenDragon@reddit
same, i use opencode with ollama on a headless linux box
Purple-Programmer-7@reddit
Upvote for opencode. Currently, I still like it more than pi, but pi also works.
rorykoehler@reddit
I like the design of pi (coding harness behind openclaw) but it's much less plug and play
ZubZero@reddit
When you realize Pi can make it’s own extensions I feel it’s more plug and play. Less opinionated in a weird way
rorykoehler@reddit
It's great but you still have to put the work in and then there is a much larger and more diverse surface area of stuff that could go wrong in strange and interesting ways.
havnar-@reddit
It’s not like this doesn’t / can’t happen with frontier tools
HiddenoO@reddit
That's not the point.
Pi is basically like developing software from scratch instead of using existing software. Sure, you have more freedom, but you'll also run into all the issues that others already ran into and solved for that existing software.
That doesn't mean you won't run into issues with other harnesses, just that it's generally going to be much rarer and you won't have to spend all the time optimizing and fixing everything yourself.
FortiTree@reddit
Isnt it the beauty of DYI local model? The more I read about Pi, the better it gets. At first I thought ppl were talking about running local model on an actual Raspberry Pi, and I was like wtf how did it get so powerful. And then ohh it's not the Pi I remember.
Evening_Ad6637@reddit
Its more like: A lot of plugs and then play
walden42@reddit
And if you want to get started with useful functionality out of the box, you can add orchestrators, planners, and anything else you need like https://github.com/ruizrica/agent-pi
rorykoehler@reddit
This look genuinely useful
colin_colout@reddit
for small models, this is the way. even the default system prompt is tiny. it's literally just this:
cmdr-William-Riker@reddit
Wait openclaw is based on pi? It gets less and less impressive, the more I learn
fdrch@reddit
So if we forget about openclaw, what's wrong about pi?
illforgetsoonenough@reddit
I'm reading it like they're saying openclaw is less impressive because it's using pi as a tool rather than their own harness
cmdr-William-Riker@reddit
Correct, I should have worded that better
fdrch@reddit
Thanks. Now I see :)
Speedping@reddit
qwen code (gemini cli fork) wired up to a decent qwen model is great
besmin@reddit
I am using it along with qwen3.6 35b through llamacpp locally. I preferred it over opencode as it followed my requests better. Although the start was pretty good, I am having difficulties creating agent orchestration or making it follow instructions i put in the main system prompt. The main system prompt is combined of many instructions and becomes huge which I cannot see and also is not very customisable. Sometimes it selects the wrong tool or agent because it has way too much instructions I think. Overall good but not so customisable and also docs is not human readable.
rorowhat@reddit
Link?
gurilagarden@reddit
Pi. Just from context overhead alone it's the clear winner. The amount of unnecessary shit that gets packed into system prompts for every other local harness adds up fast when running over consumer hardware. If you're serious about local ai-assisted coding, spending a day or two getting pi right where you want it gets paid back 10-fold. One-size-fits-all doesn't work on consumer hardware, specialized agents for specialized tasks meaningfully improves reliability and productivity.
jimmytoan@reddit
Aider is worth trying if you haven't - it has an architect mode that uses a stronger model to plan and a cheaper/local model to actually write the code, which works well for local setups where you're bottlenecked on the generation step. The `--model` and `--editor-model` flags let you split the reasoning vs. implementation load. Works cleanly with ollama.
SadExcitement8893@reddit
Why not just use Claude locally. You all do know this is possible right?
Express_Quail_1493@reddit
opencode is nice but for small models its brutal. if you want to make the most of your context windows use pi-coding-agent. Pi system prompt is literally 1k tokens give the LLM more room to think and solve instead of suffering from SysPrompt token-diabetes.
Pretend_Engineer5951@reddit
I wonder why people using cli agents for coding? Doesn't it more comfortable with ide extension?
suprjami@reddit
With OpenCode you can have both.
Run web instance with
opencode servethen terminal withopencode attach https://localhost:4096. You have the same session in both.Often table rendering is better in the terminal.
Some OC features are only available in the terminal like loading an archived session and exporting sessions to files.
I spend most of my time in the terminal and Vim so it's more in line with the rest of my workflow.
ayylmaonade@reddit
For me personally, I already spend most of my time working in the terminal anyway since I prefer it to GUIs for the speed aspect. So having a terminal-based harness is perfect in my case.
Pretend_Engineer5951@reddit
Did you manage to handle tasks when you needed to make report in markdown?
ayylmaonade@reddit
Yep. I've got cronjobs running via Hermes delivering daily markdown reports. And the new TUI has full markdown support, it's really nice.
toptier4093@reddit
Yea I don't get it either. I absolutely love my cli agent for navigating my way around my system, but for coding I absolutely always use vs code with the claude extension.
Long_War8748@reddit
Hell no. Monitor 1 Terminals (one of them being CLI), Monitor 2 VS Code (Editor of choice).
That is pure comfort 👘
loudsound-org@reddit
Wondering the same. I made a post earlier today asking about VSCode extensions for this sort of thing but for some reason its only had one view and doesn't even show up when I browse the sub.
TheIncarnated@reddit
There is no description. Hard to answer anything there
loudsound-org@reddit
For some reason my thread got filtered by reddit. No idea why. This is what I posted.
VSCode and agent integration
I've been using VSCode with Github Copilot for a bit (free tier) and looking to try running locally due to running in to all of the limits with GHCP. I'd like to have as close of an experience as possible with both code autocomplete and chat integration. I know that GHCP can use local models but I think I'll still run in to session limits and such. If there's a way around that then maybe sticking with it would be best.
A few things about my setup that may make a difference. I'm running the model (primarily Qwen 3.6 35B but would like the ability to switch to 27B and other models on the fly) on my windows PC with llama.cpp. My local Linux server hosts all of my code and dev environments, and I primarily use my windows laptop with VSCode on an SSH workspace in to my server (which works fine with GHCP and any agentic tooling). I plan to also setup Hermes for non-coding use (on the linux server), also using the windows PC's models (the server only has a 1060 6GB GPU...looking at doing embeddings and such on it once I figure that out!).
So with that setup, what is the best integration with VSCode? The Hermes extension and use Hermes for coding as well? Continue pointed directly to my llama.cpp? Cline pointed to either Hermes (is that even possible?) or llama? Run pi.dev alongside Hermes and somehow integrate that (tho it seems pi is mostly for cli dev?). Some other option? Appreciate any advice!
Pretend_Engineer5951@reddit
Personally I stopped on Roo code now. Chrck it out.
loudsound-org@reddit
They're stopping development on it though. Sounds like another team is going to pick it up but seems like something not worth diving into at this point.
TheIncarnated@reddit
Ahhh, I've found that GHCP does per line/function submit and retrieval. Where as continuum does it all in context.
I would love better alternatives but my company also pays for GHCP at the highest level enterprise plan
loudsound-org@reddit
Interesting. I've had pretty good luck with it, except the fact that I'm cheap and don't want to pay a sub!
TheIncarnated@reddit
Well... I am trying to brush up on Nanocoder, since I don't care for locked in products like Claude Code and I've enjoyed OpenCode but want to see alternatives.
I found this little nugget on Nanocoder's GitHub
Maybe will help?
ripter@reddit
It’s a different way of using the LLM. When I use VSCode with it., I tend to focus on the code and write some and ask the LLM questions on what I’m doing.
With ClaudeCode or OpenCode, I’m thinking about the problems, telling it what I want it to do, then reviewing the result.
The IDE is more like writing it yourself, the CLI is more like telling another engineer the requirements and reviewing their code.
my_name_isnt_clever@reddit
Exactly this. There are some projects where I have an IDE open, but these days most of the time I just have a terminal with a few tabs. I spend a lot more time in agent TUIs than editing files in nvim.
ea_man@reddit
They work well in the shell, can use via ssh tmux.
responds-with-tealc@reddit
for me i like the terminal, and already feel like im constantly fighting the bloat in most IDEs. i prefer things to be separate most of the time. I want my IDE to edit/navigate/run/debug code. thats it, no git, nothing else.
i use the terminal for reviewing changes and git management already.
2Norn@reddit
you can open your project in vscode and use the built in terminal from there
exaknight21@reddit (OP)
Audits for code, generating small landing pages, overlooking dokploy docker configurations, small bug squashes and pushing to git for dokploy easy publishing. And finally, small experiments.
I am not a coder, I create small Proof of Concepts, then hand them off to my team to recreate and stablize for production.
I guess it really depends but this is definitely helpful.
AltoidNerd@reddit
Depends on if you care to inspect the code. If you want to, the IDE is the way to go of course.
rpkarma@reddit
Eh even in that case, just start the harness in your integrated terminal!
SatoshiNotMe@reddit
The Qwen3.6 MOE you mentioned works very well with Claude Code. I’ve gathered the exact llama.cpp/server instructions here for this and other models:
https://pchalasani.github.io/claude-code-tools/integrations/local-llms/#qwen36-35b-a3b--fast-qwen-moe
Among recent models, this one gives the best TG (token gen) speed at nearly 40 tok/s and PP (prompt processing) nearly 500 tok/s on my 5 year old M1 Max 64 GB MacBook
Ok_Chef_5858@reddit
opencode and aider both work well with llama.cpp if you stay CLI. if you're open to the editor route instead, Kilo Code in VS Code points at any local endpoint and runs Qwen through it the same way, agent modes plus you can see the prompts and context. either way claude code itself is hard to wire to a local backend cleanly.
Mobile_Marsupial_619@reddit
Use qwencode it gives all features of gemini cli with 3rd party API support. It also supports both gemini cli and Claude code extensions. It is working Great for me now
gcaussade@reddit
LOL, you can now use Claude code with open router. Basically supports every llm out there.
Far-Chest-8821@reddit
Without the "Lol" and with a link on how to, that could have been a proper answer.
Varterove_muke@reddit
Quick search will tell you the secret of this dark art.
https://medium.com/@luongnv89/run-claude-code-on-local-cloud-models-in-5-minutes-ollama-openrouter-llama-cpp-6dfeaee03cda
FortiTree@reddit
Sure. The sun rises and dies one day just how you may remember this is a local model sub.
lucaiuli@reddit
I am usingVSCode with KiloCode and Cline extensions with LMStudio server. On a Macbook Pro M4Max 32gb ram and a MacMini M4Pro 48gb ram.
qwen/qwen3.5-27b on Macbook Pro
and
qwen/qwen3.6-27b on the MacMini
I am quite happy with Cline on Macbook for my coding needs, it does the job.
On Macmini I'm using KiloCode and it does split the task to many agents.
For now, that's my stable setup and does not require subscription.
hust921@reddit
Personally I was a bit reluctant to try pi, because it's so customizable and bare-bones. I felt that I needed to understand everything before using it. But it works perfectly fine out of the box! And with qwen3.6-35B, it has been working significantly better for me, than CC and opencode. Without ANY modification or plugins.
A lot of people become emotional about tools, Operating systems, models. You are only punishing yourself, by sticking to the one and only solution. If CC is really that much better, It should survive a round of comparison with other tools. And nobody is saying that you can't use both.
mrdevlar@reddit
I quite like Roo Code.
I've had more success with Roo than OpenCode. I found the scaffolding was smarter, it produced a cleaner better encapsulated code. Though I haven't used opencode extensively, so it is possible that is on how I set it up.
That said, they are going through their monetization phase so not sure how long it'll still be good.
HumanDrone8721@reddit
opencode + a curated selection of oh-my-opencode plugins, Sisyphus is my favorite.
meow_goes_woof@reddit
What’s the hard part about getting Claude code to work or am I mistaken? U just need to add in the z.ai models in ~/.claude/settings.json according to the docs and that’s it
Covert-Agenda@reddit
I use opencode with MLX on Apple, seems to do pretty well for agentic loops.
leinadsey@reddit
Claude Code!
Dry-Tune430@reddit
Pi and OpenCode are good enough
idnvotewaifucontent@reddit
I use Qwen3.6-27b Q4_K_M and Cline in VS Code, it's been way better than any other setup I've tried, including Qwen3.6-15B-A3B, the 3.5s, and Gemma 4. I have a 24 GB RTX 3090 and get about 23 tk/s.
Human_Information561@reddit
https://open.substack.com/pub/itayinbarr/p/honey-i-shrunk-the-coding-agent
This has been working amazing for me. I figure for interview prep design review, instead of studying “design uber”, I’ll just build it. So far so good, it was able to ingest osm, osm routing, and it was able to simulate and render the data. I’m having it implement the APIs now so I can update on how it did there. But so far really good and I’m confident!
gregorskii@reddit
I’ve had the best luck with open code and Claude. But with Claude I like to leave it alone to use with opus.
uspace@reddit
With subscription or API? How do you handle it?
gregorskii@reddit
I assumed this was about local AI since it's in local llama. I have had the best luck with opencode + qwen 3.6 27B and 35B these days.
exaknight21@reddit (OP)
OpenCode is working really good. Qwen Code is kind of painful. It's not generating full responses. Stops around 400 tokens for a response, max gen is set at 16k
gregorskii@reddit
The model stops sometimes for me, I just tell it to resume. But I’m on a m5max with 128
exaknight21@reddit (OP)
Yeah, OpenCode (WebGUI) worked great and was easiest. Pi i am having trouble with in terms of set up, same continuation issue. I’m trying nanocoder now and I’ll update my post with how it all went.
kidousenshigundam@reddit
How did you get ZLM for $30/year?
Exciting_Garden2535@reddit
New Year's sale. The Lite plan was $3/month minus 10% extra, so if you bought a year, it was $32.
exaknight21@reddit (OP)
This past Christmas they had a sale for 32 dollars per year. I snatched it then.
Budget_Assignment457@reddit
Asking the important questions
ea_man@reddit
Opencode is "like" cloudecode, Qwencode is made on QWEN LLMs.
bradwmorris@reddit
It's important to note that there is an interesting tension at the moment because model capabilities and coding harnesses are changing and leap-frogging each other every 35 seconds.
argument 1:
double down on a single model/harness and learn its nuances etc
argument 2:
switch between models/agents often
it's too early to decide, so I would advice building your harness/setup (like a pi) so you can easily switch out models and try them all
ea_man@reddit
But you can't easily switch out models there's vertical optimization.
LLM - tools - prompts
2Norn@reddit
pi or opencode
exaknight21@reddit (OP)
OpenCode is definitely good. But, I have not tried pi.dev yet, Qwen Coder CLI is annoying. it was easier to set up in a way, but sadly, the responses are not generating in full. I am about to move on after 2 hours of troubleshooting. my max gen tokens are set to 16k, but they stop generating tokens around 498. Its weird.
Obvious_Equivalent_1@reddit
You need Ollama that will work with Claude Code.
I just happened to set this up for someone else which might come of use. https://github.com/pcvelz/qwen-claude-code-getting-started/
For the setup of the local modal with Claude Code you just need to put some extra parameters before the
Claudecommand to route the requests through QwenSFsports87@reddit
Isn't 16 bit kv bigger and slower than 8 bit?
Obvious_Equivalent_1@reddit
Heads up: f16 being faster than q8_0 is Apple Silicon-specific. Metal's q8_0 dequant kernels are slow, some memory savings get eaten per-token. On NVIDIA, q8_0 is near-free.
On 48GB M4 Pro w/qwen3.6:35b-a3b (~22GB resident): - f16@65K ≈ 13.6GiB - f16@131K ≈ 27GiB — too tight - q8_0@131K ≈ 13.6GiB but slower per-token
Evening_Ad6637@reddit
But it should be mentioned that this is only true for M1 and M2
M3/M4/M5 should use q8 or bf16
Obvious_Equivalent_1@reddit
Measured this specifically on M4 Pro (48 GB) - f16 was faster. The claim may vary by Ollama version and model type (MoE vs dense behaves differently). I just right now learnt that Llama.mcp allows prompt caching so I need to redo a new scale of benchmarking.
bf16 is genuinely interesting on M3+ though - native BF16 in Apple's Metal, same 2-byte size as f16, possibly zero-cost dequant. If you're on M3/M4 and not using Llama.mcp it’s worth trying OLLAMA_KV_CACHE_TYPE=bf16 and timing a real session.
Evening_Ad6637@reddit
I would never use ollama xD But yeah.. no I am on M1 Max. M5 Max would be a nice upgrade, but I already sold my two kidneys ^^
Latt@reddit
You don’t need ollama to use a local LLM with clause code. For your specific description sure, but it works just fine with omlx or llama.cpp
Obvious_Equivalent_1@reddit
The model is just out since a few days and this is the best I’ve got out of it so far, I did get llama.mcp to work but not with Claude Code but I’d be very much interested!
Do you have any reference or any working example?
Pleasant-Shallot-707@reddit
Pi
soulhacker@reddit
Swival
SupaBrunch@reddit
I’m using that model with vs code right now. Need to use the beta “insiders” version of vs code but it’s been working well.
DenizOkcu@reddit
NanoCoder is built with your use case mind:
https://github.com/Nano-Collective/nanocoder
Steve_Streza@reddit
Been using it with Qwen3.6-27B, it's pretty usable
kexibis@reddit
CLine, great experience with Qwen3.6 27B
SirGreenDragon@reddit
i am using opencode with success. sometimes directly, sometimes through the ACP connection from openclaw
DANGERCAT9000@reddit
Personally I like crush from charm - because it feels like a good compromise between pi and opencode. The maintainers have a long track record of building great TUI apps, and they've been adding more features but doing so in a way that I think is really measured and reasonable. When they add new stuff it feels like they've actually thought about it rather than just taking in every single feature request. The pace of development feels sustainable which is something I worry about with other tools.
Krillian58@reddit
Claude code, hands down.
It's built in now. Claude Code supports local models natively through the Anthropic Messages API that Ollama and llama.cpp both expose.
Fastest path Ollama:
export ANTHROPIC_AUTH_TOKEN=ollama export ANTHROPIC_BASE_URL=http://localhost:11434
claude... it just worksllama.cpp path
Same idea, llama-server already exposes a compatible API. Point Claude Code at it
export ANTHROPIC_AUTH_TOKEN=local export ANTHROPIC_BASE_URL=http://localhost:8080
Claude Code prepends an attribution header that invalidates the KV cache on every request, making inference ~90% slower. To fix it:
Add to
~/.claude/settings.json:{ "env": { "CLAUDE_CODE_ATTRIBUTION_HEADER": "0" } }
2Norn@reddit
cc is too bloated and streamlined for general use at this point
if u want project specific performant harness nothing will beat pi
Zeeplankton@reddit
Why is CC better? I thought in harness bench marks it's not great
Krillian58@reddit
Good question, made me think for a second. And it might not be for local. It's intended to be used with claude. I doubt they optimized it for local much. I was wprkong off of using many harnesses and liking how claude code managed it's context the most. I've also seen many other similar posts that contributed to this.
Personally I wrap it up and use it like openclaw, so my use case is probably much different then OPs.
OneSlash137@reddit
Unless you’re loading that 100% into vram you won’t be able to do anything with it.
You can chat with it but as soon as you try to code, have it analyze a code base, pretty much do anything you’d actually WANT to do with it, it will fail.
exaknight21@reddit (OP)
131K context isn’t enough? Sorry, I’m a little new to this.
I’ve got 32GB MI50 - 64 GB DDR3, no good?
OneSlash137@reddit
It isn’t the context length that’s the problem. I can have a 256k context length with some vram to spare. The problem is that some of the experts are offloaded to CPU, that makes tokens per second god awful.
Your first agentic prompt is going to be huge depending on your stack. After that things start to pickup steam because the first prompt is bloated but subsequent ones are smaller and don’t include the entire prompt, it’s more of a delta. Inevitably the deltas stop and the full prompt needs to be sent again. By that time the entire prompt is most of the time well over 40k, that takes minutes and minutes to process, harnesses and tools start to timeout waiting for api responses, the tools start asking for responses again, and that’s that. At that point it all falls apart.
So it isn’t a matter of context length it’s toeken processing speed.
I used to think I wouldn’t care how long it took for a LLM to respond as long as it was a quality response. Now I see just how long these tasks are goring to take, and they will rarely if ever be one shot unless you have insanely careful planning.
x8code@reddit
You are correct. A bunch of poor people without GPUs are coping by down voting you.
OkBase5453@reddit
He wants to use the Qwen3.6-35B-A3B model, which should work.
Try ik_llama.cpp it might work with your card. From my experience(rtx 3090), it is faster than llama.cpp.
I am using Qwen code, and it is great in terms of tools calling and speed.
Good luck
OleCuvee@reddit
spot on! Same as you I thought sod it, worst case the task will run overnight. It timed out, never finished or broke apart after a step or two. I tried to split them into small parts, but as soon as I wanted it to process the complex detailed brief, nah!
dreamai87@reddit
My choice 1: mistral vibe - moderate instruction prompt size 8k. Simple and good features 2: pi - smallest instruction prompt - only code mode but it’s good 👍 3: qwen cli - 14k instruction prompt - good and rich features 4: then whatever
see_spot_ruminate@reddit
Mistral vibe is just so easy to work with.
It’s nice to add to as well. I’ve used it to make its own mcp web search, agents, skills, etc. it’s not flashy but tool calls work well.
rorowhat@reddit
Can you select different models?
see_spot_ruminate@reddit
It comes with some built in but you have to pay for (I think) stuff from mistral, but you don't have to use any of that. Just delete it from the config. Then add in your llamacpp endpoint with your models. its very easy in the config file.
Curious-Function7490@reddit
I'm currently setting up OpenCode. One of the nice things about is that you can specify multiple agents for different purposes. You need at least two, one for planning and one for building.
I use my Claude Pro subscription for planning.
Over the w/e I configured qwen.2.5-coder.32B on my gaming rig with RTX4090 using llama.cpp on WSL. It's running as my build agent. I'm getting 30 tokens a second.
It isn't a flawless experience yet but I'm getting some results. Still experimenting.
OGScottingham@reddit
Cline works pretty well with it.