I'm done with using local LLMs for coding
Posted by dtdisapointingresult@reddit | LocalLLaMA | View on Reddit | 426 comments
I think gave it a fair shot over the past few weeks, forcing myself to use local models for non-work tech asks. I use Claude Code at my job so that's what I'm comparing to.
I used Qwen 27B and Gemma 4 31B, these are considered the best local models under the multi-hundred LLMs. I also tried multiple agentic apps. My verdict is that the loss of productivity is not worth it the advantages.
I'll give a brief overview of my main issues.
Shitty decision-making and tool-calls
This is a big one. Claude seems to read my mind in most cases, but Qwen 27B makes me give it the Carlo Ancelotti eyebrow more often than not. The LLM just isn't proceeding how I would proceed.
I was mainly using local LLMs for OS/Docker tasks. Is this considered much harder than coding or something?
To give an example, tasks like "Here's a Github repo, I want you to Dockerize it." I'd expect any dummy to follow the README's instructions and execute them.
Issues like having a 'docker build' that takes longer than the default timeout, which sends them on unrelated follow-ups (as if the task failed), instead of checking if it's still running. I had Qwen try to repeat the installation commands on the host (also Ubuntu) to see what happens. It started assuming "it must have failed because of torchcodec" just like that, pulling this entirely out of its ass, instead of checking output.
I tried to meet the models half-way. Having this in AGENTS.md: "If you run a Docker build command, or any other command that you think will have a lot of debug output, then do the following: 1. run it in a subagent, so we don't pollute the main context, 2. pipe the output to a temporary file, so we can refer to it later using tail and grep." And yet twice in a row I came back to a broken session with 250k input tokens because the LLM is reading all the output of 'docker build' or 'docker compose up'.
I know there's huge AGENTS.md that treat the LLM like a programmable robot, giving it long elaborate protocols because they don't expect to have decent self-guidance, I didn't try those tbh. And tbh none of them go into details like not reading the output of 'docker build'. I stuck to the default prompts of the agentic apps I used, + a few guidelines in my AGENTS.md.
Performance
Not only are the LLMs slow, but no matter which app I'm using, the prompt cache frequently seems to break. Translation: long pauses where nothing seems to happen.
For Claude Code specifically, this is made worse by the fact that it doesn't print the LLM's output to the user. It's one of the reasons I often preferred Qwen Code. It's very frustrating when not only is the outcome looking bad, but I'm not getting rapid feedback.
I'm not learning anything
Other than changing the URL of the Chat Completions server, there's no difference between using a local LLM and a cloud one, just more grief.
There's definitely experienced to be gained learning how to prompt an LLM. But I think coding tasks are just too hard for the small ones, it's like playing a game on Hardcore. I'm looking for a sweetspot in learning curve and this is just not worth it.
What now
For my coding and OS stuff, I'm gonna put some money on OpenRouter and exclusively use big boys like Kimi. If one model pisses me off, move on to the next one. If I find a favorite, I'll sign up to its yearly plan to save money.
I'll still use small local models for automation, basic research, and language tasks. I've had fun writing basic automation skills/bots that run stuff on my PC, and these will always be useful.
I also love using local LLMs for writing or text games. Speed isn't an issue there, the prompt cache's always being hit. Technically you could also use a cloud model for this too, but you'd be paying out the ass because after a while each new turn is sending like 100k tokens.
Thanks for reading my blog.
onethousandmonkey@reddit
Purely from the performance point of view, there are a number of settings to tweak to make Claude Code jive with local models. For example: https://unsloth.ai/docs/basics/claude-code#fixing-90-slower-inference-in-claude-code
Before I did that, I was banging my head against the wall at the slowness and useless cache.
AdOk3759@reddit
I would also suggest to look into little coder, which is a harness specifically designed to boost smaller models’ performance
Torodaddy@reddit
Open code is actually pretty sick too. I used claude code daily for work and i found open code leaps more productive and faster
RobotRobotWhatDoUSee@reddit
link? I googled little coder (and variations) but largely found many webpages targeted at teaching children to code. Worthy goal, just not what I am looking for!
Clear-Ad-9312@reddit
https://github.com/itayinbarr/little-coder
ChocomelP@reddit
Why would you use Claude Code without Claude models? The models are what make it good. The harness itself is suboptimal. If you could easily OAuth to use other harnesses like Pi with your Claude subscription, I would never use Claude Code.
howardhus@reddit
not true. its an open secret that claude has done a great job optimizing their harness to boos performance… after leak there were several anslysis confirming this
ChocomelP@reddit
Oh wow, I didn't know there were "several anslysis"...
howardhus@reddit
well… glad you came out from under that rock
Gesha24@reddit
Personal experience - because I like it the best. I am getting better results with claude code than aider or continue.dev when working with a local model
howardhus@reddit
you da real mvp
Mobile_Bonus4983@reddit
The link days:
"Claude Code recently prepends and adds a Claude Code Attribution header, which invalidates the KV Cache, making inference 90% slower with local models. See this LocalLlama discussion."
Torodaddy@reddit
No gpu? You are using really small models, you could go up a size and use something quantitized
reginakinhi@reddit
Who told you those small models were the best that local models had to offer? They're great for their size, but even the biggest local models most likely don't compare to the size of Opus. The comparison is hardly fair
pablo_chicone_lovesu@reddit
Your missing so many things.
You realize that Claude is faster because they have huge memory and GPU banks right?
I use cline with a tuned qwen3.5 to check all of my code, it does a pretty good job. But I'm also more obsessed with context windows then model size.
You need to tune your rules for the model, make use of skills and also mcp, you don't just replace a model and be done. These ai companies have spent years tuning their setups to be what they are.
The model and context windows are a big part of the stack but not all of it. If your not willing to put in the time your not going to get good results.
MLExpert000@reddit
I won’t really say that out loud here because people get really offended. But I hear your point.
andy_potato@reddit
It is necessary to say it out loud.
Qwen 3.6 27b is a great model for many applications. But I’m sick of these posts of people claiming it performs on par with Claude for coding. It is simply not true.
Finanzamt_Endgegner@reddit
It is if you know what you are doing. It isn't if you don't. For pure vibe coding without thinking on you part it might not be there yet but with correct harness settings and instructions and guidance it can compete with at the very least sonnet4.5
andy_potato@reddit
It’s not even close. Everyone claiming otherwise is just coping hard.
Recoil42@reddit
Say it out loud, otherwise this place devolves into a reality bubble and loses value to everyone. Sometimes, people need their medicine.
Pleasant-Shallot-707@reddit
So, you refused to craft the guardrails to accommodate the needs of the local models, expected one shot level behavior and were upset that they can’t work that way.
markole@reddit
It is irrational to compare a 27B model running on a single GPU and a multi trillion model running on clusters of GPUs that cost more than your retirement fund.
droptableadventures@reddit
And also to conclude that that's as good as "local LLMs" get.
It's good for its size, but it's one of the smallest notable models out there.
markole@reddit
I wouldn't be too surprised if we get a 70-120B model as strong as Opus 4.6 in 12 months or so. Remember what we had last year in April.
Virtamancer@reddit
OP is almost certainly running quantized versions, plus he’s using them through Claude Code instead of something like OpenCode or Pi which don’t pollute the context. 🤡
That and expecting teeny tiny models to perform similarly to Opus is kind of wild.
If he used the unquantized versions in OpenCode he might have a much better experience but it still wouldn’t match opus. Literally in a year though it will—coding agents and harnesses and optimizing the processes is the entire game right now.
Widget2049@reddit
AGENTS.md still too weak, you need to be more thorough for a 27b model. make it focus on what the LLM really need to do, avoid using "IF", "DON'T". you need to create a solid plan mode first before executing anything in build mode. local llm for coding is still good if you know what you're doing. so keep learning
StorageHungry8380@reddit
Got some concrete examples of how
AGENTS.mdshould look for such models?2Norn@reddit
https://github.com/forrestchang/andrej-karpathy-skills/blob/main/CLAUDE.md
this the only one i ever use, simple to the point
Widget2049@reddit
cleversmoke@reddit
Beautiful, thank you for the links!
StorageHungry8380@reddit
Awesome, much appreciated.
ChosenOfTheMoon_GR@reddit
A very good advice i will like to give to people who are using any instructions on any model is the following, test a model really quickly on the instructions that you want it to know first, specifically ask it what understands about the instructions and what is unclear to it, it will save you a ton trying to figure out what it can and won't do given your instructions.
Iory1998@reddit
I am not a coder, but I totally understand and share your feelings. I find myself going back to Gemini-3.1 or Deepseek v4 for better replies, or I start a conversation with them, copy it to LM Studio, and continue with a local model like Qwen-3.6-27B-Q8 or Gemma-4-31B-Q8. This seems to give them a bit of an edge.
But, I use them mostly as an inner voice that helps me collect and organize my thoughts. When I need serious sanity check, I go back to the top Gemini or Deepseek (I like to vary the sources). Perhaps, if I could run larger model locally, it would have been much better...
And you are right regarding wasting time. You can get good outcome with small local LLMs, but you spend more time and energy. If you are tight on time or you need to make a lot of decisions, just go with the best model you have access to. People have limited deciding making capacities per day, and it doesn't matter whether you decide on trivial or serious matters, you spend the same energy deciding.
Migraine_7@reddit
Are you using a subagent to at the very least create a work plan before each task?
Even Sonnet and sometimes Opus fail miserablely if the task is not well defined.
dtdisapointingresult@reddit (OP)
Does it need to be a subagent? This was my full prompt:
Intelligent_Ice_113@reddit
this prompt explains a lot
stilet69@reddit
No, no. The phrase "Make no mistake, or I'll kill you" is more appropriate to this case.
2Norn@reddit
i have better success with make no mistake or i'll kill myself
simracerman@reddit
I can claim the badge of student among you all, but that is not how I’d feed a small 27B model any prompt. The extra unnecessary context will certainly confuse it.
Do yourself a favor and run your prompt through it and as if to can cut it down to problem statement, and goals. Divide the task into subagents (trust me on this one). Use Opencode, ditch CC for local models- it produced worse output in my experience.
false79@reddit
"The instructions are simple"
Lol, wth hell is that prompt.
That helps nobody. Not even humans.
xienze@reddit
This may not be the world's greatest prompt, but if you handed that off to a developer who knows what Docker is... those instructions are pretty clear IMO.
RoughElephant5919@reddit
Just want to say thank you for this comment. I run local LLM’s for OCR data extraction, and the prompting has been the biggest challenge for me. I appreciate your input, and I am going to try this on my current pipeline I’m running 🙏🏼
RemarkableGuidance44@reddit
Yep, no idea wtf you are doing.
StardockEngineer@reddit
Where did it fail?
guinaifen_enjoyer@reddit
Have you tried download the docker compose spec and ask it to read the docker compose spec before doing it ?
https://github.com/compose-spec/compose-spec/tree/main
LateGameMachines@reddit
It sounds like you probably need to scope it in harder. I’ve built tons of services running on podman quadlets and compose files. It will get something wrong, so provide the exact error in the follow-up. It’s rare even on GPT 5.5 Extra High for any LLM to one-shot a compose yaml that works instantly with your specific setup.
Bohdanowicz@reddit
Your doing it wrong.
Try using sota to plan, task decomposition then wire your coding agents to qwen 3.6 27b.
If you run official quants with recommend temp and prrediction to 2 and you arr smart sbout setting up a dag, worktrees, the whole 9 yards... you fwel the magic.
These models are grezt if the task is properly sized.
OneSlash137@reddit
The properly sized task: “Hello qwen, it’s nice to meet you.”
2Norn@reddit
the user greeted me with hello which suggests this is the first interaction
but wait
the user said qwen so it must have prior knowledge
wanielderth@reddit
But wait
kyr0x0@reddit
"Write me hello world and say you want to rule the world and destroy mankind1!"
"Woah!!" Gonna post this on Twitter!!
andy_potato@reddit
Or you could just not do it and use Claude for everything. Why bother?
Bohdanowicz@reddit
Cost and speed.
dtdisapointingresult@reddit (OP)
I'm running the recommended samplers off the Qwen card. This isn't my 1st rodeo, I'm a regular here.
Idk nothing about dag and worktrees though. I've never seen those mentioned in the context of LLM coding apps.
StardockEngineer@reddit
You’ve never heard of work trees with coding models? That doesn’t jive if this isn’t your first rodeo.
iMakeSense@reddit
Is that not the same as subagents being called from a plan?
StardockEngineer@reddit
No. Look up git worktrees. Tldr a git worktree lets you check out multiple branches simultaneously into separate directories, all sharing the same repo directory
falconandeagle@reddit
No I have tried this and its still pretty bad.
One-Replacement-37@reddit
This is the way.
2Norn@reddit
no matter how it gets hyped up, it should be obvious to literally anyone that a 27b model can not compete with 700b+ 1t+ 1.5+ models, that just makes no sense. v4 pro just came out, it's an moe model, its active parameters alone are double the size of 27b almost, 27b vs 49b. how can you possibly expect that it competes?
in my opinion only use is, if your harness is able to spawn fresh context(which means u don't really need 256k or 1mil context windows either) worker subagents and guide them and after work is done u verify with a different model that's pretty much it. they are just simply there to cut on your subscription/api cost. anybody who fully downgrades from opus4.7, gpt5.5, k2.6, glm 5.1 is just not gonna have good time.
RegularRecipe6175@reddit
Did you use an 8-bit or better quant? Curious, but it's not going to change the outcome if your work gives you all you you can eat Claude. As someone who is forced to use local models from time to time, I can say using at least an 8-bit quant, if not full fat, makes all the difference for small models.
Particular-Award118@reddit
Who has the vram
mister2d@reddit
The small ones also very sensitive to quantized kv. I started running with kv cache at full precision and noticed a significant difference in increased quality.
It's slower, but useable.
bonobomaster@reddit
I agree.
It's just a feeling at this point, because I don't have numbers to back that up but even Q8_0 KV cache makes at least all the Qwens I tried noticable dumber, especially in regards to coding and successful tool calls.
dtdisapointingresult@reddit (OP)
The official 27B FP8 from Qwen, yeah. Ran slow but having MTP helped. (unlike Gemma)
StardockEngineer@reddit
You can run Gemma e4n as a speculative decoding model for a big performance boost.
andy_potato@reddit
That doesn’t make Gemma any better at coding. Just faster at producing nonsense.
t4a8945@reddit
3.5 or 3.6?
They are NOT the same haha. They cooked, really.
dtdisapointingresult@reddit (OP)
3.6, who do you take me for? I know game!
RemarkableGuidance44@reddit
You don’t know what you’re talking about here.
You clearly don’t understand how to set up models properly across different hardware, how quantization behaves differently depending on the setup, or how important pre-prompting is for getting better results.
You should spend some time learning how these systems actually work. Reading through the Claude Code files might help you understand how they drive Claude in the right direction. Even though that has turned to a pile if sh!t.
YOU KNOW THE GAME.... Looks like you dont...
andy_potato@reddit
OP clearly knows the game. But OP obviously also has a life.
Material_Soft1380@reddit
have you tried BF16?
t4a8945@reddit
Whoops, sorry!
I tried it in my setup (2x Spark) and it did some amazing stuff (massive refactor) ; only issues I had with it was it was stopping for no good reason, outputting xml sometimes. I blame its jinja template and I got no time for that.
Anyway, I liked your post, it's a good reality check from a real experience. Thanks
Oleszykyt@reddit
You should’ve tried minimax m2.7 it is very good
taoyx@reddit
I coded my first chatbot with python and streamlit using local LLMs. What I've learned back then is that they were really bad at modifying existing code, but if you let them start from scratch then they just do fine.
Then I've learned about context size.
Electronic-Space-736@reddit
"Here's a Github repo, I want you to Dockerize it." is terrible lazy and most likely to fail.
You are missing orchestration layers.
dtdisapointingresult@reddit (OP)
Do I really need to run brainstorm skill, decide on architecture, answer questions about TDD compliance, to have the LLM dockerize an already-working app that gives all its doc in the README?
Electronic-Space-736@reddit
no, you need an AI layer that does that and creates smaller tasks from the large one and hands it off to workers, the same as what happens with the cloud ones, its just you need to set that up.
mister2d@reddit
You get it.
Electronic-Space-736@reddit
I do, and good news, it is open source https://github.com/doctarock/local-ai-home-assistant
kyr0x0@reddit
Your code needs a serious refactoring to TypeScript and ESM. It's obvious that the tool calling harness is fragile as it assumes issues only certain LLMs will face and others will stumble over. It has thousands of lines of code to solve some tasks that are more trivial to solve, but the LLM generated code tends to overcomplicated things. Also the README reads very AI sloppy and overstating functionality. But I haven't given it a reality check yet - that's just a guts feeling. It's cool though, that you open source it. I liked some ideas, it's just that it's one of those huge code bases that become hard to maintain. I'd suggest to refactor it, trying to get less code doing more and add a serious amount of tests, integration tests and e2e tests as well as ARCH.md docs for every module - so that the LLM wouldn't hallucinate on it when you continue using it to write code.
Electronic-Space-736@reddit
Thanks for taking a look, I will pass on the refactor but you are welcome to.
This is the core system, most of the functionality is spread into plugins, which I am publishing regularly.
The tool calling harness is a catch all addressing common problems, this is deliberate, there are hooks throughout for customizing its functionality, the core function is the fallback if you have not extended this further, I have plugins that advance this baseline.
I do not consider this particularly huge a codebase, if you investigated the main core system, it is a platform or foundation to plug things into with the basics included, built with security in mind, and a good deal of flexibility and a visual GUI, sure it is larger than your average vibe coded app, but it is more serious than your average vibe coded app too.
There is some refactoring needed, development is AI accelerated, but I refactor regularly throughout the development cycle and have 30 years of software experience - I think you should take a closer look when you get a chance.
traveddit@reddit
Looking at your repo and how you constructed your harness I don't think you're in any position to be giving out tips. You literally have subagent orchestration structure backwards. You're using Gemma 4B to decide the scope of your query and you have the 26B as a worker. This is a fundamental misunderstanding of how to allocate intelligence for subagents. You can't let a dumber model orchestrate the task because it will never know when to reliably handoff to "harder" tasks.
mister2d@reddit
Nice project. Are you the author?
It would be better to use
systemd-nspawnrather than docker for isolation. You get almost zero overhead (daemonless) with the the desired level of process isolation.Electronic-Space-736@reddit
nice, I am using docker as it is well known and easy to include install scripts for, I also have a few things like the RAG that use containers as part of the whole
dtdisapointingresult@reddit (OP)
Isn't that what the LLM's reasoning is for? I shared my whole prompt here btw:
https://reddit.com/r/LocalLLaMA/comments/1sxqa2c/im_done_with_using_local_llms_for_coding/oiowcxe/
Then between Qwen Code's system prompt + Qwen 27B's reasoning, I don't think it's unrealistic of me to expect it to complete this basic task.
It's not like it failed to compile the dependency for my hardware because of some complex compatibility issue. We didn't even get that far!
Electronic-Space-736@reddit
how can I make it clearer, there is another layer that you are unaware of that the cloud services provide that makes LLM more smarter and effective.
Running Qwen in llama.ccp (or whatever) does not supply this layer, you need to make your own or use someone elses.
kyr0x0@reddit
Qwen Code is such a layer or at least, is sold as such. Cloud services don't run harness code at their servers for LLM Inference. They do so for non-coding harness (ChatGPT, or coding harness with server-side run agents), but a decent RooCode, OpenCode or even VS Code Insiders should already bridge the gap the same way they do with large models, not SLMs. Yet they don't because you can only try to shoot a moving target when you write instructions to fix one issue for a small model , then it stumbles upon the next, and the next and you continue ... Finally you switch models next week and face totally different issues.. and your code is pointless - you need to rewrite everything for the next model that requires other fixes..
Electronic-Space-736@reddit
yes, for small context, but then we hand it pages, so it needs to be broken back into smaller pieces that qwen was built to handle, this is the layer that is missing
StardockEngineer@reddit
Already working? I thought I saw your prompt asked it to figure out if it needed to compile things from scratch?
PeerlessYeeter@reddit
op's experience somewhat matches mine, I keep assuming I'm doing something wrong but I think this subreddit gave me some unrealistic expectations
Hodler-mane@reddit
1000%
I been following guides exact for decent performing qwen3.6 27b on a 3090 and everything I try, fails at basic stuff like thinking and tool calling.
then I realize all these examples are examples for chat bots with no thinking of tool calling .. they just fail to mention that.
Finanzamt_Endgegner@reddit
That's a config issue. It should not fail any too calls, I had it do like 2000 or so at this point and just a single failure.
StardockEngineer@reddit
Nah, they work. I use 27b with Pi Coding agent to do hard things all day long. The latest thing I did was ask it to iterate on some never before seen data for a data science hackathon. After about 20 commits it made an html dashboard to show me the results.
bjodah@reddit
I love local models, but this has got to depend on the task complexity at hand? There are plenty of tasks (scientific computing, etc.) for which I don't even bother asking Sonnet (let alone my local Qwen 3.6 instance) to solve, but go straight to its bigger brother or OpenAI's/Google's SotA offering (unless the data is sensitive).
StardockEngineer@reddit
I’m not saying they can do it all, but they can do far far more than what many in this thread think. I can do 90% of my work now in 27b, at least. And I’ve had 27b fix three problems both Codex and Opus got stuck on.
roosterfareye@reddit
I think the problem frequently lies between the chair and keyboard. Poor prompting, poor planning, impatience.... I was there too once!
bjodah@reddit
That's probably a common case, I just want to add that sometimes you really need the extra world knowledge of the larger model. For example, every now and then I want assistance in a niche programming language (elisp) and the smaller models (understandably) hallicinates functions that does not exist. For elisp in particular I've found Gemini 3.1 Pro to be the undisputed king. I really want to use my local models there as well, but I get nowhere near the success rate I can achieve for say python and bash.
Finanzamt_Endgegner@reddit
The just let Gemini create a list of things the local model has to adhere to and it should be fine? Don't have to use Gemini for actual implementation and stuff
dearmannerism@reddit
This type of reaction is what I didn't lose hope yet. Perhaps, there must be a smarter way to break down the big task into bits that are quite easily digested by the smaller models like Qwen 27b. Once we find those primitives, everything can be just simple processing loop like Ralph loop.
Alwaysragestillplay@reddit
It depends what you're doing. Frankly for any "real" workload that a dev is likely to face, the <100B models are going to crap out sooner or later. I would suggest that decent model routing and orchestration is the way to fix that if your goal is to save tokens. Have some mechanism to judge prompt complexity, then choose whether to invoke Claude or Qwen dynamically.
roosterfareye@reddit
Yes, agree. I just remoted into my PC after asking qwen 3.6 35 a3b (6k quant) to generate a full test suite and --> run --> evaluate --> repeat until fixed and damn me, it did it, fully and agentically in LM Studio no less!
Own_Mix_3755@reddit
I use Qwen 3.6 27b for coding sessions just fine. The problem often is multilayered - it starts with wrongly configured server (I understand there are literally hundreds of combinations - but some are much better and some are much worse), continues through good harness (I ended up with RooCode as eg Claude Code seems to add too much of an overhead to each task that its just not worth it, I also had to define manually my own modes, engineer custom prompts and skills) a ends with model size and type (often people choose smaller quants like K_3_S to fit everything into VRAM with 256k context while with good agentic workflow you rarely go over 64k context). You also have to understand you are working with much smaller model and effectively dumbing it down quite alot with small quant. You have to find ways how to help him a bit (giving him proper readable “manual” will certainly help).
mateszhun@reddit
Same, local models seem to work really well with Roo Code.
But I do have a problem with on longer context windows with 27b, it suddenly starts to fail with File Edits. (Maybe it is a setup problem?) But 35B doesn't have that problem.
I've settled on 27B for Ask, Orchestration, Architecture modes, and 35B for coding. And 35B is also faster as a moe model, so it works out nicely for the longer outputs. I'm using Q8 quants for both models.
Eyelbee@reddit
I switched to cline after they shut it down, it seems to be the same. I have small complaints, like we can't see things like system prompt. I'm too lazy to look at the source code. It's close to perfect in my opinion, I'm thinking about forking it if I can't find the time.
DrBattletoad@reddit
Good to see someone else with the same problem as me. I thought I was going crazy to see 35B solve problems that 27B wasn't able to.
Sn0opY_GER@reddit
I use roo code with lm studio on a 5090 with qwen 3.6 27b (or 35b) and im surprised how good it works, tool calls etc no problems. I managed to code a timer software with nice animations for out mini rc car track that talks to the IR trackers for the timing software and now we habe a start light, leader board, rain warner etc - for free. I played a little with openclaw for 2 weekends and spent 700$ on claude :p i think the best way is a hybrid approach where the local model does the simple stuff and cloud corrects and refines. Thats how my claw works now for a while and it works verry well. If local is stuck or im not happy it can talk to a cloud bot in discord and get help fixing it or the cloudbot can take over.
330d@reddit
I'm sorry but these are all toy projects. An average SaaS that's not a crud will have 50-100k lines of backend and 20-30k lines of frontend with complicated deploy pipelines
MexInAbu@reddit
Well, none is (or should) be vive coding a production SaaS with a local, quantized small LLM. Hell, you should have very strong guards if you are going that with the frontier models too.
Sn0opY_GER@reddit
true - and now that i think about it you can literally FEEL with every line of code it takes longer and vreates more bugs - at first its prompt > "ooh thats looks really nice - lets me add XXX" and after a few of these "loops" the bugs/breaking starts and more and more time goes into fixing stuff - at the end i had to use Claude to fix an error with the minimap-timings the local model just couldnt get right (local always only displayed cars in the first 25% of the map never a complete lap - Claude fixed it and called it "Bad math" 😃
den0rk@reddit
Could you recommend some necessary adjustments in LM Studio?
Own_Mix_3755@reddit
Thats hard to say. Depending on your hardware, model, usage, … there is alot. Google is your friend and you have to do alot of testing.
TheTerrasque@reddit
That's a you problem.
Local models aren't as good as claude, but they're fully capable. I've been experimenting with Qwen3.5 35b a3b at Q4 and opencode last week, and one task it did was making an MCP for a web site's search and detail listing (a local ebay'ish salesplace).
It started with me telling it to find out how the search worked. I couldn't see a json call for it, and the source html didn't have the results so it wasn't straight forward. It went at it, reading source code, finding javascript, deobfuscating it and tracing the calls and fetching the various js files. Like really going at it.
I started it before an 1hr work meeting, and it was still going on after I was done. I just let it putz since I wanted to see how it went, and about 20 minutes later it had figured it out and written a python module to get the listings. I then told it to do the same for details, and it figured that out within minutes.
Then I had it build:
I even had it test the result by building docker image and read the build log, launch it in docker and check the docker logs, then have it do http requests to the server to see if it answered correctly. I didn't even had to instruct it hard to do it either, just something like "verify via docker that it works" and it handled the rest itself.
At one point I had a "host name invalid" type of error, don't remember exactly now, happened when it was called inside the k8s cluster. I gave it the error message, it spun up the latest image and tried a http call with custom host header, noted the bug, traced through the mcp library until it found where a default class was created with hostname protection option was on, and altered the mcp server code to create an object with that option was turned off and pass it along when instancing the server. It then built a new image, verified that the call with custom host now worked, and deployed a new version.
It was a bit back and forth, with a few more mcp errors that took a bit of time to smooth out, but I only looked at the code twice during the whole thing. Once to figure out a problem it was stuck on and once to skim through it at the end to check if there was anything really stupid going on. It wasn't.
And that's with the MoE, which is less capable than the 27b. I don't know what you're doing wrong, but you're doing something wrong there, mate.
andy_potato@reddit
It’s not a “you problem”. OP has pointed out very detailed why a model like Qwen 3.6 is a nice toy but eventually much less capable than Opus or Sonnet.
Everything else is just “I want it to be true because local models”
rog1121@reddit
The only “real-world” success I’ve had with local llms is sorting and sentiment analysis. Essentially just a script that calls a Gemini model and asks it to be sorted into one of 6 categories which tends to be fairly well given the headers and raw data.
Full fledged agentic workflows is def not doable u less you run at least a 120b model
iMakeSense@reddit
I'm not sure you even need it for that. If you have enough data for your 6 email categories, couldn't you just create embeddings for those 6, cluster them, create an embedding for the new email, and if I certain confidence threshold isn't reached then use the LLM?
yeah-ok@reddit
The key here is the phrasing, "just" might be a bit of stretch for most people, can you point to practical steps needed to do this (i.e. not theory or overview but actual terminal commands)?
groeli02@reddit
original qwen? have you tried qwopus or some other derivates?
Your_Friendly_Nerd@reddit
I'm so glad it's not just me. I've barely used any fancy agent harnesses like opencode with local models, because the few times that I did try, it was an awful experience (doesn't help that I don't have much VRAM so the models run slow as hell). That's why I've just stuck to using the chat interface in my editor, which is a step up from open-webui, since it's easier to share editor content with it, but that's about it.
Hans-Wermhatt@reddit
The people here overhype Qwen 3.6 for sure, but I don't know what to tell the people who were expecting to just flip over from Opus 4.7 4 Trillion to Qwen 27B and expect the same performance. You'd have to run GLM 5.1 for something a little closer. Qwen 3.6 27B is more like GPT 5.3 mini.
GCoderDCoder@reddit
Over hype? I'm going to sound defensive but I genuinely think people hype claude from lack of exposure to other models and other harneses. The content creators who actually try different things tend to recognize opus has great ability but often use other models for their own work. And nobody is saying a 30b parameter model can do everything claude can do. People are saying most of what they need a model to do can be done with self hosted models.
For local 3.6 what hardware are you using? What quant are you using? What harness are you using? How are you using your harness? Claude has those tuned for a certain user profile. You have to do those for local too before comparing.
People using q4 of a 30b model to code are not actually using the model that the benchmarks are made on. Models can keep agentic logic sound longer than they can maintain the same level of coding performance. So a 30b parameter model can search the internet, manage emails, etc down to q4 but I would not write code with that version.
Claude the model is different from claude the harness. I had opus in cursor for work just fine so i tried claude for my personal and Anthropic's harness makes me hate their models because I don't just let llms do their own thing. I use them to fill in the boiler plate for my logic. The way I use models I can swap claude, chat gpt, large local models (i have hardware) and now small local models like qwen 3.6 too. My friend who doesn't code loves claude code because he doesn't care about the how. He's also not using what he builds for production.
Most people don't actually need claude and the data is showing there's a lot of people enjoying AI activity not getting real value. If value is just making a lot of docs then people are really hyped making docs no one looks at lol.
Finanzamt_Endgegner@reddit
Well idk about you but I let qwen3.6 27 go into llama.cpp to implement a feature which it had to change like 10 files for and it just did that. Was for testing out some new method so don't worry I'm not gonna spam the devs with it but it works. I highly doubt gpt 5.3 mini would be anywhere near this level.
Nixellion@reddit
To chip in, GLM 5.1 truly is capable of replacing Opus 4.6. I am running the z.ai api version, I assume it runs unquantized, so local performance may degrade, but overall it works well across various complex large codebases.
Void-kun@reddit
What harness are you using GLM5.1 in?
In Claude Code it's significantly worse than Sonnet 4.6 nevermind Opus.
HappySl4ppyXx@reddit
How are the limits and are there a lot of rate limiting / technical issues you run into?
Kholtien@reddit
Ok the lite plan, I get 2-3 times what I do on claude pro
TheLexoPlexx@reddit
But that's the thing. Gemma4 31b is in LMArena remarkably close to the GLM-Models or Kimi across all benchmarks and on top of that, Composer is based on Kimi and that sucks too.
Monkey_1505@reddit
Mo., it's not remarkably close in benchmarks.
TheLexoPlexx@reddit
Care to explain?
Monkey_1505@reddit
Well, I'm not sure numbers not being close really needs explaining, but, on the artificial intelligence benchmark aggregate (which is just an aggregate of benchmark scores), Gemma 31b is a 39. Kimi v2.6 is a 54. Opus is a 57. Kimi v2.6 is far far closer to the benchmarks of Opus than Gemma is to Kimi.
Kimi v2.6 and MiMo Pro are the absolute top models in open source rn, trillion parameter models within spitting distance of SOTA proprietary super labs.
Gemma4 31b isn't even best in class, just vaguely competitive with other small models.
IntrinsicSecurity@reddit
I’m going
XccesSv2@reddit
You read benchmarks wrong "close" means, when it comes to the last top percent, are huge differences.
FaceDeer@reddit
I would think that local models like Qwen3.6 would be well suited to replacing remote LLMs for things like auto-complete, filling out a local function or writing docstrings. Not so much the large-scale system architecture stuff. I could see a framework that optimizes which tokens get sent where, using the big remote models to plan out what to do and then delegating implementation tasks to local models. Might be a best of both worlds arrangement.
Monkey_1505@reddit
Or MiMo flash or similar.
GlitteringDress912@reddit
falconandeagle@reddit
This subreddit is filled with vibe coders that think their yet another todo application of basic ass dashboard is something to brag about.
IamKyra@reddit
Hm I'd say the opposite, if you're a good coder you know how to make Qwen3.X do what you actually want to do. It's the vibecoders that will actually miss Claude for how much he can achieve.
Eyelbee@reddit
Yeah, the more you know what you need to do, the less you need a better model. This has been true for quite some time, honestly. But the thing is, qwen 3.6 27b is quite literally at sonnet 4.5 - GPT-5 level. 6 months ago these were the best models. Would OP say the same about sonnet 4.5 when it first came out?
Finanzamt_Endgegner@reddit
This this this, if you know what you it can even beat 4.5 opus in some areas with correct guidance.
sexy_silver_grandpa@reddit
I use local LLMs and I'm the project leader of an extremely popular open source library that you, and every enterprise company use every day.
Chupa-Skrull@reddit
Thanks for your hard work, sexy silver grandpa.
relmny@reddit
This subreddit is filled with people comparing a most likely >1tb huge model to a 27b/31b model. And claiming they can't do the same.
What is clear to me is that some people don't understand the tools. And they don't know what they are for nor how to use them.
HiddenoO@reddit
The whole issue in OP's post stems from too many posts claiming the opposite, i.e., that your locally hosted small model is basically as good as frontier models.
It might not be the majority opinion, but it's common enough to mislead people into thinking they're doing something wrong, when in reality that false suggestion is typically either the result of vested interests (like the Huggingface CTO post yesterday), people not being competent enough to realise there is still a very significant gap for real work, or people simply not having complex enough use cases for that gap to show.
Just below, you have the following comment with currently 24 upvotes:
"Iterating on data" and "making an html dashboard" are generally not "hard things" for an AI, especially when the person prompting has the required data science knowledge - what's hard for an AI is dealing with large, messy, interwoven codebases that result in a large, messy context window with tons of tool calls.
relmny@reddit
It's like any opinion on the Internet, what you read is what THAT person thinks/claims.
Meaning, that if someone says "I don't need commercial models anymore, running qwen/gemma/kimi/glm/etc locally is enough!" that means exactly that. No matter how they phrase that. It's their opinion for their case.
I always use local models. So I'm not surprised, specially since the last 1-2 months with gemma-4, qwen3.5/3.6, kimi, glm etc, that more and more people are claiming that THEY can do THEIR work with local models.
And that example is by a single person that, like me, can work fine with local models.
It's about context. And understanding that what works for someone, might not work for someone else.
HiddenoO@reddit
You're acting as if that were all that's being said, but the part I referred to specifically was the "doing hard things all day long" part, and that's how these comment chains regularly go. People extrapolate their own (often very limited) experience and then effectively gaslight people like the OP into thinking they're doing something wrong when in reality they're just overstating their own experience as being generally applicable.
relmny@reddit
Again, that's your claim of what "hard things" are.
AFAIK there's no official definition for "hard things".
Maybe for the person that wrote that, those are "hard things". Maybe things that didn't work before with local models.
And the main point remains, that's the opinion of a single person.
I claim that I do everything with local models. If somebody understands that anyone can do everything with local models, that's their problem, not mine.
That's my experience. I can do "hard things" because they are... to me.
And then there is the comparison between a huge commercial models with all the infrastructure, workers, hardware, tools, etc with a 27b/31b model in a single GPU...
Anyway, I'm done with this.
HiddenoO@reddit
You're arguing about technicalities, I'm arguing about these comments are being perceived by people like OP constantly reading them.
GreenGreasyGreasels@reddit
It's the hype - Qwen3.6-27B is as smart as a model 20x it's size - which is true not not the full story.
It's like claiming a child with 130 IQ can do the same things as an adult with 130 IQ - they might both have the same IQ numbers, but the tasks each is capable of is very different.
Syncaidius@reddit
People also forget when comparing Claude models against others, Claude is trained specifically for coding and development-related tasks. It's more specialised in this area, so it should be expected to be at least slightly better at coding than other models.
However, when it comes to doing more generalised and varying tasks, I find Claude makes way too many dumb decisions compared to models of lesser sizes and that's fine. They're specialised models, whereas the others are more generalised.
Other models are intended to be good at a bit of everything, but great at nothing. However, that will change over time as they optimise model sizes and efficiency.
The biggest issue with Claude right now is it's not able to run at it's optimal level because Anthropic have been severely restricting it to coutneract the shortage of available compute.
droptableadventures@reddit
Or if they're capable of setting up local AI to a degree that works well, they are more likely to have some level of programming knowledge.
So if they have to help the model get past the occasional issue it's stuck on, they don't see this as a major impediment.
alberto_467@reddit
Not necessarily for anyone who's gotten started in the last 2/3 years. There are people doing things who never really learned how to code, because they never truly needed to. They are totally lost when they try to code without a model or smart autocomplete.
They surely have more technical skills if they can set things up, they can probably read some code, but they don;'t really have programming knowledge because they never had the mental strength to disable all AI and actually learn, for many months or even years, to actually code by themselves.
More experienced guys have already put in the work to actually gain the programming knowledge, it's the newer ones who never felt they needed to know the why and the details that i'm worried about.
andy_potato@reddit
1000x this
SmartCustard9944@reddit
You forgot the tower defense guys
ProfessionalSpend589@reddit
We need more tower defence games!
RoomyRoots@reddit
You can easily extrapolate it to the whole Internet.
WinDrossel007@reddit
No, it's a common sense
xamboozi@reddit
Wait are you guys comparing a raw LLM against one with a fully refined harness? Is your local AI decomposing every ask to reason through it? Is it learning and self improving as you work? Is it evaluating every conversation for how it can do better next time?
nickl@reddit
> Is your local AI decomposing every ask to reason through it? Is it learning and self improving as you work? Is it evaluating every past conversation for how it can do better next time?
> Cause that's what Claude Code is doing.
Other than the system prompt telling it to reason through things step by step, no, Claude Code does not do these things.
The harness is important, but don't make things up.
AdOk3759@reddit
Exactly.. the harness plays a huge, huge role in output quality, even more so when we’re talking about small models. Look up little coder
balancedchaos@reddit
Local LLMs have been utterly terrible at everything I've tried with them.
eat_my_ass_n_balls@reddit
A lot of the people in here are making slop and it shows
weiyong1024@reddit
I think this isn't local-vs-frontier, I get the same tool-call loops and context bloat with Claude when the harness doesn't scope tool output, local just has less margin. Maybe the point of a strong coding agent is helping you build better local projects.
stillanoobummkay@reddit
Claude code is an order magnitude better than it’s competitors. Hands down.
So the best local model won’t compare.
I think it’s a matter of the right tool for the right job though and I admit that I get frustrated with my local models and go to Claude when needed.
Fast_Sleep7282@reddit
the trick is to use a large llm to orchestrate smaller coding llm’s to save output tokens
hedsht@reddit
i was about to say.
in my workflow codex gpt 5.5 is the architect, qwen3.6 27b the builder and qwen3.6 35b the tester. it works very well (for web development).
tomdg4@reddit
How do you set up such a workflow? Trying to do the same since github copilor prices will go through the roof
Harvard_Med_USMLE267@reddit
I use the same two models as you, op. I like Gemma 4 personally. 48 gig vram setup.
Local models are fun to code with from a hobby perspective. Using your GPU to write code\ is very sci-fi!
However, for anything serious there’s no comparison between local models and claude code. Not even vaguely in the same league.
swaglord1k@reddit
better late than never
edsonmedina@reddit
To me it sounds like no one is wrong in this thread, they just have different expectations.
Some people use LLMs as tools to speed up/improve their coding/reasoning and do just fine with local AI.
Others expect LLMs to do the thinking and take decisions for them. Nothing wrong with that, but for those people local AI is definitely not there yet.
The latter group does have a problem though: I'm not sure these gigantic models are even economic viable (at least currently) so you might face even higher prices. The scale required to run them is simply insane and someone needs to pay the bill.
MexInAbu@reddit
This. A couple of years ago we were doing complex coding without any LLM assistance whatsoever. So having something like Qwen 3.6 is an incredible production multiplier.
Maybe I'm and old jaded man yelling at clouds, but all these talk about letting a complex model do the planning is crazy talk to me. I most (almost all) of the planning myself and a significant part of the coding too. When I let the LLM to write code autonomously I give very detailed instructions approaching pseudo code. Small LLM are very good at speeding up my work.
Now, I do use the frontier models to help me plan complex plans, solve complex problems and find out known methods and tools, though.
waescher@reddit
name checks out
floriandotorg@reddit
I mean what are your expectations? You’re comparing an LLM running on a GPU cluster in a data center with a MacBook.
And as other commenters pointed out: In the end, they are just tools. And I think you're using them wrong. A local LLM is great if you want to be able to use it offline, have total privacy, and practically no cost. If that's not your goal, use a frontier model in the cloud.
droptableadventures@reddit
If they had a MacBook with anything more than the base amount of RAM, they'd be able to run a bigger model than that!
tp_bexx@reddit
The Carlo Ancelotti reference is 10/10
PromptInjection_@reddit
You are comparing a 27B LLM to one with over 1 Trillion params?
Buy a Mac Studio Cluster und try it with GLM 5.1.
andy_potato@reddit
Or just subscribe to Claude for the next 10 years for the same price.
PromptInjection_@reddit
That's an option. It's like on vacation: taxi or rental car? Both have their pros and cons. It's not just about the price.
andy_potato@reddit
The problem is your shiny Mac Studio cluster will be scrap metal in 3 years from now. Outside of very narrow use cases, investing that kind of money in a local AI rig is a huge waste of money.
PromptInjection_@reddit
Yes, of course. In three years, you'll "have to" buy a new one.
Local AI isn't a free lunch. Anyone who thought it was has fallen for AI influencers. They live off hype.
WinDrossel007@reddit
It's a matter of time when big corpos decide to ramp up prices to equalise their investments with ROI they want.
Your local llms will be much more useful. Until that I would agree with you. I like working with Opus 4.6, but my company pays for it.
I don't care about tokens as an employee.
But I do care about tokens as an individual.
I bought 5090 and happy with my local models and learn how to use them. qwen 3.6 is a pretty good one. If you provide enough specs -> it does it's job pretty good. Not like cloud models, but you can't depend on them.
Overall I agree with a sentiment.
finevelyn@reddit
Yeah me too, honestly I was done before I was even started. Now I'm using a local brain for coding and it's miles ahead of even cloud LLMs. Try it if you haven't.
Stitch10925@reddit
What is a "local brain"?
orion7788@reddit
lol
thejesteroftortuga@reddit
Honestly nothing beats Opus 4.7. It’s crazy. I just had it refactor the UI of a pretty complicated web-app over several hours while I slept and it got like 95% of the way there.
These smaller models are much better for narrower faux-deterministic outcomes than they are as broad scale coding agents.
dolomitt@reddit
I tried cline with qwen3.6:27b. It will not work anywhere near as good as with opencode for some reason. Same llama.cpp server. Its really usable compared to previous generations. I run on a 3090.
Puzzleheaded-Try737@reddit
Totally fair. If the productivity loss is hurting, the "Local" pride isn't worth it. I’ve been building tech for years and the "Hardcore mode" of small models is only fun until you have a deadline. Since you're switching to OpenRouter, I'd suggest trying a mix:
AnomalyNexus@reddit
I mostly view it as a spectrum. Don’t want to pay 200 bucks a month for Claude opus or whatever. But also don’t want to fight against a weak model too much either. So one of the cheaper api it is - currently GLM.
Working on moving some openclaw stuff to local though. Some tasks there aren’t as sensitive to precise model
GrungeWerX@reddit
I recommend ppl just downvote this AI slop written post and keep it moving.
andy_potato@reddit
Downvotes this mindless reply
mister2d@reddit
Probably the irony is that the local model was used to assist.
AlwaysLateToThaParty@reddit
I code on an rtx 6000 pro, using qwen3.5 122b/a10b heretic mxfp4, at about 75GB, and it's solid. I've tried the smaller models and they drove me mental. This can one shot complex tasks.
The problem with openrouter seemed, to me, that different service providers were quantising their API end point models. I think that's unavoidable fwiw. I'm pretty sure openai and claude do it, but they'll do it in subtle ways, cuz they can. But what it meant for me was inconsistent output, and that drove me mental.
So that's why i have the gpu. Does the task, and more. Pretty epic gaming gpu too tbh.
andy_potato@reddit
Feels good until you do the math how long you could have subscribed to Claude or Codex for the $$ you paid for that RTX 6000.
Your games will still run fine on a 5060ti by the way.
YehowaH@reddit
Hope you used qwen3.6 35 a3b with iq4nl/xs, it fits in 24 GB mem. You get 170 tg equal to Claude. Qwen3.6 was trained for tool calling 3.5 was not and it has the developer role. Both going well and check the parameters for defined programming tasks, e.g. temp 0.6.
I have minor issue to none with the new models, these are a true replacement. Give it another try with the right models. I do complex scientific stuff back and frontend, nothing you can compare the daily work if a dev and nothing the llm can be trained on because there might be only a few examples worldwide. It runs like a charm.
andy_potato@reddit
The MOE model performs even worse than the dense model for coding.
false79@reddit
Bruh - that is not how you do it. You need a harness, Claude Code, Cline, Kilo, whatever, then you need to @ the file you want to make a part of the context.
Claude code is not a mind reader but it certainly has massive amount of context.
You can get away with so much more if you give LLM some direction, it will connect the dots with sufficient direction.
andy_potato@reddit
Did you even read the original post?
dtdisapointingresult@reddit (OP)
I was using a harness. I tried two complete ones (Claude Code aimed at vllm, and Qwen Code aimed at vllm). I also tried vanilla Pi.
traveddit@reddit
Which endpoint did you use on Claude Code? What arguments for vllm?
RemarkableGuidance44@reddit
They have no idea...
juraj336@reddit
I'm surprised it isn't able to handle this then, Ive had Qwen3.6 27B handle several things like this easily. I had it make an api, then dockerize it and then iterate through until it fixed the issues after which it worked great.
I think for these medium size models context is king. They don't know as much as a Claude or Chatgpt model but they know enough that with the right context they can reach the same result.
So for me what has worked great is adding a searxng instance for web search and having it ensure testing in a loop until it has something working.
More-Curious816@reddit
You compared a trillion+ parameters model with 27 billion and 31 billion models? Of course you will notice the disparity. Try the big open source models and come back.
andy_potato@reddit
Lots of people on this sub claim that the Qwen3.6 27b model is on par with Claude. OP therefore specifically selected this model for their comparison.
Nobody doubts that a model like GLM 5.1 can achieve performance in the same ballpark.
1dayHappy_1daySad@reddit
I do code for a living and yeah, local models are not there yet compared to the best paid models.
AvidCyclist250@reddit
Noticed that too. Especially with qwern 3.6 35b
ComfyUser48@reddit
If you not getting good results from Qwen-3.6-27b, it's a skill issue.
Learn to use good prompts and phases coding.
FusionCow@reddit
a model running on a single consumer gpu will never compare to a model like claude. you can still save money though by using something like kimi k2.6, which is as good as claude opus but way cheaper on api
andy_potato@reddit
Of course not. But there are very vocal people on this sub who want to make you believe otherwise.
dtdisapointingresult@reddit (OP)
For sure, that's the idea. I'll keep using Claude for the work stuff (I don't pay for it), and use big cheap Chinese models for my personal projects. It gives me the best of both worlds.
cyberdork@reddit
I mean, did you get the impression that people use local LLMs for anything else than hobby projects?
Obvious_Equivalent_1@reddit
I think you’ve also might’ve fallen for the trying to switch all at once trap. What works the best is to start with what you know, and familiarize where it doesn’t hurt as much.
To give some insight I started using Qwen 3.6 35B for what do you think? Right I didn’t start with full blown dev sessions, I let Claude set up a slash command for comments and routed through the local model. A clean 1-2K tokens save per session, easily verifiable in git log.
Then I started experimenting with some hooks, I forced Claude to run any Explore type subagent or Search type subagent through local Qwen 27B model.
The thing when you start in the small scope, it’s also easier to discover any performance issues, any catching issues. Any issues with the prompting and issues with the thinking levels. I’ve actually managed to run into some issues or crashes occasionally, but because iterations are so small, it’s way easier to find the issue locally.
I think when people talk about the power of the local models, they didn’t get to that point by going all in before they got through the initial fine tuning stage. I think for the local models the next big steps will be tools to automatically adjust the models to your local hardware. For now, unfortunately the promised potential does take some grinding through the finetuning.
dzhopa@reddit
As a tech VP, I'm currently operating a whole dev team on Anthropic and OpenAI credits freely available to lots of VC funded startups. Those days are rapidly coming to and end and we're burning through the credits at a ridiculous pace some days. That said, I'm frantically evaluating other ways to give my team these tools when the gravy train runs out.
They're going to get the big cheap Chinese models for work stuff and local models for their personal projects lol
XTCaddict@reddit
Nahhh it’s not, on benchmarks sure but in real use it does still lag behind. It’s inbetween opus and sonnet imo. That being said it’s still very good. I think it’s thinking trajectory isn’t as dialled in as Opus. It misses more things, needs handheld more. Still a beast overall though if you’re a dev it’s a great tool for the price.
RemarkableGuidance44@reddit
That's why you split up the effort... We can do 85% on Kimi K2.6 and GLM 5.1 on our servers and then use Codex for the 15%.
SmartCustard9944@reddit
One can hope that DeepSeek v4 flash gets somewhat close to an older Claude.
Crampappydime@reddit
You dont even mention hardware, you could be stupidly using 2 bit quants expecting more…
ConsciousDev24@reddit
Fair take local models still struggle with long-horizon reasoning, tool use, and real-world workflows like Docker. The gap vs Claude Code or API-based models is very real right now, especially for debugging and decision-making. Using locals for lightweight tasks + cloud for heavy lifting feels like the practical split.
Have you tried pairing a local model with a stricter tool-execution layer (like enforced step checks) to reduce those bad decisions?
a_beautiful_rhind@reddit
Harness issues aside, this is why I always said stuff of this size are scraps. 30b is like bare minimum to get anything, even chat or RP. You're expecting it to have kimi performance. Let alone MoE with 3-10 active parameters.
People here never liked hearing that. They will blame everything else like quantization or laughably, kv cache.
oldschooldaw@reddit
I quite like reading posts like this, it is the antidote to the shit I see on Twitter constantly about people using xyz claw variant #1337 with omega-amazing-distill-opus-3b on their third Mac mini while they escape the permanent underclass. It helps really remind me the reality is actually in the middle.
andy_potato@reddit
I wish I could upvote you more than once.
Zeeplankton@reddit
I always thought twitter was better than reddit, until I got a twitter account. That place is like linkedin with toxicity turned up to the max.
pkmxtw@reddit
Just downloaded IQ1_S on ollama 🦙 running at 3 tk/s. This thing totally replaces Opus 4.7 and I'm canceling my CC sub! Big AI labs in shambles... Starting my new all-AI startup with 10 claw agents now 🚀🚀🚀. If you aren't learning about this, you are 100% left behind!!!
gameboyVino@reddit
Deleting twitter is truly the answer here
CondiMesmer@reddit
Pretty much. Even if using Claude Opus 4.7, you still need to heavily supervise the output. That's just the flow of coding with LLMs tbh
cohesive_dust@reddit
Reality sets in. I went through same drill as you. I'll try again in a year.
Techngro@reddit
Off topic: This is exactly how I feel about Linux.
mister2d@reddit
So...skill issue on both accounts?
Techngro@reddit
Yup. The same skill issue that makes people return to Windows over and over and keeps desktop Linux at a barely noticeable market share. 🤷♂️
mister2d@reddit
Ah. You would have had me if you said users returning to Mac.
Returning to "Windows" explains the core issue.
Techngro@reddit
I was just speaking from my personal experience. But whether its Mac or Windows, we both know the reason people return after trying (and I really gave it a fair shot this time) Linux is the same.
letsbefrds@reddit
I'm flipping and flopping btw mac (,I use a mac at work and I have a air) my home desktop is duo boot Linux/windows.
They all have their strengths. I'm really trying to use Linux (Kubuntu) and for 90% of the stuff it's fine. But when it's bad it's really bad. For example just organizing my photos and deleting images of my memory card it's super slow. Works just fine on windows and Mac. That 10% is pushing me towards full macOS.
Mac : I can't get use to the weird cmd c cmd v. My god the screen splitting is awful.
Windows : I use it to game that's about it. Copilot dog shit and ads in windows is really killing the vibe lol
Techngro@reddit
Instead of going back to Windows 11 Pro, I went with Windows 11 IoT LTSC. It's already debloated from Microsoft. No Ai, no Copilot, no ads, etc. I am very happy with it. You should give it a look.
mister2d@reddit
I didn't miss on any points. Go and enjoy Windows.
FUS3N@reddit
I feel like its kind of being disingenuous when you put it like that yes windows has issues but desktop Linux isnt there yet too, I have it dual booted but for most people its not the same experience when you always look from the view of someone who knows stuff here and there.
dtdisapointingresult@reddit (OP)
idk if I agree with that. Linux is predictable. It's the same stuff working predictably every time.
Just stick to Ubuntu LTS instead of meme or rolling distros, don't install random drivers.
iMakeSense@reddit
Pick one of the 3 top wifi cards on Amazon, I promise you only 2 of them will work and one of them will probably require some weird ass drivers.
kyr0x0@reddit
Calling the Pipewire mess predictable is kind of a stretch ;) Audio under Ubuntu is highly unpredictable. Sometimes it works, sometimes it doesn't. It was more stable with ALSA in the early 2000s...
simracerman@reddit
Funny I had a conversation with a work colleague about this today. I concluded that I’m still burnt from last year’s experience.
PavelPivovarov@reddit
Local models are usable but also require frugal approach to the context.
Claude code system prompt alone is 10k tokens, add there few MCP servers and you are approaching 30k context without even asking any questions, and this is where local models start degrading...
Im currently switching to Pi, paired with RTK and Caveman for better context density, plus replacing MCPs with CLI + Skills and it works wonders.
I had pretty good coding session with that Pi setup and Qwen3.6-27b-IQ4XS with 32k@Q8_0 context (maximum I can fit in VRAM) and it was really decent coding companion.
Yes its not GPT5 level but that wasn't even my expectation but the model never did anything unreasonable and generated code was also solid most of the time.
a_beautiful_rhind@reddit
People underestimate how much stuff like claude code is tuned to cloud models. It's really slim pickings on the harness front.
I only had luck with roo so maybe I will try this new Pi thing. Otherwise it's "COMPACTING" city and the model can't really get anywhere.
zipperlein@reddit
I think, this is the a perspective problem, not a problem of the acutal models. It depends a lot of how much hand-off approach you have, imo. I like to know exactly what is in my codebase. If the LLM does not make good changes, the direction is fine most of the time. Then I just do some manual tweaking and let it continue. It's a wayyy smaller model, even if it is good for its size.
DeltaSqueezer@reddit
3 minutes after giving the prompt:
Qwen 9B.
CheatCodesOfLife@reddit
Upvoted for EchoTTS. That's pretty good for a 9b! Which harness?
Stitch10925@reddit
What agent tooling did you use?
DeltaSqueezer@reddit
I wrote my own. I just started with a simple loop and added tools. After a week, I stopped using Claude Code and replaced it with my own agent and most of the agent was developed by itself.
After adding many tools, I found it was better to skip back and limit to just four: Read, Write, Edit, Bash. I also have Grep and Glob so I can disabled Bash to limit risk, but technically, you could just have Bash as the universal tool.
I also have no default system prompt so full context is available to the agent.
I reduced API usage massively. Now 70% of work is done with Local Qwen and 30% with GLM-5.1 when more context/intelligence is required.
https://www.reddit.com/r/LocalLLaMA/comments/1sq7cie/warning_do_not_write_your_own_ai_agent_if_you/
Stitch10925@reddit
That's pretty cool. What coding language?
I've been thinking of doing the same thing because current tools are not very fond of C#.
DeltaSqueezer@reddit
I wrote it in Python.
datbackup@reddit
Even though I lean towards agreeing with you that local isn’t able to compete with the big centralized providers, i immediately became skeptical when your long post didn’t mention the actual harnesses you used by name. I see in another comment you mentioned using Claude Code, Qwen Code, and pi.
The fact that you didn’t mention this in your original post but you did mention several models by name, tells me that you are misunderstanding the importance of the specific harness you choose.
I agree that there are way too many posts on X that hype up agents or AI in general and ESPECIALLY make it sound like the poster spent way less time on their hyped outcome than they actually did. Basically there is a scammy situation happening whether organically or intentionally where people are incentivized to make it sound like something “just worked” because then, when others read it and can’t reproduce the outcome (without ridiculous amounts of time and effort) it positions the poster to get more esteem, followers, job offers etc.
The takeaway is just that you should expect vastly different outcomes with different harnesses even when using the same model. Of course there is also the “skill issue” but I want to suggest to you that some portion of the “mind reading” you refer to is down to the agent’s system prompt(s) and the way it engineers context.
Hermes agent for example has the same problem you mention where it starts a long-running process with no regard for how long it might take, then times out and has to start over. However, it’s very good about by default doing the behavior you described where the tail of a log file or command output should be used to determine the state of something.
So if you aren’t totally giving up yet i encourage you to try a “breadth over depth” approach to using harnesses where you try the same task in each and note what their strengths are.
I think there are huge unlocks still to be made in harness design, which will make the already released local models that much more viable compared to big providers.
TheQuantumFriend@reddit
What is your setup? I am running coder-latest with opencode. I would trade time for quality, maybe with deterministic harnesses. However reddit is a bit polluted with so muxh crap, hat iam a bit lost atm.
datbackup@reddit
I just set this up last night
https://www.reddit.com/r/LocalLLaMA/s/khiJXifoAV
It’s about as close to sota as one can reasonably get on “consumer” grade hardware imo
Hermes agent, pi, opencode
mrdevlar@reddit
Honest question: What do we mean when we're talking about an AI coding harness? Is this what we mean by OpenCode or Cline or RooCode or is this a more nuanced set of features that are used as part of a coding process?
Lucky-Necessary-8382@reddit
Probably good prompts in .md files
mrdevlar@reddit
Could you elaborate on what you mean on that?
Lucky-Necessary-8382@reddit
A "harness" is the software layer you build around a model — the infrastructure that turns raw intelligence into a useful, autonomous work engine. The model provides the reasoning; the harness makes it actually do things reliably, repeatedly, and without you babysitting it.
What a Harness Actually Contains
A harness typically wraps the model with:
run_python(code)), executes it in the real world, and feeds results back into contextmrdevlar@reddit
Thanks. Not sure who downvoted you for helping me, but I appreciate your effort.
watchmanstower@reddit
A harness is both what you are running the agent through (the software) and also what you are surrounding the agent with for him to be successful at whatever you’re wanting him to do (e.g. all the necessary docs)
droptableadventures@reddit
Also it's some very interesting timing given that Github Copilot just announced a switch to usage based billing, and massively increased the cost for higher end models.
And it's resulting in a lot of people suddenly being quite interested in local AI models, who previously dismissed them...
eLKosmonaut@reddit
Pro+ is 40$ and still has Opus. The multipliers drop on April 30th. Your post isn't entirely accurate.
droptableadventures@reddit
It does now, under the new one you can't sign up to.
eLKosmonaut@reddit
How would you use something you can't even sign up for? Like I said and you just confirmed, not entirely accurate.
cniinc@reddit
I disagree, OP posted how they were making harnesses and parameters for their relatively simple task of taking a Github and making a container.
If anyone can point to a working set of model and harness, I'd be very open to hearing about it. If we just can't do anythign close to Opus, let's just be honest about that. If we can achieve Opus-level gains with a set of well-defined harnesses, let's be honest about that.
So, what are harnesses that work for coding? I've yet to see someone replicate the productivity gain from cloud models, using a local model.
PaMRxR@reddit
Local models require significant time investment to learn a lot of details of how things work and how to efficiently make use of the hardware and model capabilities. Without some curiosity driving you into this people like the OP will fail. People that just want to use something and don't really care about the details.
TheTerrasque@reddit
He also didn't mention how he's running the models, which can have dramatic differences in result.
mumblerit@reddit
2 bit in ollama
droptableadventures@reddit
And it's probably failed to detect all his GPUs, so is running on the CPU.
And that thing it does where it doesn't error when you run out of context, but just ignores the first bit of the prompt.
With the context length set to the default of 4096.
datbackup@reddit
good point.
AdOk3759@reddit
Exactly, look up little-coder
Altruistic_Night_327@reddit
The context bloat issue you described — 250K tokens from docker build output — is actually the core problem I was trying to solve when I built my tool.
The reason agentic apps blow up the context window is they have no architectural understanding of the project. They either dump everything or dump nothing useful. So when a long-running command finishes, they have no frame of reference and spiral.
What I built instead is a RAG layer that parses the codebase with Tree-sitter into a typed graph locally. Every agent query pulls ~5K tokens of relevant nodes — functions, dependencies, the specific files in scope — not the whole project, not terminal dumps.
For your Docker example specifically: the agent knows which files matter for that build because the graph tells it. It's not guessing from context.
The tool is called Atlarix. Works with Ollama and LM Studio natively, free for local model users. Still early (31 users, being honest), but the context problem you described is the exact thing it's built around.
Not saying it fixes everything — small models still have reasoning limits. But the 250K token death spiral is an architecture problem, not a model problem.
patricious@reddit
OP you have mentioned all sorts of things but failed to give us the most crucial piece of information. What does your setup look like exactly. Hardware, model flags, TUI, harnesses, MCP servers?
The whole point, at least in my experience, when running local models is the supporting tech stack you build around it. My current setup feels far superior to what Anti-Gravity, Claude Code, Codex and others have to offer.
For me it looks like this: RTX 5090, Qwen3.6 35B/27B with TurboQuant (use them both interchangeably), --temperature 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --frequency-penalty 0.0 --repeat-penalty 1.0
Coding stack: OpenCode TUI, oh-my-opencode harness, MCP's: . context7, grep_app, pdf-mcp, sequential-thinking, serena, stitch, websearch.
I have oh-my-opencode use Qwen3.6 as the builder and general orchestrator and all other sub-agents use: DeepSeek V4 Pro and Fast from my OpenCode Go subscription.
This setup works wonders for me.
MDSExpro@reddit
This sub creates unrealistic expectations that do not match reality. I have spent last 4 months setting up local closing via LLMs and I arrived on setup that works, but it's vastly different then image pushed by hypers:
First realistic productivity barrier was crossed at 128GB of VRAM (4x R9700) - Qwen3.5-122B-A10B quantized to INT4 was able to generate a lot of good code, but failed on long range coding. When I have it a technical spec, it was stuck at 90% correct implementation, but were unable to reach 100%. Anything smaller was pure frustration.
Bumped up VRAM to 256GB (8x R9700) allowed me to switch to FP8 quantization of same model and difference was night and day, it reached 100% correctness and really moved to next tasks.
llama.cpp is a trap, for coding you need vLLM if you want any responsible speed.
Long story short: it can be done, but it cost way more than this sub thinks.
leinadsey@reddit
So you’re saying your computer at home is t a massive data center with 256 TB of RAM?
Positive_Example_478@reddit
Performance other provide model is better but the Claude,Gemini code output and prompt understanding has become total piece of shi absolutely for a week of that feeling absolutely so fucing angry about the quality degradation feel and even though my prompt was clear as well as the same like used to be even after saying clearly what to do step by step it can fuking follow it omg I am so damn frustrated 😡
Unlikely-Loan-4175@reddit
This is very refreshing. I guess local LLMs might get there by end of 2026 given the fast progress. But they are not there yet for even a 5090 GPU (what I have).
And even if they so get to current frontier model performance, thr frontier models will have moved on again, increasing our expectations.
Potential-Leg-639@reddit
Give it a try with Opencode, Linux, latest Llama.cpp and Qwen3.6-35B (use the Q4 quant recommended from Unsloth - no other one! Think it‘s the XL, check their guide). No issues at all with tool calls on my side (Strix Halo with Fedora 43).
tibor1234567895@reddit
I heard that pi (pi.dev) could be better harness for local models. Haven't tested or benchmarked though.
Mochila-Mochila@reddit
Why would you compare Opus to a 27B model ?!
And why would you assimilate local LLM to, again, a 27B model ?
If you were serious in your comparison, you'd have something like 2 * DGX Spark...
swingbear@reddit
Try a different harness mate, I tried to run CC through everything local and had a bad impression of models even up to minimax 2.7. Started using Hermes and a few others, speed increased and way more mileage in terms of intelligence.
unspecified_person11@reddit
These smaller models can be good for small tasks or as a subagent but yeah a full drop in replacement for Claude would require bigger models. The only way to get decent output from these small local models is to spoon-feed them very specific and detailed instructions or have a SOTA model keep them on track.
Jungle_Llama@reddit
I disagree, I have had frontier cloud models mess up simple stuff and local do a good job. Local has it's limits with complexity say a caddy, authelia integration in an environment with a ton of technical debt but the issue to my mind is the tooling, especially with coding agents etc, they just aren't fully mature yet. a hybrid approach works really well.
Jungle_Llama@reddit
ha ha ha. no sooner had I typed this than DeepSeek V4 Pro (cloud obviously) completely borked my opencode fixing an mcp bug. Now recovering it with my local Qwen 3.6 35b. Hot shot models getting over their skis is real too.
NoShoulder69@reddit
Same here. It's not practical
ttkciar@reddit
Yah, unfortunately mid-sized codegen models just aren't there, yet. They've gotten a lot better, but the ones worth using are still in the 120B-size class.
With a lot of extra work, Gemma-4-31B-it gets close'ish to GLM-4.5-Air for codegen, but not close enough to make the extra work worthwhile.
Qwen3.6-27B similarly falls short, and that's only if it doesn't overthink (which it still does, way too frequently; wtf didn't the Qwen team fix that with 3.6? It was a well-known problem with 3.5).
TheAncientOnce@reddit
What's your experience with the 120b class models? The bench seems to show that 3.6 27b outperforms or matches the performance of the 3.5 120b
ttkciar@reddit
My experience:
GLM-4.5-Air: Best at instruction-following, which makes it my top pick. I tend to drive codegen with large specifications full of instructions, and Air consistently follows every single instruction in the specification. Unfortunately it is more much prone to write bugs than other models in this size class, but these tend to be low-level bugs, easily fixed, and not design flaws. It's "only" a 106B, but it's competent like a 120B.
Qwen3.5-122B-A10B: Runner-up. It's not bad, but would randomly ignore some instructions in my specification. It writes fewer bugs than Air, but is more likely to introduce design flaws (like using a temporary file, always the same pathname, non-atomically, in a multi-process application) or leaving some functions empty except for a "In production, this would .." comment.
GPT-OSS-120B: Great at tool-calling, okay at instruction-following (though noticeably worse than Qwen), but hallucinates up a storm. I wasn't able to get a good sense of whether it writes bugs or design flaws or not, because I couldn't get past the hallucinated libraries and APIs. How do I debug calls to a library which doesn't exist?
Devstral 2 Large: Very good at not writing bugs, and good world knowledge, but the absolute worst at instruction-following. It would ignore most of the instructions in my specification and write something only vaguely like what I asked for. I had high expectations, since it is after all a 123B dense model, but was hugely disappointed.
I have a hypothesis that Devstral 2 Large was deliberately under-trained, to "leave room" for further training on individual MistralAI customers' repos without overcooking, but don't know.
None of them are perfect, but I find the flaws of GLM-4.5-Air easiest to tolerate. Fixing little bugs is fine, and Gemma-4-31B-it actually finds most of Air's bugs, so that's easy. Ignoring parts of the specification is intolerable. Design flaws that require more than a one-line fixup are a pain in the ass. Hallucinating libraries is especially grievous, because I have to throw everything out and start over, but be sure to describe the libraries it should be using before continuing.
I used all of these models at Q4_K_M, and I know some people will point at that and say "there's your problem!" but frankly I can't tell any difference at Q6_K_M. Did not quantize K/V caches at all.
dtdisapointingresult@reddit (OP)
I can try one of those as my final attempt. Which one do you think would do best at my Docker prompt I shared here? https://reddit.com/r/LocalLLaMA/comments/1sxqa2c/im_done_with_using_local_llms_for_coding/oiowcxe/
I'm surprised someone is saying GLM-4.5-Air still holds up, and putting it ahead of recent models.
Bird476Shed@reddit
I agree with OP "the flaws of GLM-4.5-Air easiest to tolerate."
Overall, this model is still a reliable worker and a good speed/quality trade-off.
ttkciar@reddit
I have no confidence that GLM-4.5-Air's tool-calling prowess is up to the task of doing it interactively, else I would recommend it. Its tool-calling competence is quite weak, and I have never tried giving it instructions quite that vague and open-ended before.
Your prompt is better suited to a model of GLM-5.1's caliber. I'm having a hard time imagining any of those 120B doing it well, but it might line up with GPT-OSS-120B's strengths. Maybe give that a shot.
If I were to rewrite your prompt for Air, it would include a lot more information (how the app is supposed to work, specific filename for the dockerization documentation, etc) and a lot more instructions for how it should go about compiling the misbehaving wheel. I just have no faith it would figure those things out on its own.
It's a bit surprising to me too, frankly. I keep trying the hot new models, thinking "surely this one will knock Air off its perch", but they just don't, and I keep using Air.
Maybe Qwen3.6-122B-A10B will be "the one"? Or if Google ever releases that 120B MoE Gemma4 they beta-tested, that would probably do it (assuming they fix Gemma4's tool-calling woes).
At this rate, though, it's probably going to be a new Air model based on some version of GLM-5.x (assuming ZAI can repeat the feat).
Karyo_Ten@reddit
My very first agent was GLM-4.5-Air. But when I switched to OpenCode it kept failing tool calls - https://github.com/anomalyco/opencode/issues/1880
Besides, 131K context is just too small when you graduate from small CLI tools.
ttkciar@reddit
You're not wrong. Of all of the models I tried, GLM-4.5-Air has the weakest tool-calling competence, but I work around that by not requiring it to use many tools.
Air's 128K (128 * 1024) context limit is one of the reasons I tried so hard to make Gemma-4-31B-it work. Not only does Gemma4 have a 256K limit, but it also infers a lot fewer tokens in its reasoning phase, so more of that 256K is useful. I'm still hoping to figure something out, but for now have stopped trying to use it for codegen.
What I would really like is if ZAI released a new Air model based on GLM-5.x! Hopefully with 256K context.
dtdisapointingresult@reddit (OP)
I gave up on Gemma 4 31B early on.
It wrote the Dockerfile and now needed to build it. And I was staring at its output, coming slowly at 12 tok/sec, 3 minutes of reasoning while it tries to decide if it should check if docker is installed or not, and whether to build it via the Dockerfile or the docker-compose.yml (which also builds). I exhaled and switched back to Qwen 27B.
This was an AWQ, but I doubt the FP8 would've been much better.
I really think Terminal tasks are just harder on LLMs than coding. Coding is still just dead text output. Interacting with a running system via tool calls might be a whole other level. 27B gets 35% on TerminalBench-Hard, Sonnet 55%.
dzhopa@reddit
So yes, terminal tasks or any multi chain of tool calls is where your smaller quants fall flat. Minor hallucinations creep into the syntax and state passing between calls as the context grows large.
Code output is a lot easier because it's writing 1 file at a time, and maybe verifying syntax. You get to execute it later and fix that typo or hallucinated bug in a whole separate call. For terminal work it's passing a precisely formatted string of commands along with terminal output into the specific structural format needed for the LLM and harness to process the tool and then string the commands, syntax, context and structure together between potentially dozens of simultaneous calls needed to complete the task.
That's the real big problem Anthropic has spent a lot of time and money to get right, and it shows when you just ask Claude to "download this package from github and spin it up for my users as a docker container". Those Claude calls are stupid expensive for terminal tasks though.
rothbard_anarchist@reddit
Terminal is definitely harder for any language model. Even on Codex 5.5, it boggles my mind to watch it sometime ponder for three minutes straight how it should open a CSV file.
PANIC_EXCEPTION@reddit
Qwen3-Coder-Next is still definitely the speed king on local as it is substantially faster than 27B and approaches Sonnet level, which is good enough for a lot of tasks. Tell Opus to make a master plan for a feature, and then use a lightweight local model to implement it using that plan. I find that this is actually quite usable.
Unfortunately the barrier to entry for an 80B model is either having multiple GPUs or having a laptop with at least 64 GB of unified memory. So, inaccessible to a lot of people. If they can juice up Qwen3-Coder-Next to be like a version called Qwen3.6-Coder-80B-A3B, I think it might be able to stand entirely on its own.
27B gets relegated to very specific one-shot questions or very strong image understanding (e.g. translating text from a schematic). Or generating small scripts in isolation. I would never have it run an agent because of just how slow it is.
IWasNotMeISwear@reddit
The core members left the company I think
ChatWithNora@reddit
The decision making gap is the real issue. I can handle slower speeds but when the model keeps going off the rails on basic tool calls, you end up spending more time course correcting than you ever saved on API costs.
LienniTa@reddit
what harness you use? you cannot use small model with big harness, only with small harness like kon, late, little-coder, you cannot expect persistent gateways to work with qwen 27b
Zeeplankton@reddit
I feel like a lot of people here are kind of evangelical about running locally; which is unfair. The bigger point is retaining open models. The reality is people have to work, make money and stay competitive, and therefore use the best tools available. I think OpenRouter is a perfectly fine compromise.
StatisticianOdd6974@reddit
I think that you should also mention WHAT you code, the definition of coding is very broad. So i feel that my coding work (pipelines and IAC in terraform & helm for K8S manifesting) kinda works with qwen3.6 but Opus 4.7 is faster and better. 5090 with qwen3.6 q5_K_M. But i do enjoy hermes agent & qwen as personal assistant. (non coding)
Dapper_Chance_2484@reddit
Personal assistant? any details if u can share
StatisticianOdd6974@reddit
I use hermes agent with qwen as a personal assistant. So talking on my phone using telegram with hermes, it transcribes messages and does web searches, reads my personal knowledge base, calls mcp. Just a kind of tamagotchi..
BestSeaworthiness283@reddit
I hate it too tbh. I have a poor rtx 4060 with 8gb of ram and i cant code with it due to limitations to the context windows. I mostly wanted to use local models for qhen the imevitable happens and i get hit with the usage limit on claude.
My first approach to this was asking claude to make a skill that delegated things like code generation and stuff like this to a local agent with just the right context, meaning the code from line to line and what to modify. And itself made a tool that it ran everytime and saved me a lot of usage.
With the last approach it was still possible to run into usage limits. So i made a lightweight CLI tool for local LLMs with just 8k token context windows. It works by initialising the codebase with a map of the whole codebase. Then when you query or ask to create or delete something it goes and orchestrates a plan and tasks to solve the problem. It then does a llm request for every task and the task is no more then the context of 8k tokens. The system now has memory and you need to aprove the changes and uses a type of diff.
I will let you check it out if you want: github.com/razvanneculai/litecode
Gesha24@reddit
I have a mixed feeling about local LLM. I have decided to take one of my side projects and write it exclusively with local LLMs so that I can learn how they work.
Yes, you have to be very specific with them. Opus will make a decent web design just from a prompt. Qwen will absolutely suck. But if you open a paint and give Qwen a mock up of that, or work with Claude code skills plugin and spend 10 minutes designing the web site - it will actually code 100% usable and decent looking result.
Same goes for a database - if you tell opus/sonnet to migrate from sqlite to sqlalchemy it will prompt you whether you want to update your data ase calls to the new structure. Qwen will just wrap them in the sql_text() and keep them.
Lots of examples like that. However, I am not sure if that's a bad thing. The issues between qwen and opus are the same - code sprawl, duplicate function or features everywhere, basically both create unsupportable mess if you just let them go free. Having a worse local model forced me to be more involved with code, to look into proposals in more details, to insist that LLM reuses code - and the results are actually quite decent.
If anything, I am bringing what I have learned from local llm coding to the opus/sonnet and I am getting better results. And yes, I can't run 10 LLM sessions in parallel and have it vibe code the application. But also I actually do know how application works so I can fix/troubleshoot it myself as needed, unlike a total vibe coded stuff
Lost_Promotion_3395@reddit
the 'productivity tax' on local models is so real, I'm tired of babysitting Qwen just to stop it from eating 200k tokens of Docker logs
Low-Opening25@reddit
skill issues
New_Slice_1580@reddit
“I used Qwen 27B and Gemma 4 31B, these are considered the best local models under the multi-hundred LLMs.”
False
If you had more vram you could use much larger models
Why do you think the commercial models charge so much? As they are using large vram models
m31317015@reddit
I find people expecting local models to rival cloud models a funny concept, like the whole level of compute is totally different, and even the cloud model gave us shit, there's no way people will expect something that came out of local models could beat them, right?
Wrong, everyone not knowing shit came in and thought they're godlike, it's their chance to raise up. But in reality, hallucination is still a serious issue, context window just isn't sufficient in large projects, let alone the self doubt and bugs presents that will worsen with lower and lower quants. I'm glad that OP's finding it not suitable for the use case and realizing that API calls are just much simpler for the task, but I have to say that they were never make for that use case in the first place, at least in production. Not saying OP's expecting it, just saying there's lots of folks dreaming about it and not checking the facts here.
You don't learn AI or LLM by hosting them locally, yes there are no house of basic AI knowledge inside, majority of the knowledge is all around infrastructures and stuff well you normally see in office or data center environments. Speed and costs are crucial factors that come into play, people have to realize couple of 3090s aren't gonna beat RTX PRO 6000s, and same goes for it against the GB200s.
I personally find those who're sticking with MI50s and P40s fascinating as they're the ones always breaking their limit despite harsh architectural problems and just the lack of power. They manage to find ways for local models to work with their workflow. Maybe they don't rely on LLMs that much at all- yeah, that should be the norm, nobody should expect one click finish a job from AI, if that exists, it means agents are doing the job, not us humans.
Sorry, I got too far away from the point I'm trying to make: people who personally invests in machines and infrastructures for local LLMs are not for the job, they're for the hobby, for the what-ifs, and for the "just because I can", and the "how far could it get, how far could I get". Learning the latest technology is one thing, implementing your own solution is another.
TL;DR: Just because you know electrical engineering, you know how to design PCBs, doesn't mean you make your own PCBs from scratch. Sometimes it's not cost effective to do so, but more importantly it's because there's solutions already convenient enough that we don't need to for, unless you have the reasons to do so.
Check your motives, guys.
TheCaffeinatedPickle@reddit
The best advice I have is understanding what “agent sized tasks” are for each model. There is no context size that going to fix this. Then the smaller the model the more specific you need to be. For example, NUXT UI skill loaded, asked Gemma 4 E3B to add a password visibility icon to the right side. It tired to do it with css positioning when there is a Vue slot. The issue here is that the skill itself isn’t specific about inputs and it’s possible slots. After providing the docs around this it was able to do this. However the time it took with the failing code was longer than myself doing it. Another one was to center the footer in the UI, there is a slot for that but even with pasting the documentation, there is no center slot, rather default slot is the center. I had to switch to 27B of Gemma 4 to catch that while it’s thinking.
I also struggle to keep the smaller model to keep working it keeps ending its reasoning assuming it done even though it’s clear not. With PI Agent there is babysitter and continuation plugins, none work as expected. If the task is too large it just can’t finish it, without you having to remind it. For example I can’t ask it to write a test for the feature, implement the feature, run the test and fix any issues of failing tests. It will just work on the feature without a test first and the next run it. So I’d have to break the agent task down into smaller more well defined chunks. Then it’s done like 3-4 hours later when I could use DeepSeek v4 fast and in 3 minutes it’s done and only $0.30 spent.
ComplexType568@reddit
While I do resonate with a lot about what you're saying, the nice thing about local LLMs is that they're LLMs at heart. Give it like... 6 months... and the current 27B-35B class will probably be as "smart" as Sonnet 4 or even 4.5 in actual use. Just hoping that they'd be public.
ProfessionalSpend589@reddit
People who say those models are mostly hype tend to be downvoted here.
I personally run (slowly) Qwen 3.5 397B for experimenting and a faster Gemma 4 26 A4B for chat.
--marcel--@reddit
hard to compete with cloud solutions at the moment - those that are happy with local LLM either have hundreds of thousands of bucks invested in hardware or never really used LLMs for anything critical production-wise.
tired514@reddit
I too have been somewhat unimpressed by qwen3.6-27B (harnessed by opencode). I spent the last week comparing it to qwen3.5-122B-A10B and the latter destroyed it easily (mainly C++).
Both are outperformed by frontier models, obviously, but while 3.6-27B is a step up from 3.5-27B, neither are really that useful for large, complex codebases (ie. >5000 lines). 122B-A10B is better. I'm very interested to see how 3.6-122B-A10B does.
Syzygy___@reddit
I tried Gemma4:26b with Openclaw and it's useless.
When I realized I could connect Openclaw with my OpenAI subscription (codex), all of a sudden it did a proper onboarding that I didn't even knew existed before. So much better.
Thick-Succotash-795@reddit
Personally, I had massive issues using GitHub Copilot to code with small local models. When switching to opencode, wich I personally liked less in the beginning, the situation changed: I find the small local models (currently I’m using mostly gemma4-26b) extremely useful.
But, I also changed my behavior: I started coding more myself again and use AI more for small sections / explanations and stopped asking agents to implement hundreds of lines of Code.
RoughElephant5919@reddit
Same. I mainly use local LLMs for data extraction, but that’s it. I’m so sick of the narrative online that “I ended my Claude subscription because I just got a local LLM instead, now look at all the cool stuff I can do!” Yet they don’t disclose what they’re using it for, or their local machine specs. It’s falsely advertised for sure. Can’t tell you how many friends ask me why their computer almost blew up after trying to locally install Qwen 7b on 4gb of vram. These influencers have people trying to load a brick onto a paper plate 😂
cutebluedragongirl@reddit
Yeah, it's not ready yet.
Upstairs-Extension-9@reddit
Like do people really need one LLM to do it all? I like the combination of a big paid model or 2 mostly Gemini 3.1 an Opus 4.6 for me, I then have 2 local instances of Qwen 3.5 and now Gemma 4 which is great for me. They run on a Mac Mini and significantly reduced my API costs for me.
Koalateka@reddit
What quants are you using? Q8? Q4? Do you quantize KV cache? I have noticed quantization impacts a lot in those "small" models.
PaMRxR@reddit
Qwen3.6-27B Q8 and KV-cache BF16 is working very well for me with the pi-coding-agent on 2x3090 GPUs. But even with 1x3090 before I've had pretty good success with: Qwen3.5-27B, and before Devstral-Small-2 24B and Qwen3-Coder-Next, before Qwen3-32B, and so on.
Maybe I just haven't been spoiled by the cloud models? I've only ever tried Kimi (2.5 I think) with a 1 week free trial. My local models occasionally stumble due to lack of some obscure knowledge, but pasting some doc into the context is really not that hard.
alphatrad@reddit
Skill issue
dtdisapointingresult@reddit (OP)
Can you tell me what you would've done differently in the docker example I gave? How do I make it NOT read the entire goddamn 'docker build' 250k output tokens into an LLM configured for 200k context?
I put guidelines in AGENTS.md, what more do you expect me to do? Write a custom CLI interface to docker because I can't trust Qwen 27B to use docker properly?
alphatrad@reddit
Do you want to actually learn or have an emotional outburst?
Your prompt is vague, confusing and shitty. You put the same guidelines in the AGENTS.md? So you don't understand context clearly or how that effects your results, especially on smaller sensitive models.
“Get it to run properly” is vague
Does “properly” mean:
For Docker/AI projects, “runs properly” can mean ten different things.
You are talking to it like it's a person and not a tool. Something like this would probably work better:
```
You are helping Dockerize the existing project at \~/ai/echo-tts.
Goal:
Create a Dockerfile and docker-compose.yml that build and run the web UI on this arm64 NVIDIA Ubuntu host.
Hard constraints:
- Do not install anything on the host.
- Do not run pip, apt, poetry, uv, python app startup commands, or model setup commands on the host.
- You may read files in the repo.
- You may create/edit files in the repo.
- You may run docker and docker compose commands only.
- Do not use sudo.
- Do not guess dependency failures. Use actual logs.
Operating procedure:
- README/install instructions
- requirements.txt / pyproject.toml / setup.py / environment.yml
- app entrypoint
- expected port
- Python version requirements
- GPU/CUDA notes if present
Summarize the intended install/run process from the repo.
Propose a Docker plan before editing files.
Create:
- Dockerfile
- docker-compose.yml
- .dockerignore if useful
Command/log rules:
- For Docker builds, write logs to a file, e.g.
docker build ... > /tmp/echo-tts-build.log 2>&1
- Do not paste full logs into the conversation.
- If the command fails, inspect only the last 100-200 lines first.
- If the command appears to timeout, check whether the process/container is still running before assuming failure.
- Make one fix at a time and rebuild.
- If a Python dependency lacks an arm64 wheel, identify the exact package/version from logs, then try a source build for that specific package only.
- Do not invent packages or assume the failing package.
Success criteria:
- `docker compose build` completes on a
rm64.
- `docker compose up` starts the app.
- The web UI is reachable from the host.
- The container has access to NVIDIA GPU if the project requires it.
- Final answer should include the contents of Dockerfile and docker-compose.yml, plus brief notes on any source-build workaround used.
Start by inspecting the repo and making a short plan. Stop after the plan and wait for confirmation before editing files.
```
Even better, split it into multiple prompts.
For local models, I would not give the whole task at once. I’d do it in phases.
Your prompt expects the model to have good judgment about:
Those are exactly the places where local coding agents often fall apart.
For Qwen/Gemma-sized models, the fix is not necessarily a giant AGENTS.md. It is more about making the task procedural, staged, and observable. Don’t say “Dockerize this repo.” Say “inspect only, summarize, wait.” Then “write files.” Then “build with logs to file.” Then “debug one failure at a time.”
Like I said... these tools are really good, when you use them properly. Not when you act like a 27B model can read your mind like a datacenter model can.
AlwaysLateToThaParty@reddit
The first thing I created was a specification prompt with the sole purpose of writing a technical specification. So that it has it documented what to write and you don't need to pass it the trivial info again and again.
But the smaller models still just don't get some stuff as much you structure it. That happens in the bigger local models (like qwen3.5 122b/a10b mxfp4) too, but not enough to be an issue of usage. A good spec only gets you so far.
Hydroskeletal@reddit
Briefly I think these local models are much more like autocomplete for an entire function rather than the long horizon inference that the name brand frontier models do.
I think a big difference here is model size. With car engines they say there is no replacement for displacement and with LLMs displacement == RAM.
Dockerizing a repo isn't coding, it's code adjacent. It really cannot be overstated how much these local models lean on the structured grammar that a programming language provides. If it hallucinates a function, a compiler or interpreter gives it that feedback quickly. Tests do the same. But for an open ended task like writing a Docker file, where the superset of solutions is much wider, it doesn't get that kind of feedback and then it has to rely on intrinsic knowledge to deduce the problem OR it has to go search the internet, which it rarely will do unprompted. So when I think people rave about the abilities of something like the latest qwen model, they're operating in a much more constrained field. And I'll just say it that this kind of structure that the language (eg Python, C, etc) gives the output makes things like smaller quants much more forgiving. It's quite undersold I think that there are lots of tasks like data munging that degrade terribly on these smaller quantizations where even an 8bit would work.
agentic-doc@reddit
Tasking a 27B model with Dockerizing repos and complaining about decision making is like asking a Roomba to vacuum your driveway. Right tool, wrong block.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
90hex@reddit
My experience in a nutshell. We're not the only one. Here's a dude who tried exactly this on a MacBook M5 128 GB. Lots of gotchas. https://deploy.live/blog/running-local-llms-offline-on-a-ten-hour-flight/
__Maximum__@reddit
I used qwen 3.6 35Ba3B on opencode and pi coder, and were satisfied with both on middle difficult tasks. It was even better than claude 4.6 or 4.7 in claude code in explaining things since claude does not seem to be a good teacher, it is too compact.
randomperson32145@reddit
I dont see any nodels mentioning competing against openai codex anytime soon
Monkey_1505@reddit
Yeah, there's a notable size difference between the models you are comparing here.
robertpro01@reddit
Well, I still consider my self a developer so ... local AI is just a tool, for me qwen 3.6 is a good tool to use, I started vibe coding on Nov 2025, because my previous experience with AI (API not local) were terrible.
For me local AI is just another tool.
I also do a mix of API + local for very complex tasks, and still I validate all the code.
brick-pop@reddit
This. Huge LLM's start to be worth it when you blindly delegate to them. And then have no idea of what the code actually looks like.
RoomyRoots@reddit
Yeah, add to the stack of tools you use, don't drop everything and depend only on it. It works very well as a document searcher, summarize and drafter. I still rather do things slow and step by step so I can fully understand how things are implemented.
Equivalent-Repair488@reddit
This is me, I am not a coder and only trained on python basics.
I use Qwen 3.6 27B on UDQ5 locally on roo code, it still runs into debug loops and is not it for "I want this app build with so and so features." Tried that before and gave up. So I use my free student Gemini Pro plan to architect, to create a build prompt based on the discussion for my vision for my app and guide both me and my local model.
Although the apps always have issue and bugs, even a non coder like me can vibe out simple apps with enough time and testing.
the-username-is-here@reddit
As someone who's been using Claude Code for a loooong time and recently got into local models (with the limited hardware i've got), cannot agree completely.
Yes, local models by default are dumber and slower than even "basic" Sonnet or sometimes Haiku. Yes, there's a learning curve involved, as well as a lot of tweaking. Yes, they tend to hallucinate, loop tool calls, stuff like that.
But.
It kinda doesn't matter when Anthropic decides to slash usage once more and burn through 200 EUR/month subscription tokens in half an hour. Or when it goes down again. Or when it decides that some code you're working on "violates their TOS", effectively censoring your work, no matter what you do.
Once you're set up, you pay just for electricity, which is peanuts on Apple Silicon (and you "need" that sweet 128 GB MacBook anyway 😄 ). It's always available, 100% secure, and you can do anything you want with coding harness, which is a no-go with Claude Code.
Local models are still more than enough for simpler refactors, boilerplate and stuff like that. They require you to get more familiar with the code you're working on, which is A Good Thing™.
You cannot go "hey Claude, make it fast" and then have NFI how it works now and what are the new bugs, which is not necessarily bad.
There's a future for local models, they're getting much smarter and more accessible.
niellsro@reddit
I am having quite good results with qwen 3.6 27b for coding. Using pi, llama.cpp with unsloth ud q8 kxl quant (tried an awq in vllm but i was getting more tool calls errors). However i am really impressed by how good this model is with precise directions. This is still in testing phase for me, i am actually throwing it at a project idea i had in mind for some time, but so far the results are really good. I'm using it for python (backend) and vuejs (fe app). What i noticed (this applies to all llms, but especially to small models like qwen) - make sure to layout the foundation or precise instructions on the architecture and the code, not just requirements - provide interfaces, design patterns etc
PS: i also use claude code, but comparing it to qwen is unrealistic: 2 different models (small vs huge/unknown), 2 different agent tools (claude code vs pi - i dont have API acces to any anthrophic model so i only use them in cc)
xXG0DLessXx@reddit
I think the harness you use plays a big role in how good some local models are. But also, I kind of agree that they simply don’t hold a candle to the big providers. But for me, Gemma 4 for example is extremely good at debugging some weird issues in the system for me, checking logs, giving ideas what it might be, and even fixing some small stuff for me fairly well. Where it’s not that good, is creating things from scratch, or making huge changes to existing code bases.
DeltaSqueezer@reddit
There are limitations on intelligence and context of local models.
But the tasks you gave are easily doable. I've done similar with just a 9B model. I suspect you may have not controlled the thinking (particularly for Qwen3.5) and context exhausted by thinking. I actually have thinking disabled when using it for coding.
muhlfriedl@reddit
You get what you pay for
Aphid_red@reddit
Note: The provider models all have a big system prompt. You don't see it, but it's there. You should use one as well.
The provider models also use 'thinking' mode as well.
If your local model only has one or even neither, it practically won't behave the same way. It's smaller and thus a bit less capable, but shouldn't be unusable for repetitive tasks that have well-documented examples online.
juaps@reddit
The issue is that you require over 240k of context to run it without any problems. I simply switch from LMstudio to llama.cpp custom fork to execute turboquant and all my toolncalling, idles, loops, and so on stop, As a result, now i have a proper and efficient web application with login, chat, rag, and SQL functionalities for my business
iamreddituserhi@reddit
Try giving system prompt (try different quant and versions some version just break keep looping for weird out )
One its tuned can expect beter out put .try different system promts or even ask opus kimi deepseak to optimize the system promt for your use case .
Then it will become usable You also need to adjust temperature according to ur use
SatanVapesOn666W@reddit
His prompt was like 2 paragraphs asking it to work with docker which even the Claude models struggle with. He's really not doing himself any favors, since that's basically the first thing he dismissed and it really is his problem.
BubrivKo@reddit
Yeah, it’s the same for me... Everyone’s praising Qwen 3.6, Gemma 4, and so on, but they just don’t work for my use case. I have 15 years of programming experience, so I can usually tell pretty quickly the difference between clean, functional code and a bad solution that misses the mark.
Opus almost always gives me the correct approach, while smaller models almost always fall short.
I simply can’t trust a smaller model to solve my task correctly, which makes using them kind of pointless for me.
Smaller models might be fine for tasks like summarization, pulling information from large text databases, basically RAG-type operations.
TuskNaPrezydenta2020@reddit
I think Qwen sucks for tool calling, every time I tried either MoE or dense they randomly fail with syntax over time. I had no such problems with Gemma 4 or GLM though and I would say they are roughly as good as the sub makes them out to be in general for my use. I wrote (not vibecoded) my own private harness though after being unhappy with OpenCode and alternatives so mileage may vary
HenkPoley@reddit
Yeah, you can kind of use them like GPT4, e.g. you ask them something, and then what the answer will not work, but you don’t assume it will work. You just use it as an inspiration. Using the function names, they will come up with will get you somewhere in the documentation of the system that you’re using. Things like that.
AardvarkTemporary536@reddit
You don't hire McKinsey to bleed your wallet dry and ruin your business their own hands. You over pay them then to tell you how to ruin your business.
You also don't fire all middle management when times get tough, you just pick the most efficient that you can over work and underpay who does a good enough job. Then when things get better you realize they are doing just so long as you accept high turn over.
Ballisticsfood@reddit
Opus is a mid-junior level developer, sonnet is a junior developer, haiku is a junior developer that got distracted by a butterfly.
Local LLM’s are kids doing ‘my first app’ courses. They don’t have the necessary ‘experience’ to do even basic tasks without handholding, but if you want them to copy out endless boilerplate with some small changes they’re competent enough.
fasti-au@reddit
So what you have is called. Other people flows.
I have 1 million cintext qwen 36 27 and 35b running.
The trick is don’t give tools. Give ways to find tools. Think like you and make statemachine and give a cli tool for the command call http by cli mcp-connect. It gets thoughts.
Ask gpt to read back a thread an devalue how many tokens went to solving the problem and how much to not upsetting you with reply type. That’s why you go local
CriticismTop@reddit
In having similar experiences using open code. Gemma4 has be ok, Qwen spends most of its time going round in circles.
Local is working great for me with automation, but coding leaves a lot to be desired. I am enjoying Big Pickle with open code at the moment
Current_Ferret_4981@reddit
Hermes + Qwen3.6 in Q5 with qkv cache fp8. Flies fairly quickly and handles tasks incredibly well. Much better than the integrations and models my job has made accessible so far. Complex workflows with deep library understanding come together in a few minutes which has been impressive
mission_tiefsee@reddit
local models are not your fairies. If you use this setup, you are using bleeding edge tech. Welcome to the frontier. So change your harness, and jump on the hermes-agent bandwagon and get your stuff going. For your prompt: Go to chatgpt 5.5 post your prompt and demand a prompt for your local model. Unfortunately, local smaller models are really prone to not really understand your intend.
I build a nice tower defense yesterday. With qwen3.6-27b and some prompting help from chatty 5.5. All in godot, without touching code.
somerussianbear@reddit
r/usernamechecksout
themoregames@reddit
TL;DR After giving local LLMs a fair shot for coding tasks, the author concluded the productivity loss outweighs the benefits compared to cloud models like Claude Code.
somerussianbear@reddit
Funny. I wanted to have posted something like this but every time I think about it I fallback to “I’m probably doing something wrong, gotta put more time on this to ensure it’s not my setup” but eventually a new model comes out and I feel “oh, NOW it will work!” and then rinse and repeat and the feeling is still the same.
The thing that seems to have given me best results was
little-coder, a pi-based harness that adds good guardrails for small models. To have an idea of how excited I am about that one, I’m building an entire sandbox tool using SBX because that doesn’t have one and I want to use it badly day to day. For simpler tasks like documentation or understanding/researching a codebase it gave me good results with Qwen 3.6 35 MoE, so I imagine a dense would do even better.But yes, it’s pure grind until we get something minimal working, and most people just don’t have that energy to keep going. Luckily I have fun on the discovery path rather than enjoying just the final results, and for this Reddit this seems to imperative.
EPICWAFFLETAMER@reddit
Completely agree. I use local models all the time, but not for general coding. I see a lot of posts on here of people saying x model is 99% as good as opus 4.7. That sentiment especially gets voted to the top when a new model drops. I think we will have very good local models for this purpose in a year or two, but we just aren't there yet.
Zestyclose-Worth-167@reddit
Look, if a 27B model isn't cutting it, consumer-grade gear just isn't gonna save you. My advice? Milk the free APIs for all they're worth. If those run out, you’ve just gotta bite the bullet and pay up. Even the 80B coders I've used don't really hit the mark. That 27B version of 3.6 is 'okay,' it’s just laggy as hell. So yeah, I feel you. It's either put up with the stutter or you're stuck. Spending $20k+ on an AI rig is overkill—that money would pay for enough API tokens to last you a lifetime.
ShelZuuz@reddit
$20k would be totally worth it for a Kimi 2.6 level model at 100 tok/sec output.
But that's not going to be $20k.
Zestyclose-Worth-167@reddit
yes..
Inevitable_Mistake32@reddit
I like my privacy. I use APIs on the rare occasion I am ok with donating towards my replacement, but for everything else, local. of course LLMs aren't all I self-host, doubt anyone is. But with everything from HA, my fun paper trading accounts, my screeps bots, local and remote API keys on the host, I opted to keep my data local.
Is Qwen or Gemma better than Opus? Idk, is a smaller yacht better than a bigger yacht? Subjective.
But being able to crack out 120 tks, with 256k ctx, with zero api waiting/throttling/ratelimits and knowing none of it leaves my local network? Priceless value to me.
hovo1990@reddit
Try to use caveman plugin for Claude code, it has improved the experience.
StorageHungry8380@reddit
I noticed that the default 8GB for host prompt cache in llama.cpp was not enough for Qwen3.6 27B @ 128k context using it with OpenCode. You can monitor this in the logs by looking for sections such as this:
Here you can see a ~57k prompt ate 5.6GB of prompt cache. Bumped it up to 32GB, since I'm running 4 slots, and it helped a fair bit for me.
Major-Examination941@reddit
Yeah I built my own ollama locally with routing and model switching and orchestrating. Calling into cloud (anthropic, minimax, Gemini) for orchestration and synthesizing. Sometimes code. I mean if you're expecting your rig to actually compete against Claude that literally loses money on you then your expectations are off. Also for continual learning it's great you still have to debug learn how to prompt better, review etc
AdOk3759@reddit
The quality of a local model hugely depends on the harness. I suggest you look into little coder (and their paper)
SourceCodeplz@reddit
I think it is because of quantisation. I am using Gemma 4 31b via API directly from Google and I don’t experience this. It just works!
TanguayX@reddit
Yeah, I'm with you. Did some experiments over the weekend, and my local Qwen3.6, as big as I can muster, with Cline, and it was doing OK with the task I was trying. But I have Sonnet off to the side going..."wow, look, it just made up a function". Even getting Sonnet giving it hints.
So yeah, what's the utility in that when debugging is often worse than just starting from scratch with a better planning doc.
The way I look at it, two years ago, I had to carefully coax GPT through a coding session. Now, I was getting VERY close to getting a local model to one shot based on a good PLANNING and TASK doc. That's pretty sweet. Progress will continue, and it will happen one day soon.
admajic@reddit
Did you give the baby model a plan or just let it loose? If you did use architect mode and then write a jira style ticket it would do better.
Pygmy_Nuthatch@reddit
It's hard to make an honest comparison between a 27B parameter model and a 2T parameter model.
Curious-Function7490@reddit
I'm semi in the same spot. I am running qwen.2.5-coder.32B locally on an RTX4090 using llama.cpp and getting 30 tokens a second. I set this up because I was tired of using up Claude's tokens on one of my projects.
TL/DR.
The more helpful LLMs (Claude, etc.) that are really effective won't be affordable in the longterm. The companies providing them are running at a loss, there is an AI hype bubble and we are already understanding that they are unaffordable and problematic to depend upon.
I think understanding how to work with local models is viable and it will come back to being more hands on.
So, I'm more or less going to nix Claude from a lot of its activity in my codebase and learn to work with open source models that I can host myself. It won't be as productive as using something like Opus but it will be viable for the longterm and relevant for the job market.
unchikuso@reddit
tldr. but did you try pi? pi fixes things
wasnt_in_the_hot_tub@reddit
Right? Pi is the main reason coding with local models works for me. Before trying pi, I was was essentially building my own (shittier) harness and trying to keep it super minimal to work with smaller models. Then I found pi and realized it was perfect and stopped my other harness project.
I've seen people here using local models with small context windows, then complaining they can't code with opencode, not realizing it eats like 10-12k tokens on init.
There are other things to consider with smaller, local models. But it mostly boils down to making tasks smaller in scope... go figure!
I prefer to compare coding with local models to coding without AI, rather than comparing with cloud-hosted frontier models.
theUmo@reddit
What models are you using with Pi?
dtdisapointingresult@reddit (OP)
Briefly, I was just starting to use it. I haven't gotten far enough to discover and create extensions. The vanilla setup is not usable, it would also read the output of 'docker build' in the main context. I know there's extensions to teach it to run subagents, I just didn't get that far.
PP9284@reddit
I believe this subreddit lacks best practices regarding model deployment and its' practical application in dev, and may be a promising topic to explore.
lnsip9reg@reddit
👋
fredandlunchbox@reddit
I'm currently trying a hybrid approach.
Claude to plan then it uses a skill to implement the code with Qwen-27B. Save tokens writing code.
spaceman_@reddit
I felt like this before, but with Qwen 3.6 for me it has honestly been a non issue for how I use it. ("look at this issue, explore and plan" -> "write a test or test suite that covers the issue" -> "fix or implement the issue")
They're not on the level of Kimi or GLM, but in my daily use, they are more than good enough for 90-95% of the issues.
Due_Duck_8472@reddit
But but but ... all the autistic people here mortgaging their parents houses to buy 6 figures rigs to code "hello world" say it's working ...
Who to believe?!
whatever462672@reddit
Honestly, the issue is with how Claude Code works. I was having these issues using Roo Code until I looked exactly at what it sends to the model. There is too much natural language and dependence on model knowledge. Then the harness sends a wall of text and nukes it's own context.
I get decent results with GEMMA-4 with Copilot Chat and context engineered instructions.
sarcasmguy1@reddit
I've been tinkering with qwen3.6 recently, and have got it to a place where I can use it for most coding tasks, so I thought I'd share my experience.
Note - I still use GPT5.5 and mini for bigger projects (Monorepo or similar), and generally use mini for 'work' tasks as the quality is higher. Qwen has been great for side-projects though.
I run it on a RX 7800 XT, with many MoE layers pushed to the CPU. This allows me to fit almost all GPU layers into VRAM. I get around 30t/s. Prompt processing is really fast as long as I keep context small (68k). I have 32gb of system RAM, and a Ryzen 5 7600.
My workflow is:
1. Plan with 5.5 or mini, depending on the task. Mini for features, GPT5.5 for new projects. I get them to write plan files.
2. Give it to Qwen 3.6 to implement
3. Get mini to validate it
I use pi via the littlecoder harness.
On quality: it feels good in Typescript. This entire repo has been written by Qwen3.6 locally, with 5.5 plans. In less popular languages (like Clojure), its pretty bad. Slow and it hallucinates a lot. Language choice is important.
On speed: Pretty good. It took a lot of experimentation to get here though. littlecoder helped quite a bit, and switching to ubuntu made a big difference (I was on windows previously). I ran it all through lmstudio, I haven't got to the part where I tinker with llama.cpp directly. Its not nearly as fast as say GPT mini, but its good enough.
The main advantage is infinite tokens. They feel amazing, even if they're slower. It really pushes the bar for experimentation imo. However I would not replace my primary workflow with local hardware.
Some issues:
1. Thinking loops are a pain. I've got them to happen less frequently by following the recommended inference settings by the Qwen team, but they still happen. It makes me feel like I need to babysit the model which can be annoying depending on what I'm doing.
2. Small context window. This is an issue with my hardware, not the model at all, but I thought I'd call it out. Auto-compaction kicks in pretty quickly, which can sometimes interrupt the model.
3. Tool calling proactivity. In GPT, the model is really good at knowing when to call a tool. If it encounters issues (like compilation or bad types), it will use a variety of cli calls to get to the solution faster. Qwen doesn't do this, it tends to rather grep every line of code possible and then come up with a solution. This is much slower.
4. Greenfield tasks (e.g "Add this feature"), are still quite bad. It often comes to a really strange conclusion on how to implement a feature. This could be an AGENTS.md or context issue, so not putting this on the model. For example, adding async model loading in the lmstudio extension took a long time and it did some really weird stuff. GPT mini ripped through it, and was proactive in reading docs to find the solution.
aniket_afk@reddit
Yo OP. Which quantized model are you using?
Otherwise_Berry3170@reddit
Like everything else it depends, for example if you were talking about Claude a month ago I would say yeah it was pretty good, now? Not so much they water down the models and recently came and called it a bug because we complained. The prices change, the limit changed, so while I agree models locally are not as good, with training and a good agents/skills I can do with qwen3.6 35b almost the same as Claude sonnet. Qwen3.6 27b is better but on my hardware a GB10 Blackwell is a bit slow so I use it for text only. Took a bit to get the agents right and they still sometimes don’t work as expected but pretty ok with my setup. And from the cost calculation just last week would have spent 2k on Claude api calls. So yeah I agree, not perfect but not that terrible that you cannot work with
_mayuk@reddit
Skill issues xd
I mange to use Gemma4 E2b to digest some json payloads and check in a vector db correlation with some bucks that I have etc .. multiple agent ( not running simultaneously) but each one take care some aspects of the digestion of the json payloads in the way that maybe just a conversation with deep research of notebook would be able to handle hehe …
_mayuk@reddit
Of course you have to do much of programming your self or use another AI to create the python scripts to digest the files and handle the db …
The verbatim of open claw is not enough .. I’m trying to integrate obsidian or/and graphiti …
I hate how much marketing about this kind of stuff is going nowadays because there is many forks or repackages of the same with fancy names … but like people have been saying if you are a coder yourself all this are amazing tools .. a none vibe code can be actually sustainable if everything is running in api calls lol ..
Maybe you can use the pay llms to help you setup proper agents with proper memory handling for some given task .. but you would have multiple diferente agent or stay to organize all this subroutines let’s called xd
Idk I’m never verse in the proper term for all this because despite been programming since 12 I’m mostly self learning… :v
toothpastespiders@reddit
I'm a huge proponent of local models. I've spent a godawful amount of time carefully curating datasets and training along with setting up rag systems, testing, etc etc.
And I don't use local models for coding. I'd love to but I've never really been satisfied with anything I can actually run. Qwen's 480b model would be great if I had the hardware. But I don't. So I use it through their api. I console myself that it 'could' be local for me, one day. In a way that's both non-lobotomized and fast. But at the end of the day that's just cope.
poobear_74@reddit
OP, you might be bumping into tool calling issues since the models you reference are very new. Qwen 27B was only released a week or two ago, and there simply hasn't been enough to time for the developer community patch vllm and other software to work well with them.
Majinsei@reddit
Programar... No tengo el hardware para ello~
Crear pipeline de agentes y procesos, entonces sí~
Generalmente dejo procesando durante horas (casi días) sin miedo a hacer un chingo de pruebas~
Consumo tokens como no tienes idea~ así que automatizar algo en local sin que mi tarjeta de crédito sufra es glorioso~ fácilmente en consumido ya en tokens casi la mitad de una GPU mediana~
Generalmente lo uso para trabajos batch~ programar nah~ con suerte y uso antigravity y a darle al resto~
RedParaglider@reddit
Local models are not really for vibe coding if you want to code with them. They are for pair programming. These are 27b models, you simply aren't going to get the same performance as 1500b models.
I personally do not use local LLM's for coding tasks outside of simple scripts or command line questions without session. I use them for testing agentic business workflows, and for those they are great.
simracerman@reddit
To be brutally honest, I haven’t coded by hand in years and would likely take a year to learn how to get back in original shape, yet the same model you used at Q4 quant + Opencode and a few days worth of sessions I was able to get a fully featured budgeting app build from scratch.
Local LLMs are not cookie cutter solutions yet. There’s more like a clay sculpture - at the beginning you can’t event hold the clay together, but after leading and tweaking you will slowly overcome issues and start producing good results. Remember, this isn’t cloud AI where an army of sys admins and devs are working non-stop behind the scent to make your experienc3 better
_hephaestus@reddit
I wish I could disagree with you but it’s a messy world. I do think there might be something to having something like litellm with langfuse in between the harness and the provider for debugging, but that works the same whether you’re using local or externally hosted. Part of the problem is the speed things are moving and the lack of cohesiveness, all the big players have their harnesses and ship their models with them, meanwhile there’s still unmerged llama.cpp/oMLX stuff supposed to make qwen3.6 understand tools the way it’s hyped to.
Interesting-Yellow-4@reddit
Yeah that's been my experience as well. Local has a long way to go
ResearcherFantastic7@reddit
For people who wants 1 to 1 replacement against cloud model... You are just waiting your time.
Why would you compete a cluster of elephant brain against a single parrot brain.
The way you should look at Local models is they there to do a very specific task ( not excel in it but it can produce an acceptable result )
knownboyofno@reddit
I'm interested in what repo you asked it to do it with. Could you post the link? I want to test this too because this would be a good test. I have had problems like this too. I thought it was easy but it failed quickly.
I had a different problem. I gave it a range in an Excel sheet that was saved from a Google Sheet. I had it recreate those calculations, then use that file only as a "database". That took an hour in Claude Code, then I downloaded the data into a CSV for each data source. This was something I did before. These functions will retrieve the updated data, which is fed directly into the model. I then had it use those functions, but gave it example files to test on before wasting credits. It was able to correctly recreate a 30-sheet Excel file that had the following kinds of formulas with hookups, lookups, index match, sumif, cross product, negative binomial distribution, etc., into a Python dataframe using pandas. I have done this before with other files manually, but it took me 25+ hours to trace the formulas and get the correct data sources, too.
I used Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf with llama.cpp (out of laziness because Ihave vLLM setup), it had full context. In Claude Code without any skills or anything extra but I did turn off a few headers i sent by Claude Code. I did ask it to create a Python environment to run what it needed. It did ask a few questions but I didn't have to micromanage it.
cleversmoke@reddit
I use sota models for high level plan, strategic plan, architecture plan, and feature implementation plans. Then I use local Qwen3.6-35B-A3B + DeepSeek-R1-Distill-Qwen-14B as an agentic coding pair to build one feature at a time.
It's going well, but it's more involved than just "build me an app". For anything that Qwen fails at, I just fall back to a sota model.
ascendant23@reddit
Expecting a 27B OSS model to hold up head to head with the latest Opus / GPT is just wild. It's like trying to replace an 18-wheeler truck with a ford pickup truck. If your workload requires an 18-wheeler, the pickup truck is never going to come close to meeting that need.
Doesn't mean pickup trucks are bad, just means, don't expect them to do things they can't.
StardockEngineer@reddit
Claude Code has a parameter you need to set to prevent it from junking the KV cache. I forget what it is but maybe you can search for it.
InKentWeTrust@reddit
Do you use recursive reasoning on your locals? It takes longer to process but it produces much better results
dtdisapointingresult@reddit (OP)
idk what that means so I guess the answer is no, I don't.
biotech997@reddit
I tend to agree, from my experience with small-medium (<27B) models, they just aren't that smart. For regular questions sure, but things like scaffolding a codebase and actually providing relevant code is still very far away. Even boilerplate code I feel like it's evidently lower quality. Obviously I don't have 2x5090s, so YMMV.
Over-Advertising2191@reddit
I think the best approach is to use these mid-sized LLMs as assistants that understand your codebase and occasionally help you with a specific function or specific errors. Anything broader and the experience is the same as op's
NNN_Throwaway2@reddit
LLMs can't generalize, if they haven't been trained extensively on a task they will face-plant. This is especially true on smaller models where you don't have a large body of world knowledge to lean on.
LLMs in general, but especially the small ones, are getting increasingly specialized on agentic coding. I suspect that building an spinning up a container falls just far enough outside of what it was trained on that it doesn't know how to apply basic problem-solving that it was certainly trained on in other areas.
But yeah, people are going to get upset if you say the latest OSS darling isn't the bee's knees and a huge game-changer that rivals Claude Opus.
InnovativeBureaucrat@reddit
I found even the advanced models to be terrible at any devops tasks until recently. They were pretty good at Python code, and really bad at even the simplest tasks involving containers, network questions, or really anything not code.
Maybe it’s just me.
VertigoOne1@reddit
They’re pretty solid on kubernetes, terraform, argo, github actions, you know, stuff that has a long history and strong community representation. Not baddd on powershell but it has to work through the bugs for a few rounds. But if you mean devops like, any of not the above, i’m not sure. It is important regardless to have a good structure and docs. A lot of devops repos are just a mess, which gives the llms amnesia
Long_comment_san@reddit
Try fidgeting with sampler settings maybe
Photoperiod@reddit
I say this as somebody who runs local models and deploys OSS models to on prem datacenter gpus as part of my job. Ultimately, local AI is a niche hobby that is not really cost effective. The only major reason I can see doing it beyond hobby tinkering/learning is for a fully secure/private stack. If you absolutely cannot let your data go to the cloud then you need to go local.
There are enterprises that absolutely will not send their sensitive data to cloud model providers like Claude or gpt. Those enterprises are using OSS models deployed on their own compute for coding, analysis, etc. But yeah, if you're in the clear for data privacy and whatnot, then it makes more sense to use cloud providers.
maz_net_au@reddit
All of this (and the comments) is how I feel about using Opus as well. LLMs are fun but they're just so dumb.
One-Replacement-37@reddit
Cool story, bye!
Different-Rush-2358@reddit
Look, the problem that many of you are facing—and honestly, it’s not even your fault—is that they’ve overhyped local models way too much without explaining how the "local hype" actually works. As of today, the only local models that are truly good for general purposes, including coding, are Gemma, Qwen, and DeepSeek. Forget about those weird variants or random "labs" that pop up out of nowhere; most of them are just distillations of models that were already distilled before.
Then there’s the whole quantization topic, which has advanced quite a bit. For example, Unsloth’s UDS gives you very decent precision and they fit into any consumer PC (depending on the parameters, obviously, haha).
And then you have the "Blackwell sect" and their "high precision or nothing" mantra (which smells a bit like a sponsorship from NVIDIA or some massive GPU distributor, but whatever). They’ll tell you that if you don’t have 17 Blackwells, 900TB of RAM, and a quantum computer, you can’t run anything. That is a total scam. Anyone with 15 minutes of spare time can figure out the commands to squeeze the most out of their hardware. You can run models very comfortably on hardware from several years ago.
(Example: Qwen 2.5 72B at 10-20 tk/s on a Xeon 2680v4, 32GB of RAM, and a GTX 1070 with 32k context, thanks to turboquant flags).
So, in short: believe only half of what you read here, and take the other half with a grain of salt. You don't need a $30,000 rig to run a 280B model, for instance; you just need to know how to use the correct flags in llama.cpp and have a balanced setup.
Sorry for the wall of text, but I saw this post and took the chance to get this off my chest it’s been on my mind for months. And I know I might get downvoted to hell for this... but I don't give a damn. It’s about time we debunked the myth
Such_Advantage_6949@reddit
reality is u need at least 200B local model like minimax to get serious stuff done, small model had made progresss, but they will break as soon as u throw serious stuff at them
dev_all_the_ops@reddit
Thanks for sharing. I've been obsessed with getting started in this, but I worried I would just be wasting my time.
I still like local models for security and to fight against subscription bloat, but its good to know that its just not as good as paying a major player.
cocoa_coffee_beans@reddit
Yes, local models fall short for coding.
That’s not all. The ecosystem is quite fragmented:
OpenCode is broken with vLLM ever since vLLM deprecated the
reasoning_contentfield forreasoning.Open WebUI still handles reasoning like it’s early 2025.
Vendor specific tools such as Codex and Claude Code constantly break against local inference even if you provide their respective APIs, because vendors are constantly iterating their client.
If you’re not deeply entrenched in the specifics, you won’t squeeze the performance you need for coding. For most people, it simply isn’t worth it.
Noiselexer@reddit
I've never considered local llm for coding..
alexthecat999@reddit
Is it good for small tasks and bolierplate code? Just to bridge my lack of syntax kowledge with a new langauge?
ryfromoz@reddit
Yes, and its better if you can actually prompt 😁
SnooPaintings8639@reddit
Yup, it's a bit over hyped here. I mean, if it wasn't open weight I assume it would be very rarely used anywhere.
Having said that - it is capable and with proper care if CAN replace sonnet or others in many clearly defined, coding heavy tasks. It just needs a bit more of care from you and/or larger model on top.
So, if all you care about is price, speed and qualify - probably stick with APIs. But if you have a reason to go local, this model CAN do it it.
kevin_1994@reddit
Works fine for me but I don't delegate all my thinking to a machine
Kitchen-Patience8176@reddit
Honestly, not worth it unless privacy is a major priority for you. I just use the $20 ChatGPT subscription (includes Codex) and use it in the terminal for Docker and general sysadmin stuff.
gffcdddc@reddit
You need to use a high param MoE model. Then use a fast gpu and offload the experts.
Ok_Librarian_7841@reddit
Yes, unless you can host Kimi 2.6 or minimax 2.7, local coding is trash
matt-k-wong@reddit
even though I use local models a lot life is better with Opus
tomByrer@reddit
Takes a bunch of homework &/or beefy GPU power & VRAM to get LocalLLM worth it.
Seems you have neither.
c64z86@reddit
Local models are pretty good to play around with and some can be good even at storytelling or roleplay, but subs like this one hype them up way too much. Local AI is pretty revolutionary in itself because it sets us free from corpos who dumb down their models while charging us more for the pleasure, and I will say and spread that around that all day long, but I'd be lying if I said that they were better than something like Claude Opus.
Own-Refrigerator7804@reddit
This was written by a local or online model?
dtdisapointingresult@reddit (OP)
Neither, you regard.
Terminator857@reddit
Strix halo qwen 3.5 122b q4 working well for me on simple stuff. Yes very slow, but works.
blargh4@reddit
They’re fun to mess with, they have legitimate practical uses when you don’t want to burn money on properly scoped tasks they can do fine… just use then with appropriate expectations.
TheAncientOnce@reddit
catched this post too early. Would love to come back and see what people say..
Zestyclose-Worth-167@reddit
如果27b满足不了. 基本消费级的模型解决不了你的问题了. 免费羊毛如果能撸就撸. 撸不到就乖乖交钱吧. 我目前用过80b的coder都满足不了的. 3.6的27b能基本满足就是太卡; 所以说, 我懂你; 要么忍受27. 要么无解. 如果花10几万买ai机, 这个token数够你用到盖板;