I'm done with using local LLMs for coding
Posted by dtdisapointingresult@reddit | LocalLLaMA | View on Reddit | 807 comments
I think gave it a fair shot over the past few weeks, forcing myself to use local models for non-work tech asks. I use Claude Code at my job so that's what I'm comparing to.
I used Qwen 27B and Gemma 4 31B, these are considered the best local models under the multi-hundred LLMs. I also tried multiple agentic apps. My verdict is that the loss of productivity is not worth it the advantages.
I'll give a brief overview of my main issues.
Shitty decision-making and tool-calls
This is a big one. Claude seems to read my mind in most cases, but Qwen 27B makes me give it the Carlo Ancelotti eyebrow more often than not. The LLM just isn't proceeding how I would proceed.
I was mainly using local LLMs for OS/Docker tasks. Is this considered much harder than coding or something?
To give an example, tasks like "Here's a Github repo, I want you to Dockerize it." I'd expect any dummy to follow the README's instructions and execute them.
Issues like having a 'docker build' that takes longer than the default timeout, which sends them on unrelated follow-ups (as if the task failed), instead of checking if it's still running. I had Qwen try to repeat the installation commands on the host (also Ubuntu) to see what happens. It started assuming "it must have failed because of torchcodec" just like that, pulling this entirely out of its ass, instead of checking output.
I tried to meet the models half-way. Having this in AGENTS.md: "If you run a Docker build command, or any other command that you think will have a lot of debug output, then do the following: 1. run it in a subagent, so we don't pollute the main context, 2. pipe the output to a temporary file, so we can refer to it later using tail and grep." And yet twice in a row I came back to a broken session with 250k input tokens because the LLM is reading all the output of 'docker build' or 'docker compose up'.
I know there's huge AGENTS.md that treat the LLM like a programmable robot, giving it long elaborate protocols because they don't expect to have decent self-guidance, I didn't try those tbh. And tbh none of them go into details like not reading the output of 'docker build'. I stuck to the default prompts of the agentic apps I used, + a few guidelines in my AGENTS.md.
Performance
Not only are the LLMs slow, but no matter which app I'm using, the prompt cache frequently seems to break. Translation: long pauses where nothing seems to happen.
For Claude Code specifically, this is made worse by the fact that it doesn't print the LLM's output to the user. It's one of the reasons I often preferred Qwen Code. It's very frustrating when not only is the outcome looking bad, but I'm not getting rapid feedback.
I'm not learning anything
Other than changing the URL of the Chat Completions server, there's no difference between using a local LLM and a cloud one, just more grief.
There's definitely experienced to be gained learning how to prompt an LLM. But I think coding tasks are just too hard for the small ones, it's like playing a game on Hardcore. I'm looking for a sweetspot in learning curve and this is just not worth it.
What now
For my coding and OS stuff, I'm gonna put some money on OpenRouter and exclusively use big boys like Kimi. If one model pisses me off, move on to the next one. If I find a favorite, I'll sign up to its yearly plan to save money.
I'll still use small local models for automation, basic research, and language tasks. I've had fun writing basic automation skills/bots that run stuff on my PC, and these will always be useful.
I also love using local LLMs for writing or text games. Speed isn't an issue there, the prompt cache's always being hit. Technically you could also use a cloud model for this too, but you'd be paying out the ass because after a while each new turn is sending like 100k tokens.
Thanks for reading my blog.
PeerlessYeeter@reddit
op's experience somewhat matches mine, I keep assuming I'm doing something wrong but I think this subreddit gave me some unrealistic expectations
Hans-Wermhatt@reddit
The people here overhype Qwen 3.6 for sure, but I don't know what to tell the people who were expecting to just flip over from Opus 4.7 4 Trillion to Qwen 27B and expect the same performance. You'd have to run GLM 5.1 for something a little closer. Qwen 3.6 27B is more like GPT 5.3 mini.
FaceDeer@reddit
I would think that local models like Qwen3.6 would be well suited to replacing remote LLMs for things like auto-complete, filling out a local function or writing docstrings. Not so much the large-scale system architecture stuff. I could see a framework that optimizes which tokens get sent where, using the big remote models to plan out what to do and then delegating implementation tasks to local models. Might be a best of both worlds arrangement.
xamboozi@reddit
Uhhhh is a little better than autocomplete lol
FaceDeer@reddit
These days "autocomplete" is more than just finishing the word that you're currently typing. It's "the user just typed the function name sort_graph_paths in a context where it looks like there's special handling for heavy graph paths, I'll write a function that sorts graph paths by weight and insert that."
Nixellion@reddit
To chip in, GLM 5.1 truly is capable of replacing Opus 4.6. I am running the z.ai api version, I assume it runs unquantized, so local performance may degrade, but overall it works well across various complex large codebases.
Monkey_1505@reddit
Why GLM? K2 and MiMo Pro both beat it on aggregate benchmarks. Is it good at coding but worse at everything else?
Nixellion@reddit
I tried Kimi and it was way more unstable and erratic than GLM. Did not try MiMo.
Also z.ai (GLM) has a convenient coding plan.
jiml78@reddit
Agree, i have access to opus and GLM 5.1(ollama cloud). I use them to review each other. They are always catching things the other didn't think of.
Caffdy@reddit
you're not using any harness?
Tank_Gloomy@reddit
I'm wondering... are you working exclusively with Javascript-based software? Because this definitely isn't my experience.
Nixellion@reddit
No, python in a relatively niche use case and c#
Tank_Gloomy@reddit
Ah, well... yeah, Python is still pretty well known. C# maybe not but its knowledge about C++ and Java is probably close enough to work with that. My workflow is quite closely tied to Dockerfiles, SNMP calls and PHP with and without Laravel, and it becomes absolutely stupid with that.
Void-kun@reddit
What harness are you using GLM5.1 in?
In Claude Code it's significantly worse than Sonnet 4.6 nevermind Opus.
Nixellion@reddit
I have great experience with the updated kilo code. OpenCode also seems fine but I did not use it for any seriois coding work.
GCoderDCoder@reddit
Anthropic models are meant to stay functional with their harness. Other models arent designed for their harness behavior. I see a gap between claude in cursor's behavior vs claude in claud code and the benchmarks back that up. The reason they keep using their harness is because it is subtly designed to embed you into it. Anecdotally I think people who dislike it most are people who also use other tools as well and experience clashes any time they try to mobe out of claude code.
So then I always wonder on the flip side how much of the friction people experience comparing other models to claude is because of how they have grown accustomed to using claude.
HappySl4ppyXx@reddit
How are the limits and are there a lot of rate limiting / technical issues you run into?
Nixellion@reddit
I use it throughought the day running 1-3 parallel instances of Kilo Code and had no issues (except for kilos new agent delegation which sometimes get stuck, but it happens on opus too), and never hit any limits.
A few times I hit rate limits, but kilo typically waita a bit and retries and it keeps going.
I mostly used Opus through antigravity and limits there are atrocious nowadays. But even with claude code I'd hit limits way more often than with glm.
Kholtien@reddit
Ok the lite plan, I get 2-3 times what I do on claude pro
GCoderDCoder@reddit
Over hype? I'm going to sound defensive but I genuinely think people hype claude from lack of exposure to other models and other harneses. The content creators who actually try different things tend to recognize opus has great ability but often use other models for their own work. And nobody is saying a 30b parameter model can do everything claude can do. People are saying most of what they need a model to do can be done with self hosted models.
For local 3.6 what hardware are you using? What quant are you using? What harness are you using? How are you using your harness? Claude has those tuned for a certain user profile. You have to do those for local too before comparing.
People using q4 of a 30b model to code are not actually using the model that the benchmarks are made on. Models can keep agentic logic sound longer than they can maintain the same level of coding performance. So a 30b parameter model can search the internet, manage emails, etc down to q4 but I would not write code with that version.
Claude the model is different from claude the harness. I had opus in cursor for work just fine so i tried claude for my personal and Anthropic's harness makes me hate their models because I don't just let llms do their own thing. I use them to fill in the boiler plate for my logic. The way I use models I can swap claude, chat gpt, large local models (i have hardware) and now small local models like qwen 3.6 too. My friend who doesn't code loves claude code because he doesn't care about the how. He's also not using what he builds for production.
Most people don't actually need claude and the data is showing there's a lot of people enjoying AI activity not getting real value. If value is just making a lot of docs then people are really hyped making docs no one looks at lol.
rsatrioadi@reddit
Would you mind to share/at least give me some pointers to preparing this harness? I’m not using local models btw.
GCoderDCoder@reddit
I predominately use local but I think the principles translate to cloud. I have chatgpt and anthropic subs usually the small $20 ones but I canceled claude this month.
Models are the biggest factor for sure and for local you need to consider quantization and model size/ class. Higher quants of decent sized models tend to perform more syntactically accurate than lower quants. Bigger models manage concepts and logic better but I cant fit those quants. I will usually have a bigger model like qwen 397b at a lower quant make my plans and sometimes make a skeleton of the project because the higher parameter count gives it more space to manage the concepts and ideas. Then I might have a smaller higher quant model like qwen3.5 27b at q8 do a pass to do the coding and fix bugs. Qwen 3.6 27b and even 35b have made better plans for me than 397b lately and Opus and chatgpt agreed when I had them evaluate the code anonymously so that's why my for is kind of flipped right now because of the progress of these models lately.
I must also add this tangent that comparing non-thinking chatgpt and gemini fast to local models, chat gpt and opus preferred gemma 4, qwen (various 3.5/3.6models), glm, minimax m2.7, then gemini fast and chat gpt non-thinking last. Benchmarks align as well that good self hosted reasoning models beat non-thinking frontier models. Of course you dont use non thinking models for coding in the cloud but it gives you an idea of the capability of these models.
Next is harness... I like roo code because it gives me lots of knobs for my models. I can assign different roles to different models based on their strengths and define my compaction behavior. You can add skills and MCPs configure different tools etc. Hold this tools idea because it's important.
Next is operating workflows. I talked about it a little but having different roles and different models for each role is important. If you make a model review its own work it's going to love it. Claude likes models that write more like it even if a different model like qwen gets the right answer. All the qwen models follow Qwen's way of laying out info and claude thinks it's unrefined and often rude lol.
For me right now I'm a little messed up because my designs with these new potent small models are needing modification from what they were historically but my typical config is something like a glm 4.7 or qwen 397b for my planning (but can't lie I have been using qwen 3.6 27b lately). This is where I often use cloud models too but hinestly more in my brainstorming. I kike chatgpt for brainstorming. It hallucinates more according to emchmarks but I have figured out solutions with chatgot that other models said wasnt possible so pros and cons to all these characteristics...
Sorry back on track: then I will use something like qwen3.6 35b q8kxl as an orchestrator. It does agentic work well so managing sub agents is it's thing and it's easier to push. Then I started using gemma4 31b q8kxl as my coder because it is good at agentic coding but it's not a great agent for other things. I have qwen 3.6 27b set as my debugger right now but I would raise to minimax m2.7 if needed but I havent needed that yet. To tell the truth pairing qwen 3.6 27b or even 35b with gemma4 31b has been really complimentary. I feel like they fill in gaps for eachother and I have the hardware to run them at decent speeds.
In roo each of these agent modes (planner, orchestrator, debugger, coder, cutom) can have customizations in how you want things to work. I also use tools like mcp and skills.
Skill example: I had claude and chatgpt make 2 versions of a work presentation. Claude did better at sticking to my template. That was because it used my template and built on top of it where as chat gpt tried to mimic the template itself. I told chat gpt to create a skill to duplicate the template slides, replace the content, then verify placement and wouldnt you know it, the output looked the same. So skills tell agents how to do specific things. I have skills for managing my worktrees, interacting with authentication systems in my labs, notifying me, managing my kanban boards and tickets, etc.
Lastly memory systems. I have several. Coding is about 25% of what I use models for. But there are different types of knowledge. A memory system for navigating a code base is different from searching all the chats around a project or searching all the tickets etc. Each level of memory needs a different solution you give the agents access to. And you want the agent to have a map of what's at it's disposal without wasting context when it doesnt need it. So I have smaller index files for a platter of options for the models and then skills the model can choose allow them to learn in the moment how to interact with those tools in my environment without dumping all the rules on them at once.
Models>harness>operating rules>tool integration
Most of all you need to use your brain to think about what you are doing and how to do it. Get multiple perspectives. Never take the first thing a model gives you. Challenge as many ideas as possible. Evaluate what will happen next. The reality is everyone wants to move fast but even claide hits a wall if you dont manage it.
Example: Up through opus 4.6 I had a little personal app idea that I let claude just drive without me stearing it. I made a real spec my way with chatgpt and just told claude to keep iterating until it's finished. There eventually wast a button claude could not figure out how to fix. I started in 4.5 then tried 4.6 but still couldnt. There were a thousand files and I had no idea how it built that and neither did Opus lolol. I didnt test 4.7 but my point is that is not how you go to production but it felt great seeing new features until it fell apart. I did not do my normal commits along the way and refining of the code and organization and evaluating options etc. I just confirmed of it was working or not then said continue.
Likewise the big projects they have been claiming these models completed by themselves all have holes in them when you review them. LLMs are architected for tasks not ongoing streams of logic. A task can be making a plan but they are not designed to do a job. Im not saying get in the model's way, im saying if you are not feeling you are bringing something to the table you are probably not going to get to any level of shippable product.
rsatrioadi@reddit
Much appreciated! I believe other people will appreciate the local model part.
When I moved from ChatGPT to Claude I was impressed by how it’s taking internal turns for completing a complex request and how faithful it follows such instructions, which I believe is at least partially achieved by taking internal turns. I was thus wondering if there is any bring-your-own-model app that approximates Claude’s harness, but not necessarily for coding tasks.
I’ll check out Roo Code and see how it works.
GCoderDCoder@reddit
It's basically being discontinued because people like me who are probably most of their users don't help them make money :(
rsatrioadi@reddit
:(
whiskeyjack555@reddit
Well, I appreciate your write up.
ildefonso_camargo@reddit
Thanks!
_bones__@reddit
I'm using a Qwen 3.6 Q3 at home, and it works fairly well at 40t/s for coding in a fairly small project on limited tasks. I wouldn't expect it to do well if given a huge amount of work to coordinate.
I'm only on 12GB VRAM, so I'm limited in capability there.
I do have it plan a feature or change, and then tweak the stupid assumptions it's made until they are sound, and then have it execute that plan.
YMMV obviously, and it's not an Opus replacement. If that's your benchmark and expectation, it's not going to perform.
GCoderDCoder@reddit
Agreed but I can say q4 vs q8 for qwen 3.6 27b/35b are very different. In a harness where the model is told not to do all these d@mn emojis and is given a persona I think most people would have a hard time distinquishing Claude models on most of their tasks. Code is a unique differentiator. Tool calling is logic pseudo code. Real code has lot's of particulars and that's where higher quants and better models really shine. A model can be useful for lots of things without being a great coder and many people are judging these models in the versions/ quants that aren't good for coding.
This science of building scaffolding around a model is what a ton of millionaire developers are doing for openai and Anthropic. We cant just connect a 30b q4 model to lm studio and get claude code output. But we can get isht done with local models if we commit and value the sovereignty enough. When anthropic changes a model i don't get pissed because i don't build on a foundation that can be taken from me at any moment. Cloud is the icing on the cake for me
QuinQuix@reddit
Is it notably better on something like an rtx 6000 pro?
GCoderDCoder@reddit
Definitely. I focused on unified memory for sparse models before qwen 3.5 27b and gemma 4 31b. These dense models really prefer cuda. Mac studio i get around 30t/s for these dense models, macbook m5 max I get 15t/s at the start. 5090 I get 50t/s using gguf. I cant really fit fp8 well and dont want to go down to q4 for vllm. I'd expect the 600watt rtx pro 6000 blackwell to be a lot faster partially because of vram and extra cuda magic but really because you can fit fp8 with the full context that way.
Zestyclose839@reddit
What I prefer about using local models is that it forces you to be much more involved with the process. Claude is way too trigger happy to just build the thing without my input, and inevitably it ends up creating something half-baked and illogical because it's not seeing the bigger picture. Using local models force you to slow down and rigorously consider every design decision, which ultimately makes you a better software architect IMO.
Finanzamt_Endgegner@reddit
Well idk about you but I let qwen3.6 27 go into llama.cpp to implement a feature which it had to change like 10 files for and it just did that. Was for testing out some new method so don't worry I'm not gonna spam the devs with it but it works. I highly doubt gpt 5.3 mini would be anywhere near this level.
TheLexoPlexx@reddit
But that's the thing. Gemma4 31b is in LMArena remarkably close to the GLM-Models or Kimi across all benchmarks and on top of that, Composer is based on Kimi and that sucks too.
Monkey_1505@reddit
Mo., it's not remarkably close in benchmarks.
TheLexoPlexx@reddit
Care to explain?
Monkey_1505@reddit
Well, I'm not sure numbers not being close really needs explaining, but, on the artificial intelligence benchmark aggregate (which is just an aggregate of benchmark scores), Gemma 31b is a 39. Kimi v2.6 is a 54. Opus is a 57. Kimi v2.6 is far far closer to the benchmarks of Opus than Gemma is to Kimi.
Kimi v2.6 and MiMo Pro are the absolute top models in open source rn, trillion parameter models within spitting distance of SOTA proprietary super labs.
Gemma4 31b isn't even best in class, just vaguely competitive with other small models.
IntrinsicSecurity@reddit
I’m going
XccesSv2@reddit
You read benchmarks wrong "close" means, when it comes to the last top percent, are huge differences.
Monkey_1505@reddit
Or MiMo flash or similar.
GlitteringDress912@reddit
Hodler-mane@reddit
1000%
I been following guides exact for decent performing qwen3.6 27b on a 3090 and everything I try, fails at basic stuff like thinking and tool calling.
then I realize all these examples are examples for chat bots with no thinking of tool calling .. they just fail to mention that.
StardockEngineer@reddit
Nah, they work. I use 27b with Pi Coding agent to do hard things all day long. The latest thing I did was ask it to iterate on some never before seen data for a data science hackathon. After about 20 commits it made an html dashboard to show me the results.
bjodah@reddit
I love local models, but this has got to depend on the task complexity at hand? There are plenty of tasks (scientific computing, etc.) for which I don't even bother asking Sonnet (let alone my local Qwen 3.6 instance) to solve, but go straight to its bigger brother or OpenAI's/Google's SotA offering (unless the data is sensitive).
StardockEngineer@reddit
I’m not saying they can do it all, but they can do far far more than what many in this thread think. I can do 90% of my work now in 27b, at least. And I’ve had 27b fix three problems both Codex and Opus got stuck on.
roosterfareye@reddit
I think the problem frequently lies between the chair and keyboard. Poor prompting, poor planning, impatience.... I was there too once!
my_name_isnt_clever@reddit
Local models have so many variables, and if you mess up one of them you get shit performance and blame the models for being useless.
roosterfareye@reddit
That's it. Five minutes on the model card can fix a myriad of problems lol!
the3dwin@reddit
Care to elaborate what you mean "on the model card"
roosterfareye@reddit
This: https://huggingface.co/SicariusSicariiStuff/Assistant_Pepe_8B is the model card for the model in question. And no, this isn't the stupid right-wing version of Pepe, its the one being pulled back where he belongs lol!
bjodah@reddit
That's probably a common case, I just want to add that sometimes you really need the extra world knowledge of the larger model. For example, every now and then I want assistance in a niche programming language (elisp) and the smaller models (understandably) hallicinates functions that does not exist. For elisp in particular I've found Gemini 3.1 Pro to be the undisputed king. I really want to use my local models there as well, but I get nowhere near the success rate I can achieve for say python and bash.
Finanzamt_Endgegner@reddit
The just let Gemini create a list of things the local model has to adhere to and it should be fine? Don't have to use Gemini for actual implementation and stuff
roosterfareye@reddit
Yes, agree. I just remoted into my PC after asking qwen 3.6 35 a3b (6k quant) to generate a full test suite and --> run --> evaluate --> repeat until fixed and damn me, it did it, fully and agentically in LM Studio no less!
Caffdy@reddit
can you expand on this? which language/framework were you testing? which library did you use, what level of testing (Unit, Integration, E2E)
roosterfareye@reddit
Sure. E2E, and it's a html, CSS and vanilla JS setup, so yes, pretty straight forward but I'm loving some complicated and detailed maths and scoring systems. I need to know what I'm looking at, so these suit me fine!
roosterfareye@reddit
There's 19 seperate files in all (one html, the CSS and the remaining files are the seperate JS components). I hate monolithic setups, they are a pain in the butt to work on. Learnt that the hard way! I know my way around python a little as well and can read it and generally figure out what's going on.
RevolutionaryLime758@reddit
Bruh that’s not hard
StardockEngineer@reddit
Not from me. But for LLMs this is a talent. And it’s something worth noting in a post full of people saving they can’t get the models to do anything useful. Don’t you agree?
dearmannerism@reddit
This type of reaction is what I didn't lose hope yet. Perhaps, there must be a smarter way to break down the big task into bits that are quite easily digested by the smaller models like Qwen 27b. Once we find those primitives, everything can be just simple processing loop like Ralph loop.
Alwaysragestillplay@reddit
It depends what you're doing. Frankly for any "real" workload that a dev is likely to face, the <100B models are going to crap out sooner or later. I would suggest that decent model routing and orchestration is the way to fix that if your goal is to save tokens. Have some mechanism to judge prompt complexity, then choose whether to invoke Claude or Qwen dynamically.
StardockEngineer@reddit
Honestly, Claude and Codex also often crap out. It’s because of that I have workflows that rotate between the two to auto-resolve security tickets because I’d often find that just one of them in an all night loop going back and forth with greptile or code rabbit.
I don’t find 27b to be any more likely to “crap out”. Matter of text, adding it as a third peg to be flow improved the loops even further and reduced cost.
TheTerrasque@reddit
That's a you problem.
Local models aren't as good as claude, but they're fully capable. I've been experimenting with Qwen3.5 35b a3b at Q4 and opencode last week, and one task it did was making an MCP for a web site's search and detail listing (a local ebay'ish salesplace).
It started with me telling it to find out how the search worked. I couldn't see a json call for it, and the source html didn't have the results so it wasn't straight forward. It went at it, reading source code, finding javascript, deobfuscating it and tracing the calls and fetching the various js files. Like really going at it.
I started it before an 1hr work meeting, and it was still going on after I was done. I just let it putz since I wanted to see how it went, and about 20 minutes later it had figured it out and written a python module to get the listings. I then told it to do the same for details, and it figured that out within minutes.
Then I had it build:
I even had it test the result by building docker image and read the build log, launch it in docker and check the docker logs, then have it do http requests to the server to see if it answered correctly. I didn't even had to instruct it hard to do it either, just something like "verify via docker that it works" and it handled the rest itself.
At one point I had a "host name invalid" type of error, don't remember exactly now, happened when it was called inside the k8s cluster. I gave it the error message, it spun up the latest image and tried a http call with custom host header, noted the bug, traced through the mcp library until it found where a default class was created with hostname protection option was on, and altered the mcp server code to create an object with that option was turned off and pass it along when instancing the server. It then built a new image, verified that the call with custom host now worked, and deployed a new version.
It was a bit back and forth, with a few more mcp errors that took a bit of time to smooth out, but I only looked at the code twice during the whole thing. Once to figure out a problem it was stuck on and once to skim through it at the end to check if there was anything really stupid going on. It wasn't.
And that's with the MoE, which is less capable than the 27b. I don't know what you're doing wrong, but you're doing something wrong there, mate.
andy_potato@reddit
It’s not a “you problem”. OP has pointed out very detailed why a model like Qwen 3.6 is a nice toy but eventually much less capable than Opus or Sonnet.
Everything else is just “I want it to be true because local models”
my_name_isnt_clever@reddit
I don't use Qwen 3.6 as a toy, no matter how much you believe that's all it's good for. If you can't set it up properly and utilize it for useful tasks, it really is a skill issue.
andy_potato@reddit
If it works for those little hobby projects of yours then go ahead and use it. Nothing wrong with it.
my_name_isnt_clever@reddit
Don't patronize me. Learn how to use local models properly.
andy_potato@reddit
Some people have a life and need to get work done
TheTerrasque@reddit
The one I responded to stated that it "fails at basic stuff like thinking and tool calling." - which is entirely a him / stack problem. Probably using outdated chat template or token handling.
Qwen3.6 is less capable, sure, but not much less capable. As for OP, I do think he's done something wrong somewhere, because what he describes doesn't match my experiences with it at all.
Maybe it's tiny context, maybe it's weird quant, or some outdated hosting server, or high temp or.. Whatever it is, there's many ways to mess up serving and using a model that can give those results, and since he's given no info how he runs it, we can't really check can we. So then I have to go by my own experiences, one which I detailed in my comment.
rog1121@reddit
The only “real-world” success I’ve had with local llms is sorting and sentiment analysis. Essentially just a script that calls a Gemini model and asks it to be sorted into one of 6 categories which tends to be fairly well given the headers and raw data.
Full fledged agentic workflows is def not doable u less you run at least a 120b model
iMakeSense@reddit
I'm not sure you even need it for that. If you have enough data for your 6 email categories, couldn't you just create embeddings for those 6, cluster them, create an embedding for the new email, and if I certain confidence threshold isn't reached then use the LLM?
rog1121@reddit
There’s complex rules I wrote, if it’s a certain email I sort it to one folder. If it’s not matching spof and dkim, etc…. Stuff I don’t want to write logic for.
The prompt is like 1500 lines long
yeah-ok@reddit
The key here is the phrasing, "just" might be a bit of stretch for most people, can you point to practical steps needed to do this (i.e. not theory or overview but actual terminal commands)?
Own_Mix_3755@reddit
I use Qwen 3.6 27b for coding sessions just fine. The problem often is multilayered - it starts with wrongly configured server (I understand there are literally hundreds of combinations - but some are much better and some are much worse), continues through good harness (I ended up with RooCode as eg Claude Code seems to add too much of an overhead to each task that its just not worth it, I also had to define manually my own modes, engineer custom prompts and skills) a ends with model size and type (often people choose smaller quants like K_3_S to fit everything into VRAM with 256k context while with good agentic workflow you rarely go over 64k context). You also have to understand you are working with much smaller model and effectively dumbing it down quite alot with small quant. You have to find ways how to help him a bit (giving him proper readable “manual” will certainly help).
Sn0opY_GER@reddit
I use roo code with lm studio on a 5090 with qwen 3.6 27b (or 35b) and im surprised how good it works, tool calls etc no problems. I managed to code a timer software with nice animations for out mini rc car track that talks to the IR trackers for the timing software and now we habe a start light, leader board, rain warner etc - for free. I played a little with openclaw for 2 weekends and spent 700$ on claude :p i think the best way is a hybrid approach where the local model does the simple stuff and cloud corrects and refines. Thats how my claw works now for a while and it works verry well. If local is stuck or im not happy it can talk to a cloud bot in discord and get help fixing it or the cloudbot can take over.
330d@reddit
I'm sorry but these are all toy projects. An average SaaS that's not a crud will have 50-100k lines of backend and 20-30k lines of frontend with complicated deploy pipelines
gladfelter@reddit
Yes, and you sic the agent on the task by promoting it to describe each package and extract public API documentation, with subagents ideally, or with fresh prompts. Once the codebase is documented, that documentation serves as a context-friendly map to allow the agent to create a realistic design and testing plan and implementation plan. Clear the context again. Now your agent is ready to refactor existing code to add any missing unit tests, TDD-style. Clear the context and you're ready to start implementation, TDD-style. The agent can run for hours now since it has stable critics to keep it on track in the form of tests. There's a risk of requirements creep, granted, but there are ways to ameliorate that, too.
Or you could you yolo with a huge model with 1M context, but it'll be worse than using a smaller model in a way designed around it's capabilities and limitations.
MexInAbu@reddit
Well, none is (or should) be vive coding a production SaaS with a local, quantized small LLM. Hell, you should have very strong guards if you are going that with the frontier models too.
Sn0opY_GER@reddit
true - and now that i think about it you can literally FEEL with every line of code it takes longer and vreates more bugs - at first its prompt > "ooh thats looks really nice - lets me add XXX" and after a few of these "loops" the bugs/breaking starts and more and more time goes into fixing stuff - at the end i had to use Claude to fix an error with the minimap-timings the local model just couldnt get right (local always only displayed cars in the first 25% of the map never a complete lap - Claude fixed it and called it "Bad math" 😃
FullOf_Bad_Ideas@reddit
Even with BF16 I found Qwen 3.6 27B to be bad in the same scenario where Qwen 3.5 397B 3.5bpw was pretty good. Same harness.
mateszhun@reddit
Same, local models seem to work really well with Roo Code.
But I do have a problem with on longer context windows with 27b, it suddenly starts to fail with File Edits. (Maybe it is a setup problem?) But 35B doesn't have that problem.
I've settled on 27B for Ask, Orchestration, Architecture modes, and 35B for coding. And 35B is also faster as a moe model, so it works out nicely for the longer outputs. I'm using Q8 quants for both models.
Eyelbee@reddit
I switched to cline after they shut it down, it seems to be the same. I have small complaints, like we can't see things like system prompt. I'm too lazy to look at the source code. It's close to perfect in my opinion, I'm thinking about forking it if I can't find the time.
DrBattletoad@reddit
Good to see someone else with the same problem as me. I thought I was going crazy to see 35B solve problems that 27B wasn't able to.
den0rk@reddit
Could you recommend some necessary adjustments in LM Studio?
Own_Mix_3755@reddit
Thats hard to say. Depending on your hardware, model, usage, … there is alot. Google is your friend and you have to do alot of testing.
Finanzamt_Endgegner@reddit
That's a config issue. It should not fail any too calls, I had it do like 2000 or so at this point and just a single failure.
groeli02@reddit
original qwen? have you tried qwopus or some other derivates?
Your_Friendly_Nerd@reddit
I'm so glad it's not just me. I've barely used any fancy agent harnesses like opencode with local models, because the few times that I did try, it was an awful experience (doesn't help that I don't have much VRAM so the models run slow as hell). That's why I've just stuck to using the chat interface in my editor, which is a step up from open-webui, since it's easier to share editor content with it, but that's about it.
xamboozi@reddit
Wait are you guys comparing a raw LLM against one with a fully refined harness? Is your local AI decomposing every ask to reason through it? Is it learning and self improving as you work? Is it evaluating every conversation for how it can do better next time?
nickl@reddit
> Is your local AI decomposing every ask to reason through it? Is it learning and self improving as you work? Is it evaluating every past conversation for how it can do better next time?
> Cause that's what Claude Code is doing.
Other than the system prompt telling it to reason through things step by step, no, Claude Code does not do these things.
The harness is important, but don't make things up.
One-Net-3049@reddit
He's not making anything up; Claude Code incorporates a TODO list and it's quite effective (at least with Claude models)
smirnfil@reddit
Claude Code has memory from the box now. By default hidden from user, but very noticeable in practice.
nickl@reddit
Sure, but that is different to the things stated.
AdOk3759@reddit
Exactly.. the harness plays a huge, huge role in output quality, even more so when we’re talking about small models. Look up little coder
falconandeagle@reddit
This subreddit is filled with vibe coders that think their yet another todo application of basic ass dashboard is something to brag about.
sexy_silver_grandpa@reddit
I use local LLMs and I'm the project leader of an extremely popular open source library that you, and every enterprise company use every day.
QuinQuix@reddit
Linus is that you
sexy_silver_grandpa@reddit
Lol ok my project is not THAT important.
Chupa-Skrull@reddit
Thanks for your hard work, sexy silver grandpa.
pomatotappu@reddit
Lmao
IamKyra@reddit
Hm I'd say the opposite, if you're a good coder you know how to make Qwen3.X do what you actually want to do. It's the vibecoders that will actually miss Claude for how much he can achieve.
benfavre@reddit
At some point you know so much that you don't even need a model
my_name_isnt_clever@reddit
I disagree. If I know how to do it I can delegate it to an LLM by giving it clear instructions, and if it messes up I know how to fix it.
Eyelbee@reddit
Yeah, the more you know what you need to do, the less you need a better model. This has been true for quite some time, honestly. But the thing is, qwen 3.6 27b is quite literally at sonnet 4.5 - GPT-5 level. 6 months ago these were the best models. Would OP say the same about sonnet 4.5 when it first came out?
smirnfil@reddit
So December 2025 haven't yet happened for local models? That explains a lot - the main difference between 6 months ago and current in big world is required level of fine tuning. 6 months ago you needed a lot of knowledge in "AI coding" how to specifically manage context, what mcps to use and what not to use, what tasks you could throw at them and what would be too large. Yes if you do all these dances you could get a lot of value, but the amount of maintenance was quite big. To the level of some devs saying - sure nice tools, but too niche for my tasks.
Now any developer without specific "AI knowledge" could open Claude Code and it just works. Would be interesting too see when local models would be at this level.
Finanzamt_Endgegner@reddit
This this this, if you know what you it can even beat 4.5 opus in some areas with correct guidance.
-Ellary-@reddit
A lot of times I just use local LLM for assistance coding, to suggest me how to complete a function that I'm writing right now. Suggestions become better and better with every major local release. Sometimes I just push the code to ask LLM explaining what I need to achieve and ask it for ideas. Then I just use ideas that I liked and finish it by hand.
I need a little help to speedup stuff, not do everything for me.
I kinda want to enjoy my work.
droptableadventures@reddit
Or if they're capable of setting up local AI to a degree that works well, they are more likely to have some level of programming knowledge.
So if they have to help the model get past the occasional issue it's stuck on, they don't see this as a major impediment.
cmerchantii@reddit
I don’t think this is it either.
I’m not a developer and never claim to be, I’m a hobbyist systems architect at best. But when I’ve got two pieces of software in my homelab I want to communicate with one another and a bunch of API docs from both- I can use a smallish local model to guide me to creating a simple JS worker to pass the relevant data back and forth. Run that on one of my servers and boom: I “built software”… but even I know enough from $dayjob to know it’s not up to scratch for what even one of my junior devs would do at work in a quarter of the time.
Small local models (and big hosted ones, of course) empower people like me who are a little curious and have just enough knowledge to be dangerous to create small things that work well, bigger things that probably function mostly, and bigger things that are totally fucked. But I can completely see how a larger codebase and bigger project with more complex requirements would get choked in a small local model even when guided by a professional.
It’s a complicated multi variable thing we’re analyzing here: how powerful is the model, how skilled is the developer (on a scale from “not a developer/me/0) to literally senior 15 year engineer at Microsoft/10”, and how robust and complex is the project. Moving those 3 sliders around gets massively different results.
alberto_467@reddit
Not necessarily for anyone who's gotten started in the last 2/3 years. There are people doing things who never really learned how to code, because they never truly needed to. They are totally lost when they try to code without a model or smart autocomplete.
They surely have more technical skills if they can set things up, they can probably read some code, but they don;'t really have programming knowledge because they never had the mental strength to disable all AI and actually learn, for many months or even years, to actually code by themselves.
More experienced guys have already put in the work to actually gain the programming knowledge, it's the newer ones who never felt they needed to know the why and the details that i'm worried about.
johnfkngzoidberg@reddit
This sub is full of bots hyping whatever local model just came out.
China is behind and their strategy is to release open models to gain exposure.
relmny@reddit
This subreddit is filled with people comparing a most likely >1tb huge model to a 27b/31b model. And claiming they can't do the same.
What is clear to me is that some people don't understand the tools. And they don't know what they are for nor how to use them.
HiddenoO@reddit
The whole issue in OP's post stems from too many posts claiming the opposite, i.e., that your locally hosted small model is basically as good as frontier models.
It might not be the majority opinion, but it's common enough to mislead people into thinking they're doing something wrong, when in reality that false suggestion is typically either the result of vested interests (like the Huggingface CTO post yesterday), people not being competent enough to realise there is still a very significant gap for real work, or people simply not having complex enough use cases for that gap to show.
Just below, you have the following comment with currently 24 upvotes:
"Iterating on data" and "making an html dashboard" are generally not "hard things" for an AI, especially when the person prompting has the required data science knowledge - what's hard for an AI is dealing with large, messy, interwoven codebases that result in a large, messy context window with tons of tool calls.
relmny@reddit
It's like any opinion on the Internet, what you read is what THAT person thinks/claims.
Meaning, that if someone says "I don't need commercial models anymore, running qwen/gemma/kimi/glm/etc locally is enough!" that means exactly that. No matter how they phrase that. It's their opinion for their case.
I always use local models. So I'm not surprised, specially since the last 1-2 months with gemma-4, qwen3.5/3.6, kimi, glm etc, that more and more people are claiming that THEY can do THEIR work with local models.
And that example is by a single person that, like me, can work fine with local models.
It's about context. And understanding that what works for someone, might not work for someone else.
HiddenoO@reddit
You're acting as if that were all that's being said, but the part I referred to specifically was the "doing hard things all day long" part, and that's how these comment chains regularly go. People extrapolate their own (often very limited) experience and then effectively gaslight people like the OP into thinking they're doing something wrong when in reality they're just overstating their own experience as being generally applicable.
relmny@reddit
Again, that's your claim of what "hard things" are.
AFAIK there's no official definition for "hard things".
Maybe for the person that wrote that, those are "hard things". Maybe things that didn't work before with local models.
And the main point remains, that's the opinion of a single person.
I claim that I do everything with local models. If somebody understands that anyone can do everything with local models, that's their problem, not mine.
That's my experience. I can do "hard things" because they are... to me.
And then there is the comparison between a huge commercial models with all the infrastructure, workers, hardware, tools, etc with a 27b/31b model in a single GPU...
Anyway, I'm done with this.
HiddenoO@reddit
You're arguing about technicalities, I'm arguing about these comments are being perceived by people like OP constantly reading them.
GreenGreasyGreasels@reddit
It's the hype - Qwen3.6-27B is as smart as a model 20x it's size - which is true not not the full story.
It's like claiming a child with 130 IQ can do the same things as an adult with 130 IQ - they might both have the same IQ numbers, but the tasks each is capable of is very different.
Syncaidius@reddit
People also forget when comparing Claude models against others, Claude is trained specifically for coding and development-related tasks. It's more specialised in this area, so it should be expected to be at least slightly better at coding than other models.
However, when it comes to doing more generalised and varying tasks, I find Claude makes way too many dumb decisions compared to models of lesser sizes and that's fine. They're specialised models, whereas the others are more generalised.
Other models are intended to be good at a bit of everything, but great at nothing. However, that will change over time as they optimise model sizes and efficiency.
The biggest issue with Claude right now is it's not able to run at it's optimal level because Anthropic have been severely restricting it to coutneract the shortage of available compute.
andy_potato@reddit
1000x this
SmartCustard9944@reddit
You forgot the tower defense guys
ProfessionalSpend589@reddit
We need more tower defence games!
RoomyRoots@reddit
You can easily extrapolate it to the whole Internet.
Tank_Gloomy@reddit
Same. People told me "try GLM, it's amazing at coding!" and all I found was a model that would constantly get stuck in shit like "I will now call the tool I will now call the tool I will now call the tool" whenever I got over 50% of the context, lmao.
WinDrossel007@reddit
No, it's a common sense
balancedchaos@reddit
Local LLMs have been utterly terrible at everything I've tried with them.
eat_my_ass_n_balls@reddit
A lot of the people in here are making slop and it shows
Medium_Anxiety_8143@reddit
Have you tried another harness other than claude code?
dtdisapointingresult@reddit (OP)
Before I answer that, can you answer with the square root of Pi?
Bitter-Bed-3532@reddit
fr man it's so frustrating
Glum_Proof_2791@reddit
sorry i dont kown what you say
dco44@reddit
onethousandmonkey@reddit
Purely from the performance point of view, there are a number of settings to tweak to make Claude Code jive with local models. For example: https://unsloth.ai/docs/basics/claude-code#fixing-90-slower-inference-in-claude-code
Before I did that, I was banging my head against the wall at the slowness and useless cache.
AdOk3759@reddit
I would also suggest to look into little coder, which is a harness specifically designed to boost smaller models’ performance
Torodaddy@reddit
Open code is actually pretty sick too. I used claude code daily for work and i found open code leaps more productive and faster
Affectionate_Pen6882@reddit
Is this good for beginners learning to code or already for experience coders?
Torodaddy@reddit
Both its pretty easy to use and a lot faster. It does many tasks in parallel
PinkySwearNotABot@reddit
OC has like a 10K prompt though. If you think OC is faster, have you tried Pi? I think their prompt is like 100 lines or something. It’s amazingly fast, and I notice the difference in my M1 Max 64GB
QuchchenEbrithin2day@reddit
Thanks to this thread, found out about little-coder, and that in turn seems to be based on pi. Both look very promising, but in the end, all depends on what can be achieved with these tools and by whom, with what kind or level of skills.
PinkySwearNotABot@reddit
You’re right but it was much easier to hang onto using local models for practical work with the right tool ie Pi. My models were barely functional using other harnesses
AdOk3759@reddit
I never used much CC so I cannot compare it, but yes open code is really my favorite out of harnesses I have tried (although only with large models like Kimi 2.6; I don’t know how it behaves with smaller local models)
ghostnation66@reddit
I would recommend pi
AdOk3759@reddit
Little coder is a wrapper of pi.. just better
ghostnation66@reddit
Better than pi...hmmm
AdOk3759@reddit
Read the paper published by the author of little coder. Another redditor posted the GitHub repo below, and inside the repo there’s the link.
RobotRobotWhatDoUSee@reddit
link? I googled little coder (and variations) but largely found many webpages targeted at teaching children to code. Worthy goal, just not what I am looking for!
aparamonov@reddit
Just use pi instead
AdOk3759@reddit
Little coder is a wrapper of pi, but better
Clear-Ad-9312@reddit
https://github.com/itayinbarr/little-coder
IrisColt@reddit
Thanks!!!
RR321@reddit
Does this happen with Pi, OpenCode, etc?
onethousandmonkey@reddit
Probably not, Claude Code inserts a header that changes with every prompt (for billing), thus invalidating the cache. Easy one-liner fix.
ChocomelP@reddit
Why would you use Claude Code without Claude models? The models are what make it good. The harness itself is suboptimal. If you could easily OAuth to use other harnesses like Pi with your Claude subscription, I would never use Claude Code.
howardhus@reddit
not true. its an open secret that claude has done a great job optimizing their harness to boos performance… after leak there were several anslysis confirming this
BankruptingBanks@reddit
can you please share some links? didnt happen to read any, would love something in depth.
ChocomelP@reddit
Oh wow, I didn't know there were "several anslysis"...
howardhus@reddit
well… glad you came out from under that rock
DoorStuckSickDuck@reddit
Actual skill issue take idk what to say, it's widely considered one of the top harnesses, so much so that people use it with other non Claude models.
ChocomelP@reddit
Congratulations, you've arrived at the point where the conversation started.
DownSyndromeLogic@reddit
The claude harness is not so optimal, especially if you're using claude desktop, it's fantastic. It's the most advanced one that I've seen. Perhaps I haven't explored open code enough yet, but Claude codes memory system. Prompt injection system And chat artifact discovery system is top notch.
ChocomelP@reddit
If you haven't explored Opencode yet, how can you claim that Claude is not suboptimal? I'm not saying you're wrong, but how would you know? Also, Claude desktop straight up has missing features compared to actual Claude code in the terminal.
DownSyndromeLogic@reddit
Because I'm comparing it against GitHub Copilot Which I've used extensively. Never even heard of open code until yesterday. What features does Cloud Code Terminal have that the desktop doesn't have? That are so valuable.
smirnfil@reddit
Claude Code in current iteration is a powerful harness. It just works out of the box with sub agents, memory, planning etc. Battery included design is a huge benefit especially in the field where every two months best design changes a lot. It obviousely is designed to be used with Claude models so it is an interesting question how model replacement affects it, but it makes a lot of sense to use it.
ChocomelP@reddit
It is a good start, I agree. But if you're not going to be using Claude models, you have better options. Claude Code is bloated, and you probably don't need half its modules.
Torodaddy@reddit
You dont know what you are talking about. Claude code as a harness and agent staging platform is tops
Gesha24@reddit
Personal experience - because I like it the best. I am getting better results with claude code than aider or continue.dev when working with a local model
howardhus@reddit
you da real mvp
Mobile_Bonus4983@reddit
The link days:
"Claude Code recently prepends and adds a Claude Code Attribution header, which invalidates the KV Cache, making inference 90% slower with local models. See this LocalLlama discussion."
WhichOrganization884@reddit
Really relate to this, the tool call reliability gap between local and cloud models is still very real for complex agentic tasks. Interestingly though I've found local LLMs work really well for RAG based applications where the task is more about retrieval and answering from documents rather than complex decision making. The smaller scope seems to suit them better. Are you planning to keep local models for any specific use cases or going fully cloud?
dtdisapointingresult@reddit (OP)
I haven't switched to cloud models yet but it's the plan. I'm trying bigger local models now, since I got access to dual Sparks.
It's not tool calls I have an issue with, it's the basic methodology. It's lacking the basic intelligence I expect from the lowest, dumbest junior developer. "dockerize this working app that tells you all the instructions in the README" is a basic task for me. I cannot accept that it's supposed to be a complex agentic task.
After pushbacks from the people saying my prompt is too basic, I even created a docker-usage skill with 700 tokens of guidelines that tells the LLM what to do when it's asked to dockerize a project, and it's still not enough. One of the rules in the skill is:
And despite this, first thing MiniMax M2.7 does after the README is read the source code of the project. I just can't win.
crantob@reddit
Your headline is poorly constructed. You meant "for agentic coding" not "for coding".
Please remember to distinguish these two things in future, everyone.
dtdisapointingresult@reddit (OP)
Who the hell would be doing coding without an agentic app? This isn't 2024. It's so inefficient.
latebinding@reddit
So I know this is dead-threading but...
Real developers are often doing things agentic coding can't do well, because it hasn't been done often before. Agents are good at copying patterns, but if you're innovating, there's no available pattern to copy.
So I do the initial code. Sure, I might use an agent to scaffold it, but that's no different from the old Borland days when you'd select the project type and it would create, e.g., a WinMain or an event loop for you. The actual logic and code is mine.
The difference is, are you the developer, leveraging the agent to do boilerplate? Or is the agent the developer, with you doing no more coding than a junior Product Manager would?
I think that was the point. You don't seem to actually be a hard-core developer, regardless of what you've managed to "vibe code."
crantob@reddit
You're not coding.
dtdisapointingresult@reddit (OP)
You're not communicating intelligently.
Go ahead and copy-paste code to/from a chat UI, you boomer.
crantob@reddit
Lol. You're not coding by pressing "Start Coding". Sorry.
Your LLM is.
Annual_Award1260@reddit
Yeah those datacenter models running on multi million dollar equipment always win. But i use claude to write prompts for my local ai that cranks profit all day long
rentprompts@reddit
The browser-control angle is powerful, but the real moat is reliability. For I'm done with using local LLMs for coding, I would care less about whether it can click once and more about whether it can recover from popups, changed layouts, and partial failures.
A good benchmark would be boring: 20 repeated runs, screenshots/proofs saved, failed steps logged, and a human approval gate before anything irreversible. That is the difference between a cool agent and a workflow you can sell.
patricious@reddit
OP you have mentioned all sorts of things but failed to give us the most crucial piece of information. What does your setup look like exactly. Hardware, model flags, TUI, harnesses, MCP servers?
The whole point, at least in my experience, when running local models is the supporting tech stack you build around it. My current setup feels far superior to what Anti-Gravity, Claude Code, Codex and others have to offer.
For me it looks like this: RTX 5090, Qwen3.6 35B/27B with TurboQuant (use them both interchangeably), --temperature 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --frequency-penalty 0.0 --repeat-penalty 1.0
Coding stack: OpenCode TUI, oh-my-opencode harness, MCP's: . context7, grep_app, pdf-mcp, sequential-thinking, serena, stitch, websearch.
I have oh-my-opencode use Qwen3.6 as the builder and general orchestrator and all other sub-agents use: DeepSeek V4 Pro and Fast from my OpenCode Go subscription.
This setup works wonders for me.
fabkosta@reddit
What's "oh-my-opencode harness" - do you refer to this thing here: https://ohmyopenagent.com/ ?
thadude3@reddit
sorry for the dumb question but how did you configure the sub agents to use a different model. I struggled to find how to do this, the other day.
patricious@reddit
Every harness that has agent and sub-agents has some form of json config file that dictated what model and from what source it uses.
Silver-Antelope-1285@reddit
Hi, can you please tell me more about this setup as if you're explaining this to a first time ollama and opencode user?
I've got as far as installing opencode and connecting it to my local ollama qwen3.6 model running on an M4 48gb Macbook Pro.
QuchchenEbrithin2day@reddit
Master Yoda, would you mind show us lesser mortals, the path, say using a youtube video or something ? TIA.
patricious@reddit
DM me and I'll see what I can pull together.
QuchchenEbrithin2day@reddit
Master Yoda, you are probably in Dabogah... Unable to DM, it says:
patricious
Unable to message this account.
patricious@reddit
Hey mate I requested and messaged you 1h ago. Did my message come through on your end?
QuchchenEbrithin2day@reddit
Thanks for DM'ing, but unfortunately there is something wrong with reddit chats for my account, as I am unable to accept the invite. Opened a ticket with reddit.
CarlSagan_1986@reddit
Darwin Opus I1 works really good coding best one at tool use
Puzzleheaded_Tie7801@reddit
That's a great setup, I also have a 5090. What are you using as your infrencing engine for the Qwen models? I use WSL2 with LM Studio but I see LM Studio taking a long time to process (developer screen shows model going through "18 GEN XX tok" where XX keeps increasing and after a long while prompt is processed). vLLM was faster, esp with Speculative Decoding but experienced frequent vLLM crashes)
patricious@reddit
For the inferencing engine I use llama.cpp build with CUDA 12.8 (13 is currently very bugged for Blackwell)
I was on LM studio for the longest time but there is a very weird behavior where LM Studio stops generating tokens, sometimes at 9K sometimes at 16K, and the token processing was taking too long IMO, for these two reasons alone I moved to llamacpp.
Puzzleheaded_Tie7801@reddit
thank you, I will try that. Although, LM Studio uses llama.cpp as the backend. Can you share your llama.cpp startup command? e.g. llama-server and all the switches.
patricious@reddit
These are the params I use for coding only. Replace the paths, host and port for your use-case.
u/echo off
echo Starting llama.cpp server with Qwen3.6-27B (UD-Q4_K_XL) on RTX 5090...
echo CUDA 12.8 + Blackwell (sm_120) + MMQ kernels
echo.
set SERVER=build-x64-windows-msvc-release\bin\llama-server.exe
set MODEL=C:\Users\%USERNAME%\Desktop\Qwen3.6 27B Unsloth\Qwen3.6-27B-UD-Q4_K_XL.gguf
set MMPROJ=C:\Users\%USERNAME%\Desktop\Qwen3.6 27B Unsloth\mmproj-BF16.gguf
if not exist "%MODEL%" (
echo ERROR: Model not found at %MODEL%
pause
exit /b 1
)
if not exist "%MMPROJ%" (
echo ERROR: mmproj not found at %MMPROJ%
pause
exit /b 1
)
"%SERVER%" \^
--model "%MODEL%" \^
--mmproj "%MMPROJ%" \^
--host %Your_IP% \^
--port %ANYPORT% \^
--n-gpu-layers 99 \^
--ctx-size 262144 \^
--cache-type-k turbo4 \^
--cache-type-v turbo4 \^
--flash-attn on \^
--reasoning off \^
--jinja \^
--batch-size 32768 \^
--ubatch-size 2048 \^
--cont-batching \^
--no-context-shift \^
--metrics \^
--temperature 0.6 \^
--top-p 0.95 \^
--top-k 20 \^
--min-p 0.0 \^
--presence-penalty 0.0 \^
--frequency-penalty 0.0 \^
--repeat-penalty 1.0
pause
DependentBat5432@reddit
This. really solid setup, respect for putting all together. but honestly way beyond most ppl would build themselves lol I’m on AllToken team so obv biased haha. for ppl like OP who just want to switch between Kimi and Claude without the setup headache. zero markup, ever. no pressure but happy to help if wanna try it out
AMD_PoolShark28@reddit
This is the way.
Unfortunately you really got to read the model card on hugging face.. there is no one size fits all approach to the parameters especially things like top k temperature and frequency penalty.
Doing creative stuff? You probably want to high temperature. Doing specific coding work, you wanted a lot lower but not zero. Zero gets you into holes where the LLM cannot creatively find its way out of.
The other problem with local LLMS is the defaults typically is really small context window, again you got to read and see what it supports but at very minimum 32k and ideally 128 k for big coding tasks or visual models
patricious@reddit
very very true, I have separate batch files with different params depending on the task I need done.
For the context size, I left it at 263K (compacting at 200K in OpenCode), I haven't encountered any strange behavior thus far.
dtdisapointingresult@reddit (OP)
Hardware doesn't matter for intelligence, it only affects speed.
I ran Qwen 3.6 27B FP8, and Gemma 4 31B AWQ 4-bit. Using the temperature/etc from the model's card.
I used vanilla Claude Code and vanilla Qwen Code. They each have a massive 18k token system prompt. I don't use any MCPs or skills otherwise. The only MCP I have installed is Playwright for web stuff, but it was not relevant for this task.
I think you're right that I probably need to use something that forcefully decomposes the task since the small local LLMs are too dumb to do it on their own.
Commando501@reddit
You must be new to this, huh bud. You gotta actually read through how this stuff works so you get a grasp for what it takes to make this stuff work.
We don't yet live in an age of plug and play with 100% quality/efficiency for local models.
You still have to go through constant tinkering and fine tuning of the model/setup to reach that goldilocks zone.
Inevitable_Search468@reddit
That's what op is trying to do bruh. It is funny how you engineers feel the need to perfectly automate something that is not meant to be while dudes learning y'all are throwing shade for donkeys years. 🫎
Inevitable_Search468@reddit
Why the down votes on this OP reply the fuck
Tai9ch@reddit
Except RAM, which limits model size and quantization.
And speed matters in practice. I could run GLM locally at Q8 if I were willing to deal with 5 seconds per token inference speed. Qwen3-Coder 80B will get much more work done and done effectively.
middleNameIsHadrian@reddit
Regarding quality of responses from models... Yesterday I hit a nasty bug. Opus kept proposing elaborate and overcomplicated solutions, while DeepSeek V4 Flash went straight for the simple fix. Quality-wise I'd genuinely put it head to head with the latest Sonnet, and it's saving me real money and latency on top of that.
ElonMusksFacecream@reddit
Not that insightful tbh. Highly opinionated. Purely based on your own biases. No details on your setup.
Not exactly brief either. More like a therapy session, venting general frustrations. Can your LLM setup not even help with sorting out this post?
TBF, the title did hint it was all about you. I can't complain really.
Legal-Pop-1330@reddit
Fwiw, we have decent results using our new sagent API/CLI. Maybe give it a whirl? (Its OSS/Apache2.)
https://rekursiv.ai/blog/introducing-sagent/
Yusso_17@reddit
I like local but i dont use it for coding. I use cursor and claude instead. Local AI is still too weak at this point in time.
Sharp_Classroom9686@reddit
I think the problem is less “local models suck” and more “you used the wrong tools for local models.”
If the runtime lets a 27B model eat giant logs, bloat context, and improvise badly with tools, of course it’s going to feel terrible.
Try Forge. (Github) It’s much more local-first in how it handles context, subagents, and task scoping. It won’t make Qwen think like Claude, but it does stop wasting tokens on garbage, which is half the battle with local coding.
Link:: https://github.com/defexnicolas/forge
dtdisapointingresult@reddit (OP)
I'm not trying your 1-week-old vibecoded agent bro. hmu if your project is still getting commits in 3 months.
Sharp_Classroom9686@reddit
shh
dtdisapointingresult@reddit (OP)
I'm looking forward to adding your app to my graveyard in a few weeks:
https://reddit.com/r/LocalLLaMA/comments/1t3lwji/comparison_of_the_development_status_of_various/
Sharp_Classroom9686@reddit
15mins.. task complete
dtdisapointingresult@reddit (OP)
I put on my robe and wizard hat...
No, nevermind that. UNSUBSCRIBE!
Sharp_Classroom9686@reddit
newcomb_benford_law@reddit
The thing with open models is that their increments in performance are getting significantly bigger compared to closed models. If you look at Qwen3.6 compared to 3.5 it's a significant difference. So you just have to stay patient and keep up. Also the harnesses like Claude Code and Codex are simply not good for these models - you have to use better/more tailored harnesses. I find something like pi (https://pi.dev/) combined with Qwen3.6 35B/27B to work amazingly well, with the right skills and tools.
dtdisapointingresult@reddit (OP)
I said in another comment how Pi is actually WORSE.
It doesn't do toolcalls in subagents by default. That means the main context on your slow, shitty local model is filled with the output of tool-calls + the analysis. "Just write your own subagent extension bro" is the recommended solution, and it is a valid one, but no, people will not get ANYTHING out of Pi without putting in a few weeks of learning curve.
newcomb_benford_law@reddit
That's not my experience with it at all. It's been solid. If you don't like it don't use it. I like OpenCode as well. I use Codex and CC as well. I switch between all of these. This entire ecosystem is still fairly new, and it's getting better and better. Your post is just one perspective and it's very different from mine. Should I tell you to stop posting?
NE0_ZER0_@reddit
This can't be a real post lol...
dtdisapointingresult@reddit (OP)
Oh no!
Anyway...
Kind_Brick_8461@reddit
Is there a platform where someone can just pay per token (like in OpenAI or Claude) but for the open models api?
dtdisapointingresult@reddit (OP)
OpenRouter is the most popular one. It includes OpenAI and Claude models too. One site, all models with a single API key.
Revolutionary_Ask154@reddit
jury is still out - this is bleeding edge - during a debug session i noticed an insanely large amount of polluting from a everything-claude-code plugin flooding the context window - wasn't a problem with claude - checkout with verbose flag - use openclaude - it's a direct rip off of claude with proprietory logic. https://gist.github.com/johndpope/a77b179c4f0013adb2a50e13e56b7929#file-llama-deepseek-v4-sh-L18
MLExpert000@reddit
I won’t really say that out loud here because people get really offended. But I hear your point.
andy_potato@reddit
It is necessary to say it out loud.
Qwen 3.6 27b is a great model for many applications. But I’m sick of these posts of people claiming it performs on par with Claude for coding. It is simply not true.
Finanzamt_Endgegner@reddit
It is if you know what you are doing. It isn't if you don't. For pure vibe coding without thinking on you part it might not be there yet but with correct harness settings and instructions and guidance it can compete with at the very least sonnet4.5
andy_potato@reddit
It’s not even close. Everyone claiming otherwise is just coping hard.
Finanzamt_Endgegner@reddit
It is. Anyone claiming it itsnt has either a config or skill issue. Looping and stuff is config issue. Not giving it clear instructions is skill issue.
crantob@reddit
If the person is giving the same guidance to Claude and to Local LLM, and pointing out the difference in performance, then that is a valid difference of performance.
Claiming it's a "skill issue" is nonsense when it's clearly divergent performance between models.
YOU being able to get what YOU want in YOUR application is IRRELEVANT to the point.
Finanzamt_Endgegner@reddit
Well if the local model is in the wrong harness and has the wrong settings its hardly a fair match, when the cloud model has the correct settings on their api and uses a harness that was made for it. Btw im not claiming it beats opus but it is either on paar or extremely close to sonnet 4.5 and not only benchmarks show that but real world tasks as well.
bnolsen@reddit
The question is, how much time and research do you put into fine tuning these local models as they come along?
Finanzamt_Endgegner@reddit
not a lot, just use pi and build it from there. And not only do you get free of privacy issues you also dont have issue when copilot increases prices x9...
bnolsen@reddit
I mean there are things to consider like quantization and the like. Perhaps that's not as important.
Recoil42@reddit
Say it out loud, otherwise this place devolves into a reality bubble and loses value to everyone. Sometimes, people need their medicine.
thiswebthisweb@reddit
I've come to the conclusion that even the best SOTA models inc commercial are only ever good for very small projects and code bases because there is no good solution for the context limit issue. No matter how great the AI once you run our of context everything fails and its easy to lose progress, its super frustrating and expensive. For a AI to do proper coding you need an enormous context limit, 3-10M tokens. I think this is now more important than getting newer models with more training data. the limit has been 250k for quite some time now. Deepseek has improved that, but 1m is still not enough really. Compression and memory management just doesn't cut it.
Equivalent_Bit_461@reddit
Cool story bro, still not using Claude
XCyefHpwE7coXyV@reddit
I give as well. I have spent a week trying to get qwen model running on LM Studio with several editors running on a damn Mac Studio M3 Ultra 96 GB and can’t even get the it read file structure properly let alone load my rules. Tried other coder models, same thing. No clue how some are able to developer low end stuff based on my experience.🤦♂️
Fast_Sleep7282@reddit
the trick is to use a large llm to orchestrate smaller coding llm’s to save output tokens
blastradii@reddit
I use my cheaper perplexity pro subscription to draft out implementation plans and then feed that over to a free open router model to implement. lol. What do you think?
hedsht@reddit
i was about to say.
in my workflow codex gpt 5.5 is the architect, qwen3.6 27b the builder and qwen3.6 35b the tester. it works very well (for web development).
tomdg4@reddit
How do you set up such a workflow? Trying to do the same since github copilor prices will go through the roof
hedsht@reddit
its just a python script calling either codex cli or pi. its very basic with a handoff.md in the end that gets validated by an additional script, so the agents get feedback whats wrong and can fix it to make sure that the handoffs are always correct.
i was looking at existing solutions (crewai or langchain/langgraph), but they were too much for my "small" team.
stonk_street@reddit
how to do this
kapitanfind-us@reddit
Would you kindly share the architect prompt if not too business specific? That is what makes a difference.
The_2nd_Coming@reddit
What's the mechanism that let's Codex instruct your local models? Is it just as easy as setting up the agent.md in the project or does it need a skill?
SaltAddictedMan@reddit
THats cool but how exactly does the workflow look like. Automated or are you copy pasting instructions
hedsht@reddit
i run one python command per phase manually, then its completly automated and only stops on blockers or when a phase is done. i could automate it til the end, but a completed phase do needs an audit and i prefer to do it after each phase then one bug audit in the end.
last blocker was because the AI Agent wanted a specific virtual machine to run tests and couldnt install that himself... stuff like this happens and then they get stuck. the architect tries to plan around it, but when its a hard depedency for all following tasks, a blocker cant be avoided.
i check here and there, but if i havent resolved a blocker in 1 hour i get sent a notification to discord which then sends a push to my phone.
i think everyone starts at some point with copy pasting instructions, i have done it myself as well, but this works surprisingly well so, its so good and i'm not kidding!
zis1785@reddit
Do you maybe have like a blog or something that you could share ?
SaltAddictedMan@reddit
That sounds like a great setup! Thanks
UncleRedz@reddit
That is a pretty sweet setup that is not overcomplicated. How do you make the phase plans?
hedsht@reddit
i just tried to copy/implement the workflow from modern software development. you could scale it or add more roles, but the simpler, the better, especially since the ai agents like to drift or do unexpected stuff, even if you have explicitly prohibited it... less complex, the better it works.
just dump my response into chatgpt, it should generate you something similar.
i do the phase plans myself manually, thats the stuff where i dont trust any AI ... yet.
exaknight21@reddit
Yeah I agree with your thought process. Phases are practically instructions to different teams. Thank you!
hedsht@reddit
yes and no.
lets say you have a big picture, like how the project has to be, you split that into phase, split phases into tasks which the agent splits into slices.
you want the local models to do only one task with as much information as it needs.
documentation is key and bonus points if you know yourself how it should be or how it should be built.
my phase plan is exactly how i would built it myself, but i would need 6 months myself to complete. last project was done in 7 days thanks to AI agents coding 24/7.
crazy times.
m3kw@reddit
copy and paste the planned tasks. Or you can roll your own with stuff like Opencode to route agents, but i think thats a load of work
exaknight21@reddit
Can you write a step by step how you do this?
hedsht@reddit
I run three agents with a Python runner that orchestrates them. Each agent has one job and strict handoff rules.
The cast:
Only architect uses the expensive model. The rest are cheap open models.
The flow:
Why it works:
I type one command to start a phase, then check in occasionally to handle blockers that need a human decision.
PS: I've used AI to generate the response, because i wouldnt be able to explain it that well myself.
my_name_isnt_clever@reddit
Exactly. I use local as much as possible for my agents for privacy reasons, but I've been experimenting with having hermes-agent delegate plan creation to Kimi K2.6, then handling the implementation itself. It's been working really well so far.
Due_Pea_372@reddit
I have the same experience Claude Code in my job and local coding llm no comparison
reedog117@reddit
I can see your pain especially when I'm trying to mimic my workflows at work where I have a very generous AI budget. At work I usually am coordiinating between very large monorepos with context usage averaging 400-600k tokens on Claude (Sonnet or Opus). I do run a local memory MCP in a Docker container so I can keep some memories/context between sessions and not have to have each new session relearn my environment, along with custom corporate Claude wrappers that also help persist the environment state uniformly to sessions.
I'm trying to mimic the same thing at home with my own hardware, but that's pretty limited - an RTX 3090 on one system, and multiple 16GB Apple Silicon machines I can try linking together with exo or mesh-llm. I do have a personal ChatGPT Plus membership but I've noticed just on a medium-complexity project I'm already hitting limits on Codex. The next thing I'm going to try is maybe using OpenCode where I have GPT-5.4/5.5 as my main agent layer, but delegate all subagent work to local? This is where I get torn - would a 128GB mem system help me that much at all? What's the tradeoff? Or am I better off just forking over more money for either Claude Max or ChatGPT Pro?
dtdisapointingresult@reddit (OP)
Do a hybrid, where you delegate subagent work to a cheap cloud model like Kimi. It might work for local, if GPT-5.5 spoonfeeds a lot of guidance to Qwen 27B, idk.
Can OpenCode be configured to use Opus as the main session model, but subagents run in a local LLM? If not, you'll have to create a skill called delegate_to_assistant or smth that takes the task details as input, and executes it on the cheap model.
dead_dads@reddit
Yo! New to local LLMs/ai stuff in general. I have an old 3090 and 128gb of DDR4 RAM. Was going to sell my old machine for parts but occurred to me this week I could turn it into an ai machine to dip my toes into locally run stuff.
My interest rn is to work on some vibe coding projects. Would like to assess and test models that fit fully into the VRAM of the 3090 but also curious about utilizing my ram (DDR4) to see what larger models can bring into the equation.
What models would be worth by time for testing? I’ve been working with Claude to ID some stuff of interest but as this field moves so fast I thought asking people who are actively engaged in this stuff would be better.
dtdisapointingresult@reddit (OP)
With your hardware you only have 2 viable options, Qwen 3.6 27B on the GPU, and/or Qwen 3 Coder Next (80B A3B) on the CPU or CPU+GPU. Q3CN being A3B only while not having reasoning means it might run decently fast even though you have DDR4. Start at Q8 and work your way down to Q5, find an acceptable speed.
With such weak models the most important thing will be to micromanage their behavior with customized agents and prompts because you cannot count on them to be intelligent enough to figure some things out on their own. You need to handhold them to make the most out of them.
No_Hunter_7786@reddit
Totally fair. Local models for agentic coding tasks are just not there yet. Using them for automation and text stuff makes sense though, that's the sweet spot right now
cloverAthlete@reddit
Open models are typically 6 to 12 months behind frontier and that gap is real, you're feeling it.
Kimi K2.6 and GLM 5.1 are genuinely good for code now, but they still don't compare to Opus 4.6 for the kind of agentic decision-making you're describing
I'm running Gemma 4 on my own site (won't mention it to avoid spamming) and for day to day tasks and image reading it's fantastic
That-Drink4650@reddit
"Here's a Github repo, I want you to Dockerize it."
Are you giving the model truly this open of a task/prompt? Why wouldn't you run that through chatgpt first and produce a prompt around your repo..
dtdisapointingresult@reddit (OP)
Because any basic developer knows that this instruction is all that is needed. There's implied context known to anyone who uses Docker, and for anyone who looks at repos on Github:
I expect any model to have this minimum knowledge. If they don't start by reading the README, or if they are incapable of translating the instructions to Dockerfile commands, they are unsuitable as an agentic coding LLM.
raviteja777@reddit
I was trying for past 3 days to setup a local coding agent using ollama and cline for intellij IDE. (desktop with 12GB 3060 rtx and 16GB ram). Except for Llama 3.1, rest of the models struggled and kept crashing .
So i just tried using jetbrains AI assistant plugin - with ollama as a provider - Llama 3.1, Qwen 2.5-7b-coder-instruct and oss-gpt-20b(has some lag) worked, i tried explain/review code and gave some simple prompts to check, , it gave decent results (but not on par with claude or copilot) . Gemma4: 31b kept crashing no matter what.
I have documented the process in this medium article article
runner2012@reddit
You are comparing a data center with a computer...
Your blog's content is...not great. To say the least
ai_guy_nerd@reddit
The productivity gap usually comes down to the orchestration layer rather than the model itself. Claude Code isn't just a model; it's a tightly integrated loop with a specific set of tools and a very refined way of handling errors.
Most local agentic apps have a naive "prompt and pray" loop. When a tool call fails or takes too long, the system doesn't provide the model with the right context to recover, leading to that weird decision-making. Tightening the inner loop and giving the model better telemetry about why a command failed is usually the fix. Systems like OpenClaw try to bridge this by focusing on the execution layer, but it's a hard problem to nail.
swagonflyyyy@reddit
I don't think < 100b models are there yet for coding, but try using an organized Claude Code stack with a good set of CLAUDE.md file and additional .md files stored under a /rules directory to help guide its workflow better.
Honestly, Claude Code locally has worked wonders for me, way better than codex. Only thing is that Codex is very autonomous but inaccurate with local LLMs, which make it very unpredictable.
If Codex fixes whatever orchestration issues they have going on I think I would drop Claude Code for that if I want a project built autonomously. Right now their implementation is bloated and over-engineered. Its only good for APIs and not much else.
laterbreh@reddit
something something... there is no replacement for displacement? -- er parameters? 😄
mfgiatti@reddit
I've implemented a hybrid workflow where Gemini CLI routes viable tasks to a local Gemma 4 instance. It acts as an orchestrator, utilizing my local compute and effectively reducing API token consumption.
KingMitsubishi@reddit
I think we are currently just seeing the potential. Local is much worse than cloud, but I feel it’s catching up. I test some local models as they come out, they are nowhere near Claude or gpt or even glm/kimi, but is see them improving. The 27B qwen is really really promising. Pretty confident that they are the future. And that might not be too far away.
envelupo@reddit
just curious -text games such as?
datbackup@reddit
Even though I lean towards agreeing with you that local isn’t able to compete with the big centralized providers, i immediately became skeptical when your long post didn’t mention the actual harnesses you used by name. I see in another comment you mentioned using Claude Code, Qwen Code, and pi.
The fact that you didn’t mention this in your original post but you did mention several models by name, tells me that you are misunderstanding the importance of the specific harness you choose.
I agree that there are way too many posts on X that hype up agents or AI in general and ESPECIALLY make it sound like the poster spent way less time on their hyped outcome than they actually did. Basically there is a scammy situation happening whether organically or intentionally where people are incentivized to make it sound like something “just worked” because then, when others read it and can’t reproduce the outcome (without ridiculous amounts of time and effort) it positions the poster to get more esteem, followers, job offers etc.
The takeaway is just that you should expect vastly different outcomes with different harnesses even when using the same model. Of course there is also the “skill issue” but I want to suggest to you that some portion of the “mind reading” you refer to is down to the agent’s system prompt(s) and the way it engineers context.
Hermes agent for example has the same problem you mention where it starts a long-running process with no regard for how long it might take, then times out and has to start over. However, it’s very good about by default doing the behavior you described where the tail of a log file or command output should be used to determine the state of something.
So if you aren’t totally giving up yet i encourage you to try a “breadth over depth” approach to using harnesses where you try the same task in each and note what their strengths are.
I think there are huge unlocks still to be made in harness design, which will make the already released local models that much more viable compared to big providers.
TheTerrasque@reddit
He also didn't mention how he's running the models, which can have dramatic differences in result.
mumblerit@reddit
2 bit in ollama
pja@reddit
2 bit is probably too small a quant tbh. I’ve certainly read a lot of complaints that tool calling especially in the open models tends to fall apart once you get below 4-bit quants.
droptableadventures@reddit
And it's probably failed to detect all his GPUs, so is running on the CPU.
And that thing it does where it doesn't error when you run out of context, but just ignores the first bit of the prompt.
With the context length set to the default of 4096.
norebe@reddit
Everything he tried (nothing) didn't work!
datbackup@reddit
good point.
droptableadventures@reddit
Also it's some very interesting timing given that Github Copilot just announced a switch to usage based billing, and massively increased the cost for higher end models.
And it's resulting in a lot of people suddenly being quite interested in local AI models, who previously dismissed them...
eLKosmonaut@reddit
Pro+ is 40$ and still has Opus. The multipliers drop on April 30th. Your post isn't entirely accurate.
Caffdy@reddit
what does it mean? should I go and get a Pro plan before April 30th?
droptableadventures@reddit
It does now, under the new one you can't sign up to.
eLKosmonaut@reddit
How would you use something you can't even sign up for? Like I said and you just confirmed, not entirely accurate.
TheQuantumFriend@reddit
What is your setup? I am running coder-latest with opencode. I would trade time for quality, maybe with deterministic harnesses. However reddit is a bit polluted with so muxh crap, hat iam a bit lost atm.
datbackup@reddit
I just set this up last night
https://www.reddit.com/r/LocalLLaMA/s/khiJXifoAV
It’s about as close to sota as one can reasonably get on “consumer” grade hardware imo
Hermes agent, pi, opencode
mrdevlar@reddit
Honest question: What do we mean when we're talking about an AI coding harness? Is this what we mean by OpenCode or Cline or RooCode or is this a more nuanced set of features that are used as part of a coding process?
Lucky-Necessary-8382@reddit
Probably good prompts in .md files
mrdevlar@reddit
Could you elaborate on what you mean on that?
Lucky-Necessary-8382@reddit
A "harness" is the software layer you build around a model — the infrastructure that turns raw intelligence into a useful, autonomous work engine. The model provides the reasoning; the harness makes it actually do things reliably, repeatedly, and without you babysitting it.
What a Harness Actually Contains
A harness typically wraps the model with:
run_python(code)), executes it in the real world, and feeds results back into contextmrdevlar@reddit
Thanks. Not sure who downvoted you for helping me, but I appreciate your effort.
watchmanstower@reddit
A harness is both what you are running the agent through (the software) and also what you are surrounding the agent with for him to be successful at whatever you’re wanting him to do (e.g. all the necessary docs)
cniinc@reddit
I disagree, OP posted how they were making harnesses and parameters for their relatively simple task of taking a Github and making a container.
If anyone can point to a working set of model and harness, I'd be very open to hearing about it. If we just can't do anythign close to Opus, let's just be honest about that. If we can achieve Opus-level gains with a set of well-defined harnesses, let's be honest about that.
So, what are harnesses that work for coding? I've yet to see someone replicate the productivity gain from cloud models, using a local model.
PaMRxR@reddit
Local models require significant time investment to learn a lot of details of how things work and how to efficiently make use of the hardware and model capabilities. Without some curiosity driving you into this people like the OP will fail. People that just want to use something and don't really care about the details.
AdOk3759@reddit
Exactly, look up little-coder
DeltaSqueezer@reddit
3 minutes after giving the prompt:
Qwen 9B.
Stitch10925@reddit
What agent tooling did you use?
DeltaSqueezer@reddit
I wrote my own. I just started with a simple loop and added tools. After a week, I stopped using Claude Code and replaced it with my own agent and most of the agent was developed by itself.
After adding many tools, I found it was better to skip back and limit to just four: Read, Write, Edit, Bash. I also have Grep and Glob so I can disabled Bash to limit risk, but technically, you could just have Bash as the universal tool.
I also have no default system prompt so full context is available to the agent.
I reduced API usage massively. Now 70% of work is done with Local Qwen and 30% with GLM-5.1 when more context/intelligence is required.
https://www.reddit.com/r/LocalLLaMA/comments/1sq7cie/warning_do_not_write_your_own_ai_agent_if_you/
Stitch10925@reddit
That's pretty cool. What coding language?
I've been thinking of doing the same thing because current tools are not very fond of C#.
DeltaSqueezer@reddit
I wrote it in Python.
Pangocciolo@reddit
How do you make the write command emit indented code? It seems prompt engineering is not enough. You just call a linter after the code gets messed up by the LLM ?
DeltaSqueezer@reddit
The LLM is intelligent enough to do the indenting properly. Since whitespace in Python has meaning, the code wouldn't even work without proper indenting.
I actually, did write a plugin for my harness to run a linter after each edit, but I have it turned off. I do have a git hook which runs Black before committing.
Pangocciolo@reddit
But my agent was stupid enough to mess up strings. Now I'm starting to have good results. :D
Pangocciolo@reddit
Which quant are you using, from who? I see my 9B-Q8 from unsloth often shows messy behavior. Like calling write and putting code in both the content and the filepath fields...
DeltaSqueezer@reddit
I'm using it unquantized.
false79@reddit
bro - this is hilarious. OP made massive rage quit post and you did it with a 9b, lol
Caffdy@reddit
truly skill issue
CheatCodesOfLife@reddit
Upvoted for EchoTTS. That's pretty good for a 9b! Which harness?
Agile-Orderer@reddit
If you’re using 3.5, the 3.6 versions are much better, however the 27b dense and the 35b MoE versions of Qwen are multi modal reasoning models mostly for general purpose (including coding), however the Qwen Coder Next models are specifically trained for coding and tool calling so you may get better results with that.
Additionally, comparing local Qwen to Claude Claude, especially Opus 4.6/4.7, is not the correct comparison. Qwen 3.6 35b lands roughly in between Sonnet 4.6 and Opus 4.6 and I find that to be the case when thinking is on, for general tasks, including coding sometimes but not necessarily long running or full codebase tasks (it taxes my machine too heavy to run long tasks anyway so I haven’t been able to test that aspect).
The next thing I’d say is, Kimi 2.6 is much more comparable to Opus 4.6 and much better at coding (I’ve read the most recent GLM model is better too). You won’t be able to run either locally more than likely but you can use them for a fraction of the cost of Claude via Open Router or directly.
Regardless, maybe for your particular tasks local LLMs aren’t usable yet, but I wouldn’t abandon it altogether. Month to month, more advancements keep coming so at some point in the next year I’m sure they’ll catch up to your needs and you’ll want to in a position to switch back even if you decide to use Claude/Codex for now.
ilt1@reddit
Do you think it's a good idea to invest in a rig right now? Would the same rig be compatible with newer models? I'm concerned about space limitations, so I'm looking for a compact option. What do you recommend?
Agile-Orderer@reddit
I’m not sure I’d build out a full rig right now, parts are expensive and models are becoming cheaper to “rent” on OpenRouter or similar. The big models like Kimi or GLM would need too powerful of a rig to make it worth it (ie you’d spend years of OpenRouter costs to recoup the cost of your rent and by then your rig would be outdated and/or overkill).
I think if you can grab an M4 Mac mini with as much ram as you can afford then you’ll be able to run some decent local model in the Qwen & Gemma family, increasingly more image and video models, and quantized versions of some small Nemotron, DeepSeek, etc. everything else can go through OpenRouter for the big boys.
In a year or so I wouldn’t be surprised is a smaller Kimi or GLM comes out runable on a mini, or a larger/more powerful Qwen (like a 128B A9B as a random example) which would be like 3-4x more powerful than the current Qwen but with only marginal increase resource usage (at sub 6 quantization).
Long story short, big models are getting lighter, small models are getting more capable, the need for massive powerful rigs is diminishing yet the cost for part is increasing. If it were me, I’d either wait it out for a model that works on what you have, or buy as maxed out of a Mini as you can as a sweet spot for now and reevaluate the rig build in a year or two when either parts cost is more reasonable or requirements have dropped significantly.
PS on the Mini front. I’d also wait until Apple drops the M5 spec bump for it because every M5 device has got bumped up to 512GB storage with 16GB RAM on base model. If then spec bump the Mini to M5 with little to no price bump then it’s a significantly better deal than buying the M4 right now.
Migraine_7@reddit
Are you using a subagent to at the very least create a work plan before each task?
Even Sonnet and sometimes Opus fail miserablely if the task is not well defined.
dtdisapointingresult@reddit (OP)
Does it need to be a subagent? This was my full prompt:
simracerman@reddit
I can claim the badge of student among you all, but that is not how I’d feed a small 27B model any prompt. The extra unnecessary context will certainly confuse it.
Do yourself a favor and run your prompt through it and as if to can cut it down to problem statement, and goals. Divide the task into subagents (trust me on this one). Use Opencode, ditch CC for local models- it produced worse output in my experience.
dtdisapointingresult@reddit (OP)
Isn't that what ClaudeCode/QwenCode's system prompts and the model's own reasoning supposed to do? Expand a small task into a list of decomposable steps? I gave "Start by making a plan" to steer it towards that.
If I have to chew the model's food for it, that means a small local model can't do what I expect it to do, and it's a huge loss in productivity for me to keep using it.
the3dwin@reddit
Build a custom command like /execute or /implement that takes your non optimized broken down prompts and always breaks it down and makes it as explicit as possible and even ask you questions so you don't have to repeat yourself. Even instruct the custom command with explicit instructions of how Claude Code behaves. I'm sure you will get the results you are looking for.
dtdisapointingresult@reddit (OP)
I do as you suggest for actual applications. I specify the stack, how to implement, I tell it to write tests first, etc.
But for me this is a straightforward prompt. I can't treat random one-off personal tasks like a serious software engineering project. If small local models can't do this, then I can't use small models for one-off personal tasks: they waste more time than they save me.
the3dwin@reddit
As for agentic coding, I do not think your Original Post was about the model itself, but fyi local models from my understanding 30B+ are where local models are most reliable for agentic coding, anything less than 30B usually have more problems and need more configuration as far as I'm concerned.
dtdisapointingresult@reddit (OP)
I believe so too, but you missed the last month on this sub.
Qwen 3.6 27B (and 35B before it) is being treated like Opus's little brother. People are pointing to how it's just behind Sonnet, how we finally have a local Sonnet, etc.
I was fooled by the hype. I mean check this out ,this was one of the top posts of the week: https://reddit.com/r/LocalLLaMA/comments/1strodp/qwen_36_27b_makes_huge_gains_in_agency_on/
the3dwin@reddit
I was looking at the top posts this week and discovered yours, then I was about to go through that post next.
Again I am sure with the right configuration, and prompt they could reach what the benchmarks show.
the3dwin@reddit
I use a /explain custom command that basically has the model explain to me what it understands and confirms, and ask questions about ambiguity or which method to use for tasks that can be executed in different ways. This way whenever I switch model I use the /explain command and will know how well it can handle the task.
dtdisapointingresult@reddit (OP)
so you type '/explain', the LLM supposed to analyze and report what it thinks must be done, and this is how you can tell?
I kinda do that in my prompt, you'll notice the last line is "Start by writing a plan." However, I like the idea of forcing it to explain what it understands and to confirm. I could make this into a dedicated skill.
the3dwin@reddit
Yes.
For being able to check its status I say have in the /execute or /implement command have explicit instructions to report status exactly how you want it to, whether for each execution, before, during, after etc.
As for how long something will take is a bit overkill personally for me to have it also predict how long will take based of it's training data but I'm sure you can get it to based of what it has been trained on, even tell you whether it will need to research how to do something or let you know whether already knows.
simracerman@reddit
You’d think, right? It’s up to the LLM’s interpretation and how good is it at following instructions.
I’ve built two apps already from scratch and learned lessons the slow way. You can achieve a ton with these local tools already if you spend time and iterate over the flows to perfect it.
false79@reddit
"The instructions are simple"
Lol, wth hell is that prompt.
That helps nobody. Not even humans.
xienze@reddit
This may not be the world's greatest prompt, but if you handed that off to a developer who knows what Docker is... those instructions are pretty clear IMO.
dtdisapointingresult@reddit (OP)
Right? I'm not asking for the moon here.
This is something an average non-coder Linux user, like someone who an enthusiast with a homelab, should be able to do trivially. It's a form of translation (README to Dockerfile), the model doesn't even need to be intelligent. I unironically would have expected the 9B to pass this.
I think if my prompt was just "Dockerize the app at ~/echo-tts" it would've succeeded (I certainly hope so or it's hopeless). But adding the context of "you need to test the Dockerfile yourself, also you WILL have a failure and you should fix it when it happens" is what was too much for lil' 27B little monkey brain.
RoughElephant5919@reddit
Just want to say thank you for this comment. I run local LLM’s for OCR data extraction, and the prompting has been the biggest challenge for me. I appreciate your input, and I am going to try this on my current pipeline I’m running 🙏🏼
Intelligent_Ice_113@reddit
this prompt explains a lot
dtdisapointingresult@reddit (OP)
Was my prompt so bad? I would expect any basic junior dev to be able to follow this prompt. I give these sort of instructions to the intern at work all the time, I get a working script/Dockerfile/etc when he's done.
I can't give it more detailed instructions, otherwise I'm doing its work for it: I expect it to read the README of the project (implied, because this is the case for 99% of Github projects) for installation instructions, translate those to commands in a Dockerfile.
Intelligent_Ice_113@reddit
exactly. I mean, it's a gamble, sometimes it can guess your intentions right, sometimes it can't.
The thing is: these are not humans. Never forget that. And you have to give them the right commands, a cold-blooded list of procedures to follow, without any chatter, as if you would do with a real junior dev. Every detail or context you didn't provide, they'll make up, thinking that's what you meant. And that's critical for small LLMs, because they're dumber than true LLMs, yes, that's their huge disadvantage, but that doesn't make them useless.
TL:DR, small models are prompt sensitive. And you have to do its work partially, at least by providing the relevant context.
dtdisapointingresult@reddit (OP)
I mean, what you're saying reaffirms that I can't use them for the sort of things I want to automate.
I get that small local models can work for people with a lot of prompt management, but I really want to be able to give that Docker prompt and have a working Docker image on the other end. An app running in Docker is to me a very simple thing that someone with 1 day of Docker tutorials can do. It's the hello world of modern development.
Anything that requires to put in more effort is a big waste of time for me. I mean 'waste of time' literally, I'm not saying those models are a waste of time. I'm saying me using those models ends up wasting my time. These are not long-term software projects where it's essential I put in my full effort in the original definition. These are one-off small tasks where I turn to the LLM because I want to spend less of my own time doing it. I cannot treat them with the attention of a work project, I want to spend less time on the computer, not more.
stilet69@reddit
No, no. The phrase "Make no mistake, or I'll kill you" is more appropriate to this case.
2Norn@reddit
i have better success with make no mistake or i'll kill myself
StardockEngineer@reddit
Where did it fail?
dtdisapointingresult@reddit (OP)
Step 3 here https://reddit.com/r/LocalLLaMA/comments/1sxqa2c/im_done_with_using_local_llms_for_coding/oiqq2gr/
LateGameMachines@reddit
It sounds like you probably need to scope it in harder. I’ve built tons of services running on podman quadlets and compose files. It will get something wrong, so provide the exact error in the follow-up. It’s rare even on GPT 5.5 Extra High for any LLM to one-shot a compose yaml that works instantly with your specific setup.
dtdisapointingresult@reddit (OP)
I didn't remember the exact details, it's a 2nd attempt from something, so I figured it can figure it out on its own.
My expectations:
I never got past step 3.
RemarkableGuidance44@reddit
Yep, no idea wtf you are doing.
guinaifen_enjoyer@reddit
Have you tried download the docker compose spec and ask it to read the docker compose spec before doing it ?
https://github.com/compose-spec/compose-spec/tree/main
TheCatDaddy69@reddit
On the other hand i have been having a great experience with gemma 4 31 with openclaw, as an assistant.
dtdisapointingresult@reddit (OP)
I was getting reasoning loops with Gemma 4 31B AWQ 4-bit (from QuantTrio).
It got far enough to create the Dockerfile, then spent 2000 tokens repeating itself trying to decide if it should check if docker is already installed, and whether to use 'docker build' vs 'docker compose up'. I went back to Qwen.
duebina@reddit
I recommend using Qwen3-Coder-Next. It's 80b so get the right quant for your memory. I used to use sonnet but now I use this with Context7 and my own skill router and it's been flawless. I use 8-bit. Also, it depends on how you're using your coding assistant. I am in plan mode refining everything first, and then I have it right at the plan to a file and then I forever reference it. Profit.
the3dwin@reddit
Unfamiliar with Conext7 and "own skill router" could you elaborate on what it is, how you use it and how it has been "flawless". Thanks
robertpro01@reddit
Well, I still consider my self a developer so ... local AI is just a tool, for me qwen 3.6 is a good tool to use, I started vibe coding on Nov 2025, because my previous experience with AI (API not local) were terrible.
For me local AI is just another tool.
I also do a mix of API + local for very complex tasks, and still I validate all the code.
brick-pop@reddit
This. Huge LLM's start to be worth it when you blindly delegate to them. And then have no idea of what the code actually looks like.
Equivalent-Repair488@reddit
This is me, I am not a coder and only trained on python basics.
I use Qwen 3.6 27B on UDQ5 locally on roo code, it still runs into debug loops and is not it for "I want this app build with so and so features." Tried that before and gave up. So I use my free student Gemini Pro plan to architect, to create a build prompt based on the discussion for my vision for my app and guide both me and my local model.
Although the apps always have issue and bugs, even a non coder like me can vibe out simple apps with enough time and testing.
my_name_isnt_clever@reddit
Honest question, are you interested in learning how to code for real, or are you happy being a vibecoder forever?
Equivalent-Repair488@reddit
Staying vibecoder. Ain't no way I am spending another 4 years in college which I was A) never smart enough to enter in the first place B) definitely not smart enough to learn quick enough unlike my peers.
I'm a marketing major student (aka dumb af), although I am trying to take that customer/user compassion understanding + technical affinity (not expertise) to do a product role (I also prefer physical product, but software is bountiful in entry level jobs in comparison, even though it is still like nothing nowadays)
My basic course in python thought me I am not meant for writing code, I did do alright, and better in r for 2 statistics and analytics courses, but it was with a lot of genai coaching and help as well. I am not a very smart person, who is also fearful that they can't even land a business PM role.
So no, not really. With how my first vibecoding app went too, I think maybe even product isn't for me lmao.
Vibecoding is fun because I build small things for myself that work for my own pain points, without years of experience to write said code.
the3dwin@reddit
I never had to use it because I learned to code before the tool came out but NotebookLM is a game changer in terms of learning anything. I suggest you attempt to learn to code with it and when if you do comment your journey when you get anywhere significant.
AI in Education is a game changer because you can tailor to AI to teach you and your specific needs without it complaining or getting tired unlike a college teacher.
Equivalent-Repair488@reddit
Yeah, I'm currently interning in a very niche and complex medical technologies industry and I spammed my gemini subscription with questions and got a very good baseline understanding of the industry and the company's products, and so I just needed my supervisor to clarify misconceptions and hallucinated teachings from Gemini. They said to me on my second month I already knew more technical knowledge than 90 percent of the office.
But coding wise, the process of writing code I have no intention to pick up because I don't find it fun. But code architecting, learning code structuring etc, more concepts and less syntax and exact function arguments or what not, is something that I already try to learn subconsciously when vibecoding. I will not do active journalling though, as I find that a chore as well. I just started this relatively recently as well due to the new quality of open source models (specifically the hype during Qwen 3.5) and maybe in a few years I will pick up the relevant knowledge naturally as I do more of it, and as even better open source and cloud frontier models come out.
the3dwin@reddit
Well I say you are definitely going about it the right way since understanding how the code and architecture works is more important than understanding syntax as LLMs will be handling 90-100% of actual writing the code.
my_name_isnt_clever@reddit
You don't need to go to college for years to pick up coding skills. But alright.
Equivalent-Repair488@reddit
You said to become a coder though.
Regardless writing code I don't find fun, so no, I don't think I will pick up coding skills on the side, but if I continue to stare at VSC for the foreseeable future using whatever local ai agentic wrapper and naturally pickup how software works behind the scenes when I needlesses produce slop apps for myself, then maybe you could count that?
RoomyRoots@reddit
Yeah, add to the stack of tools you use, don't drop everything and depend only on it. It works very well as a document searcher, summarize and drafter. I still rather do things slow and step by step so I can fully understand how things are implemented.
zipperlein@reddit
I think, this is the a perspective problem, not a problem of the acutal models. It depends a lot of how much hand-off approach you have, imo. I like to know exactly what is in my codebase. If the LLM does not make good changes, the direction is fine most of the time. Then I just do some manual tweaking and let it continue. It's a wayyy smaller model, even if it is good for its size.
dtdisapointingresult@reddit (OP)
Yeah I never was a believer in small local models but tbf the Qwen 3.6 benchmarks posted and hyped all over the internet made it seem only slightly behind Sonnet 4.5. I knew it wasn't true, having used both local and cloud models, but I was hoping we had reached the point where it's 70% as good.
the3dwin@reddit
As you have already read in other comments, those benchmarks are only true when the configuration, and prompting are setup correctly that they can reach the same levels.
But if you are expecting the same prompt you gave to claude to work with on the qwen models with wrong or no configuration then no they will not be like the benchmarks show.
ttkciar@reddit
Yah, unfortunately mid-sized codegen models just aren't there, yet. They've gotten a lot better, but the ones worth using are still in the 120B-size class.
With a lot of extra work, Gemma-4-31B-it gets close'ish to GLM-4.5-Air for codegen, but not close enough to make the extra work worthwhile.
Qwen3.6-27B similarly falls short, and that's only if it doesn't overthink (which it still does, way too frequently; wtf didn't the Qwen team fix that with 3.6? It was a well-known problem with 3.5).
TheAncientOnce@reddit
What's your experience with the 120b class models? The bench seems to show that 3.6 27b outperforms or matches the performance of the 3.5 120b
ttkciar@reddit
My experience:
GLM-4.5-Air: Best at instruction-following, which makes it my top pick. I tend to drive codegen with large specifications full of instructions, and Air consistently follows every single instruction in the specification. Unfortunately it is more much prone to write bugs than other models in this size class, but these tend to be low-level bugs, easily fixed, and not design flaws. It's "only" a 106B, but it's competent like a 120B.
Qwen3.5-122B-A10B: Runner-up. It's not bad, but would randomly ignore some instructions in my specification. It writes fewer bugs than Air, but is more likely to introduce design flaws (like using a temporary file, always the same pathname, non-atomically, in a multi-process application) or leaving some functions empty except for a "In production, this would .." comment.
GPT-OSS-120B: Great at tool-calling, okay at instruction-following (though noticeably worse than Qwen), but hallucinates up a storm. I wasn't able to get a good sense of whether it writes bugs or design flaws or not, because I couldn't get past the hallucinated libraries and APIs. How do I debug calls to a library which doesn't exist?
Devstral 2 Large: Very good at not writing bugs, and good world knowledge, but the absolute worst at instruction-following. It would ignore most of the instructions in my specification and write something only vaguely like what I asked for. I had high expectations, since it is after all a 123B dense model, but was hugely disappointed.
I have a hypothesis that Devstral 2 Large was deliberately under-trained, to "leave room" for further training on individual MistralAI customers' repos without overcooking, but don't know.
None of them are perfect, but I find the flaws of GLM-4.5-Air easiest to tolerate. Fixing little bugs is fine, and Gemma-4-31B-it actually finds most of Air's bugs, so that's easy. Ignoring parts of the specification is intolerable. Design flaws that require more than a one-line fixup are a pain in the ass. Hallucinating libraries is especially grievous, because I have to throw everything out and start over, but be sure to describe the libraries it should be using before continuing.
I used all of these models at Q4_K_M, and I know some people will point at that and say "there's your problem!" but frankly I can't tell any difference at Q6_K_M. Did not quantize K/V caches at all.
the3dwin@reddit
Have you written an orchestrator that delegates for each job? Could you share on github?
ttkciar@reddit
No, I just use GLM-4.5-Air.
dtdisapointingresult@reddit (OP)
I can try one of those as my final attempt. Which one do you think would do best at my Docker prompt I shared here? https://reddit.com/r/LocalLLaMA/comments/1sxqa2c/im_done_with_using_local_llms_for_coding/oiowcxe/
I'm surprised someone is saying GLM-4.5-Air still holds up, and putting it ahead of recent models.
Bird476Shed@reddit
I agree with OP "the flaws of GLM-4.5-Air easiest to tolerate."
Overall, this model is still a reliable worker and a good speed/quality trade-off.
ttkciar@reddit
I have no confidence that GLM-4.5-Air's tool-calling prowess is up to the task of doing it interactively, else I would recommend it. Its tool-calling competence is quite weak, and I have never tried giving it instructions quite that vague and open-ended before.
Your prompt is better suited to a model of GLM-5.1's caliber. I'm having a hard time imagining any of those 120B doing it well, but it might line up with GPT-OSS-120B's strengths. Maybe give that a shot.
If I were to rewrite your prompt for Air, it would include a lot more information (how the app is supposed to work, specific filename for the dockerization documentation, etc) and a lot more instructions for how it should go about compiling the misbehaving wheel. I just have no faith it would figure those things out on its own.
It's a bit surprising to me too, frankly. I keep trying the hot new models, thinking "surely this one will knock Air off its perch", but they just don't, and I keep using Air.
Maybe Qwen3.6-122B-A10B will be "the one"? Or if Google ever releases that 120B MoE Gemma4 they beta-tested, that would probably do it (assuming they fix Gemma4's tool-calling woes).
At this rate, though, it's probably going to be a new Air model based on some version of GLM-5.x (assuming ZAI can repeat the feat).
Karyo_Ten@reddit
My very first agent was GLM-4.5-Air. But when I switched to OpenCode it kept failing tool calls - https://github.com/anomalyco/opencode/issues/1880
Besides, 131K context is just too small when you graduate from small CLI tools.
ttkciar@reddit
You're not wrong. Of all of the models I tried, GLM-4.5-Air has the weakest tool-calling competence, but I work around that by not requiring it to use many tools.
Air's 128K (128 * 1024) context limit is one of the reasons I tried so hard to make Gemma-4-31B-it work. Not only does Gemma4 have a 256K limit, but it also infers a lot fewer tokens in its reasoning phase, so more of that 256K is useful. I'm still hoping to figure something out, but for now have stopped trying to use it for codegen.
What I would really like is if ZAI released a new Air model based on GLM-5.x! Hopefully with 256K context.
PANIC_EXCEPTION@reddit
Qwen3-Coder-Next is still definitely the speed king on local as it is substantially faster than 27B and approaches Sonnet level, which is good enough for a lot of tasks. Tell Opus to make a master plan for a feature, and then use a lightweight local model to implement it using that plan. I find that this is actually quite usable.
Unfortunately the barrier to entry for an 80B model is either having multiple GPUs or having a laptop with at least 64 GB of unified memory. So, inaccessible to a lot of people. If they can juice up Qwen3-Coder-Next to be like a version called Qwen3.6-Coder-80B-A3B, I think it might be able to stand entirely on its own.
27B gets relegated to very specific one-shot questions or very strong image understanding (e.g. translating text from a schematic). Or generating small scripts in isolation. I would never have it run an agent because of just how slow it is.
AMD_PoolShark28@reddit
https://huggingface.co/mradermacher/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF you can have the best of both worlds..... :)
dtdisapointingresult@reddit (OP)
I gave up on Gemma 4 31B early on.
It wrote the Dockerfile and now needed to build it. And I was staring at its output, coming slowly at 12 tok/sec, 3 minutes of reasoning while it tries to decide if it should check if docker is installed or not, and whether to build it via the Dockerfile or the docker-compose.yml (which also builds). I exhaled and switched back to Qwen 27B.
This was an AWQ, but I doubt the FP8 would've been much better.
I really think Terminal tasks are just harder on LLMs than coding. Coding is still just dead text output. Interacting with a running system via tool calls might be a whole other level. 27B gets 35% on TerminalBench-Hard, Sonnet 55%.
dzhopa@reddit
So yes, terminal tasks or any multi chain of tool calls is where your smaller quants fall flat. Minor hallucinations creep into the syntax and state passing between calls as the context grows large.
Code output is a lot easier because it's writing 1 file at a time, and maybe verifying syntax. You get to execute it later and fix that typo or hallucinated bug in a whole separate call. For terminal work it's passing a precisely formatted string of commands along with terminal output into the specific structural format needed for the LLM and harness to process the tool and then string the commands, syntax, context and structure together between potentially dozens of simultaneous calls needed to complete the task.
That's the real big problem Anthropic has spent a lot of time and money to get right, and it shows when you just ask Claude to "download this package from github and spin it up for my users as a docker container". Those Claude calls are stupid expensive for terminal tasks though.
rothbard_anarchist@reddit
Terminal is definitely harder for any language model. Even on Codex 5.5, it boggles my mind to watch it sometime ponder for three minutes straight how it should open a CSV file.
IWasNotMeISwear@reddit
The core members left the company I think
the3dwin@reddit
My current workflow in short:
Use cloud models for large tasks and use local models for small tasks.
Measure what to use for each based on your own personal results you know the local models do reliably.
Use your specs and prompts as "libraries" (usually imported into your codebase) where the prompts are optimized to implement on your specific projects and not dependent on how other projects are setup so that it reads your codebase and implements reliably.
I still think that training AI ( unsloth.ai ) for specific use cases is the best way to get local models to code, I just haven't started yet.
StirlingG@reddit
Nice try Dario
dennis_linux@reddit
My Technical Read (Inference)
For my setup specifically (M3 Ultra 512GB + oMLX), I would not generalize this post to myself.
I am in a different class than many LocalLLaMA users.
Local coding is weak when used as:
❌ “Replace Claude Code entirely with one local model”
That often disappoints.
Local coding is strong when used as:
✅ Hybrid stack
Use local for:
Codebase understanding
Refactoring drafts
Test generation
Private repo analysis
Fast iterative copiloting
Agent experiments
Use cloud for:
Hard debugging
Complex multi-file changes
Long-horizon coding agents
High-stakes architectural generation
This hybrid model is where many advanced users land.
For my stack specifically
Given my models:
Qwen3-Coder-30B-A3B-Instruct-4bit
Qwen2.5-Coder-32B-Instruct-MLX
Qwen3.6-27B/35B
Kimi K2.5 MLX
I would use:
Task Best choice
Local code generation Qwen3-Coder-30B
Repo reasoning Qwen2.5-Coder-32B
Agentic experimentation Qwen3.6-35B
Heavy cloud-like coding fallback Claude Code / frontier API
This is much stronger than what many Reddit posters are running.
What I think the thread really says
Not:
“Local LLMs for coding are dead.”
More like:
“Poorly configured local coding stacks lose to polished cloud agents.”
That is different.
And honestly, that has been true.
My blunt assessment
For coding today:
Use Case Local Wins? Cloud Wins?
Privacy-sensitive code Yes —
Cheap heavy token usage Yes —
Best coding agent autonomy — Yes
Fastest solo dev productivity — Yes
Experimental multi-agent hacking Yes Mixed
Offline resilience Yes —
Verdict: Hybrid wins.
That Reddit post is a warning against ideological “all local” thinking, not an indictment of local models.
Hyp3rSoniX@reddit
Did you make sure the Agent is actually reading or has the AGENTS.md file auto-injected into the context? Claude Code for example only auto-injects CLAUDE.md files into the context (and also a MEMORY.md auto-memory file). You need to reference other files from inside CLAUDE.md to have the harness parse and inject other files into the model context as well (like AGENTS.md).
Some harnesses don't auto inject any such files into the context. Then it is left to the model itself to read it - the problem then however is, that models these days rarely read full files. Instead they read a few lines and then they're happy. If that happens with your AGENTS.md file well... then good luck.
Also depending on the Harness you use, there might be added some hidden System prompt, that tells the Agent god knows what. The smaller the model, the easier it is to confuse it. Bigger "smarter" models can be more resistant to confusing information.
dtdisapointingresult@reddit (OP)
Yes, I tested every harness, by asking it to repeat the instructions to me. QWEN.md for Qwen Code, AGENTS.md for Pi.
And yes, every single harness has its own system prompt. The Qwen Code one is like 18k tokens, while Pi is under 1k IIRC.
ebra95@reddit
It depends on the quantization a lot. Anything <Q8 is pointless.
Gemma I tried directly from them, not locally and it still failed most of my basic tasks.
What was good and I recommend is IBM Granite. I don't know scores, but out of all models I tried (<4B parameters Q4K at most) it was the only one could write code snippets properly.
No good model <200B parameters though for ovevrall thinking.
You can use a high model for planning/thinking and the local for execution.
But not worth yet, in my opinion.
alvisanovari@reddit
I mean, I know this is LocalLLaMa and it sounds like blasphemy, but I will say that if you are using local models for your primary coding, you are needlessly handicapping yourself. There is no reason to use local models unless the cost/privacy trade off is big enough. And I would argue it almost never is because most of your stuff isn't that important and its much better/cheaper to pay for Sota intelligence.
dragonbornamdguy@reddit
I love my Qwen3 Coder Next, its on sonnet level.
ElephantWithBlueEyes@reddit
I gave up on local LLMs for everything because they're dumb. Cloud LLMs are dumb as well. They just suck less
dtdisapointingresult@reddit (OP)
Touche.
Randomshortdude@reddit
dtdisapointingresult@reddit (OP)
I'm bitching about the small local models because for the past month this sub was acting like small Qwen 3.6 was just a hair behind Sonnet. Tbf Qwen's benchmarks were showing that too.
I've always used small local models for non-coding tasks, but never took them seriously for coding because I felt they were too weak. That is, UNTIL the latest wave of hype started. Check out the top monthly posts on this sub: all small Qwen glazing.
So I fell for it and gave it another legitimate try, for several hours, and reality was completely different. At least, when using the same standard harnesses 99% of people use (Every *Code app you can think of).
I'd appreciate you posting your advice and prompts, even if it doesn't help me it will definitely help another reader. there's more comments on this than any other post in recent memory.
Randomshortdude@reddit
Additionally consider things at the software level that may enhance the inference quality / caliber of the models that you're dealing with.
Ok_Dragonfruit_9989@reddit
if you have m3 ultra 512 gb and running a full sized model, how would that play? its only fair to compare a trillions parameters model to other trillions not to micro 32B model etc
DeckRdt@reddit
I had claude summarize your post and it was very fast... just bounced back a tldr;
AlphaPrime90@reddit
There should be flare named " low effort " for this type of Posts
gentoorax@reddit
Im using qwencoder3 30B in vLLM and its blazing fast. Performance wise in terms of response speed it beats the frontier models easy. I feel like everyone using local models is using ollama or something.
BestSeaworthiness283@reddit
I hate it too tbh. I have a poor rtx 4060 with 8gb of ram and i cant code with it due to limitations to the context windows. I mostly wanted to use local models for qhen the imevitable happens and i get hit with the usage limit on claude.
My first approach to this was asking claude to make a skill that delegated things like code generation and stuff like this to a local agent with just the right context, meaning the code from line to line and what to modify. And itself made a tool that it ran everytime and saved me a lot of usage.
With the last approach it was still possible to run into usage limits. So i made a lightweight CLI tool for local LLMs with just 8k token context windows. It works by initialising the codebase with a map of the whole codebase. Then when you query or ask to create or delete something it goes and orchestrates a plan and tasks to solve the problem. It then does a llm request for every task and the task is no more then the context of 8k tokens. The system now has memory and you need to aprove the changes and uses a type of diff.
I will let you check it out if you want: github.com/razvanneculai/litecode
AliceCD1@reddit
Tenho essa mesma placa rtx 4060. Obrigada pelo seu comentário, vou tentar sua abordagem. Os modelos que tento rodar localmente sempre ficam lentos, pode ser minha ddr4 de 3200mts.
BestSeaworthiness283@reddit
Thank you for the interest.
BestSeaworthiness283@reddit
Thank you very much for the interest in the project!
dtdisapointingresult@reddit (OP)
Your approach is great, and workable. This is the degree of LLM wrangling needed to get something out of these small models I think.
BestSeaworthiness283@reddit
Thank you very much!
thewhzrd@reddit
I find a local qwen 9b to be just fine at handeling email and cal tasks. Given. I build my own cal and email tools for Gideon to use.
Tartarus116@reddit
You're doing something wrong. I've had Qwen3.5-397B Q2 build docker images on gitea runner, push to gitea container registry, and deploy to Nomad with the latest tag on every merge.
Now, both Qwen3.6-27B & Qwen3.6-35B-A3B can do it.
_W0z@reddit
Have you tried oss 120b?
cohesive_dust@reddit
Reality sets in. I went through same drill as you. I'll try again in a year.
Techngro@reddit
Off topic: This is exactly how I feel about Linux.
mister2d@reddit
So...skill issue on both accounts?
Techngro@reddit
Yup. The same skill issue that makes people return to Windows over and over and keeps desktop Linux at a barely noticeable market share. 🤷♂️
mister2d@reddit
Ah. You would have had me if you said users returning to Mac.
Returning to "Windows" explains the core issue.
FUS3N@reddit
I feel like its kind of being disingenuous when you put it like that yes windows has issues but desktop Linux isnt there yet too, I have it dual booted but for most people its not the same experience when you always look from the view of someone who knows stuff here and there.
mister2d@reddit
I'll let you in on something.
"Desktop Linux" is only a thing to Windows and Mac users.
FUS3N@reddit
No ones denying that linux is used everywhere in infrastructure all over the world, what is your point desktop linux shouldn't exists or doesn't? Because I think that's what most fans want they want Linux to be friendly enough for everyone to use that is what that "desktop" part is just implying everyday use for tasks.
Techngro@reddit
That's a point I've always made about Linux. The only way it works for the average user is if they never have to actually touch Linux (e.g. Steam Deck). But having users be completely removed from what Linux is and subsequently having zero knowledge of how to use Linux doesn't strike me as the win for Linux.
mister2d@reddit
My point is that desktop Linux isn't a thing.
Techngro@reddit
I was just speaking from my personal experience. But whether its Mac or Windows, we both know the reason people return after trying (and I really gave it a fair shot this time) Linux is the same.
letsbefrds@reddit
I'm flipping and flopping btw mac (,I use a mac at work and I have a air) my home desktop is duo boot Linux/windows.
They all have their strengths. I'm really trying to use Linux (Kubuntu) and for 90% of the stuff it's fine. But when it's bad it's really bad. For example just organizing my photos and deleting images of my memory card it's super slow. Works just fine on windows and Mac. That 10% is pushing me towards full macOS.
Mac : I can't get use to the weird cmd c cmd v. My god the screen splitting is awful.
Windows : I use it to game that's about it. Copilot dog shit and ads in windows is really killing the vibe lol
Techngro@reddit
Instead of going back to Windows 11 Pro, I went with Windows 11 IoT LTSC. It's already debloated from Microsoft. No Ai, no Copilot, no ads, etc. I am very happy with it. You should give it a look.
mister2d@reddit
I didn't miss on any points. Go and enjoy Windows.
dtdisapointingresult@reddit (OP)
idk if I agree with that. Linux is predictable. It's the same stuff working predictably every time.
Just stick to Ubuntu LTS instead of meme or rolling distros, don't install random drivers.
iMakeSense@reddit
Pick one of the 3 top wifi cards on Amazon, I promise you only 2 of them will work and one of them will probably require some weird ass drivers.
my_name_isnt_clever@reddit
Is this comment from 2011?
kyr0x0@reddit
Calling the Pipewire mess predictable is kind of a stretch ;) Audio under Ubuntu is highly unpredictable. Sometimes it works, sometimes it doesn't. It was more stable with ALSA in the early 2000s...
simracerman@reddit
Funny I had a conversation with a work colleague about this today. I concluded that I’m still burnt from last year’s experience.
No-Consequence-1779@reddit
I’m also a professional (employed) software dev. I have found vs code with the kilocode extension and qwen3.6-27b to be very good.
I use visual studio enterprise for work and copilot (various models available).
I am starting to prefer switching over to vs code (both apps open on same directory and git repo) and asking or agent coding - atomic changes of course. Only a fool would let an agent loose on multimillion dollar software.
Sometimes it’s actually faster and many times it offers a superior solution. Sometimes is similar. It’s never worse.
Though qwen3.6 is a game changer.
Ordinary_Breath_8732@reddit
the Docker timeout issue really shows the decision making gap a decent model checks if the process is still running before assuming failure but smaller local models just don’t have that layer yet. the AGENTS.md robot treatment is basically admitting the model can’t self guide lol. OpenRouter with the ability to model hop when one annoys u is the right call honestly
otacon6531@reddit
I am not seeing the loss. It isnt opus 3.6, but qwen3.6:35b and nemotron-3-nano with vs code chat extension and autopilot on is pretty darn good.
Personally my development stream is more pipeline oriented. I use the vs code's chat extension autopilot as my orchestrator for everything. It even makes gpt 4.o pretty decent and dont let me get started with gpt 5.0 mini and how it fixes its constant questioning.
I define and refine requirements in azure devops with nemotron reviewing and perfecting the requirements. Then I vhange the status to ready and qwen 3.6 ia queued to complete the development. I then do testing and require revision via the pull request. Between developments I have 3 processes running on a queue. Code review, autonomous code optimization, and security audit. It isnt perfect, but I want these apps writing themselves.
The big benefit..... it costs way less and I dont have to worry about tokens. My focus is on optimization.
AdUnlucky9870@reddit
yeah the gap is still pretty massive for agentic coding. i use local models for quick one-off stuff but anything multi-file or context-heavy and its not even close yet. maybe qwen 3.6 closes it but im not holding my breath
Wild_Milk_2442@reddit
Not saying this would solve all of your problems but I use qwen 35b hours a day and would never consider using it in a bloated harness like Claude code.
Claude code system prompt is over 24k tokens. That's huge for a local model.
Opencode is much better at 14k~
But the real unlock is pi.dev. pi.dev has a system prompt under 1000 tokens. Even the slowest models feel fast on that. It only has 4 tools: Write Read Grep Exec
That's all you need. With those 4 you can do anything.
The smaller harness makes a massive difference with smaller models.
xKiiyoshiix@reddit
Hellow,
I think you can give the local ai thing a last try and i will you say why...
I'm so fine with Google Antigravity but when i need to do some private information like things with AI, i'm not fine to use Antigravity to leak my whole life *you know*, so thats why i started to work with local llms like Gemma4:e2b, Gemma4:e4b and a good Gemma uncensored version.
At the moment, i'm in full development, all day, working with it, fixing and adding new features but no unnecessary things. When youre interested into my project, you can write me.
At the moment i'm hosting it on my own gitlab server but, i think i will publish the repo when i think, "ok its good, we can publish it now", at that point you can help improve and extend the project.
What i have made:
- TUI App
- Web UI (like chat app with some extra things)
- Functionalities like, qdrant vector saving memory, creating projects, write files.
- Having Skills and Tools that are 100+% faster then OpenClaw things because the Tools and Skills are written in Markdown with the functions of javascript combined (sounds wired but works well)
- Automatic learning from errors/problems.
- MCP Server supported (ex. context7 or GhidraMCP)
- STT (Speech-To-Text from local whisper)
- Document upload in chat session
You're asking now "Why do you think, is it better then OpenClaw?"
The answer is simple: My Project is completly optimized to use ollama with small LLMs like on my RTX 4070 12GB VRam so good on 4B, 7B, 9B Parameters.
yes-im-hiring-2025@reddit
Huh. I successfully ran a minimax 2.7 + Qwen3.6-35B-A3B setup on an m3 ultra for claude code yesterday. Ran fine lmao, about 50 TPS average. Takes a little tinkering on preprocessing batch sizes, cache enablement, context management and params etc. There's an unsloth guide for it iirc.
Sonnet level ish quality overall, not mad at it at all.
tgsz@reddit
Almost all of this is your harness and likely context window+tuning. I've had the most success with cline configured to use qwen3.6-27B as an anthropic model (modified the model list to add a record for it) - the prompts and tool calls + anthropic compatible API and prompt caching really make it feel like using sonnet 4.6.
It's a shame that the harness is so important to the model but that's modern LLM coding.
Using Claude Code with ENV variables pointing to your local endpoint is also a great way to go if your comfortable using CLI vs using opencode.
Imo local models really shine when they are driven by a frontier model, so use 5.5 or opus 4.7 as the main arbiter and save yourself the babysitting.
joeprovence@reddit
I set up Claude code to be able to query Gemma (and Gemini for that matter) to double check code, more times than not Gemma finds things! You can add it to your mcp via Gemini API.
cocoa_coffee_beans@reddit
Yes, local models fall short for coding.
That’s not all. The ecosystem is quite fragmented:
OpenCode is broken with vLLM ever since vLLM deprecated the
reasoning_contentfield forreasoning.Open WebUI still handles reasoning like it’s early 2025.
Vendor specific tools such as Codex and Claude Code constantly break against local inference even if you provide their respective APIs, because vendors are constantly iterating their client.
If you’re not deeply entrenched in the specifics, you won’t squeeze the performance you need for coding. For most people, it simply isn’t worth it.
Caffdy@reddit
and try to mention any problem you got on their github and see if you're not at best get ignored, if not being told how wrong you are, "just read the documentation" or straight up blocked
hwpoison@reddit
I guess all of us in this community are trying to see the good side of the local models, trying to convince ourselves about they are useful, but the reality is that they are so far behind of big models and the majority of cases are shit. This does not discard that are continuously improving but that's how the things are right now.
dtdisapointingresult@reddit (OP)
It's really just the hype for the past 6 weeks of models that made me bite.
Simple as.
ilt1@reddit
What is your hardware+software setup that drives your local models?
markole@reddit
It is irrational to compare a 27B model running on a single GPU and a multi trillion model running on clusters of GPUs that cost more than your retirement fund.
droptableadventures@reddit
And also to conclude that that's as good as "local LLMs" get.
It's good for its size, but it's one of the smallest notable models out there.
markole@reddit
I wouldn't be too surprised if we get a 70-120B model as strong as Opus 4.6 in 12 months or so. Remember what we had last year in April.
droptableadventures@reddit
It'll be even easier to get one that's as good as Opus 4.7 :P
(it appears to be worse than 4.6).
CooperDK@reddit
It is probably you. They are actually quite good. Prompting is 50%
mr_zerolith@reddit
It's not a surprise to see anyone unimpressed with a \~30b model for coding.
wise_young_man@reddit
“Gave it a fair shot the past few weeks” 😂😂😂
Sad_Yellow6662@reddit
thank you bot
modadisi@reddit
Kind of obvious with around 30Bs
edsonmedina@reddit
To me it sounds like no one is wrong in this thread, they just have different expectations.
Some people use LLMs as tools to speed up/improve their coding/reasoning and do just fine with local AI.
Others expect LLMs to do the thinking and take decisions for them. Nothing wrong with that, but for those people local AI is definitely not there yet.
The latter group does have a problem though: I'm not sure these gigantic models are even economic viable (at least currently) so you might face even higher prices. The scale required to run them is simply insane and someone needs to pay the bill.
SaucyBossy1919@reddit
the economic viability problem is real but it's actually a visibility problem first. you can't make the right call on model selection without knowing delta cost AND delta quality together. most teams only see one side. Traeco ships both numbers with every recommendation. traeco.dev
MexInAbu@reddit
This. A couple of years ago we were doing complex coding without any LLM assistance whatsoever. So having something like Qwen 3.6 is an incredible production multiplier.
Maybe I'm and old jaded man yelling at clouds, but all these talk about letting a complex model do the planning is crazy talk to me. I most (almost all) of the planning myself and a significant part of the coding too. When I let the LLM to write code autonomously I give very detailed instructions approaching pseudo code. Small LLM are very good at speeding up my work.
Now, I do use the frontier models to help me plan complex plans, solve complex problems and find out known methods and tools, though.
rainbyte@reddit
Yup, there is something with high expectations. Here I also use Qwen3.6 and it helps to automate the things I describe to it, but I have them in my mind first.
smirnfil@reddit
Cost of running models drops every month. And the performance of all models goes up and up. So it is reasonable to assume that 2 years from now current state of the art level of gigantic model based development (something like Opus 4.7 xhigh + claude code) would be affordable and easily available. Will it be through a cheaper cloud service or through local model doesn't really matter.
edsonmedina@reddit
Yes, but the hundreds of billions spent meanwhile will have to be paid back to investors. All while chinese labs undercut their prices by 80-90%.
snowdrone@reddit
This happens all the time in investing and the investors will take a loss. They do not "have" to be paid back, they took the risk.
smirnfil@reddit
Who cares about inverstors? All I care is that the current level of AI coding allows me to drop hand writing of the 95% of the code and develop faster and better than before. It is easily available right now. And I don't see any reason why the cost of the current setup would become unmaintainable in the future.
my_name_isnt_clever@reddit
People are absolutely wrong in this thread, anyone who says Qwen 3.6 is unable to do basic tool calling is messing up somewhere. But there are so many variables to local LLM use, it's not very productive to debate without any details.
SaucyBossy1919@reddit
the 100k token accumulation per turn is a config problem, not a model problem. every tool result appending to context re-bills everything before it. the fix is knowing before you run which steps actually need full context and which don't. Traeco surfaces that attribution so the decision isn't a gut call. traeco.dev
Western-Image7125@reddit
I stopped reading at Qwen 27 and Gemma 31, for any coding tasks which are slightly harder than basic there’s no point at all using any model of this size. Might as well code by yourself if you cannot use Claude and maybe prompt for small self-contained snippets of code.
dtdisapointingresult@reddit (OP)
Yeah, that's what it's looking like.
In my defense the past month showed Qwen 27B nearly coming close to Sonnet in many intelligence benchmarks. Guess I just learned how little those scores generalize to something like doing docker work on its own.
Western-Image7125@reddit
Benchmarks are one thing and lot of models are very finetuned towards benchmark. General purpose capabilities come from bigger models and I’ve very rarely seen this to not be the be the case. Of course a new small model may beat an old big model but a new big model there will rarely be a contest. So it really depends on the problem you’re trying to solve. Clearly people need to still need to code at least a bit without AI assistance
ANTIVNTIANTI@reddit
build your own harness so you have more control than you ever wanted to! lololol like for real, i’m often stuck in analysis paralysis cause, just so many little things not even touching the prompt much less the system prompt lol!!! i’m experimenting with a coding harness not meant for coding entirely, more like, “get me the last 10-20% of the project finished, polish my shit”, so hyper focused on the final stages not the building stage. will post if it’s successful. feels incredibly promising. i’ve built a few chat + harnesses and have had fantastic success there, this new one is super exciting! i’m slow though, only use ai for the things i need to. lol. good luck all! much love to my local homies!!!!!!!!!!
AllanSundry2020@reddit
i reckon this post is ai generated by one of the big players and they are loosing too many of us
dtdisapointingresult@reddit (OP)
Because you're regarded.
Why was my conclusion at the end to use a large open weights model like Kimi then? Why didn't I suggest Anthropic or OpenAI?
AllanSundry2020@reddit
you think it's cool to use terms like that even if Finlay disguised? jerk. I think you are rumbled
DangKilla@reddit
Have Ollama launch claude code and use Superclaude with Gemma4 31B. Done. Past that you need mcps and skills.
dtdisapointingresult@reddit (OP)
What Supercclaude MCPs/Skills would have made it less regarded?
Remember this isn't a software development project, I don't want to spend 30 minutes talking about architecture and tests. It's for one-off quick tasks. Look at my prompt here: https://reddit.com/r/LocalLLaMA/comments/1sxqa2c/im_done_with_using_local_llms_for_coding/oiqq2gr/
DangKilla@reddit
I made a custom MCP skill for podman (same as docker) by copy/pasting the CLI manual/docs output, for one, instead of bash calls.
ecompanda@reddit
OS and Docker are a brutal showcase for local models because one slow build pushes them past their expected timeout, and the moment that happens they invent a failure reason like 'torchcodec must have failed' instead of just tailing the log.
Ruin-Capable@reddit
Regarding the learning bit. I tend to keep the agent open in a terminal window. Whenever I think I'm about done with a file, I'll ask it to review the changes to a file. I'll make the changes, or explain why its feedback is wrong. Rinse repeat. It's quite helpful in catching things I've overlooked. I'm not totally sold on unleashing the agent to write code for me.
Curious-Function7490@reddit
I'm semi in the same spot. I am running qwen.2.5-coder.32B locally on an RTX4090 using llama.cpp and getting 30 tokens a second. I set this up because I was tired of using up Claude's tokens on one of my projects.
TL/DR.
The more helpful LLMs (Claude, etc.) that are really effective won't be affordable in the longterm. The companies providing them are running at a loss, there is an AI hype bubble and we are already understanding that they are unaffordable and problematic to depend upon.
I think understanding how to work with local models is viable and it will come back to being more hands on.
So, I'm more or less going to nix Claude from a lot of its activity in my codebase and learn to work with open source models that I can host myself. It won't be as productive as using something like Opus but it will be viable for the longterm and relevant for the job market.
ebonggio1990@reddit
Do you think that a long-term solution is to have a 4090 in each home?
StatusAnxiety6@reddit
Dunno i just use a qwen3.6 27b 3bit(?) Have to look..... with default setup w/ pi and mine works well
oldschooldaw@reddit
I quite like reading posts like this, it is the antidote to the shit I see on Twitter constantly about people using xyz claw variant #1337 with omega-amazing-distill-opus-3b on their third Mac mini while they escape the permanent underclass. It helps really remind me the reality is actually in the middle.
pkmxtw@reddit
Just downloaded IQ1_S on ollama 🦙 running at 3 tk/s. This thing totally replaces Opus 4.7 and I'm canceling my CC sub! Big AI labs in shambles... Starting my new all-AI startup with 10 claw agents now 🚀🚀🚀. If you aren't learning about this, you are 100% left behind!!!
switchbanned@reddit
Can i buy your guide?
GriffinDodd@reddit
Agreed. I have actually managed to build a quant trading bot without being able to code myself, but it took 2 months of endless frustration and micro managing various LLMs. I started with Opus 4.6 on a Max plan which built most of it. Once Claude killed that access I tried going local with Qwen 3.5 122B which can be ok, but needs even more hand holding. These days I copy and paste in and out of supergrok so I’m much more involved in the execution even though my coding ability is close to zero.
So I did manage to do ‘the thing’ but certainly not in the way those terrible hype posts claim on Twitter.
andy_potato@reddit
I wish I could upvote you more than once.
Zeeplankton@reddit
I always thought twitter was better than reddit, until I got a twitter account. That place is like linkedin with toxicity turned up to the max.
gameboyVino@reddit
Deleting twitter is truly the answer here
CondiMesmer@reddit
Pretty much. Even if using Claude Opus 4.7, you still need to heavily supervise the output. That's just the flow of coding with LLMs tbh
geldonyetich@reddit
More power to you, OP.
I think the point of local LLMs isn't so much that we expect them to do better than a frontier model running on the cloud.
Rather, it's exciting how close you can get, secure in the knowledge that if you had to do without Internet you could.
devshore@reddit
The purpose of localLLM isnt to get the "best performance" while "saving money". Is that what you thought? Its to combat the anti-christ (Zuckerberg, Gates, Altman, Thiel, etc)
Yugen42@reddit
For me the main intent would be independence in all its forms. If that isn't at least part of the equation then local LLM coding is not for you right now - it seems to lag behind in quality and speed at least by an order of magnitude. I'll keep working on it though, I'm sure it will be way better in a 6-12 months as well.
Alternative_Ad4267@reddit
Are you using Qwen with Qwen Code? That’s a real improvement. With Claude is dumb, use it with its own coding tool.
dtdisapointingresult@reddit (OP)
Yeah, Qwen Code + Qwen model. I figured it would be the most optimal.
I also tried Claude Code + Qwen model, figuring Qwen must have trained on the CC prompt too. It's possible this gives the best results, but I didn't have the patience to do it. Claude (the app) hides the model's reasoning, so you end up with these long pauses with zero output. I want to be able to interrupt the request immediately if I see the model is on the wrong track, my LLMs are too slow to just wait for a final result.
Alternative_Ad4267@reddit
What about OpenCode or AnythingLLM? Those works more or less fine with Qwen.
tecneeq@reddit
Still using Qwen 3.6 27b.
layer4down@reddit
Two things can be true: Opus can be best-in-class overall and local models can be good enough for the need. If you can afford best in class and don’t have any other coding constraints (typically imposed by an employer or customer) then I don’t see why someone doesn’t just use it. That doesn’t mean local models are not productive for anyone else even if they’re not necessarily getting everything they want from it it’s just making do with that’s available and moving on with life.
I haven’t personally used Opus much though I did use Sonnet quite heavily throughout 2025. When Zhiphu launched their GLM-4.5 promo in October ($360/yr) I dropped Sonnet and never looked back. My understanding is that Opus lets one outsource a lot of thinking and decision-making. Guess since I have access to it now at the job I’ll try it out but I can’t personally see the benefit of becoming dependent on it especially given the current climate.
dtdisapointingresult@reddit (OP)
My post is about small local models, not local models in general. The recent hype this past month was about how these 27B/31B models are closing the gap on Sonnet and it's just not the case. My solution I mention in the TLDR is to switch to the larger open-weight cloud models that one could run locally if they had the hardware. So basically I'll be doing the same thing as you, just probably with Kimi, GLM5.1, etc.
layer4down@reddit
Fair enough. I hadn’t caught the entire post yet. My own goal is to move off of GLM when my 1yr expires in October. I already have access to an M3 Ultra 512GB and my daily driver is M2 Ultra 192GB so I think I should have enough local resources between the two.
dtdisapointingresult@reddit (OP)
Lucky you, if you have a 512GB you can easily run GLM5.1 at 4-bit, although I don't know what the speed will be like. The reasoning on these large models makes slow speed painful.
layer4down@reddit
Right so for instance I enabled DFlash Speculative Decoding on qwen3.6-27b-bf16 using oMLX and went from TG 11tps to 62tps, but the catch is that was at pp1024, so I need to do more testing to find a max TG/PP speeds and I’d be happy be be at 15-20tps frankly.
Assuming algorithmic improvements in inference capabilities like speculative decoding and speculative pre-fill continue over the next 6-12-18 months, GLM at home is sounding more feasible by the day.
LittleCraft1994@reddit
What i am not understanding that how you can compare are 21 or 37b modal to the claude ?
Bro ofcourse you will want to burn the workstation
Atleast go with minimax 2.7 , kimi 2.5 or glm 5
dtdisapointingresult@reddit (OP)
In my defense there was a lot of hype because benchmarks were showing Qwen 3.6 was sorta close to Sonnet 4.5 in ability.
I never believed in small local LLMs for coding until Qwen 3.5/3.6.
It might be true for the 122B, I didn't try it, but it's definitely not true for 27B. And 35B is far worse.
LittleCraft1994@reddit
Yeahh for me qwen and other small models below 70b behave like children sometimes awesome performance and sometimes act like they are infants given phd to do
Firenze30@reddit
I stopped reading after the Docker example. Failing to containerize a repo based on its README is not a model limitation; it’s a setup failure. That task is well within the capabilities of 9B and 27B models when configured correctly. The issues you’re describing sound like a harness or context management problem, not a fundamental lack of model intelligence.
dtdisapointingresult@reddit (OP)
I think if my only prompt was "Dockerize this repo" it would've succeeded.
But I made it clear I wanted it to run the docker build on its own, and also warned it about a future error that it will also have to fix. The sort of info I'd give upfront to a human intern. Certainly the minimum of what I expect from an agentic coding assistant.
I'm sure if I spent a few weeks making my own Pi extensions things would've worked out, but it's just much easier for my time and sanity to switch to Kimi in the cloud. This isn't the post-apocalypse, I don't have to use tiny models if I don't want to.
AlphaEdge77@reddit
> the prompt cache's always being hit.
Yeah, this is by far the biggest issue, and very rarely tested. The first couple prompts is nothing. The Youtube bros need to ask 10, 20 questions in a row, and see if it holds up.
I find local LLM's useless because of this.
Made me want to delete LM Studio once I discovered this.
dtdisapointingresult@reddit (OP)
Tbf that's the client app's (harness) fault for modifying history. But there's no easy way to tell in the prompt cache is being broken.
I think the people recommending completely unknown harnesses ITT are right about having to use one of those simpler harnesses designed for small models.
What the ideal local harness should have :
None of the popular ones have this, not even Pi. They are all designed for cloud models and don't give a shit about prompt cache.
jonydevidson@reddit
Local LLMs that you can run on a MacBook are right now around the level of quality that Sonnet 3.7 used to have in early April last year.
They are currently a full year behind frontier LLMs.
In April last year, I wrote here or some other AI subreddit that within 2 years we will have models that have these capabilities available for laptop use. It happened within a year.
So given that track record, you can expect models with roughly today's frontier capabilities available locally on laptops in this time next year.
Because people developing these tools are using the frontier LLMs today which are nothing short of magic. GPT 5.5 with Codex is magic.
And the frontier will continue to improve.
So relax, chill out, use whatever works for your current task. A year ago, you simply didn't have local agents like Qwen 3.6 27B. Now you have them, and they work, they can edit files consistently, analyse codebases, documents, and answer questions, all completely offline.
If GPT 5.5 does all your work needs, you'll be able to do it offline next year.
norebe@reddit
You're going to have to put time in if you want to roll your own. There's no evidence from this mountain of words that you have any knowledge of what that means or that you tried to do anything but put together a harness and model that weren't designed for each other and expect things to work out well.
dtdisapointingresult@reddit (OP)
OK. Short of spending a month hand-crafting my own harness through constant trial and error, where are the guides on how to do this?
I expected the Qwen Code harness to be able to use a small Qwen model. That was clearly a mistake, the harness seems designed for their Qwen Plus/Max cloud models.
What was I supposed to have done? There's 50 vibecoded harnesses released every week. They get a few commits and then get abandoned, or they're under-documented, or they're trash. I don't have time to try all this. My goal in using the LLM is to spend less time on the computer.
m3kw@reddit
lol, you go and use stuff like GPT 5.5/Opus 4.6 and then you go use a local LLM with "Great looking benchmarks". It's the difference between looking at a vacation on Youtube vs actually going in real life.
dtdisapointingresult@reddit (OP)
Instagram vs Reality, huh?
Alternative_Nose_874@reddit
Yeah, local models still feel like a rough draft compared to cloud stuff like Claude Code. The timeout and tool-calling issues you mentioned are spot on - I’ve had Qwen just bail on a task because it "assumed" something failed without actually checking. Dockerizing repos should be straightforward but these models often miss the basics or get stuck chasing weird errors. I guess until we get better integration and smarter task handling, local LLMs are more of a pain than a help for anything beyond simple stuff.
Savantskie1@reddit
Trust me it’s a prompting issue. He’s giving vague instructions and not guiding the llm and just assuming it can do it all. That leak of Claude Code should have proven to everyone that prompting is the key.
Savantskie1@reddit
You’re prompting wrong. I use the Qwen 3.6 35B A3B, and I rarely see problems. Granted you have to have a lengthy system prompt to steer the model, but every time you run into problems, you simply have to adjust the system prompt to limit or give instructions. The harness has to have good instructions in tool descriptions too. Otherwise the LLM will make assumptions and just use any tools or flat out delete entire folders. Trust me, it’s a you problem. How do you think Claude Opus and Sonnet is so good? It’s not training data so much as instructions in system prompts that make it seem easy. They are prompted to hell and back to what they can and cannot do.
Iory1998@reddit
I am not a coder, but I totally understand and share your feelings. I find myself going back to Gemini-3.1 or Deepseek v4 for better replies, or I start a conversation with them, copy it to LM Studio, and continue with a local model like Qwen-3.6-27B-Q8 or Gemma-4-31B-Q8. This seems to give them a bit of an edge.
But, I use them mostly as an inner voice that helps me collect and organize my thoughts. When I need serious sanity check, I go back to the top Gemini or Deepseek (I like to vary the sources). Perhaps, if I could run larger model locally, it would have been much better...
And you are right regarding wasting time. You can get good outcome with small local LLMs, but you spend more time and energy. If you are tight on time or you need to make a lot of decisions, just go with the best model you have access to. People have limited deciding making capacities per day, and it doesn't matter whether you decide on trivial or serious matters, you spend the same energy deciding.
dtdisapointingresult@reddit (OP)
I do like this idea of planning with a frontier model like Deepseek then going back to the local LLM to take over.
I just really thought a big dense 27B would be able to thinking like this on its own. I guess not.
Iory1998@reddit
Not gonna happen even if you had a dense 70B. Parameter count is important to hold knowledge, and there is not limit to knowledge. Unless we get an architectural leap where knowledge is a separate module the logic/cognitive engine can use while thinking, we will never have 27B model better than a frontier model.
lqstuart@reddit
Claude Code isn't an LLM, it's a Typescript package. KV cache hit rates and client timeouts are also not part of an LLM. It's kinda like running the S3 API on top of my laptop's filesystem and then complaining that it isn't as good as Amazon S3.
celsowm@reddit
Good for you ! But I am in a fucking country Brazil that our economy was fucking destroyed in just 3 fucking years so even 20 usd bucks is expensive to us
More-Curious816@reddit
You compared a trillion+ parameters model with 27 billion and 31 billion models? Of course you will notice the disparity. Try the big open source models and come back.
andy_potato@reddit
Lots of people on this sub claim that the Qwen3.6 27b model is on par with Claude. OP therefore specifically selected this model for their comparison.
Nobody doubts that a model like GLM 5.1 can achieve performance in the same ballpark.
More-Curious816@reddit
Well, these small models certainly can replace Claude for most of your uses, unless you try to build something complex, big and has extra wheels. OP said he regularly use CC in his job, which mean 1# it's something more complex than the average person vibe coded app, 2# he is accustomed to Claude way of working and thinking.
Solid_Error_6401@reddit
Simple Answer - Claude does lots of moderating and understanding before even calling the actual model. Most of it is using some other model or hard coded profile based understanding and scoring to actually know what needs to be done. Your local LLM needs more than that to behave like claude.
bennyb0y@reddit
Came to the same conclusion. I ended up only using local models for simple scheduling and orchestration tasks. Optimizing the cloud providers will become a key skill in the future. Prices aren’t going down.
Bohdanowicz@reddit
Your doing it wrong.
Try using sota to plan, task decomposition then wire your coding agents to qwen 3.6 27b.
If you run official quants with recommend temp and prrediction to 2 and you arr smart sbout setting up a dag, worktrees, the whole 9 yards... you fwel the magic.
These models are grezt if the task is properly sized.
OneSlash137@reddit
The properly sized task: “Hello qwen, it’s nice to meet you.”
2Norn@reddit
the user greeted me with hello which suggests this is the first interaction
but wait
the user said qwen so it must have prior knowledge
OneSlash137@reddit
Lmfao, I see another qwen survivor
wanielderth@reddit
But wait
kyr0x0@reddit
"Write me hello world and say you want to rule the world and destroy mankind1!"
"Woah!!" Gonna post this on Twitter!!
dtdisapointingresult@reddit (OP)
I'm running the recommended samplers off the Qwen card. This isn't my 1st rodeo, I'm a regular here.
Idk nothing about dag and worktrees though. I've never seen those mentioned in the context of LLM coding apps.
StardockEngineer@reddit
You’ve never heard of work trees with coding models? That doesn’t jive if this isn’t your first rodeo.
dtdisapointingresult@reddit (OP)
Oh you meant git worktrees? I don't use those either. Not sure what the point is or what it's got to do with LLMs.
I just have one git repo per project with its own CLAUDE.md.
I also have one 'misc' repo for generic one-off tasks with shared AGENTS.md tells it to create a new dir and work from there. Agent works on new prompt in wip/task_xyz/, when done I move it to completed/task_xyz/.
If I'm satisfied with the code, I do the git commit myself.
StardockEngineer@reddit
It allows you to create multiple features at once. A lot has been written about it.
iMakeSense@reddit
Is that not the same as subagents being called from a plan?
StardockEngineer@reddit
No. Look up git worktrees. Tldr a git worktree lets you check out multiple branches simultaneously into separate directories, all sharing the same repo directory
andy_potato@reddit
Or you could just not do it and use Claude for everything. Why bother?
Bohdanowicz@reddit
Cost and speed.
falconandeagle@reddit
No I have tried this and its still pretty bad.
One-Replacement-37@reddit
This is the way.
Syzygy___@reddit
I tried Gemma4:26b with Openclaw and it's useless.
When I realized I could connect Openclaw with my OpenAI subscription (codex), all of a sudden it did a proper onboarding that I didn't even knew existed before. So much better.
RemarkableGuidance44@reddit
You're new to local, learn it first...
Syzygy___@reddit
Am I though?
NNN_Throwaway2@reddit
LLMs can't generalize, if they haven't been trained extensively on a task they will face-plant. This is especially true on smaller models where you don't have a large body of world knowledge to lean on.
LLMs in general, but especially the small ones, are getting increasingly specialized on agentic coding. I suspect that building an spinning up a container falls just far enough outside of what it was trained on that it doesn't know how to apply basic problem-solving that it was certainly trained on in other areas.
But yeah, people are going to get upset if you say the latest OSS darling isn't the bee's knees and a huge game-changer that rivals Claude Opus.
InnovativeBureaucrat@reddit
I found even the advanced models to be terrible at any devops tasks until recently. They were pretty good at Python code, and really bad at even the simplest tasks involving containers, network questions, or really anything not code.
Maybe it’s just me.
VertigoOne1@reddit
They’re pretty solid on kubernetes, terraform, argo, github actions, you know, stuff that has a long history and strong community representation. Not baddd on powershell but it has to work through the bugs for a few rounds. But if you mean devops like, any of not the above, i’m not sure. It is important regardless to have a good structure and docs. A lot of devops repos are just a mess, which gives the llms amnesia
InnovativeBureaucrat@reddit
Maybe it’s just me. At one point I asked basic setup questions like about gitlab project management and they all seemed very unaware of that layer. Maybe I wasn’t being precise, maybe it’s changed.
I just said devops because I didn’t have a better description (and my memory was vague)
m3kw@reddit
More like
false79@reddit
Bruh - that is not how you do it. You need a harness, Claude Code, Cline, Kilo, whatever, then you need to @ the file you want to make a part of the context.
Claude code is not a mind reader but it certainly has massive amount of context.
You can get away with so much more if you give LLM some direction, it will connect the dots with sufficient direction.
andy_potato@reddit
Did you even read the original post?
false79@reddit
I did read it. It screams rage quit skill issue to me. If you read some of the other replies here, it's pretty embarassing. Like this one:
https://www.reddit.com/r/LocalLLaMA/comments/1sxqa2c/comment/oipgifv/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
dtdisapointingresult@reddit (OP)
I was using a harness. I tried two complete ones (Claude Code aimed at vllm, and Qwen Code aimed at vllm). I also tried vanilla Pi.
traveddit@reddit
Which endpoint did you use on Claude Code? What arguments for vllm?
RemarkableGuidance44@reddit
They have no idea...
juraj336@reddit
I'm surprised it isn't able to handle this then, Ive had Qwen3.6 27B handle several things like this easily. I had it make an api, then dockerize it and then iterate through until it fixed the issues after which it worked great.
I think for these medium size models context is king. They don't know as much as a Claude or Chatgpt model but they know enough that with the right context they can reach the same result.
So for me what has worked great is adding a searxng instance for web search and having it ensure testing in a loop until it has something working.
lost_mentat@reddit
small local modals just don’t have the intelligence for long agentic tool calling , the errors compound, so calling 1 tool 95% success , drops fast if you need many tool calls in successions. That’s why there is a massive compounding difference between 95% and 99.9% , even though it appears to be “only 5%”
chibop1@reddit
I think people hyping local models for coding just using it for short test like CS problem set with few files for school assignment. lol
They are not good with what you would need to do on actual job. It's waste of time.
Apprehensive_Half_68@reddit
Tinkerers are just born that way and are the bedrock of civilization. We will always be many steps ahead of these new 'coders'.
Apprehensive_Half_68@reddit
Local LLM usage vs cloud...local is not optimized for productivity, it's just not FOR that. 👍🏽
swingbear@reddit
Try a different harness mate, I tried to run CC through everything local and had a bad impression of models even up to minimax 2.7. Started using Hermes and a few others, speed increased and way more mileage in terms of intelligence.
my_name_isnt_clever@reddit
I swear for local LLMs Claude Code does more harm than good. People are blinded by the big fancy tool and just want to use the same thing they were paying for, for free, but they don't understand the nuances and assume local is worthless.
cloudcity@reddit
Kimi 2.6
misha1350@reddit
Use DeepSeek V4 instead. It costs pennies on the dollar.
Ambitious_Stuff5105@reddit
I actually agree with you. If you value your time, it s infinitely cheaper to just pay for a frontier llm
LienniTa@reddit
what harness you use? you cannot use small model with big harness, only with small harness like kon, late, little-coder, you cannot expect persistent gateways to work with qwen 27b
dtdisapointingresult@reddit (OP)
I mainly use Qwen Code. I figured the models would be trained with their harness. Plus I like it more than Claude Code.
It's the same as Claude Code though, it has its own 18k token system prompt.
I didn't try any of those small harnesses. I tried installing little-coder but the basic instructions weren't enough to make it connect to vllm. I was starting to learn Pi (what little-coder is an extension for) to troubleshoot little-coder better.
LienniTa@reddit
maybe qwen code is more about qwen max than qwen 27b xD i didnt try it, but qwenpaw with qwen 27b feels okay
4bitgeek@reddit
Local models are too crappy to be used in any project. Tried upto gpt oss 120b on bedrock and that seem to write some better code, else nothing is worth for coding. Or else need to have too much firepower to try something bigger. Seriously... For texts and chatting, some smaller models are somewhat fine, but definitely not for coding. Also without proper gpu, the token output will age one faster. Will die waiting....
cbterry@reddit
This is hilarious
MDSExpro@reddit
This sub creates unrealistic expectations that do not match reality. I have spent last 4 months setting up local closing via LLMs and I arrived on setup that works, but it's vastly different then image pushed by hypers:
First realistic productivity barrier was crossed at 128GB of VRAM (4x R9700) - Qwen3.5-122B-A10B quantized to INT4 was able to generate a lot of good code, but failed on long range coding. When I have it a technical spec, it was stuck at 90% correct implementation, but were unable to reach 100%. Anything smaller was pure frustration.
Bumped up VRAM to 256GB (8x R9700) allowed me to switch to FP8 quantization of same model and difference was night and day, it reached 100% correctness and really moved to next tasks.
llama.cpp is a trap, for coding you need vLLM if you want any responsible speed.
Long story short: it can be done, but it cost way more than this sub thinks.
dtdisapointingresult@reddit (OP)
Hmm, thanks for the feedback. I will go up in weight class then and try the 120B MoE models. To be honest I genuinely thought dense 27B would be better than 120B MoE at reading between the lines and making the right decisions. I guess I was wrong.
Pleasant-Shallot-707@reddit
So, you refused to craft the guardrails to accommodate the needs of the local models, expected one shot level behavior and were upset that they can’t work that way.
dtdisapointingresult@reddit (OP)
That's right. If Qwen want to claim 27B is close to Sonnet 4.5, then let them craft those guardrails in Qwen Code. I used their model, their harness/app, what more can you expect from me? To scour Github for random community extensions to get past the "skill issue?"
Torodaddy@reddit
No gpu? You are using really small models, you could go up a size and use something quantitized
dtdisapointingresult@reddit (OP)
No GPU, but I have 128GB of memory. You're right, I'm gonna try the 120B MoEs as a last attempt, but I unironically thought big dense models would be better at reading between the lines and building a plan. One of the llama.cpp devs just replied saying GLM-4.5-Air is still the best local coding model he uses despite how old it is. I was waiting for 3.6 to try Qwen 122B.
DeepBlue96@reddit
Strange both 27b and 35b where perfectly capable of "dockerizing" many of my repos with thousands of files in some, i don't use claude code because it's not in my nature to trust bs written without check, instead with a vscode and roocode it get's the job done 90% of the time, sure it may allucinates sometimes but rarely for simple tasks like that one. still i run a 27b q5xl with a 128000 context and no thinking on a 3090 but it's better than risking getting billed 10x because they had an upsies or getting banned after paying just because they felt like it...
Virtamancer@reddit
OP is almost certainly running quantized versions, plus he’s using them through Claude Code instead of something like OpenCode or Pi which don’t pollute the context. 🤡
That and expecting teeny tiny models to perform similarly to Opus is kind of wild.
If he used the unquantized versions in OpenCode he might have a much better experience but it still wouldn’t match opus. Literally in a year though it will—coding agents and harnesses and optimizing the processes is the entire game right now.
meca23@reddit
I'm new to running local models. Not being fecucios, genuinely interested.
Do you really think in a years time we'll have a open weights model that can be run on less than 100GB vram and match Opus?
Virtamancer@reddit
I think in a year, that because all the focus is on optimizing models to be efficient and simultaneously focusing on coding specifically, combined with the realization that harnesses are a breakthrough, and agents (smaller models) can be effective in harnesses along with the orchestrator, that yes, we’ll compare local capabilities (say, 128gb VRAM or less) and say “this beats what CC on Opus did in April 2026.
RegularRecipe6175@reddit
Did you use an 8-bit or better quant? Curious, but it's not going to change the outcome if your work gives you all you you can eat Claude. As someone who is forced to use local models from time to time, I can say using at least an 8-bit quant, if not full fat, makes all the difference for small models.
dtdisapointingresult@reddit (OP)
The official 27B FP8 from Qwen, yeah. Ran slow but having MTP helped. (unlike Gemma)
StardockEngineer@reddit
You can run Gemma e4n as a speculative decoding model for a big performance boost.
andy_potato@reddit
That doesn’t make Gemma any better at coding. Just faster at producing nonsense.
StardockEngineer@reddit
“Big performance boost” not “big coding boost”. Key words here
t4a8945@reddit
3.5 or 3.6?
They are NOT the same haha. They cooked, really.
dtdisapointingresult@reddit (OP)
3.6, who do you take me for? I know game!
RemarkableGuidance44@reddit
You don’t know what you’re talking about here.
You clearly don’t understand how to set up models properly across different hardware, how quantization behaves differently depending on the setup, or how important pre-prompting is for getting better results.
You should spend some time learning how these systems actually work. Reading through the Claude Code files might help you understand how they drive Claude in the right direction. Even though that has turned to a pile if sh!t.
YOU KNOW THE GAME.... Looks like you dont...
andy_potato@reddit
OP clearly knows the game. But OP obviously also has a life.
Material_Soft1380@reddit
have you tried BF16?
t4a8945@reddit
Whoops, sorry!
I tried it in my setup (2x Spark) and it did some amazing stuff (massive refactor) ; only issues I had with it was it was stopping for no good reason, outputting xml sometimes. I blame its jinja template and I got no time for that.
Anyway, I liked your post, it's a good reality check from a real experience. Thanks
mister2d@reddit
The small ones also very sensitive to quantized kv. I started running with kv cache at full precision and noticed a significant difference in increased quality.
It's slower, but useable.
bonobomaster@reddit
I agree.
It's just a feeling at this point, because I don't have numbers to back that up but even Q8_0 KV cache makes at least all the Qwens I tried noticable dumber, especially in regards to coding and successful tool calls.
mister2d@reddit
I don't have numbers either. But my test was the "carwash test", and coding up a tetris clone with music using html/js with the "superpowers" agent skill.
The carwash test passed every single time out of 5 attempts.
The tetris clone had a two go-backs for the collision detection and preview screen. But the finished product was nice. Had me playing for about 15 minutes till I got tired.
Qwen3.6-35B-A3B-UD-Q8_K_XL.ggufcache-type-k = f16cache-type-v = f16Particular-Award118@reddit
Who has the vram
jimmytoan@reddit
Honest take and I think the comparison is fair. The gap between Qwen 27B and Claude Code isn't really about the model - it's about the whole agentic loop. Claude Code has a very tight read-edit-run-check cycle with tools specifically tuned for coding contexts, good context management, and Anthropic's safety training actually helps it avoid the "confidently wrong" failure mode that plagues local models on ambiguous prompts. Running a local model with a generic agent framework against that is a 2x quality deficit minimum. The local use case that actually holds up is offline work on sensitive codebases or very specific fine-tunes on your own codebase. For general coding assistance, it just doesn't make sense at the current capability gap.
jimmytoan@reddit
n is fair. The gap between Qwen 27B and Claude Code isn't really about the model - it's about the whole agentic loop. Claude Code has a very tight read-edit-run-check cycle with tools specifically tuned for coding contexts, good context management, and Anthropic's safety training actually helps it avoid the "confidently wrong" failure mode that plagues local models on ambiguous prompts. Running a local model with a generic agent framework against that is a 2x quality deficit minimum. The local use case that actually holds up is offline work on sensitive codebases or very specific fine-tunes on your own codebase. For general coding assistance, it just doesn't make sense at the current capability gap.
shockwaverc13@reddit
finally someone who is realistic about those models
i'm really tired of seeing people blaming the users and saying stuff like "you aren't using it right, use the right sampling" for the millionth time or "you should use qwen 200b at BF16" without even asking how much VRAM they have
raz0099@reddit
Bro as a real Madrid fan this is your worst season anyway... So let it go.. 😂 lol
Maximum-Wishbone5616@reddit
what Q & KV ?
27B is good enough to often fix errors created by Opus 4.6 (4.7 is useless even with simply HTML + CSS + Bootstrap, created horrible designs, way worse than first models).
But quality depends on Q/KV. Those models tend to be good only at highest possible
for example 8K_K_XL runs faster than 4.6 (4.6 sometimes take 30-60 mins for something that 27B can do in 20minutes on 2x 5090, on AI cluster it is even faster).
This is model that I personally use on my own machine as I tend no to waste our AI cluster so other devs can use better models if they need.
fgp121@reddit
Limited budget can only get us this far. Cloud LLMs are too costly otherwise.
GodComplecs@reddit
Based on OP's post history, I would be a little wary of this sentiment, doesn't seem like a user with lots of knowledge about specifics of local LLM, no local harness, quant, etc.
I latest successfully used Gemma 4 31B q4, with custom agents in Opencode, it literally could continue on Claude Opus Extendeds work fine, it stumbeled a little but I fixed the config and then the experience was great.
Then I used I used it for planning also on brand new project and that worked well also, but still usually consult Gemeni 3.1 pro for sanity check, but that could also be replaced with a tool call for web search.
Is it Opus replacement? No, since everything right now has their unique flavor, eithe cloud or local.
g_rich@reddit
Local models, especially those in the 27b to 35b range aren’t going to compete with frontier models. However something like Qwen3.6 35b should be able to easily build a Docker container and do so at an acceptable speed providing you have the hardware to run it.
OP didn’t give any details as to how they are running the model or what tools they are using but this sounds like an issue with their configuration and tooling.
enoonoone@reddit
Your first mistake is using Claude Code as a harness for local models.
vlodia@reddit
So this sub always has a post about local llms getting almost at par for coding / logic with private llm then you also see these posts.
Can we have a consensus here?
enoonoone@reddit
https://mariozechner.at/posts/2025-11-30-pi-coding-agent/ My rifle is human, even as I [am human], because it is my life. Thus, I will learn it as a brother. I will learn its weaknesses, its strength, its parts, its accessories, its sights and its barrel.
InKentWeTrust@reddit
Do you use recursive reasoning on your locals? It takes longer to process but it produces much better results
doyouevenliff@reddit
can you detail or point to where I can read how to do that?
dtdisapointingresult@reddit (OP)
idk what that means so I guess the answer is no, I don't.
AlternateWitness@reddit
I don’t have much disposable income. In terms of usable models, I have qwen 3.5 9b currently organizing my emails and schedule. It’s not knowledgeable enough for coding yet, I use Claude if I have a proper use case - and can properly anonymize important information from personal projects without sacrificing the quality.
If I can swing it, and hopefully in the near future when prices stabilize a bit, I would absolutely be down for Qwen 3.6 35b or even Qwen 3.6 27b to replace my Claude (Sonnet) usage, from the things I’ve heard.
finevelyn@reddit
Yeah me too, honestly I was done before I was even started. Now I'm using a local brain for coding and it's miles ahead of even cloud LLMs. Try it if you haven't.
Stitch10925@reddit
What is a "local brain"?
RemarkableGuidance44@reddit
Its amazing! You have one, you just dont know it yet!
orion7788@reddit
lol
AlwaysLateToThaParty@reddit
I code on an rtx 6000 pro, using qwen3.5 122b/a10b heretic mxfp4, at about 75GB, and it's solid. I've tried the smaller models and they drove me mental. This can one shot complex tasks.
The problem with openrouter seemed, to me, that different service providers were quantising their API end point models. I think that's unavoidable fwiw. I'm pretty sure openai and claude do it, but they'll do it in subtle ways, cuz they can. But what it meant for me was inconsistent output, and that drove me mental.
So that's why i have the gpu. Does the task, and more. Pretty epic gaming gpu too tbh.
andy_potato@reddit
Feels good until you do the math how long you could have subscribed to Claude or Codex for the $$ you paid for that RTX 6000.
Your games will still run fine on a 5060ti by the way.
RemarkableGuidance44@reddit
For now.... for now.... you wont be able to afford Claude or Codex soon. $200 a month will turn into $2000 a month.
If you had looked around you would see that AI companies are increasing pricing / usage and this is just the beginning.
I have 4 x b70's and 2 x 5090's, my b70's run 24/7 with automated scripts. My 5090's I game on and do image gen + 3d Gen + training. Its all private data, bunch of fine tuned models that out do 4.7 and 5.5 for my own specific tasks.
Maybe you should add a Remind me in 3 years, that way 3 years later you will be begging for a GPU with at least 16G VRAM on it.
AlwaysLateToThaParty@reddit
I do the math of not being able to privately use an LLM. There's one way to resolve that.
ranting80@reddit
You can't point a 27b model at a repository and one shot it. You need to make a plan first. I use 27B to make the phased plan in opencode, export it to a markdown file, have claude and gpt give it a once or twice over then have the local model initiate. It's very good at following instructions and if you phase the work, it's quite fast as well.
FullOf_Bad_Ideas@reddit
I've had bad experience with Qwen 3.6 27b (BF16 in sglang) for coding too. And I get you. I'm OK using local Qwen 3.5 397B 3.5bpw though. It's not Opus but it does read my mind most of the time and it's no longer such a pain to use. It's not visible in benchmarks but it's just better for me. I don't know why the disparity between benchmarks and real life is so big, since those Qwens do great even on contamination-free benchmarks.
ortegaalfredo@reddit
In my experience the only model that I can use effectively as a coder is qwen-3.5-122B. The 27B and gemma 31B almost could, but fail too many times.
costinha69@reddit
See you in one year..
Financial_Egg_1502@reddit
what i have found is it does not handle the automation well, it will do step one but struggles with full autonomous building or code repair, I've had to build three agents to do it, one to scan to to suggest repairs approval and then repair its multi steps through multiple agents. Local LLM's are not fully there yet
lolpezzz@reddit
Remember kids running local llms are not as powerful as paid cloud llms, don't expect to have a claude or gpt or Gemini in your mac or in your gaming pc
tuisalagadharbaccha@reddit
Would have been surprised if it was other way round. Thanks for confirming
reginakinhi@reddit
Who told you those small models were the best that local models had to offer? They're great for their size, but even the biggest local models most likely don't compare to the size of Opus. The comparison is hardly fair
pablo_chicone_lovesu@reddit
Your missing so many things.
You realize that Claude is faster because they have huge memory and GPU banks right?
I use cline with a tuned qwen3.5 to check all of my code, it does a pretty good job. But I'm also more obsessed with context windows then model size.
You need to tune your rules for the model, make use of skills and also mcp, you don't just replace a model and be done. These ai companies have spent years tuning their setups to be what they are.
The model and context windows are a big part of the stack but not all of it. If your not willing to put in the time your not going to get good results.
Widget2049@reddit
AGENTS.md still too weak, you need to be more thorough for a 27b model. make it focus on what the LLM really need to do, avoid using "IF", "DON'T". you need to create a solid plan mode first before executing anything in build mode. local llm for coding is still good if you know what you're doing. so keep learning
StorageHungry8380@reddit
Got some concrete examples of how
AGENTS.mdshould look for such models?2Norn@reddit
https://github.com/forrestchang/andrej-karpathy-skills/blob/main/CLAUDE.md
this the only one i ever use, simple to the point
Widget2049@reddit
cleversmoke@reddit
Beautiful, thank you for the links!
StorageHungry8380@reddit
Awesome, much appreciated.
ChosenOfTheMoon_GR@reddit
A very good advice i will like to give to people who are using any instructions on any model is the following, test a model really quickly on the instructions that you want it to know first, specifically ask it what understands about the instructions and what is unclear to it, it will save you a ton trying to figure out what it can and won't do given your instructions.
2Norn@reddit
no matter how it gets hyped up, it should be obvious to literally anyone that a 27b model can not compete with 700b+ 1t+ 1.5+ models, that just makes no sense. v4 pro just came out, it's an moe model, its active parameters alone are double the size of 27b almost, 27b vs 49b. how can you possibly expect that it competes?
in my opinion only use is, if your harness is able to spawn fresh context(which means u don't really need 256k or 1mil context windows either) worker subagents and guide them and after work is done u verify with a different model that's pretty much it. they are just simply there to cut on your subscription/api cost. anybody who fully downgrades from opus4.7, gpt5.5, k2.6, glm 5.1 is just not gonna have good time.
Oleszykyt@reddit
You should’ve tried minimax m2.7 it is very good
taoyx@reddit
I coded my first chatbot with python and streamlit using local LLMs. What I've learned back then is that they were really bad at modifying existing code, but if you let them start from scratch then they just do fine.
Then I've learned about context size.
Electronic-Space-736@reddit
"Here's a Github repo, I want you to Dockerize it." is terrible lazy and most likely to fail.
You are missing orchestration layers.
dtdisapointingresult@reddit (OP)
Do I really need to run brainstorm skill, decide on architecture, answer questions about TDD compliance, to have the LLM dockerize an already-working app that gives all its doc in the README?
Electronic-Space-736@reddit
no, you need an AI layer that does that and creates smaller tasks from the large one and hands it off to workers, the same as what happens with the cloud ones, its just you need to set that up.
mister2d@reddit
You get it.
Electronic-Space-736@reddit
I do, and good news, it is open source https://github.com/doctarock/local-ai-home-assistant
kyr0x0@reddit
Your code needs a serious refactoring to TypeScript and ESM. It's obvious that the tool calling harness is fragile as it assumes issues only certain LLMs will face and others will stumble over. It has thousands of lines of code to solve some tasks that are more trivial to solve, but the LLM generated code tends to overcomplicated things. Also the README reads very AI sloppy and overstating functionality. But I haven't given it a reality check yet - that's just a guts feeling. It's cool though, that you open source it. I liked some ideas, it's just that it's one of those huge code bases that become hard to maintain. I'd suggest to refactor it, trying to get less code doing more and add a serious amount of tests, integration tests and e2e tests as well as ARCH.md docs for every module - so that the LLM wouldn't hallucinate on it when you continue using it to write code.
Electronic-Space-736@reddit
Thanks for taking a look, I will pass on the refactor but you are welcome to.
This is the core system, most of the functionality is spread into plugins, which I am publishing regularly.
The tool calling harness is a catch all addressing common problems, this is deliberate, there are hooks throughout for customizing its functionality, the core function is the fallback if you have not extended this further, I have plugins that advance this baseline.
I do not consider this particularly huge a codebase, if you investigated the main core system, it is a platform or foundation to plug things into with the basics included, built with security in mind, and a good deal of flexibility and a visual GUI, sure it is larger than your average vibe coded app, but it is more serious than your average vibe coded app too.
There is some refactoring needed, development is AI accelerated, but I refactor regularly throughout the development cycle and have 30 years of software experience - I think you should take a closer look when you get a chance.
traveddit@reddit
Looking at your repo and how you constructed your harness I don't think you're in any position to be giving out tips. You literally have subagent orchestration structure backwards. You're using Gemma 4B to decide the scope of your query and you have the 26B as a worker. This is a fundamental misunderstanding of how to allocate intelligence for subagents. You can't let a dumber model orchestrate the task because it will never know when to reliably handoff to "harder" tasks.
mister2d@reddit
Nice project. Are you the author?
It would be better to use
systemd-nspawnrather than docker for isolation. You get almost zero overhead (daemonless) with the the desired level of process isolation.Electronic-Space-736@reddit
nice, I am using docker as it is well known and easy to include install scripts for, I also have a few things like the RAG that use containers as part of the whole
dtdisapointingresult@reddit (OP)
Isn't that what the LLM's reasoning is for? I shared my whole prompt here btw:
https://reddit.com/r/LocalLLaMA/comments/1sxqa2c/im_done_with_using_local_llms_for_coding/oiowcxe/
Then between Qwen Code's system prompt + Qwen 27B's reasoning, I don't think it's unrealistic of me to expect it to complete this basic task.
It's not like it failed to compile the dependency for my hardware because of some complex compatibility issue. We didn't even get that far!
Electronic-Space-736@reddit
how can I make it clearer, there is another layer that you are unaware of that the cloud services provide that makes LLM more smarter and effective.
Running Qwen in llama.ccp (or whatever) does not supply this layer, you need to make your own or use someone elses.
kyr0x0@reddit
Qwen Code is such a layer or at least, is sold as such. Cloud services don't run harness code at their servers for LLM Inference. They do so for non-coding harness (ChatGPT, or coding harness with server-side run agents), but a decent RooCode, OpenCode or even VS Code Insiders should already bridge the gap the same way they do with large models, not SLMs. Yet they don't because you can only try to shoot a moving target when you write instructions to fix one issue for a small model , then it stumbles upon the next, and the next and you continue ... Finally you switch models next week and face totally different issues.. and your code is pointless - you need to rewrite everything for the next model that requires other fixes..
Electronic-Space-736@reddit
yes, for small context, but then we hand it pages, so it needs to be broken back into smaller pieces that qwen was built to handle, this is the layer that is missing
StardockEngineer@reddit
Already working? I thought I saw your prompt asked it to figure out if it needed to compile things from scratch?
weiyong1024@reddit
I think this isn't local-vs-frontier, I get the same tool-call loops and context bloat with Claude when the harness doesn't scope tool output, local just has less margin. Maybe the point of a strong coding agent is helping you build better local projects.
stillanoobummkay@reddit
Claude code is an order magnitude better than it’s competitors. Hands down.
So the best local model won’t compare.
I think it’s a matter of the right tool for the right job though and I admit that I get frustrated with my local models and go to Claude when needed.
Harvard_Med_USMLE267@reddit
I use the same two models as you, op. I like Gemma 4 personally. 48 gig vram setup.
Local models are fun to code with from a hobby perspective. Using your GPU to write code\ is very sci-fi!
However, for anything serious there’s no comparison between local models and claude code. Not even vaguely in the same league.
swaglord1k@reddit
better late than never
waescher@reddit
name checks out
floriandotorg@reddit
I mean what are your expectations? You’re comparing an LLM running on a GPU cluster in a data center with a MacBook.
And as other commenters pointed out: In the end, they are just tools. And I think you're using them wrong. A local LLM is great if you want to be able to use it offline, have total privacy, and practically no cost. If that's not your goal, use a frontier model in the cloud.
droptableadventures@reddit
If they had a MacBook with anything more than the base amount of RAM, they'd be able to run a bigger model than that!
tp_bexx@reddit
The Carlo Ancelotti reference is 10/10
PromptInjection_@reddit
You are comparing a 27B LLM to one with over 1 Trillion params?
Buy a Mac Studio Cluster und try it with GLM 5.1.
andy_potato@reddit
Or just subscribe to Claude for the next 10 years for the same price.
PromptInjection_@reddit
That's an option. It's like on vacation: taxi or rental car? Both have their pros and cons. It's not just about the price.
andy_potato@reddit
The problem is your shiny Mac Studio cluster will be scrap metal in 3 years from now. Outside of very narrow use cases, investing that kind of money in a local AI rig is a huge waste of money.
PromptInjection_@reddit
Yes, of course. In three years, you'll "have to" buy a new one.
Local AI isn't a free lunch. Anyone who thought it was has fallen for AI influencers. They live off hype.
WinDrossel007@reddit
It's a matter of time when big corpos decide to ramp up prices to equalise their investments with ROI they want.
Your local llms will be much more useful. Until that I would agree with you. I like working with Opus 4.6, but my company pays for it.
I don't care about tokens as an employee.
But I do care about tokens as an individual.
I bought 5090 and happy with my local models and learn how to use them. qwen 3.6 is a pretty good one. If you provide enough specs -> it does it's job pretty good. Not like cloud models, but you can't depend on them.
Overall I agree with a sentiment.
thejesteroftortuga@reddit
Honestly nothing beats Opus 4.7. It’s crazy. I just had it refactor the UI of a pretty complicated web-app over several hours while I slept and it got like 95% of the way there.
These smaller models are much better for narrower faux-deterministic outcomes than they are as broad scale coding agents.
dolomitt@reddit
I tried cline with qwen3.6:27b. It will not work anywhere near as good as with opencode for some reason. Same llama.cpp server. Its really usable compared to previous generations. I run on a 3090.
Puzzleheaded-Try737@reddit
Totally fair. If the productivity loss is hurting, the "Local" pride isn't worth it. I’ve been building tech for years and the "Hardcore mode" of small models is only fun until you have a deadline. Since you're switching to OpenRouter, I'd suggest trying a mix:
AnomalyNexus@reddit
I mostly view it as a spectrum. Don’t want to pay 200 bucks a month for Claude opus or whatever. But also don’t want to fight against a weak model too much either. So one of the cheaper api it is - currently GLM.
Working on moving some openclaw stuff to local though. Some tasks there aren’t as sensitive to precise model
GrungeWerX@reddit
I recommend ppl just downvote this AI slop written post and keep it moving.
andy_potato@reddit
Downvotes this mindless reply
mister2d@reddit
Probably the irony is that the local model was used to assist.
YehowaH@reddit
Hope you used qwen3.6 35 a3b with iq4nl/xs, it fits in 24 GB mem. You get 170 tg equal to Claude. Qwen3.6 was trained for tool calling 3.5 was not and it has the developer role. Both going well and check the parameters for defined programming tasks, e.g. temp 0.6.
I have minor issue to none with the new models, these are a true replacement. Give it another try with the right models. I do complex scientific stuff back and frontend, nothing you can compare the daily work if a dev and nothing the llm can be trained on because there might be only a few examples worldwide. It runs like a charm.
andy_potato@reddit
The MOE model performs even worse than the dense model for coding.
1dayHappy_1daySad@reddit
I do code for a living and yeah, local models are not there yet compared to the best paid models.
AvidCyclist250@reddit
Noticed that too. Especially with qwern 3.6 35b
ComfyUser48@reddit
If you not getting good results from Qwen-3.6-27b, it's a skill issue.
Learn to use good prompts and phases coding.
FusionCow@reddit
a model running on a single consumer gpu will never compare to a model like claude. you can still save money though by using something like kimi k2.6, which is as good as claude opus but way cheaper on api
andy_potato@reddit
Of course not. But there are very vocal people on this sub who want to make you believe otherwise.
dtdisapointingresult@reddit (OP)
For sure, that's the idea. I'll keep using Claude for the work stuff (I don't pay for it), and use big cheap Chinese models for my personal projects. It gives me the best of both worlds.
cyberdork@reddit
I mean, did you get the impression that people use local LLMs for anything else than hobby projects?
Obvious_Equivalent_1@reddit
I think you’ve also might’ve fallen for the trying to switch all at once trap. What works the best is to start with what you know, and familiarize where it doesn’t hurt as much.
To give some insight I started using Qwen 3.6 35B for what do you think? Right I didn’t start with full blown dev sessions, I let Claude set up a slash command for comments and routed through the local model. A clean 1-2K tokens save per session, easily verifiable in git log.
Then I started experimenting with some hooks, I forced Claude to run any Explore type subagent or Search type subagent through local Qwen 27B model.
The thing when you start in the small scope, it’s also easier to discover any performance issues, any catching issues. Any issues with the prompting and issues with the thinking levels. I’ve actually managed to run into some issues or crashes occasionally, but because iterations are so small, it’s way easier to find the issue locally.
I think when people talk about the power of the local models, they didn’t get to that point by going all in before they got through the initial fine tuning stage. I think for the local models the next big steps will be tools to automatically adjust the models to your local hardware. For now, unfortunately the promised potential does take some grinding through the finetuning.
dzhopa@reddit
As a tech VP, I'm currently operating a whole dev team on Anthropic and OpenAI credits freely available to lots of VC funded startups. Those days are rapidly coming to and end and we're burning through the credits at a ridiculous pace some days. That said, I'm frantically evaluating other ways to give my team these tools when the gravy train runs out.
They're going to get the big cheap Chinese models for work stuff and local models for their personal projects lol
XTCaddict@reddit
Nahhh it’s not, on benchmarks sure but in real use it does still lag behind. It’s inbetween opus and sonnet imo. That being said it’s still very good. I think it’s thinking trajectory isn’t as dialled in as Opus. It misses more things, needs handheld more. Still a beast overall though if you’re a dev it’s a great tool for the price.
RemarkableGuidance44@reddit
That's why you split up the effort... We can do 85% on Kimi K2.6 and GLM 5.1 on our servers and then use Codex for the 15%.
SmartCustard9944@reddit
One can hope that DeepSeek v4 flash gets somewhat close to an older Claude.
Crampappydime@reddit
You dont even mention hardware, you could be stupidly using 2 bit quants expecting more…
ConsciousDev24@reddit
Fair take local models still struggle with long-horizon reasoning, tool use, and real-world workflows like Docker. The gap vs Claude Code or API-based models is very real right now, especially for debugging and decision-making. Using locals for lightweight tasks + cloud for heavy lifting feels like the practical split.
Have you tried pairing a local model with a stricter tool-execution layer (like enforced step checks) to reduce those bad decisions?
a_beautiful_rhind@reddit
Harness issues aside, this is why I always said stuff of this size are scraps. 30b is like bare minimum to get anything, even chat or RP. You're expecting it to have kimi performance. Let alone MoE with 3-10 active parameters.
People here never liked hearing that. They will blame everything else like quantization or laughably, kv cache.
PavelPivovarov@reddit
Local models are usable but also require frugal approach to the context.
Claude code system prompt alone is 10k tokens, add there few MCP servers and you are approaching 30k context without even asking any questions, and this is where local models start degrading...
Im currently switching to Pi, paired with RTK and Caveman for better context density, plus replacing MCPs with CLI + Skills and it works wonders.
I had pretty good coding session with that Pi setup and Qwen3.6-27b-IQ4XS with 32k@Q8_0 context (maximum I can fit in VRAM) and it was really decent coding companion.
Yes its not GPT5 level but that wasn't even my expectation but the model never did anything unreasonable and generated code was also solid most of the time.
a_beautiful_rhind@reddit
People underestimate how much stuff like claude code is tuned to cloud models. It's really slim pickings on the harness front.
I only had luck with roo so maybe I will try this new Pi thing. Otherwise it's "COMPACTING" city and the model can't really get anywhere.
Altruistic_Night_327@reddit
The context bloat issue you described — 250K tokens from docker build output — is actually the core problem I was trying to solve when I built my tool.
The reason agentic apps blow up the context window is they have no architectural understanding of the project. They either dump everything or dump nothing useful. So when a long-running command finishes, they have no frame of reference and spiral.
What I built instead is a RAG layer that parses the codebase with Tree-sitter into a typed graph locally. Every agent query pulls ~5K tokens of relevant nodes — functions, dependencies, the specific files in scope — not the whole project, not terminal dumps.
For your Docker example specifically: the agent knows which files matter for that build because the graph tells it. It's not guessing from context.
The tool is called Atlarix. Works with Ollama and LM Studio natively, free for local model users. Still early (31 users, being honest), but the context problem you described is the exact thing it's built around.
Not saying it fixes everything — small models still have reasoning limits. But the 250K token death spiral is an architecture problem, not a model problem.
leinadsey@reddit
So you’re saying your computer at home is t a massive data center with 256 TB of RAM?
Positive_Example_478@reddit
Performance other provide model is better but the Claude,Gemini code output and prompt understanding has become total piece of shi absolutely for a week of that feeling absolutely so fucing angry about the quality degradation feel and even though my prompt was clear as well as the same like used to be even after saying clearly what to do step by step it can fuking follow it omg I am so damn frustrated 😡
Unlikely-Loan-4175@reddit
This is very refreshing. I guess local LLMs might get there by end of 2026 given the fast progress. But they are not there yet for even a 5090 GPU (what I have).
And even if they so get to current frontier model performance, thr frontier models will have moved on again, increasing our expectations.
Potential-Leg-639@reddit
Give it a try with Opencode, Linux, latest Llama.cpp and Qwen3.6-35B (use the Q4 quant recommended from Unsloth - no other one! Think it‘s the XL, check their guide). No issues at all with tool calls on my side (Strix Halo with Fedora 43).
tibor1234567895@reddit
I heard that pi (pi.dev) could be better harness for local models. Haven't tested or benchmarked though.
Mochila-Mochila@reddit
Why would you compare Opus to a 27B model ?!
And why would you assimilate local LLM to, again, a 27B model ?
If you were serious in your comparison, you'd have something like 2 * DGX Spark...
unspecified_person11@reddit
These smaller models can be good for small tasks or as a subagent but yeah a full drop in replacement for Claude would require bigger models. The only way to get decent output from these small local models is to spoon-feed them very specific and detailed instructions or have a SOTA model keep them on track.
Jungle_Llama@reddit
I disagree, I have had frontier cloud models mess up simple stuff and local do a good job. Local has it's limits with complexity say a caddy, authelia integration in an environment with a ton of technical debt but the issue to my mind is the tooling, especially with coding agents etc, they just aren't fully mature yet. a hybrid approach works really well.
Jungle_Llama@reddit
ha ha ha. no sooner had I typed this than DeepSeek V4 Pro (cloud obviously) completely borked my opencode fixing an mcp bug. Now recovering it with my local Qwen 3.6 35b. Hot shot models getting over their skis is real too.
NoShoulder69@reddit
Same here. It's not practical
ChatWithNora@reddit
The decision making gap is the real issue. I can handle slower speeds but when the model keeps going off the rails on basic tool calls, you end up spending more time course correcting than you ever saved on API costs.
Zeeplankton@reddit
I feel like a lot of people here are kind of evangelical about running locally; which is unfair. The bigger point is retaining open models. The reality is people have to work, make money and stay competitive, and therefore use the best tools available. I think OpenRouter is a perfectly fine compromise.
StatisticianOdd6974@reddit
I think that you should also mention WHAT you code, the definition of coding is very broad. So i feel that my coding work (pipelines and IAC in terraform & helm for K8S manifesting) kinda works with qwen3.6 but Opus 4.7 is faster and better. 5090 with qwen3.6 q5_K_M. But i do enjoy hermes agent & qwen as personal assistant. (non coding)
Dapper_Chance_2484@reddit
Personal assistant? any details if u can share
StatisticianOdd6974@reddit
I use hermes agent with qwen as a personal assistant. So talking on my phone using telegram with hermes, it transcribes messages and does web searches, reads my personal knowledge base, calls mcp. Just a kind of tamagotchi..
Gesha24@reddit
I have a mixed feeling about local LLM. I have decided to take one of my side projects and write it exclusively with local LLMs so that I can learn how they work.
Yes, you have to be very specific with them. Opus will make a decent web design just from a prompt. Qwen will absolutely suck. But if you open a paint and give Qwen a mock up of that, or work with Claude code skills plugin and spend 10 minutes designing the web site - it will actually code 100% usable and decent looking result.
Same goes for a database - if you tell opus/sonnet to migrate from sqlite to sqlalchemy it will prompt you whether you want to update your data ase calls to the new structure. Qwen will just wrap them in the sql_text() and keep them.
Lots of examples like that. However, I am not sure if that's a bad thing. The issues between qwen and opus are the same - code sprawl, duplicate function or features everywhere, basically both create unsupportable mess if you just let them go free. Having a worse local model forced me to be more involved with code, to look into proposals in more details, to insist that LLM reuses code - and the results are actually quite decent.
If anything, I am bringing what I have learned from local llm coding to the opus/sonnet and I am getting better results. And yes, I can't run 10 LLM sessions in parallel and have it vibe code the application. But also I actually do know how application works so I can fix/troubleshoot it myself as needed, unlike a total vibe coded stuff
Lost_Promotion_3395@reddit
the 'productivity tax' on local models is so real, I'm tired of babysitting Qwen just to stop it from eating 200k tokens of Docker logs
Low-Opening25@reddit
skill issues
New_Slice_1580@reddit
“I used Qwen 27B and Gemma 4 31B, these are considered the best local models under the multi-hundred LLMs.”
False
If you had more vram you could use much larger models
Why do you think the commercial models charge so much? As they are using large vram models
m31317015@reddit
I find people expecting local models to rival cloud models a funny concept, like the whole level of compute is totally different, and even the cloud model gave us shit, there's no way people will expect something that came out of local models could beat them, right?
Wrong, everyone not knowing shit came in and thought they're godlike, it's their chance to raise up. But in reality, hallucination is still a serious issue, context window just isn't sufficient in large projects, let alone the self doubt and bugs presents that will worsen with lower and lower quants. I'm glad that OP's finding it not suitable for the use case and realizing that API calls are just much simpler for the task, but I have to say that they were never make for that use case in the first place, at least in production. Not saying OP's expecting it, just saying there's lots of folks dreaming about it and not checking the facts here.
You don't learn AI or LLM by hosting them locally, yes there are no house of basic AI knowledge inside, majority of the knowledge is all around infrastructures and stuff well you normally see in office or data center environments. Speed and costs are crucial factors that come into play, people have to realize couple of 3090s aren't gonna beat RTX PRO 6000s, and same goes for it against the GB200s.
I personally find those who're sticking with MI50s and P40s fascinating as they're the ones always breaking their limit despite harsh architectural problems and just the lack of power. They manage to find ways for local models to work with their workflow. Maybe they don't rely on LLMs that much at all- yeah, that should be the norm, nobody should expect one click finish a job from AI, if that exists, it means agents are doing the job, not us humans.
Sorry, I got too far away from the point I'm trying to make: people who personally invests in machines and infrastructures for local LLMs are not for the job, they're for the hobby, for the what-ifs, and for the "just because I can", and the "how far could it get, how far could I get". Learning the latest technology is one thing, implementing your own solution is another.
TL;DR: Just because you know electrical engineering, you know how to design PCBs, doesn't mean you make your own PCBs from scratch. Sometimes it's not cost effective to do so, but more importantly it's because there's solutions already convenient enough that we don't need to for, unless you have the reasons to do so.
Check your motives, guys.
TheCaffeinatedPickle@reddit
The best advice I have is understanding what “agent sized tasks” are for each model. There is no context size that going to fix this. Then the smaller the model the more specific you need to be. For example, NUXT UI skill loaded, asked Gemma 4 E3B to add a password visibility icon to the right side. It tired to do it with css positioning when there is a Vue slot. The issue here is that the skill itself isn’t specific about inputs and it’s possible slots. After providing the docs around this it was able to do this. However the time it took with the failing code was longer than myself doing it. Another one was to center the footer in the UI, there is a slot for that but even with pasting the documentation, there is no center slot, rather default slot is the center. I had to switch to 27B of Gemma 4 to catch that while it’s thinking.
I also struggle to keep the smaller model to keep working it keeps ending its reasoning assuming it done even though it’s clear not. With PI Agent there is babysitter and continuation plugins, none work as expected. If the task is too large it just can’t finish it, without you having to remind it. For example I can’t ask it to write a test for the feature, implement the feature, run the test and fix any issues of failing tests. It will just work on the feature without a test first and the next run it. So I’d have to break the agent task down into smaller more well defined chunks. Then it’s done like 3-4 hours later when I could use DeepSeek v4 fast and in 3 minutes it’s done and only $0.30 spent.
ComplexType568@reddit
While I do resonate with a lot about what you're saying, the nice thing about local LLMs is that they're LLMs at heart. Give it like... 6 months... and the current 27B-35B class will probably be as "smart" as Sonnet 4 or even 4.5 in actual use. Just hoping that they'd be public.
ProfessionalSpend589@reddit
People who say those models are mostly hype tend to be downvoted here.
I personally run (slowly) Qwen 3.5 397B for experimenting and a faster Gemma 4 26 A4B for chat.
--marcel--@reddit
hard to compete with cloud solutions at the moment - those that are happy with local LLM either have hundreds of thousands of bucks invested in hardware or never really used LLMs for anything critical production-wise.
tired514@reddit
I too have been somewhat unimpressed by qwen3.6-27B (harnessed by opencode). I spent the last week comparing it to qwen3.5-122B-A10B and the latter destroyed it easily (mainly C++).
Both are outperformed by frontier models, obviously, but while 3.6-27B is a step up from 3.5-27B, neither are really that useful for large, complex codebases (ie. >5000 lines). 122B-A10B is better. I'm very interested to see how 3.6-122B-A10B does.
Thick-Succotash-795@reddit
Personally, I had massive issues using GitHub Copilot to code with small local models. When switching to opencode, wich I personally liked less in the beginning, the situation changed: I find the small local models (currently I’m using mostly gemma4-26b) extremely useful.
But, I also changed my behavior: I started coding more myself again and use AI more for small sections / explanations and stopped asking agents to implement hundreds of lines of Code.
RoughElephant5919@reddit
Same. I mainly use local LLMs for data extraction, but that’s it. I’m so sick of the narrative online that “I ended my Claude subscription because I just got a local LLM instead, now look at all the cool stuff I can do!” Yet they don’t disclose what they’re using it for, or their local machine specs. It’s falsely advertised for sure. Can’t tell you how many friends ask me why their computer almost blew up after trying to locally install Qwen 7b on 4gb of vram. These influencers have people trying to load a brick onto a paper plate 😂
cutebluedragongirl@reddit
Yeah, it's not ready yet.
Upstairs-Extension-9@reddit
Like do people really need one LLM to do it all? I like the combination of a big paid model or 2 mostly Gemini 3.1 an Opus 4.6 for me, I then have 2 local instances of Qwen 3.5 and now Gemma 4 which is great for me. They run on a Mac Mini and significantly reduced my API costs for me.
Koalateka@reddit
What quants are you using? Q8? Q4? Do you quantize KV cache? I have noticed quantization impacts a lot in those "small" models.
PaMRxR@reddit
Qwen3.6-27B Q8 and KV-cache BF16 is working very well for me with the pi-coding-agent on 2x3090 GPUs. But even with 1x3090 before I've had pretty good success with: Qwen3.5-27B, and before Devstral-Small-2 24B and Qwen3-Coder-Next, before Qwen3-32B, and so on.
Maybe I just haven't been spoiled by the cloud models? I've only ever tried Kimi (2.5 I think) with a 1 week free trial. My local models occasionally stumble due to lack of some obscure knowledge, but pasting some doc into the context is really not that hard.
alphatrad@reddit
Skill issue
dtdisapointingresult@reddit (OP)
Can you tell me what you would've done differently in the docker example I gave? How do I make it NOT read the entire goddamn 'docker build' 250k output tokens into an LLM configured for 200k context?
I put guidelines in AGENTS.md, what more do you expect me to do? Write a custom CLI interface to docker because I can't trust Qwen 27B to use docker properly?
alphatrad@reddit
Do you want to actually learn or have an emotional outburst?
Your prompt is vague, confusing and shitty. You put the same guidelines in the AGENTS.md? So you don't understand context clearly or how that effects your results, especially on smaller sensitive models.
“Get it to run properly” is vague
Does “properly” mean:
For Docker/AI projects, “runs properly” can mean ten different things.
You are talking to it like it's a person and not a tool. Something like this would probably work better:
```
You are helping Dockerize the existing project at \~/ai/echo-tts.
Goal:
Create a Dockerfile and docker-compose.yml that build and run the web UI on this arm64 NVIDIA Ubuntu host.
Hard constraints:
- Do not install anything on the host.
- Do not run pip, apt, poetry, uv, python app startup commands, or model setup commands on the host.
- You may read files in the repo.
- You may create/edit files in the repo.
- You may run docker and docker compose commands only.
- Do not use sudo.
- Do not guess dependency failures. Use actual logs.
Operating procedure:
- README/install instructions
- requirements.txt / pyproject.toml / setup.py / environment.yml
- app entrypoint
- expected port
- Python version requirements
- GPU/CUDA notes if present
Summarize the intended install/run process from the repo.
Propose a Docker plan before editing files.
Create:
- Dockerfile
- docker-compose.yml
- .dockerignore if useful
Command/log rules:
- For Docker builds, write logs to a file, e.g.
docker build ... > /tmp/echo-tts-build.log 2>&1
- Do not paste full logs into the conversation.
- If the command fails, inspect only the last 100-200 lines first.
- If the command appears to timeout, check whether the process/container is still running before assuming failure.
- Make one fix at a time and rebuild.
- If a Python dependency lacks an arm64 wheel, identify the exact package/version from logs, then try a source build for that specific package only.
- Do not invent packages or assume the failing package.
Success criteria:
- `docker compose build` completes on a
rm64.
- `docker compose up` starts the app.
- The web UI is reachable from the host.
- The container has access to NVIDIA GPU if the project requires it.
- Final answer should include the contents of Dockerfile and docker-compose.yml, plus brief notes on any source-build workaround used.
Start by inspecting the repo and making a short plan. Stop after the plan and wait for confirmation before editing files.
```
Even better, split it into multiple prompts.
For local models, I would not give the whole task at once. I’d do it in phases.
Your prompt expects the model to have good judgment about:
Those are exactly the places where local coding agents often fall apart.
For Qwen/Gemma-sized models, the fix is not necessarily a giant AGENTS.md. It is more about making the task procedural, staged, and observable. Don’t say “Dockerize this repo.” Say “inspect only, summarize, wait.” Then “write files.” Then “build with logs to file.” Then “debug one failure at a time.”
Like I said... these tools are really good, when you use them properly. Not when you act like a 27B model can read your mind like a datacenter model can.
AlwaysLateToThaParty@reddit
The first thing I created was a specification prompt with the sole purpose of writing a technical specification. So that it has it documented what to write and you don't need to pass it the trivial info again and again.
But the smaller models still just don't get some stuff as much you structure it. That happens in the bigger local models (like qwen3.5 122b/a10b mxfp4) too, but not enough to be an issue of usage. A good spec only gets you so far.
Hydroskeletal@reddit
Briefly I think these local models are much more like autocomplete for an entire function rather than the long horizon inference that the name brand frontier models do.
I think a big difference here is model size. With car engines they say there is no replacement for displacement and with LLMs displacement == RAM.
Dockerizing a repo isn't coding, it's code adjacent. It really cannot be overstated how much these local models lean on the structured grammar that a programming language provides. If it hallucinates a function, a compiler or interpreter gives it that feedback quickly. Tests do the same. But for an open ended task like writing a Docker file, where the superset of solutions is much wider, it doesn't get that kind of feedback and then it has to rely on intrinsic knowledge to deduce the problem OR it has to go search the internet, which it rarely will do unprompted. So when I think people rave about the abilities of something like the latest qwen model, they're operating in a much more constrained field. And I'll just say it that this kind of structure that the language (eg Python, C, etc) gives the output makes things like smaller quants much more forgiving. It's quite undersold I think that there are lots of tasks like data munging that degrade terribly on these smaller quantizations where even an 8bit would work.
agentic-doc@reddit
Tasking a 27B model with Dockerizing repos and complaining about decision making is like asking a Roomba to vacuum your driveway. Right tool, wrong block.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
90hex@reddit
My experience in a nutshell. We're not the only one. Here's a dude who tried exactly this on a MacBook M5 128 GB. Lots of gotchas. https://deploy.live/blog/running-local-llms-offline-on-a-ten-hour-flight/
__Maximum__@reddit
I used qwen 3.6 35Ba3B on opencode and pi coder, and were satisfied with both on middle difficult tasks. It was even better than claude 4.6 or 4.7 in claude code in explaining things since claude does not seem to be a good teacher, it is too compact.
randomperson32145@reddit
I dont see any nodels mentioning competing against openai codex anytime soon
Monkey_1505@reddit
Yeah, there's a notable size difference between the models you are comparing here.
the-username-is-here@reddit
As someone who's been using Claude Code for a loooong time and recently got into local models (with the limited hardware i've got), cannot agree completely.
Yes, local models by default are dumber and slower than even "basic" Sonnet or sometimes Haiku. Yes, there's a learning curve involved, as well as a lot of tweaking. Yes, they tend to hallucinate, loop tool calls, stuff like that.
But.
It kinda doesn't matter when Anthropic decides to slash usage once more and burn through 200 EUR/month subscription tokens in half an hour. Or when it goes down again. Or when it decides that some code you're working on "violates their TOS", effectively censoring your work, no matter what you do.
Once you're set up, you pay just for electricity, which is peanuts on Apple Silicon (and you "need" that sweet 128 GB MacBook anyway 😄 ). It's always available, 100% secure, and you can do anything you want with coding harness, which is a no-go with Claude Code.
Local models are still more than enough for simpler refactors, boilerplate and stuff like that. They require you to get more familiar with the code you're working on, which is A Good Thing™.
You cannot go "hey Claude, make it fast" and then have NFI how it works now and what are the new bugs, which is not necessarily bad.
There's a future for local models, they're getting much smarter and more accessible.
niellsro@reddit
I am having quite good results with qwen 3.6 27b for coding. Using pi, llama.cpp with unsloth ud q8 kxl quant (tried an awq in vllm but i was getting more tool calls errors). However i am really impressed by how good this model is with precise directions. This is still in testing phase for me, i am actually throwing it at a project idea i had in mind for some time, but so far the results are really good. I'm using it for python (backend) and vuejs (fe app). What i noticed (this applies to all llms, but especially to small models like qwen) - make sure to layout the foundation or precise instructions on the architecture and the code, not just requirements - provide interfaces, design patterns etc
PS: i also use claude code, but comparing it to qwen is unrealistic: 2 different models (small vs huge/unknown), 2 different agent tools (claude code vs pi - i dont have API acces to any anthrophic model so i only use them in cc)
xXG0DLessXx@reddit
I think the harness you use plays a big role in how good some local models are. But also, I kind of agree that they simply don’t hold a candle to the big providers. But for me, Gemma 4 for example is extremely good at debugging some weird issues in the system for me, checking logs, giving ideas what it might be, and even fixing some small stuff for me fairly well. Where it’s not that good, is creating things from scratch, or making huge changes to existing code bases.
DeltaSqueezer@reddit
There are limitations on intelligence and context of local models.
But the tasks you gave are easily doable. I've done similar with just a 9B model. I suspect you may have not controlled the thinking (particularly for Qwen3.5) and context exhausted by thinking. I actually have thinking disabled when using it for coding.
muhlfriedl@reddit
You get what you pay for
Aphid_red@reddit
Note: The provider models all have a big system prompt. You don't see it, but it's there. You should use one as well.
The provider models also use 'thinking' mode as well.
If your local model only has one or even neither, it practically won't behave the same way. It's smaller and thus a bit less capable, but shouldn't be unusable for repetitive tasks that have well-documented examples online.
juaps@reddit
The issue is that you require over 240k of context to run it without any problems. I simply switch from LMstudio to llama.cpp custom fork to execute turboquant and all my toolncalling, idles, loops, and so on stop, As a result, now i have a proper and efficient web application with login, chat, rag, and SQL functionalities for my business
iamreddituserhi@reddit
Try giving system prompt (try different quant and versions some version just break keep looping for weird out )
One its tuned can expect beter out put .try different system promts or even ask opus kimi deepseak to optimize the system promt for your use case .
Then it will become usable You also need to adjust temperature according to ur use
SatanVapesOn666W@reddit
His prompt was like 2 paragraphs asking it to work with docker which even the Claude models struggle with. He's really not doing himself any favors, since that's basically the first thing he dismissed and it really is his problem.
BubrivKo@reddit
Yeah, it’s the same for me... Everyone’s praising Qwen 3.6, Gemma 4, and so on, but they just don’t work for my use case. I have 15 years of programming experience, so I can usually tell pretty quickly the difference between clean, functional code and a bad solution that misses the mark.
Opus almost always gives me the correct approach, while smaller models almost always fall short.
I simply can’t trust a smaller model to solve my task correctly, which makes using them kind of pointless for me.
Smaller models might be fine for tasks like summarization, pulling information from large text databases, basically RAG-type operations.
TuskNaPrezydenta2020@reddit
I think Qwen sucks for tool calling, every time I tried either MoE or dense they randomly fail with syntax over time. I had no such problems with Gemma 4 or GLM though and I would say they are roughly as good as the sub makes them out to be in general for my use. I wrote (not vibecoded) my own private harness though after being unhappy with OpenCode and alternatives so mileage may vary
HenkPoley@reddit
Yeah, you can kind of use them like GPT4, e.g. you ask them something, and then what the answer will not work, but you don’t assume it will work. You just use it as an inspiration. Using the function names, they will come up with will get you somewhere in the documentation of the system that you’re using. Things like that.
AardvarkTemporary536@reddit
You don't hire McKinsey to bleed your wallet dry and ruin your business their own hands. You over pay them then to tell you how to ruin your business.
You also don't fire all middle management when times get tough, you just pick the most efficient that you can over work and underpay who does a good enough job. Then when things get better you realize they are doing just so long as you accept high turn over.
Ballisticsfood@reddit
Opus is a mid-junior level developer, sonnet is a junior developer, haiku is a junior developer that got distracted by a butterfly.
Local LLM’s are kids doing ‘my first app’ courses. They don’t have the necessary ‘experience’ to do even basic tasks without handholding, but if you want them to copy out endless boilerplate with some small changes they’re competent enough.
fasti-au@reddit
So what you have is called. Other people flows.
I have 1 million cintext qwen 36 27 and 35b running.
The trick is don’t give tools. Give ways to find tools. Think like you and make statemachine and give a cli tool for the command call http by cli mcp-connect. It gets thoughts.
Ask gpt to read back a thread an devalue how many tokens went to solving the problem and how much to not upsetting you with reply type. That’s why you go local
CriticismTop@reddit
In having similar experiences using open code. Gemma4 has be ok, Qwen spends most of its time going round in circles.
Local is working great for me with automation, but coding leaves a lot to be desired. I am enjoying Big Pickle with open code at the moment
Current_Ferret_4981@reddit
Hermes + Qwen3.6 in Q5 with qkv cache fp8. Flies fairly quickly and handles tasks incredibly well. Much better than the integrations and models my job has made accessible so far. Complex workflows with deep library understanding come together in a few minutes which has been impressive
mission_tiefsee@reddit
local models are not your fairies. If you use this setup, you are using bleeding edge tech. Welcome to the frontier. So change your harness, and jump on the hermes-agent bandwagon and get your stuff going. For your prompt: Go to chatgpt 5.5 post your prompt and demand a prompt for your local model. Unfortunately, local smaller models are really prone to not really understand your intend.
I build a nice tower defense yesterday. With qwen3.6-27b and some prompting help from chatty 5.5. All in godot, without touching code.
somerussianbear@reddit
r/usernamechecksout
themoregames@reddit
TL;DR After giving local LLMs a fair shot for coding tasks, the author concluded the productivity loss outweighs the benefits compared to cloud models like Claude Code.
somerussianbear@reddit
Funny. I wanted to have posted something like this but every time I think about it I fallback to “I’m probably doing something wrong, gotta put more time on this to ensure it’s not my setup” but eventually a new model comes out and I feel “oh, NOW it will work!” and then rinse and repeat and the feeling is still the same.
The thing that seems to have given me best results was
little-coder, a pi-based harness that adds good guardrails for small models. To have an idea of how excited I am about that one, I’m building an entire sandbox tool using SBX because that doesn’t have one and I want to use it badly day to day. For simpler tasks like documentation or understanding/researching a codebase it gave me good results with Qwen 3.6 35 MoE, so I imagine a dense would do even better.But yes, it’s pure grind until we get something minimal working, and most people just don’t have that energy to keep going. Luckily I have fun on the discovery path rather than enjoying just the final results, and for this Reddit this seems to imperative.
EPICWAFFLETAMER@reddit
Completely agree. I use local models all the time, but not for general coding. I see a lot of posts on here of people saying x model is 99% as good as opus 4.7. That sentiment especially gets voted to the top when a new model drops. I think we will have very good local models for this purpose in a year or two, but we just aren't there yet.
Zestyclose-Worth-167@reddit
Look, if a 27B model isn't cutting it, consumer-grade gear just isn't gonna save you. My advice? Milk the free APIs for all they're worth. If those run out, you’ve just gotta bite the bullet and pay up. Even the 80B coders I've used don't really hit the mark. That 27B version of 3.6 is 'okay,' it’s just laggy as hell. So yeah, I feel you. It's either put up with the stutter or you're stuck. Spending $20k+ on an AI rig is overkill—that money would pay for enough API tokens to last you a lifetime.
ShelZuuz@reddit
$20k would be totally worth it for a Kimi 2.6 level model at 100 tok/sec output.
But that's not going to be $20k.
Zestyclose-Worth-167@reddit
yes..
Inevitable_Mistake32@reddit
I like my privacy. I use APIs on the rare occasion I am ok with donating towards my replacement, but for everything else, local. of course LLMs aren't all I self-host, doubt anyone is. But with everything from HA, my fun paper trading accounts, my screeps bots, local and remote API keys on the host, I opted to keep my data local.
Is Qwen or Gemma better than Opus? Idk, is a smaller yacht better than a bigger yacht? Subjective.
But being able to crack out 120 tks, with 256k ctx, with zero api waiting/throttling/ratelimits and knowing none of it leaves my local network? Priceless value to me.
hovo1990@reddit
Try to use caveman plugin for Claude code, it has improved the experience.
StorageHungry8380@reddit
I noticed that the default 8GB for host prompt cache in llama.cpp was not enough for Qwen3.6 27B @ 128k context using it with OpenCode. You can monitor this in the logs by looking for sections such as this:
Here you can see a ~57k prompt ate 5.6GB of prompt cache. Bumped it up to 32GB, since I'm running 4 slots, and it helped a fair bit for me.
Major-Examination941@reddit
Yeah I built my own ollama locally with routing and model switching and orchestrating. Calling into cloud (anthropic, minimax, Gemini) for orchestration and synthesizing. Sometimes code. I mean if you're expecting your rig to actually compete against Claude that literally loses money on you then your expectations are off. Also for continual learning it's great you still have to debug learn how to prompt better, review etc
AdOk3759@reddit
The quality of a local model hugely depends on the harness. I suggest you look into little coder (and their paper)
SourceCodeplz@reddit
I think it is because of quantisation. I am using Gemma 4 31b via API directly from Google and I don’t experience this. It just works!
TanguayX@reddit
Yeah, I'm with you. Did some experiments over the weekend, and my local Qwen3.6, as big as I can muster, with Cline, and it was doing OK with the task I was trying. But I have Sonnet off to the side going..."wow, look, it just made up a function". Even getting Sonnet giving it hints.
So yeah, what's the utility in that when debugging is often worse than just starting from scratch with a better planning doc.
The way I look at it, two years ago, I had to carefully coax GPT through a coding session. Now, I was getting VERY close to getting a local model to one shot based on a good PLANNING and TASK doc. That's pretty sweet. Progress will continue, and it will happen one day soon.
admajic@reddit
Did you give the baby model a plan or just let it loose? If you did use architect mode and then write a jira style ticket it would do better.
Pygmy_Nuthatch@reddit
It's hard to make an honest comparison between a 27B parameter model and a 2T parameter model.
unchikuso@reddit
tldr. but did you try pi? pi fixes things
wasnt_in_the_hot_tub@reddit
Right? Pi is the main reason coding with local models works for me. Before trying pi, I was was essentially building my own (shittier) harness and trying to keep it super minimal to work with smaller models. Then I found pi and realized it was perfect and stopped my other harness project.
I've seen people here using local models with small context windows, then complaining they can't code with opencode, not realizing it eats like 10-12k tokens on init.
There are other things to consider with smaller, local models. But it mostly boils down to making tasks smaller in scope... go figure!
I prefer to compare coding with local models to coding without AI, rather than comparing with cloud-hosted frontier models.
theUmo@reddit
What models are you using with Pi?
dtdisapointingresult@reddit (OP)
Briefly, I was just starting to use it. I haven't gotten far enough to discover and create extensions. The vanilla setup is not usable, it would also read the output of 'docker build' in the main context. I know there's extensions to teach it to run subagents, I just didn't get that far.
PP9284@reddit
I believe this subreddit lacks best practices regarding model deployment and its' practical application in dev, and may be a promising topic to explore.
lnsip9reg@reddit
👋
fredandlunchbox@reddit
I'm currently trying a hybrid approach.
Claude to plan then it uses a skill to implement the code with Qwen-27B. Save tokens writing code.
spaceman_@reddit
I felt like this before, but with Qwen 3.6 for me it has honestly been a non issue for how I use it. ("look at this issue, explore and plan" -> "write a test or test suite that covers the issue" -> "fix or implement the issue")
They're not on the level of Kimi or GLM, but in my daily use, they are more than good enough for 90-95% of the issues.
Due_Duck_8472@reddit
But but but ... all the autistic people here mortgaging their parents houses to buy 6 figures rigs to code "hello world" say it's working ...
Who to believe?!
whatever462672@reddit
Honestly, the issue is with how Claude Code works. I was having these issues using Roo Code until I looked exactly at what it sends to the model. There is too much natural language and dependence on model knowledge. Then the harness sends a wall of text and nukes it's own context.
I get decent results with GEMMA-4 with Copilot Chat and context engineered instructions.
sarcasmguy1@reddit
I've been tinkering with qwen3.6 recently, and have got it to a place where I can use it for most coding tasks, so I thought I'd share my experience.
Note - I still use GPT5.5 and mini for bigger projects (Monorepo or similar), and generally use mini for 'work' tasks as the quality is higher. Qwen has been great for side-projects though.
I run it on a RX 7800 XT, with many MoE layers pushed to the CPU. This allows me to fit almost all GPU layers into VRAM. I get around 30t/s. Prompt processing is really fast as long as I keep context small (68k). I have 32gb of system RAM, and a Ryzen 5 7600.
My workflow is:
1. Plan with 5.5 or mini, depending on the task. Mini for features, GPT5.5 for new projects. I get them to write plan files.
2. Give it to Qwen 3.6 to implement
3. Get mini to validate it
I use pi via the littlecoder harness.
On quality: it feels good in Typescript. This entire repo has been written by Qwen3.6 locally, with 5.5 plans. In less popular languages (like Clojure), its pretty bad. Slow and it hallucinates a lot. Language choice is important.
On speed: Pretty good. It took a lot of experimentation to get here though. littlecoder helped quite a bit, and switching to ubuntu made a big difference (I was on windows previously). I ran it all through lmstudio, I haven't got to the part where I tinker with llama.cpp directly. Its not nearly as fast as say GPT mini, but its good enough.
The main advantage is infinite tokens. They feel amazing, even if they're slower. It really pushes the bar for experimentation imo. However I would not replace my primary workflow with local hardware.
Some issues:
1. Thinking loops are a pain. I've got them to happen less frequently by following the recommended inference settings by the Qwen team, but they still happen. It makes me feel like I need to babysit the model which can be annoying depending on what I'm doing.
2. Small context window. This is an issue with my hardware, not the model at all, but I thought I'd call it out. Auto-compaction kicks in pretty quickly, which can sometimes interrupt the model.
3. Tool calling proactivity. In GPT, the model is really good at knowing when to call a tool. If it encounters issues (like compilation or bad types), it will use a variety of cli calls to get to the solution faster. Qwen doesn't do this, it tends to rather grep every line of code possible and then come up with a solution. This is much slower.
4. Greenfield tasks (e.g "Add this feature"), are still quite bad. It often comes to a really strange conclusion on how to implement a feature. This could be an AGENTS.md or context issue, so not putting this on the model. For example, adding async model loading in the lmstudio extension took a long time and it did some really weird stuff. GPT mini ripped through it, and was proactive in reading docs to find the solution.
aniket_afk@reddit
Yo OP. Which quantized model are you using?
Otherwise_Berry3170@reddit
Like everything else it depends, for example if you were talking about Claude a month ago I would say yeah it was pretty good, now? Not so much they water down the models and recently came and called it a bug because we complained. The prices change, the limit changed, so while I agree models locally are not as good, with training and a good agents/skills I can do with qwen3.6 35b almost the same as Claude sonnet. Qwen3.6 27b is better but on my hardware a GB10 Blackwell is a bit slow so I use it for text only. Took a bit to get the agents right and they still sometimes don’t work as expected but pretty ok with my setup. And from the cost calculation just last week would have spent 2k on Claude api calls. So yeah I agree, not perfect but not that terrible that you cannot work with
_mayuk@reddit
Skill issues xd
I mange to use Gemma4 E2b to digest some json payloads and check in a vector db correlation with some bucks that I have etc .. multiple agent ( not running simultaneously) but each one take care some aspects of the digestion of the json payloads in the way that maybe just a conversation with deep research of notebook would be able to handle hehe …
_mayuk@reddit
Of course you have to do much of programming your self or use another AI to create the python scripts to digest the files and handle the db …
The verbatim of open claw is not enough .. I’m trying to integrate obsidian or/and graphiti …
I hate how much marketing about this kind of stuff is going nowadays because there is many forks or repackages of the same with fancy names … but like people have been saying if you are a coder yourself all this are amazing tools .. a none vibe code can be actually sustainable if everything is running in api calls lol ..
Maybe you can use the pay llms to help you setup proper agents with proper memory handling for some given task .. but you would have multiple diferente agent or stay to organize all this subroutines let’s called xd
Idk I’m never verse in the proper term for all this because despite been programming since 12 I’m mostly self learning… :v
toothpastespiders@reddit
I'm a huge proponent of local models. I've spent a godawful amount of time carefully curating datasets and training along with setting up rag systems, testing, etc etc.
And I don't use local models for coding. I'd love to but I've never really been satisfied with anything I can actually run. Qwen's 480b model would be great if I had the hardware. But I don't. So I use it through their api. I console myself that it 'could' be local for me, one day. In a way that's both non-lobotomized and fast. But at the end of the day that's just cope.
poobear_74@reddit
OP, you might be bumping into tool calling issues since the models you reference are very new. Qwen 27B was only released a week or two ago, and there simply hasn't been enough to time for the developer community patch vllm and other software to work well with them.
Majinsei@reddit
Programar... No tengo el hardware para ello~
Crear pipeline de agentes y procesos, entonces sí~
Generalmente dejo procesando durante horas (casi días) sin miedo a hacer un chingo de pruebas~
Consumo tokens como no tienes idea~ así que automatizar algo en local sin que mi tarjeta de crédito sufra es glorioso~ fácilmente en consumido ya en tokens casi la mitad de una GPU mediana~
Generalmente lo uso para trabajos batch~ programar nah~ con suerte y uso antigravity y a darle al resto~
RedParaglider@reddit
Local models are not really for vibe coding if you want to code with them. They are for pair programming. These are 27b models, you simply aren't going to get the same performance as 1500b models.
I personally do not use local LLM's for coding tasks outside of simple scripts or command line questions without session. I use them for testing agentic business workflows, and for those they are great.
simracerman@reddit
To be brutally honest, I haven’t coded by hand in years and would likely take a year to learn how to get back in original shape, yet the same model you used at Q4 quant + Opencode and a few days worth of sessions I was able to get a fully featured budgeting app build from scratch.
Local LLMs are not cookie cutter solutions yet. There’s more like a clay sculpture - at the beginning you can’t event hold the clay together, but after leading and tweaking you will slowly overcome issues and start producing good results. Remember, this isn’t cloud AI where an army of sys admins and devs are working non-stop behind the scent to make your experienc3 better
_hephaestus@reddit
I wish I could disagree with you but it’s a messy world. I do think there might be something to having something like litellm with langfuse in between the harness and the provider for debugging, but that works the same whether you’re using local or externally hosted. Part of the problem is the speed things are moving and the lack of cohesiveness, all the big players have their harnesses and ship their models with them, meanwhile there’s still unmerged llama.cpp/oMLX stuff supposed to make qwen3.6 understand tools the way it’s hyped to.
Interesting-Yellow-4@reddit
Yeah that's been my experience as well. Local has a long way to go
ResearcherFantastic7@reddit
For people who wants 1 to 1 replacement against cloud model... You are just waiting your time.
Why would you compete a cluster of elephant brain against a single parrot brain.
The way you should look at Local models is they there to do a very specific task ( not excel in it but it can produce an acceptable result )
knownboyofno@reddit
I'm interested in what repo you asked it to do it with. Could you post the link? I want to test this too because this would be a good test. I have had problems like this too. I thought it was easy but it failed quickly.
I had a different problem. I gave it a range in an Excel sheet that was saved from a Google Sheet. I had it recreate those calculations, then use that file only as a "database". That took an hour in Claude Code, then I downloaded the data into a CSV for each data source. This was something I did before. These functions will retrieve the updated data, which is fed directly into the model. I then had it use those functions, but gave it example files to test on before wasting credits. It was able to correctly recreate a 30-sheet Excel file that had the following kinds of formulas with hookups, lookups, index match, sumif, cross product, negative binomial distribution, etc., into a Python dataframe using pandas. I have done this before with other files manually, but it took me 25+ hours to trace the formulas and get the correct data sources, too.
I used Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf with llama.cpp (out of laziness because Ihave vLLM setup), it had full context. In Claude Code without any skills or anything extra but I did turn off a few headers i sent by Claude Code. I did ask it to create a Python environment to run what it needed. It did ask a few questions but I didn't have to micromanage it.
cleversmoke@reddit
I use sota models for high level plan, strategic plan, architecture plan, and feature implementation plans. Then I use local Qwen3.6-35B-A3B + DeepSeek-R1-Distill-Qwen-14B as an agentic coding pair to build one feature at a time.
It's going well, but it's more involved than just "build me an app". For anything that Qwen fails at, I just fall back to a sota model.
ascendant23@reddit
Expecting a 27B OSS model to hold up head to head with the latest Opus / GPT is just wild. It's like trying to replace an 18-wheeler truck with a ford pickup truck. If your workload requires an 18-wheeler, the pickup truck is never going to come close to meeting that need.
Doesn't mean pickup trucks are bad, just means, don't expect them to do things they can't.
StardockEngineer@reddit
Claude Code has a parameter you need to set to prevent it from junking the KV cache. I forget what it is but maybe you can search for it.
biotech997@reddit
I tend to agree, from my experience with small-medium (<27B) models, they just aren't that smart. For regular questions sure, but things like scaffolding a codebase and actually providing relevant code is still very far away. Even boilerplate code I feel like it's evidently lower quality. Obviously I don't have 2x5090s, so YMMV.
Over-Advertising2191@reddit
I think the best approach is to use these mid-sized LLMs as assistants that understand your codebase and occasionally help you with a specific function or specific errors. Anything broader and the experience is the same as op's
Long_comment_san@reddit
Try fidgeting with sampler settings maybe
Photoperiod@reddit
I say this as somebody who runs local models and deploys OSS models to on prem datacenter gpus as part of my job. Ultimately, local AI is a niche hobby that is not really cost effective. The only major reason I can see doing it beyond hobby tinkering/learning is for a fully secure/private stack. If you absolutely cannot let your data go to the cloud then you need to go local.
There are enterprises that absolutely will not send their sensitive data to cloud model providers like Claude or gpt. Those enterprises are using OSS models deployed on their own compute for coding, analysis, etc. But yeah, if you're in the clear for data privacy and whatnot, then it makes more sense to use cloud providers.
maz_net_au@reddit
All of this (and the comments) is how I feel about using Opus as well. LLMs are fun but they're just so dumb.
One-Replacement-37@reddit
Cool story, bye!
Different-Rush-2358@reddit
Look, the problem that many of you are facing—and honestly, it’s not even your fault—is that they’ve overhyped local models way too much without explaining how the "local hype" actually works. As of today, the only local models that are truly good for general purposes, including coding, are Gemma, Qwen, and DeepSeek. Forget about those weird variants or random "labs" that pop up out of nowhere; most of them are just distillations of models that were already distilled before.
Then there’s the whole quantization topic, which has advanced quite a bit. For example, Unsloth’s UDS gives you very decent precision and they fit into any consumer PC (depending on the parameters, obviously, haha).
And then you have the "Blackwell sect" and their "high precision or nothing" mantra (which smells a bit like a sponsorship from NVIDIA or some massive GPU distributor, but whatever). They’ll tell you that if you don’t have 17 Blackwells, 900TB of RAM, and a quantum computer, you can’t run anything. That is a total scam. Anyone with 15 minutes of spare time can figure out the commands to squeeze the most out of their hardware. You can run models very comfortably on hardware from several years ago.
(Example: Qwen 2.5 72B at 10-20 tk/s on a Xeon 2680v4, 32GB of RAM, and a GTX 1070 with 32k context, thanks to turboquant flags).
So, in short: believe only half of what you read here, and take the other half with a grain of salt. You don't need a $30,000 rig to run a 280B model, for instance; you just need to know how to use the correct flags in llama.cpp and have a balanced setup.
Sorry for the wall of text, but I saw this post and took the chance to get this off my chest it’s been on my mind for months. And I know I might get downvoted to hell for this... but I don't give a damn. It’s about time we debunked the myth
Such_Advantage_6949@reddit
reality is u need at least 200B local model like minimax to get serious stuff done, small model had made progresss, but they will break as soon as u throw serious stuff at them
dev_all_the_ops@reddit
Thanks for sharing. I've been obsessed with getting started in this, but I worried I would just be wasting my time.
I still like local models for security and to fight against subscription bloat, but its good to know that its just not as good as paying a major player.
Noiselexer@reddit
I've never considered local llm for coding..
alexthecat999@reddit
Is it good for small tasks and bolierplate code? Just to bridge my lack of syntax kowledge with a new langauge?
ryfromoz@reddit
Yes, and its better if you can actually prompt 😁
SnooPaintings8639@reddit
Yup, it's a bit over hyped here. I mean, if it wasn't open weight I assume it would be very rarely used anywhere.
Having said that - it is capable and with proper care if CAN replace sonnet or others in many clearly defined, coding heavy tasks. It just needs a bit more of care from you and/or larger model on top.
So, if all you care about is price, speed and qualify - probably stick with APIs. But if you have a reason to go local, this model CAN do it it.
kevin_1994@reddit
Works fine for me but I don't delegate all my thinking to a machine
Kitchen-Patience8176@reddit
Honestly, not worth it unless privacy is a major priority for you. I just use the $20 ChatGPT subscription (includes Codex) and use it in the terminal for Docker and general sysadmin stuff.
gffcdddc@reddit
You need to use a high param MoE model. Then use a fast gpu and offload the experts.
Ok_Librarian_7841@reddit
Yes, unless you can host Kimi 2.6 or minimax 2.7, local coding is trash
matt-k-wong@reddit
even though I use local models a lot life is better with Opus
tomByrer@reddit
Takes a bunch of homework &/or beefy GPU power & VRAM to get LocalLLM worth it.
Seems you have neither.
c64z86@reddit
Local models are pretty good to play around with and some can be good even at storytelling or roleplay, but subs like this one hype them up way too much. Local AI is pretty revolutionary in itself because it sets us free from corpos who dumb down their models while charging us more for the pleasure, and I will say and spread that around that all day long, but I'd be lying if I said that they were better than something like Claude Opus.
Own-Refrigerator7804@reddit
This was written by a local or online model?
dtdisapointingresult@reddit (OP)
Neither, you regard.
Terminator857@reddit
Strix halo qwen 3.5 122b q4 working well for me on simple stuff. Yes very slow, but works.
blargh4@reddit
They’re fun to mess with, they have legitimate practical uses when you don’t want to burn money on properly scoped tasks they can do fine… just use then with appropriate expectations.
TheAncientOnce@reddit
catched this post too early. Would love to come back and see what people say..
Zestyclose-Worth-167@reddit
如果27b满足不了. 基本消费级的模型解决不了你的问题了. 免费羊毛如果能撸就撸. 撸不到就乖乖交钱吧. 我目前用过80b的coder都满足不了的. 3.6的27b能基本满足就是太卡; 所以说, 我懂你; 要么忍受27. 要么无解. 如果花10几万买ai机, 这个token数够你用到盖板;