[-]

PeerlessYeeter@reddit

op's experience somewhat matches mine, I keep assuming I'm doing something wrong but I think this subreddit gave me some unrealistic expectations

[-]

I tried Kimi and it was way more unstable and erratic than GLM. Did not try MiMo.

What harness are you using GLM5.1 in?

In Claude Code it's significantly worse than Sonnet 4.6 nevermind Opus.

[-]

Nixellion@reddit

I have great experience with the updated kilo code. OpenCode also seems fine but I did not use it for any seriois coding work.

[-]

GCoderDCoder@reddit

Anthropic models are meant to stay functional with their harness. Other models arent designed for their harness behavior. I see a gap between claude in cursor's behavior vs claude in claud code and the benchmarks back that up. The reason they keep using their harness is because it is subtly designed to embed you into it. Anecdotally I think people who dislike it most are people who also use other tools as well and experience clashes any time they try to mobe out of claude code.

So then I always wonder on the flip side how much of the friction people experience comparing other models to claude is because of how they have grown accustomed to using claude.

[-]

HappySl4ppyXx@reddit

How are the limits and are there a lot of rate limiting / technical issues you run into?

[-]

Nixellion@reddit

I use it throughought the day running 1-3 parallel instances of Kilo Code and had no issues (except for kilos new agent delegation which sometimes get stuck, but it happens on opus too), and never hit any limits.

A few times I hit rate limits, but kilo typically waita a bit and retries and it keeps going.

I mostly used Opus through antigravity and limits there are atrocious nowadays. But even with claude code I'd hit limits way more often than with glm.

[-]

Kholtien@reddit

Ok the lite plan, I get 2-3 times what I do on claude pro

[-]

GCoderDCoder@reddit

Over hype? I'm going to sound defensive but I genuinely think people hype claude from lack of exposure to other models and other harneses. The content creators who actually try different things tend to recognize opus has great ability but often use other models for their own work. And nobody is saying a 30b parameter model can do everything claude can do. People are saying most of what they need a model to do can be done with self hosted models.

For local 3.6 what hardware are you using? What quant are you using? What harness are you using? How are you using your harness? Claude has those tuned for a certain user profile. You have to do those for local too before comparing.

People using q4 of a 30b model to code are not actually using the model that the benchmarks are made on. Models can keep agentic logic sound longer than they can maintain the same level of coding performance. So a 30b parameter model can search the internet, manage emails, etc down to q4 but I would not write code with that version.

Claude the model is different from claude the harness. I had opus in cursor for work just fine so i tried claude for my personal and Anthropic's harness makes me hate their models because I don't just let llms do their own thing. I use them to fill in the boiler plate for my logic. The way I use models I can swap claude, chat gpt, large local models (i have hardware) and now small local models like qwen 3.6 too. My friend who doesn't code loves claude code because he doesn't care about the how. He's also not using what he builds for production.

Most people don't actually need claude and the data is showing there's a lot of people enjoying AI activity not getting real value. If value is just making a lot of docs then people are really hyped making docs no one looks at lol.

[-]

rsatrioadi@reddit

Would you mind to share/at least give me some pointers to preparing this harness? I’m not using local models btw.

[-]

GCoderDCoder@reddit

I predominately use local but I think the principles translate to cloud. I have chatgpt and anthropic subs usually the small $20 ones but I canceled claude this month.

Models are the biggest factor for sure and for local you need to consider quantization and model size/ class. Higher quants of decent sized models tend to perform more syntactically accurate than lower quants. Bigger models manage concepts and logic better but I cant fit those quants. I will usually have a bigger model like qwen 397b at a lower quant make my plans and sometimes make a skeleton of the project because the higher parameter count gives it more space to manage the concepts and ideas. Then I might have a smaller higher quant model like qwen3.5 27b at q8 do a pass to do the coding and fix bugs. Qwen 3.6 27b and even 35b have made better plans for me than 397b lately and Opus and chatgpt agreed when I had them evaluate the code anonymously so that's why my for is kind of flipped right now because of the progress of these models lately.

I must also add this tangent that comparing non-thinking chatgpt and gemini fast to local models, chat gpt and opus preferred gemma 4, qwen (various 3.5/3.6models), glm, minimax m2.7, then gemini fast and chat gpt non-thinking last. Benchmarks align as well that good self hosted reasoning models beat non-thinking frontier models. Of course you dont use non thinking models for coding in the cloud but it gives you an idea of the capability of these models.

Next is harness... I like roo code because it gives me lots of knobs for my models. I can assign different roles to different models based on their strengths and define my compaction behavior. You can add skills and MCPs configure different tools etc. Hold this tools idea because it's important.

Next is operating workflows. I talked about it a little but having different roles and different models for each role is important. If you make a model review its own work it's going to love it. Claude likes models that write more like it even if a different model like qwen gets the right answer. All the qwen models follow Qwen's way of laying out info and claude thinks it's unrefined and often rude lol.

For me right now I'm a little messed up because my designs with these new potent small models are needing modification from what they were historically but my typical config is something like a glm 4.7 or qwen 397b for my planning (but can't lie I have been using qwen 3.6 27b lately). This is where I often use cloud models too but hinestly more in my brainstorming. I kike chatgpt for brainstorming. It hallucinates more according to emchmarks but I have figured out solutions with chatgot that other models said wasnt possible so pros and cons to all these characteristics...

Sorry back on track: then I will use something like qwen3.6 35b q8kxl as an orchestrator. It does agentic work well so managing sub agents is it's thing and it's easier to push. Then I started using gemma4 31b q8kxl as my coder because it is good at agentic coding but it's not a great agent for other things. I have qwen 3.6 27b set as my debugger right now but I would raise to minimax m2.7 if needed but I havent needed that yet. To tell the truth pairing qwen 3.6 27b or even 35b with gemma4 31b has been really complimentary. I feel like they fill in gaps for eachother and I have the hardware to run them at decent speeds.

In roo each of these agent modes (planner, orchestrator, debugger, coder, cutom) can have customizations in how you want things to work. I also use tools like mcp and skills.

Skill example: I had claude and chatgpt make 2 versions of a work presentation. Claude did better at sticking to my template. That was because it used my template and built on top of it where as chat gpt tried to mimic the template itself. I told chat gpt to create a skill to duplicate the template slides, replace the content, then verify placement and wouldnt you know it, the output looked the same. So skills tell agents how to do specific things. I have skills for managing my worktrees, interacting with authentication systems in my labs, notifying me, managing my kanban boards and tickets, etc.

Lastly memory systems. I have several. Coding is about 25% of what I use models for. But there are different types of knowledge. A memory system for navigating a code base is different from searching all the chats around a project or searching all the tickets etc. Each level of memory needs a different solution you give the agents access to. And you want the agent to have a map of what's at it's disposal without wasting context when it doesnt need it. So I have smaller index files for a platter of options for the models and then skills the model can choose allow them to learn in the moment how to interact with those tools in my environment without dumping all the rules on them at once.

Models>harness>operating rules>tool integration

Most of all you need to use your brain to think about what you are doing and how to do it. Get multiple perspectives. Never take the first thing a model gives you. Challenge as many ideas as possible. Evaluate what will happen next. The reality is everyone wants to move fast but even claide hits a wall if you dont manage it.

Example: Up through opus 4.6 I had a little personal app idea that I let claude just drive without me stearing it. I made a real spec my way with chatgpt and just told claude to keep iterating until it's finished. There eventually wast a button claude could not figure out how to fix. I started in 4.5 then tried 4.6 but still couldnt. There were a thousand files and I had no idea how it built that and neither did Opus lolol. I didnt test 4.7 but my point is that is not how you go to production but it felt great seeing new features until it fell apart. I did not do my normal commits along the way and refining of the code and organization and evaluating options etc. I just confirmed of it was working or not then said continue.

Likewise the big projects they have been claiming these models completed by themselves all have holes in them when you review them. LLMs are architected for tasks not ongoing streams of logic. A task can be making a plan but they are not designed to do a job. Im not saying get in the model's way, im saying if you are not feeling you are bringing something to the table you are probably not going to get to any level of shippable product.

[-]

rsatrioadi@reddit

Much appreciated! I believe other people will appreciate the local model part.

When I moved from ChatGPT to Claude I was impressed by how it’s taking internal turns for completing a complex request and how faithful it follows such instructions, which I believe is at least partially achieved by taking internal turns. I was thus wondering if there is any bring-your-own-model app that approximates Claude’s harness, but not necessarily for coding tasks.

Gemma4 31b isn't even best in class, just vaguely competitive with other small models.

I don’t find 27b to be any more likely to “crap out”. Matter of text, adding it as a third peg to be flow improved the loops even further and reduced cost.

[-]

TheTerrasque@reddit

That's a you problem.

Local models aren't as good as claude, but they're fully capable. I've been experimenting with Qwen3.5 35b a3b at Q4 and opencode last week, and one task it did was making an MCP for a web site's search and detail listing (a local ebay'ish salesplace).

It started with me telling it to find out how the search worked. I couldn't see a json call for it, and the source html didn't have the results so it wasn't straight forward. It went at it, reading source code, finding javascript, deobfuscating it and tracing the calls and fetching the various js files. Like really going at it.

I started it before an 1hr work meeting, and it was still going on after I was done. I just let it putz since I wanted to see how it went, and about 20 minutes later it had figured it out and written a python module to get the listings. I then told it to do the same for details, and it figured that out within minutes.

Then I had it build:

Streamable HTTP mcp server for it
Caching and paginating
UV compatible project files
CLI tool for it
Dockerfile
Release instructions (update version in toml file, commit and tag in git, build docker image, push to my private registry, update my k8s deployment to pull the new image)

I even had it test the result by building docker image and read the build log, launch it in docker and check the docker logs, then have it do http requests to the server to see if it answered correctly. I didn't even had to instruct it hard to do it either, just something like "verify via docker that it works" and it handled the rest itself.

At one point I had a "host name invalid" type of error, don't remember exactly now, happened when it was called inside the k8s cluster. I gave it the error message, it spun up the latest image and tried a http call with custom host header, noted the bug, traced through the mcp library until it found where a default class was created with hostname protection option was on, and altered the mcp server code to create an object with that option was turned off and pass it along when instancing the server. It then built a new image, verified that the call with custom host now worked, and deployed a new version.

It was a bit back and forth, with a few more mcp errors that took a bit of time to smooth out, but I only looked at the code twice during the whole thing. Once to figure out a problem it was stuck on and once to skim through it at the end to check if there was anything really stupid going on. It wasn't.

And that's with the MoE, which is less capable than the 27b. I don't know what you're doing wrong, but you're doing something wrong there, mate.

[-]

andy_potato@reddit

It’s not a “you problem”. OP has pointed out very detailed why a model like Qwen 3.6 is a nice toy but eventually much less capable than Opus or Sonnet.

[-]

rog1121@reddit

The only “real-world” success I’ve had with local llms is sorting and sentiment analysis. Essentially just a script that calls a Gemini model and asks it to be sorted into one of 6 categories which tends to be fairly well given the headers and raw data.

Full fledged agentic workflows is def not doable u less you run at least a 120b model

[-]

I use Qwen 3.6 27b for coding sessions just fine. The problem often is multilayered - it starts with wrongly configured server (I understand there are literally hundreds of combinations - but some are much better and some are much worse), continues through good harness (I ended up with RooCode as eg Claude Code seems to add too much of an overhead to each task that its just not worth it, I also had to define manually my own modes, engineer custom prompts and skills) a ends with model size and type (often people choose smaller quants like K_3_S to fit everything into VRAM with 256k context while with good agentic workflow you rarely go over 64k context). You also have to understand you are working with much smaller model and effectively dumbing it down quite alot with small quant. You have to find ways how to help him a bit (giving him proper readable “manual” will certainly help).

[-]

Sn0opY_GER@reddit

I use roo code with lm studio on a 5090 with qwen 3.6 27b (or 35b) and im surprised how good it works, tool calls etc no problems. I managed to code a timer software with nice animations for out mini rc car track that talks to the IR trackers for the timing software and now we habe a start light, leader board, rain warner etc - for free. I played a little with openclaw for 2 weekends and spent 700$ on claude :p i think the best way is a hybrid approach where the local model does the simple stuff and cloud corrects and refines. Thats how my claw works now for a while and it works verry well. If local is stuck or im not happy it can talk to a cloud bot in discord and get help fixing it or the cloudbot can take over.

[-]

330d@reddit

I'm sorry but these are all toy projects. An average SaaS that's not a crud will have 50-100k lines of backend and 20-30k lines of frontend with complicated deploy pipelines

[-]

gladfelter@reddit

Yes, and you sic the agent on the task by promoting it to describe each package and extract public API documentation, with subagents ideally, or with fresh prompts. Once the codebase is documented, that documentation serves as a context-friendly map to allow the agent to create a realistic design and testing plan and implementation plan. Clear the context again. Now your agent is ready to refactor existing code to add any missing unit tests, TDD-style. Clear the context and you're ready to start implementation, TDD-style. The agent can run for hours now since it has stable critics to keep it on track in the form of tests. There's a risk of requirements creep, granted, but there are ways to ameliorate that, too.

Or you could you yolo with a huge model with 1M context, but it'll be worse than using a smaller model in a way designed around it's capabilities and limitations.

[-]

MexInAbu@reddit

Well, none is (or should) be vive coding a production SaaS with a local, quantized small LLM. Hell, you should have very strong guards if you are going that with the frontier models too.

[-]

Sn0opY_GER@reddit

true - and now that i think about it you can literally FEEL with every line of code it takes longer and vreates more bugs - at first its prompt > "ooh thats looks really nice - lets me add XXX" and after a few of these "loops" the bugs/breaking starts and more and more time goes into fixing stuff - at the end i had to use Claude to fix an error with the minimap-timings the local model just couldnt get right (local always only displayed cars in the first 25% of the map never a complete lap - Claude fixed it and called it "Bad math" 😃

[-]

FullOf_Bad_Ideas@reddit

Even with BF16 I found Qwen 3.6 27B to be bad in the same scenario where Qwen 3.5 397B 3.5bpw was pretty good. Same harness.

[-]

mateszhun@reddit

Same, local models seem to work really well with Roo Code.

But I do have a problem with on longer context windows with 27b, it suddenly starts to fail with File Edits. (Maybe it is a setup problem?) But 35B doesn't have that problem.

I don’t think this is it either.

I’m not a developer and never claim to be, I’m a hobbyist systems architect at best. But when I’ve got two pieces of software in my homelab I want to communicate with one another and a bunch of API docs from both- I can use a smallish local model to guide me to creating a simple JS worker to pass the relevant data back and forth. Run that on one of my servers and boom: I “built software”… but even I know enough from $dayjob to know it’s not up to scratch for what even one of my junior devs would do at work in a quarter of the time.

Small local models (and big hosted ones, of course) empower people like me who are a little curious and have just enough knowledge to be dangerous to create small things that work well, bigger things that probably function mostly, and bigger things that are totally fucked. But I can completely see how a larger codebase and bigger project with more complex requirements would get choked in a small local model even when guided by a professional.

It’s a complicated multi variable thing we’re analyzing here: how powerful is the model, how skilled is the developer (on a scale from “not a developer/me/0) to literally senior 15 year engineer at Microsoft/10”, and how robust and complex is the project. Moving those 3 sliders around gets massively different results.

[-]

alberto_467@reddit

they are more likely to have some level of programming knowledge.

Not necessarily for anyone who's gotten started in the last 2/3 years. There are people doing things who never really learned how to code, because they never truly needed to. They are totally lost when they try to code without a model or smart autocomplete.

They surely have more technical skills if they can set things up, they can probably read some code, but they don;'t really have programming knowledge because they never had the mental strength to disable all AI and actually learn, for many months or even years, to actually code by themselves.

More experienced guys have already put in the work to actually gain the programming knowledge, it's the newer ones who never felt they needed to know the why and the details that i'm worried about.

[-]

johnfkngzoidberg@reddit

This sub is full of bots hyping whatever local model just came out.

China is behind and their strategy is to release open models to gain exposure.

[-]

relmny@reddit

This subreddit is filled with people comparing a most likely >1tb huge model to a 27b/31b model. And claiming they can't do the same.

What is clear to me is that some people don't understand the tools. And they don't know what they are for nor how to use them.

[-]

HiddenoO@reddit

This subreddit is filled with people comparing a most likely >1tb huge model to a 27b/31b model. And claiming they can't do the same.

The whole issue in OP's post stems from too many posts claiming the opposite, i.e., that your locally hosted small model is basically as good as frontier models.

It might not be the majority opinion, but it's common enough to mislead people into thinking they're doing something wrong, when in reality that false suggestion is typically either the result of vested interests (like the Huggingface CTO post yesterday), people not being competent enough to realise there is still a very significant gap for real work, or people simply not having complex enough use cases for that gap to show.

Just below, you have the following comment with currently 24 upvotes:

Nah, they work. I use 27b with Pi Coding agent to do hard things all day long. The latest thing I did was ask it to iterate on some never before seen data for a data science hackathon. After about 20 commits it made an html dashboard to show me the results.

"Iterating on data" and "making an html dashboard" are generally not "hard things" for an AI, especially when the person prompting has the required data science knowledge - what's hard for an AI is dealing with large, messy, interwoven codebases that result in a large, messy context window with tons of tool calls.

[-]

relmny@reddit

It's like any opinion on the Internet, what you read is what THAT person thinks/claims.

Meaning, that if someone says "I don't need commercial models anymore, running qwen/gemma/kimi/glm/etc locally is enough!" that means exactly that. No matter how they phrase that. It's their opinion for their case.

I always use local models. So I'm not surprised, specially since the last 1-2 months with gemma-4, qwen3.5/3.6, kimi, glm etc, that more and more people are claiming that THEY can do THEIR work with local models.

And that example is by a single person that, like me, can work fine with local models.

It's about context. And understanding that what works for someone, might not work for someone else.

[-]

HiddenoO@reddit

You're acting as if that were all that's being said, but the part I referred to specifically was the "doing hard things all day long" part, and that's how these comment chains regularly go. People extrapolate their own (often very limited) experience and then effectively gaslight people like the OP into thinking they're doing something wrong when in reality they're just overstating their own experience as being generally applicable.

[-]

relmny@reddit

Again, that's your claim of what "hard things" are.

AFAIK there's no official definition for "hard things".

Maybe for the person that wrote that, those are "hard things". Maybe things that didn't work before with local models.

And the main point remains, that's the opinion of a single person.

I claim that I do everything with local models. If somebody understands that anyone can do everything with local models, that's their problem, not mine.
That's my experience. I can do "hard things" because they are... to me.

And then there is the comparison between a huge commercial models with all the infrastructure, workers, hardware, tools, etc with a 27b/31b model in a single GPU...

Anyway, I'm done with this.

[-]

HiddenoO@reddit

You're arguing about technicalities, I'm arguing about these comments are being perceived by people like OP constantly reading them.

[-]

GreenGreasyGreasels@reddit

It's the hype - Qwen3.6-27B is as smart as a model 20x it's size - which is true not not the full story.

It's like claiming a child with 130 IQ can do the same things as an adult with 130 IQ - they might both have the same IQ numbers, but the tasks each is capable of is very different.

[-]

Syncaidius@reddit

People also forget when comparing Claude models against others, Claude is trained specifically for coding and development-related tasks. It's more specialised in this area, so it should be expected to be at least slightly better at coding than other models.

However, when it comes to doing more generalised and varying tasks, I find Claude makes way too many dumb decisions compared to models of lesser sizes and that's fine. They're specialised models, whereas the others are more generalised.

Other models are intended to be good at a bit of everything, but great at nothing. However, that will change over time as they optimise model sizes and efficiency.

Purely from the performance point of view, there are a number of settings to tweak to make Claude Code jive with local models. For example: https://unsloth.ai/docs/basics/claude-code#fixing-90-slower-inference-in-claude-code

Read the paper published by the author of little coder. Another redditor posted the GitHub repo below, and inside the repo there’s the link.

[-]

RobotRobotWhatDoUSee@reddit

little coder,

Personal experience - because I like it the best. I am getting better results with claude code than aider or continue.dev when working with a local model

[-]

howardhus@reddit

you da real mvp

[-]

Mobile_Bonus4983@reddit

The link days:

"Claude Code recently prepends and adds a Claude Code Attribution header, which invalidates the KV Cache, making inference 90% slower with local models. See this LocalLlama discussion."

[-]

WhichOrganization884@reddit

Really relate to this, the tool call reliability gap between local and cloud models is still very real for complex agentic tasks. Interestingly though I've found local LLMs work really well for RAG based applications where the task is more about retrieval and answering from documents rather than complex decision making. The smaller scope seems to suit them better. Are you planning to keep local models for any specific use cases or going fully cloud?

[-]

dtdisapointingresult@reddit (OP)

I haven't switched to cloud models yet but it's the plan. I'm trying bigger local models now, since I got access to dual Sparks.

the tool call reliability gap between local and cloud models is still very real for complex agentic tasks

It's not tool calls I have an issue with, it's the basic methodology. It's lacking the basic intelligence I expect from the lowest, dumbest junior developer. "dockerize this working app that tells you all the instructions in the README" is a basic task for me. I cannot accept that it's supposed to be a complex agentic task.

After pushbacks from the people saying my prompt is too basic, I even created a docker-usage skill with 700 tokens of guidelines that tells the LLM what to do when it's asked to dockerize a project, and it's still not enough. One of the rules in the skill is:

3. Before trying to read the source code of the application, ask yourself: do I need to understand the source code to dockerize this project? The answer is usually no. Most projects provide clear installation instructions in the README, that can be followed without understanding the code of the project. The errors you are likely to run into with Docker are things like missing dependencies, library mismatches, etc. None of these require you to know the code. Do not attempt to read the code unless you believe it's necessary to get around a problem.

And despite this, first thing MiniMax M2.7 does after the README is read the source code of the project. I just can't win.

[-]

crantob@reddit

Your headline is poorly constructed. You meant "for agentic coding" not "for coding".

Please remember to distinguish these two things in future, everyone.

[-]

dtdisapointingresult@reddit (OP)

Who the hell would be doing coding without an agentic app? This isn't 2024. It's so inefficient.

[-]

latebinding@reddit

So I know this is dead-threading but...

Real developers are often doing things agentic coding can't do well, because it hasn't been done often before. Agents are good at copying patterns, but if you're innovating, there's no available pattern to copy.

So I do the initial code. Sure, I might use an agent to scaffold it, but that's no different from the old Borland days when you'd select the project type and it would create, e.g., a WinMain or an event loop for you. The actual logic and code is mine.

The difference is, are you the developer, leveraging the agent to do boilerplate? Or is the agent the developer, with you doing no more coding than a junior Product Manager would?

I think that was the point. You don't seem to actually be a hard-core developer, regardless of what you've managed to "vibe code."

[-]

Yeah those datacenter models running on multi million dollar equipment always win. But i use claude to write prompts for my local ai that cranks profit all day long

[-]

rentprompts@reddit

The browser-control angle is powerful, but the real moat is reliability. For I'm done with using local LLMs for coding, I would care less about whether it can click once and more about whether it can recover from popups, changed layouts, and partial failures.

A good benchmark would be boring: 20 repeated runs, screenshots/proofs saved, failed steps logged, and a human approval gate before anything irreversible. That is the difference between a cool agent and a workflow you can sell.

[-]

patricious@reddit

OP you have mentioned all sorts of things but failed to give us the most crucial piece of information. What does your setup look like exactly. Hardware, model flags, TUI, harnesses, MCP servers?

The whole point, at least in my experience, when running local models is the supporting tech stack you build around it. My current setup feels far superior to what Anti-Gravity, Claude Code, Codex and others have to offer.

For me it looks like this: RTX 5090, Qwen3.6 35B/27B with TurboQuant (use them both interchangeably), --temperature 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --frequency-penalty 0.0 --repeat-penalty 1.0

Coding stack: OpenCode TUI, oh-my-opencode harness, MCP's: . context7, grep_app, pdf-mcp, sequential-thinking, serena, stitch, websearch.

I have oh-my-opencode use Qwen3.6 as the builder and general orchestrator and all other sub-agents use: DeepSeek V4 Pro and Fast from my OpenCode Go subscription.

This setup works wonders for me.

[-]

Puzzleheaded_Tie7801@reddit

thank you, I will try that. Although, LM Studio uses llama.cpp as the backend. Can you share your llama.cpp startup command? e.g. llama-server and all the switches.

[-]

patricious@reddit

These are the params I use for coding only. Replace the paths, host and port for your use-case.

u/echo off

echo Starting llama.cpp server with Qwen3.6-27B (UD-Q4_K_XL) on RTX 5090...

echo CUDA 12.8 + Blackwell (sm_120) + MMQ kernels

echo.

set SERVER=build-x64-windows-msvc-release\bin\llama-server.exe

set MODEL=C:\Users\%USERNAME%\Desktop\Qwen3.6 27B Unsloth\Qwen3.6-27B-UD-Q4_K_XL.gguf

set MMPROJ=C:\Users\%USERNAME%\Desktop\Qwen3.6 27B Unsloth\mmproj-BF16.gguf

if not exist "%MODEL%" (

echo ERROR: Model not found at %MODEL%

pause

exit /b 1

)

if not exist "%MMPROJ%" (

echo ERROR: mmproj not found at %MMPROJ%

pause

exit /b 1

)

"%SERVER%" \^

--model "%MODEL%" \^

--mmproj "%MMPROJ%" \^

--host %Your_IP% \^

--port %ANYPORT% \^

--n-gpu-layers 99 \^

--ctx-size 262144 \^

--cache-type-k turbo4 \^

--cache-type-v turbo4 \^

--flash-attn on \^

--reasoning off \^

--jinja \^

--batch-size 32768 \^

--ubatch-size 2048 \^

--cont-batching \^

--no-context-shift \^

--metrics \^

--temperature 0.6 \^

--top-p 0.95 \^

--top-k 20 \^

--min-p 0.0 \^

--presence-penalty 0.0 \^

--frequency-penalty 0.0 \^

--repeat-penalty 1.0

pause

[-]

DependentBat5432@reddit

This. really solid setup, respect for putting all together. but honestly way beyond most ppl would build themselves lol I’m on AllToken team so obv biased haha. for ppl like OP who just want to switch between Kimi and Claude without the setup headache. zero markup, ever. no pressure but happy to help if wanna try it out

[-]

AMD_PoolShark28@reddit

This is the way.

Unfortunately you really got to read the model card on hugging face.. there is no one size fits all approach to the parameters especially things like top k temperature and frequency penalty.

Doing creative stuff? You probably want to high temperature. Doing specific coding work, you wanted a lot lower but not zero. Zero gets you into holes where the LLM cannot creatively find its way out of.

The other problem with local LLMS is the defaults typically is really small context window, again you got to read and see what it supports but at very minimum 32k and ideally 128 k for big coding tasks or visual models

[-]

patricious@reddit

Unfortunately you really got to read the model card on hugging face.. there is no one size fits all approach to the parameters especially things like top k temperature and frequency penalty.

very very true, I have separate batch files with different params depending on the task I need done.

For the context size, I left it at 263K (compacting at 200K in OpenCode), I haven't encountered any strange behavior thus far.

[-]

dtdisapointingresult@reddit (OP)

Hardware doesn't matter for intelligence, it only affects speed.

I ran Qwen 3.6 27B FP8, and Gemma 4 31B AWQ 4-bit. Using the temperature/etc from the model's card.

I used vanilla Claude Code and vanilla Qwen Code. They each have a massive 18k token system prompt. I don't use any MCPs or skills otherwise. The only MCP I have installed is Playwright for web stuff, but it was not relevant for this task.

I think you're right that I probably need to use something that forcefully decomposes the task since the small local LLMs are too dumb to do it on their own.

[-]

Commando501@reddit

You must be new to this, huh bud. You gotta actually read through how this stuff works so you get a grasp for what it takes to make this stuff work.

We don't yet live in an age of plug and play with 100% quality/efficiency for local models.

You still have to go through constant tinkering and fine tuning of the model/setup to reach that goldilocks zone.

[-]

Inevitable_Search468@reddit

That's what op is trying to do bruh. It is funny how you engineers feel the need to perfectly automate something that is not meant to be while dudes learning y'all are throwing shade for donkeys years. 🫎

[-]

Inevitable_Search468@reddit

Why the down votes on this OP reply the fuck

[-]

Tai9ch@reddit

Hardware doesn't matter for intelligence, it only affects speed.

Except RAM, which limits model size and quantization.

And speed matters in practice. I could run GLM locally at Q8 if I were willing to deal with 5 seconds per token inference speed. Qwen3-Coder 80B will get much more work done and done effectively.

[-]

middleNameIsHadrian@reddit

Regarding quality of responses from models... Yesterday I hit a nasty bug. Opus kept proposing elaborate and overcomplicated solutions, while DeepSeek V4 Flash went straight for the simple fix. Quality-wise I'd genuinely put it head to head with the latest Sonnet, and it's saving me real money and latency on top of that.

[-]

ElonMusksFacecream@reddit

Not that insightful tbh. Highly opinionated. Purely based on your own biases. No details on your setup.

Not exactly brief either. More like a therapy session, venting general frustrations. Can your LLM setup not even help with sorting out this post?

TBF, the title did hint it was all about you. I can't complain really.

[-]

Legal-Pop-1330@reddit

Fwiw, we have decent results using our new sagent API/CLI. Maybe give it a whirl? (Its OSS/Apache2.)
https://rekursiv.ai/blog/introducing-sagent/

[-]

Yusso_17@reddit

I like local but i dont use it for coding. I use cursor and claude instead. Local AI is still too weak at this point in time.

[-]

Sharp_Classroom9686@reddit

I think the problem is less “local models suck” and more “you used the wrong tools for local models.”
If the runtime lets a 27B model eat giant logs, bloat context, and improvise badly with tools, of course it’s going to feel terrible.

Try Forge. (Github) It’s much more local-first in how it handles context, subagents, and task scoping. It won’t make Qwen think like Claude, but it does stop wasting tokens on garbage, which is half the battle with local coding.

Link:: https://github.com/defexnicolas/forge

[-]

[-]

newcomb_benford_law@reddit

The thing with open models is that their increments in performance are getting significantly bigger compared to closed models. If you look at Qwen3.6 compared to 3.5 it's a significant difference. So you just have to stay patient and keep up. Also the harnesses like Claude Code and Codex are simply not good for these models - you have to use better/more tailored harnesses. I find something like pi (https://pi.dev/) combined with Qwen3.6 35B/27B to work amazingly well, with the right skills and tools.

[-]

dtdisapointingresult@reddit (OP)

I said in another comment how Pi is actually WORSE.

It doesn't do toolcalls in subagents by default. That means the main context on your slow, shitty local model is filled with the output of tool-calls + the analysis. "Just write your own subagent extension bro" is the recommended solution, and it is a valid one, but no, people will not get ANYTHING out of Pi without putting in a few weeks of learning curve.

[-]

newcomb_benford_law@reddit

That's not my experience with it at all. It's been solid. If you don't like it don't use it. I like OpenCode as well. I use Codex and CC as well. I switch between all of these. This entire ecosystem is still fairly new, and it's getting better and better. Your post is just one perspective and it's very different from mine. Should I tell you to stop posting?

[-]

NE0_ZER0_@reddit

This can't be a real post lol...

[-]

dtdisapointingresult@reddit (OP)

Oh no!

Finanzamt_Endgegner@reddit

It is if you know what you are doing. It isn't if you don't. For pure vibe coding without thinking on you part it might not be there yet but with correct harness settings and instructions and guidance it can compete with at the very least sonnet4.5

[-]

andy_potato@reddit

It’s not even close. Everyone claiming otherwise is just coping hard.

[-]

Finanzamt_Endgegner@reddit

It is. Anyone claiming it itsnt has either a config or skill issue. Looping and stuff is config issue. Not giving it clear instructions is skill issue.

[-]

crantob@reddit

If the person is giving the same guidance to Claude and to Local LLM, and pointing out the difference in performance, then that is a valid difference of performance.

Claiming it's a "skill issue" is nonsense when it's clearly divergent performance between models.

tomdg4@reddit

How do you set up such a workflow? Trying to do the same since github copilor prices will go through the roof

[-]

hedsht@reddit

its just a python script calling either codex cli or pi. its very basic with a handoff.md in the end that gets validated by an additional script, so the agents get feedback whats wrong and can fix it to make sure that the handoffs are always correct.

i just tried to copy/implement the workflow from modern software development. you could scale it or add more roles, but the simpler, the better, especially since the ai agents like to drift or do unexpected stuff, even if you have explicitly prohibited it... less complex, the better it works.

just dump my response into chatgpt, it should generate you something similar.

i do the phase plans myself manually, thats the stuff where i dont trust any AI ... yet.

[-]

exaknight21@reddit

Yeah I agree with your thought process. Phases are practically instructions to different teams. Thank you!

[-]

hedsht@reddit

yes and no.

lets say you have a big picture, like how the project has to be, you split that into phase, split phases into tasks which the agent splits into slices.

you want the local models to do only one task with as much information as it needs.

documentation is key and bonus points if you know yourself how it should be or how it should be built.

my phase plan is exactly how i would built it myself, but i would need 6 months myself to complete. last project was done in 7 days thanks to AI agents coding 24/7.

crazy times.

[-]

m3kw@reddit

copy and paste the planned tasks. Or you can roll your own with stuff like Opencode to route agents, but i think thats a load of work

[-]

exaknight21@reddit

Can you write a step by step how you do this?

[-]

hedsht@reddit

I run three agents with a Python runner that orchestrates them. Each agent has one job and strict handoff rules.

The cast:

Architect → GPT 5.5. Plans the next smallest verifiable task, writes acceptance criteria.
Builder → Qwen 27B. Fast, cheap. Implements the code.
Tester → Qwen 35B. Verifies the code without touching it.

Only architect uses the expensive model. The rest are cheap open models.

The flow:

Architect reads the phase plan and repo state, slices the next task into acceptance criteria, updates TASKS.md, routes to Builder.
Builder reads the task and architecture docs, implements the code, runs tests, routes to Tester.
Tester reads the same criteria, runs tests/browser flows, inspects code. Three outcomes:
PASS → routes back to Architect for the next task
FAIL → routes back to Builder to fix
BLOCKED → routes back to Architect for scope/dependency resolution
Repeat. The runner handles the loop. If interrupted, it resumes the same role.

Why it works:

Specialization — each model does what it's best at
Cheap — only architect is expensive
Deterministic handoffs — state lives in files, not chat memory
No agent overreach — tester never edits code, builder never redefines scope
Automatic retry on failure

I type one command to start a phase, then check in occasionally to handle blockers that need a human decision.

PS: I've used AI to generate the response, because i wouldnt be able to explain it that well myself.

[-]

my_name_isnt_clever@reddit

Exactly. I use local as much as possible for my agents for privacy reasons, but I've been experimenting with having hermes-agent delegate plan creation to Kimi K2.6, then handling the implementation itself. It's been working really well so far.

[-]

Yo! New to local LLMs/ai stuff in general. I have an old 3090 and 128gb of DDR4 RAM. Was going to sell my old machine for parts but occurred to me this week I could turn it into an ai machine to dip my toes into locally run stuff.

My interest rn is to work on some vibe coding projects. Would like to assess and test models that fit fully into the VRAM of the 3090 but also curious about utilizing my ram (DDR4) to see what larger models can bring into the equation.

What models would be worth by time for testing? I’ve been working with Claude to ID some stuff of interest but as this field moves so fast I thought asking people who are actively engaged in this stuff would be better.

[-]

dtdisapointingresult@reddit (OP)

With your hardware you only have 2 viable options, Qwen 3.6 27B on the GPU, and/or Qwen 3 Coder Next (80B A3B) on the CPU or CPU+GPU. Q3CN being A3B only while not having reasoning means it might run decently fast even though you have DDR4. Start at Q8 and work your way down to Q5, find an acceptable speed.

With such weak models the most important thing will be to micromanage their behavior with customized agents and prompts because you cannot count on them to be intelligent enough to figure some things out on their own. You need to handhold them to make the most out of them.

[-]

No_Hunter_7786@reddit

Totally fair. Local models for agentic coding tasks are just not there yet. Using them for automation and text stuff makes sense though, that's the sweet spot right now

[-]

cloverAthlete@reddit

Open models are typically 6 to 12 months behind frontier and that gap is real, you're feeling it.

Kimi K2.6 and GLM 5.1 are genuinely good for code now, but they still don't compare to Opus 4.6 for the kind of agentic decision-making you're describing

I'm running Gemma 4 on my own site (won't mention it to avoid spamming) and for day to day tasks and image reading it's fantastic

[-]

That-Drink4650@reddit

"Here's a Github repo, I want you to Dockerize it."

Are you giving the model truly this open of a task/prompt? Why wouldn't you run that through chatgpt first and produce a prompt around your repo..

[-]

dtdisapointingresult@reddit (OP)

Because any basic developer knows that this instruction is all that is needed. There's implied context known to anyone who uses Docker, and for anyone who looks at repos on Github:

Read repo's README.md for project description and installation instructions
Translate installation instructions to commands in a Dockerfile

I expect any model to have this minimum knowledge. If they don't start by reading the README, or if they are incapable of translating the instructions to Dockerfile commands, they are unsuitable as an agentic coding LLM.

[-]

raviteja777@reddit

I was trying for past 3 days to setup a local coding agent using ollama and cline for intellij IDE. (desktop with 12GB 3060 rtx and 16GB ram). Except for Llama 3.1, rest of the models struggled and kept crashing .

So i just tried using jetbrains AI assistant plugin - with ollama as a provider - Llama 3.1, Qwen 2.5-7b-coder-instruct and oss-gpt-20b(has some lag) worked, i tried explain/review code and gave some simple prompts to check, , it gave decent results (but not on par with claude or copilot) . Gemma4: 31b kept crashing no matter what.

I have documented the process in this medium article article

[-]

runner2012@reddit

You are comparing a data center with a computer...

Your blog's content is...not great. To say the least

[-]

ai_guy_nerd@reddit

The productivity gap usually comes down to the orchestration layer rather than the model itself. Claude Code isn't just a model; it's a tightly integrated loop with a specific set of tools and a very refined way of handling errors.

Most local agentic apps have a naive "prompt and pray" loop. When a tool call fails or takes too long, the system doesn't provide the model with the right context to recover, leading to that weird decision-making. Tightening the inner loop and giving the model better telemetry about why a command failed is usually the fix. Systems like OpenClaw try to bridge this by focusing on the execution layer, but it's a hard problem to nail.

[-]

swagonflyyyy@reddit

I don't think < 100b models are there yet for coding, but try using an organized Claude Code stack with a good set of CLAUDE.md file and additional .md files stored under a /rules directory to help guide its workflow better.

Honestly, Claude Code locally has worked wonders for me, way better than codex. Only thing is that Codex is very autonomous but inaccurate with local LLMs, which make it very unpredictable.

[-]

Could you elaborate on what you mean on that?

[-]

Lucky-Necessary-8382@reddit

A "harness" is the software layer you build around a model — the infrastructure that turns raw intelligence into a useful, autonomous work engine. The model provides the reasoning; the harness makes it actually do things reliably, repeatedly, and without you babysitting it.

What a Harness Actually Contains

A harness typically wraps the model with:

Prompt/context assembly — packages system instructions, memory, and task state before each model call
Tool execution — detects when the model calls a tool (e.g., run_python(code)), executes it in the real world, and feeds results back into context
Memory & state management — tracks what's happened across turns and across the model's limited context window
Agentic loops — drives the cycle of "do → observe → fix → repeat" until tests pass or goals are met
Guardrails & evaluation — catches hallucinations (confabulations), out-of-bounds behavior, or broken outputs before they corrupt downstream steps

[-]

mrdevlar@reddit

Thanks. Not sure who downvoted you for helping me, but I appreciate your effort.

[-]

watchmanstower@reddit

A harness is both what you are running the agent through (the software) and also what you are surrounding the agent with for him to be successful at whatever you’re wanting him to do (e.g. all the necessary docs)

[-]

cniinc@reddit

I disagree, OP posted how they were making harnesses and parameters for their relatively simple task of taking a Github and making a container.

If anyone can point to a working set of model and harness, I'd be very open to hearing about it. If we just can't do anythign close to Opus, let's just be honest about that. If we can achieve Opus-level gains with a set of well-defined harnesses, let's be honest about that.

So, what are harnesses that work for coding? I've yet to see someone replicate the productivity gain from cloud models, using a local model.

[-]

PaMRxR@reddit

Local models require significant time investment to learn a lot of details of how things work and how to efficiently make use of the hardware and model capabilities. Without some curiosity driving you into this people like the OP will fail. People that just want to use something and don't really care about the details.

[-]

AdOk3759@reddit

Exactly, look up little-coder

[-]

DeltaSqueezer@reddit

the current working directory is /home/deltasqueezer/src/ai/echo-tts dockerize the app here

I'll help you dockerize the Echo TTS application. Let me first explore the project structure to understand what we're working with.               

    ● bash_exec(ls -la)
      ⎿  {"ok": true, "exit_code": 0, "stdout": "total 196\ndrwxrwxr-x  4 deltasqueezer deltasqueezer  4096 Apr 28 09:12 .\ndrwxrwxr-x 44 deltasqueezer deltasqueezer  4096 Apr 28 
09:12 ..\ndrwxrwxr-x  2 deltasqueezer deltasqueezer  4096 Apr 28 09:12 audio_prompts\n-rw-rw-r--  1 deltasqueezer deltasqueezer 46434 Apr 28 09:12 autoencoder.py\ndrwxrwxr-x  8 deltasqueezer 
deltasqueezer  4096 Apr 28 09:12 .git\n-rw-rw-r--  1 deltasqueezer deltasqueezer    67 Apr 28 09:12 .gitignore\n-rw-rw-r--  1 deltasqueezer deltasqueezer 40502 Apr 28 09:12 
gradio_app.py\n-rw-rw-r--  1 deltasqueezer deltasqueezer  8896 Apr 28 09:12 inference_blockwise.py\n-rw-rw-r--...

    ● read_file(/home/deltasqueezer/src/ai/echo-tts/requirements.txt)
      ⎿  text 8/8 lines

    ● read_file(/home/deltasqueezer/src/ai/echo-tts/gradio_app.py)
      ⎿  text 994/994 lines

Now let me check the README for more context on the application:                                                                                  

    ● read_file(/home/deltasqueezer/src/ai/echo-tts/README.md)
      ⎿  text 145/145 lines

Perfect! Now I have a good understanding of the application. Let me create a Dockerfile and docker-compose.yml to containerize this Echo TTS      
application. I'll create:                                                                                                                         

 1 A Dockerfile with optimized settings for PyTorch/TensorFlow applications                                                                       
 2 A docker-compose.yml for easy deployment                                                                                                       
 3 A .dockerignore file to exclude unnecessary files                                                                                              

    ● Write(/home/deltasqueezer/src/ai/echo-tts/Dockerfile)
      ⎿  {"ok": true, "path": "/home/deltasqueezer/src/ai/echo-tts/Dockerfile"}

    ● Write(/home/deltasqueezer/src/ai/echo-tts/docker-compose.yml)
      ⎿  {"ok": true, "path": "/home/deltasqueezer/src/ai/echo-tts/docker-compose.yml"}

    ● Write(/home/deltasqueezer/src/ai/echo-tts/.dockerignore)
      ⎿  {"ok": true, "path": "/home/deltasqueezer/src/ai/echo-tts/.dockerignore"}

3 minutes after giving the prompt:

Perfect! Now I have a good understanding of the application. Let me create a Dockerfile and docker-compose.yml to containerize this Echo TTS
application. I'll create:

1 A Dockerfile with optimized settings for PyTorch/TensorFlow applications
2 A docker-compose.yml for easy deployment
3 A .dockerignore file to exclude unnecessary files

● Write(/home/deltasqueezer/src/ai/echo-tts/Dockerfile)
⎿ {"ok": true, "path": "/home/deltasqueezer/src/ai/echo-tts/Dockerfile"}

I'll help you dockerize the Echo TTS application. Let me first explore the project structure to understand what we're working with.

Now let me check the README for more context on the application:

Perfect! Now I have a good understanding of the application. Let me create a Dockerfile and docker-compose.yml to containerize this Echo TTS application. I'll create:

A Dockerfile with optimized settings for PyTorch/TensorFlow applications

A docker-compose.yml for easy deployment

A .dockerignore file to exclude unnecessary files

Now let me create a README section for Docker usage:

Now let me create a quick start guide file:

Perfect! Now let me verify the files were created correctly:

Excellent! Let me create one more helpful file - a shell script to make it easier to build and run:

Now let me create a comprehensive summary document:

Perfect! Let me verify all the files are in place and show you a summary:

Excellent! Let me show you a quick preview of what was created:

Perfect! Now let me create one final helpful file - a requirements file for development:

Perfect! Let me create a final verification to show you everything that's been set up:

Perfect! Let me provide you with a final summary:

✅ Dockerization Complete!

I've successfully dockerized the Echo-TTS application. Here's what was created:

📦 Core Files

Dockerfile - Production-ready Docker image with:

docker-compose.yml - Orchestration file with:

.dockerignore - Excludes unnecessary files from builds

🛠️ Helper Files

docker-run.sh - Convenience script with commands:

./docker-run.sh run - Build and run

./docker-run.sh build - Build only

./docker-run.sh stop - Stop container

./docker-run.sh rebuild - Clean rebuild

DOCKER_QUICKSTART.md - Quick start guide

DOCKERIZATION_SUMMARY.md - Comprehensive documentation

requirements-dev.txt - Development dependencies

📝 Updated Files

README.md - Added Docker installation section

🚀 Quick Start

cd /home/deltasqueezer/src/ai/echo-tts

# Option 1: Use the script (recommended)
./docker-run.sh run

# Option 2: Direct docker-compose
docker-compose build
docker-compose up

# Access at: http://localhost:7860

🎯 Key Features

✅ Full GPU acceleration (NVIDIA CUDA 12.1)

✅ Security (non-root user)

✅ Easy deployment (single command)

✅ Low VRAM support (8GB GPU configuration)

✅ Production-ready (health checks, logging)

✅ Developer-friendly (hot-rebuild support)

All files are ready to use! The application is now containerized and can be deployed to any system with Docker and NVIDIA support.

3:08 ctx 24,867 Σ in 346,764 (new 24,384 cached 322,380) Σ out 5,879 PP 1,272 TG 65 tok/s

[-]

DeltaSqueezer@reddit

The LLM is intelligent enough to do the indenting properly. Since whitespace in Python has meaning, the code wouldn't even work without proper indenting.

I’m not sure I’d build out a full rig right now, parts are expensive and models are becoming cheaper to “rent” on OpenRouter or similar. The big models like Kimi or GLM would need too powerful of a rig to make it worth it (ie you’d spend years of OpenRouter costs to recoup the cost of your rent and by then your rig would be outdated and/or overkill).

I think if you can grab an M4 Mac mini with as much ram as you can afford then you’ll be able to run some decent local model in the Qwen & Gemma family, increasingly more image and video models, and quantized versions of some small Nemotron, DeepSeek, etc. everything else can go through OpenRouter for the big boys.

In a year or so I wouldn’t be surprised is a smaller Kimi or GLM comes out runable on a mini, or a larger/more powerful Qwen (like a 128B A9B as a random example) which would be like 3-4x more powerful than the current Qwen but with only marginal increase resource usage (at sub 6 quantization).

Long story short, big models are getting lighter, small models are getting more capable, the need for massive powerful rigs is diminishing yet the cost for part is increasing. If it were me, I’d either wait it out for a model that works on what you have, or buy as maxed out of a Mini as you can as a sweet spot for now and reevaluate the rig build in a year or two when either parts cost is more reasonable or requirements have dropped significantly.

PS on the Mini front. I’d also wait until Apple drops the M5 spec bump for it because every M5 device has got bumped up to 512GB storage with 16GB RAM on base model. If then spec bump the Mini to M5 with little to no price bump then it’s a significantly better deal than buying the M4 right now.

[-]

Migraine_7@reddit

Are you using a subagent to at the very least create a work plan before each task?

Even Sonnet and sometimes Opus fail miserablely if the task is not well defined.

[-]

dtdisapointingresult@reddit (OP)

Does it need to be a subagent? This was my full prompt:

I git cloned an AI project in ~/ai/echo-tts, an AI-powered web UI for audio generation.
I tried to install it on this host (an arm64 Nvidia-powered Ubuntu device), but one of the dependencies (or a dependency of a dependency...you'll see when you try to build your Docker image) only had amd64 wheels, so the setup instructions installation failed on this system.

There's 2 objectives I want you to help me with:

1. Get it Dockerized. The instructions are simple.
2. Get it to run properly. That means getting that wheel to be compiled from source, most likely.

I don't want you to make a mess on the host. Use Docker. The output I expect is a Dockerfile that builds the image, and a docker-compose.yml that builds the local image + runs it.

Start by making a plan.

[-]

simracerman@reddit

I can claim the badge of student among you all, but that is not how I’d feed a small 27B model any prompt. The extra unnecessary context will certainly confuse it.

the3dwin@reddit

I was looking at the top posts this week and discovered yours, then I was about to go through that post next.

Again I am sure with the right configuration, and prompt they could reach what the benchmarks show.

[-]

the3dwin@reddit

I think if my prompt was just "Dockerize the app at ~/echo-tts" it would've succeeded (I certainly hope so or it's hopeless). But adding the context of "you need to test the Dockerfile yourself, also you WILL have a failure and you should fix it when it happens" is what was too much for lil' 27B little monkey brain.

[-]

RoughElephant5919@reddit

Just want to say thank you for this comment. I run local LLM’s for OCR data extraction, and the prompting has been the biggest challenge for me. I appreciate your input, and I am going to try this on my current pipeline I’m running 🙏🏼

[-]

Intelligent_Ice_113@reddit

this prompt explains a lot

[-]

dtdisapointingresult@reddit (OP)

Was my prompt so bad? I would expect any basic junior dev to be able to follow this prompt. I give these sort of instructions to the intern at work all the time, I get a working script/Dockerfile/etc when he's done.

I can't give it more detailed instructions, otherwise I'm doing its work for it: I expect it to read the README of the project (implied, because this is the case for 99% of Github projects) for installation instructions, translate those to commands in a Dockerfile.

[-]

Intelligent_Ice_113@reddit

Are you saying I can't expect a so-called quality coding model like Qwen 27B to read between the lines on extremely common development/OS tasks?

exactly. I mean, it's a gamble, sometimes it can guess your intentions right, sometimes it can't.

The thing is: these are not humans. Never forget that. And you have to give them the right commands, a cold-blooded list of procedures to follow, without any chatter, as if you would do with a real junior dev. Every detail or context you didn't provide, they'll make up, thinking that's what you meant. And that's critical for small LLMs, because they're dumber than true LLMs, yes, that's their huge disadvantage, but that doesn't make them useless.

TL:DR, small models are prompt sensitive. And you have to do its work partially, at least by providing the relevant context.

[-]

dtdisapointingresult@reddit (OP)

I mean, what you're saying reaffirms that I can't use them for the sort of things I want to automate.

I get that small local models can work for people with a lot of prompt management, but I really want to be able to give that Docker prompt and have a working Docker image on the other end. An app running in Docker is to me a very simple thing that someone with 1 day of Docker tutorials can do. It's the hello world of modern development.

TheCatDaddy69@reddit

On the other hand i have been having a great experience with gemma 4 31 with openclaw, as an assistant.

[-]

dtdisapointingresult@reddit (OP)

I was getting reasoning loops with Gemma 4 31B AWQ 4-bit (from QuantTrio).

It got far enough to create the Dockerfile, then spent 2000 tokens repeating itself trying to decide if it should check if docker is already installed, and whether to use 'docker build' vs 'docker compose up'. I went back to Qwen.

I think, this is the a perspective problem, not a problem of the acutal models. It depends a lot of how much hand-off approach you have, imo. I like to know exactly what is in my codebase. If the LLM does not make good changes, the direction is fine most of the time. Then I just do some manual tweaking and let it continue. It's a wayyy smaller model, even if it is good for its size.

[-]

dtdisapointingresult@reddit (OP)

Yeah I never was a believer in small local models but tbf the Qwen 3.6 benchmarks posted and hyped all over the internet made it seem only slightly behind Sonnet 4.5. I knew it wasn't true, having used both local and cloud models, but I was hoping we had reached the point where it's 70% as good.

[-]

the3dwin@reddit

As you have already read in other comments, those benchmarks are only true when the configuration, and prompting are setup correctly that they can reach the same levels.

But if you are expecting the same prompt you gave to claude to work with on the qwen models with wrong or no configuration then no they will not be like the benchmarks show.

[-]

ttkciar@reddit

Yah, unfortunately mid-sized codegen models just aren't there, yet. They've gotten a lot better, but the ones worth using are still in the 120B-size class.

With a lot of extra work, Gemma-4-31B-it gets close'ish to GLM-4.5-Air for codegen, but not close enough to make the extra work worthwhile.

Qwen3.6-27B similarly falls short, and that's only if it doesn't overthink (which it still does, way too frequently; wtf didn't the Qwen team fix that with 3.6? It was a well-known problem with 3.5).

[-]

TheAncientOnce@reddit

What's your experience with the 120b class models? The bench seems to show that 3.6 27b outperforms or matches the performance of the 3.5 120b

[-]

ttkciar@reddit

My experience:

GLM-4.5-Air: Best at instruction-following, which makes it my top pick. I tend to drive codegen with large specifications full of instructions, and Air consistently follows every single instruction in the specification. Unfortunately it is more much prone to write bugs than other models in this size class, but these tend to be low-level bugs, easily fixed, and not design flaws. It's "only" a 106B, but it's competent like a 120B.
Qwen3.5-122B-A10B: Runner-up. It's not bad, but would randomly ignore some instructions in my specification. It writes fewer bugs than Air, but is more likely to introduce design flaws (like using a temporary file, always the same pathname, non-atomically, in a multi-process application) or leaving some functions empty except for a "In production, this would .." comment.
GPT-OSS-120B: Great at tool-calling, okay at instruction-following (though noticeably worse than Qwen), but hallucinates up a storm. I wasn't able to get a good sense of whether it writes bugs or design flaws or not, because I couldn't get past the hallucinated libraries and APIs. How do I debug calls to a library which doesn't exist?
Devstral 2 Large: Very good at not writing bugs, and good world knowledge, but the absolute worst at instruction-following. It would ignore most of the instructions in my specification and write something only vaguely like what I asked for. I had high expectations, since it is after all a 123B dense model, but was hugely disappointed.

I have a hypothesis that Devstral 2 Large was deliberately under-trained, to "leave room" for further training on individual MistralAI customers' repos without overcooking, but don't know.

None of them are perfect, but I find the flaws of GLM-4.5-Air easiest to tolerate. Fixing little bugs is fine, and Gemma-4-31B-it actually finds most of Air's bugs, so that's easy. Ignoring parts of the specification is intolerable. Design flaws that require more than a one-line fixup are a pain in the ass. Hallucinating libraries is especially grievous, because I have to throw everything out and start over, but be sure to describe the libraries it should be using before continuing.

I used all of these models at Q4_K_M, and I know some people will point at that and say "there's your problem!" but frankly I can't tell any difference at Q6_K_M. Did not quantize K/V caches at all.

[-]

the3dwin@reddit

Have you written an orchestrator that delegates for each job? Could you share on github?

[-]

ttkciar@reddit

No, I just use GLM-4.5-Air.

[-]

dtdisapointingresult@reddit (OP)

I can try one of those as my final attempt. Which one do you think would do best at my Docker prompt I shared here? https://reddit.com/r/LocalLLaMA/comments/1sxqa2c/im_done_with_using_local_llms_for_coding/oiowcxe/

I'm surprised someone is saying GLM-4.5-Air still holds up, and putting it ahead of recent models.

[-]

Bird476Shed@reddit

I'm surprised someone is saying GLM-4.5-Air still holds up

I agree with OP "the flaws of GLM-4.5-Air easiest to tolerate."

Overall, this model is still a reliable worker and a good speed/quality trade-off.

[-]

ttkciar@reddit

I have no confidence that GLM-4.5-Air's tool-calling prowess is up to the task of doing it interactively, else I would recommend it. Its tool-calling competence is quite weak, and I have never tried giving it instructions quite that vague and open-ended before.

Your prompt is better suited to a model of GLM-5.1's caliber. I'm having a hard time imagining any of those 120B doing it well, but it might line up with GPT-OSS-120B's strengths. Maybe give that a shot.

If I were to rewrite your prompt for Air, it would include a lot more information (how the app is supposed to work, specific filename for the dockerization documentation, etc) and a lot more instructions for how it should go about compiling the misbehaving wheel. I just have no faith it would figure those things out on its own.

I'm surprised someone is saying GLM-4.5-Air still holds up, and putting it ahead of recent models.

It's a bit surprising to me too, frankly. I keep trying the hot new models, thinking "surely this one will knock Air off its perch", but they just don't, and I keep using Air.

Maybe Qwen3.6-122B-A10B will be "the one"? Or if Google ever releases that 120B MoE Gemma4 they beta-tested, that would probably do it (assuming they fix Gemma4's tool-calling woes).

At this rate, though, it's probably going to be a new Air model based on some version of GLM-5.x (assuming ZAI can repeat the feat).

[-]

https://huggingface.co/mradermacher/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF you can have the best of both worlds..... :)

[-]

dtdisapointingresult@reddit (OP)

I gave up on Gemma 4 31B early on.

It wrote the Dockerfile and now needed to build it. And I was staring at its output, coming slowly at 12 tok/sec, 3 minutes of reasoning while it tries to decide if it should check if docker is installed or not, and whether to build it via the Dockerfile or the docker-compose.yml (which also builds). I exhaled and switched back to Qwen 27B.

This was an AWQ, but I doubt the FP8 would've been much better.

I really think Terminal tasks are just harder on LLMs than coding. Coding is still just dead text output. Interacting with a running system via tool calls might be a whole other level. 27B gets 35% on TerminalBench-Hard, Sonnet 55%.

[-]

dzhopa@reddit

So yes, terminal tasks or any multi chain of tool calls is where your smaller quants fall flat. Minor hallucinations creep into the syntax and state passing between calls as the context grows large.

Code output is a lot easier because it's writing 1 file at a time, and maybe verifying syntax. You get to execute it later and fix that typo or hallucinated bug in a whole separate call. For terminal work it's passing a precisely formatted string of commands along with terminal output into the specific structural format needed for the LLM and harness to process the tool and then string the commands, syntax, context and structure together between potentially dozens of simultaneous calls needed to complete the task.

That's the real big problem Anthropic has spent a lot of time and money to get right, and it shows when you just ask Claude to "download this package from github and spin it up for my users as a docker container". Those Claude calls are stupid expensive for terminal tasks though.

[-]

rothbard_anarchist@reddit

Terminal is definitely harder for any language model. Even on Codex 5.5, it boggles my mind to watch it sometime ponder for three minutes straight how it should open a CSV file.

[-]

IWasNotMeISwear@reddit

The core members left the company I think

[-]

the3dwin@reddit

My current workflow in short:

Use cloud models for large tasks and use local models for small tasks.

Measure what to use for each based on your own personal results you know the local models do reliably.

Use your specs and prompts as "libraries" (usually imported into your codebase) where the prompts are optimized to implement on your specific projects and not dependent on how other projects are setup so that it reads your codebase and implements reliably.

I still think that training AI ( unsloth.ai ) for specific use cases is the best way to get local models to code, I just haven't started yet.

[-]

StirlingG@reddit

Nice try Dario

[-]

dennis_linux@reddit

My Technical Read (Inference)

For my setup specifically (M3 Ultra 512GB + oMLX), I would not generalize this post to myself.

I am in a different class than many LocalLLaMA users.

Local coding is weak when used as:

❌ “Replace Claude Code entirely with one local model”

That often disappoints.

Local coding is strong when used as:

✅ Hybrid stack

Use local for:

Codebase understanding

Refactoring drafts

Test generation

Private repo analysis

Fast iterative copiloting

Agent experiments

Use cloud for:

Hard debugging

Complex multi-file changes

Long-horizon coding agents

High-stakes architectural generation

This hybrid model is where many advanced users land.

For my stack specifically

Given my models:

Qwen3-Coder-30B-A3B-Instruct-4bit

Qwen2.5-Coder-32B-Instruct-MLX

Qwen3.6-27B/35B

Kimi K2.5 MLX

I would use:

Task Best choice

Local code generation Qwen3-Coder-30B

Repo reasoning Qwen2.5-Coder-32B

Agentic experimentation Qwen3.6-35B

Heavy cloud-like coding fallback Claude Code / frontier API

This is much stronger than what many Reddit posters are running.

What I think the thread really says

Not:

“Local LLMs for coding are dead.”

More like:

“Poorly configured local coding stacks lose to polished cloud agents.”

That is different.

And honestly, that has been true.

My blunt assessment

For coding today:

Use Case Local Wins? Cloud Wins?

Privacy-sensitive code Yes —

Cheap heavy token usage Yes —

Best coding agent autonomy — Yes

Fastest solo dev productivity — Yes

Experimental multi-agent hacking Yes Mixed

Offline resilience Yes —

Verdict: Hybrid wins.

That Reddit post is a warning against ideological “all local” thinking, not an indictment of local models.

[-]

Hyp3rSoniX@reddit

Did you make sure the Agent is actually reading or has the AGENTS.md file auto-injected into the context? Claude Code for example only auto-injects CLAUDE.md files into the context (and also a MEMORY.md auto-memory file). You need to reference other files from inside CLAUDE.md to have the harness parse and inject other files into the model context as well (like AGENTS.md).

Some harnesses don't auto inject any such files into the context. Then it is left to the model itself to read it - the problem then however is, that models these days rarely read full files. Instead they read a few lines and then they're happy. If that happens with your AGENTS.md file well... then good luck.

Also depending on the Harness you use, there might be added some hidden System prompt, that tells the Agent god knows what. The smaller the model, the easier it is to confuse it. Bigger "smarter" models can be more resistant to confusing information.

[-]

dtdisapointingresult@reddit (OP)

Yes, I tested every harness, by asking it to repeat the instructions to me. QWEN.md for Qwen Code, AGENTS.md for Pi.

You need to learn how to use a system prompt. Message me if you need one because they actually help.
You need to learn how to better tune parameters - again, message me if necessary.
You need to start learning how to **decompose** your prompts as well. `Qwen3.5/3.6-27B` is good, but not good enough (at this point) to where you can throw everything and the kitchen sink at it.
Stop falling in love/believing in the viability of these random harnesses that keep popping up every other second. OR you need to create your own custom-made harness that is curated and tailor made to do exactly what you want and need it to do.
Start using AI to use AI. In other words, if you're not achieving the results that you want from a given AI model (local-level), consider using one of these commercially provided LLMs to actually diagnose the problem instead of rage quitting and then coming to Reddit to bitch about the ineffectiveness of said models perpetually.

[-]

dtdisapointingresult@reddit (OP)

I'm bitching about the small local models because for the past month this sub was acting like small Qwen 3.6 was just a hair behind Sonnet. Tbf Qwen's benchmarks were showing that too.

I've always used small local models for non-coding tasks, but never took them seriously for coding because I felt they were too weak. That is, UNTIL the latest wave of hype started. Check out the top monthly posts on this sub: all small Qwen glazing.

So I fell for it and gave it another legitimate try, for several hours, and reality was completely different. At least, when using the same standard harnesses 99% of people use (Every *Code app you can think of).

You're doing something wrong. I've had Qwen3.5-397B Q2 build docker images on gitea runner, push to gitea container registry, and deploy to Nomad with the latest tag on every merge.

Ah. You would have had me if you said users returning to Mac.

Returning to "Windows" explains the core issue.

[-]

FUS3N@reddit

I feel like its kind of being disingenuous when you put it like that yes windows has issues but desktop Linux isnt there yet too, I have it dual booted but for most people its not the same experience when you always look from the view of someone who knows stuff here and there.

[-]

mister2d@reddit

I'll let you in on something.

That's all you need. With those 4 you can do anything.

The smaller harness makes a massive difference with smaller models.

[-]

xKiiyoshiix@reddit

Hellow,

I think you can give the local ai thing a last try and i will you say why...

I'm so fine with Google Antigravity but when i need to do some private information like things with AI, i'm not fine to use Antigravity to leak my whole life *you know*, so thats why i started to work with local llms like Gemma4:e2b, Gemma4:e4b and a good Gemma uncensored version.

At the moment, i'm in full development, all day, working with it, fixing and adding new features but no unnecessary things. When youre interested into my project, you can write me.

At the moment i'm hosting it on my own gitlab server but, i think i will publish the repo when i think, "ok its good, we can publish it now", at that point you can help improve and extend the project.

What i have made:

- TUI App
- Web UI (like chat app with some extra things)
- Functionalities like, qdrant vector saving memory, creating projects, write files.
- Having Skills and Tools that are 100+% faster then OpenClaw things because the Tools and Skills are written in Markdown with the functions of javascript combined (sounds wired but works well)
- Automatic learning from errors/problems.
- MCP Server supported (ex. context7 or GhidraMCP)
- STT (Speech-To-Text from local whisper)
- Document upload in chat session

You're asking now "Why do you think, is it better then OpenClaw?"
The answer is simple: My Project is completly optimized to use ollama with small LLMs like on my RTX 4070 12GB VRam so good on 4B, 7B, 9B Parameters.

[-]

yes-im-hiring-2025@reddit

Huh. I successfully ran a minimax 2.7 + Qwen3.6-35B-A3B setup on an m3 ultra for claude code yesterday. Ran fine lmao, about 50 TPS average. Takes a little tinkering on preprocessing batch sizes, cache enablement, context management and params etc. There's an unsloth guide for it iirc.

Sonnet level ish quality overall, not mad at it at all.

[-]

tgsz@reddit

Almost all of this is your harness and likely context window+tuning. I've had the most success with cline configured to use qwen3.6-27B as an anthropic model (modified the model list to add a record for it) - the prompts and tool calls + anthropic compatible API and prompt caching really make it feel like using sonnet 4.6.

It's a shame that the harness is so important to the model but that's modern LLM coding.

Using Claude Code with ENV variables pointing to your local endpoint is also a great way to go if your comfortable using CLI vs using opencode.

Imo local models really shine when they are driven by a frontier model, so use 5.5 or opus 4.7 as the main arbiter and save yourself the babysitting.

[-]

joeprovence@reddit

I set up Claude code to be able to query Gemma (and Gemini for that matter) to double check code, more times than not Gemma finds things! You can add it to your mcp via Gemini API.

[-]

cocoa_coffee_beans@reddit

Yes, local models fall short for coding.

That’s not all. The ecosystem is quite fragmented:

OpenCode is broken with vLLM ever since vLLM deprecated the reasoning_content field for reasoning.

Open WebUI still handles reasoning like it’s early 2025.

Vendor specific tools such as Codex and Claude Code constantly break against local inference even if you provide their respective APIs, because vendors are constantly iterating their client.

If you’re not deeply entrenched in the specifics, you won’t squeeze the performance you need for coding. For most people, it simply isn’t worth it.

[-]

Caffdy@reddit

Open WebUI still handles reasoning like it’s early 2025

and try to mention any problem you got on their github and see if you're not at best get ignored, if not being told how wrong you are, "just read the documentation" or straight up blocked

[-]

hwpoison@reddit

I guess all of us in this community are trying to see the good side of the local models, trying to convince ourselves about they are useful, but the reality is that they are so far behind of big models and the majority of cases are shit. This does not discard that are continuously improving but that's how the things are right now.

[-]

dtdisapointingresult@reddit (OP)

It's really just the hype for the past 6 weeks of models that made me bite.

Small local model = bad for coding, can be OK if you put in a lot of effort with custom harnesses
Large local model (that 99.9% of us can't run) = good for coding
Sonnet/GPT5.5/Gemini = good for coding

Simple as.

[-]

[-]

dtdisapointingresult@reddit (OP)

Yeah, that's what it's looking like.

In my defense the past month showed Qwen 27B nearly coming close to Sonnet in many intelligence benchmarks. Guess I just learned how little those scores generalize to something like doing docker work on its own.

[-]

Regarding the learning bit. I tend to keep the agent open in a terminal window. Whenever I think I'm about done with a file, I'll ask it to review the changes to a file. I'll make the changes, or explain why its feedback is wrong. Rinse repeat. It's quite helpful in catching things I've overlooked. I'm not totally sold on unleashing the agent to write code for me.

[-]

Curious-Function7490@reddit

I'm semi in the same spot. I am running qwen.2.5-coder.32B locally on an RTX4090 using llama.cpp and getting 30 tokens a second. I set this up because I was tired of using up Claude's tokens on one of my projects.

TL/DR.

The more helpful LLMs (Claude, etc.) that are really effective won't be affordable in the longterm. The companies providing them are running at a loss, there is an AI hype bubble and we are already understanding that they are unaffordable and problematic to depend upon.

I think understanding how to work with local models is viable and it will come back to being more hands on.

Agreed. I have actually managed to build a quant trading bot without being able to code myself, but it took 2 months of endless frustration and micro managing various LLMs. I started with Opus 4.6 on a Max plan which built most of it. Once Claude killed that access I tried going local with Qwen 3.5 122B which can be ok, but needs even more hand holding. These days I copy and paste in and out of supergrok so I’m much more involved in the execution even though my coding ability is close to zero.

[-]

devshore@reddit

The purpose of localLLM isnt to get the "best performance" while "saving money". Is that what you thought? Its to combat the anti-christ (Zuckerberg, Gates, Altman, Thiel, etc)

[-]

Yugen42@reddit

For me the main intent would be independence in all its forms. If that isn't at least part of the equation then local LLM coding is not for you right now - it seems to lag behind in quality and speed at least by an order of magnitude. I'll keep working on it though, I'm sure it will be way better in a 6-12 months as well.

[-]

Alternative_Ad4267@reddit

Are you using Qwen with Qwen Code? That’s a real improvement. With Claude is dumb, use it with its own coding tool.

[-]

dtdisapointingresult@reddit (OP)

Yeah, Qwen Code + Qwen model. I figured it would be the most optimal.

I also tried Claude Code + Qwen model, figuring Qwen must have trained on the CC prompt too. It's possible this gives the best results, but I didn't have the patience to do it. Claude (the app) hides the model's reasoning, so you end up with these long pauses with zero output. I want to be able to interrupt the request immediately if I see the model is on the wrong track, my LLMs are too slow to just wait for a final result.

[-]

Alternative_Ad4267@reddit

What about OpenCode or AnythingLLM? Those works more or less fine with Qwen.

[-]

tecneeq@reddit

Still using Qwen 3.6 27b.

[-]

layer4down@reddit

Two things can be true: Opus can be best-in-class overall and local models can be good enough for the need. If you can afford best in class and don’t have any other coding constraints (typically imposed by an employer or customer) then I don’t see why someone doesn’t just use it. That doesn’t mean local models are not productive for anyone else even if they’re not necessarily getting everything they want from it it’s just making do with that’s available and moving on with life.

I haven’t personally used Opus much though I did use Sonnet quite heavily throughout 2025. When Zhiphu launched their GLM-4.5 promo in October ($360/yr) I dropped Sonnet and never looked back. My understanding is that Opus lets one outsource a lot of thinking and decision-making. Guess since I have access to it now at the job I’ll try it out but I can’t personally see the benefit of becoming dependent on it especially given the current climate.

[-]

dtdisapointingresult@reddit (OP)

My post is about small local models, not local models in general. The recent hype this past month was about how these 27B/31B models are closing the gap on Sonnet and it's just not the case. My solution I mention in the TLDR is to switch to the larger open-weight cloud models that one could run locally if they had the hardware. So basically I'll be doing the same thing as you, just probably with Kimi, GLM5.1, etc.

[-]

layer4down@reddit

Fair enough. I hadn’t caught the entire post yet. My own goal is to move off of GLM when my 1yr expires in October. I already have access to an M3 Ultra 512GB and my daily driver is M2 Ultra 192GB so I think I should have enough local resources between the two.

[-]

dtdisapointingresult@reddit (OP)

Lucky you, if you have a 512GB you can easily run GLM5.1 at 4-bit, although I don't know what the speed will be like. The reasoning on these large models makes slow speed painful.

[-]

Yeahh for me qwen and other small models below 70b behave like children sometimes awesome performance and sometimes act like they are infants given phd to do

[-]

Firenze30@reddit

I stopped reading after the Docker example. Failing to containerize a repo based on its README is not a model limitation; it’s a setup failure. That task is well within the capabilities of 9B and 27B models when configured correctly. The issues you’re describing sound like a harness or context management problem, not a fundamental lack of model intelligence.

[-]

dtdisapointingresult@reddit (OP)

I think if my only prompt was "Dockerize this repo" it would've succeeded.

But I made it clear I wanted it to run the docker build on its own, and also warned it about a future error that it will also have to fix. The sort of info I'd give upfront to a human intern. Certainly the minimum of what I expect from an agentic coding assistant.

I'm sure if I spent a few weeks making my own Pi extensions things would've worked out, but it's just much easier for my time and sanity to switch to Kimi in the cloud. This isn't the post-apocalypse, I don't have to use tiny models if I don't want to.

[-]

AlphaEdge77@reddit

> the prompt cache's always being hit.

Yeah, this is by far the biggest issue, and very rarely tested. The first couple prompts is nothing. The Youtube bros need to ask 10, 20 questions in a row, and see if it holds up.

I find local LLM's useless because of this.

Made me want to delete LM Studio once I discovered this.

[-]

dtdisapointingresult@reddit (OP)

Tbf that's the client app's (harness) fault for modifying history. But there's no easy way to tell in the prompt cache is being broken.

I think the people recommending completely unknown harnesses ITT are right about having to use one of those simpler harnesses designed for small models.

What the ideal local harness should have :

Track when the prompt cache is being broken by modifying past messages beyond Current -1.
Warn the user and give them a OK/Cancel prompt to cancel the outgoing request

None of the popular ones have this, not even Pi. They are all designed for cloud models and don't give a shit about prompt cache.

[-]

jonydevidson@reddit

Local LLMs that you can run on a MacBook are right now around the level of quality that Sonnet 3.7 used to have in early April last year.

They are currently a full year behind frontier LLMs.

In April last year, I wrote here or some other AI subreddit that within 2 years we will have models that have these capabilities available for laptop use. It happened within a year.

So given that track record, you can expect models with roughly today's frontier capabilities available locally on laptops in this time next year.

Because people developing these tools are using the frontier LLMs today which are nothing short of magic. GPT 5.5 with Codex is magic.

And the frontier will continue to improve.

So relax, chill out, use whatever works for your current task. A year ago, you simply didn't have local agents like Qwen 3.6 27B. Now you have them, and they work, they can edit files consistently, analyse codebases, documents, and answer questions, all completely offline.

If GPT 5.5 does all your work needs, you'll be able to do it offline next year.

[-]

norebe@reddit

You're going to have to put time in if you want to roll your own. There's no evidence from this mountain of words that you have any knowledge of what that means or that you tried to do anything but put together a harness and model that weren't designed for each other and expect things to work out well.

[-]

dtdisapointingresult@reddit (OP)

OK. Short of spending a month hand-crafting my own harness through constant trial and error, where are the guides on how to do this?

I expected the Qwen Code harness to be able to use a small Qwen model. That was clearly a mistake, the harness seems designed for their Qwen Plus/Max cloud models.

I am not a coder, but I totally understand and share your feelings. I find myself going back to Gemini-3.1 or Deepseek v4 for better replies, or I start a conversation with them, copy it to LM Studio, and continue with a local model like Qwen-3.6-27B-Q8 or Gemma-4-31B-Q8. This seems to give them a bit of an edge.

But, I use them mostly as an inner voice that helps me collect and organize my thoughts. When I need serious sanity check, I go back to the top Gemini or Deepseek (I like to vary the sources). Perhaps, if I could run larger model locally, it would have been much better...

And you are right regarding wasting time. You can get good outcome with small local LLMs, but you spend more time and energy. If you are tight on time or you need to make a lot of decisions, just go with the best model you have access to. People have limited deciding making capacities per day, and it doesn't matter whether you decide on trivial or serious matters, you spend the same energy deciding.

[-]

dtdisapointingresult@reddit (OP)

I do like this idea of planning with a frontier model like Deepseek then going back to the local LLM to take over.

2Norn@reddit

the user greeted me with hello which suggests this is the first interaction

but wait

the user said qwen so it must have prior knowledge

[-]

OneSlash137@reddit

Lmfao, I see another qwen survivor

[-]

You’ve never heard of work trees with coding models? That doesn’t jive if this isn’t your first rodeo.

[-]

dtdisapointingresult@reddit (OP)

Oh you meant git worktrees? I don't use those either. Not sure what the point is or what it's got to do with LLMs.

I just have one git repo per project with its own CLAUDE.md.

I also have one 'misc' repo for generic one-off tasks with shared AGENTS.md tells it to create a new dir and work from there. Agent works on new prompt in wip/task_xyz/, when done I move it to completed/task_xyz/.

I tried Gemma4:26b with Openclaw and it's useless.

When I realized I could connect Openclaw with my OpenAI subscription (codex), all of a sudden it did a proper onboarding that I didn't even knew existed before. So much better.

[-]

RemarkableGuidance44@reddit

You're new to local, learn it first...

[-]

Syzygy___@reddit

Am I though?

[-]

NNN_Throwaway2@reddit

LLMs can't generalize, if they haven't been trained extensively on a task they will face-plant. This is especially true on smaller models where you don't have a large body of world knowledge to lean on.

LLMs in general, but especially the small ones, are getting increasingly specialized on agentic coding. I suspect that building an spinning up a container falls just far enough outside of what it was trained on that it doesn't know how to apply basic problem-solving that it was certainly trained on in other areas.

But yeah, people are going to get upset if you say the latest OSS darling isn't the bee's knees and a huge game-changer that rivals Claude Opus.

[-]

InnovativeBureaucrat@reddit

I found even the advanced models to be terrible at any devops tasks until recently. They were pretty good at Python code, and really bad at even the simplest tasks involving containers, network questions, or really anything not code.

Maybe it’s just me.

[-]

VertigoOne1@reddit

They’re pretty solid on kubernetes, terraform, argo, github actions, you know, stuff that has a long history and strong community representation. Not baddd on powershell but it has to work through the bugs for a few rounds. But if you mean devops like, any of not the above, i’m not sure. It is important regardless to have a good structure and docs. A lot of devops repos are just a mess, which gives the llms amnesia

[-]

InnovativeBureaucrat@reddit

Maybe it’s just me. At one point I asked basic setup questions like about gitlab project management and they all seemed very unaware of that layer. Maybe I wasn’t being precise, maybe it’s changed.

I just said devops because I didn’t have a better description (and my memory was vague)

[-]

m3kw@reddit

More like

[-]

false79@reddit

To give an example, tasks like "Here's a Github repo, I want you to Dockerize it." I'd expect any dummy to follow the README's instructions and execute them.

Bruh - that is not how you do it. You need a harness, Claude Code, Cline, Kilo, whatever, then you need to @ the file you want to make a part of the context.

Claude code is not a mind reader but it certainly has massive amount of context.

You can get away with so much more if you give LLM some direction, it will connect the dots with sufficient direction.

[-]

andy_potato@reddit

Did you even read the original post?

[-]

false79@reddit

I did read it. It screams rage quit skill issue to me. If you read some of the other replies here, it's pretty embarassing. Like this one:

https://www.reddit.com/r/LocalLLaMA/comments/1sxqa2c/comment/oipgifv/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

[-]

dtdisapointingresult@reddit (OP)

I was using a harness. I tried two complete ones (Claude Code aimed at vllm, and Qwen Code aimed at vllm). I also tried vanilla Pi.

[-]

traveddit@reddit

Which endpoint did you use on Claude Code? What arguments for vllm?

[-]

RemarkableGuidance44@reddit

They have no idea...

[-]

juraj336@reddit

I'm surprised it isn't able to handle this then, Ive had Qwen3.6 27B handle several things like this easily. I had it make an api, then dockerize it and then iterate through until it fixed the issues after which it worked great.

I think for these medium size models context is king. They don't know as much as a Claude or Chatgpt model but they know enough that with the right context they can reach the same result.

So for me what has worked great is adding a searxng instance for web search and having it ensure testing in a loop until it has something working.

[-]

lost_mentat@reddit

small local modals just don’t have the intelligence for long agentic tool calling , the errors compound, so calling 1 tool 95% success , drops fast if you need many tool calls in successions. That’s why there is a massive compounding difference between 95% and 99.9% , even though it appears to be “only 5%”

[-]

chibop1@reddit

I think people hyping local models for coding just using it for short test like CS problem set with few files for school assignment. lol

[-]

LienniTa@reddit

maybe qwen code is more about qwen max than qwen 27b xD i didnt try it, but qwenpaw with qwen 27b feels okay

[-]

4bitgeek@reddit

Local models are too crappy to be used in any project. Tried upto gpt oss 120b on bedrock and that seem to write some better code, else nothing is worth for coding. Or else need to have too much firepower to try something bigger. Seriously... For texts and chatting, some smaller models are somewhat fine, but definitely not for coding. Also without proper gpu, the token output will age one faster. Will die waiting....

[-]

cbterry@reddit

This is hilarious

[-]

MDSExpro@reddit

This sub creates unrealistic expectations that do not match reality. I have spent last 4 months setting up local closing via LLMs and I arrived on setup that works, but it's vastly different then image pushed by hypers:

First realistic productivity barrier was crossed at 128GB of VRAM (4x R9700) - Qwen3.5-122B-A10B quantized to INT4 was able to generate a lot of good code, but failed on long range coding. When I have it a technical spec, it was stuck at 90% correct implementation, but were unable to reach 100%. Anything smaller was pure frustration.
Bumped up VRAM to 256GB (8x R9700) allowed me to switch to FP8 quantization of same model and difference was night and day, it reached 100% correctness and really moved to next tasks.
llama.cpp is a trap, for coding you need vLLM if you want any responsible speed.

[-]

meca23@reddit

I'm new to running local models. Not being fecucios, genuinely interested.

Anyway, I liked your post, it's a good reality check from a real experience. Thanks

[-]

mister2d@reddit

The small ones also very sensitive to quantized kv. I started running with kv cache at full precision and noticed a significant difference in increased quality.

It's slower, but useable.

[-]

bonobomaster@reddit

I agree.

It's just a feeling at this point, because I don't have numbers to back that up but even Q8_0 KV cache makes at least all the Qwens I tried noticable dumber, especially in regards to coding and successful tool calls.

[-]

mister2d@reddit

I don't have numbers either. But my test was the "carwash test", and coding up a tetris clone with music using html/js with the "superpowers" agent skill.

The carwash test passed every single time out of 5 attempts.

The tetris clone had a two go-backs for the collision detection and preview screen. But the finished product was nice. Had me playing for about 15 minutes till I got tired.

Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf

cache-type-k = f16

cache-type-v = f16

[-]

Particular-Award118@reddit

Who has the vram

[-]

jimmytoan@reddit

Honest take and I think the comparison is fair. The gap between Qwen 27B and Claude Code isn't really about the model - it's about the whole agentic loop. Claude Code has a very tight read-edit-run-check cycle with tools specifically tuned for coding contexts, good context management, and Anthropic's safety training actually helps it avoid the "confidently wrong" failure mode that plagues local models on ambiguous prompts. Running a local model with a generic agent framework against that is a 2x quality deficit minimum. The local use case that actually holds up is offline work on sensitive codebases or very specific fine-tunes on your own codebase. For general coding assistance, it just doesn't make sense at the current capability gap.

[-]

jimmytoan@reddit

n is fair. The gap between Qwen 27B and Claude Code isn't really about the model - it's about the whole agentic loop. Claude Code has a very tight read-edit-run-check cycle with tools specifically tuned for coding contexts, good context management, and Anthropic's safety training actually helps it avoid the "confidently wrong" failure mode that plagues local models on ambiguous prompts. Running a local model with a generic agent framework against that is a 2x quality deficit minimum. The local use case that actually holds up is offline work on sensitive codebases or very specific fine-tunes on your own codebase. For general coding assistance, it just doesn't make sense at the current capability gap.

[-]

shockwaverc13@reddit

finally someone who is realistic about those models

i'm really tired of seeing people blaming the users and saying stuff like "you aren't using it right, use the right sampling" for the millionth time or "you should use qwen 200b at BF16" without even asking how much VRAM they have

[-]

raz0099@reddit

Bro as a real Madrid fan this is your worst season anyway... So let it go.. 😂 lol

[-]

Maximum-Wishbone5616@reddit

what Q & KV ?

27B is good enough to often fix errors created by Opus 4.6 (4.7 is useless even with simply HTML + CSS + Bootstrap, created horrible designs, way worse than first models).

But quality depends on Q/KV. Those models tend to be good only at highest possible

for example 8K_K_XL runs faster than 4.6 (4.6 sometimes take 30-60 mins for something that 27B can do in 20minutes on 2x 5090, on AI cluster it is even faster).

This is model that I personally use on my own machine as I tend no to waste our AI cluster so other devs can use better models if they need.

[-]

fgp121@reddit

Limited budget can only get us this far. Cloud LLMs are too costly otherwise.

[-]

GodComplecs@reddit

Based on OP's post history, I would be a little wary of this sentiment, doesn't seem like a user with lots of knowledge about specifics of local LLM, no local harness, quant, etc.

I latest successfully used Gemma 4 31B q4, with custom agents in Opencode, it literally could continue on Claude Opus Extendeds work fine, it stumbeled a little but I fixed the config and then the experience was great.

Then I used I used it for planning also on brand new project and that worked well also, but still usually consult Gemeni 3.1 pro for sanity check, but that could also be replaced with a tool call for web search.

Is it Opus replacement? No, since everything right now has their unique flavor, eithe cloud or local.

[-]

g_rich@reddit

Local models, especially those in the 27b to 35b range aren’t going to compete with frontier models. However something like Qwen3.6 35b should be able to easily build a Docker container and do so at an acceptable speed providing you have the hardware to run it.

OP didn’t give any details as to how they are running the model or what tools they are using but this sounds like an issue with their configuration and tooling.

[-]

enoonoone@reddit

Your first mistake is using Claude Code as a harness for local models.

[-]

vlodia@reddit

So this sub always has a post about local llms getting almost at par for coding / logic with private llm then you also see these posts.

finevelyn@reddit

Yeah me too, honestly I was done before I was even started. Now I'm using a local brain for coding and it's miles ahead of even cloud LLMs. Try it if you haven't.

[-]

Stitch10925@reddit

What is a "local brain"?

[-]

RemarkableGuidance44@reddit

Its amazing! You have one, you just dont know it yet!

[-]

orion7788@reddit

lol

[-]

AlwaysLateToThaParty@reddit

I code on an rtx 6000 pro, using qwen3.5 122b/a10b heretic mxfp4, at about 75GB, and it's solid. I've tried the smaller models and they drove me mental. This can one shot complex tasks.

The problem with openrouter seemed, to me, that different service providers were quantising their API end point models. I think that's unavoidable fwiw. I'm pretty sure openai and claude do it, but they'll do it in subtle ways, cuz they can. But what it meant for me was inconsistent output, and that drove me mental.

So that's why i have the gpu. Does the task, and more. Pretty epic gaming gpu too tbh.

[-]

andy_potato@reddit

Feels good until you do the math how long you could have subscribed to Claude or Codex for the $$ you paid for that RTX 6000.

Your games will still run fine on a 5060ti by the way.

[-]

RemarkableGuidance44@reddit

For now.... for now.... you wont be able to afford Claude or Codex soon. $200 a month will turn into $2000 a month.

If you had looked around you would see that AI companies are increasing pricing / usage and this is just the beginning.

I have 4 x b70's and 2 x 5090's, my b70's run 24/7 with automated scripts. My 5090's I game on and do image gen + 3d Gen + training. Its all private data, bunch of fine tuned models that out do 4.7 and 5.5 for my own specific tasks.

Your missing so many things.

You realize that Claude is faster because they have huge memory and GPU banks right?

I use cline with a tuned qwen3.5 to check all of my code, it does a pretty good job. But I'm also more obsessed with context windows then model size.

You need to tune your rules for the model, make use of skills and also mcp, you don't just replace a model and be done. These ai companies have spent years tuning their setups to be what they are.

The model and context windows are a big part of the stack but not all of it. If your not willing to put in the time your not going to get good results.

[-]

Widget2049@reddit

AGENTS.md still too weak, you need to be more thorough for a 27b model. make it focus on what the LLM really need to do, avoid using "IF", "DON'T". you need to create a solid plan mode first before executing anything in build mode. local llm for coding is still good if you know what you're doing. so keep learning

[-]

StorageHungry8380@reddit

Got some concrete examples of how AGENTS.md should look for such models?

[-]

2Norn@reddit

https://github.com/forrestchang/andrej-karpathy-skills/blob/main/CLAUDE.md

this the only one i ever use, simple to the point

[-]

Widget2049@reddit

https://github.com/agentsmd/agents.md/blob/main/AGENTS.md
https://github.com/microsoft/skills/blob/main/Agents.md
https://github.com/gakeez/agents_md_collection/tree/main/examples

[-]

cleversmoke@reddit

Beautiful, thank you for the links!

[-]

StorageHungry8380@reddit

Awesome, much appreciated.

[-]

ChosenOfTheMoon_GR@reddit

A very good advice i will like to give to people who are using any instructions on any model is the following, test a model really quickly on the instructions that you want it to know first, specifically ask it what understands about the instructions and what is unclear to it, it will save you a ton trying to figure out what it can and won't do given your instructions.

[-]

2Norn@reddit

no matter how it gets hyped up, it should be obvious to literally anyone that a 27b model can not compete with 700b+ 1t+ 1.5+ models, that just makes no sense. v4 pro just came out, it's an moe model, its active parameters alone are double the size of 27b almost, 27b vs 49b. how can you possibly expect that it competes?

in my opinion only use is, if your harness is able to spawn fresh context(which means u don't really need 256k or 1mil context windows either) worker subagents and guide them and after work is done u verify with a different model that's pretty much it. they are just simply there to cut on your subscription/api cost. anybody who fully downgrades from opus4.7, gpt5.5, k2.6, glm 5.1 is just not gonna have good time.

[-]

I do not consider this particularly huge a codebase, if you investigated the main core system, it is a platform or foundation to plug things into with the basics included, built with security in mind, and a good deal of flexibility and a visual GUI, sure it is larger than your average vibe coded app, but it is more serious than your average vibe coded app too.

There is some refactoring needed, development is AI accelerated, but I refactor regularly throughout the development cycle and have 30 years of software experience - I think you should take a closer look when you get a chance.

[-]

traveddit@reddit

Looking at your repo and how you constructed your harness I don't think you're in any position to be giving out tips. You literally have subagent orchestration structure backwards. You're using Gemma 4B to decide the scope of your query and you have the 26B as a worker. This is a fundamental misunderstanding of how to allocate intelligence for subagents. You can't let a dumber model orchestrate the task because it will never know when to reliably handoff to "harder" tasks.

[-]

mister2d@reddit

Nice project. Are you the author?

It would be better to use systemd-nspawn rather than docker for isolation. You get almost zero overhead (daemonless) with the the desired level of process isolation.

[-]

Electronic-Space-736@reddit

nice, I am using docker as it is well known and easy to include install scripts for, I also have a few things like the RAG that use containers as part of the whole

[-]

dtdisapointingresult@reddit (OP)

Isn't that what the LLM's reasoning is for? I shared my whole prompt here btw:

https://reddit.com/r/LocalLLaMA/comments/1sxqa2c/im_done_with_using_local_llms_for_coding/oiowcxe/

Then between Qwen Code's system prompt + Qwen 27B's reasoning, I don't think it's unrealistic of me to expect it to complete this basic task.

It's not like it failed to compile the dependency for my hardware because of some complex compatibility issue. We didn't even get that far!

[-]

Electronic-Space-736@reddit

how can I make it clearer, there is another layer that you are unaware of that the cloud services provide that makes LLM more smarter and effective.

Running Qwen in llama.ccp (or whatever) does not supply this layer, you need to make your own or use someone elses.

[-]

kyr0x0@reddit

Qwen Code is such a layer or at least, is sold as such. Cloud services don't run harness code at their servers for LLM Inference. They do so for non-coding harness (ChatGPT, or coding harness with server-side run agents), but a decent RooCode, OpenCode or even VS Code Insiders should already bridge the gap the same way they do with large models, not SLMs. Yet they don't because you can only try to shoot a moving target when you write instructions to fix one issue for a small model , then it stumbles upon the next, and the next and you continue ... Finally you switch models next week and face totally different issues.. and your code is pointless - you need to rewrite everything for the next model that requires other fixes..

[-]

Electronic-Space-736@reddit

yes, for small context, but then we hand it pages, so it needs to be broken back into smaller pieces that qwen was built to handle, this is the layer that is missing

[-]

StardockEngineer@reddit

Already working? I thought I saw your prompt asked it to figure out if it needed to compile things from scratch?

better late than never

[-]

waescher@reddit

name checks out

[-]

floriandotorg@reddit

I mean what are your expectations? You’re comparing an LLM running on a GPU cluster in a data center with a MacBook.

WinDrossel007@reddit

It's a matter of time when big corpos decide to ramp up prices to equalise their investments with ROI they want.

Your local llms will be much more useful. Until that I would agree with you. I like working with Opus 4.6, but my company pays for it.

I don't care about tokens as an employee.

But I do care about tokens as an individual.

I bought 5090 and happy with my local models and learn how to use them. qwen 3.6 is a pretty good one. If you provide enough specs -> it does it's job pretty good. Not like cloud models, but you can't depend on them.

Overall I agree with a sentiment.

[-]

thejesteroftortuga@reddit

Honestly nothing beats Opus 4.7. It’s crazy. I just had it refactor the UI of a pretty complicated web-app over several hours while I slept and it got like 95% of the way there.

These smaller models are much better for narrower faux-deterministic outcomes than they are as broad scale coding agents.

[-]

dolomitt@reddit

I tried cline with qwen3.6:27b. It will not work anywhere near as good as with opencode for some reason. Same llama.cpp server. Its really usable compared to previous generations. I run on a 3090.

[-]

Puzzleheaded-Try737@reddit

Totally fair. If the productivity loss is hurting, the "Local" pride isn't worth it. I’ve been building tech for years and the "Hardcore mode" of small models is only fun until you have a deadline. Since you're switching to OpenRouter, I'd suggest trying a mix:

Claude 3.5 Sonnet: Still the king for logic/tool use.
DeepSeek: Insanely cheap and surprisingly good at Python/Docker.

[-]

AnomalyNexus@reddit

I mostly view it as a spectrum. Don’t want to pay 200 bucks a month for Claude opus or whatever. But also don’t want to fight against a weak model too much either. So one of the cheaper api it is - currently GLM.

Working on moving some openclaw stuff to local though. Some tasks there aren’t as sensitive to precise model

[-]

GrungeWerX@reddit

I recommend ppl just downvote this AI slop written post and keep it moving.

[-]

andy_potato@reddit

Downvotes this mindless reply

[-]

mister2d@reddit

Probably the irony is that the local model was used to assist.

[-]

YehowaH@reddit

Hope you used qwen3.6 35 a3b with iq4nl/xs, it fits in 24 GB mem. You get 170 tg equal to Claude. Qwen3.6 was trained for tool calling 3.5 was not and it has the developer role. Both going well and check the parameters for defined programming tasks, e.g. temp 0.6.

I have minor issue to none with the new models, these are a true replacement. Give it another try with the right models. I do complex scientific stuff back and frontend, nothing you can compare the daily work if a dev and nothing the llm can be trained on because there might be only a few examples worldwide. It runs like a charm.

[-]

andy_potato@reddit

The MOE model performs even worse than the dense model for coding.

[-]

a model running on a single consumer gpu will never compare to a model like claude. you can still save money though by using something like kimi k2.6, which is as good as claude opus but way cheaper on api

[-]

andy_potato@reddit

Of course not. But there are very vocal people on this sub who want to make you believe otherwise.

[-]

dtdisapointingresult@reddit (OP)

[-]

SmartCustard9944@reddit

One can hope that DeepSeek v4 flash gets somewhat close to an older Claude.

[-]

Local models are usable but also require frugal approach to the context.

Claude code system prompt alone is 10k tokens, add there few MCP servers and you are approaching 30k context without even asking any questions, and this is where local models start degrading...

Im currently switching to Pi, paired with RTK and Caveman for better context density, plus replacing MCPs with CLI + Skills and it works wonders.

I had pretty good coding session with that Pi setup and Qwen3.6-27b-IQ4XS with 32k@Q8_0 context (maximum I can fit in VRAM) and it was really decent coding companion.

Yes its not GPT5 level but that wasn't even my expectation but the model never did anything unreasonable and generated code was also solid most of the time.

[-]

a_beautiful_rhind@reddit

People underestimate how much stuff like claude code is tuned to cloud models. It's really slim pickings on the harness front.

I only had luck with roo so maybe I will try this new Pi thing. Otherwise it's "COMPACTING" city and the model can't really get anywhere.

[-]

Altruistic_Night_327@reddit

The context bloat issue you described — 250K tokens from docker build output — is actually the core problem I was trying to solve when I built my tool.

The reason agentic apps blow up the context window is they have no architectural understanding of the project. They either dump everything or dump nothing useful. So when a long-running command finishes, they have no frame of reference and spiral.

What I built instead is a RAG layer that parses the codebase with Tree-sitter into a typed graph locally. Every agent query pulls ~5K tokens of relevant nodes — functions, dependencies, the specific files in scope — not the whole project, not terminal dumps.

For your Docker example specifically: the agent knows which files matter for that build because the graph tells it. It's not guessing from context.

The tool is called Atlarix. Works with Ollama and LM Studio natively, free for local model users. Still early (31 users, being honest), but the context problem you described is the exact thing it's built around.

Not saying it fixes everything — small models still have reasoning limits. But the 250K token death spiral is an architecture problem, not a model problem.

[-]

leinadsey@reddit

So you’re saying your computer at home is t a massive data center with 256 TB of RAM?

[-]

Positive_Example_478@reddit

Performance other provide model is better but the Claude,Gemini code output and prompt understanding has become total piece of shi absolutely for a week of that feeling absolutely so fucing angry about the quality degradation feel and even though my prompt was clear as well as the same like used to be even after saying clearly what to do step by step it can fuking follow it omg I am so damn frustrated 😡

[-]

Unlikely-Loan-4175@reddit

This is very refreshing. I guess local LLMs might get there by end of 2026 given the fast progress. But they are not there yet for even a 5090 GPU (what I have).

And even if they so get to current frontier model performance, thr frontier models will have moved on again, increasing our expectations.

[-]

Potential-Leg-639@reddit

Give it a try with Opencode, Linux, latest Llama.cpp and Qwen3.6-35B (use the Q4 quant recommended from Unsloth - no other one! Think it‘s the XL, check their guide). No issues at all with tool calls on my side (Strix Halo with Fedora 43).

[-]

tibor1234567895@reddit

I heard that pi (pi.dev) could be better harness for local models. Haven't tested or benchmarked though.

[-]

Mochila-Mochila@reddit

Why would you compare Opus to a 27B model ?!

And why would you assimilate local LLM to, again, a 27B model ?

I have a mixed feeling about local LLM. I have decided to take one of my side projects and write it exclusively with local LLMs so that I can learn how they work.

Yes, you have to be very specific with them. Opus will make a decent web design just from a prompt. Qwen will absolutely suck. But if you open a paint and give Qwen a mock up of that, or work with Claude code skills plugin and spend 10 minutes designing the web site - it will actually code 100% usable and decent looking result.

Same goes for a database - if you tell opus/sonnet to migrate from sqlite to sqlalchemy it will prompt you whether you want to update your data ase calls to the new structure. Qwen will just wrap them in the sql_text() and keep them.

Lots of examples like that. However, I am not sure if that's a bad thing. The issues between qwen and opus are the same - code sprawl, duplicate function or features everywhere, basically both create unsupportable mess if you just let them go free. Having a worse local model forced me to be more involved with code, to look into proposals in more details, to insist that LLM reuses code - and the results are actually quite decent.

The best advice I have is understanding what “agent sized tasks” are for each model. There is no context size that going to fix this. Then the smaller the model the more specific you need to be. For example, NUXT UI skill loaded, asked Gemma 4 E3B to add a password visibility icon to the right side. It tired to do it with css positioning when there is a Vue slot. The issue here is that the skill itself isn’t specific about inputs and it’s possible slots. After providing the docs around this it was able to do this. However the time it took with the failing code was longer than myself doing it. Another one was to center the footer in the UI, there is a slot for that but even with pasting the documentation, there is no center slot, rather default slot is the center. I had to switch to 27B of Gemma 4 to catch that while it’s thinking.

I also struggle to keep the smaller model to keep working it keeps ending its reasoning assuming it done even though it’s clear not. With PI Agent there is babysitter and continuation plugins, none work as expected. If the task is too large it just can’t finish it, without you having to remind it. For example I can’t ask it to write a test for the feature, implement the feature, run the test and fix any issues of failing tests. It will just work on the feature without a test first and the next run it. So I’d have to break the agent task down into smaller more well defined chunks. Then it’s done like 3-4 hours later when I could use DeepSeek v4 fast and in 3 minutes it’s done and only $0.30 spent.

[-]

ComplexType568@reddit

While I do resonate with a lot about what you're saying, the nice thing about local LLMs is that they're LLMs at heart. Give it like... 6 months... and the current 27B-35B class will probably be as "smart" as Sonnet 4 or even 4.5 in actual use. Just hoping that they'd be public.

[-]

ProfessionalSpend589@reddit

I used Qwen 27B and Gemma 4 31B, these are considered the best local models under the multi-hundred LLMs.

People who say those models are mostly hype tend to be downvoted here.

I personally run (slowly) Qwen 3.5 397B for experimenting and a faster Gemma 4 26 A4B for chat.

[-]

Same. I mainly use local LLMs for data extraction, but that’s it. I’m so sick of the narrative online that “I ended my Claude subscription because I just got a local LLM instead, now look at all the cool stuff I can do!” Yet they don’t disclose what they’re using it for, or their local machine specs. It’s falsely advertised for sure. Can’t tell you how many friends ask me why their computer almost blew up after trying to locally install Qwen 7b on 4gb of vram. These influencers have people trying to load a brick onto a paper plate 😂

[-]

cutebluedragongirl@reddit

Yeah, it's not ready yet.

[-]

Upstairs-Extension-9@reddit

container starts?
web UI reachable?
CUDA/GPU visible?
audio generation works?
no missing model files?
no CPU fallback?
works on Jetson specifically?
works on generic arm64 NVIDIA Ubuntu?

For Docker/AI projects, “runs properly” can mean ten different things.

You are talking to it like it's a person and not a tool. Something like this would probably work better:
```
You are helping Dockerize the existing project at \~/ai/echo-tts.

Goal:

Create a Dockerfile and docker-compose.yml that build and run the web UI on this arm64 NVIDIA Ubuntu host.

Hard constraints:

- Do not install anything on the host.

- Do not run pip, apt, poetry, uv, python app startup commands, or model setup commands on the host.

- You may read files in the repo.

- You may create/edit files in the repo.

- You may run docker and docker compose commands only.

- Do not use sudo.

- Do not guess dependency failures. Use actual logs.

Operating procedure:

Inspect the repo first:

- README/install instructions

- requirements.txt / pyproject.toml / setup.py / environment.yml

- app entrypoint

- expected port

- Python version requirements

- GPU/CUDA notes if present

Summarize the intended install/run process from the repo.
Propose a Docker plan before editing files.
Create:

- Dockerfile

- docker-compose.yml

- .dockerignore if useful

Build using Docker only.

Command/log rules:

- For Docker builds, write logs to a file, e.g.

docker build ... > /tmp/echo-tts-build.log 2>&1

- Do not paste full logs into the conversation.

- If the command fails, inspect only the last 100-200 lines first.

- If the command appears to timeout, check whether the process/container is still running before assuming failure.

- Make one fix at a time and rebuild.

- If a Python dependency lacks an arm64 wheel, identify the exact package/version from logs, then try a source build for that specific package only.

- Do not invent packages or assume the failing package.

Success criteria:

- `docker compose build` completes on a

rm64.

- `docker compose up` starts the app.

- The web UI is reachable from the host.

- The container has access to NVIDIA GPU if the project requires it.

- Final answer should include the contents of Dockerfile and docker-compose.yml, plus brief notes on any source-build workaround used.

Start by inspecting the repo and making a short plan. Stop after the plan and wait for confirmation before editing files.
```

Even better, split it into multiple prompts.

For local models, I would not give the whole task at once. I’d do it in phases.

Your prompt expects the model to have good judgment about:

dependency triage
Docker isolation
long-running commands
log management
ARM/CUDA packaging
source builds
verification criteria

Those are exactly the places where local coding agents often fall apart.

For Qwen/Gemma-sized models, the fix is not necessarily a giant AGENTS.md. It is more about making the task procedural, staged, and observable. Don’t say “Dockerize this repo.” Say “inspect only, summarize, wait.” Then “write files.” Then “build with logs to file.” Then “debug one failure at a time.”

Like I said... these tools are really good, when you use them properly. Not when you act like a 27B model can read your mind like a datacenter model can.

[-]

AlwaysLateToThaParty@reddit

The first thing I created was a specification prompt with the sole purpose of writing a technical specification. So that it has it documented what to write and you don't need to pass it the trivial info again and again.

But the smaller models still just don't get some stuff as much you structure it. That happens in the bigger local models (like qwen3.5 122b/a10b mxfp4) too, but not enough to be an issue of usage. A good spec only gets you so far.

[-]

Hydroskeletal@reddit

Briefly I think these local models are much more like autocomplete for an entire function rather than the long horizon inference that the name brand frontier models do.

I think a big difference here is model size. With car engines they say there is no replacement for displacement and with LLMs displacement == RAM.

Dockerizing a repo isn't coding, it's code adjacent. It really cannot be overstated how much these local models lean on the structured grammar that a programming language provides. If it hallucinates a function, a compiler or interpreter gives it that feedback quickly. Tests do the same. But for an open ended task like writing a Docker file, where the superset of solutions is much wider, it doesn't get that kind of feedback and then it has to rely on intrinsic knowledge to deduce the problem OR it has to go search the internet, which it rarely will do unprompted. So when I think people rave about the abilities of something like the latest qwen model, they're operating in a much more constrained field. And I'll just say it that this kind of structure that the language (eg Python, C, etc) gives the output makes things like smaller quants much more forgiving. It's quite undersold I think that there are lots of tasks like data munging that degrade terribly on these smaller quantizations where even an 8bit would work.

[-]

agentic-doc@reddit

Tasking a 27B model with Dockerizing repos and complaining about decision making is like asking a Roomba to vacuum your driveway. Right tool, wrong block.

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

It kinda doesn't matter when Anthropic decides to slash usage once more and burn through 200 EUR/month subscription tokens in half an hour. Or when it goes down again. Or when it decides that some code you're working on "violates their TOS", effectively censoring your work, no matter what you do.

Once you're set up, you pay just for electricity, which is peanuts on Apple Silicon (and you "need" that sweet 128 GB MacBook anyway 😄 ). It's always available, 100% secure, and you can do anything you want with coding harness, which is a no-go with Claude Code.

Local models are still more than enough for simpler refactors, boilerplate and stuff like that. They require you to get more familiar with the code you're working on, which is A Good Thing™.

You cannot go "hey Claude, make it fast" and then have NFI how it works now and what are the new bugs, which is not necessarily bad.

juaps@reddit

The issue is that you require over 240k of context to run it without any problems. I simply switch from LMstudio to llama.cpp custom fork to execute turboquant and all my toolncalling, idles, loops, and so on stop, As a result, now i have a proper and efficient web application with login, chat, rag, and SQL functionalities for my business

[-]

iamreddituserhi@reddit

Try giving system prompt (try different quant and versions some version just break keep looping for weird out )

One its tuned can expect beter out put .try different system promts or even ask opus kimi deepseak to optimize the system promt for your use case .

So what you have is called. Other people flows.

I have 1 million cintext qwen 36 27 and 35b running.

The trick is don’t give tools. Give ways to find tools. Think like you and make statemachine and give a cli tool for the command call http by cli mcp-connect. It gets thoughts.

Ask gpt to read back a thread an devalue how many tokens went to solving the problem and how much to not upsetting you with reply type. That’s why you go local

[-]

CriticismTop@reddit

In having similar experiences using open code. Gemma4 has be ok, Qwen spends most of its time going round in circles.

Local is working great for me with automation, but coding leaves a lot to be desired. I am enjoying Big Pickle with open code at the moment

[-]

Current_Ferret_4981@reddit

Hermes + Qwen3.6 in Q5 with qkv cache fp8. Flies fairly quickly and handles tasks incredibly well. Much better than the integrations and models my job has made accessible so far. Complex workflows with deep library understanding come together in a few minutes which has been impressive

[-]

mission_tiefsee@reddit

local models are not your fairies. If you use this setup, you are using bleeding edge tech. Welcome to the frontier. So change your harness, and jump on the hermes-agent bandwagon and get your stuff going. For your prompt: Go to chatgpt 5.5 post your prompt and demand a prompt for your local model. Unfortunately, local smaller models are really prone to not really understand your intend.

I build a nice tower defense yesterday. With qwen3.6-27b and some prompting help from chatty 5.5. All in godot, without touching code.

[-]

somerussianbear@reddit

r/usernamechecksout

[-]

themoregames@reddit

TL;DR After giving local LLMs a fair shot for coding tasks, the author concluded the productivity loss outweighs the benefits compared to cloud models like Claude Code.

[-]

somerussianbear@reddit

Funny. I wanted to have posted something like this but every time I think about it I fallback to “I’m probably doing something wrong, gotta put more time on this to ensure it’s not my setup” but eventually a new model comes out and I feel “oh, NOW it will work!” and then rinse and repeat and the feeling is still the same.

The thing that seems to have given me best results was little-coder, a pi-based harness that adds good guardrails for small models. To have an idea of how excited I am about that one, I’m building an entire sandbox tool using SBX because that doesn’t have one and I want to use it badly day to day. For simpler tasks like documentation or understanding/researching a codebase it gave me good results with Qwen 3.6 35 MoE, so I imagine a dense would do even better.

But yes, it’s pure grind until we get something minimal working, and most people just don’t have that energy to keep going. Luckily I have fun on the discovery path rather than enjoying just the final results, and for this Reddit this seems to imperative.

[-]

Not only are the LLMs slow, but no matter which app I'm using, the prompt cache frequently seems to break. Translation: long pauses where nothing seems to happen.

I noticed that the default 8GB for host prompt cache in llama.cpp was not enough for Qwen3.6 27B @ 128k context using it with OpenCode. You can monitor this in the logs by looking for sections such as this:

[63432] slot slot_save_an: id  2 | task -1 | saving idle slot to prompt cache 
[63432] srv   prompt_save:  - saving prompt with length 57352, total state size = 3735.220 MiB [63432] slot prompt_clear: id  2 | task -1 | clearing prompt with 57352 tokens 
[63432] srv        update:  - cache state: 1 prompts, 5680.360 MiB (limits: 8192.000 MiB, 131072 tokens, 131072 est) 
[63432] srv        update:    - prompt 000002C67CD9E310:   57352 tokens, checkpoints: 13,  5680.360 MiB

Here you can see a ~57k prompt ate 5.6GB of prompt cache. Bumped it up to 32GB, since I'm running 4 slots, and it helped a fair bit for me.

llama-server.exe --cache-ram 32768 ...

[-]

Major-Examination941@reddit

Yeah I built my own ollama locally with routing and model switching and orchestrating. Calling into cloud (anthropic, minimax, Gemini) for orchestration and synthesizing. Sometimes code. I mean if you're expecting your rig to actually compete against Claude that literally loses money on you then your expectations are off. Also for continual learning it's great you still have to debug learn how to prompt better, review etc

[-]

AdOk3759@reddit

The quality of a local model hugely depends on the harness. I suggest you look into little coder (and their paper)

[-]

SourceCodeplz@reddit

I think it is because of quantisation. I am using Gemma 4 31b via API directly from Google and I don’t experience this. It just works!

[-]

TanguayX@reddit

Yeah, I'm with you. Did some experiments over the weekend, and my local Qwen3.6, as big as I can muster, with Cline, and it was doing OK with the task I was trying. But I have Sonnet off to the side going..."wow, look, it just made up a function". Even getting Sonnet giving it hints.

So yeah, what's the utility in that when debugging is often worse than just starting from scratch with a better planning doc.

I've been tinkering with qwen3.6 recently, and have got it to a place where I can use it for most coding tasks, so I thought I'd share my experience.

Note - I still use GPT5.5 and mini for bigger projects (Monorepo or similar), and generally use mini for 'work' tasks as the quality is higher. Qwen has been great for side-projects though.

I run it on a RX 7800 XT, with many MoE layers pushed to the CPU. This allows me to fit almost all GPU layers into VRAM. I get around 30t/s. Prompt processing is really fast as long as I keep context small (68k). I have 32gb of system RAM, and a Ryzen 5 7600.

My workflow is:
1. Plan with 5.5 or mini, depending on the task. Mini for features, GPT5.5 for new projects. I get them to write plan files.
2. Give it to Qwen 3.6 to implement
3. Get mini to validate it

I use pi via the littlecoder harness.

On quality: it feels good in Typescript. This entire repo has been written by Qwen3.6 locally, with 5.5 plans. In less popular languages (like Clojure), its pretty bad. Slow and it hallucinates a lot. Language choice is important.

On speed: Pretty good. It took a lot of experimentation to get here though. littlecoder helped quite a bit, and switching to ubuntu made a big difference (I was on windows previously). I ran it all through lmstudio, I haven't got to the part where I tinker with llama.cpp directly. Its not nearly as fast as say GPT mini, but its good enough.

The main advantage is infinite tokens. They feel amazing, even if they're slower. It really pushes the bar for experimentation imo. However I would not replace my primary workflow with local hardware.

Some issues:
1. Thinking loops are a pain. I've got them to happen less frequently by following the recommended inference settings by the Qwen team, but they still happen. It makes me feel like I need to babysit the model which can be annoying depending on what I'm doing.
2. Small context window. This is an issue with my hardware, not the model at all, but I thought I'd call it out. Auto-compaction kicks in pretty quickly, which can sometimes interrupt the model.
3. Tool calling proactivity. In GPT, the model is really good at knowing when to call a tool. If it encounters issues (like compilation or bad types), it will use a variety of cli calls to get to the solution faster. Qwen doesn't do this, it tends to rather grep every line of code possible and then come up with a solution. This is much slower.
4. Greenfield tasks (e.g "Add this feature"), are still quite bad. It often comes to a really strange conclusion on how to implement a feature. This could be an AGENTS.md or context issue, so not putting this on the model. For example, adding async model loading in the lmstudio extension took a long time and it did some really weird stuff. GPT mini ripped through it, and was proactive in reading docs to find the solution.

[-]

aniket_afk@reddit

Yo OP. Which quantized model are you using?

[-]

Otherwise_Berry3170@reddit

Like everything else it depends, for example if you were talking about Claude a month ago I would say yeah it was pretty good, now? Not so much they water down the models and recently came and called it a bug because we complained. The prices change, the limit changed, so while I agree models locally are not as good, with training and a good agents/skills I can do with qwen3.6 35b almost the same as Claude sonnet. Qwen3.6 27b is better but on my hardware a GB10 Blackwell is a bit slow so I use it for text only. Took a bit to get the agents right and they still sometimes don’t work as expected but pretty ok with my setup. And from the cost calculation just last week would have spent 2k on Claude api calls. So yeah I agree, not perfect but not that terrible that you cannot work with

[-]

_mayuk@reddit

Skill issues xd

I mange to use Gemma4 E2b to digest some json payloads and check in a vector db correlation with some bucks that I have etc .. multiple agent ( not running simultaneously) but each one take care some aspects of the digestion of the json payloads in the way that maybe just a conversation with deep research of notebook would be able to handle hehe …

[-]

_mayuk@reddit

Of course you have to do much of programming your self or use another AI to create the python scripts to digest the files and handle the db …

The verbatim of open claw is not enough .. I’m trying to integrate obsidian or/and graphiti …

I hate how much marketing about this kind of stuff is going nowadays because there is many forks or repackages of the same with fancy names … but like people have been saying if you are a coder yourself all this are amazing tools .. a none vibe code can be actually sustainable if everything is running in api calls lol ..

Maybe you can use the pay llms to help you setup proper agents with proper memory handling for some given task .. but you would have multiple diferente agent or stay to organize all this subroutines let’s called xd

To be brutally honest, I haven’t coded by hand in years and would likely take a year to learn how to get back in original shape, yet the same model you used at Q4 quant + Opencode and a few days worth of sessions I was able to get a fully featured budgeting app build from scratch.

Local LLMs are not cookie cutter solutions yet. There’s more like a clay sculpture - at the beginning you can’t event hold the clay together, but after leading and tweaking you will slowly overcome issues and start producing good results. Remember, this isn’t cloud AI where an army of sys admins and devs are working non-stop behind the scent to make your experienc3 better

[-]

_hephaestus@reddit

I wish I could disagree with you but it’s a messy world. I do think there might be something to having something like litellm with langfuse in between the harness and the provider for debugging, but that works the same whether you’re using local or externally hosted. Part of the problem is the speed things are moving and the lack of cohesiveness, all the big players have their harnesses and ship their models with them, meanwhile there’s still unmerged llama.cpp/oMLX stuff supposed to make qwen3.6 understand tools the way it’s hyped to.

I use sota models for high level plan, strategic plan, architecture plan, and feature implementation plans. Then I use local Qwen3.6-35B-A3B + DeepSeek-R1-Distill-Qwen-14B as an agentic coding pair to build one feature at a time.

It's going well, but it's more involved than just "build me an app". For anything that Qwen fails at, I just fall back to a sota model.

[-]

ascendant23@reddit

Expecting a 27B OSS model to hold up head to head with the latest Opus / GPT is just wild. It's like trying to replace an 18-wheeler truck with a ford pickup truck. If your workload requires an 18-wheeler, the pickup truck is never going to come close to meeting that need.

maz_net_au@reddit

All of this (and the comments) is how I feel about using Opus as well. LLMs are fun but they're just so dumb.

[-]

One-Replacement-37@reddit

Cool story, bye!

[-]

Different-Rush-2358@reddit

Look, the problem that many of you are facing—and honestly, it’s not even your fault—is that they’ve overhyped local models way too much without explaining how the "local hype" actually works. As of today, the only local models that are truly good for general purposes, including coding, are Gemma, Qwen, and DeepSeek. Forget about those weird variants or random "labs" that pop up out of nowhere; most of them are just distillations of models that were already distilled before.

Then there’s the whole quantization topic, which has advanced quite a bit. For example, Unsloth’s UDS gives you very decent precision and they fit into any consumer PC (depending on the parameters, obviously, haha).

And then you have the "Blackwell sect" and their "high precision or nothing" mantra (which smells a bit like a sponsorship from NVIDIA or some massive GPU distributor, but whatever). They’ll tell you that if you don’t have 17 Blackwells, 900TB of RAM, and a quantum computer, you can’t run anything. That is a total scam. Anyone with 15 minutes of spare time can figure out the commands to squeeze the most out of their hardware. You can run models very comfortably on hardware from several years ago.

(Example: Qwen 2.5 72B at 10-20 tk/s on a Xeon 2680v4, 32GB of RAM, and a GTX 1070 with 32k context, thanks to turboquant flags).

So, in short: believe only half of what you read here, and take the other half with a grain of salt. You don't need a $30,000 rig to run a 280B model, for instance; you just need to know how to use the correct flags in llama.cpp and have a balanced setup.

Sorry for the wall of text, but I saw this post and took the chance to get this off my chest it’s been on my mind for months. And I know I might get downvoted to hell for this... but I don't give a damn. It’s about time we debunked the myth

[-]

Such_Advantage_6949@reddit

reality is u need at least 200B local model like minimax to get serious stuff done, small model had made progresss, but they will break as soon as u throw serious stuff at them

[-]

dev_all_the_ops@reddit

Thanks for sharing. I've been obsessed with getting started in this, but I worried I would just be wasting my time.

I still like local models for security and to fight against subscription bloat, but its good to know that its just not as good as paying a major player.

[-]

Noiselexer@reddit

I've never considered local llm for coding..

[-]

alexthecat999@reddit

Is it good for small tasks and bolierplate code? Just to bridge my lack of syntax kowledge with a new langauge?

[-]

ryfromoz@reddit

Yes, and its better if you can actually prompt 😁

[-]

SnooPaintings8639@reddit

Yup, it's a bit over hyped here. I mean, if it wasn't open weight I assume it would be very rarely used anywhere.

Having said that - it is capable and with proper care if CAN replace sonnet or others in many clearly defined, coding heavy tasks. It just needs a bit more of care from you and/or larger model on top.