Qwen3.6. This is it.
Posted by Local-Cardiologist-5@reddit | LocalLLaMA | View on Reddit | 409 comments

I gave it a task to build a tower defense game. use screenshots from the installed mcp to confirm your build.
My God its actually doing it, Its now testing the upgrade feature,
It noted the canvas wasnt rendering at some point and saw and fixed it.
It noted its own bug in wave completions and is actually doing it...
I am blown away...
I cant image what the Qwen Coder thats following will be able to do.
What a time were in.
Long_comment_san@reddit
That's not the best part. Imagine new generation of kids having access to tools like that since early school. I wonder what the heck out planet would look like. It's either a metropolis or Idiocracy
kamikamen@reddit
People thought the internet would make us all geniuses, and Gen Z is the first generation with an IQ lower than their parents. The future will be a lot more unequal than today, people with access and smarts will use these tools to empower their creativity and create new ventures, most will just use it to outsource their thinking never moving beyond pure chatbot use.
nachohk@reddit
Armed with a programming book and a BASIC dialect and a determination to create cool stuff when I was a kid, I taught myself enough code to write simple games like this within a few years. Add internet access and the rest of my youth, and I taught myself enough code that it became my career.
LLMs are nowhere near good enough to write entire commercial applications without expert supervision. And I have enough background in ML to say the reasons why are fundamental limitations of language models that will require multiple attention-level breakthroughs to get past. So I don't think we're going to have that in a long time still. (Shit, the LLMs are only useful even with expert supervision in a fairly constrained subset of types of software.)
But LLMs are extremely good at hacking people's dopamine reward system and giving them the feeling of building something with absolutely none of the benefits of having done so themselves. If kid me had an LLM, I don't know that I could have ever learned programming well enough to make a career out of it.
I'm worried for our future. I take little comfort in it, but at least I should never have to worry about my own job security.
Sea-Promise-1182@reddit
‘Expert supervision’ is a little strong, no? all you really have to do is point it in the right direction and be able to translate ideas into words and use typescript.
nachohk@reddit
Sounds like you're liable to find out the hard way that LLMs don't know better than to do very stupid things like transmit your production database credentials to the client.
Sea-Promise-1182@reddit
Well yes, that is a risk, but as long as you make sure the ai doesn’t do that it’s not gonna do it. If you’re allowing all edits and commands in a prod environment you could be screwed, but the upsides definitely outweigh the downsides for coding.
nachohk@reddit
You seriously have no idea what you're talking about.
gearcontrol@reddit
I think there will be new fields created and current fields expanded, and it will be very lucrative for current experienced programmers and those that can understand the big picture, as in, how everything connects. There will be a long transition maintaining legacy, legacy with the new, and then the new.
Currently, AI has lowered the barrier to entry for creating software products. Call it 'AI slop' or whatever, but it's out there and will need to be maintained and managed.
nachohk@reddit
No, it really hasn't. Not for the kind of software you can sell. Not without going out of business very soon after due to shitty dysfunctional software because nobody involved knew to make damn sure Claude didn't implement any of those features you requested by sending your database credentials to the client app.
It has lowered the perceived barrier, but it is a machine gun pointed at the user's foot.
What they really do is they glaze the hell out of less technical people and hack their reward system to make them dangerously overconfident about what they can do with an LLM.
gearcontrol@reddit
That is true, and it's not just filled with coders vs non-coders. There are non-coders with other IT skills who have hacked together prototypes and deployed software into production. I've worked for colocation and hosting companies over the years, long before AI, and seen this with my own eyes and helped them support it.
They used many of the same sites that AI scraped data from like Stack Exchange, Stack Overflow, forums, CMS, etc, to hack together sites and products. Then they'd hire staff and coders if it took off to maintain it or learn more skills as needed to do it themselves.
Many of these folks (engineers, security experts, designers) know the process and requirements for putting software into production, and many of them are the ones doing the checks and final deployments at major companies.
AI has absolutely dropped the barrier of entry for these folks without question.
Fear_ltself@reddit
Leaning towards Idiocracy, I played roller coaster tycoon as a kid doesn't mean I can open a billion dollar theme park. Just saying
Kodix@reddit
Looking at how it's going currently, further stratification. Properly raised kids/kids from well-off families will use these tools to achieve and learn much more than previously possible. Other kids will drown in slop and cheap, individually-curated dopamine drips, even more so than they already do.
Medium_Chemist_4032@reddit
Tools and possibilities are one side of equation. Motivation and challenge is the second. Various ways to set-it up for best results overall.
motorhead84@reddit
The future mantra of vibe coding will be "make it cheaper to run, then make it cheaper to run again."
Long_comment_san@reddit
Children are inherently motivated. Its the world of adults that makes our life miserable and motivationless lmao
Cute_Obligation2944@reddit
Children CAN be motivated, but if they can just scroll YouTube and Instagram while the computer does their homework... we're going to have a huge problem in the next couple decades.
kiwibonga@reddit
I dunno. My ipad baby started reading fluently around 3-4 years old. Her favorite youtube show is called Numberblocks, she knows what squares and roots are. She's turning 7 and she's not engaged at all at school. The only value of school is to socialize.
It's going to suck if they can't keep the electricity on but it's going to be an unbearably more intelligent world than we got to experience.
Cute_Obligation2944@reddit
Cool story but it sounds like your kid is exceptional.
FrogsJumpFromPussy@reddit
All small kids are, honestly.
FrogsJumpFromPussy@reddit
"The only value of school is to socialize."
The value of school is to learn to think too. That's why we have trained professionals in schools and not ordinary Joe's to keep them entertained.
"started reading fluently around 3-4 years old."
When we sent out first to school, they knew to read and write at 5 and a half. Guess what, all the other kids in their class knew as well. None properly though. The teacher had to teach them to read and write properly, which she said it's a nightmare to do.
dellis87@reddit
Same. 2nd grade and he’d rather learn on Duolingo than play a video game.
WPBaka@reddit
woswoissdenniii@reddit
We need to patch this short circuit in our brains. We need to free ourselves first and our kids right after. „Social media“- first and biggest global societal disruptor. Either we find a way to overlap our bubbles or we will dehumanize to a degree that we can’t control.
Cute_Obligation2944@reddit
Well, machine learning needs to be regulated just like anything. You can't have FB or TikTok pushing dopamine like oxycontin.
BlueSwordM@reddit
They CAN be, but because of various commercial incentives to get children/teens hooked, a relatively high number of them don't have great technical abilities whatsoever regarding general tool usage and problem solving.
Even the ones using online LLMs to actually learn don't actually know how to take advantage of their tools competently.
some1else42@reddit
Meanwhile my 13 year old wants zero to do with using these advanced AI tools. He just wants to "use his brain". I've tried all manner of attempts but he just pulls away. Big sigh.
falcongsr@reddit
My kid is the same age and thinks AI is evil because it "steals art" from real artists.
I'm like OK but you have to understand it's a tool and you need to know how to use these tools.
Nothing.
SquareWheel@reddit
That seems fine to me. Better they learn the fundamentals than get in the habit of offloading their thinking process to an AI. They're powerful tools, but sometimes doing things the hard way is necessary to build intuition, too.
my_name_isnt_clever@reddit
Sounds a lot better than your child falling into AI psycosis.
Thebandroid@reddit
The issue is every techbro and VC idiot is lining up to sell the convenience of not really have to learn how to do anything for modest monthly sum.
Sure those who are motivated will continue to study the old ways or at least push this tech to its limits but just if we look at how quickly searching the internet has been replaced with "I asked AI" I'd say those people will be in the minority.
finevelyn@reddit
A lifetime of being limited by what the AI can do for you, and never learning to surpass its capabilities.
rkoy1234@reddit
same was said for books, tv, then internet.
it'll reward different kinds of motivation in some, while exacerbating different kinds of laziness in others - as did every technological convenience that came before.
the only "this time it's different" aspect is the fact that it might eliminate almost all professions to start with, but at that point, we got bigger stuff to worry about.
draconic_tongue@reddit
pessimistic libtärdism on tech subs doesn't belong
Zc5Gwu@reddit
He’s somewhat right though. AI is like a bicycle for the mind. Your brain just doesn’t have to work as hard anymore.
NeinJuanJuan@reddit
If you ask "Who is fitter, runners or cyclists?" it could start an endless debate.
But if you ask "Who can go further, runners or cyclists?" the answer is definitive.
draconic_tongue@reddit
He's not right, like at all. Unless you want to say that access to the internet has made you stupider than a white-collar worker from the 60s
FrogsJumpFromPussy@reddit
Arrogant, 100 karma, stupid, so a troII.
finevelyn@reddit
Does internet access allow you to skip 10 years of computer science to build software? No. Your analogy is bad.
draconic_tongue@reddit
You don't need that shit, also has nothing to do with what's been said
Emotional_Chard_8005@reddit
Way to miss the point. You can vibecode something that maybe works without a degree now. Cool. This isn't about that. This is about eventually starting to lack people who know how it all works on lower levels which is necessary for continuous advancement.
finevelyn@reddit
Saying that you should learn and that it's worth to learn yourself is in fact the opposite of gatekeeping.
EuphoricPenguin22@reddit
I think we're trivializing what access to PCs and the Internet did when it was new. It allowed people from across the world to directly communicate for the first time. It allowed information that was previously locked up in books limited to physical locations like libraries to live in a central place where everyone could access them at any time. I could easily see someone critiquing this ease of access as "ruining" people's ability to search because it made finding information much easier. In some ways, it did. The Internet is infamous for containing unreliable information that people tend to not question as much as they should. Did that mean it's a worthless technology in the goal of democratizing access to information? Of course not. Likewise, AI makes it easier to write functional software, but especially with these local tools, you really need to know what it is you're trying to do to produce decent software. I learned how to program before AI was ever an option, so I try to critique bad patterns when I see them, but even more important than that is having a proper "forest from the trees" view of the project to ensure the way you're expanding it is architecturally sensible. All of that is not trivial and AI does not replace that, especially with models at this size.
necile@reddit
Did the wheel allow you to skip 10 years of computer science to build software?
TinyZoro@reddit
Not been my experience and I totally get the theoretical possibility. I actually think the power of creativity it gives you inspires you to go down many intellectual rabbit holes. It does reduce the need for a certain type of mental effort so I don’t think it’s not without concerns but there’s still lots of brain engagement with creating something half interesting with AI.
draconic_tongue@reddit
pessimistic libtardism on tech subs doesn't belong
Mr-Potato-Head99@reddit
It's dangerous in my opinion to have access to such tools without understanding what the tool is doing. It's like flying a Jumbo Jet on autopilot without knowing how to fly a Jumbo Jet.
Cute_Obligation2944@reddit
Jumbo jet might be a bit dramatic.
FluentFreddy@reddit
A golf sized toy helicopter?
Cute_Obligation2944@reddit
Probably more like a gun. Could be used to hurt people, but more likely you or your own family by accident.
DarkArtsMastery@reddit
No thinking required. The age of thinking is over
Long_War8748@reddit
Quite pessimistic and horrifying view.
handsomebrielarson@reddit
Made me remember that Claude has recently become the 'Official Thinking Partner' of Williams F1 Team.
NarutoDragon732@reddit
People said this when graphing calculators went mainstream. We ended up never using it in school and only allowed it for the hardest classes that required us to do far more than any kid could accomplish before it.
I expect AI to be the same story but on a grander scale. Those who seek education will receive it, no matter how much their tools trivialize it.
my_name_isnt_clever@reddit
I'm hoping in the long term it will adapt higher education into something that can be done by anyone with the drive, not just anyone who can pay thousands of dollars and 4+ years to get a peice of paper.
DarkArtsMastery@reddit
I actually agree. Curiosity is not going away anytime soon. In the end, it is a tool and all that matters is how you actually do that tool. I like the way it explains code to me, makes sense so far and my skills and understanding has improved. Yes I am wasting time with my part, but it is good for my brain and it helps knowing how things work under the hood :) The actual hard part is verifying most things and concepts at least thru quick experiments.
balder1993@reddit
LLMs allowed me to transition from iOS programming to web quite easily. I just started building a project with an LLM as an assistant. Every time I wanted to do something I’d ask how I should do it. When I didn’t understand what code it gave me, I’d keep asking why this way, why not that way, how does this thing work etc.
It’s like an accelerated way to learn, because in the end I learned something by “demonstration” (actually implementing it and seeing it work). As I got more experience with it, I needed the AI less and less to do the basic stuff.
draconic_tongue@reddit
your mistake is thinking it was ever a thing
Kandiak@reddit
And yours is thinking it wasn’t
jeffwadsworth@reddit
Brain? Brain?? What is this Brain?!
moonrust-app@reddit
Brain is a biological version of a cpu and ram combination. Outdated species like humans and monkeys use. A mac mini with 32gb unified ram beats it easily.
jeffwadsworth@reddit
Star Trek “Spock’s Brain”
-dysangel-@reddit
did you mean utopia? All cities are metropoliseses
social_tech_10@reddit
https://en.wikipedia.org/wiki/Metropolis_(1927_film)
-dysangel-@reddit
He said "a metropolis", lower case. If he'd said "a Metropolis or an Idiocracy" then I'd assume that's more what he meant..
social_tech_10@reddit
I can't quite tell if you're being defensive and making excuses because you didn't even recognize the Metropolis movie reference (despite the obvious movie-refererence context of the sentence), of if you're so hung up on correct capitalization, for whatever reason, that you would deliberately pretend to miss the reference, just to be argumentative and act all "holier than thou", like the character Sheldon in the Big Bang Theory, always attempting to demonstrate that he is much more intelligent than everybody else, and often missing the whole point in the process. I think the Sheldon character is written to show a person who is highly intelligent and also somewhat handicapped by being somewhere on the autism spectrum. Does that description fit you as well?
-dysangel-@reddit
Oof. Sure you're not projecting right now after being triggered into a wall of text? :D
darktraveco@reddit
Think about how media outlets and social networks will engineer the shit out of their platforms to make sure kids are glued to the screens instead of building cool stuff.
No-Marionberry-772@reddit
what stack are you using for software? Id love to get a proper local setup going but ive had trouble figuring out what i should actually be using.
Local-Cardiologist-5@reddit (OP)
Im using Llama.cpp for the server,
OpenCode for the coding, just using the build agent,
I have 64 gig ram, RTX 4090, and my model is
the Q6 variant.
Here are my llama parameters
here is my llama server with the configs.
bnm777@reddit
This may be of interest:
https://sleepingrobots.com/dreams/stop-using-ollama/
Borkato@reddit
Wow this was extremely informative, wtf ollama
Pyros-SD-Models@reddit
Literally every model being discussed here "stole" shit to train on, so I find it somewhat amusing that people are all up in arms about ollama basically using open source as it is designed. you can argue about morality, but it's a very simple question: are they violating any licenses they are supposed to adhere to? no? end of story.
llama.cpp chose its license with full awareness of what people would do with the software and the code, and if they would like people to behave a certain way they should have written it into their fcking license
JamesEvoAI@reddit
I'm the author of the article.
Except they're not. They're taking the open source project and breaking it in a way that delivers worse performance, in addition to adding complexity overhead. This is isn't a good faith business built on FOSS, this is being a rent-seeking parasite. I'm not opposed to building a business on top of FOSS, I am opposed to your business making the free alternative meaningfully worse and damaging the sentiment of the underlying FOSS ecosystem.
Yes, they are. Did you actually read the article or did you just go straight to writing an angry comment?
It's explicitly written in the MIT license that llama.cpp uses that you need to include a copy of said license with any distribution of the software. Ollama is deliberately violating the license terms to prevent their users from finding their FOSS foundation that offers a better experience.
FusionX@reddit
I found this ironic. The article was AI generated, along with this reply. And then I noticed your name.. /u/JamesEvoAI.
The internet is dead.
ArtfulGenie69@reddit
I think people bawk at ollama because the idiots pretend regularly that it is all their own work and they have an ass system instead of just using gguf like a normal person would do. They have tried to make their own personal garden out of all our shared equipment and they don't give credit. Mainly they just suck ass because of the go templates (who the fuck thought that was a good idea, why do you reinvent the fucking wheel when you have jinja already). They're just annoying dumb bastards who are easily replaced with llama-swappo.
The_frozen_one@reddit
They use gguf, they just use sha256 filenames to dedupe/deconflict identical files. It's very similar to how container software works. You can load them directly with llama.cpp.
Wasn't jinja added to llama.cpp 4 months ago?
JamesEvoAI@reddit
This reads like straw man justification for a problem that nobody has. Even if I did have the issue of multiple copies of the same model floating around (why would I though?), the obvious solution to that problem is not to lock myself to a third party tool because the filenames are now obfuscated.
This makes sense in containers, when two completely different containers may share intermediary layers. A human isn't directly using those artifacts so hashed filenames are the obvious choice. This makes zero sense in the context of a GGUF.
This is correct, OP's argument was invalid. The ollama go syntax predates llama.cpp's use of Jinja by a few years. That said the Jinja syntax is more accessible and has become the de-facto standard.
The_frozen_one@reddit
But there's no obfuscation, it's just a system you aren't used to.
Being able to quickly validate that the file contents are valid by using the filename match is really useful, especially if you can automatically delete and redownload the invalid files. You can trivially write a script to tell you if each and every one of my huggingface or ollama blob files is valid without knowing anything about them.
openssl sha256 FILENAME, does the hash match the filename? If so it's valid, you don't need to understand anything about the underlying data or format.And yes, huggingface's
hfCLI tool does the same thing. It's such a robust and unremarkable way to deal with large files sets that huggingface uses a nearly identical system (look under~/.cache/huggingface/hub, everything undermodel-*/blobis a bunch of hash-based filenames where the actual data is stored.If you don't have a lot of models to manage, there's little reason to having a system manage them. That's perfectly understandable, use what works for you. But it's not obfuscation to store a file by its hash. If ollama were using a secret hash function or entangling the hash with a secret, non-pubic value, sure that'd be problematic, but it's just standard sha256 that anyone can compute.
JamesEvoAI@reddit
I've been building with docker for years.
Again, when is this a real world problem that people doing local inference are having? I download the GGUF, I test the GGUF, the GGUF goes in my model storage folder. If it's not working I download it again.
You're ignoring the point to continue rationalizing the problem that nobody actually has. When I say obfuscation I mean obscuring the information I care about (what model it is, what quant it is) behind a dependency when that information could have been in the filename.
In what world are you living where you're having to validate the integrity of your GGUF's beyond the initial download? I have 50+ models downloaded right now and all it takes is an
lsto know exactly what models are there. I can easily load them up with any other inference tool because they're just files whose name reflects their contents. If I need to know if I have a specific model I can justfindagainst my model folder.Why are you complicating things to solve a non-issue?
The_frozen_one@reddit
Nobody uses docker, everyone just uses namespaces + cgroups + chroot (or jails). There is no reason to locked into a system that uses immutable layers for building containers, it's just convenience-ware that solves a problem nobody has.
/s if it isn't obvious (Docker is great)
I make an HTTP call to 4 systems, they download the same model, I run some tests against the model using a standard request. When I'm done, I make an HTTP to delete the model from all systems. Every call to each system is agnostic and identical, only the target IP/hostname changes.
OR
I download the GGUF,
scpit to each system, thenssh(orRDPif ssh isn't available) to each system and launchllama-serverpointed at the gguf. I buildllama.cppor download the latest release. I usetmuxorscreenorRDPto keep the process active, monitoring and restartingllama-serveras required until I'm done, then manually delete the file from each system. Each step of the process requires knowing a bit about Windows or Linux or macOS or *BSD.What problem does nobody have? Wanting custom options for the same underlying model available on demand? I think you're over-representing your use case. There's room in this community for people who will never learn what a gguf or safetensors file is.
I run
ollama lsthenollama show MODELand it shows me the context length, quant, etc. It's standardize and easy to read. I typeollama pull MODELto download a model with reasonable defaults orollama pull MODEL:quantto get a specific quant, it deletes when I typeollama rm MODEL. I can create a Modelfile with specific context lengths or system messages by typingollama create specialmodel. It uses the model if I have it or downloads it if I don't. Or I can use a custom file I provide it. The syntax of Modelfile is similar to Dockerfile (even starts withFROMthat uses a model name instead of an image name).If you are really itching to use the files from ollama, it's not hard to do so. The walls aren't high enough to matter, just like how you using Docker is fine despite the fact that more open / less commercial alternatives exist.
Ah yes, the "command line is trivial" person for whom file management is obvious. For you and I it might be, but there are people who are wildly more capable in things that you and I will never begin to comprehend who are terrible with computers and who basically have a non-functional mental model for how they work. I want more people using local models, whatever their skill level with computers.
JamesEvoAI@reddit
My brother in christ this entire time I have been making my arguments from the perspective of a non-technical user for whom filenames that reflect the actual content of the file are far more obvious than hashes. This is the dumbest conversation I've had in a while lol. If a normal user downloads a GGUF with Ollama and want to try using it in literally anything else, they now have to deal with file hashes instead of just using whatever search their OS provides for the name of the model file they know they have somewhere on disk.
You're arguing for a piece of software that makes it harder for normal people to reason about, has worse performance, and is parasitic to the FOSS ecosystem. The idiot in this conversation is actually me for letting this go on for so long. Have a good one lol
Borkato@reddit
This is such a stupid take.
Nobody is suing them. Llama.cpp isn’t telling them to take it down. They just got rid of their goodwill by being rude so people are saying “let’s stop using them”. Literally nobody said they can’t do it, just that it’s a dick move, so people don’t have to use it if they don’t feel like supporting that kind of behavior.
There’s nothing wrong with this. Your argument sounds like those that complain about free speech being “restricted”, not realizing that people do not have to listen to you, like you, or put up with your speech without consequences for you.
FaceDeer@reddit
Unfortunately the article spends 95% of its time explaining why Ollama sucks, and then there's a paragraph tucked away at the end with "BTW, here's a list of various projects that may or may not accomplish bits of what Ollama accomplishes. Good luck figuring them out."
Looks like to replicate what I use Ollama for the most I'd want to install both llama-server and llama-swap. Neither of these appear to have a Windows installer and there's a huge amount of fiddly configuration files that it looks like I'll need to figure out once they are installed.
I'm a technical person, I could sort all that out. Or I could just leave Ollama as it is and everything just keeps on working fine as it is now.
Ollama's got the "it just works" part nailed down pretty well and that's a very important feature IMO.
WhoRoger@reddit
Pretty much. And they recommended LMStudio, which isn't foss.
Ollama has just the right amount of user friendliness and tinkering friendliness for people to start messing around with AI and understand the basics. Even the API is friendly enough to cobble a client together in an hour and goof around, including model cloning. Most other solutions are a brick wall of "swim or drown".
I was just testing a new variant of a model yesterday and well, I could either convince llama.cpp server to take in another model (still idk how to do that without restarting it), launch another cli on a new port and be super careful about the parameters, or... I could swap out the filename in the ollama modelfile and have the model available in 30 seconds with the same settings as the old one to test. The last one is almost always the fastest, even if it's not the cleanest.
I get the distaste for Ollama, but they really nailed the basics.
JamesEvoAI@reddit
Article author here, I give other recommendations that are FOSS. LM Studio is the first choice in the article because it is the best at filling the needs of what people expect from Ollama, while also giving proper attribution back to the ecosystem.
I am a FOSS advocate but that doesn't mean I'm 100% against you trying to build a business by offering convenience on top of it. My issue is when your profit incentives become parasitic to the ecosystem that made those profits possible.
WhoRoger@reddit
You should still make it clear that it's not even open source, if you're criticising another app for releasing a closed source gui.
If we want to move from apps that aren't totally legit foss, then going towards non-foss is the opposite of what we want.
Personally I was really shocked when I found LMS isn't open. So many people recommend it, I thought I'm missing something because nobody even bothers to mention it. Considering this community is largely Linux/foss people, I'm thinking it's at least in part because of lack of good, commonly available alternatives.
If the choice for inference is between one closed source app and a trillion hobby Python single-use projects, that's not really healthy, and is exactly what kept Linux back for so long. Now we're doing the same thing with the LLM ecosystem.
JamesEvoAI@reddit
That's a fair criticism, I'll add a note to each option.
WhoRoger@reddit
👍
ZootAllures9111@reddit
LMStudio does everything you want and more in terms of user friendliness while being able to directly download any GGUF you want from huggingface and a lot faster than default Ollama
FaceDeer@reddit
It doesn't, actually. LM Studio is a GUI first and foremost. What I want out of Ollama is to have it just sit quietly in the background until one of my scripts calls an API, at which point it loads the LLM, serves the call, and then eventually unloads the LLM again if there are no further calls.
Checking out LM Studio to see what's changed recently, I see they've added a "headless" version without a GUI. But it still doesn't do the dynamic load and unload stuff. That's why I identified llama-swap as a necessary part of what I'd need to install alongside llama-server.
msaraiva@reddit
It actually does that. I use it constantly. The server exposes OpenAI, Anthropic and an LM Studio-type APIs. You can list models and load/unload dynamically.
FaceDeer@reddit
Alright, I'll check it it out in more detail. It's been a long time since I explored it previously.
Worth noting that the "closed-source application" objection to Ollama applies to LM Studio too, though.
Evening_Ad6637@reddit
Only to the UI.
But if you worry about it being not fully open source (which I totally agree with) just fyi, llama.cpp aka llama-server does support autoloading models too now.
other solutions are:
llama-swap (more features than llama-server; supports any endpoint)
llamafile (much more convenient, only one file. Model, llama.cpp/server, configs etc - everything in one executable file. Downside: only one model per llamafile)
FaceDeer@reddit
I already mentioned llama-server and llama-swap back at the beginning of this subthread as the way to do this, the problem with it has a complex setup to accomplish something that I've already got working fine using Ollama.
I'm rather surprised at the amount of downvote I've been getting discussing this. I guess saying anything positive about Ollama is very unpopular in these parts?
Anyway, haven't heard of llamafile before. Does it do the "automatically load model into memory when actually queried, unload again after timeout" thing? I took a quick look at the documentation and didn't see a reference to features like that, the impression I get is that the model is in memory and ready to go for as long as llamafile is running.
426upgradrequired@reddit
Lmstudio has auto load and unload. It might be a newer feature.
https://lmstudio.ai/docs/developer/core/ttl-and-auto-evict
PollinosisQc@reddit
I did exactly that with a small Python server and llama-cpp for my home setup. The Python server takes the requests, creates a llama-cpp server subprocess, and when the request is done being served, the subprocess is killed and the RAM is reclaimed (well actually it keeps them loaded for a few minutes in case more requests come in so it doesnt have to do cold starts with every request).
gwillen@reddit
If you're on Windows I think you definitely have more limited options. You might consider using WSL2, although I haven't personally tested any of this stuff with it, so I can't say you won't run into issues there. It's possible that ollama is still a good choice for Windows users.
ZootAllures9111@reddit
LMStudio is six billion times better than Ollama in every way though, it's the best choice on Windows by a ton
bnm777@reddit
Windows... ewwww... that are you doing to your self, matey?
At the very least, LLMs run with less overhead on linux.
Move away from the Dark Side, Jedi.
Borkato@reddit
Except for the fact that it’s closed source. Also, llama cpp now has llamaui or whatever it’s called
ZootAllures9111@reddit
most of the backend is open source, it's just the Electron UI that's closed source. In no way does that make Ollama more worth using, regardless. Also the llama-server built in webui sucks, it has like zero features, it can't even switch models from within the UI.
FaceDeer@reddit
Both llama-server and llama-swap have Windows installs via Winget (though the swap one is noted as being "unofficial"), so basic support likely isn't a problem here. It's more a matter of foreseeing all the time I'll be spending tinkering with configurations and other fiddly details so that, at the end of all that work, I have something that works the same as the system that I already have installed and is functioning with minimal effort. That's where the "ugh, maybe I'll do that later" barrier keeps coming up.
tgreenhaw@reddit
Ollama doesn’t support TurboQuant yet. That will be a huge game changer because we can use larger models with a useable context window. Right now, llama.cpp is in another league.
lack_of_reserves@reddit
Not at all. Llama-server supports model switching now, to enable it don't provide a model when starting it up.
FaceDeer@reddit
Alright, that eliminates half of the work. The other half is still there. I'll take another look later today.
ArtfulGenie69@reddit
Dude your knocking the best software because you don't know how to use GitHub and your still on windows instead of Linux hehe. Ollama never had the just works thing going, go templating is trash and regularly fucked up thinking, not sure now but go templates are sub par and forced instead of the normal jinja, you'll see it still has a bunch of added bugs because of the shitty go templating instead of jinja and you will also be extremely annoyed by how they handle their models. All they offer is an API that is pretty easy to program for and so a bunch of noobs made their first program around it. Llama-swappo is the nice llama-swap offshoot that pretends to be an ollama so you don't need ollama anymore. Llama-swap is way way better. You do the config once and just copy the parts around in the yaml file. It's easy enough that I figured it out. It makes everything way more modular than ollama. I have full control because it's just setting up llama-server commands. Then you can rebuild llama-server when ever and it doesn't fuck up llama-swap, they have upgrades all the time. Llama-swap can even help with other inference engines like vllm, so you can make everything reachable in one place. It's better for this kind of stuff just like Linux is way better for this than windows. Check out Linux mint Cinnamon it's my favorite flavor.
FaceDeer@reddit
As I said, I do know how to use GitHub. I am entirely capable of setting all this stuff up, but it's going to take a bunch of work to do so. It's that extra work that I'm pointing out is an actual problem.
I mean, you're literally suggesting that I should install Linux as one of the steps here? That's not making things easier.
You are telling me that my own personal experience, that I personally experienced, didn't actually happen?
This one? Its installation section reads, in its entirety:
This is going in the opposite direction from what I'm suggesting is needed here.
randylush@reddit
Not only is that article guilty of it, but there are also infinite Reddit comments all just saying “don’t use Ollama!”
Local-Cardiologist-5@reddit (OP)
Thats why i always strongly advicate for using llama.cpp directly
chimph@reddit
yet called it olllama..
PinkySwearNotABot@reddit
did you install any skills for OpenCode? i'm curious because i'm seeing that To-Do List on the right panel, which I don't think I've ever seen myself when using OpenCode
rumblemcskurmish@reddit
I'm kind of intrigued why you'd use a 6bit model on a 4090. I have an identical setup (7950 CPU, 64GB DDR5, RTX 4090) but I'm using the 4bit quant to fit the whole model on VRAM.
You're clearly more advanced than me so just wondering, what I'm missing here.
alphapussycat@reddit
Imo q4 has noticeable loss, q5 is a step up, but q6 is the sweetspot I'd say only do q4 if you're really starved for vram.
rumblemcskurmish@reddit
Yeah, I'm AI poor cause I "only" have a 4090. So I can't really do anything higher than 4bit. One day I'd love to step up to a 5090 or something with more VRAM but I'm stuck at 24GB for now.
smuckola@reddit
ollama has 8-bit quantization (50% compression, virtually lossless) of context window for free with an environment variable fyi
rumblemcskurmish@reddit
Wha?!?! You're telling me if I defect from LMStudio to Ollama, I get a huge context window for free?! Or am I too dim to understand what you're talking about?
alphapussycat@reddit
Lm studio has kv quantuzation too.
smuckola@reddit
LM Studio is also based on llama.cpp so you can enable it now, directly in the user interface (according to Gemini):
On the right-side panel, expand the Advanced Configuration or Hardware settings before loading a model.
Look for the K Cache Quantization and V Cache Quantization settings.
Set them to 8-bit (labeled as q8_0).
If you use the LM Studio API or configuration files, you can enable it by setting the llamaKCacheQuantizationType and llamaVCacheQuantizationType parameters to q8_0 (https://lmstudio.ai/docs/typescript/api-reference/llm-load-model-config).
Pretty soon we have plans coming to merge the community implementation of google's TurboQuant, which gives 600% compression virtually lossless of every context window for every LLM. That already works on ollama for at least 300% last I knew.
rumblemcskurmish@reddit
Just enabled the options you mentioned (labeled as "experimental" so I never touched them). Freed up tons of VRAM and allowed me to take context window up to 120K instead of 70K.
Excellent advice!
smuckola@reddit
I wonder why it's labeled as "experimental" unless that just means "not default". For reference of anybody interested in current stable KV cache compression that we already secretly have, it's been around since 2024!
https://github.com/ollama/ollama/pull/6279
rumblemcskurmish@reddit
Thank you random genius! Srsly, I'm a bit over my head on some of these esoteric settings. I'm running the Q4_NL (Unsloth) build with a 70K context window and it flies on a 4090. But if I can get more context I'll take it!
smuckola@reddit
yaaaaay feeeed off of my suffering!
I just learned this late last night just before bed and didn't even try it yet! lol I enabled it but didn't check.
I enabled OLLAMA_KV_CACHE_TYPE=q8_0 and restarted, and everything still works but I didn't measure it yet. Gemini insists that it's perfectly stable and indistinguishable, and should be enabled by default but the purists and researchers don't want it yet I guess ;)
I JUST started really testing openclaw for the first time, during this week of Gemini outage! So that forced me back to my 6-core i7 cpu with qwen 2.5-coder 1.5b!
Ok but don't cry for me, Argentina, because this just hurls me back toward learning runpod, hopefully for a big fat qwen 3.5. Let the de-googling begin!
BlueSwordM@reddit
Do note that it isn't lossless, especially on long context tasks.
rumblemcskurmish@reddit
Yes, Gemini says it isn't lossless but that it really only breaks down on long context tasks (as you noted) which is where the model starts to break down anyways so that it's totally worth it.
tvmaly@reddit
I only have a 2070 with 8GB but 64GB of ram. Is it possible to run this?
rumblemcskurmish@reddit
Look even with lowish RAM you can, theoretically, use swap memory which is your HDD/SSD acting like RAM and, sure, it will run.
Will it behave like an LLM? If you're fine with 1 word every second or two hitting the screen, yeah, it runs.
I'm trying to load mine 100% in VRAM because I want Openclaw to respond nearly instantly to requests on Discord.
The facts are we do have some models which are PRETTY GOOD at chat and will run on very modest hardware (Take a look at Gemma4 9B, etc - they are CRAZY good for the size), but this model is really only for someone who wants Agentic workflows (tool use) and there the stakes are simply much higher.
For instance, my Openclaw bot has corrupted his own config a few times by not understanding the formatting of a particular file. He's deleted folders by not understanding that a seemingly straight forward linux command (rsync -rf) can delete files if you call it the wrong way even though I told him to NEVER delete anything.
This space is changing very fast and it's a really ugly space right now. I mean it's beautiful when you consider the potential but boy is it kind of ugly watching the sausage get made.
nlegger@reddit
Your ssd/ nvme will kill is endurance and lifespan, I wouldn't do this often on that. But I mean 1TB of nvme is under 100 bucks 😂
rumblemcskurmish@reddit
Yeah I didn't say I recommended it. If the question is will it run or not, sure it will run but it isn't a good idea
Puzzleheaded_Base302@reddit
openclaw is terrible. but today i found out if I delete most of the agent.md file content, openclaw becomes smarter.
also, try Hermes Agent, it is a great step up. more polished, less bug, does not feel like AI slope. (Openclaw is a giant pile of AI slope, so many things not tested, too many bugs here and there.)
Shouldhaveknown2015@reddit
You should never go by just the Q number it's meaningless when it comes to the quality of the responses. Just look at the model and the quantization KLD then pick the one with the best KLD you can load with the context size you need (or max if that).
KLD is basically how far from the full model is it off, and generally a model might have nearly the same KLD from the Q3 to the Q6 version depending. But almost always Q6 and better KLD is nearly the same.
Looking for that dropoff between different versions of the model and size will help you fit the most you can in and lose the least amount of quality.
At least my understanding of it
rumblemcskurmish@reddit
The q number is not meaningless because it is the most important factor in the trade off between accuracy and performance. I can run q8 at 2 tokens a second or Q4 at 200t/s. The q4 is very close to the accuracy of q8 for my purposes at least but it actually fits in VRAM.
Yes the q6 is def better but I would have to run a worse model to get acceptable speed. I'd prefer to run a better model and lose some accuracy on some tasks than a poor model that's more accurate
Local-Cardiologist-5@reddit (OP)
To be honest with you im just plugging whatever works. im even downloadling the Q8 variant to see how much better or worse it is. we are all learning in this space. everything i know is from this thread
nlegger@reddit
When you go over 80% context window sometimes the chat gets less accurate. Use Karpathy's wiki?
xeeff@reddit
honestly, i wouldn't go higher than Q6. Q5 is good if you need extra vram with little real difference (which is why most people settle for Q4)
carrotsquawk@reddit
it was established somewhere that q4 m is the sweet spot. not really worth going higher
GrungeWerX@reddit
Q5 is the sweet spot for 27b, noticeably smarter than q4
wen_mars@reddit
https://x.com/UnslothAI/status/2045167861942063428
Separate-Forever-447@reddit
when you say "Q8 variant... better or worse", you must be talking about given your system constraints, because it should be nothing but stronger and more capable.
rumblemcskurmish@reddit
There's no doubt that will run but if you watch task manager you'll see it constantly hammering your CPU as the model shifts from CPU/RAM to GPU/VRAM. I mean, it's doable for sure!
AdamDhahabi@reddit
FYI, I'm running Q8 at the exact same token generation speed as Q6, try it, you have the VRAM.
PinkySwearNotABot@reddit
difference between llama.cpp and llama-server?
coder543@reddit
You’re leaving speed on the table by using cpu-MoE instead of n-cpu-MoE. Or you could just use “fit”.
nlegger@reddit
27B is slightly higher than 3.6 by less than 1% I think.
AvidCyclist250@reddit
isn't "fit on" the new default now?
coder543@reddit
It is, but if you specify both --cpu-moe, as OP did, then you're overriding the decision.
Local-Cardiologist-5@reddit (OP)
yeah i was too lazy to switch the model name on open code so many settings i just changed the model name in my llama.cpp server call and called it a day.
Thank you for n-cpu-MoE and fit flag tips il try them now
Danmoreng@reddit
If you’re using fit, you need to use fit-ctx as well instead of the normal ctx flag. Parameters which work well for me: https://github.com/Danmoreng/local-qwen3-coder-env?tab=readme-ov-file#server-optimization-details
see_spot_ruminate@reddit
Like the other person said, fit works well and is on by default. So you really just need to remove all of your "cpu-moe" flags. One point though, you can eeek out a bit more t/s by fussing around with "--fit-target" as the default is 512, but you can push it. Fit target is how much vram to leave unoccupied. You need some, but what some is may be less than 512.
iamapizza@reddit
Excellent thanks for the pointers here.
nlegger@reddit
Is UD (Ultra Dense)? Googling soon, just wanted to leave a comment for the algorithm lol
ea_man@reddit
Are tools working well between QWEN and Opencode?
Did you implement some kind of translation from XML-> json, modified the prompts in anyway?
Potential-Leg-639@reddit
Nothing to do here, everything works out of the box
ea_man@reddit
Yeah I'm testing it now and it works flawlessly which is kinda new to me, just like Qwencode that is meant for XML tools as QWEN is trained for.
This is good news, I guess that lots of people where pissed when QWEN tools failed with jsons agent harness.
pepe256@reddit
I mean even Qwen 3.5 27B (with the latest updated Unsloth weights) works flawlessly with Claude Code. That's how I'm using it right now.
ea_man@reddit
Oh that's what I did too, I was using 27B dense mostly, now it looks like 35B A3B is doing better outside of Qwencode.
Yet I'm thinking that the prompts are mostly top blame for some other agent harnesses.
carrotsquawk@reddit
you da real mvp
T3KO@reddit
Do you have a link to your chat template?
Federal_Order4324@reddit
opencode is so goated, roo code too bloated, since I've switched tool hallucinations etc. are golden
sjhatters@reddit
Try pi
Randomdotmath@reddit
yeah, roo/kilo was design for old models who has awful agentic level, too convoluted for modern models
nicholas_the_furious@reddit
Doesn't open code take your prompt data? It seemed less private than the Kilo extension when using local models.
Still-Wafer1384@reddit
It's open source
TimeRemove@reddit
Do you have any custom tooling / mcp endpoints for the playwright integration?
Local-Cardiologist-5@reddit (OP)
no, in opencode
No-Marionberry-772@reddit
thank you!
TheDailySpank@reddit
Not op, but from the screenshot, they're running OpenCode in the top right window.
Great_Guidance_8448@reddit
I could barely get 32k context on my 24 gig VRAM with it the qwen 3.6... Asked it to refactor some stuff (python project) for me - it did some work, claimed it finished, but a bunch of changes were truncated and scripts left unusable.
I am back on Gemma 4 26 a4b... 64k context and no (so far!) fails like that.
DeepBlue96@reddit
bro quantize the context... i can easily fit 131k context in my 24gb at q4_0, and i would suggest to use unsloth mxfp4 model of qwen3.6
EbbNorth7735@reddit
Why not use system ram if you want more context space
Great_Guidance_8448@reddit
Ooof, that's going to slow things WAY down...
EbbNorth7735@reddit
Really depends
pedronasser_@reddit
Qwen3.6 35B is working wonderfully with 16GB of VRAM.
andrewh2000@reddit
Would you mind briefly explaining your setup? Ollama, lmstudio etc? And which exact model?
pedronasser_@reddit
Right now, I am running like this:
andrewh2000@reddit
So that's 80k of context? I'm on an RTX 5060 Ti 16GB VRAM, 80GB(!) system RAM and I'm getting about 80 tokens per second if I let llama-serve determine the context using -fit, or about 50 tokens per second if I set it to something quite high like 128k.
pedronasser_@reddit
Yes, I am following Qwen's best practices guide. Also, I don't need more than that, given how I work/harness.
andrewh2000@reddit
I get the impression that working in something like opencode, which is what I want to do, the more context the better. So you have to trade off context with token speed. Fast slightly worse results or slower more likely to be useful. Hmmm, if only I could justify £5000 on some AI rig.
andrewh2000@reddit
And this looks like a good read.
https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html
Hyrnos@reddit
Can you share your setup ?
cviperr33@reddit
INSANE how good this model is ..... Honestly im blown away again and again. It literally fixed the broken code or projects i had hit a wall with gemma for days , and it solved it in like 5 mins and then explained why gemma failed.
And the best thing about it , its sooooo fast... 120 tk/s on 3090 llama.ccp , prefill is instant in 3.8k-5k range.
The moment i send a word , 1 second later i already have a response , with a file edited or something , it is soo efficient in these agentic tools and also doesnt hog my gpu like the gemma models
squatterbot@reddit
Yeah, and also which quant are you using
cviperr33@reddit
quant unslot iq4 nl , settings posted under my prof comment
valtor2@reddit
why the NL? why not iq4 xs?
cviperr33@reddit
NL = Natural Language, XS - Extreme Something , meaning extreme compression, so NL slightly better
cviperr33@reddit
Posting some proof :
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:-----------|----------------:|----------------:|--------------:|----------------:|----------------:|----------------:|
| qwen3.gguf | pp4096 | 3510.40 ± 17.51 | | 1028.06 ± 12.23 | 1027.55 ± 12.23 | 1028.10 ± 12.23 |
| qwen3.gguf | tg32 | 124.84 ± 1.70 | 129.02 ± 1.77 | | | |
| qwen3.gguf | pp4096 @ d8192 | 3586.85 ± 21.88 | | 3108.61 ± 16.12 | 3108.10 ± 16.12 | 3108.66 ± 16.14 |
| qwen3.gguf | tg32 @ d8192 | 117.26 ± 0.77 | 121.29 ± 0.78 | | | |
| qwen3.gguf | pp4096 @ d16384 | 3468.56 ± 2.95 | | 5371.04 ± 14.77 | 5370.53 ± 14.77 | 5371.07 ± 14.77 |
| qwen3.gguf | tg32 @ d16384 | 114.60 ± 0.96 | 118.64 ± 1.00 |
cviperr33@reddit
Quant is : Unsloth IQ4 N L + BF16 mmproj.
Running on 2 channel slots and 200k contex , my vram usage 22/24gb
huzbum@reddit
Thanks, similar to my setup, but I am looking at adding the mmproj
TraditionalCurrent64@reddit
I thought giving it higher temperatures gave way worse results when coding?
cviperr33@reddit
im using unsloth recommended values , he has 4 profiles , this one is "thinking general "
Randomdotmath@reddit
the 3/4 GDN design on qwen 3.5/6 is genius for locals, that saving so much vrams
cviperr33@reddit
i dont know what kind of black magic they did but my card has a blown up bearing , it whines even at 60-70% load so at a 100% its like a jet engine going on in my room lol , and the gemma 4 models always made it to that level , but for some reason because its either too fast and completes everything efficiently , it consumes next to nothing as energy and the card gets cooldowned faster.
rpkarma@reddit
That’s called “race to idle” :)
CreamPitiful4295@reddit
Which Gemma?
cviperr33@reddit
Gemma 4 26b , dont get me wrong its awesome model , it was the first one that like blown me away , but we hit a wall when we were trying to edit drivers on the linux kernel , with qwen 3.6 it went pretty easy he even managed to find a way to bypass the restrictions by his harness ( hermes-agent) and i just told him my password and he echo'ed it with his commands and somehow it worked.
Paradigmind@reddit
Then it began bypassing his harness in order to hack my bank account to buy more GPU's for it so it can run even faster.
cviperr33@reddit
Well yeah 😂 thats how i trick them actually , i tell it that if we make money of it we buy GPUs with it so it can upgrade as much as he wants.
Paradigmind@reddit
Wait really? Or did you continue the joke? :D
cviperr33@reddit
Nah for real man haha , because those are large language models , when you give them even better incentives , or like encourage them , something weird happens , with gemma 4 26b i noticed that if i encourage it too much and make it like excited , it started putting its thoughts inside thougts , it completely messed up my UI because it wasnt designed to handle 2 double thinking lol
Paradigmind@reddit
Lol this is hilarious. I remember early prompting guides from about 2 years ago suggested promising the model thousands of $ as a reward for a task, so that the replies will be better. So I guess this still works
Or that one very "cruel" Windsurf system prompt
cviperr33@reddit
AHAHHHAHAHAHAHHAHAHAHAHAHAHAHA NO SHOT this is real !!! Damn i gotta try this :D
Paradigmind@reddit
We are doomed if AGI sees shit like this. xD
r00x@reddit
How are you squeezing it onto your 3090? Mine only runs about ~75% on mine and fills the VRAM (it is ollama though).
cviperr33@reddit
Download llama.ccp or LM Studio , use the Unsloth quants , and the use the IQ format , imo its the best one , nearly the same quality as Q5 - Q6 , but the size is like at q3km , so its perfect.
Download the model 35b a3b IQ4 NL , and bf16 mmproj , and you are good to go
r00x@reddit
Thank you, I got it going in LM Studio with unsloth/qwen3.6-35b-a3b IQ4_NL and it does squeeze in nicely! Was a bit loopy (channel errors) until I'd changed some params though (below in case it helps anyone else):
Flaky-Advisor@reddit
Thanks for this. I have only 32GB RAM. Could you please share some CPU only config for llama.cpp Note: I tried bartowski/Qwen_Qwen3.6-35B-A3B-GGUF Q3 K_L and getting 10 t/s. Not greedy. Just want to improve this a bit.
cviperr33@reddit
ohh thats completely different story , and you are already on the lowest end at Q3 , i dunno what else you can do to improve it.
Maybe wait and see when different people start uploading different quants , because there is like specialized hardware quants , like people upload MXL which is optimized for apple , and intel has it own too , AMD too , so you just have to find whatever your CPU brand is most optimized quant of the model.
I have posted my config here in this comment section right below my post , there is also proof of llama-benchy run and screenshot. Configs are right bellow it.
Mine uses -b and -ub set at 2048 / 1024 , those can affect how fast the model is.
The other idea i have for you is , try "Speculative Decoding" , its super cool tech , basically you load 2 models , 1 big and 1 really small , and the small one is just predicting what the next token is gonna be and if its right , it speeds up the whole process , with high acceptance rate you could get up to 50-90% increase in speed. So def research that. Bonsai dropped new models that are extremely small and its from today so new models , maybe they are good at speculative decoding ? who knows u can try.
Flaky-Advisor@reddit
Wow. Thanks a lot for the detailed explanation and suggestions. I started following you. I will try Speculative decoding. Never heard of it. 🙏😀
Randomdotmath@reddit
offload some experts to cpu
cviperr33@reddit
no dont do that if u have 3090!
FinBenton@reddit
If you can fit a small draft model in there too, you should get pretty significant speedup for coding too.
cviperr33@reddit
you think ? I tried drafting with 26B gemma and it was actually -30-40% speed . it only worked with the dense 32b model as it was slow by itself , so i went from 20-30 tk/s to 30-40 tk/s.
Ive read that drafting doesnt work as good or at all for MoE models , whats your experience , have you tried drafting the 35b qwen ? It could def fit a drafter in there , Bonsai AI just released some mind blowing small sized models
FinBenton@reddit
No I tried it with dense 31b, havent tried it with qwen yet, I can barely get the full 256k with Q5 fit so theres literally not even a room for a small draft model :D
cviperr33@reddit
the new bonsai models fit in 200mb-1200mb vram , maybe they can make it work ? :D
uti24@reddit
Are you using single 3090? What Q/context size do you use?
cviperr33@reddit
Single 3090 , 200k contex , i could push it easly to 260k but thats what i kinda started with and i havent changed it , i run it with -cn 2 so my agent can spawn a agent and work together.
uti24@reddit
Tells it's full offload to GPU. But what quant allows it with 200k context? Q3?
cviperr33@reddit
Mine allows it , like the IQ4 N L im using , is set to 200k -cn split evenly between 2 channels for my agents , i dont think i can fit 260k in -cn 2 , but the with just -cn 1 which most people use , no problem at all to fit 260k and have room to spare.
cviperr33@reddit
also if i want to go above 260k contex i could also use the turboquant fork of llama.ccp , which gives me extra 1-2gb vram , i think after a few months when we get different models and quants , i could see it being possible to push 500-700k contex on 3090 with all the fancy rotaryquants and distills.
Spiritual_Piccolo793@reddit
How do you run it? Using docker?
cviperr33@reddit
no i just host it on my PC , where i work and study lol , it doesnt slow it down at all or anything , i cannot feel it working because its 100% offloaded to GPU only
Spiritual_Piccolo793@reddit
GPU on your machine? What is the config?
cviperr33@reddit
yes ofcourse its on my machine , config used you can see in my comment here , i posted proof of speed + model + config
IrisColt@reddit
Teach me senpai... pretty please?
cviperr33@reddit
:D you need to use better batching , -b and -ub to 2048 . 1024.
But this increases ur vram usage and sometimes it could even harm your perfomance if your memory bus speed is not fast enough so you have to find the ideal settings yourself :P .
I posted a screenshot of proof and also my settings under that comment.
IrisColt@reddit
Thanks!!!
vr_fanboy@reddit
3090 owner here, can you share llamacpp config?
cviperr33@reddit
settings posted under my prof comment
Local-Cardiologist-5@reddit (OP)
can you tip me on prefils...
joeyhipolito@reddit
tried running local models in agentic loops and the part that always breaks for me is tool call reliability past 60-80k tokens. model starts drifting from the expected format and everything falls apart. curious if you're hitting that on longer sessions or if Qwen3.6 actually holds the format clean.
kant12@reddit
So far, I am extremely impressed. Even on my slow strix halo I'm getting a solid 30 t/s with Qwen3.6-35B-A3B-UD-Q8_K_XL and better responses than I was getting with Qwen3.5 and gemma-4. Let's see if it keeps up.
WhoDidThat97@reddit
On strix, I was using the Q5 (just the first I picked) with opencode. Getting 55t/s which slows to 45t/s by 65k context. Amazing stuff. Response feels the same as I was getting with opencode zen minimax m2.5
kant12@reddit
Damn that's nice.
No-Manufacturer-3315@reddit
How did you get the the image to be processed with opencode. Mine is struggling
exodusayman@reddit
I wish my 9070xt could use Qwen 3.6 in opencode, but most of the time the models that I can reliably run are far too dumb for opencode models
EbbNorth7735@reddit
Of course it can. A 9070xt with 33GB of system memory can definitely run this model. Just use Llama server (Llama.cpp and go to the releases folder) then just use --fit which is the default setting so you just need to point it at the model and the if you want image support use mmproj
-Ellary-@reddit
That is it guys, I've tasked Qwen 3.6 35b a3b to conquer the world for me.
Prepare.
sid351@reddit
You forgot this:
Make no mistakes.
-Ellary-@reddit
Oh shiiii...
Local-Cardiologist-5@reddit (OP)
lmao, pardon my hyperbole, im extremely excited that its actually doing what i ask it
getmevodka@reddit
Nice one
jimmytoan@reddit
Using screenshots from the MCP to self-verify the build is a genuinely interesting capability - it's not just code generation, it's closed-loop testing via vision. The part about it catching its own canvas rendering bug from a screenshot and fixing it is the bit I keep rereading. What MCP server are you using for the screenshot capture?
PotatoQualityOfLife@reddit
What size/quant are you running?
Local-Cardiologist-5@reddit (OP)
im using the but im currently downloading the Q_8 variant im that impressed
PotatoQualityOfLife@reddit
I think you accidentally the word
TheMaestroCleansing@reddit
Please do not the cat
Paradigmind@reddit
What's this fewer dream of a comment section?
Awwtifishal@reddit
Have you really been far even as decided to use even go want to do look more like?
MoneyPowerNexis@reddit
I like turtles.
stumblinbear@reddit
Redditors think that even a slight hint of a reference to something means they have to repeat that reference word for word even if it's otherwise 100% irrelevant, or is literally the joke the original comment was making
unculturedperl@reddit
knowyourmeme dot com
tessatrigger@reddit
how is prangent formed?
Late_Film_1901@reddit
Do you think it's going to make a difference? I've only tested Q4 quants but I'm tempted to try heavier ones
Most-Trainer-8876@reddit
Same here, should I? Lol
I got 24GB total vram and 64GB DDR4 ram... Spilling into ram might tank performance alot
Blues520@reddit
How does Q8 compare?
k0zakinio@reddit
In using 2x3090s and found the q8 to be half the speed of the q6, 120t/s on instruct mode, is an absolute beast. We've definitely reached a tipping point with local models with this release
Medium_Chemist_4032@reddit
Good question. My vllm bf16 tops out at 17 tps and unsloth "quants" BF16 go a lot faster, but falls apart into loops after few q&a rounds
abmateen@reddit
On my local setup with V100 32GB using Qwen3.6 4bit giving me around 80 tok/s
SearchTricky7875@reddit
80 tps? are you using vllm or llama.cpp?
abmateen@reddit
Llama.cpp, vLLM is very slow for single user inference cases
Local-Cardiologist-5@reddit (OP)
im not sure around vllm, its probably to do with the flags, but for me i use llama.cpp, i need a stronger gpu to get vllm
LesserofWeevils@reddit
lol my qwen 3.6 has been struggling for three days to write pong in rust I feel like I’m doing something wrong
Healthy-Nebula-3603@reddit
Why are you using those parameters?
--reasoning-budget -1
it is as default infinite so why you even using it?
--top-k 20 --top-p 0.95 --min-p 0 --repeat-penalty 1.0 --presence-penalty 1 --temp 0.7 --cpu-moe --chat-template
Those parameters are already taken from a gguf so is not reason to putting them
--host 0.0.0.0 --port 8084
That is ok if you want to change IP and port as default is http://127.0.0.1:8080
--no-mmap
aslo ok if you do not want to keep a model copy in the RAM. default is off.
--ctx-checkpoints
THAT IS A GOOD STUFF - works the best wit orchestration mode for opencode. Is keeping cache when model is unload / reload without processing everything again
Orchestration you can install from here to opencode
https://github.com/alvinunreal/oh-my-opencode-slim
So
it should looks like that
As a cache rotation works great for a now (implemented a week ago ) so you can use Q8 cache which is a s good as fp16 now and easily fit 256k context now.
So final code
CurrentNew1039@reddit
it needs "preserve mode "to be on for good use right?
GoodTip7897@reddit
Ctx-checkpoints can prevent an oom error or just help save memory. It will make it take longer and may have more cache misses but for one user even just 4 context checkpoints are fine because you only need one to restore mamba and kv cache.
Correct me if I am wrong but I don't see a way it would alter any math and make the model dumber or loop more to have less swa/context checkpoints
Healthy-Nebula-3603@reddit
Why do you think the current default is 32?
--ctx-checkpoints tells llama-server how many context checkpoints it may keep per slot. A checkpoint is a saved snapshot of the model’s SWA-related cache state, created during prompt processing, so the server can resume from a saved point later instead of reprocessing the whole prompt from scratch. That is mainly useful for SWA / hybrid / recurrent-style models where cache reuse can otherwise fall back to full prompt reprocessing.
I think Georgi Gerganov knows what is doing.
GoodTip7897@reddit
Yeah but I get oom on Gemma at those sizes because the swa cache is massive. Even with 32gb of ram 32 checkpoints fill it up. I only use bf16kv cache because q8 has a memory leak on amd rocm systems and vulkan is much slower prefill
Healthy-Nebula-3603@reddit
Is Vulcan prefill is slow for you? Strange
I have 1200 t/s using vulkan for prefill but I have rtx 3090. For me Vulkan is faster and takes less vram so I can fit more context with rotation Q8 cache.
GoodTip7897@reddit
For qwen 3.6 I get 2000 +-100 prefill at 32k context on a 7900xtx. On vulkan it's more like 1200 like you have.
The gap is really significant for me because I frequently use it for agentic work where it will read multiple logs and files and needs to prefill huge contexts.
And also I had opus write me custom bf16 mma flash attention kernels so I can use bf16 kv cache without any issues.
Maybe q8 is better after rot but honestly Ive wasted too much time trouble shooting looping tool calls with Qwen 3.5 and the only thing that fixed it was bf16 instead of q8 or f16.
Healthy-Nebula-3603@reddit
Looping problems has as Gemma 4 26b and newest queen 3.6 35b also they are not as good for instructions following as well. I think is something wrong with Moe mm models or their implementation..no idea .
Those problems do not exist with dense models like Gemma 4 31b or Qwen 3.5 27b.
Those models are never looping and listening instructions much better but are much slower ...
Actually I prefer dense models mostly because they do their job at first attempt especially with a book translations. Those moe models in 90% of time are lost because not following instructions properly in this scenario and looping like crazy ...
GoodTip7897@reddit
I had the qwen 3.5 27b q5 making repeat variations of the same tool call in three separate instances where it was doing long context agentic tasks. It would become stuck and spend 10000 tokens trying to read one file. And it still did that with up to presence penalty 2.0.
I switched to bf16 kv cache and have had both qwen 3.5 27b and the moe models run for hours burning through millions of tokens easily and never having a looping issue. Even very coherent at 80000 filled context.
Is there a big enough sample to conclude it's statistically significant?probably not. But to me it works now and I'd rather not mess with it. I really do suspect that the accumulation produces numbers too big for f16 and thus bf16 or rot q8 are needed for qwen 3.5
And yes I concur that Moe models are worse. They are faster but a dense model always seems to be smarter because it activates every parameter every time.
Healthy-Nebula-3603@reddit
Strange ...I never had any looping with Q8 rotation cache with dense models.
I have to check your theory about cache FP16 with Moe models. Maybe that will fix looping.
Also noticed the qwen 3.5 family is very good for coding but everything except coding better will be Gemma 4.
Also Q8 cache / model is not int 8 like any people think. Inside is still many weights FP16 .
GoodTip7897@reddit
Yeah. I really think (and llama.cpp PRs have finally been coming around to realizing) that if your GPU supports it then bf16 is the better option over fp16 for weights or kv cache. I've seen other people post stuff where q8 mmproj performs better than fp16 and the only thing that makes sense to me is that since q8 weights are int8 * fp16 scaling factor you technically get 127*65535 instead of just 65535 as your max representable value.
It seems that models love to generate massive outliers over accumulation and bf16 is great for that because it has the dynamic range of f32. For quantized formats, rotation seems to help a lot (making q8 kv cache virtually lossless).
I think I'll play around with benchmarks and see if I can't get vulkan running faster because if I can then I can have twice the context. But rocm does seem to be more stable when you push the card to the absolute limit (I frequently leave only 700 MiB empty). I can do that because I'm running it on a headless Ubuntu computer.
Healthy-Nebula-3603@reddit
I have AMD CPU 7950x 3d with an integrated GPU so that GPU is the main GPU for the system and my rtxv3090 is a second GPU so I also have access to full vram of that card :) Running models my vram usage is around 23.4 GB because over it is starting to swapping to ram.
Local-Cardiologist-5@reddit (OP)
let me load these up. i literally did nothing but just plug the model in and im blown away. im getting so many tips on here thank you for this so much for these
Healthy-Nebula-3603@reddit
no a problem.
Also you can use many models at once using llamacpp-server. Just put all in a one folderr for instance "models" and use that command
llama-server.exe --ctx-size 260000 --models-dir models --models-preset 1_preset.ini --models-max 1 -ctk q8_0 -ctv q8_0 -fa on -ngl 99
That command is using a folder "models" with few models inside and load only one model to vram at once (i --models-max 1) if need other model the first one is unloaded.
--models-preset 1_preset.ini
This ini is keeping models configuration
It looks like that for me (I left "reasoning = on" to have possibility to switch off that just changing on to off )
Local-Cardiologist-5@reddit (OP)
youknow alot about these, on the unsloth hugging face, theres an imatrix gguf file, do you know what those are for? can i use them or is it only for qantizing models?
Healthy-Nebula-3603@reddit
You do not need them ( imatrix).
They are only needed to create a gguf with less errors after quantize.
Blues520@reddit
Why do you suggest Bartowski and which quant level is good for 48 GB VRAM?
Healthy-Nebula-3603@reddit
As low compression as possible to fit on your 48 GB :) but I suggest never going below q4km ... higher if possible ALWAYS
His checkpoints always work well.
Blues520@reddit
Thanks :)
kwicked@reddit
I'm not op but 0.0.0.0 exposes the llama server to other machines on the network, so you can use it on a laptop in another room if you don't want the heat and fan noise. It's not just changing the ip.
Healthy-Nebula-3603@reddit
that why I said it is ok.
swingbear@reddit
I don’t normally comment on local model performance but I have also been blown away by 3.6 over the last couple of days. I’m actually running one on each pro 6000 via llamaccp and openclaude/opencode.
I sometimes forget I’m hitting a local model it’s that good, and for 30b… crazy times.
TraditionalCurrent64@reddit
I tried this model using Ollama through opencode and it got so confused at plan mode and didn't have permission to edit files yet, then it flat out sometimes just failed to do certain tasks, was a bit let down. Maybe it's something up with my setup. To it's credit, it made an adventure game for my kids and fixed a whole bunch of weird issues like undefined variables, random slop, after extensive prompting though. Something that the bigger models might have one shotted
Much-Researcher6135@reddit
I downvote shill posts
spawncampinitiated@reddit
It's like living in groundhog day
ShadowBannedAugustus@reddit
Guys, could anyone integrate Qwen3.6 successfully into "Agent mode" in VS Code? I tried with the Continue extension and with Copilot Chat extensions (supports local models), but no luck. Thanks for any tips!
FinBenton@reddit
I host it with llama.cpp and use cline in vscode to run it, works great.
autisticit@reddit
Yes, look for LLM gateway extension.
Local-Cardiologist-5@reddit (OP)
n my humble opinion, i only use llama.cpp and opencode. the variouse vscode intergrations i havent tested so i wouldnt know.
ignorantpisswalker@reddit
... explain how you are using the map server. I am having problems running this setup.
No-Consequence-1779@reddit
I have been testing it today with kilocode as the agent. it provides numbered questions to answer one at a time. the code review is much better. i am also impressed. Token generation is 50% of the qwen3 coder. I think it is worth it and it may be optimized soon.
nlegger@reddit
I just finished testing 3.5 9B bf16 fine tune on my test document of Palo Alto administration guide 11.x 12xx+ pages and 20 failed versions later I got it working but something feels off.
I'll redo using unsloth studio for 3.6 again this time and see if it works better. Basically I didn't want rag, I wanted a 9x-100% accurate response on anything in the pdf lol maybe I'm being unrealistic.
JohnMason6504@reddit
The hype is nice, but I need to know if this fits on a Cortex-M4 with 256KB SRAM or if its just another cloud-dependent toy. Until we see the actual memory footprint and power draw, Im sticking to my local LLaMA quantized to 4-bit.
Lkemb@reddit
I've just setup ollama and opencode, but it seems whatever model I use, when talking to it via opencode it struggles incredibly hard to read local files. Like they all "say" what they want to do but never actually do it, or it fails, or they end early..
Any ideas why this might be happening?
spaceman3000@reddit
It can't give me one sentence in my language without a grammar mistake. It's not doing it. It sucks big time.
ab2377@reddit
i think we should sell everything and either but 4090 or 5090, these times are going to a crazy route.
IrisColt@reddit
Just to set the record straight, my opinion below focuses more on the creative writing and translation side of these models...
Gemma 4 31B is the clear winner here; it's aced my 64K context translation benchmarks by producing English that feels natural, nuanced, and properly localized, even running at Q4_K_M. Qwen 3.6 35B A3B is the first of its class from Qwen to pass my test, though its English ends up sounding a bit more literal. As for Gemma 4 26B A4B and Qwen 3.5 27B, they both flunked. They spiral into repetition and/or broken language, gradually dropping pronouns and connecting words until they're just mechanically spitting out nouns and verbs with no real skill... Er... I didn't expect that Qwen 3.6 would be able to pull it off.
gearcontrol@reddit
What quant was the Gemma 4 31B that aced your 64K benchmark? I also use it for writing but not with long context (typically under 30K) and have been bouncing between Gemma 4 31B and 26B A4B (for the speed). Both Q4_K_M on an RTX 3090 (24B).
harglblarg@reddit
The 4-bit quant just barely fits in the 32gb RAM/12gb VRAM I have, while leaving enough space to compile. I’ve got it hacking away at a source port for a 3D FPS and it’s slowly but successfully chewing its way through it.
philnm@reddit
thank you for sharing. could you explain the MCP part, where you say "use screenshots from the installed mcp"?
Local-Cardiologist-5@reddit (OP)
in my open code settings, i have the
playright mcp installed, thats what its using as the browser to test
tuliosarmento@reddit
I just didn't get the "screenshot" part. How are you dealing with images in llama.cpp with this model?
TelevisionVast5819@reddit
That's what the mmproj file is for
AppleBottmBeans@reddit
You can use playwright mcp (or better yet the CLI tool)
Local-Cardiologist-5@reddit (OP)
i need to look into using the cli for everything
GrungeWerX@reddit
Has anyone compared it to Qwen 3.5 27B?
takoulseum@reddit
This is really impressive! Local LLMs are coming a long way. Exciting to see Qwen3.6 performing so well on agentic tasks.
wolfgeo@reddit
What? 3.5 came out like two weeks ago right?
shuwatto@reddit
How the heck can you run opencode with Qwen3.6/3.5?
No matter how I tried it runs straight into an infinite loop of
compaction.c64z86@reddit
Is anyone else finding that Qwen 3.6 more often than not fails at something and it takes multiple attempts? I find that even though Gemma 4 is lower quality it actually one shots a lot of things.
Local_Phenomenon@reddit
You're excited, I'm excited, My Man!
Xyrus2000@reddit
Even at a 4-bit quant, the 35B A3B model has actually been really solid (I only have a 4080 super). I've been getting 66 t/s with my setup with a 32K context. Enough for small projects and PoCs.
If things weren't at a premium right now, I'd seriously consider investing in a larger VRAM setup.
tarruda@reddit
Hope they release at least 122b of the 3.6 series.
ionizing@reddit
YES, I also am very hopeful for 122B since it is my daily driver and is already a BEAST in my harness with the 3.5 version, which will be my daily on this until something better comes out in this size class. However I am also grateful for the 35B drop because even the 3.5 version of that was already pretty good on 12GB/32GB setups.
What an amazing time to be alive.
Poluact@reddit
Honestly I don't know how you manage to run 35B on 12GB/32GB with decent speed, my experience is it's way too slow for comfort with any decent context window.
ionizing@reddit
Darn, I was so hopeful.... I suspect there are template issues I will have to work out, and hope it is only that, because we are not off to a good start with 3.6-35B. Basically, in my months of testing, if a model behaves like this where it simply stops rather than continuing the next task, it is almost always a jinja issue or internal model issue of some sort. I did notice it was also putting some of its thinking output in the main chat channel and some of its main output in the reasoning channel. OH, I am on yesterday build of llama.cpp, I should update that I suppose in case something is related. Anyhow yeah I need to give it a few weeks but so far it is acting subpar compared to the 3.5-35B version which never made this type of failure in this harness. But yeah there may just be some work to do, either on my side or the model side or llama etc, as is usually the case for these new releases. Still grateful of course! But unless I can work out why it won't continue the agentic loop like its cousins, then it isn't worth much for my flows.
TuxRuffian@reddit
You and me both! Qwen 3.5 122B is still the raining champ for my workflow.
ionizing@reddit
absolutely agree sir, it has outperformed everything else I have tried in my app, but honestly 35B has been a decent performer as well. A well curated prompt harness makes a big difference with all these models and I have been fine tuning for qwen for months now and am blown away by the 3.5 series and hope they continue to impress.
Local-Cardiologist-5@reddit (OP)
https://www.reddit.com/r/LocalLLaMA/comments/1so1533/comment/ogpnk5k/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button sorry heres a comment for my setup. not a beast at all
vex_humanssucks@reddit
The hybrid thinking/non-thinking mode is what makes Qwen3.6 genuinely different. Being able to toggle reasoning on-demand without switching models means you can use it efficiently for quick tasks and crank it up for complex ones. For local inference that's a huge quality-of-life improvement.
qwen_next_gguf_when@reddit
Show us the prompt bro. I want to play. Thanks.
Local-Cardiologist-5@reddit (OP)
heres the prompt, this convo has been going on for so long the start was cut off
Rich192K@reddit
Dont „please“ your LLM ;)
Ok_Sprinkles_6998@reddit
I wanted to be spared aftet the AI apocalypse. I always please and thank my LLMs.
gurilagarden@reddit
Seriously, on local you've got a lot more context management and conservation to contend with. Go Pi.
9kSs@reddit
How do I get this model working on Mac M4 Pro 48GB with MLX?
Organic-Chart-7226@reddit
faster-mlx - I started using it this week. fast. I am running mvfp4 on 64gb. 4 bit should fit ( mvp4 went up,to 35gb in use, might be tight on 48gb).
CryptoLamboMoon@reddit
been running it locally all morning and yeah this thing is wild. the context window alone changes everything for my workflow. did a whole breakdown on my podcast if anyone wants the full deep dive - A Thousand Tabs × Hour on spotify, first ep is literally about this drop
Far-Low-4705@reddit
why do you set this?
Is that not already in the gguf file?
Dion-AI@reddit
It's an amazing open source local model. I really hope we see another 9b variant like Qwen3.5-9b as well
Fuzzdump@reddit
Anybody know how this compares to Qwen3-Coder-Next?
itguy327@reddit
What MCP are you using? Can post configs
evilrat420@reddit
What do you guys think of the IQ2_M quantization of the 3.6 model? I only have 8gb of vram and im going to have to offload and delegate layers in between my resources, but just curious to know if anyone has tried it and has any meaningful input on how it performs. It is my first time actually considering downloading a local model for coding on my limited hardware for when my claude code subscription runs out and I have to keep working on complicated stuff like my new streaming machine learning rust crate project (which im building to hopefully democratize the resource economy of local llms a bit, not relevant to the content of this question though).
CryptoLamboMoon@reddit
The 3B active / 35B total ratio is the part that keeps breaking my intuition. You're getting 22B-class performance at 8.6% parameter activation per token — the MoE routing is doing something genuinely different here, not just "sparse = efficient."
What I'm most curious about is how the KV cache behaves in extended agentic loops. The 262K context is great on paper but real-world token budgets for tool-heavy tasks hit the memory wall before the context limit in most setups.
Did you notice any degradation in instruction adherence past 50-80K tokens in your testing?
ozzeruk82@reddit
Nice!!! I'm gonna give it a go once the dust has settled.
minkyuthebuilder@reddit
The self-correction loop is what gets me — it noticed the canvas wasn't rendering and fixed it without being told. That kind of autonomous debugging is a different category from just "write me some code." Curious how it handles edge cases when the visual feedback is ambiguous.
SmartCustard9944@reddit
Can you give me a recipe for banana bread?
minkyuthebuilder@reddit
Lmao., no, you're not getting a banana bread recipe from me.
Local-Cardiologist-5@reddit (OP)
i was using the Qwen3.5 models before and for those models, as long as it called the tool, and saw that a screenshot exists, it marked the task as done, this one reads it and will be like "the game is rendering but the canvas seems to be cut off, i need to fix that" THAT IS EXACTLY whay im so excited about this model
minkyuthebuilder@reddit
That's the real shift! - from "task marked done" to "task actually verified." Most models optimize for completing the step, not for checking if the output is correct.
Big difference in practice.
Imaginary_Land1919@reddit
what are your pc specs?
ayylmaonade@reddit
Yeah, 3.6-36B in particular is insanely good for its size. I've been super impressed with its coding prowess and general frontend design capabilities. It one-shotted both of these for me:
Browser OS
Japanese Voxel Pagoda
It's legit state of the art, frontier level coding from like ~3 months ago. I remember people being so impressed by Gemini 3 generating really beautiful Voxel ThreeJS worlds, and now we've got basically the same capability locally. It's crazy.
hannibal27@reddit
Impossível ter feito isso !!! realmente impressionante.
shankey_1906@reddit
I wonder if its possible to build something like MS Word this way, lol!
PhotographerUSA@reddit
Yeah, but can it code Crysis?
LordStinkleberg@reddit
Recommended way to run this on 16GB VRAM + 64GB RAM?
SearchTricky7875@reddit
what token speed are you getting, is it better than qwen 3.5 9b?
burdzi@reddit
Yes. By a lot
SearchTricky7875@reddit
can you tell the no, tokens/sec, you can see it in log probably.
burdzi@reddit
depends on quants and your hardware. I have a 5090 and with qwen3.5-9B-Q8_0 i get initial 117t/s (500 words generated) and with qwen3.6-35B-Q4_K_XL i get initial 172t/s (also 500 words generated)
raz0099@reddit
I upvoted this.
grantnlee@reddit
What hardware are you using and how much memory is being used?
hoschidude@reddit
3.5 27B ist still better for agentic use.
Still-Wafer1384@reddit
Could you substantiate I'm very interested to hear what you've done to compare the two
fredandlunchbox@reddit
Reminder you can use it with claude code.
I tried it on a project last night: worked great for new-feature development, not so great for debugging. I spent about 45 minutes trying to get it to solve an issue before I gave up and handed the bug to Claude 4.7 which solved it first try in about 5 minutes.
ecompanda@reddit
ran the 35B quant overnight on a small coding task and had exactly this reaction. the thing that got me was watching it hit an import error on iteration 3 and rewrite the whole module from scratch rather than patch the broken line. never escalated, just figured it out. tool call context feels meaningfully better than the prior generation.
Eyelbee@reddit
So it is better than 27b? Really?
Suspicious_Bit_3106@reddit
Excellent Work
fermuch@reddit
I've been using Qwen3.6 all day for my normal work and I didn't even use the (work-provided) claude once. At one time I forgot I wasn't using Claude! (ollama with 100k context using Maki and q8_0 KV cache)
IONaut@reddit
Is the reasoning-budget -1 to turn off reasoning? Or is it no limit?
FinBenton@reddit
I had some issues with this so I set
andreasntr@reddit
No limit
Local-Cardiologist-5@reddit (OP)
As i stated in the edit, i once had it there to test from tips i got from this thread and never bothered to remove it.
betam4x@reddit
I used an older version of qwen to do something similar and was impressed with the results.
Enitnatsnoc@reddit
Jobless
Pleasant-Shallot-707@reddit
If your only skill is writing code you’re told to write, sure
Enitnatsnoc@reddit
For a while, I was extremely arrogant and considered myself awesome, cool, and irreplaceable, tells another people git gud. And then it was my turn.
The only thing that saved me from total collapse was some devops skills that allowed me to stay and support all the services created by AI rn, with a significant loss in salary. Some colleagues were less fortunate.
I'm not complaining ~~yes I am~~, I actually excited by all that "neural stuff". But it is difficult to deny the collapse of the labor market.
brickout@reddit
I was a year into a DS MS when AI started getting rolled out to the mainstream. I perfectly timed it so that I owe student loans and had absolutely zero chance to get a job. It hurt.
En-tro-py@reddit
We already have the 'cyberpunk distopia' going strong and now we get to look forward to the part with the wars between corp AI's...
What a time to be alive...
__sad_but_rad__@reddit
i was living the dream and didn't even know it
uti24@reddit
Yeah, model is really good and speed is also good.
Somehow I ended up asking to create exactly same thing but also like idler. It decides where to build towers itself. It had only like 2-3 hiccups during 1 hour or so session.
Lorian0x7@reddit
I don't know, maybe I had a bad quant but I tested it today actually much much inferior to 27b. It does an absolutely ridiculous amounts of calls, doesn't follow instructions without actually accomplishing anything. I tried it with openclaw creating a wiki of a huge document of 1.2M characters. It filled 140k context with stupid tool calls while doing nothing concrete. On the other hand qwen 27b did the job with just 60k context.
BackgroundNo2157@reddit
what’s the vision mcp your running for the screenshots?
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
FeelingFish9009@reddit
Seems dumb when you compare with rest of frontier models
celsowm@reddit
What is that cli vibecode app?
Due-Function-4877@reddit
Yeah? Does it use cross platform SDL3? If my question confuses you, you're not a game dev. If the model doesn't write virtually flawless SDL3, it's not an indie game dev, either.
Local-Cardiologist-5@reddit (OP)
well youre probably expecting alot if youre expecting local models to build full projects with local models without much guidance. you need to guide them, to me its almost exactly how claude sonnet was and for me thats good enough, because il never run out of tokens
Due-Function-4877@reddit
It's good enough for autocomplete and boilerplate. And, if you're upset about SDL3, we can make this twice as funny and ask the machine to write a cross-platform blitter from scratch... Or you're welcome to "guide" that process.
The canned vibe coded shovelware featured here may show incremental progress for the tech, but it doesn't functionally or practically mean anything to people like me. All I'm getting right now is autocomplete and some help with boilerplate.
hibzy7@reddit
What hardware are you using to run this ?
Local-Cardiologist-5@reddit (OP)
heres my setup https://www.reddit.com/r/LocalLLaMA/comments/1so1533/comment/ogpnk5k/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
vyralsurfer@reddit
I noticed you're defining the model chat templates manually. I was under the assumption that the chat template was bundled with the model from unsloth. Is that not the case? Just want to make sure I'm getting the most out of these models. Thanks!
Local-Cardiologist-5@reddit (OP)
it probably is, as i said in my post, it was settings for preserving the checkpoint so that it doesnt reprocess the promt from 0, it seems to be able to continue instatntly when i use the chat template. its currently at 180k context but when i promt it it continues instatly
14domino@reddit
is opencode the best open harness to use? What are the alternatives?
Local-Cardiologist-5@reddit (OP)
i havent tested other alternatives, im sure someone has feedback on the other openharness's that exist
try_repeat_succeed@reddit
Sick! I'm new here so let me know if this is out of line but what hardware do you have running this?
I want to know if this is possible with my 16gb VRAM and 32 (maybe 64 soon) gb RAM. Or what I would need for this to be possible.
Vibe-coding with claude has been amazing. Being able to get to that level locally, for free, with no "usage limit" would be next level.
Local-Cardiologist-5@reddit (OP)
heres my setup https://www.reddit.com/r/LocalLLaMA/comments/1so1533/comment/ogpnk5k/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
dadidutdut@reddit
This one works for me like a charm. 16gb vram and 32gb ram
MilkyJoe8k@reddit
Ok. This is all looking very promising! What hardware are you running this on?
Local-Cardiologist-5@reddit (OP)
https://www.reddit.com/r/LocalLLaMA/comments/1so1533/comment/ogpnk5k/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button heres my setup, i dont think its a beast at all very moderate. IMO
italianguy83@reddit
Avete qualche consiglio per il mio sistema RTX5070 12 GB + 32 GB RAM. Non pretendo di usare quel modello così grande, che posso usare solo con basse velocità, ma magari qualche spunto da cui partire per il coding con un altra macchina come client
Aggressive_Job_1031@reddit
Strange. The benchmarks looked like they only improved slightly from Qwen 3.5.
Local-Cardiologist-5@reddit (OP)
i personaly wouldnt place much weight in the benchmarks, models will be benchmaxed. for me they said the Gemma4 model according to the benchmarks was top of the range but for my heavy coding needs, qwen3.5 was MILES ahead FAR better in every aspect so much so i though Google was paying people to rave about Gemma, till i noticed the people raving about Gemma 4 use llms to write text copy, not code, just text copy and tes better for that in my opinion
Borkato@reddit
Agentic coding improved a lot. You need to compare the two 35Bs
PossibleComplex323@reddit
Qwen3.6-35B-A3B is amazing. It spit out 29k token in a single short prompt to create a complete operating system in 1 html file.
Leather_Flan5071@reddit
awhh mann I want this
yogthos@reddit
might be interesting to try in combination with this too https://github.com/itigges22/ATLAS
jacek2023@reddit
I see 16 t/s, how long it took to finish the task? Also was it a single opencode task or you needed to manually continue things?
Elegant_Tech@reddit
Qwen3.6 is also a massive upgrade in the svg department if you wanted it to code vector graphics.
charmander_cha@reddit
Tenta fazer isso com umas quantizacao mais agressiva
Alternative_You3585@reddit
Looks like qwen 3.5 27B to me not 3.6
relmny@reddit
no is not. "-a" is for alias, what is relevant is what comes right after "-m" (model)
Local-Cardiologist-5@reddit (OP)
Hi Sorry im lazy i didnt update my model alias, heres my llama server configs, its really the 3.6 model, thats how excited i was about it i didnt even bother updating the model aliad name just plugging it directly
jreoka1@reddit
Yeah was gonna say the same thing but wasnt sure if it was just listing the wrong model or something via opencode
Sarayel1@reddit
thats a bot bait post
scythe000@reddit
Yeah that’s what it looks like
phenotype001@reddit
It made the most beautiful 2D fishing game I've ever seen. Easily better than GLM 4.7 and every MiniMax release.
DarkArtsMastery@reddit
Oh My God