Qwen3.6. This is it. | TheaterFire

[-]

Long_comment_san@reddit

That's not the best part. Imagine new generation of kids having access to tools like that since early school. I wonder what the heck out planet would look like. It's either a metropolis or Idiocracy

[-]

People thought the internet would make us all geniuses, and Gen Z is the first generation with an IQ lower than their parents. The future will be a lot more unequal than today, people with access and smarts will use these tools to empower their creativity and create new ventures, most will just use it to outsource their thinking never moving beyond pure chatbot use.

[-]

nachohk@reddit

Imagine new generation of kids having access to tools like that since early school that don't require 10 years of computer science.

Armed with a programming book and a BASIC dialect and a determination to create cool stuff when I was a kid, I taught myself enough code to write simple games like this within a few years. Add internet access and the rest of my youth, and I taught myself enough code that it became my career.

LLMs are nowhere near good enough to write entire commercial applications without expert supervision. And I have enough background in ML to say the reasons why are fundamental limitations of language models that will require multiple attention-level breakthroughs to get past. So I don't think we're going to have that in a long time still. (Shit, the LLMs are only useful even with expert supervision in a fairly constrained subset of types of software.)

But LLMs are extremely good at hacking people's dopamine reward system and giving them the feeling of building something with absolutely none of the benefits of having done so themselves. If kid me had an LLM, I don't know that I could have ever learned programming well enough to make a career out of it.

I'm worried for our future. I take little comfort in it, but at least I should never have to worry about my own job security.

[-]

Sea-Promise-1182@reddit

‘Expert supervision’ is a little strong, no? all you really have to do is point it in the right direction and be able to translate ideas into words and use typescript.

[-]

nachohk@reddit

‘Expert supervision’ is a little strong, no? all you really have to do is point it in the right direction and be able to translate ideas into words and use typescript.

Sounds like you're liable to find out the hard way that LLMs don't know better than to do very stupid things like transmit your production database credentials to the client.

[-]

Sea-Promise-1182@reddit

Well yes, that is a risk, but as long as you make sure the ai doesn’t do that it’s not gonna do it. If you’re allowing all edits and commands in a prod environment you could be screwed, but the upsides definitely outweigh the downsides for coding.

[-]

nachohk@reddit

You seriously have no idea what you're talking about.

[-]

gearcontrol@reddit

I think there will be new fields created and current fields expanded, and it will be very lucrative for current experienced programmers and those that can understand the big picture, as in, how everything connects. There will be a long transition maintaining legacy, legacy with the new, and then the new.

Currently, AI has lowered the barrier to entry for creating software products. Call it 'AI slop' or whatever, but it's out there and will need to be maintained and managed.

[-]

nachohk@reddit

Currently, AI has lowered the barrier to entry for creating software products.

No, it really hasn't. Not for the kind of software you can sell. Not without going out of business very soon after due to shitty dysfunctional software because nobody involved knew to make damn sure Claude didn't implement any of those features you requested by sending your database credentials to the client app.

It has lowered the perceived barrier, but it is a machine gun pointed at the user's foot.

What they really do is they glaze the hell out of less technical people and hack their reward system to make them dangerously overconfident about what they can do with an LLM.

[-]

gearcontrol@reddit

The gulf between a prototype-quality tower defense game and a commercially viable piece of software is extremely vast.

That is true, and it's not just filled with coders vs non-coders. There are non-coders with other IT skills who have hacked together prototypes and deployed software into production. I've worked for colocation and hosting companies over the years, long before AI, and seen this with my own eyes and helped them support it.

They used many of the same sites that AI scraped data from like Stack Exchange, Stack Overflow, forums, CMS, etc, to hack together sites and products. Then they'd hire staff and coders if it took off to maintain it or learn more skills as needed to do it themselves.

Many of these folks (engineers, security experts, designers) know the process and requirements for putting software into production, and many of them are the ones doing the checks and final deployments at major companies.

BlueSwordM@reddit

They CAN be, but because of various commercial incentives to get children/teens hooked, a relatively high number of them don't have great technical abilities whatsoever regarding general tool usage and problem solving.

Even the ones using online LLMs to actually learn don't actually know how to take advantage of their tools competently.

[-]

some1else42@reddit

Meanwhile my 13 year old wants zero to do with using these advanced AI tools. He just wants to "use his brain". I've tried all manner of attempts but he just pulls away. Big sigh.

[-]

falcongsr@reddit

My kid is the same age and thinks AI is evil because it "steals art" from real artists.

I'm like OK but you have to understand it's a tool and you need to know how to use these tools.

Nothing.

[-]

SquareWheel@reddit

That seems fine to me. Better they learn the fundamentals than get in the habit of offloading their thinking process to an AI. They're powerful tools, but sometimes doing things the hard way is necessary to build intuition, too.

[-]

my_name_isnt_clever@reddit

Sounds a lot better than your child falling into AI psycosis.

[-]

Thebandroid@reddit

The issue is every techbro and VC idiot is lining up to sell the convenience of not really have to learn how to do anything for modest monthly sum.

Sure those who are motivated will continue to study the old ways or at least push this tech to its limits but just if we look at how quickly searching the internet has been replaced with "I asked AI" I'd say those people will be in the minority.

[-]

finevelyn@reddit

A lifetime of being limited by what the AI can do for you, and never learning to surpass its capabilities.

[-]

rkoy1234@reddit

same was said for books, tv, then internet.

it'll reward different kinds of motivation in some, while exacerbating different kinds of laziness in others - as did every technological convenience that came before.

the only "this time it's different" aspect is the fact that it might eliminate almost all professions to start with, but at that point, we got bigger stuff to worry about.

[-]

draconic_tongue@reddit

pessimistic libtärdism on tech subs doesn't belong

[-]

Zc5Gwu@reddit

He’s somewhat right though. AI is like a bicycle for the mind. Your brain just doesn’t have to work as hard anymore.

[-]

NeinJuanJuan@reddit

If you ask "Who is fitter, runners or cyclists?" it could start an endless debate.

People said this when graphing calculators went mainstream. We ended up never using it in school and only allowed it for the hardest classes that required us to do far more than any kid could accomplish before it.

I expect AI to be the same story but on a grander scale. Those who seek education will receive it, no matter how much their tools trivialize it.

[-]

my_name_isnt_clever@reddit

I'm hoping in the long term it will adapt higher education into something that can be done by anyone with the drive, not just anyone who can pay thousands of dollars and 4+ years to get a peice of paper.

[-]

DarkArtsMastery@reddit

I actually agree. Curiosity is not going away anytime soon. In the end, it is a tool and all that matters is how you actually do that tool. I like the way it explains code to me, makes sense so far and my skills and understanding has improved. Yes I am wasting time with my part, but it is good for my brain and it helps knowing how things work under the hood :) The actual hard part is verifying most things and concepts at least thru quick experiments.

[-]

balder1993@reddit

LLMs allowed me to transition from iOS programming to web quite easily. I just started building a project with an LLM as an assistant. Every time I wanted to do something I’d ask how I should do it. When I didn’t understand what code it gave me, I’d keep asking why this way, why not that way, how does this thing work etc.

what stack are you using for software? Id love to get a proper local setup going but ive had trouble figuring out what i should actually be using.

[-]

Local-Cardiologist-5@reddit (OP)

Qwen3.6-35B-A3B-UD-Q6_K_XL

Im using Llama.cpp for the server,
OpenCode for the coding, just using the build agent,
I have 64 gig ram, RTX 4090, and my model is
the Q6 variant.

Here are my llama parameters

llama-server -m "{PATH_TO_MODEL}\Qwen3.6\Qwen3.6-35B-A3B-UD-Q6_K_XL.gguf"  --mmproj "{PATH_TO_MODEL}\Qwen3.6\mmproj-F16.gguf" --chat-template-file "{PATH_TO_MODEL}\chat_template\chat_template.jinja"  -a  "Qwen3.6-35B-A3B"  --cpu-moe -c 250000 --host 0.0.0.0 --port 8084 --reasoning-budget -1 --top-k 20 --top-p 0.95 --min-p 0 --repeat-penalty 1.0 --presence-penalty 1.5 -fa on --temp 0.7 --no-mmap --no-mmproj-offload --ctx-checkpoints 5"

here is my llama server with the configs.

[-]

bnm777@reddit

This may be of interest:

https://sleepingrobots.com/dreams/stop-using-ollama/

[-]

Borkato@reddit

Wow this was extremely informative, wtf ollama

[-]

Pyros-SD-Models@reddit

Literally every model being discussed here "stole" shit to train on, so I find it somewhat amusing that people are all up in arms about ollama basically using open source as it is designed. you can argue about morality, but it's a very simple question: are they violating any licenses they are supposed to adhere to? no? end of story.

llama.cpp chose its license with full awareness of what people would do with the software and the code, and if they would like people to behave a certain way they should have written it into their fcking license

[-]

JamesEvoAI@reddit

I'm the author of the article.

Literally every model being discussed here "stole" shit to train on, so I find it somewhat amusing that people are all up in arms about ollama basically using open source as it is designed.

Except they're not. They're taking the open source project and breaking it in a way that delivers worse performance, in addition to adding complexity overhead. This is isn't a good faith business built on FOSS, this is being a rent-seeking parasite. I'm not opposed to building a business on top of FOSS, I am opposed to your business making the free alternative meaningfully worse and damaging the sentiment of the underlying FOSS ecosystem.

but it's a very simple question: are they violating any licenses they are supposed to adhere to? no? end of story.

Yes, they are. Did you actually read the article or did you just go straight to writing an angry comment?

It's explicitly written in the MIT license that llama.cpp uses that you need to include a copy of said license with any distribution of the software. Ollama is deliberately violating the license terms to prevent their users from finding their FOSS foundation that offers a better experience.

[-]

FusionX@reddit

I'm the author of the article

I found this ironic. The article was AI generated, along with this reply. And then I noticed your name.. /u/JamesEvoAI.

The internet is dead.

[-]

ArtfulGenie69@reddit

I think people bawk at ollama because the idiots pretend regularly that it is all their own work and they have an ass system instead of just using gguf like a normal person would do. They have tried to make their own personal garden out of all our shared equipment and they don't give credit. Mainly they just suck ass because of the go templates (who the fuck thought that was a good idea, why do you reinvent the fucking wheel when you have jinja already). They're just annoying dumb bastards who are easily replaced with llama-swappo.

[-]

The_frozen_one@reddit

They use gguf, they just use sha256 filenames to dedupe/deconflict identical files. It's very similar to how container software works. You can load them directly with llama.cpp.

Wasn't jinja added to llama.cpp 4 months ago?

[-]

JamesEvoAI@reddit

they just use sha256 filenames to dedupe/deconflict identical files.

This reads like straw man justification for a problem that nobody has. Even if I did have the issue of multiple copies of the same model floating around (why would I though?), the obvious solution to that problem is not to lock myself to a third party tool because the filenames are now obfuscated.

This makes sense in containers, when two completely different containers may share intermediary layers. A human isn't directly using those artifacts so hashed filenames are the obvious choice. This makes zero sense in the context of a GGUF.

Wasn't jinja added to llama.cpp 4 months ago?

This is correct, OP's argument was invalid. The ollama go syntax predates llama.cpp's use of Jinja by a few years. That said the Jinja syntax is more accessible and has become the de-facto standard.

[-]

The_frozen_one@reddit

This reads like straw man justification for a problem that nobody has. Even if I did have the issue of multiple copies of the same model floating around (why would I though?), the obvious solution to that problem is not to lock myself to a third party tool because the filenames are now obfuscated.

But there's no obfuscation, it's just a system you aren't used to.

Being able to quickly validate that the file contents are valid by using the filename match is really useful, especially if you can automatically delete and redownload the invalid files. You can trivially write a script to tell you if each and every one of my huggingface or ollama blob files is valid without knowing anything about them. openssl sha256 FILENAME, does the hash match the filename? If so it's valid, you don't need to understand anything about the underlying data or format.

And yes, huggingface's hf CLI tool does the same thing. It's such a robust and unremarkable way to deal with large files sets that huggingface uses a nearly identical system (look under ~/.cache/huggingface/hub, everything under model-*/blob is a bunch of hash-based filenames where the actual data is stored.

If you don't have a lot of models to manage, there's little reason to having a system manage them. That's perfectly understandable, use what works for you. But it's not obfuscation to store a file by its hash. If ollama were using a secret hash function or entangling the hash with a secret, non-pubic value, sure that'd be problematic, but it's just standard sha256 that anyone can compute.

[-]

JamesEvoAI@reddit

But there's no obfuscation, it's just a system you aren't used to.

I've been building with docker for years.

Being able to quickly validate that the file contents are valid by using the filename match is really useful, especially if you can automatically delete and redownload the invalid files.

Again, when is this a real world problem that people doing local inference are having? I download the GGUF, I test the GGUF, the GGUF goes in my model storage folder. If it's not working I download it again.

But it's not obfuscation to store a file by its hash. If ollama were using a secret hash function or entangling the hash with a secret, non-pubic value, sure that'd be problematic, but it's just standard sha256 that anyone can compute.

You're ignoring the point to continue rationalizing the problem that nobody actually has. When I say obfuscation I mean obscuring the information I care about (what model it is, what quant it is) behind a dependency when that information could have been in the filename.

In what world are you living where you're having to validate the integrity of your GGUF's beyond the initial download? I have 50+ models downloaded right now and all it takes is an ls to know exactly what models are there. I can easily load them up with any other inference tool because they're just files whose name reflects their contents. If I need to know if I have a specific model I can just find against my model folder.

Why are you complicating things to solve a non-issue?

[-]

The_frozen_one@reddit

I've been building with docker for years.

Nobody uses docker, everyone just uses namespaces + cgroups + chroot (or jails). There is no reason to locked into a system that uses immutable layers for building containers, it's just convenience-ware that solves a problem nobody has.

/s if it isn't obvious (Docker is great)

I download the GGUF, I test the GGUF, the GGUF goes in my model storage folder. If it's not working I download it again.

I make an HTTP call to 4 systems, they download the same model, I run some tests against the model using a standard request. When I'm done, I make an HTTP to delete the model from all systems. Every call to each system is agnostic and identical, only the target IP/hostname changes.

OR

I download the GGUF, scp it to each system, then ssh(or RDP if ssh isn't available) to each system and launch llama-server pointed at the gguf. I build llama.cpp or download the latest release. I use tmux or screen or RDP to keep the process active, monitoring and restarting llama-server as required until I'm done, then manually delete the file from each system. Each step of the process requires knowing a bit about Windows or Linux or macOS or *BSD.

You're ignoring the point to continue rationalizing the problem that nobody actually has. When I say obfuscation I mean obscuring the information I care about (what model it is, what quant it is) behind a dependency when that information could have been in the filename.

What problem does nobody have? Wanting custom options for the same underlying model available on demand? I think you're over-representing your use case. There's room in this community for people who will never learn what a gguf or safetensors file is.

I run ollama ls then ollama show MODEL and it shows me the context length, quant, etc. It's standardize and easy to read. I type ollama pull MODEL to download a model with reasonable defaults or ollama pull MODEL:quant to get a specific quant, it deletes when I type ollama rm MODEL. I can create a Modelfile with specific context lengths or system messages by typing ollama create specialmodel. It uses the model if I have it or downloads it if I don't. Or I can use a custom file I provide it. The syntax of Modelfile is similar to Dockerfile (even starts with FROM that uses a model name instead of an image name).

If you are really itching to use the files from ollama, it's not hard to do so. The walls aren't high enough to matter, just like how you using Docker is fine despite the fact that more open / less commercial alternatives exist.

Why are you complicating things to solve a non-issue?

Ah yes, the "command line is trivial" person for whom file management is obvious. For you and I it might be, but there are people who are wildly more capable in things that you and I will never begin to comprehend who are terrible with computers and who basically have a non-functional mental model for how they work. I want more people using local models, whatever their skill level with computers.

[-]

JamesEvoAI@reddit

Ah yes, the "command line is trivial" person for whom file management is obvious. For you and I it might be, but there are people who are wildly more capable in things that you and I will never begin to comprehend who are terrible with computers and who basically have a non-functional mental model for how they work. I want more people using local models, whatever their skill level with computers.

My brother in christ this entire time I have been making my arguments from the perspective of a non-technical user for whom filenames that reflect the actual content of the file are far more obvious than hashes. This is the dumbest conversation I've had in a while lol. If a normal user downloads a GGUF with Ollama and want to try using it in literally anything else, they now have to deal with file hashes instead of just using whatever search their OS provides for the name of the model file they know they have somewhere on disk.

You're arguing for a piece of software that makes it harder for normal people to reason about, has worse performance, and is parasitic to the FOSS ecosystem. The idiot in this conversation is actually me for letting this go on for so long. Have a good one lol

[-]

Article author here, I give other recommendations that are FOSS. LM Studio is the first choice in the article because it is the best at filling the needs of what people expect from Ollama, while also giving proper attribution back to the ecosystem.

I am a FOSS advocate but that doesn't mean I'm 100% against you trying to build a business by offering convenience on top of it. My issue is when your profit incentives become parasitic to the ecosystem that made those profits possible.

[-]

WhoRoger@reddit

You should still make it clear that it's not even open source, if you're criticising another app for releasing a closed source gui.

If we want to move from apps that aren't totally legit foss, then going towards non-foss is the opposite of what we want.

Personally I was really shocked when I found LMS isn't open. So many people recommend it, I thought I'm missing something because nobody even bothers to mention it. Considering this community is largely Linux/foss people, I'm thinking it's at least in part because of lack of good, commonly available alternatives.

If the choice for inference is between one closed source app and a trillion hobby Python single-use projects, that's not really healthy, and is exactly what kept Linux back for so long. Now we're doing the same thing with the LLM ecosystem.

[-]

JamesEvoAI@reddit

That's a fair criticism, I'll add a note to each option.

[-]

I already mentioned llama-server and llama-swap back at the beginning of this subthread as the way to do this, the problem with it has a complex setup to accomplish something that I've already got working fine using Ollama.

I'm rather surprised at the amount of downvote I've been getting discussing this. I guess saying anything positive about Ollama is very unpopular in these parts?

Anyway, haven't heard of llamafile before. Does it do the "automatically load model into memory when actually queried, unload again after timeout" thing? I took a quick look at the documentation and didn't see a reference to features like that, the impression I get is that the model is in memory and ready to go for as long as llamafile is running.

[-]

426upgradrequired@reddit

Lmstudio has auto load and unload. It might be a newer feature.

https://lmstudio.ai/docs/developer/core/ttl-and-auto-evict

[-]

PollinosisQc@reddit

I did exactly that with a small Python server and llama-cpp for my home setup. The Python server takes the requests, creates a llama-cpp server subprocess, and when the request is done being served, the subprocess is killed and the RAM is reclaimed (well actually it keeps them loaded for a few minutes in case more requests come in so it doesnt have to do cold starts with every request).

[-]

gwillen@reddit

If you're on Windows I think you definitely have more limited options. You might consider using WSL2, although I haven't personally tested any of this stuff with it, so I can't say you won't run into issues there. It's possible that ollama is still a good choice for Windows users.

[-]

ZootAllures9111@reddit

LMStudio is six billion times better than Ollama in every way though, it's the best choice on Windows by a ton

[-]

bnm777@reddit

Windows... ewwww... that are you doing to your self, matey?

At the very least, LLMs run with less overhead on linux.

As I said, I do know how to use GitHub. I am entirely capable of setting all this stuff up, but it's going to take a bunch of work to do so. It's that extra work that I'm pointing out is an actual problem.

I mean, you're literally suggesting that I should install Linux as one of the steps here? That's not making things easier.

Ollama never had the just works thing going

You are telling me that my own personal experience, that I personally experienced, didn't actually happen?

Llama-swappo is the nice llama-swap offshoot

This one? Its installation section reads, in its entirety:

Use the original Building from source instructions, and overwrite your installed llama-swap executable with the newly built one.

This is going in the opposite direction from what I'm suggesting is needed here.

[-]

LM Studio is also based on llama.cpp so you can enable it now, directly in the user interface (according to Gemini):

On the right-side panel, expand the Advanced Configuration or Hardware settings before loading a model.
Look for the K Cache Quantization and V Cache Quantization settings.
Set them to 8-bit (labeled as q8_0).

If you use the LM Studio API or configuration files, you can enable it by setting the llamaKCacheQuantizationType and llamaVCacheQuantizationType parameters to q8_0 (https://lmstudio.ai/docs/typescript/api-reference/llm-load-model-config).

Pretty soon we have plans coming to merge the community implementation of google's TurboQuant, which gives 600% compression virtually lossless of every context window for every LLM. That already works on ollama for at least 300% last I knew.

[-]

rumblemcskurmish@reddit

Just enabled the options you mentioned (labeled as "experimental" so I never touched them). Freed up tons of VRAM and allowed me to take context window up to 120K instead of 70K.

Excellent advice!

[-]

smuckola@reddit

I wonder why it's labeled as "experimental" unless that just means "not default". For reference of anybody interested in current stable KV cache compression that we already secretly have, it's been around since 2024!

https://github.com/ollama/ollama/pull/6279

[-]

rumblemcskurmish@reddit

Thank you random genius! Srsly, I'm a bit over my head on some of these esoteric settings. I'm running the Q4_NL (Unsloth) build with a 70K context window and it flies on a 4090. But if I can get more context I'll take it!

[-]

smuckola@reddit

yaaaaay feeeed off of my suffering!

I just learned this late last night just before bed and didn't even try it yet! lol I enabled it but didn't check.

I enabled OLLAMA_KV_CACHE_TYPE=q8_0 and restarted, and everything still works but I didn't measure it yet. Gemini insists that it's perfectly stable and indistinguishable, and should be enabled by default but the purists and researchers don't want it yet I guess ;)

I JUST started really testing openclaw for the first time, during this week of Gemini outage! So that forced me back to my 6-core i7 cpu with qwen 2.5-coder 1.5b!

Ok but don't cry for me, Argentina, because this just hurls me back toward learning runpod, hopefully for a big fat qwen 3.5. Let the de-googling begin!

[-]

BlueSwordM@reddit

Do note that it isn't lossless, especially on long context tasks.

[-]

rumblemcskurmish@reddit

Yes, Gemini says it isn't lossless but that it really only breaks down on long context tasks (as you noted) which is where the model starts to break down anyways so that it's totally worth it.

[-]

tvmaly@reddit

I only have a 2070 with 8GB but 64GB of ram. Is it possible to run this?

[-]

rumblemcskurmish@reddit

Look even with lowish RAM you can, theoretically, use swap memory which is your HDD/SSD acting like RAM and, sure, it will run.

Will it behave like an LLM? If you're fine with 1 word every second or two hitting the screen, yeah, it runs.

I'm trying to load mine 100% in VRAM because I want Openclaw to respond nearly instantly to requests on Discord.

The facts are we do have some models which are PRETTY GOOD at chat and will run on very modest hardware (Take a look at Gemma4 9B, etc - they are CRAZY good for the size), but this model is really only for someone who wants Agentic workflows (tool use) and there the stakes are simply much higher.

ea_man@reddit

Yeah I'm testing it now and it works flawlessly which is kinda new to me, just like Qwencode that is meant for XML tools as QWEN is trained for.

This is good news, I guess that lots of people where pissed when QWEN tools failed with jsons agent harness.

[-]

pepe256@reddit

I mean even Qwen 3.5 27B (with the latest updated Unsloth weights) works flawlessly with Claude Code. That's how I'm using it right now.

[-]

ea_man@reddit

Oh that's what I did too, I was using 27B dense mostly, now it looks like 35B A3B is doing better outside of Qwencode.

Do you have any custom tooling / mcp endpoints for the playwright integration?

[-]

Local-Cardiologist-5@reddit (OP)

no, in opencode

[-]

No-Marionberry-772@reddit

thank you!

[-]

TheDailySpank@reddit

Not op, but from the screenshot, they're running OpenCode in the top right window.

[-]

Great_Guidance_8448@reddit

I could barely get 32k context on my 24 gig VRAM with it the qwen 3.6... Asked it to refactor some stuff (python project) for me - it did some work, claimed it finished, but a bunch of changes were truncated and scripts left unusable.

Would you mind briefly explaining your setup? Ollama, lmstudio etc? And which exact model?

[-]

pedronasser_@reddit

Right now, I am running like this:

 llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-IQ4_NL \
  -a qwen3.6-35b-a3b \
  --host 0.0.0.0 \
  --port 1234 \
  -ngl 99 --n-cpu-moe 15 \
  -c 81920 \
  --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 \
  --repeat-penalty 1.0 \
  --presence-penalty 1.5 \
  --reasoning on \
  --reasoning-budget -1 \
  -fa on \
  -np 1 \
  --cache-ram 1 \
  --no-mmproj \
  -ctk q8_0 -ctv q8_0 \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --jinja \
  --chat-template-file ./qwen3.6.jinja

[-]

andrewh2000@reddit

So that's 80k of context? I'm on an RTX 5060 Ti 16GB VRAM, 80GB(!) system RAM and I'm getting about 80 tokens per second if I let llama-serve determine the context using -fit, or about 50 tokens per second if I set it to something quite high like 128k.

[-]

quant unslot iq4 nl , settings posted under my prof comment

[-]

valtor2@reddit

NL

why the NL? why not iq4 xs?

[-]

cviperr33@reddit

NL = Natural Language, XS - Extreme Something , meaning extreme compression, so NL slightly better

[-]

cviperr33@reddit

Posting some proof :
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:-----------|----------------:|----------------:|--------------:|----------------:|----------------:|----------------:|
| qwen3.gguf | pp4096 | 3510.40 ± 17.51 | | 1028.06 ± 12.23 | 1027.55 ± 12.23 | 1028.10 ± 12.23 |
| qwen3.gguf | tg32 | 124.84 ± 1.70 | 129.02 ± 1.77 | | | |
| qwen3.gguf | pp4096 @ d8192 | 3586.85 ± 21.88 | | 3108.61 ± 16.12 | 3108.10 ± 16.12 | 3108.66 ± 16.14 |
| qwen3.gguf | tg32 @ d8192 | 117.26 ± 0.77 | 121.29 ± 0.78 | | | |
| qwen3.gguf | pp4096 @ d16384 | 3468.56 ± 2.95 | | 5371.04 ± 14.77 | 5370.53 ± 14.77 | 5371.07 ± 14.77 |
| qwen3.gguf | tg32 @ d16384 | 114.60 ± 0.96 | 118.64 ± 1.00 |

[-]

cviperr33@reddit

Quant is : Unsloth IQ4 N L + BF16 mmproj.
Running on 2 channel slots and 200k contex , my vram usage 22/24gb

CONTEXT_SIZE="200000"
GPU_LAYERS="99"
TEMP="1.0"
TOP_P="0.95"
TOP_K="20"
MIN_P="0.0"
REP_PENALTY="1.0"
PRESENCE_PENALTY="1.5"


CTK="q8_0"
CTV="q8_0"


# ---[EXECUTION COMMAND] ---
"$LLAMA_SERVER" \
    -m "$MODEL_PATH" \
    --mmproj "$MMProjPath" \
    -np 2 \
    -ngl "$GPU_LAYERS" \
    -c "$CONTEXT_SIZE" \
    -fa 1 \
    --cache-ram 2048 \
    -ctxcp 2 \
    -ctk "$CTK" \
    -ctv "$CTV" \
    -b 2048 \
    -ub 1024 \
    --temp "$TEMP" \
    --top-p "$TOP_P" \
    --top-k "$TOP_K" \
    --min-p "$MIN_P" \
    --repeat-penalty "$REP_PENALTY" \
    --presence-penalty "$PRESENCE_PENALTY" \
    --host 0.0.0.0 \
    --port 8080 \
    --chat-template-kwargs '{"preserve_thinking": true}' \
    --metrics

[-]

huzbum@reddit

Thanks, similar to my setup, but I am looking at adding the mmproj

[-]

TraditionalCurrent64@reddit

I thought giving it higher temperatures gave way worse results when coding?

[-]

cviperr33@reddit

im using unsloth recommended values , he has 4 profiles , this one is "thinking general "

[-]

Randomdotmath@reddit

and also doesnt hog my gpu like the gemma models

cviperr33@reddit

AHAHHHAHAHAHAHHAHAHAHAHAHAHAHA NO SHOT this is real !!! Damn i gotta try this :D

[-]

Paradigmind@reddit

We are doomed if AGI sees shit like this. xD

[-]

r00x@reddit

How are you squeezing it onto your 3090? Mine only runs about ~75% on mine and fills the VRAM (it is ollama though).

[-]

cviperr33@reddit

Download llama.ccp or LM Studio , use the Unsloth quants , and the use the IQ format , imo its the best one , nearly the same quality as Q5 - Q6 , but the size is like at q3km , so its perfect.

Download the model 35b a3b IQ4 NL , and bf16 mmproj , and you are good to go

[-]

r00x@reddit

Thank you, I got it going in LM Studio with unsloth/qwen3.6-35b-a3b IQ4_NL and it does squeeze in nicely! Was a bit loopy (channel errors) until I'd changed some params though (below in case it helps anyone else):

temperature: 0.6

top_k: 20

repeat penalty: 1

top_p: 0.95

min_p disabled/0

[-]

uti24@reddit

120 tk/s

:D you need to use better batching , -b and -ub to 2048 . 1024.
But this increases ur vram usage and sometimes it could even harm your perfomance if your memory bus speed is not fast enough so you have to find the ideal settings yourself :P .

lol my qwen 3.6 has been struggling for three days to write pong in rust I feel like I’m doing something wrong

[-]

Healthy-Nebula-3603@reddit

Why are you using those parameters?

--reasoning-budget -1 --top-k 20 --top-p 0.95 --min-p 0 --repeat-penalty 1.0 --presence-penalty 1.5 -fa on  --no-mmap --no-mmproj-offload --ctx-checkpoints 5"

--reasoning-budget -1

it is as default infinite so why you even using it?

--top-k 20 --top-p 0.95 --min-p 0 --repeat-penalty 1.0 --presence-penalty 1 --temp 0.7 --cpu-moe --chat-template

Those parameters are already taken from a gguf so is not reason to putting them

--host 0.0.0.0 --port 8084

That is ok if you want to change IP and port as default is http://127.0.0.1:8080

--no-mmap

aslo ok if you do not want to keep a model copy in the RAM. default is off.

--ctx-checkpoints

THAT IS A GOOD STUFF - works the best wit orchestration mode for opencode. Is keeping cache when model is unload / reload without processing everything again

Orchestration you can install from here to opencode

https://github.com/alvinunreal/oh-my-opencode-slim

So

it should looks like that

llama-server -m "{PATH_TO_MODEL}\Qwen3.6\Qwen3.6-35B-A3B-UD-Q6_K_XL.gguf"  --mmproj "{PATH_TO_MODEL}\Qwen3.6\mmproj-F16.gguf" -c 120384 --host 0.0.0.0 --port 8084  --no-mmap --no-mmproj-offload --ctx-checkpoints 5"

As a cache rotation works great for a now (implemented a week ago ) so you can use Q8 cache which is a s good as fp16 now and easily fit 256k context now.

So final code

llama-server -m "{PATH_TO_MODEL}\Qwen3.6\Qwen3.6-35B-A3B-UD-Q6_K_XL.gguf"  --mmproj "{PATH_TO_MODEL}\Qwen3.6\mmproj-F16.gguf" -c 120384 --host 0.0.0.0 --port 8084  --no-mmap --no-mmproj-offload --ctx-checkpoints 5 -ctk q8_0 -ctv q8_0

[-]

CurrentNew1039@reddit

it needs "preserve mode "to be on for good use right?

[-]

GoodTip7897@reddit

Ctx-checkpoints can prevent an oom error or just help save memory. It will make it take longer and may have more cache misses but for one user even just 4 context checkpoints are fine because you only need one to restore mamba and kv cache.

Correct me if I am wrong but I don't see a way it would alter any math and make the model dumber or loop more to have less swa/context checkpoints

[-]

Healthy-Nebula-3603@reddit

Why do you think the current default is 32?

--ctx-checkpoints tells llama-server how many context checkpoints it may keep per slot. A checkpoint is a saved snapshot of the model’s SWA-related cache state, created during prompt processing, so the server can resume from a saved point later instead of reprocessing the whole prompt from scratch. That is mainly useful for SWA / hybrid / recurrent-style models where cache reuse can otherwise fall back to full prompt reprocessing.

I think Georgi Gerganov knows what is doing.

[-]

GoodTip7897@reddit

Yeah but I get oom on Gemma at those sizes because the swa cache is massive. Even with 32gb of ram 32 checkpoints fill it up. I only use bf16kv cache because q8 has a memory leak on amd rocm systems and vulkan is much slower prefill

Yeah. I really think (and llama.cpp PRs have finally been coming around to realizing) that if your GPU supports it then bf16 is the better option over fp16 for weights or kv cache. I've seen other people post stuff where q8 mmproj performs better than fp16 and the only thing that makes sense to me is that since q8 weights are int8 * fp16 scaling factor you technically get 127*65535 instead of just 65535 as your max representable value.

It seems that models love to generate massive outliers over accumulation and bf16 is great for that because it has the dynamic range of f32. For quantized formats, rotation seems to help a lot (making q8 kv cache virtually lossless).

I think I'll play around with benchmarks and see if I can't get vulkan running faster because if I can then I can have twice the context. But rocm does seem to be more stable when you push the card to the absolute limit (I frequently leave only 700 MiB empty). I can do that because I'm running it on a headless Ubuntu computer.

[-]

Healthy-Nebula-3603@reddit

I have AMD CPU 7950x 3d with an integrated GPU so that GPU is the main GPU for the system and my rtxv3090 is a second GPU so I also have access to full vram of that card :) Running models my vram usage is around 23.4 GB because over it is starting to swapping to ram.

[-]

Local-Cardiologist-5@reddit (OP)

let me load these up. i literally did nothing but just plug the model in and im blown away. im getting so many tips on here thank you for this so much for these

[-]

Healthy-Nebula-3603@reddit

no a problem.

Also you can use many models at once using llamacpp-server. Just put all in a one folderr for instance "models" and use that command

llama-server.exe --ctx-size 260000 --models-dir models --models-preset 1_preset.ini --models-max 1 -ctk q8_0 -ctv q8_0 -fa on -ngl 99

That command is using a folder "models" with few models inside and load only one model to vram at once (i --models-max 1) if need other model the first one is unloaded.

--models-preset 1_preset.ini

This ini is keeping models configuration

It looks like that for me (I left "reasoning = on" to have possibility to switch off that just changing on to off )

version = 1

[*]
ctk = q8_0
ctv = q8_0
n-gpu-layers = 99
reasoning = on

[google_gemma-4-31B-it-Q4_K_M]
model = models/google_gemma-4-31B-it-Q4_K_M.gguf
ctx-size = 60000

[google_gemma-4-31B-it-Q4_K_M_fixer]
model = models/google_gemma-4-31B-it-Q4_K_M_fixer.gguf
ctx-size = 60000

[Qwen3.6-35B-A3B-Q4_K_M]
model = models/Qwen3.6-35B-A3B-Q4_K_M.gguf
ctx-size = 200000

[google_gemma-4-26B-A4B-it-Q4_K_L]
model = models/google_gemma-4-26B-A4B-it-Q4_K_L.gguf
ctx-size = 200000

Healthy-Nebula-3603@reddit

that why I said it is ok.

[-]

swingbear@reddit

I don’t normally comment on local model performance but I have also been blown away by 3.6 over the last couple of days. I’m actually running one on each pro 6000 via llamaccp and openclaude/opencode.

I just finished testing 3.5 9B bf16 fine tune on my test document of Palo Alto administration guide 11.x 12xx+ pages and 20 failed versions later I got it working but something feels off.

I'll redo using unsloth studio for 3.6 again this time and see if it works better. Basically I didn't want rag, I wanted a 9x-100% accurate response on anything in the pdf lol maybe I'm being unrealistic.

[-]

JohnMason6504@reddit

The hype is nice, but I need to know if this fits on a Cortex-M4 with 256KB SRAM or if its just another cloud-dependent toy. Until we see the actual memory footprint and power draw, Im sticking to my local LLaMA quantized to 4-bit.

[-]

Lkemb@reddit

I've just setup ollama and opencode, but it seems whatever model I use, when talking to it via opencode it struggles incredibly hard to read local files. Like they all "say" what they want to do but never actually do it, or it fails, or they end early..

Any ideas why this might be happening?

[-]

spaceman3000@reddit

It can't give me one sentence in my language without a grammar mistake. It's not doing it. It sucks big time.

[-]

ab2377@reddit

i think we should sell everything and either but 4090 or 5090, these times are going to a crazy route.

[-]

IrisColt@reddit

Just to set the record straight, my opinion below focuses more on the creative writing and translation side of these models...

Gemma 4 31B is the clear winner here; it's aced my 64K context translation benchmarks by producing English that feels natural, nuanced, and properly localized, even running at Q4_K_M. Qwen 3.6 35B A3B is the first of its class from Qwen to pass my test, though its English ends up sounding a bit more literal. As for Gemma 4 26B A4B and Qwen 3.5 27B, they both flunked. They spiral into repetition and/or broken language, gradually dropping pronouns and connecting words until they're just mechanically spitting out nouns and verbs with no real skill... Er... I didn't expect that Qwen 3.6 would be able to pull it off.

[-]

gearcontrol@reddit

What quant was the Gemma 4 31B that aced your 64K benchmark? I also use it for writing but not with long context (typically under 30K) and have been bouncing between Gemma 4 31B and 26B A4B (for the speed). Both Q4_K_M on an RTX 3090 (24B).

[-]

harglblarg@reddit

The 4-bit quant just barely fits in the 32gb RAM/12gb VRAM I have, while leaving enough space to compile. I’ve got it hacking away at a source port for a 3D FPS and it’s slowly but successfully chewing its way through it.

[-]

philnm@reddit

thank you for sharing. could you explain the MCP part, where you say "use screenshots from the installed mcp"?

[-]

Local-Cardiologist-5@reddit (OP)

in my open code settings, i have the

How the heck can you run opencode with Qwen3.6/3.5?

No matter how I tried it runs straight into an infinite loop of compaction.

[-]

c64z86@reddit

Is anyone else finding that Qwen 3.6 more often than not fails at something and it takes multiple attempts? I find that even though Gemma 4 is lower quality it actually one shots a lot of things.

[-]

Local_Phenomenon@reddit

You're excited, I'm excited, My Man!

[-]

Xyrus2000@reddit

Even at a 4-bit quant, the 35B A3B model has actually been really solid (I only have a 4080 super). I've been getting 66 t/s with my setup with a 32K context. Enough for small projects and PoCs.

If things weren't at a premium right now, I'd seriously consider investing in a larger VRAM setup.

[-]

tarruda@reddit

Hope they release at least 122b of the 3.6 series.

[-]

ionizing@reddit

YES, I also am very hopeful for 122B since it is my daily driver and is already a BEAST in my harness with the 3.5 version, which will be my daily on this until something better comes out in this size class. However I am also grateful for the 35B drop because even the 3.5 version of that was already pretty good on 12GB/32GB setups.

What an amazing time to be alive.

[-]

Poluact@reddit

Honestly I don't know how you manage to run 35B on 12GB/32GB with decent speed, my experience is it's way too slow for comfort with any decent context window.

[-]

ionizing@reddit

Darn, I was so hopeful.... I suspect there are template issues I will have to work out, and hope it is only that, because we are not off to a good start with 3.6-35B. Basically, in my months of testing, if a model behaves like this where it simply stops rather than continuing the next task, it is almost always a jinja issue or internal model issue of some sort. I did notice it was also putting some of its thinking output in the main chat channel and some of its main output in the reasoning channel. OH, I am on yesterday build of llama.cpp, I should update that I suppose in case something is related. Anyhow yeah I need to give it a few weeks but so far it is acting subpar compared to the 3.5-35B version which never made this type of failure in this harness. But yeah there may just be some work to do, either on my side or the model side or llama etc, as is usually the case for these new releases. Still grateful of course! But unless I can work out why it won't continue the agentic loop like its cousins, then it isn't worth much for my flows.

been running it locally all morning and yeah this thing is wild. the context window alone changes everything for my workflow. did a whole breakdown on my podcast if anyone wants the full deep dive - A Thousand Tabs × Hour on spotify, first ep is literally about this drop

[-]

Far-Low-4705@reddit

why do you set this?

--chat-template-file "{PATH_TO_MODEL}\chat_template\chat_template.jinja"

Reminder you can use it with claude code.

I had some issues with this so I set

--reasoning-budget 4096

--reasoning-budget-message "Proceed to final answer."

[-]

andreasntr@reddit

No limit

[-]

Local-Cardiologist-5@reddit (OP)

As i stated in the edit, i once had it there to test from tips i got from this thread and never bothered to remove it.

[-]

betam4x@reddit

I used an older version of qwen to do something similar and was impressed with the results.

[-]

Enitnatsnoc@reddit

What a time were in.

Jobless

[-]

Pleasant-Shallot-707@reddit

If your only skill is writing code you’re told to write, sure

[-]

Enitnatsnoc@reddit

For a while, I was extremely arrogant and considered myself awesome, cool, and irreplaceable, tells another people git gud. And then it was my turn.

The only thing that saved me from total collapse was some devops skills that allowed me to stay and support all the services created by AI rn, with a significant loss in salary. Some colleagues were less fortunate.

I'm not complaining ~~yes I am~~, I actually excited by all that "neural stuff". But it is difficult to deny the collapse of the labor market.

[-]

Yeah, model is really good and speed is also good.

Somehow I ended up asking to create exactly same thing but also like idler. It decides where to build towers itself. It had only like 2-3 hiccups during 1 hour or so session.

[-]

Lorian0x7@reddit

I don't know, maybe I had a bad quant but I tested it today actually much much inferior to 27b. It does an absolutely ridiculous amounts of calls, doesn't follow instructions without actually accomplishing anything. I tried it with openclaw creating a wiki of a huge document of 1.2M characters. It filled 140k context with stupid tool calls while doing nothing concrete. On the other hand qwen 27b did the job with just 60k context.

[-]

BackgroundNo2157@reddit

what’s the vision mcp your running for the screenshots?

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

try_repeat_succeed@reddit

Sick! I'm new here so let me know if this is out of line but what hardware do you have running this?

I want to know if this is possible with my 16gb VRAM and 32 (maybe 64 soon) gb RAM. Or what I would need for this to be possible.

Vibe-coding with claude has been amazing. Being able to get to that level locally, for free, with no "usage limit" would be next level.

[-]

Local-Cardiologist-5@reddit (OP)

heres my setup https://www.reddit.com/r/LocalLLaMA/comments/1so1533/comment/ogpnk5k/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

[-]

dadidutdut@reddit

llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-IQ3_S --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --ctx-size 80000

no is not. "-a" is for alias, what is relevant is what comes right after "-m" (model)

[-]

Local-Cardiologist-5@reddit (OP)

Hi Sorry im lazy i didnt update my model alias, heres my llama server configs, its really the 3.6 model, thats how excited i was about it i didnt even bother updating the model aliad name just plugging it directly

llama-server -m "C:\Users\ktm.MAGNABC\AppData\Local\llama.cpp\Qwen3.6\Qwen3.6-35B-A3B-UD-Q6_K_XL.gguf"  --mmproj "C:\Users\ktm.MAGNABC\AppData\Local\llama.cpp\Qwen3.6\mmproj-F16.gguf" --chat-template-file "C:\Users\ktm.MAGNABC\AppData\Local\llama.cpp\chat_template\chat_template.jinja"  -a  "Qwen3.5-27B"  --cpu-moe -c 120384 --host 0.0.0.0 --port 8084 --reasoning-budget -1 --top-k 20 --top-p 0.95 --min-p 0 --repeat-penalty 1.0 --presence-penalty 1.5 -fa on --temp 0.7 --no-mmap --no-mmproj-offload --ctx-checkpoints 5"

[-]