Qwen 3.6 is the first local model that actually feels worth the effort for me
Posted by Epicguru@reddit | LocalLLaMA | View on Reddit | 165 comments
I spent some time yesterday after work trying out the new qwen3.6-35b-a3b model, and at least for me it's the first time that I actually felt that a local model wasn't more of a pain to use than it was worth.
I've been using LLMs in my personal/throwaway projects for a few months, for the kind of code that I don't feel any passion writing (most UI XML in Avalonia, embedded systems C++), and I used to have Sonet and Opus for free thanks to Github's student program but they cancelled that. I've been trying out local models for quite a while too but it's mostly felt up until this point that they were either too dumb to get the job done, or they could complete it but I would spend so much time fixing/tweaking/formatting/refactoring the code that I might as well have just done it myself.
Qwen3.6 seems to have finally changed that, at least on my system and projects. Running on a 5090 + 4090 I can load the Q8 model with full 260k context, getting around 170 tokens per second also makes it one of the fastest models I've tried. And unlike all other models I've tried recently including Gemma 4, it can actually complete tasks and only requires minor guidance or corrections at the end. 9 times out of 10, simply asking it to review its own changes once it is 'done' is enough for it to catch and correct anything that was wrong.
I'm pretty impressed and it's really cool to see local models finally start to get to this point. It gives me hope for a future where this technology is not limited to massive data centers and subscription services, but rather being optimized to the point where even mid-range computers can take advantage of it.
Curious-Function7490@reddit
This is interesting. I keep running out of tokens with sonnet 4.6 and I have a gaming rig with 4090 sitting across the room doing nothing right now.
Lancelotz7@reddit
Running Qwen 3.6 locally on a Mac. Impressive model, but want to sanity check my experience against the hype. My setup: Sonnet has been producing short-form videos for me in production. The skill file, the workflow, the project folder structure, all battle-tested. It ships finished output consistently. I handed the exact same folder to Qwen. Told it to follow the skill and continue the workflow. It reads everything, acknowledges the steps, then produces output that misses the brief. Structure drifts, tone drifts, skill steps get skipped. Not usable without heavy manual cleanup. Genuine question to the Qwen power users here: am I doing something wrong? Do you prompt it differently than you would Sonnet? Different system prompt structure, different way of referencing skill files, smaller context windows, specific sampler settings? Happy to be told I’m holding it wrong. Because on paper it should handle this. In practice, on my machine, it’s not there yet.
Epicguru@reddit (OP)
The qwen 3.6 open source model is a 35B MoE model, it's really good at coding/agentic tasks but far less impressive at something like generating content, it wasn't designed or tuned for that.
It's not really comparable to the 700B+ Sonnet model which was made with broader capabilities in mind.
Better-Struggle9958@reddit
every release same see same posts
EuphoricPenguin22@reddit
Yeah, but if this thing is better than the dense Gemma 4 31B, like the benchmarks I've seen suggest, this is killer. Gemma 4 is the first model for me to pass this threshold, so doing that but way faster seems like a dream come true.
ayylmaonade@reddit
Yeah, I don't know why people are complaining. It's exciting as hell. I've been running it at Q4 doing agentic work for the past day and a half, and I'd legitimately consider it to be a Gemini 3 or Claude Sonnet 4.6 competitor. Some of the things it's able to one-shot for me are impressive as hell. It has outstanding frontend capabilities too.
Todasa@reddit
What kind of computer does one need to run this?
letterboxmind@reddit
What's your computer's hardware? Tell us and we can point you in the right direction
Todasa@reddit
I wanna know what type of machine is ideal. I’m thinking of buying something.
I do have a MacBook Air M4 with 24 gigs of RAM already.
Real_Ebb_7417@reddit
If you want to stay on Apple hardware, then M5 Max will give you best experience (fastest generation), but if you get a worse CPU it should still be fine. RAM is more crucial here. I’d say at least 64Gb for best experience and 48Gb for satisfying one (are there even 48Gb versions? 😅)
If you want Nvidia though, go for a GPU with 16Gb vRAM (32Gb would be perfect but it’s overpaying imo if it’s just for running the model a bit faster) + 64Gb RAM (32Gb should be enough, but I suggest to go for at least 48Gb to have som breathing space)
BeatmakerSit@reddit
Was kann ich mit ner 4070 Super mit 12GB VRAM und 32GB RAM mit denen meine Ryzen CPU arbeitet sinnvoll laufen lassen?
beltemps@reddit
Ich hab die gleiche Kombo allerdings mit einer i5 CPU. Gemma 4 31B läuft und Qwen 3.6 35B auch. Beide als Q4 KM. Mein favorite zur Zeit ist das Qwen 3.6 35B Apex I Quality Modell, da es kleiner und deutlich schneller läuft als das Dense modell bei nahezu gleicher Leistung. Bei allen Modellen ist bei unserer Kombo jedoch die Ladezeit ewig.
BeatmakerSit@reddit
Hab gestern mal das Qwen 3.6 35B mit hermes agent getestet und es macht nicht so wirklich Spaß - bin wieder zurück auf Codex. Direkt in Ollama fand ich es aber vom Output her qualitativ ganz gut...
beltemps@reddit
Versuch mal die jeweiligen MOE Modelle von Qwen und Gemma. Die sind deutlich schneller
Todasa@reddit
Does that mean people custom build a PC with their Nvidia chip of choice, or can you get that from conventional manufacturers like Dell, etc?
Real_Ebb_7417@reddit
You surely can get it from conventional manufacturers, but I personally prefer a custom build.
letterboxmind@reddit
Depends if you want to run a local 8B or 70B model. VRAM is the crucial factor.
For Apple Silicon which uses unified memory, the system ram is shared with the GPU.
Are you looking to stay on Apple or open to Intel/amd?
Todasa@reddit
Open to Intel/amd or anything. Can i run ubuntu? Kinda dislike Windows.
grempire@reddit
q4 is already 24gb, you better get more than that
OmarBessa@reddit
WTF, qwen ONE SHOTTED THAT????
Better-Struggle9958@reddit
tests never show real, but tons of same posts about how wonderful new qwen without real examples it looks like scam
audioen@reddit
It literally is good. I'm just deleting 3.5-122B to run 3.6-35B instead. No doubt the 35B will be replaced by 122B in due course, but right now it seems to be the best thing available to me. Doesn't even need but half the resource of the past model.
michaelsoft__binbows@reddit
This is the advancement that was always gonna come. We're all lusting after those 256+GB systems and soon a lot of those systems are going to be looking at us like "am I a joke to you" as we run these A3B class models because they are such speed demons and the sensible choice of model for many tasks.
Kodix@reddit
Just spin it up and test it yourself.
I *am* seeing better, more consistent results than Gemma 4 I was using before. It's not night-and-day, like some people are claiming, but on the same tasks side-by-side Gemma's output was routinely filled with errors and didn't run (required debugging), whereas Qwen3.6's just wasn't.
This isn't an *extensive* test or anything, take it with a huge grain of salt, but it does seem to be an improvement *for these specific agentic coding tasks*.
Better-Struggle9958@reddit
I literally did it for C++/QML tasks and Qwen 3.6 generated uncompiled code, gemma4 is fine but very slow, all models is q_8. So...?
Kodix@reddit
So for your usecase gemma4 is better. Seems pretty clear.
Combinatorilliance@reddit
From what it's worth, I was using a local model back in 2024 to do real coding assistance for me on a 7900xtx. I'm a software engineer without experience in Android/Kotlin, I was able to use it as a stackoverflow/google to aid me with syntax and guide me into the right direction and translate my questions and analogies for my own programming experience to Kotlin programming experience.
Here's my review from two years ago: https://www.reddit.com/r/LocalLLaMA/comments/1ds9ogn/my_experience_with_using_codestral_22b_for/
I made the app I wanted to make, it worked.
This was all way before agentic coding, ancient ancient history.
Models have advanced extremely significantly since then.
If you expect them to do what claude Opus can do? Then you've got the wrong expectations, but if you want a capable model that can answer small and pointed questions for you? They can.
And these models can also search and use tools quite reliably. As well as opus or sonnet? No. As well as Haiku? Plausibly. Yes. Is Haiku useful? Undeniably.
Misio@reddit
Nah its great, u just built a 3090 inference rig and I've been dipping and experimenting. This can actually do thinks rather than flapping around being a moron.
GrungeWerX@reddit
More interested in the 27b, but I can’t find any info on it. How does it compare against Qwen 3.5 27B.
EuphoricPenguin22@reddit
The dense or sparse? I tried both IIRC and this one is definitely better.
GrungeWerX@reddit
There is only dense 27B.
EuphoricPenguin22@reddit
Oh, the 35B is the sparse for 3.5. Honestly, with so many model releases flying around, I think some slack can be cut.
miversen33@reddit
I mean, I should hope this is faster than Gemma 4 31B, you're comparing an MOE and a Dense model lol. These are not the same. You should compare it to Gemma 4 26B-A4B, that's a more "apples to apples" comparison
EuphoricPenguin22@reddit
I know. I would expect the dense to outperform it, but the sparse model is so damn good that it's definitively better at code generation.
Better-Struggle9958@reddit
Not better, I compared C++ and QML code generated by and Qwen 3.6 worse gemma 4, qwen code even not compileable
EuphoricPenguin22@reddit
With such small models, programming performance will be fairly domain-specific. I tend to notice all LLMs struggle more with C/C++ as compared with Rust or webdev.
miversen33@reddit
How are you testing? I'm running a homemade benchmark tool to compare them but I've always wondered how others validate "X is better than Y at Z"
EuphoricPenguin22@reddit
I use it with Cline to develop a test application. My general test anymore is to build a 3D snake game with ThreeJS. It's not a super great test for the logic, but a lot of small models struggle (or used to struggle) with the 3D aspect of it. Heck, even some cloud models take a while to get your design implemented properly. I'm mostly looking for issues with tool calls and the amount of corrections it takes to get a working final product. I would say this model is pretty decisive in both code competence and tool calling.
Potential-Leg-639@reddit
I tried it today and results were not really that good, all those hype posts („as good as claude“,…) about a small MoE model with 3B active parameters were already a sign of a massive overhype.
Will give it a try on pure coding a bigger plan soon and i‘m quite sure it wont be on Qwen3 Coder Next level.
Yes-Scale-9723@reddit
Yes, then paid APIs get much better. Then open models get much better.
Then, paid APIs get much better. Then, open models get much better.
Then, paid APIs get much better. Then, open models get much better.
[NaN repeated lines hidden] /s
ProfessionalJackals@reddit
I feel like this has been stalling ... If we compare Opus 4.5 > 4.6 > 4.7 the jumps have gotten way smaller. Same with Sonnet etc ...
Where as the Open Models have been closing that gap, step by step. Do not forget, comparing a 36B MOE model vs what may be a 200B+ model(s), and still delivering good result.
Even the Paid API Mini models have been closing the gaps with their Paid API big brothers. So it makes sense that the lower parameter open models are opening better home coding expierences.
There is a point that the closed models will stall.
c64z86@reddit
Yeah I don't think it's amazing as many make out after trying it for longer. When Qwen 3.6 one shots something it does a really wonderful job, but more often than not it gets things wrong and does not one shot things. I find Gemma 4 more consistent in it's quality and getting everything working from the first go, even if that quality is lower.
Pawderr@reddit
My thoughts exactly
FinBenton@reddit
idk, this is the first time I have had a model where I can run it on gaming GPU and I actually feel like I can make it do some coding without pulling out the big boys for smaller changes.
Better-Struggle9958@reddit
Example? Task and result?
Borkato@reddit
Literally anything you’d do in code.
“Refactor the code so it uses a while loop instead of recursion.” Boom it does it. “Nah I don’t want comments” it gets rid of them. Etc. it’s really not that complex
FinBenton@reddit
Idk small stuff, I asked it to go into my web app and add a new theme, it went in, found the right place to add a new theme, it went to different place to give it a custom thumbnail with colors in the theme picker, run some tests and came back with working version.
Corporate_Drone31@reddit
Because everyone has different expectations. For me, the line was somewhere around R1-ish.
IrisColt@reddit
heh
Epicguru@reddit (OP)
I guess that's what happens when every new release is better than the last...
NeedleworkerHairy837@reddit
Even I on low end PC ( in AI section ), only 8GB VRAM + 90RAM also feels this is the best I ever try and actually can follow instruction quite well. And for me, it's already more than enough.. It's even better than GLM 4.7 Flash by far from my tests today, and I basically already enough with that GLM lol.
-Ellary-@reddit
GLM 4.7 Flash was not better than Qwen 3 30b a3b.
NeedleworkerHairy837@reddit
Not is my use case or on my experience. I already try so many times, and GLM 4.7 Flash always does better. << Again, for my use case.
Hmm... Wait2.. Do you really said Qwen 3 30b A3b? Honestly in my mind, you said Qwen 3.5 35B A3B lol.
At least, Qwen Coder Next was great, but still not better than GLM 4.7 Flash for me. And now, 3.6 is fixed so much better than any other that I ever tried.
I can really said that about 27B and Gemma 31B because it's tooooo slow on my pc.. I'm quite tired of testing and only can Q3 quantized. I try a couple of times, and it fails. So I'm done testing them for my usage.
MrPanache52@reddit
Isn’t that what they said?
spawncampinitiated@reddit
Or when newcomers try an LLM for the first time
SnooPaintings8639@reddit
It's gonna be my turn next realse!
CountlessFlies@reddit
No but this time it might actually be true... I've tried previous models and none of them felt good enough
eesnimi@reddit
With my over 8 year old PC with a 2080 Ti (11 GB VRAM) and 64 GB system RAM, I can get 29 t/s with Q6_K_XL and full context. That's quite something, considering how complex the technical tasks it is able to handle are.
They complement each other well with Gemma, as Gemma has the edge in creative writing, which makes it better as a general conversationalist. That is good for brainstorming or just reflecting.
2025 was the local LLM year, where quality jumps were noticeable quarterly. Good to see that it doesn't seem to be slowing down yet. Now we are already in a place where lower-mid-tier local models can handle some things better than SOTA models because of the greater control you have over them. A wide selection of different models, each one configured for that special task on an NVMe drive, and you can already replace SOTA models without very little compromise.
Pineapple_King@reddit
Can you please share your llama.cpp parameter, I don't get past 22tps
eesnimi@reddit
-ngl 99 \
--n-cpu-moe 38 \
-c 262144 \
-b 1024 \
-ub 512 \
--flash-attn on \
-t 5 \
--no-webui \
--temp 1.0 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--presence-penalty 1.5 \
--repeat-penalty 1.0 \
--mlock \
--split-mode none \
--main-gpu 0 \
--fit off \
--parallel 1 \
--cache-type-k bf16 \
--cache-type-v bf16 \
--sleep-idle-seconds 900 \
--chat-template-kwargs '{"preserve_thinking": true}'
--jinja \
Pineapple_King@reddit
I cant believe it, its running at 80tps (previously I got 22tps with qwen3.5 and my old parameters) AND 256k context window instead of 65k I had before. Incredible! the only difference is, I use -fit on instead of --n-cpu-moe 38
Conscious_Chef_3233@reddit
i used to use --n-cpu-moe too, but switching to --fit on is not only easier to set up but also faster
Pineapple_King@reddit
thank you!
qfox337@reddit
It's the first local model that I've used as default over Deepseek for a week or so. I'm not sure I'll stay, but unlike all previous local models I'm not terribly unhappy with it. I usually have it split over my 3060 and 3090, where it is a bit slower at 85 tps (to leave space for my actual research model on the 3090). On just the 3090 it's 125 tps. It feels slow because it does a bunch of "thinking" by default; I haven't bothered tweaking the effort parameter, and assume quality would fall if I did.
DonkeyBonked@reddit
Well, even though I feel like I've read this post a thousand times, I have to say, this is the first time I've felt a real agreement with it. I've got Qwen 3.6 35B running in Cline and I'm putting it to the test right now. The shift between that and my GitHub Pro+ using upper tier models is literally the smallest it's ever been for me in a coding workflow.
Now to be clear, it's not a "I'm unsubscribing and never paying for these models again" level change, but for example, GitHub just refunded me a ton of my Premium Requests (I think they've been broken), and so I'm currently at 1,184 of 1,500 included with 12 days left. However, a few days ago I was at 1,408 of 1,500 with 15 days left which was even more grim. I expect to go over, but that doesn't mean I'm not trying to make the most of it.
I've been brutally pushing and testing Qwen 3.6 on my local AI server, where I'm running it with the highest quality settings I can handle locally, and honestly, it doesn't feel any worse than using Sonnet 4.6 on my Claude Pro sub. It does make some mistakes, but I think with some LoRAs, skills, and MCP love, this thing can actually be a part of my workflow to keep my AI costs down.
While I'm using it in Cline now, I'm going to set it up in Hermes today, and I'm also working on my own custom agent for it to see if I can maximize its potential. I've been working on some self-improving data for it while also having to adjust for the differences prompting it vs. something like Claude or ChatGPT.
I've had it test and make a few apps with agents to see how it can handle them and basically, it boils down to this:
- Python: A tier.
- JavaScript/TypeScript/HTML: Solid B tier with the potential to be A tier.
- Go, Shell, Rust: C tier with the potential to be B tier with some help, maybe.
- Niche languages like GD Script: D tier at best. Might get some specific asks right, needs to be combined with web search to be even hopeful, completely unusable in a professional workflow. Even with plugins and MCP servers, it's a train wreck and incapable of producing error free code / debugging / or correction at any scale worth mentioning.
- Elsewhere: Bag of Cats! I haven't gotten to fully test it, but like it can do some C++ or C#, but don't think that means it can pull them off in a custom environment like Unreal or Unity.
- It's worth noting there is never a task where it hasn't failed tool calls in Cline/VS Code, but most of the time it figures them out, and I'm using the failures and successes to build a database that I hope to turn into a tool for it soon.
These are my own opinions based on my own testing, which has been for my own workflow, so no, I don't have data or charts for any of it, so as far as all that work is concerned, you can call this my personal feelings based on testing and experience which is continuing and rapidly evolving.
miloman_23@reddit
I feel it's the MOE variants which are the real innovations here.
Machines with > 24GB memory and average GPU specs can start to generate tokens fast enough to use for real-life applications, such as openclaw etc.
RoomyRoots@reddit
I have only read the posts and it's probably one of the most divisive I have been followin on short post-release. People are either loving it or hating it.
pedronasser_@reddit
I think the hating might be due to bugs that may still exist. As soon as the backends get fixed, the overall consensus will get more positive.
Cupakov@reddit
It’s the same architecture, I don’t think there’s too many bugs this time around
pedronasser_@reddit
That's exactly the point. The Qwen3.5 still has open issues in llama-cpp. For example, the prompt caching doesn't work.
toothpastespiders@reddit
I haven't tried it yet. But the weirdest part of the narrative to me is that a 0.1 bump from a model that wasn't released 'that' long ago is provoking such a heated response among users. I'd get it if issues with overthinking had been tweaked but from what I've seen in discussions that doesn't seem to be the case.
computehungry@reddit
I noticed 3.6 started writing unit tests by itself and testing the code. 3.5 didn't do this, when no instructions were given. I obviously didn't test 3.6 a lot yet but this might be one of the reasons why it could suddenly be a lot better for some setups. For many it could also be very incremental.
tengo_harambe@reddit
Some people are just hating on Qwen now because of the recent leadership mixup.
ambient_temp_xeno@reddit
It must depend on what you use it for. For code I'll never know if it's good or not. I set up minimax 2.7 Q8 the other day and haven't bothered with it since.
abcdef0eed@reddit
is there going to be a 9b version?
ImSamhel@reddit
Man I can't afford to run these anymore 😭 atleast the 26B gemma fits into my 16gbs of vram, I'm jealous
dark-light92@reddit
I run this model with 12GB vram. Overflowing into RAM isn't really an issue with 3B parameters active.
ImSamhel@reddit
Which quant if I may ask and what's the speed diff?
dark-light92@reddit
IQ4_XS. \~500t/s prefill, \~22t/s generation average. 128k context.
ImSamhel@reddit
I just tried IQ3_S with 128k context and turboquant. (yes I know there's many things in this sentence that's questionable, BUT it's working and fast and so far it didn't make me regret) and it fits into like 13-14gb Vram.
paulmmluap@reddit
It can’t do simple datasheet scraping, Claude can w ease.
Epicguru@reddit (OP)
Can you run Claude on your own computer for free? Look at the name of the sub.
paulmmluap@reddit
You don’t know, okay. Saw the name of the sub. Benchmarked Qwen and it can’t do simple things well, for me. Maybe another approach could do it. But the cost of Claude was well worth my time and money trade off.
cmpxchg8b@reddit
Cool story
Liquidlino1978@reddit
It's pretty good so far. However, it can *really* get stuck in a loop when thinking. To the point of filling up the entire context and failing to respond. Try these prompts in a row:
What's brown and sticky?
Very good. What are some similar pun based simple jokes like this?
Are there any that are bit less kiddy, and more risque/adult?
This sends qwen into a tailspin, endless iterating on the same three or four rubbish jokes and deciding they're not funny and not adult. It even self-recognises it's in a loop multiple times, but fails to climb out of it.
Epicguru@reddit (OP)
Liquidlino1978@reddit
Hmm interesting. I'm using default settings for the model in ollama on Linux using amd rocm. It's been spot on except for this. I will retry in lm studio in windows with just CPU, see if it changes.
obp5599@reddit
I havent had this problem with 128k context and llama.cpp. Im using claude code + pointing it at my local llama.cpp server. Been pretty good so far just slow (compared to frontier models on cloud). 5080 + 64gb RAM, getting \~20 tps
Mayion@reddit
I always find myself in thinking loops with Qwen since 3.5. Parameters same with Unsloth but it keeps looping and I honestly don't know how to fix it. Meanwhile Gemma4 is almost instantly answers and does tool calling well.
ChrisK_au@reddit
I tested the same prompt on maybe 10 models last night using llama-server (I'm new to this). Qwen 3.5 & 3.6 (35B A3B Q4_K_M) spent minutes reasoning before outputting anything. Gemma 4 (26B A3B Q4_K_M) took just a few seconds. Generation time was similar.
If you find a config setting that fixes it, please share.
Mayion@reddit
I am using LM Studio. Disabled thinking for Qwen3.6 and it seems to have solved the issue. Still at \~12t/s but it's fine. It couldn't do anything I asked of it on Open WebUI's terminal though for some reason. Would create the file, "run" the tasks but in the end it's an empty file.
ChrisK_au@reddit
Whose version are you using? In the lmstuduo_community one the thinking is turned on in the template, even if I turn it off (with llama_server), it gets turned back on.
Mayion@reddit
Unsloth
ChrisK_au@reddit
That took me to this page that talks about recommended settings "to reduce repetitions". Might be relevant to you.
https://unsloth.ai/docs/models/qwen3.6
PairOfRussels@reddit
People who say this, did you use 3.5 beforehand or what? Is it significantly better than 3.5?
Karlthagain@reddit
I am strugling with the 3.6, i was working with the **qwen3.5:35b-a3b-mxfp8** and it was working almost perfectly (not for coding, but different complex task using with different skills), i tested **qwen3.6:35b-a3b-mxfp8** but it doesn't follow the limits, procedure and formats as well ask the previous model.
wtfihavetonamemyself@reddit
Has anybody tried using a draft model with this like qwen 2b or .8? Has it worked in llama? Noticeable gains?
No_Cake8366@reddit
The MoE architecture is doing a lot of heavy lifting here. 35B total params but only 3B active per forward pass means you're getting specialist routing without the full compute cost. That's why it feels so different from running a dense 7B or 13B locally.
Curious what hardware people are running this on. I've been testing on an M-series Mac and the inference speed is surprisingly usable for agentic coding workflows where you need fast back-and-forth. The Gemma 4 26B comparison is what sold me on trying it, but the real test for me was multi-turn conversations where previous local models always fell apart by turn 4-5.
Anyone benchmarked it against the uncensored fine-tune that dropped yesterday? Wondering if the preserve_thinking flag makes as big a difference as people are saying.
megid0105@reddit
Idk sounds like whenever a new release lands, people happy about it until they don’t. but great input anyway
evilbarron2@reddit
That may get because of the type of people who post immediately about a new release. People doing real work tend to wait on releases and then do extensive testing, so you don’t hear their findings until around month after release.
Simplest thing is to discount the immediate reviews - they just can’t provide anything beyond surface impressions
evilbarron2@reddit
Did you (or anyone else) use 3.5 moe as well? I’ve been using 3.5 extensively served locally, have been quite happy with it, and am wondering how 3.6 compares. I’m downloading it now to start testing it in my setup, would be useful to know what to look for.
Electronic-Metal2391@reddit
Yeah? Does it yap and loop thinking with you too?
TheItalianDonkey@reddit
does for me. 80k tokens so far then i gave up.
iamapizza@reddit
You said hi didn't you
perkia@reddit
Prompt skill issue /s
TheItalianDonkey@reddit
Worse; I added also “how’s it going?” It’s probably thinking i want to skin his family now…
Qwoctopussy@reddit
llama-server running Qwen3.6 35b a3b bartowski Q6 quant, pi agent harness
I see the user is greeting me. I should respond in a friendly, casual manner as they've addressed me as "buddy".
Hey! What's up? How can I help you today?
TheItalianDonkey@reddit
Not alway, but on occasion / every 50 messages or so… hermes agent
OniCr0w@reddit
You need to tune the settings correctly. 3.6 has been nothing but an improvement for me.
TheItalianDonkey@reddit
Here you go, can't see what i'm supposedly doing wrong. In my head, i tuned these settings to that text.
Qwen3.6 35B A3B UD
Model
- Format: GGUF
- Quantization: Q8_K_XL
- Architecture: qwen35moe
- Size on disk: 40.24 GB
- API Model Identifier: qwen3.6-35b-a3b@q8_k_xl
Load / Context and Offload
- Context Length: 262144
- GPU Offload: 40
Advanced
- CPU Thread Pool Size: 15
- Evaluation Batch Size: 512
- Max Concurrency: 1
- Unified KV Cache: ON
- RoPE Frequency Base: Auto
- RoPE Frequency Scale: Auto
- Offload KV Cache to GPU Memory: ON
- Keep Model in Memory: ON
- Try mmap(): ON
- Seed: Random Seed
- Number of Experts: 8
- Experimental “Number of I...” setting: 0
- Flash Attention: ON
- K Cache Quantization: OFF
- V Cache Quantization: OFF
Inference / Settings
- Temperature: 0.6
- Limit Response Length: OFF
- Context Overflow: Truncate Middle
- Stop Strings: none
- CPU Threads: 15
Sampling
- Top K Sampling: 20
- Repeat Penalty: 1.0
- Presence Penalty: OFF / 0.0
- Top P Sampling: 0.95
- Min P Sampling: 0.0
Structured Output
- Structured Output: OFF
Beginning-Window-115@reddit
you know you need to use a harness right? thats the entire point of these models
TheItalianDonkey@reddit
Yup. Currently using hermes… any other reasons as to why it’s looping like that then?
Beginning-Window-115@reddit
try opencode and let me know if that works also by chance are you using unsloth? the template might be broken
fredandlunchbox@reddit
Why not claude code. You can run it with a local model and get all of their tool prompts. It's great.
alchninja@reddit
The biggest tradeoff is that CC's toolset and system prompt eats up a very large amount of context, around 15-20k tokens depending on how many additional things you have set up. It's very good, but you definitely feel the pinch even with a 100k context length.
fredandlunchbox@reddit
I have 200k context length. Burning 5-10% seems like a good trade off for the performance I get.
covertpirates@reddit
open code has been amazing. It wasn’t until I got the context sorted though that it became useful. I’ve been using it with Qwen27b. GLM was also good, but it got stuck in a loop, blaming me every time an error popped up! lol. A bit concerned about the privacy issues though.
TheItalianDonkey@reddit
No on opencode; but yes on unsloth (the q8 quant). Where did you see that the template might be broken?
Beginning-Window-115@reddit
idk usually they have broken templates that get fixed a couple days later you could try bartowski if there is one and just get the basic q8
AD7GD@reddit
Whenever I hear "looping thinking" my first thought is that someone is using ollama with a low token limit. It will slide the window and the model will lose track of whether it's thinking.
Epicguru@reddit (OP)
Hasn't done so far. Been using LM Studio and OpenCode with pretty much default settings.
RelicDerelict@reddit
Is someone running this on a 4GB VRAM and 32GB system ram? Just asking for a friend (you don't need to remind me that I am poor).
Pineapple_King@reddit
Yes and it's useable but slow 4tps
RelicDerelict@reddit
🤔 That is not as bad as I thought, hm, thanks!
computehungry@reddit
It runs on my 3050 4gb laptop. If you want vision on the GPU, you barely have any room for context. Without vision, (either not loaded or on CPU), you can run - Q4 quant (reference: Qwen's official quant) - 131k context at Q8 KV cache - batch size 512 and it will barely fit. If you want some overhead/stability you can go with less context. ~26tk/s
RelicDerelict@reddit
Thanks man!
MediocreLeek9343@reddit
I have to agree that it is definitely the one of the best local models. Very impressive.
niellsro@reddit
The model is handling tool calls really nicely, but pls make sure you're always in the loop to review it (for cosing tasks i mean). It seems to rush to implementation/wrong conclusions without assessing the whole picture. At least this is what i've notice, i'm using an AWQ quant. I threw a code review request for a PR i made in an actual project i work. It flagged so many "problems" by just assessing class method code in isolation, without "understanding" the full flow. However, when questioned about it - without actual mentioning the business flow, it reanalyzed its conclusions and corrected itself. This might be an instruction problem or just "rush to solve" behaviour.
Zyj@reddit
In my first tests, Qwen 3.6 35b a3b didn‘t work so well.
kmp11@reddit
watching Hermes-Agent work with unlimited amount of tokens at >100tk/s with this model is kinda scary...
Physical_Gold_1485@reddit
Shit like chatjimmy at 16k tokens per second is scary af too. Having that local would be insane
blueredscreen@reddit
AI slop has infected everything. (x2)
Skelshy@reddit
I switched to this from Quen 3.5 122b (Q6) and it's faster with similar results. So far so good.
crazyCalamari@reddit
Wow that sounds interesting if true. Are you using it for coding or other use cases?
Skelshy@reddit
I have a coding framework that can run the local LLM 24/7 that does a lot of long running coding tasks.
jedsk@reddit
Function calling in opencode has not failed once yet. Editing html pages has given me surprisingly decent results. Gemma struggled for me. Q8_K_XL
AsyncAura@reddit
Is your experience good with C++ projects ? Would you recommend running it on a 3080 24GB?
chocofoxy@reddit
i am hyped for a 14 or 9b release i can't use this model i don't have enougth vram but i will try it ( i can offload it )
Blackdragon1400@reddit
I’m glad folks with smaller cards are getting to experience this now, I think we’ve been there for about 6 months now but with the larger model sizes. We’re going to be eating good from here on out!
suoko@reddit
Minimax?
GrungeWerX@reddit
Did they only release the 35B? I thought the 27b won the vote? Not interested in the 35b…
Ok_Mammoth589@reddit
If you're running a 5090 and a 4090 and some 35b model is literally the best model you can setup. Then it's not the models.
Epicguru@reddit (OP)
What do you mean by that? I
If you know of better models for coding that can run on my hardware with anywhere near 260k context and 150+ TPS, please do let me know.
Leo_hofstadter@reddit
Is the qwen3.6-9B model released too ?
donk8r@reddit
Interesting. GLM 5.1 has been my favorite from open source so far — how would you say this compares on coding tasks? Better instruction following or about the same?
-Ellary-@reddit
For me Qwen 3.5 27b is way better at executing tasks and solving problems.
If you have enough ram and 5090 + 4090 why not run full GLM 4.7 358B A32B at IQ4XS or IQ3XXS?
Difference between big GLM 4.7 358B A32B and Qwen 3.6 35b A3B will be insanely big.
For me Qwen 3.6 35b A3b and Gemma 4 26b a4b are really light models, close to 9-12b dense.
ea_man@reddit
For the peasants having less than \~16GB (I got 12GB) even 27b IQ3 knows better that Qwen3.6-35B-A3B.
Before you ask:
Yet it's slower, guess what.
Dany0@reddit
Q3.5 27B RYS for planning/hard problems, Q3.6 35b for everything else
guiopen@reddit
Qwen3.5 35b is far better then qwen 3.5 9b
Mount_Gamer@reddit
I think this new qwen3.6 35B is better than the 9B 3.5, it's solving problems better than the 27B (granted I use q4XS), and every other local model I have and I think the 27B is very good. It's very impressive, but for best results with me for both qwen3.5 and this new 3.6, is to have the thinking on and using the recommended params for thinking.
I have not used it for agentic yet, but over the webui, it's doing very well.
Epicguru@reddit (OP)
I tried GLM 4.7 (I forget which quant but probably Q3) and ran into lots of issues with it stopping suddenly mid-task, or ignoring large parts of the task.
-Ellary-@reddit
Are you sure you talking about regular GLM 4.7 and not flesh 30b variant?
I mean this one - https://huggingface.co/bartowski/zai-org_GLM-4.7-GGUF
Epicguru@reddit (OP)
Yes that's the one. But I could only use Q2s, I have 64GB of RAM.
Neighbor_@reddit
Is it better than the new Gemma?
Simon-RedditAccount@reddit
That's true. I'm testing all new models with a tricky task that implies some knowledge, obvious to a human but not specified in prompt. So far Qwen3.6-35B-A3B-UD-Unsloth was the only local model that fully solved my task.
Interesting_Key3421@reddit
I agree, it also works very well on fast CPU
Zealousideal_Fill285@reddit
I agree that the qwen 3.6 35b is great but you have rtx 5090 and 4090 and cant afford any 20$ AI subscription?
Epicguru@reddit (OP)
It's not that I can't afford it, it's that I don't want it.
Epicguru@reddit (OP)
And I bought these GPUs well before the shortages, both for under MSRP, no way I could buy them again today.
OverclockingUnicorn@reddit
Did the 5090 exist before the gpu shortage?
Epicguru@reddit (OP)
Just barely yes. In the UK Palit 5090's were available for under MSRP for a few weeks after launch.
Party-Special-5177@reddit
Brilliant retort.
It’s crazy how many people are in localllama arguing against locality lol
Zealousideal_Fill285@reddit
Im not against it as i use it myself, yet there def were capable models for coding too when he had this student sub especially with hus setup, so that was just weird
RoomyRoots@reddit
He says, in the Local AI sub.
ComfyUser48@reddit
Same!