Okay 27B made me a believer
Posted by Forward_Jackfruit813@reddit | LocalLLaMA | View on Reddit | 74 comments
I previously hated on this model, but I have just been impressed by it, and I understand the hype now.
I have been working on a HTML5 game console and I decided to see if Qwen3.6 27B can handle making some quick games in it to showcase functionality (save games, console API handling for stat tracking and heartbeat management, meta data for the game, etc)
I gave it 3 files, explaining how the API works, the gamepad controls, and a typescript shader for it to apply. Then I just game it a very simple prompt "make a breakout game for this console, in the working directory are reference files on how to make it".
First result was immediately playable, controls made sense, graphics style was was unique and appropriate, sound worked, console API all worked, and it felt good and was actually fun. It added flair that made it not feel like the vibecoded breakout clone it was. It went way above and beyond the minimum that I've seen so many LLMs do. It was not lazy in the slightest.
It's a simple test, but this is something everything but something like Opus could handle. There wasn't anything particularly done well, it's just that the whole game was nearly complete in a single shot and it felt like thought was put into the entire game. All I needed was one follow up for customization and a single glitch and it was already what I would consider complete. And this was on a 27B model with Opencode.
The best way I can describe it, is that it was congruent. Now I just wish I went the Nvidia card route instead of Strix Halo cause the speed isn't great. Maybe 3.7 35B A3B can have some of this magic.
Weekly_Comfort240@reddit
I've been working closely with 27B for the last two weeks, maybe three weeks. Some observations:
1) <64K context is best for intelligence. It will _still_ muddle through tasks at approaching max context on long horizon agentic workloads, but I find it's IQ drops alarmingly past 64K context, and really drops off after 128K. Telling an agent "Summarize everything you learned into such-and-such.md", closing the harness, reopening, and say "Read such-and-such.md" is a big key to retaining the intelligence of this model.
2) It's one-shot ability on web apps is truly amazing. For a lot of long horizon tasks where it cannot find a solution, or delivers something that does not work, you're going to have to lead it by the reins and "vibe code" it. For tricky web browser problems, I've even asked it "Open a browser with API access and watch what I do step by step" to good effect. But every time context creeps past 64K or 128K, I have to reset the session as it starts to fall into loops and stupidity.
3) It's simply absurdly fun and addictive to have a near-Sonnet class model on our local resources. I _started_ with 35B A3B, but the thing is I found it simply did not have enough intelligence compared to full-fat 27B. I feel like I've hardly scratched the surface of what's possible with this model, and I'm honestly impressed with and thankful to the engineers who created it.
dadangemonfarid@reddit
I've only done it for about 1 week and have the same experience/conclusion. It's also just nice my setup only allows ~60K context (RTX 3090, llama.cpp, unsloth MTP Q5 K_XL, draft-n-max 4, ~60 t/s tg) and it's been working very well with fresh sessions.
Regarding your "tricky web browser problem", do you mind to share more details about it? Like what problem for examples and how did you ask it (which coding harness?) to open browser and "watch you"? (doing debugging yourself or showing it the steps to reproduce the error/problem?)
73td@reddit
what do you run it on ? i used the 35b mainly because I found a quant that runs up to 90k context.
Spare-Leadership-895@reddit
yeah, i think this is the part people miss. once the session gets bloated, it feels less like the model got worse and more like it's fighting its own old assumptions.
MaxKruse96@reddit
a 5090 gets 2.5-3k prefill, and 80-110t/s on 27b Q4 with MTP. definitly crazy speeds for dense like this, but i fear the extra memory you got enables u to run way better quants
Gold_Coconut9777@reddit
With 32GB of VRAM you definitely can run better quants than Q4
PatricioDonald@reddit
I am running Q8 with 16 gb vram (probably doing something wrong I am sure, since I am a newbie when it comes to local models and llama.cpp)
alchninja@reddit
Q8 of the 35B MoE or the 27B dense? The 35B will be fine with partial CPU offload and enough RAM, but the 27B will be very very slow since it simply can't fit into your GPU.
You can look at my comment history for a decent starting config for the 35B MoE on 16GB VRAM, then tune it based on your requirements.
PatricioDonald@reddit
35B MoE
I think it does 0 offloading to CPU. I would have to double check the latest command I ran, I am not near my computer
alchninja@reddit
For the 35B MoE on 16GB VRAM you're forced to do at least some amount of offload on basically anything larger than IQ3. The smaller Q8 model is over 35GB, plus a few more GB for the KV cache and buffer. That leaves at least 20GB+ unable to fit into VRAM, which is where your CPU comes in.
MaxKruse96@reddit
MTP takes memory. KV cache of 128k+ for agentic coding takes memory, to the point where q4km + mtp + 128k BF16 cache is fully saturating my vram.
Gold_Coconut9777@reddit
How crucial is it to stick with BF16 KV instead of Q8 in your experience?
MaxKruse96@reddit
am testing Q8 192k right now. BF16 up to 128k was no issue.
Evgeny_19@reddit
MTP does take more memory, but Q4 is a noticeable compromise on the quality of the model's output. With 32 GB of VRAM it is possible to run Q6_k_xl with 128k of q8 cache.
Fabulous_Fact_606@reddit
speed, accuracy or price?
5090 32Gb Vram, Q4 gets 80-110t/s. Blazing fast but not as accurate. $3.5K
3090x2 48Gb Vram Q8 gets 50-70t/s. Slow, but slightly accurate. $2K
RTX Pro 96Gb Vram BF16 80-110t/s ??? . lostless... $10K
Is it worth it to get the 5090?
socialjusticeinme@reddit
Or do two Radeon pros at $1300 a piece for 64gb of vram - I may actually try this out myself and if I do, I’ll report back here
BlackBeardAI@reddit
Do 4x3090's next.
dreamer_2142@reddit
I don't see many people talk about the settings (temp, top_K, top_P, min_P, repeat_penalty, Presence_penalty etc). These settings are important, like 0.3 temp vs 1.0 temp, your model will act like a different one once you change any of these settings.
Forward_Jackfruit813@reddit (OP)
I'm using exactly what they recommend on the model card
MrMisterShin@reddit
For more speed, use MTP (speculative decoding), a value of 2 or 3 should be good enough.
RevMen@reddit
You can overshoot on MTP very easily. Sometimes 1 is much faster than 2 or 3.
Pristine-Woodpecker@reddit
Put a minimum draft probability. Even something like 0.6 is good enough to make 2 or 3 basically always a win.
indicava@reddit
It really depends. I find that when generating code, even 5 can produce a >=%95 acceptance rate but that can drop dramatically when it reasons or when generating “free form” prose.
BTW: this finding has led me to think there might additional optimization to squeeze out of MTP, but I haven’t been down that rabbit hole yet.
RevMen@reddit
MTP shouldn't affect accuracy at all. Only speed.
Civil_Response3127@reddit
Indeed, but nobody claimed otherwise.
Fastpas123@reddit
Do you have a link to a guide on how to setup MTP? Haven't been able to make it work on my setup unfortunately.
Evanisnotmyname@reddit
How does MTP do for small VRAM systems? I’m literally running 8gbvram 6600xt/32g ddr4. It’s pushing it…but kind of working, not sure if it’ll help for super edge cases like this?
Forward_Jackfruit813@reddit (OP)
I'm using Q8 (Q8 Cache as well) with MTP. Getting around 15TPS dropping down to 11TPS at ~100k context.
lendo93@reddit
Qwen 27B is such an outlier in our benchmark that we had to re-examine our whole methodology (we have it roughly on par with GPT 5.2 or Sonnet 4.5). It punches way above its weight, although it struggles with larger context sizes. That's true of any model in this size class though and probably an inherent limitation of param counts.
Data at https://gertlabs.com/rankings
eidrag@reddit
Good news! Try to load version with mtp, and also because you're on strix halo, you should try using bigger quant or no quant for better quality
-dysangel-@reddit
bigger quant = slower
eidrag@reddit
Yes but Moe
BrewHog@reddit
Uh.... What? This is a dense model they are taking about
aeroG1@reddit
dense as opposed to MOE (Mixture of Experts), in this case the MOE configuration is A3B, so only 3B active params at any given time, as opposed to "dense" which would be all of the params.
Upstairs_Tie_7855@reddit
Isn't 27b dense?
YoelFievelBenAvram@reddit
Is that really always the case? I haven't really noticed that drastic of a difference on the strix halo. When I tested the 6Q vs the 8Q, they were within the margin of error if I recall.
-dysangel-@reddit
yes, 27B is fully dense, so you'd get half the decode unless you're somehow compute bound
DeSibyl@reddit
If you’re coding, don’t drop quant below Q8 lol
Randommaggy@reddit
I prototyped a fairly advanced and dynamic data driven flutter app over the weekend using 27B with 80K context window as the only thing touching the code with fresh sessions for every new action.
The worst/best part: tested the major cloud hosted on the same problem and they all got off on such a wrong start that it would have taken much much longer to arrive at a working solution.
It has me looking into expanding my local hardware collection.
ethereal_intellect@reddit
Possibly controversial but you can try turning thinking off for more speed, it should feel 2.5x faster. After that there's dflash and pflash which should be slightly faster than mtp but seems like it varies still with people still working on stuff. And of course maximum speed would be the a3b with thinking off but by then you're dropping a lot of capability
_TheWolfOfWalmart_@reddit
I don't mind turning thinking off for chatbot type stuff, but it gets a bit risky for coding. Especially with smaller models.
_TheWolfOfWalmart_@reddit
Yeah, I've found 3.6 27B to be the best overall model for coding that fits in 24 GB VRAM with a decent context size. It's better at reasoning and planning than 35B A3B.
For more complex stuff, I use 3.5 122B A10B. It has to swap from system RAM so I only get 25-30 tok/s but it feels like using a frontier model for all but the most complex tasks.
Having 27B and 122B in my toolbox, I don't find myself reaching for Codex/Claude nearly as often and have been able to save my usage allowance for the really complicated stuff.
Force88@reddit
Same, I asked it to make a html tower defense game and it works quite well. It can't draw for shiet but functionally the game is passable.
It make me spend this month saving to grab a 3rd 5060ti 16gb, so I can try q8 with 262k context.
iMrParker@reddit
Has anyone noticed that this model is what made local llm more mainstream? It's so popular that people are claiming it's the best local llm on the planet. Probably newbies not knowing that larger models exist?
epicfilemcnulty@reddit
To run larger models locally you need to have a very expensive rig, while qwen3.6-27B can fit on a consumer card (quantized, of course). And it is really a great model, so the excitement about it is quite understandable.
Thalesian@reddit
In my tests 27b outperforms much larger models. I’m sure they’ll catch up, but it is pretty formidable despite being a fraction of the size of frontier models.
SmartCustard9944@reddit
I would argue that with agentic harnesses I would rather have a smart model than doesn’t have too much world knowledge but is able to gather it when necessary, rather than a bloated model that believes more in its own internal knowledge than real data provided (like when you claim something exists today and it rejects the notion because of the training date cutoff).
iMrParker@reddit
I absolutely understand why it's so popular. It's just funny that the byproduct of that is that people are unaware of what local LLM has to offer. Seeing new people get into the space is good tho. And people will learn more about different models
It's just a funny observation
Blaze344@reddit
On a size vs results ratio, these recent qwen score quite high on the ratio. They might not be as good as the current big boy models, but I openly accept them as at the very least on par with the big boy models from 2024 like GPT 4o and maybe o1. To have this running locally is truly amazing, and they work wonders with opencode. (I'm a total convert from Codex + GPT-OSS-20B, which was already pretty decent and a sleeper hit many didn't care for as much as I believe they should have, but 35A3B or 27B are both so amazing in opencode it's frankly unbelievable).
ImplementCreative106@reddit
Like I mean it's so popular and good that he didn't even mention QWEN but I am thinking about it so I guess that's a fact to consider
jacek2023@reddit
make sure to enable both ngram and mtp:
Evgeny_19@reddit
In my experience combining them does more harm than good. At least on llama.cpp builds from the official containers from ggml-org (full-rocm/full-vulkan).
jacek2023@reddit
do you see better speed without them?
Evgeny_19@reddit
Yes, it's usually the same or marginally faster (around 1 tps better) for a regular mtp for me, without ngram. However, recent ggml builds have become completely unstable for me. They frequently crash with GGML_ASSERT on a dual GPU configuration with sm-tensor, and the pefrormance boost from MTP had dropped by about 50%. So I mostly use an ancient build from havenoammo. Well, the build is really like two weeks old, it's ancient relative to the speed everything changes in this space. I do pull the updates from ggml daily (sometimes 2-3 times a day), but they consistently perform poorly on my dual R9700 setup. No combination of mtp/ngram, and switching from rocm to vulkan managed to achieve anything stable for me.
Evgeny_19@reddit
Just to be clear: I'm not even trying to use them both on an old build from havenoammo. All my tests to combine the mtp and ngram options were done on the recent builds from ggml.
jacek2023@reddit
what are your t/s?
Evgeny_19@reddit
On the build from havenoammo it starts about 50-55 on Q8. This is on ROCm with MTP and -sm tensor. ggml builds are all over the place now, they usually start at 40 and drop really fast to 30 and below.
ProfessionalSpend589@reddit
For too when I tested them a couple of days ago. Then I read the comment here for MI50 cards: https://github.com/ggml-org/llama.cpp/pull/23269
I think it hasn’t been optimised yet.
hidden2u@reddit
Now use something /goal or Ralph loop and give it access to a browser, let it iterate and bug test itself, you can let it slowly chug away
thejacer@reddit
What do you mean /goal? I’ll google Ralph loop lol
hidden2u@reddit
https://code.claude.com/docs/en/goal
most harnesses have something similar now
Then-Topic8766@reddit
I do not believe. Is there some free code as a proof? :)
TheDailySpank@reddit
Load it up in Cline and ask it to do the same.
Then-Topic8766@reddit
I know, I was just kidding. I like 27B a lot.
ProfessionalSpend589@reddit
I like it for the MTP which at the moment seems like only the Qwen models are supported.
Mine runs on 2x R9700 (on PCIe 3) and I threw an Anki clone at it to develop some features. Before that several models were working on it already. Mostly works, supports MathJax, but I think it’s unoptimised.
Then to pass time I took some hand notes from a lecture yesterday (xournal++ is Ok-ish), exported them to pdf and pngs to try different formats. Qwen recognized my ugly handwriting, made a summary of the notes and then I asked it to make me flash cards for my Anki clone - with questions from the exercises.
Today I completed the review of 14 cards and I’m waiting to test it tomorrow. :)
I briefly tested the PDF export on Gemma 4 26B A4B today, but haven’t read the full summary (had to restart the model a few times for maintenance).
So, I’m also testing if the models can produce graphical proof for some easy geometry theorems in a SVG file. At the moment I’m not very successful, but probably it’s something I’m doing wrong
Top_Training5738@reddit
That “not lazy” part is honestly becoming the biggest difference now. A lot of models can technically code, but very few actually commit to the bit and build something coherent end to end without randomly giving up halfway through.
Also funny how local AI has reached the point where people benchmark models by “could it make a fun Breakout clone in one shot” instead of just solving Python fizzbuzz now.
BankjaPrameth@reddit
May I ask what harness did you use to test it? OpenCode, Pi, CC or something else?
1998marcom@reddit
OP says Opencode in its post.
BankjaPrameth@reddit
Thank you, I must skipped that part on first read.
Automatic-Arm8153@reddit
I would recommend pi though if your looking for a good harness.
Use the plain/stock pi. Get use to it, then upgrade it.
Works amazing with local models
BankjaPrameth@reddit
I’m still back and forth between Pi and OpenCode. But recently I lean toward OpenCode since it has free model to use when my local is not up to the task.
pdycnbl@reddit
i regularly check by giving my prompts to multiple models, i keep using frontier models but also check responses from local models. in my case qwen 3.6 a35b3b or gemma4 26b4b simply dont work at the level which are acceptable. qwen 27b and gemma4 31b give some semblance of intelligence and are slightly better but still miss corner cases. Unfortunately speed is also not good for qwen mtp does not gives me any better tg and raw tg on my hardware(igpu) is 2tg/s.
I dont get they hype about 27b model, people have rave reviews about it on this forum which gets me excited but the moment i test something from actual work i just get a feeling that its not there.
atleast a35b3b is good in execution when given a solution at high level due to its relatively better tg/s speed.
Ok-Internal9317@reddit
Throw a few billion tokens it a project, you’ll enjoy it 😍😍
Medium_Chemist_4032@reddit
I used it to develop a "order me a chicken breast" skill... and it worked. I'm still shocked and sleepy after that 1 hour sprint that lasted 4