Qwen3.6 27B really good?
Posted by Popular-Factor3553@reddit | LocalLLaMA | View on Reddit | 80 comments
hi I'm new to this but I've seen many people say it's even better then some 300B models that shocked me a bit.
is it really that good what models csn i compare it to and what quant? i tried searching myself but i can't run it right now and i just don't know what to think about others saying it's better then Claude.
Formal_Scarcity_7861@reddit
Every time I see a post saying how good Qwen is and able to replace Claude, I would think if a model with 27b/35b can replace a model in T size, then Qwen team should be able to make a 1T model to rule the world. Any way, in terms of translating Japanese, Gemma 4 31b is better than any Qwen model for now.
666666thats6sixes@reddit
Training a 1T model might be out of reach for them (for now), the compute requirements are silly compared to a sub-100B model. Doing small dense and MoE allows them to iterate quicker and develop really good datasets before committing to training something larger.
fivecanal@reddit
Deepseek, Moonshot, Zhipu and other smaller Chinese companies have trained large models, so Alibaba would only have more resources to train even bigger ones
666666thats6sixes@reddit
Those are all MoE, though. Qwen also trained a fairly large (0.4T) MoE.
Popular-Factor3553@reddit (OP)
Kimi has a 1T model, ig qwen just wants to make smaller models better which is great tbh.
Mental_Object_9929@reddit
Although it's 1T, all parameters are 4-bit.
ComplexType568@reddit
You are forgetting Qwen3.6 Max, which is probably twice the size of the 419B model AT LEAST.
Popular-Factor3553@reddit (OP)
Oh but Qwen is owned by ali baba right? Aren't they like really big like 325 billion dlr big?
666666thats6sixes@reddit
All Chinese firms are compute-bound, at least compared to American ones. Availability of nvidia hardware is limited and local competition needs more time to become viable. That's why we see practical small models coming mainly from China, they have no other option at the moment.
Popular-Factor3553@reddit (OP)
Ah right the ram shortage
asenna987@reddit
No, not really the ram shortage - compute bound is mostly Nvidia chips. There are strict export controls in place to make sure these Chinese firms don't get the top-end GPUs and TPUs to catch up to the American tech.
Popular-Factor3553@reddit (OP)
Ik that was possible but didn't expect nivida would miss out on the Chinese market.
Kodix@reddit
Yeah, Gemma has better language skills. Qwen is specialized for coding/agentic stuff.
As far as the model performance to size discrepancy, Qwen isn't actually that much of an an outlier here.
Out of the recently released frontier-adjacent models, they all have significantly different parameter sizes, but are still competitive with eachother.
Deepseek V4 Pro - 1.6T parameters
GLM-5.1 - 750B parameters
MiniMax 2.7 - 230B parameters
So all minimax has to do is quadruple their parameters and they'll take over the world, yeah?
ComplexType568@reddit
They'd probably also have to quadruple their training data to make it worth it. Parameters are not everything. Llama 405B shows it.
Popular-Factor3553@reddit (OP)
Fr, that's soo random why would i translate japanese lol
ProfessionalSpend589@reddit
To read Casio’s official site.
Popular-Factor3553@reddit (OP)
Nah not really
BitGreen1270@reddit
Not much into manga I assume 😄
Popular-Factor3553@reddit (OP)
Yea not really, thanks tho I've heard gemma 4 is good too tho I'm more into kimi and qwen.
seppe0815@reddit
gemma 4 roll them all like a sushi without coding stuff
JuniorDeveloper73@reddit
4090 256k context no allucionations
llama-server -hf unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL \^
--host 0.0.0.0 \^
--port 8082 \^
-t 16 \^
-ngl 99 \^
-b 1024 \^
-ub 256 \^
--ctx-size 262144 \^
--cache-type-k turbo3 \^
--cache-type-v turbo2 \^
--flash-attn on \^
--mlock \^
--jinja \^
--reasoning-budget -1 \^
--temp 0.5 \^
--top-k 20 \^
--top-p 0.95 \^
--min-p 0.1 \^
--webui-mcp-proxy
Prize_Negotiation66@reddit
this arguments are so stupid
JuniorDeveloper73@reddit
why?
jdchmiel@reddit
you should set threads to like 1-4 since all layers are in gpu. set ubatch and batch higher unless you are only generating 100 tokens at a time for 500 token inputs.
JuniorDeveloper73@reddit
thanks,the rest its ok? come to this config after claude and gemini bullshiting each other at best cfg
jdchmiel@reddit
run llama-bench to test the permutations of batch sizes. Also look at the hugging face model card to see what temp and such you should use for the specific use case you are running. I do not recall a 0.5 temp in the recommendation, but not sure that it matters THAT much unless you are doing specific tasks.
jreoka1@reddit
Yes its super good
kevin_1994@reddit
It feels about Sonnet 4 level imo. I tried vibecoding an internal tool (about 10k) LOC and it struggled but did okay.
Popular-Factor3553@reddit (OP)
Apparently the full orginal model with quant is better and close to gemini 2.5 pro
paryska99@reddit
You mean without quant? Im not sure I understand this statement
Popular-Factor3553@reddit (OP)
Yea mb, *without
iportnov@reddit
I asked it to write tests for method which for given point and given curve finds nearest point on curve to the given one. It found some implementation of Bezier and Nurbs curves mathematics in the project, generated some curves for examples, calculated nearest points analytically (literally - it knows what Bernstein polynomials are, it took a derivative analytically and solved quadratic equation by formula) and used calculated values in the test.
This was far from one-shot (several iterations of "write tests", "review tests", "fix tests" and so on), but still.
That's in Opencode.
jablokojuyagroko@reddit
Its insane, its the first time that i think, ok i can use this as my daily driver
g_rich@reddit
Qwen3.6 is really good but it’s not a replacement for Claude and anyone that thinks this has either never used Claude or is delusional.
However it is one of the most powerful models that can be practically run on even the most modest local setups and get real work done.
I am running it at FP8 with a 256k context and have been extremely impressed with the output. Being a dense model it’s on the slower side but it gave me the best output for my standard test: - Create a Tetris clone in html with levels and music. - Create a leaderboard backend API with endpoints to host the html game, post a high score and retrieve sorted leaderboards using Python, Flask and an SQLite database. - Integrate the leaderboards into the html game. - Create a Dockerfile using Alpine Linux to host the game and leaderboard api. Other locally hosted LLM’s have been able to complete this task, but Qwen3.6 27b has given me the best game design and music than any other model and has required the least amount of back and forth to complete the subsequent tasks.
WhyNoAccessibility@reddit
I would say its been quite solid honestly, but I have also been liking the Queen 2.5 Coder 7B. It has hit 88.4 on human eval. If you have constrained hardware the 1.5B is still solid.
teachersecret@reddit
Yeah, it's extremely good. Shockingly good for the size.
Paradigmind@reddit
That's what girls keep telling me.
Best_Control_2573@reddit
27b, but it's thick.
florinandrei@reddit
Soda can, basically.
SingleProgress8224@reddit
It's dense
ShelZuuz@reddit
Some prefer a Mixture of Experts.
Mart-McUH@reddit
You mean fingers?
Paradigmind@reddit
Made me chuckle
ken107@reddit
Wow, thats a lot of girth
Popular-Factor3553@reddit (OP)
Wonder how Qwen 4 models will be like, can't wait
florinandrei@reddit
You may not be aware of this, but social media is full of something called "hype".
Adventurous-Paper566@reddit
For my usecase Gemma 4 31B is better, even 26B A4B is better.
Qwen models are dumb in french.
Popular-Factor3553@reddit (OP)
I think llama models are good in multiple languages.
ttkciar@reddit
It is quite good, using Q4_K_M.
It is not better than Claude, not by a long shot.
I'd compare it to early GPT-4, but can't narrow it down much more than that yet, because I've only just started using it.
Weary_Long3409@reddit
Yes, fair enough. Don't compare to latest frontier commercial models. Years ago we amazed with 175B GPT-3.5-Turbo and 1.8T GPT-4. Now those level of intelligence even more is available for free.
ttkciar@reddit
Very true! The performace we get out of these mid-sized models is quite astounding.
It wasn't my intention to be minimizing or dismissive, just objective.
Popular-Factor3553@reddit (OP)
Oh alr thanks!
Ell2509@reddit
In some testing I did today around timetabling and budgeting (a fairly large multi step task with a range of domains) it actually performed worse than 3.6 35b a3b, which was a huge surprise, as the 35b MoE was also blisteringly fast by comparison to the dense model.
Popular-Factor3553@reddit (OP)
Is it good in agentic tasks?
Elegant_Tech@reddit
I've been running it all day in VS Code and have been blown away with it. Been testing out stuff I normally toss to Opus. Each prompt used between 100-140k context when done burning an average 1.5 million tokens. The results have been great without a single error post testing.
gobi_1@reddit
If I may, what speed on what hardware?
Popular-Factor3553@reddit (OP)
Just gets more excited to try it, it's impressive
woepaul@reddit
First model where agentic coding workflows work well for more complex things than just bash scripts (at least for me)
Used it to resurrect an old C-based world simulation prototype. It ported it to a new graphic API, found bugs and implemented load and save for world parameters.
All done in llama.cpp built-in web interface plus an MCP server for command execution that I vibecoded with claude a while ago.
erazortt@reddit
Contrary to the benchmarks, from the testing I did this is a very clear no. And by testing I mean taking the models to real work. I would argue that the very clear tiering of the initial 3.5 series, namely 0.8 B < 2B < 4B < 9B < 35B < 27B ~ 122B < 397B < Claude, has not changed materially by the 3.6 release of the medium sized models. The differences between 35B and 27B was so huge that the 3.6 release of the 35B was not able to bridge that gap. Now with the 3.6 release of 27B, yes this is now probably slightly better than 122B but only because here the initial difference was so small. And the gap between 122B to 397B is so clear, that I have a hard time believing that a 3.6 release of 122B will change anything here.
WetSound@reddit
Yes. I initially dismissed it for failing to one-shot my tests. But I just tried hooking it up with Pi and let it keep working and see if that helped.
And boy, did it! It just solves the stuff!
My test is very mathy, complex programming and it just has deep insight.
SthMax@reddit
No it's really good, I would say that it's near 4.5 sonnet / gemini 3.1 flash level, not quite but close. Notice that many people here ran it at \~4bit quant, not it's original BF16, and quantization of small models (<70B) absolutely hurts it's performance.
Popular-Factor3553@reddit (OP)
I'm talking about 27B model tho?
SthMax@reddit
Yes, 3.6 27B is close to 4.5 sonnet / 3.1 flash level, I would even say that it has a similar level to 2025-mid SOTA models like gemini 2.5 pro. But quantitizing it and kv will hurt its performance a lot. If you run it at 4bit quants it will definitely be worse, around original DS-R1 level I would say.
Popular-Factor3553@reddit (OP)
Is it good for coding snd agentic tasks? It's really hard to believe 27B can compare with gemini 2.5 pro which is atleast more than a 1T model, ig ai does grow exponentially.
kiwibonga@reddit
It's actually for a really simple reason -- you don't need to hoard all the knowledge in the world, you just need to know how to google stuff.
Popular-Factor3553@reddit (OP)
You mean it uses web?
kiwibonga@reddit
Sure, if you give it the means. It doesn't have to look in its massive parameter count for info that's publicly available.
Popular-Factor3553@reddit (OP)
But how will it even code properly if it's checking the web, kinda doesn't make sense.
kiwibonga@reddit
It can research best practices and apply them, get inspiration from existing codebases, search for cryptic error messages, etc... Doesn't need to have all that information + the mating habits of giraffes to code. It just needs to know how to code and infer.
Popular-Factor3553@reddit (OP)
Did't know that was a real thing, but wouldn't it be too slow?
kiwibonga@reddit
No, even Opus and ChatGPT need to search online to code competently, otherwise they hallucinate APIs that don't exist or changed.
You have a finite token context window that you always have to front-load with domain specific knowledge and the local LLMs are getting excellent at finding a correct solution through that process, making use of skills and creating new skills from past conversations to avoid repetition.
Popular-Factor3553@reddit (OP)
Oh did't know that, thanks
SthMax@reddit
You can checkout some leaderboards to verify it such as artificial analysis etc. For coding / agentic tasks, YMMV, for non-frontier models, better harness / context control means more than the model's raw ability. Still, I think it's good enough for most daily coding / agentic works.
Popular-Factor3553@reddit (OP)
Alright thanks
jakegh@reddit
It's freaking excellent. First model I can actually run myself that I could see myself actually using for work, if I had to.
Dr_Me_123@reddit
I don't really notice a big improvement with the 27B model. But the 35B is faster and more practical for everyday tasks, though its intelligence ceiling is pretty obvious.
Popular-Factor3553@reddit (OP)
I didn't truly look into MoE models yet ik about it just never tested it
Charming_Support726@reddit
The small Qwen 3.6 models are good. Really good.
But not that good.
Queasy-Contract9753@reddit
I think both 27b and 35ba3 are. This generation of Qwen is a game changer.
If you have ten minutes, I'd say go to Qwen chat and talk to them. Test them out, it's free.
Popular-Factor3553@reddit (OP)
Yea just found out about qwen studios, I'll try it, thank!