High VRAM local coding model — still Qwen 3.6 27B?
Posted by Generic_Name_Here@reddit | LocalLLaMA | View on Reddit | 87 comments
I’ve been using Qwen 3.6 27B and it’s amazing. Not exactly your Opus replacement, but great for small tasks and checking work. But if you had 224GB of VRAM, would it still be your choice? Or is there something you consider better in the 100+B range (GPT-OSS, Deepseek, etc) that’s just not talked about as much because fewer people can run it? I care more about intelligence than t/s.
KalonLabs@reddit
Fortunately and unfortunately when the qwen team decided to make qwen3.6 27B they said “hold my beer and watch this” and no one else has yet managed to catch up to the unicorn of an llm they made. Ive been looking for a couple of days now for something other than qwen3.6 27B thats good for agents and coding i can run run in 2 DGX Sparks, but theres not many option realistically without going off into the 1T models. Well probably have to wait a month or two before anyone else starts to catch up.
viperx7@reddit
So true, they really cooked with the 27B. If intelligence per parameter was some metric this one would be at the top
The only limitation was that it was slow but now even that seems to be going away , can run this beast of a model at 100t/s
Hekel1989@reddit
How? That's my biggest issue with it, on my rtx 4090 it's too slow to realistically use.
Dany0@reddit
Shout-out to the worse brother - qwen3.6 35B. If you have lots of vram, you can probably figure out tasks where you can use both
Or better yet, run a herd of 27B agents. A parallel setup. A minute data centre at your home
GrungeWerX@reddit
Well put, and so true. It’s still blowing my mind and I haven’t even turned on thinking yet.
llama-impersonator@reddit
dsv4 flash > qwen 397b > minimax > step none of these are actually big upgrades from 27b other than dsv4 flash which has mega context that works alright at 300-400k. they know a little more, but the qwen team really put some magic reasoning sauce in their 27b.
Nepherpitu@reddit
Were you able to run ds4 flash? I have 192gb of vram on 3090s, but ds4 is not supported by vllm or sglang on this architecture. Add well as some others architectures. And where it is supported, it's slow.
llama-impersonator@reddit
not with sglang or vllm, i got antirez's lcpp fork to run and it seemed okay but i don't actually trust it any of these unofficial ports.
markole@reddit
Antirez made Redis, he really isn't someone you should be suspicious of.
llama-impersonator@reddit
i mean i am fine with running the software, but it's not official support. these models have a lot of working parts that all need to be pretty much perfect for a model to work well in an agentic context. especially for model architectures that diverge quite a bit from regular transformers. look at how long it took qwen3 next to work in llama.cpp.
cosmicnag@reddit
so whats actually the next upgrade from 27b?
llama-impersonator@reddit
glm 5.1 maybe? idk, dsv4 flash is about where i stop being able to run things at reasonable quants
Dany0@reddit
Yes, it's not a toss-up between glm 5.1, minimax, kimi, dsv4 but each shine through in their own niche. Hopefully next checkpoint/full release of dsv4 will come out soon (prolly in the range of next few weeks) & will decimate all the open weights competition
Mistral medium btw is surprisingly good! Very wide, knows a lot of niches. Struggles abound though...
cantgetthistowork@reddit
16x3090s desperately waiting for DS4F support
alex_pro777@reddit
Try Gemma4-31B full precision.
exaknight21@reddit
I just got Qwen3.6-36B-A3B from unsloth, Q4_K_XL - MTP with TurboQuant at k_q8 and v_q8 on my Mi50 32 GB @ 70K context. Notable mention, I wanted all compute on GPU and it fit.
Let me tell you something my friend. Holy shit. Not only is this thing blazing fast, it’s tool calling is robust, and is helluva upgrade.
I’m about to try the Qwen3.6-27B.
HlddenDreck@reddit
Sounds interesting, however I would go for Q8. What fork do you use for the MI50?
Corpo_@reddit
Where did you find the fork for mtp+turbo?
fyv8@reddit
Good luck. Personally, I haven't been able to get better quality out of 27B than 35B but I've been super happy with the latter and its speed is amazing on a 4090 or 5090.
rmhubbert@reddit
Minimax M2.5 (or M2.7 if you can stomach the license) & Qwen3-Coder-Next are also worth a look on that amount of VRAM. I've seen great results from both on 192GB of VRAM.
soyalemujica@reddit
QwenCoderNext is behind even the More model of 3.6. I have no idea why would people suggest QwenCoder asides 27B that makes literally no sense
rmhubbert@reddit
It makes perfectly sound sense. I suggest it because it works well in my workflow. That's why I said that I had seen great results, and not that OP would see great results.
It clearly doesn't work well for your use cases, but all that means is that it doesn't work well in your workflow, not that it makes no sense to suggest it all.
soyalemujica@reddit
Qwen 3.6 35B A3B is even better than QwenCoderNext while being smaller and faster. Please switch.
Codex_Pax@reddit
This is simply not true I am running a strix Halo 128gb ram and QwenCodeNext Always performs better in coding tasks. And it doesn't waste a billion years thinking..
soyalemujica@reddit
You should try using fixed template for it, also a right quantification. I have not once gotten into a thinking loop and I do a lot of C++ coding and planning, also game server analysis for hundred files and it does an amazing job as well
Codex_Pax@reddit
I tried both with template fix and Quants at Q8. QwenCoderNext is simply better from what i have seen so far for my use case. which has been mostly python.
rmhubbert@reddit
Believe it or not, I do actually try other models, including all of the Qwen3.5 and 3.6 family. Please stop assuming your experience is universal to all developers. We all have different workflows, use cases, and resources.
I have the capability to run Qwen3-Coder-Next at full precision, with full context, at around 110tps. Within my harnesses and workflows, it consistently performs better for the tasks I want LLMs to do than either of the Qwen3.6 models. That is why I made the suggestion, as with every other opinion on here, YMMV.
PracticlySpeaking@reddit
license?
mp3m4k3r@reddit
From https://github.com/MiniMax-AI/MiniMax-M2.7/blob/main/LICENSE
429_TooManyRequests@reddit
Lol, how do they even monitor a team using this. It’s local AI
QuchchenEbrithin2day@reddit
That's what is often referred to as "paper license". The way it works is someone worth "squeezing" (read "with money") ignores the license i.e. uses it for some kind of commercial purpose, and somehow this fact is found by the licensor, who first issues notice with offer for out-of-court settlement, else taken to court. This is a very simplified, short view. How you monitor, is indeed a good question, that unlike standard binary software use, usage of weights might be harder to track, especially if the inference-engine or harness or agentic-framework (or any such foundational software) has no instrumentation to support such tracking. However, expect that to change, if not already changing.
falconandeagle@reddit
Not harder to track, impossible. How the fuck are you going to track if I used the AI for coding an APP. Does the AI sign the code somehow? Truth is all these licences are worthless EULA type docs that can never be enforced.
mateszhun@reddit
You underestimate ego/stupidity/forgetfulness. One slip up in an interview is enough.
"We used minimax locally for our super successful project" -> lawsuit. Some code traces in the frontend, or some weird config settings, same.
I'm not saying this happens everyday, but things like this happen sometimes. And with this you can be ready to pounce and harvest some extra revenue.
falconandeagle@reddit
This is just fear mongering, where is the proof? The evidence? What will they present to the court. The only chance of this actually applying is for small business using it and then being audited. For a single person coding at home there is zero chance you are doing anything.
mateszhun@reddit
Did this already happen with AI? No.
Did this already happen with other technologies? Yes.
Github repo that can detect some code related licencsing issues:
osssanitizer / osspolice
A court case that decided that licensing fees have to be paid for a paper licensee
There is the case where Anthropic settled to pay for the books they have used on training AI
moncallikta@reddit
No need to track it. It’s enough that someone somewhere mention the usage, to trigger legal action. Don’t assume discovery needs a technical solution.
mp3m4k3r@reddit
Good question, more of a legal tactic technically license auditing for software does exist, but yeah not likely something able to 'monitored' for as much as if somehow caught in breach lawyer something something?
florinandrei@reddit
If you don't get caught, you're fine.
oldschooldaw@reddit
I don’t really understand what the license is trying to convey. What part of it is an issue?
leinadsey@reddit
I’ve run qwen3-coder-next on M4 Max 128gb. It actually runs pretty well, but has a tendency to overheat the MacBook pretty quickly (hardly the models fault though) and slows down significantly after a bit. I’ve had better luck with it and LM Studio rather than Ollama.
FullOf_Bad_Ideas@reddit
I have 192GB of VRAM and I use Qwen 3.5 397B. I tried Qwen 3.6 27B very briefly and just didn't like it.
florinandrei@reddit
Well, could you say why?
FullOf_Bad_Ideas@reddit
It was getting confused by context, did edits that broke the code and didn't make sense, so I threw it out quickly.
PrysmX@reddit
Interesting. I ran 397B but preferred Qwen3-Coder-Next over it, and now 27B over that.
FullOf_Bad_Ideas@reddit
Which 397B quant were you running? I'm running one of my 3.5bpw EXL3 quants, I've been able to get quality pretty high for the size.
PrysmX@reddit
Not a quant, FP8.
florinandrei@reddit
I'm pretty sure Qwen is not a native FP8 model.
FullOf_Bad_Ideas@reddit
FP8 is a quant lol, but it should be good.
tracagnotto@reddit
Lmao I'm running that shit on a 16gb vram machine
florinandrei@reddit
One token a day keeps the doctor away.
john0201@reddit
37B is sonnet, DSV4 flash is sonnet with 1M context. First one will run on a 5090 (or 2 if you want 8 bit), DS needs a pair of rp6ks
SangerGRBY@reddit
Runs on MBP 128?
john0201@reddit
Qwen3.6 27B will, but it’s a little slow.
SangerGRBY@reddit
Damn, slow but usable ? Or local llm just isnt there yet.
I am hesistant about pulling the trigger. Alot of mixed reviews out there..
Use case is to most likely has some planning/coding agents carry out long running task over night - code + research.
john0201@reddit
Long running stuff like that is more about debugging your harness setup. I’d say this is the first model that is “there”. It’s basically sonnet.
GrungeWerX@reddit
Yeah, I’m passing its outputs to sonnet literally every day to compare and its corrected sonnet on more than one occasion. It really does feel very close
wren6991@reddit
I'm using 27B Q8_K_XL on M4 Max. TG is bearable, PP is awful. I understand PP is around 3x better on M5 series.
You really feel the slow PP when you interrupt the model to re-prompt it and it has to fully reprocess the context because the harness pruned something.
PreparationTrue9138@reddit
How many tokens per second do you get for prompt processing?
astronut_13@reddit
Honestly, I’m also in the same boat but have yet to really find something better. It also heavily depends how you harness it. I use Claude code locally and have yet to find anything better than Qwen 3.6 27b. I run fp16 (important for long context and tool use so errors don’t propagate). For those recommending 37b, I disagree. That’s a MoE model intended for speed and only activates 3B parameters at a time vs 27b which is dense and all parameters are activated at once so it’s deff more “intelligent”. Just holding my breath for a bigger parameter 3.6 dense model…
florinandrei@reddit
That's gonna require some juice to run fast.
PrysmX@reddit
27B really is that good. Qwen3-Coder-Next (80B) was my go-to for coding and agents until 27B dropped. I swapped to it and it's crazily enough even better. They have some secret sauce in 27B. There is also something to be said for having speed and still being on a dense model.
florinandrei@reddit
Perhaps it's highly optimized for coding.
Gemma 4 is better at language, to quote an example from the same size class.
jacek2023@reddit
Unfortunately, the problem is that you will receive comments from people who “don’t use them locally, but recommend them” This is a problem I’ve had with the Internet forever 😄
florinandrei@reddit
I don't use Opus locally, but I most definitely recommend it. /s
QuchchenEbrithin2day@reddit
Might have something to do with the number of people who have or can afford 16x rtx3090s hangout on reddit 😄 ? Advice is cheap.
MK_L@reddit
I just picked up 256 vram machine. Just started testing out different models with qwen3.5 397b being the first. It wasn't super impressive. Mini max and a deep seek quant is on my list to test against.
Winner so far is actually 3.6 27b and 3.6 35b. If you have something you would like me to test let me know
Generic_Name_Here@reddit (OP)
This is exactly what I posted this to hear. Sure, huge models might be amazing, but is 27B really competing with the 300B models?
MK_L@reddit
Really the only test where the 397b model out "shines" the smaller models is "write me a 4000 word story" the smaller models write but its lack luster and hits the minimum requirements of the prompt. The 397b is out of the gate writting a book, very complete. One of the iterations it did give 4000 words of a very complete outline of a book. Story Bible, start of each chapter ect. Basically gave the cliff notes to a full book that could have been use to write a book.
But thats it. Everything else has just been on pare with 35b and 27b.
I took a break from testing the different models to write a harness to load the different models and log tests... got lazy and didn't like typing out a short story just to test a model. Command lines be long like that
Generic_Name_Here@reddit (OP)
> (a) personal use, including self-hosted deployment for coding, development of applications, agents, tools, integrations
To me, this seems pretty permissive.
MK_L@reddit
Sorry, this time im not tracking. What do you mean?
fractalcrust@reddit
DSv4-flash on the api actually felt really good to use, and has me windowshopping for 2x6000s. minimax 2.7 is retarded i couldn't do anything with it.
twack3r@reddit
It works in TP2 but it really shows that it’s meant to be DP4 across 4x6000.
And this is what I’m getting at FP8. Prefill is borderline for anything beyond 64k but generation holds up surprisingly well. So yay for that memory bandwidth and nay for the compute-equivalent to a B200 basically tapping out before ctx is large enough to do proper coding work.
Primary client-streaming results
64K context - Prefill probe: - Prompt: 37,882 tok - TTFT-content: 18.738 s - Prefill equivalent: 2,021.7 tok/s - Generation/decode workload: - Prompt: 37,871 tok - Completion: 512 tok - TTFT-content: 18.435 s - Decode TKS: 71.2 tok/s - TPOT: 14.1 ms - E2E: 25.616 s
128K context - Prefill probe: - Prompt: 75,710 tok - TTFT-content: 52.833 s - Prefill equivalent: 1,433.0 tok/s - Generation/decode workload: - Prompt: 75,696 tok - Completion: 512 tok - TTFT-content: 47.506 s - Decode TKS: 43.4 tok/s - TPOT: 23.0 ms - E2E: 59.269 s
256K context - Prefill probe: - Prompt: 151,365 tok - TTFT-content: 142.151 s - Prefill equivalent: 1,064.8 tok/s - Generation/decode workload: - Prompt: 151,356 tok - Completion: 512 tok - TTFT-content: 139.261 s - Decode TKS: 29.4 tok/s - TPOT: 34.0 ms - E2E: 156.653 s
Server log-window sanity
KalonLabs@reddit
Well if it makes you feel any better, i have 2dgx sparks i set minimax m.7 up on and its also retarded for me. If i talk to it with no system prompt it kinda works, but if i give it a system prompt the. It starts generating gibberish in 7 languages at once 😭. Im gonna be switching it over to deepseek v4 flash. Should do 20-25tps.
jon23d@reddit
I get great work out of it! I run q6 on a Mac Studio 512 via opencode with a hefty system prompt.
Turbulent_Ad7096@reddit
Are you using vllm 0.20 or 0.19 with MiniMax? It has a known issue with 0.19 and has worked fine for me since updating.
KalonLabs@reddit
V0.20.2
StardockEngineer@reddit
What model are you running? I mean specifically. I have not come across this problem.
Technical-Earth-3254@reddit
So I'm not the only one who can't really get something appropriate out of M2.7. Imo Step 3.5 Flash was/is way superior and was the ~200GB champion until DS V4 Flash arrived.
zdy1995@reddit
Mistral-Medium-3.5
segmond@reddit
You got options
81G /home/seg/models/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat.gguf
117G /home/seg/models/GLM4.6V
122G /home/seg/models/Qwen3.5-122B-Q8
137G /home/seg/models/Devstral2-123B
140G /home/seg/models/MistralMedium3.5-128B
151G /home/seg/models/Step3.5-Flash
153G /llmzoo/models/DeepSeek-V4-Flash-Q4_X.gguf
184G /home/seg/models/MiniMax-M2.7-Q6
205G /home/seg/models/Qwen3.5-397B-Q4
227G /home/seg/models/MiniMax-M2.7-Q8
jon23d@reddit
I use Minimax m2.7 and love it
Technical-Earth-3254@reddit
Personally, I would go for DS V4 Flash. Didn't try it locally due to being GPU poor, but via API it's great. And native precision is around 200GB.
Ariquitaun@reddit
Seconded, not a great thinking model, but it's very effective if given a plan that's detailed enough.
Yorn2@reddit
Using lukealonso/MiniMax-M2.7-NVFP4 here with two RTX PROs and running it around 160 GB VRAM. I have plenty enough headroom to fit in a comfy instance and TTS this way, though I often find I prefer running another LLM (Qwen or Gemma) in the available space for testing/benchmarking.
annodomini@reddit
MiniMax M2.7 works out pretty nicely, it even works reasonably on my Strix Halo system at UD-IQ3_XXS, I'm sure it would be even better at a much less aggressive quant.
Other options might be Deepseek V4 Flash and Qwen 3.5 397B A17B.
DataGOGO@reddit
Minimax 2.5 / 2.7
Professional-Bear857@reddit
I'm using deepseek V4 flash with the 35b qwen model as an alternative, using around 200gb of vram. Otherwise a quant of qwen 397b or 122b or the older qwen 235b is pretty good.