High VRAM local coding model — still Qwen 3.6 27B?

[-]

KalonLabs@reddit

Fortunately and unfortunately when the qwen team decided to make qwen3.6 27B they said “hold my beer and watch this” and no one else has yet managed to catch up to the unicorn of an llm they made. Ive been looking for a couple of days now for something other than qwen3.6 27B thats good for agents and coding i can run run in 2 DGX Sparks, but theres not many option realistically without going off into the 1T models. Well probably have to wait a month or two before anyone else starts to catch up.

[-]

viperx7@reddit

So true, they really cooked with the 27B. If intelligence per parameter was some metric this one would be at the top

The only limitation was that it was slow but now even that seems to be going away , can run this beast of a model at 100t/s

[-]

Hekel1989@reddit

How? That's my biggest issue with it, on my rtx 4090 it's too slow to realistically use.

[-]

Dany0@reddit

Shout-out to the worse brother - qwen3.6 35B. If you have lots of vram, you can probably figure out tasks where you can use both

Or better yet, run a herd of 27B agents. A parallel setup. A minute data centre at your home

[-]

GrungeWerX@reddit

Well put, and so true. It’s still blowing my mind and I haven’t even turned on thinking yet.

[-]

llama-impersonator@reddit

dsv4 flash > qwen 397b > minimax > step none of these are actually big upgrades from 27b other than dsv4 flash which has mega context that works alright at 300-400k. they know a little more, but the qwen team really put some magic reasoning sauce in their 27b.

[-]

Nepherpitu@reddit

Were you able to run ds4 flash? I have 192gb of vram on 3090s, but ds4 is not supported by vllm or sglang on this architecture. Add well as some others architectures. And where it is supported, it's slow.

[-]

llama-impersonator@reddit

not with sglang or vllm, i got antirez's lcpp fork to run and it seemed okay but i don't actually trust it any of these unofficial ports.

[-]

markole@reddit

Antirez made Redis, he really isn't someone you should be suspicious of.

[-]

llama-impersonator@reddit

i mean i am fine with running the software, but it's not official support. these models have a lot of working parts that all need to be pretty much perfect for a model to work well in an agentic context. especially for model architectures that diverge quite a bit from regular transformers. look at how long it took qwen3 next to work in llama.cpp.

[-]

cosmicnag@reddit

so whats actually the next upgrade from 27b?

[-]

llama-impersonator@reddit

glm 5.1 maybe? idk, dsv4 flash is about where i stop being able to run things at reasonable quants

[-]

Dany0@reddit

Yes, it's not a toss-up between glm 5.1, minimax, kimi, dsv4 but each shine through in their own niche. Hopefully next checkpoint/full release of dsv4 will come out soon (prolly in the range of next few weeks) & will decimate all the open weights competition

Mistral medium btw is surprisingly good! Very wide, knows a lot of niches. Struggles abound though...

[-]

cantgetthistowork@reddit

16x3090s desperately waiting for DS4F support

[-]

alex_pro777@reddit

Try Gemma4-31B full precision.

[-]

exaknight21@reddit

I just got Qwen3.6-36B-A3B from unsloth, Q4_K_XL - MTP with TurboQuant at k_q8 and v_q8 on my Mi50 32 GB @ 70K context. Notable mention, I wanted all compute on GPU and it fit.

Let me tell you something my friend. Holy shit. Not only is this thing blazing fast, it’s tool calling is robust, and is helluva upgrade.

I’m about to try the Qwen3.6-27B.

[-]

HlddenDreck@reddit

Sounds interesting, however I would go for Q8. What fork do you use for the MI50?

[-]

Corpo_@reddit

Where did you find the fork for mtp+turbo?

[-]

fyv8@reddit

Good luck. Personally, I haven't been able to get better quality out of 27B than 35B but I've been super happy with the latter and its speed is amazing on a 4090 or 5090.

[-]

rmhubbert@reddit

Minimax M2.5 (or M2.7 if you can stomach the license) & Qwen3-Coder-Next are also worth a look on that amount of VRAM. I've seen great results from both on 192GB of VRAM.

[-]

soyalemujica@reddit

QwenCoderNext is behind even the More model of 3.6. I have no idea why would people suggest QwenCoder asides 27B that makes literally no sense

[-]

rmhubbert@reddit

It makes perfectly sound sense. I suggest it because it works well in my workflow. That's why I said that I had seen great results, and not that OP would see great results.

It clearly doesn't work well for your use cases, but all that means is that it doesn't work well in your workflow, not that it makes no sense to suggest it all.

[-]

soyalemujica@reddit

Qwen 3.6 35B A3B is even better than QwenCoderNext while being smaller and faster. Please switch.

[-]

Codex_Pax@reddit

This is simply not true I am running a strix Halo 128gb ram and QwenCodeNext Always performs better in coding tasks. And it doesn't waste a billion years thinking..

[-]

soyalemujica@reddit

You should try using fixed template for it, also a right quantification. I have not once gotten into a thinking loop and I do a lot of C++ coding and planning, also game server analysis for hundred files and it does an amazing job as well

[-]

Codex_Pax@reddit

I tried both with template fix and Quants at Q8. QwenCoderNext is simply better from what i have seen so far for my use case. which has been mostly python.

[-]

rmhubbert@reddit

Believe it or not, I do actually try other models, including all of the Qwen3.5 and 3.6 family. Please stop assuming your experience is universal to all developers. We all have different workflows, use cases, and resources.

I have the capability to run Qwen3-Coder-Next at full precision, with full context, at around 110tps. Within my harnesses and workflows, it consistently performs better for the tasks I want LLMs to do than either of the Qwen3.6 models. That is why I made the suggestion, as with every other opinion on here, YMMV.

[-]

PracticlySpeaking@reddit

license?

[-]

mp3m4k3r@reddit

From https://github.com/MiniMax-AI/MiniMax-M2.7/blob/main/LICENSE

NON-COMMERCIAL LICENSE
Non-commercial use permitted based on MIT-style terms; commercial use requires prior written authorization.
Copyright (c) 2026 MiniMax
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software for non-commercial purposes, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or provide copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
1. The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
2. If the Software (or any derivative works thereof) is used for any Commercial Use, you shall prominently display "Built with MiniMax M2.7" on a related website, user interface, blogpost, about page or product documentation.
3. Any Commercial Use of the Software or any derivative work thereof is prohibited without obtaining a separate, prior written authorization from MiniMax.  To request such authorization, please contact api@minimax.io with the subject line "M2.7 licensing".
4. "Commercial Use" means any use of the Software or any derivative work thereof that is primarily intended for commercial advantage or monetary compensation, which includes, without limitation: (i) offering products or services to third parties for a fee, which utilize, incorporate, or rely on the Software or its derivatives, (ii) the commercial use of APIs provided by or for the Software or its derivatives, including to support or enable commercial products, services, or operations, whether in a cloud-based, hosted, or other similar environment, and (iii) the deployment or provision of the Software or its derivatives that have been subjected to post-training, fine-tuning, instruction-tuning, or any other form of modification, for any commercial purpose.
5. Permitted Free Uses. The following uses are expressly permitted free of charge: (a) personal use, including self-hosted deployment for coding, development of applications, agents, tools, integrations, research, experimentation, or other personal purposes; (b) use by non-profit organizations, academic institutions, and researchers for non-commercial research or educational purposes; (c) modification of the Software solely for the uses described in (a) or (b) above.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Appendix: Prohibited Uses
You agree you will not use, or allow others to use, the Software or any derivatives of the Software to:
1. Generate or disseminate content prohibited by applicable laws or regulations.
2. Assist with, engage in or otherwise support any military purpose.
3. Exploit, harm, or attempt to exploit or harm minors.
4. Generate or disseminate false or misleading information with the intent to cause harm.
5. Promote discrimination, hate speech, or harmful behavior against individuals or groups based on race or ethnic origin, religion, disability, age, nationality and national origin, veteran status, sexual orientation, gender or gender identity, caste, immigration status, or any other characteristic that is associated with systemic discrimination or marginalization.

[-]

429_TooManyRequests@reddit

Lol, how do they even monitor a team using this. It’s local AI

[-]

QuchchenEbrithin2day@reddit

That's what is often referred to as "paper license". The way it works is someone worth "squeezing" (read "with money") ignores the license i.e. uses it for some kind of commercial purpose, and somehow this fact is found by the licensor, who first issues notice with offer for out-of-court settlement, else taken to court. This is a very simplified, short view. How you monitor, is indeed a good question, that unlike standard binary software use, usage of weights might be harder to track, especially if the inference-engine or harness or agentic-framework (or any such foundational software) has no instrumentation to support such tracking. However, expect that to change, if not already changing.

[-]

falconandeagle@reddit

Not harder to track, impossible. How the fuck are you going to track if I used the AI for coding an APP. Does the AI sign the code somehow? Truth is all these licences are worthless EULA type docs that can never be enforced.

[-]

mateszhun@reddit

You underestimate ego/stupidity/forgetfulness. One slip up in an interview is enough.

"We used minimax locally for our super successful project" -> lawsuit. Some code traces in the frontend, or some weird config settings, same.

I'm not saying this happens everyday, but things like this happen sometimes. And with this you can be ready to pounce and harvest some extra revenue.

[-]

falconandeagle@reddit

This is just fear mongering, where is the proof? The evidence? What will they present to the court. The only chance of this actually applying is for small business using it and then being audited. For a single person coding at home there is zero chance you are doing anything.

[-]

mateszhun@reddit

Did this already happen with AI? No.

Did this already happen with other technologies? Yes.

Github repo that can detect some code related licencsing issues:
osssanitizer / osspolice

A court case that decided that licensing fees have to be paid for a paper licensee

There is the case where Anthropic settled to pay for the books they have used on training AI

[-]

moncallikta@reddit

No need to track it. It’s enough that someone somewhere mention the usage, to trigger legal action. Don’t assume discovery needs a technical solution.

[-]

mp3m4k3r@reddit

Good question, more of a legal tactic technically license auditing for software does exist, but yeah not likely something able to 'monitored' for as much as if somehow caught in breach lawyer something something?

[-]

florinandrei@reddit

If you don't get caught, you're fine.

[-]

oldschooldaw@reddit

I don’t really understand what the license is trying to convey. What part of it is an issue?

[-]

leinadsey@reddit

I’ve run qwen3-coder-next on M4 Max 128gb. It actually runs pretty well, but has a tendency to overheat the MacBook pretty quickly (hardly the models fault though) and slows down significantly after a bit. I’ve had better luck with it and LM Studio rather than Ollama.

[-]

FullOf_Bad_Ideas@reddit

I have 192GB of VRAM and I use Qwen 3.5 397B. I tried Qwen 3.6 27B very briefly and just didn't like it.

[-]

florinandrei@reddit

I tried Qwen 3.6 27B very briefly and just didn't like it.

Well, could you say why?

[-]

FullOf_Bad_Ideas@reddit

It was getting confused by context, did edits that broke the code and didn't make sense, so I threw it out quickly.

[-]

PrysmX@reddit

Interesting. I ran 397B but preferred Qwen3-Coder-Next over it, and now 27B over that.

[-]

FullOf_Bad_Ideas@reddit

Which 397B quant were you running? I'm running one of my 3.5bpw EXL3 quants, I've been able to get quality pretty high for the size.

[-]

PrysmX@reddit

Not a quant, FP8.

[-]

florinandrei@reddit

I'm pretty sure Qwen is not a native FP8 model.

[-]

FullOf_Bad_Ideas@reddit

FP8 is a quant lol, but it should be good.

[-]

tracagnotto@reddit

Lmao I'm running that shit on a 16gb vram machine

[-]

florinandrei@reddit

One token a day keeps the doctor away.

[-]

john0201@reddit

37B is sonnet, DSV4 flash is sonnet with 1M context. First one will run on a 5090 (or 2 if you want 8 bit), DS needs a pair of rp6ks

[-]

SangerGRBY@reddit

Runs on MBP 128?

[-]

john0201@reddit

Qwen3.6 27B will, but it’s a little slow.

[-]

SangerGRBY@reddit

Damn, slow but usable ? Or local llm just isnt there yet.

I am hesistant about pulling the trigger. Alot of mixed reviews out there..

Use case is to most likely has some planning/coding agents carry out long running task over night - code + research.

[-]

john0201@reddit

Long running stuff like that is more about debugging your harness setup. I’d say this is the first model that is “there”. It’s basically sonnet.

[-]

GrungeWerX@reddit

Yeah, I’m passing its outputs to sonnet literally every day to compare and its corrected sonnet on more than one occasion. It really does feel very close

[-]

wren6991@reddit

I'm using 27B Q8_K_XL on M4 Max. TG is bearable, PP is awful. I understand PP is around 3x better on M5 series.

You really feel the slow PP when you interrupt the model to re-prompt it and it has to fully reprocess the context because the harness pruned something.

[-]

PreparationTrue9138@reddit

How many tokens per second do you get for prompt processing?

[-]

astronut_13@reddit

Honestly, I’m also in the same boat but have yet to really find something better. It also heavily depends how you harness it. I use Claude code locally and have yet to find anything better than Qwen 3.6 27b. I run fp16 (important for long context and tool use so errors don’t propagate). For those recommending 37b, I disagree. That’s a MoE model intended for speed and only activates 3B parameters at a time vs 27b which is dense and all parameters are activated at once so it’s deff more “intelligent”. Just holding my breath for a bigger parameter 3.6 dense model…

[-]

florinandrei@reddit

a bigger parameter 3.6 dense model…

That's gonna require some juice to run fast.

[-]

PrysmX@reddit

27B really is that good. Qwen3-Coder-Next (80B) was my go-to for coding and agents until 27B dropped. I swapped to it and it's crazily enough even better. They have some secret sauce in 27B. There is also something to be said for having speed and still being on a dense model.

[-]

florinandrei@reddit

Perhaps it's highly optimized for coding.

Gemma 4 is better at language, to quote an example from the same size class.

[-]

jacek2023@reddit

Unfortunately, the problem is that you will receive comments from people who “don’t use them locally, but recommend them” This is a problem I’ve had with the Internet forever 😄

[-]

florinandrei@reddit

I don't use Opus locally, but I most definitely recommend it. /s

[-]

QuchchenEbrithin2day@reddit

Might have something to do with the number of people who have or can afford 16x rtx3090s hangout on reddit 😄 ? Advice is cheap.

[-]

MK_L@reddit

I just picked up 256 vram machine. Just started testing out different models with qwen3.5 397b being the first. It wasn't super impressive. Mini max and a deep seek quant is on my list to test against.

Winner so far is actually 3.6 27b and 3.6 35b. If you have something you would like me to test let me know

[-]

Generic_Name_Here@reddit (OP)

This is exactly what I posted this to hear. Sure, huge models might be amazing, but is 27B really competing with the 300B models?

[-]

MK_L@reddit

Really the only test where the 397b model out "shines" the smaller models is "write me a 4000 word story" the smaller models write but its lack luster and hits the minimum requirements of the prompt. The 397b is out of the gate writting a book, very complete. One of the iterations it did give 4000 words of a very complete outline of a book. Story Bible, start of each chapter ect. Basically gave the cliff notes to a full book that could have been use to write a book.

But thats it. Everything else has just been on pare with 35b and 27b.

I took a break from testing the different models to write a harness to load the different models and log tests... got lazy and didn't like typing out a short story just to test a model. Command lines be long like that

[-]

Generic_Name_Here@reddit (OP)

> (a) personal use, including self-hosted deployment for coding, development of applications, agents, tools, integrations

To me, this seems pretty permissive.

[-]

MK_L@reddit

Sorry, this time im not tracking. What do you mean?

[-]

fractalcrust@reddit

DSv4-flash on the api actually felt really good to use, and has me windowshopping for 2x6000s. minimax 2.7 is retarded i couldn't do anything with it.

[-]

twack3r@reddit

It works in TP2 but it really shows that it’s meant to be DP4 across 4x6000.

And this is what I’m getting at FP8. Prefill is borderline for anything beyond 64k but generation holds up surprisingly well. So yay for that memory bandwidth and nay for the compute-equivalent to a B200 basically tapping out before ctx is large enough to do proper coding work.

Primary client-streaming results

64K context - Prefill probe: - Prompt: 37,882 tok - TTFT-content: 18.738 s - Prefill equivalent: 2,021.7 tok/s - Generation/decode workload: - Prompt: 37,871 tok - Completion: 512 tok - TTFT-content: 18.435 s - Decode TKS: 71.2 tok/s - TPOT: 14.1 ms - E2E: 25.616 s

128K context - Prefill probe: - Prompt: 75,710 tok - TTFT-content: 52.833 s - Prefill equivalent: 1,433.0 tok/s - Generation/decode workload: - Prompt: 75,696 tok - Completion: 512 tok - TTFT-content: 47.506 s - Decode TKS: 43.4 tok/s - TPOT: 23.0 ms - E2E: 59.269 s

256K context - Prefill probe: - Prompt: 151,365 tok - TTFT-content: 142.151 s - Prefill equivalent: 1,064.8 tok/s - Generation/decode workload: - Prompt: 151,356 tok - Completion: 512 tok - TTFT-content: 139.261 s - Decode TKS: 29.4 tok/s - TPOT: 34.0 ms - E2E: 156.653 s

Server log-window sanity

64K server prompt windows: ~3,787–3,788 tok/s
128K server prompt windows: ~7,569–7,570 tok/s
256K server prompt windows: ~15,135 tok/s
Decode log windows were noisier due vLLM 10s logging windows:
64K: up to 50.5 tok/s
128K: 20.4–30.8 tok/s
256K: 21.0–29.3 tok/s

[-]

KalonLabs@reddit

Well if it makes you feel any better, i have 2dgx sparks i set minimax m.7 up on and its also retarded for me. If i talk to it with no system prompt it kinda works, but if i give it a system prompt the. It starts generating gibberish in 7 languages at once 😭. Im gonna be switching it over to deepseek v4 flash. Should do 20-25tps.

[-]

jon23d@reddit

I get great work out of it! I run q6 on a Mac Studio 512 via opencode with a hefty system prompt.

[-]

Turbulent_Ad7096@reddit

Are you using vllm 0.20 or 0.19 with MiniMax? It has a known issue with 0.19 and has worked fine for me since updating.

[-]

KalonLabs@reddit

V0.20.2

[-]

StardockEngineer@reddit

What model are you running? I mean specifically. I have not come across this problem.

[-]

Technical-Earth-3254@reddit

So I'm not the only one who can't really get something appropriate out of M2.7. Imo Step 3.5 Flash was/is way superior and was the ~200GB champion until DS V4 Flash arrived.

[-]

zdy1995@reddit

Mistral-Medium-3.5

[-]

segmond@reddit

You got options

81G /home/seg/models/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat.gguf

117G /home/seg/models/GLM4.6V

122G /home/seg/models/Qwen3.5-122B-Q8

137G /home/seg/models/Devstral2-123B

140G /home/seg/models/MistralMedium3.5-128B

151G /home/seg/models/Step3.5-Flash

153G /llmzoo/models/DeepSeek-V4-Flash-Q4_X.gguf

184G /home/seg/models/MiniMax-M2.7-Q6

205G /home/seg/models/Qwen3.5-397B-Q4

227G /home/seg/models/MiniMax-M2.7-Q8

[-]

jon23d@reddit

I use Minimax m2.7 and love it

[-]

Technical-Earth-3254@reddit

Personally, I would go for DS V4 Flash. Didn't try it locally due to being GPU poor, but via API it's great. And native precision is around 200GB.

[-]

Ariquitaun@reddit

Seconded, not a great thinking model, but it's very effective if given a plan that's detailed enough.

[-]

Yorn2@reddit

Using lukealonso/MiniMax-M2.7-NVFP4 here with two RTX PROs and running it around 160 GB VRAM. I have plenty enough headroom to fit in a comfy instance and TTS this way, though I often find I prefer running another LLM (Qwen or Gemma) in the available space for testing/benchmarking.

[-]

annodomini@reddit

MiniMax M2.7 works out pretty nicely, it even works reasonably on my Strix Halo system at UD-IQ3_XXS, I'm sure it would be even better at a much less aggressive quant.

Other options might be Deepseek V4 Flash and Qwen 3.5 397B A17B.

[-]

DataGOGO@reddit

Minimax 2.5 / 2.7

[-]

Professional-Bear857@reddit

I'm using deepseek V4 flash with the 35b qwen model as an alternative, using around 200gb of vram. Otherwise a quant of qwen 397b or 122b or the older qwen 235b is pretty good.