Real-world open source alternatives to the now defunct Opus 4.6?
Posted by MoistRecognition69@reddit | LocalLLaMA | View on Reddit | 66 comments
I've had enough of Anthropic's shit. I'm paying for product A and it shifts everyday from A to A but worse, B but dressed up as A, etc.
If hardware is not an issue, which open source model would you recommand me to host as an alternative for it? (Please don't just quote benchmarks, they mean nothing. I'm talking about people who've had hands-on experience with model X and Opus and can compare the two. Everyone can train on the test set or infer similar samples in order to benchmax.)
Expensive-Paint-9490@reddit
Among the models I have tested that fit on my hardware (512GB RAM and 24GB VRAM), the best one is GLM-5.1 at Q4_K_M. Runner is Qwen3.5-397B-A17B at Q4_K_M, it's less smart but more than twice faster, so I use the one or the other depending on needs.
DeepSeek-V3.2 is not as smart, but it has a distinct personality that makes me use it regularly.
Kimi-2.5 and Trinity are less smart and very boring and I have deleted them. I deleted Minimax-M2.7 as well because it is so censored it is ridicolous.
I have not tried Kimi-2.6.
lukistellar@reddit
What is your performance running models like these heavily offloaded to RAM?
ttkciar@reddit
Thank you for your testimony on this. Benchmarks indicate GLM-5.1 codegen competence is somewhere in between Opus and Sonnet. Does your experience bear that out, or is your impression that it lies above/below this range?
Expensive-Paint-9490@reddit
I am not currently using models for coding.
YehowaH@reddit
If hardware is not an issue use Kimi k2.6 equal to opus 4.6. but thus model needs around 600 GB vram alone, without context. But, yes it's capable and it's a high gamble because you need probably 800-1tb vram on current Gen b100 to use it fast.
With the prices today you can end up with hardware 500k$+, that might get obsolete in 2-3 years. It will be capable of running future big os llms but at what speed. With the current support phases of Nvidia i think until 2030+5 years probably.
It's doable, but how much can you spend in subscriptions until you reach the 500k$+?
Altruistic_Tension41@reddit
You can get last gen DGX’s for 50-100k that can run Kimi k2.5/6 so definitely not 500k+. Also you can build a 8x96GB RTX 6000 server for a little less than 100k
YehowaH@reddit
I did not see any dgx with this vram this low priced.
Altruistic_Tension41@reddit
An 8x80GB A100 DGX goes for between 50-100k homie
YehowaH@reddit
For performance u need nfp4 which is exclusive to b100+. A single GPU is about 40k+ homie.
Altruistic_Tension41@reddit
You’ll get 20-40+ tokens per second on 8xA100s which will be more than usable, batched you can easily get 400+ tokens per second
YehowaH@reddit
Which quant? Q4? How much context?
Altruistic_Tension41@reddit
Kimi is natively INT4 and 64k context in VRAM, with all the new KV cache compression techniques or just using the system RAM as a cache of last resort you could easily do the full 256k context although obviously that would cut into the tg/s
MoistRecognition69@reddit (OP)
The hardware is already there, so there isn't anything 'new' to acquire.
I'll check out 2.6, if it's good I'll try to do the math and see how much we'll pay vs. lose.
Enterprise CC accounts are also hyper-expensive, but unlike hardware you pay for them per month and per seat (And last I checked the GPU gremlins are still asleep, and have yet to sneak into anyone's servers and silently downgrade the model they are running. Yet.)
AykutSek@reddit
qwen 3.6 27b for routine agent loops, kimi k2.6 if hardware allows. but the 80/20 here is context engineering, not model choice. capped mine at 30k and chunked tasks into smaller subloops. that did more for reliability than any model swap. every model degrades at long context, and at least with local you don't get the silent quality regression on top. chunking + a decent local model gets you most of the way.
ExpensiveKey4483@reddit
How did you do that? (If you don't mind me asking).
Do you run your own Ollama or did you host it on something?
Thanks in advance!
qudat@reddit
This is really interesting! How do you cap the model and make it not crash? I’m not familiar with the techniques here but I’ll definitely try it out
Ok_Warning2146@reddit
Minimax 2.5. It has a free API, so it is better than running it local
RepulsiveRaisin7@reddit
Nothing actually compares to Opus. If you rephrase this questions as best open weight coding model, I'd say GLM 5.1
MoistRecognition69@reddit (OP)
That was my first attempt, actually! However I used it via their coding plan to test the waters before I spend time setting up vLLM and a tunnel, and it was hallucinating everything.
Might have to give it a shot on Q8/Q4 instead of whatever it is they are serving under their API.
RepulsiveRaisin7@reddit
It used to be pretty bad on z.ai but it did get much better for me. Ollama Cloud is also good. But even at its best, GLM is below Sonnet in coding tasks.
Cute_Dragonfruit4738@reddit
Disagree on that, I use 5.1 regularly and it outperforms sonnet. Definitely not outperforming Opus, but as of now this is the best open source coding model in my experience. This is after having tried the following:
Qwen3.6-Plus
Qwen3.6-27B
GLM5.1
GLM5-Turbo
Kimi K2.5
Kimi K2.6
Google Gemma 4 31b
Minimax2.7
DeepseekV3.2 (Yet to Try 4 as of 04/26)
Mistral-Small (horrendous)
Strong-Strike2001@reddit
So, in your experience, Qwen3.6 Plus is better than both Kimis, and Kimi K2.5 is better than K2.6? This is kind of weird (no offense) in my opinion
Cute_Dragonfruit4738@reddit
I didn't say that anywhere? I never indicated this as an in-order list.
Strong-Strike2001@reddit
Sorry, that was my assumption, my bad!
The downvote is kind of rude btw
RepulsiveRaisin7@reddit
That's wild to me, GLM is slow as fuck and messes up basic edits all the time.
Cute_Dragonfruit4738@reddit
agree on the speed and even reliability, but the idea being that you can serve it locally, its great for now (At least anecdotally). Lots of mixed reviews and Z.AI's business practices don't help.
Lordaizen639@reddit
What about kimi k2.6,mimo v2.5 pro,and deepseek v4 pro
portmanteaudition@reddit
Do you have 100k of hardware?
Strong-Strike2001@reddit
The question is about what model is the best, not what model can you run, because you are comparing it to OPUS
Lordaizen639@reddit
Nope I said this based on the artificial intelligence index website graph
Gab1159@reddit
Honest question. Does it make sense to try gpt 5.5 with Codex? Is it as good as Opus or are they benchmaxxing? I'm sick and tired of Anthropic as well, but jumping into Codex requires changing flows, skills, etc.
I know this is a local model sub...but since this is the topic.
Comfortable-Winter00@reddit
DeepSeek v4 Pro, if hardware is really not an issue.
Once you put together the costs for a system able to run it, you might decide that in fact hardware, or specifically the cost of purchasing and running the hardware, is in fact an issue.
Spoiler: You'll need at least $300k to run it at an acceptable speed for a coding agent.
kyr0x0@reddit
We will see; 2xH200 and NVLink could be enough with decent RAM.
Technical-Earth-3254@reddit
Kimi K2.6, Mimo V2.5 Pro (will be open weight soon), Deepseek V4 Pro
Its_Powerful_Bonus@reddit
If you considering to go on-prem it will be wise to let us know if the instance should be for you only or for you and 50 other devs. What is the use case - programming, other? How big context is required and so on. For one man army on prem it might be reasonable to go with 2x rtx 6000 pro to run minimax-m2.7/2.7 or mimo v2.5 pro. It is also possible to go with much cheaper option - rtx 5090 and Qwen 3.6 27b / Gemma 4 31b (with turbo quant) , but latter will be much slower TG.
CluePsychological937@reddit
"Now defunct opus 4.6..."
Tell me you haven't compared Opus 4.6 to other models in real world conditions without telling me.
TapAggressive9530@reddit
Nothing in the open source world comes close to opus 4.6, 4.7 and GPT 5.4 ( 5.5) . Not even in the ball park for real world professional quality programs . Yes for writing simple test apps , utilities ( - and that’s a maybe ) and prototypes that’s about it . I’ve tried every open source model on openrouter and they all score grades of D and C’s .. and on occasion depending on the test a B- . Don’t misunderstand - I’m a huge fan of open weight LLM - and use them locally- but for knowledge. For real work , ( unfortunately) I have to use the big boys
Disposable110@reddit
Wait for the new Deepseek to become widely available.
Otherwise GLM 5.1, Kimi, the largest Qwen model or the latest Gemma.
All of these models can do what Opus can do, but it needs a lot more handholding and iteration as they don't get there in one prompt and will introduce many more bugs you have to spot and tell them to fix.
ComplexType568@reddit
Gemma is definitely not in Opus league, I'd say the closest is Kimi K2.6 or GLM 5.1
Disposable110@reddit
Qwen 27b / Gemma can produce the same results for the more complex tasks. (in my case I've got a benchmark to rebuild an old java game to modern C#/HTML/JS and build a modern UI for it). All of them can get there eventually, as can Gemini 3.1 pro. The shittier models just need WAY more supervision and back and forth to fix the bugs.
Linkpharm2@reddit
Qwen3.6 27b. 35b if you want speed for quality.
VoiceApprehensive893@reddit
youre not getting opus with these btw still very usable if youre not pushing your models to the limit
Unlucky-Message8866@reddit
frankly i'm pushing 3.6 27b quite far, with proper harness+setup it does the job just fine, i don't miss opus anymore. i've been refactoring +25k loc of slop opus created over the past days and i'm super impressed, replaces 95% of my use cases (task execution, retrieval/research).
redmctrashface@reddit
What is a harness in this case?
Linkpharm2@reddit
But of course, \~2-10b is always going to be better than 27b
Linkpharm2@reddit
Trillion. Not billion.
andy_potato@reddit
I wonder why people keep repeating this nonsense
Kofeb@reddit
Really? It’s like drugs, if you can get it “free” rather then the cartel…
scythe000@reddit
Why is 27b more quality than 35b?
NebulaBetter@reddit
27b is a dense model. 35b is moe.
scythe000@reddit
I thought I had heard that! That’s a little confusing. What kind of quality difference are we talking about?
Gueleric@reddit
ON a high level MoE will have only a fraction of the model active at the time, specialized for certain tasks. So your model can be as capable for specific tasks, but may be worse for tasks that require multiple types of knowledge at the same time.
Medium_Chemist_4032@reddit
I'm very happy with the 27
boutell@reddit
Sticking to what I have hands-on experience with, as you requested...
On an M2 Macbook Pro with 32GB RAM, Qwen 3.6 35B A3B (4-bit XS quant, capped at 128K context) can do genuinely useful coding work. However, it is definitely not "Opus smart." It can struggle to trace an issue through a large codebase, it is more successful with a smaller one. It was unable to solve a sneaky bug relating to geometry and CSS, but did make good progress on a "mongodb API implemented on top of sqlite backend" adapter, meeting a lot of my requirements successfully before I moved on to evaluating other options. (Opus nailed both of these previously, so they made good test cases.)
I'm now moving on to trying out Qwen 3.6 27B. I expect this to be a failure, either a straight-up failure due to RAM issues or a practical failure due to speed. But, some suggest it is so much smarter than 35B A3B that it makes up for the slow speed. So I'm going to see if my RAM is sufficient.
So what does this mean for you...
* You could do what I'm doing, but with better performance and much more headroom for other activities on the machine, using an M5 Mac with 48GB RAM or more or delving into graphics cards on Linux. With the right hardware you could run these models unquantized and with 256K context.
* You can evaluate what that would give you yourself, before purchasing hardware, via cloud hosted offerings of the models or renting GPUs. I plan to do that myself.
* Based on the strength of what Qwen can do on my limited hardware, you could try their much larger models in the 3.5 series, again via the cloud before purchasing the expensive hardware needed to run them locally.
* You could wait for Qwen to release larger open models in the 3.6 series. It's significantly better than the 3 and 3.5 series so far, so I would expect any larger models they open-source in 3.6 to also be a big leap over their predecessors.
* You could try other options of course, my experience so far is almost entirely Qwen.
FusionCow@reddit
kimi k2.6
sanchita_1607@reddit
tbvh for open source at opus level, deepseek r1 is the closest ive used for reasoning heavy stuff, and qwen3 235B is grtly impressive for coding tasks ...but the real thing if hardware isnt an issue is running them thru kilocode..byok means u're not locked into anthropic ever again, u route to whatever model fits the task and if one gets worse you just swap, no subscription drama
ComplexType568@reddit
both are super old models. V4 pro is probably close when it gets the updates. And Qwen3.6 is coming soon.
tecneeq@reddit
If that is the case you might as well just pick a random one. I say Mistral 7b.
MoistRecognition69@reddit (OP)
lol.
Benchmarks have been poisoned by benchmaxxing for quite some time now - it isn't new. The only reliable benchmarks out there are independent ones that aren't released to the public, but that involves... well, creating a benchmark on your own w/ your own use case, which isn't something I have the time to do :(
That's why I asked for references from people who have used Opus AND x model, and are able to compare the two.
tmvr@reddit
You are going to need to put some numbers here, because hardware is always an issue. None of the models you can run easily locally are Opus quality, not even Sonnet 4.5 quality. Running the really big ones require significant hardware investment especially if you don't have any available already thanks to the price of RAM.
No_Communication7072@reddit
Well, if you take benchmarks or arena ranking Gemma 4 and sonnet 4.5 are together
tmvr@reddit
If you are deciding based on those then good luck to you and also, I have a bridge for sale if you are interested :D
No_Communication7072@reddit
Just recommend a independent benchmark or ranking that can prove that Sonnet 4.5 is so far ahead to the other models
mc_nu1ll@reddit
opus is still good, and 4.6 is still available, under the "More models" tab.
But if you wanna switch and don't mind the API cost - kimi k2.6 (top-4 on AA, but won't call you out on your BS as often), or qwen3.5-397b-a17b (weaker overall, but scores higher on bullshitbench; i personally don't like its tone though).
Local: seriously, qwen3.6-27b
YehowaH@reddit
That's a great answer. But until qwen3.6 397b is out i would go for Kimi K2.6
sandykt@reddit
I am hosting Qwen 3.6 27B and it has very quickly become my daily driver on opencode. I would put this model as the very manifestation of Chinese grit and perseverance. It simply doesn’t give up even if you give it a hard task way above what it can punch.
If hardware is no issue, I would go with Kimi K2.6 or even the latest Deepseek V4 pro.