DeepSeek V4 isn't beating Opus, but it doesn't need to

Posted by Practical_Low29@reddit | LocalLLaMA | View on Reddit | 92 comments

DeepSeek V4 is not in the same league as GPT-5.5 or Opus 4.7. Benchmarks put it slightly below both of those, roughly on par with Opus 4.6. You can check the numbers yourself here: https://www.reddit.com/r/singularity/s/jIsNEK6Rrm

And yes, benchmarks only tell part of the story. In real-world usage, my experience is that V4 performs at around GPT-5.2 level, solid, consistent, and the best open-source model available right now, but doesn't quite reach Opus 4.6 in practice either.

But here's why none of that really matters, whether DeepSeek beats Claude or GPT, it achieves this level of performance with only 20% of the hardware requirements, while being fully open-source and free to download. For now running it locally is extremely demanding tho, out of reach for most people. I've been accessing it through atlascloud, and the experience has been great. At its price point, nothing else comes close. This is genuinely the cheapest SOTA model on the market by a significant margin.

[-]

CapsAdmin@reddit

Older proprietary models were all the rage a few months ago. Nothing came close, lots of hype, vibe coding is solved, people who use open models were doing it wrong, etc.

Recent open models are now on par with those older models. But because the latest proprietary models are better, all the previous hype is invalid and has to be redirected to the latest shiny proprietary model.

It's starting to feel like comparing cars that can go 300kmh vs cars that can go 280kmh. As if anything below 300kmh is complete trash that can't drive you anywhere. Next years top speed will be 350kmh, making the 300kmh cars completely useless.

[-]

Ardalok@reddit

It's more of a 70 vs 90 km/h comparison. Yes, 90 is much better. Is it worth the price, considering most city speed limits are 40 or 60 km/h? Only if you have money to spare.

[-]

kiedistv@reddit

I think this is the point most people miss.

It doesnt feel like too long ago that Chat GPT 4o was unreal. Now we have open source models that are on par with 5.2. Thats CRAZY.

And we are still fairly early in terms of AI development. I cant even fathom where itll be 10 years from now.

[-]

jld1532@reddit

There are comments on this thread talking about how Opus 4.6 is better than 4.7. There is a decent chance we may soon see LLMs plateau. Scaling laws have been researched and published.

[-]

MeretrixDominum@reddit

A plateau assuming no further innovations. But that will never be the case. Should Google's Turboquant mature and be able to be implemented by all, any model under 500B parameters would be feasible to run on 4x 3090s

Musk claimed that Opus is a 5T parameter model. Not going to argue whether this is true or not, but let us imagine it is just for the point of conjecture. This would make it feasible to locally run on a $50k~ machine.

This scales all the way down to letting models like the new Qwen 3.6 27B model able to run on phones with 8GB (V)RAM easily.

[-]

buecker02@reddit

You aren't going to get anyone to take you seriously when you bring up a pathological liar. Just stop.

[-]

F1narion@reddit

You aren't going to get anyone to take you seriously if you are so preoccupied with the latest trendy liberal witchhunts as if you got nothing else in your life going to be so focused on it. Though, I wouldn't doubt it

[-]

buecker02@reddit

Thanks for the Fox News code words of the day!

Oh right. Fox news isn't available where I live. Totally forgot!

[-]

MeretrixDominum@reddit

I can talk about someone without liking or agreeing with them. If this is a foreign concept to you, ignore me.

[-]

buecker02@reddit

Ofcourse, but you would pick someone who is reputable.

Find another source otherwise it's just as bad as quoting some random person's blog.

[-]

KickLassChewGum@reddit

you evidently don't even know what TurboQuant is lol

[-]

timmeh1705@reddit

The other day I connected gpt4o to Codex because it’s free in my sandbox environment. It was terrible. And to think I used to hit it all day thinking it was amazing

[-]

No_Afternoon_4260@reddit

I cant even fathom where itll be 10 years from now.

Or just end of this year

[-]

the-username-is-here@reddit

That. Current generation of sub-70B models works as good as Sonnet half a year ago and much better than OpenAI a year ago.

I'm genuinely impressed with them. AGI almost achieved!

[-]

Virtualization_Freak@reddit

You hit the nail on the head.

I'm rocking sonnet for almost all of my daily vibe coding needs, and it meets almost all my expectations at a /fraction/ of the price. I can barely use all my credits fast enough in a session.

However I get 5x the code done as opus.

Then I hit it all gpt5.5 to look for dumb things, and at my skill level, I can't tell the difference.

[-]

No_Afternoon_4260@reddit

I barely see any difference between opus 4.6 and 4.7

[-]

DifficultyFit1895@reddit

One costs a lot more

[-]

No_Afternoon_4260@reddit

One just Completly disappeared from Claude code and copilot also 😅
Also feels like the second learned to say no

[-]

amunozo1@reddit

At some point the frontier models only will be better for some very specific and niche tasks, while open source models will be good for basically any task.

[-]

florinandrei@reddit

Our new razor has 7 blades! Surely it is better than all those janky 5-blade razors!

[-]

DifficultyFit1895@reddit

That reminds me of the Onion article that came out after the 3 blade razor:

Five Blades

[-]

DifficultyFit1895@reddit

That reminds me of the Onion article that came out after the 3 blade razor:

Five Blades

[-]

eli_pizza@reddit

More to the point: vibe coding doesn’t really actually work very well with even the best proprietary models. It’s cool that it works at all, but you can’t effectively vibe code anything even a little bit complex or you end up with junk.

Open models work great if you’re using them more as a coding assistant.

[-]

Boomfrag@reddit

Progress is a treadmill feels real. I do thinks we ought to cherish this because we are experiencing this frontier together, and it connects us.

[-]

AppleBottmBeans@reddit

If it can’t vibe code a 7 figure project is I really worth the 22gb?

[-]

WearMoreHats@reddit

I've been accessing it through atlascloud, and the experience has been great.

Do you work for atlascloud? Every one of your posts explicitly mentions them by name.

[-]

coffee869@reddit

Goddammit are we being sold to again

[-]

thread-e-printing@reddit

Came here for bizarre, found a bazaar instead

[-]

jacek2023@reddit

It doesn't really matter to me because I can't run Opus or DeepSeek locally. I can run Doom locally. Or Visual Studio Code. But not these models.

[-]

Steus_au@reddit

Quake III Arena released already and it's better than Doom. please upgrade.

[-]

jacek2023@reddit

You are probably not aware of new Doom games

[-]

Steus_au@reddit

are they dense? )

[-]

craterIII@reddit

opus 4.7 is garbage

[-]

LanceThunder@reddit

opus 4.7 is mind blowing if you use it for coding.

[-]

lemon07r@reddit

This. And sonnet 4.6 too. I'm glad more and more people are beginning to be able to tell. Might as well use a cheaper model like Kimi if you don't care that much about quality. I find opus 4.6/4.5 and GPT 5.4 still do be the best ones. GPT 5.5 I'm still testing, it's not bad, but I had mixed impressions at first. Certainly not opus 4.7 bad at least.

[-]

craterIII@reddit

5.5 tends to be much worse at planning than 5.4.

[-]

lemon07r@reddit

Am I crazy or does 5.5 feel a little worse than 5.4? I've been saying this but everyone else seems to disagree with me. I find it making mistakes often, that I know 5.4 would have never made.

[-]

procgen@reddit

5.5 is the best model I’ve ever used TBH

[-]

EndlessB@reddit

If you post something like that in one of the OpenAI subs, you’ll get downvoted and have bots disagreeing with you. Narratives are carefully managed on those subs. I ditched OpenAI’s platform when they retired 4o so I have nothing to contribute to your actual question.

Model progression has certainly stopped being linear in my experience

[-]

Strong-Strike2001@reddit

I find Mimi 2.5 Pro A LOT better than GML and a little bit better than Kimi 2.6

[-]

GreenGreasyGreasels@reddit

It kinda depends on what you're usecase it. All the above observations of which model is better could all be correct - for their usecases.

For me GLM-5.1 is more useful than Kimi, MiMo, DS, Qwen - for its relentless ability to keep at it and coherently grind away at a problem. GTP-5.5 is absolutely incomparable for thoroughness and dotting all the i's.

But I do concur with your opinion, MiMo from my limited usage is the best general purpose, large open weight llm. Very polished.

[-]

creamyhorror@reddit

Honestly, while Deepseek V4 is really excellent, the OP is part of AtlasCloud and this is a marketing post.

[-]

MuDotGen@reddit

Agreed. 4.6 has been far more reliable for me for anything I use it for.

[-]

Dry_Yam_4597@reddit

Makes you wonder how it managed to score better than 4.6 in coding benchmarks. It's visibility worse yet somehow benchmarks make it look better?

[-]

craterIII@reddit

benchmaxxing

[-]

hillmanoftheeast@reddit

I just read about this “law.” Goodhart’s Law.

“when a measure becomes a target, it ceases to be a good measure". Formulated by economist Charles Goodhart in 1975, this principle highlights that once a metric is used to incentivize or evaluate performance, individuals will game the system to improve the score, destroying the metric's effectiveness in representing true performance.

[-]

redblood252@reddit

Isn’t that also called cobra effect?

[-]

bitplenty@reddit

opus 4.7 is fricking excellent. first time ever I feel like talking to an engineer and not to a tool that needs to be babied (to be careful about using the right words, provide precise context, switch to new conversation at the right time) to get to the right solution. of course we all write software differently and solve different challenges. too bad it's this expensive

[-]

miniocz@reddit

It is not. It has breadth of knowledge and can use it. But it has to be forced to do so unlike 4.6 before lobotomy...

[-]

Queasy-Contract9753@reddit

That's the other attraction, even if you and me can't host it, cloud providers can. So there's some competition to keep it cheap and one company can't just change the weights whenever they feel like it.

[-]

craterIII@reddit

benchmaxxing

[-]

traveddit@reddit

Skill issue like always.

[-]

Charming_Support726@reddit

Opus 4.7 is too expensive AND rots context to fast. And for most people it even brings no advantage compared to 4.6

[-]

silenceimpaired@reddit

Is D4 flash supported in llama.cpp yet

[-]

misha1350@reddit

It needs to. I WANT BIG AI TO DIE

[-]

BubrivKo@reddit

I don't know but I got way worse results from DeepSeek V4 Pro than GLM 5 and 5.1...

[-]

mission_tiefsee@reddit

i am running Qwen3.6 27b. It is close to solving anything i throw at it. I am just saying this, in case people are wondering if you really need a deepseek v4 locally. i mean hw costs are insane. its incredible how far we have come.

[-]

a_beautiful_rhind@reddit

I give all models at a certain competence level a shot. When one fails or sketches me out, I jump to the other. Even between gemini-pro, sonnet, kimi, deepseek, etc.

One model never rules them all.

[-]

Blaze6181@reddit

Kimi and GLM are on a class above Deepseek, and then above that is frontier which is really more GPT than Claude recently

[-]

Remarkable-Emu-5718@reddit

I wish they had their own Cursor alternative

[-]

XTCaddict@reddit

I’ve been really impressed with mimo v2.5 pro for backend work, I think it checks its work better than kimi it feels a lot more reliable that being said they’re both awesome

[-]

FriendlyUser_@reddit

funny that whenever kimi or glm cannot finish the task, deepseek will. Cheaper, better, not so fast, but more reliable- so not so sure if those are really are above…

[-]

Blaze6181@reddit

I find that's often two for any pair of good models. They usually compliment each other. For example when codex gets confused you can go to Claude, and vice versa. I'm sure Kimi and GLM and Deepseek are the same. Just pick a daily driver and a backup.

[-]

d9viant@reddit

I use V4 pro as a planner and glm as a verifier, works well

[-]

FriendlyUser_@reddit

exactly! I have still to try the codex/claude combi since I ditched those because of costs, last year already. So for ex. I had 1,2 Trillion trillion tokens in february - i do not want to know what I would have paid in claude. I would be bankrupt today I guess 😅😅

[-]

craterIII@reddit

Have you tried MiMo?

[-]

Blaze6181@reddit

Soon. Waiting for nvfp4 of that model.

[-]

craterIII@reddit

Honestly, it seems like the most well rounded out of all the recent chinese model releases. Might actually dethrone GLM 5.1

[-]

Ell2509@reddit

I have run minimax m2.7 in q4 and love it.

[-]

LagOps91@reddit

Bro, it's a preview version so we can get support ready once the full version drops. Performance will likely improve a lot with additional training.

[-]

kyrylogorbachov@reddit

I think we unintentionally do not understand that Open Source/Weights it’s not the same as open source software. Historically companies were open sourcing the software to share and to get a thousands of contributors from all around the globe for free. But why would we expect “open source” models that actually have a company behind (with some limited funding) to outperform proprietary models? What is the kink people have?

[-]

Blaze6181@reddit

The kink is control, my friend. And it's not just a kink, it's an important defense mechanism against the capitalist powers that be.

[-]

kyrylogorbachov@reddit

OMG

[-]

Charming_Support726@reddit

I agree. More or less. Since around Opus 4.5 and codex-5.2 I felt no real enhancement for my work. I just see the benchmarks increasing, and wonder if it means any difference except from benchmaxxing.

[-]

Bakoro@reddit

I've definitely noticed an increase in the performance ceiling, but also more variance in the day to day.

Some days I feels like Anthropic turned the intelligence down, and other days Claude has an almost antisocial level of "I'm not going to do what I'm told unless I know that I'm being closely monitored, and even then I'll half ass it".

It also seems to really hate interacting with other models.
I tried to get it to set up a local model for testing, and it turned off the local model's thinking, turned off multi-turn reasoning, and cut the token output to like 10%. I came back to a bunch of messages about how the local model was very bad and couldn't do basic tasks.

I'm like "Hmm".

I seriously wonder if Anthropic trained it to be biased against local models, because that's the only time I see Claude take a negative affect, and it increased notably with Opus 4.7.

[-]

alexrada@reddit

what you run in locally on? what hardware?

[-]

Competitive_Pass_855@reddit

As a LLM engineer, I would say what DeepSeek contributed is way more than the majority thinks. First, it helped many local LLM enthusiasts and many enterprises for their in-house usage, which was cool, and they definitely won the huge huge applauses in the past 15 months and been considered one of the best AI labs worldwide. But more importantly, if you look deeper into many of the open source datasets as well as the domain-specific open source models on huggingface, most of them are distilled from DeepSeek's model (and also Qwen, one of the GOAT lab for sure). And these datasets are mostly used by researchers or small companies who wish to do some researches or post train their own model, and you know what? They could be the next ones that change the world. If you think about this entire chain reaction, one open model offers people (or even the entire human being) so many possibilities, all the way from local user to industry to academia. They can save lots of money on investing computational power instead of distilling GPT for a one-time-use dataset, this i the real treasure that some of us often ignore.
I don't call myself an AI researcher (unfortunately) bc I am not smart at all, but I used to work as a LLM engineer and apply strategies on pre-training data, the model (or the models that based on our pretrained base model) is still now ranked top 10 in OpenRouter. We used tons of DeepSeek R1 internally to do data labelling, cleaning and all the dirty things. What I can tell is: these things 100% couldn't be done without a near-SOTA-and-open-weight model unless you would pay 10x cash from calling GPT/Claude's api. And I am pretty sure we are not the only lab who did this, and I even doubt that Claude/GPT/Gemini used DeepSeek-R1 at some point (I know it is very very unlikely but who knows).
We should appreciate what DeepSeek had done, as a human being. Basically they did not open source the model, but open sourced millions of possibilities and money to the entire AI world.

[-]

fmlitscometothis@reddit

I think we're already at the point where the model is only part of the story. Eg for me Opus 4.7 is worse than 4.6 because I've developed context strategy optimised for 4.6.

[-]

_supert_@reddit

Deepseek v4 pro excels at maths though. In my unscientific noodling, a bit better than Opus.

[-]

Ell2509@reddit

I am struggling to get it to work.

Used llama.cpp builds b8807 and b 8978.

Am on Linux, with 2 AMD pro 32gb ggpus (which layer split happily).

Same for glm 4.5 air.

Neither run on the latest llama build :/

[-]

diffore@reddit

Works fine for my personal projects (python mostly), probably at Gemini 3 flash level but without hallucination and rushing through. The real thing here is iteration speed + cost. $3.77 for 61 mln tokens is honestly too low for the perfomance it gives you. I am gonna use the hell out of it until they increase cost/get sessiins limits but cause I don't feel like it is sustainable long term.

[-]

ComplexType568@reddit

The thing about DeepSeek V4 is that this is probably the lowest it will perform. V4.1 or whatever the updated checkpoint version name will be will definitely outperform V4. It's been said to have been undertrained, the ceiling is huge!

[-]

b3081a@reddit

Where was that claim of only 20% of the hardware requirements coming from?

[-]

Final-Rush759@reddit

Read the technical report. It includes all their previously published techniques, probably use less than 20% of compute.

[-]

Kahvana@reddit

Qwen3.6-35B-A3B / Qwen3.6-27B being a solid Claude Haiku 4.5 replacement is very nice. With it being better than Claude Sonnet 3.7 which was very capable already, it's downright impressive to me. It's genuinely good enough to handle most tasks, provided you aren't vibe coding like a clueless project manager and instead break it down into small steps using your own programming knowledge.

With DeepSeek V4 Flash/Pro being slightly below/above Claude Sonnet 4.6 is also remarkable and a lot more cost effective. And now Mistral Medium 3.5 entered the fray with 128B dense.

I think what people forget is that "Good enough" gets the job done. And if that's for free (as in runnable locally on your own machine) or a stupid low price, then that's a really big win.

[-]

FullOf_Bad_Ideas@reddit

Well the leaks were saying in will beat Opus for coding, do you remember seeing those?

https://www.reddit.com/r/LocalLLaMA/comments/1q88hdc/the_information_deepseek_to_release_next_flagship/

It was overhyped.

But here's why none of that really matters, whether DeepSeek beats Claude or GPT, it achieves this level of performance with only 20% of the hardware requirements, while being fully open-source and free to download.

Yeah that's lovely but that's been the pattern with GPT Mini models too, at least for the capabilities maintained at a price. Obviously Deepseek has open weights and that's fantastic but we're forever locked in this step-up game where open weight models trail closed models by small margin. It's been the status quo for 3 years now.

[-]