Deepseek is the 4th most intelligent AI in the world.
Posted by Rare-Programmer-1747@reddit | LocalLLaMA | View on Reddit | 133 comments

And yes, that's Claude-4 all the way at the bottom.
i love Deepseek
i mean look at the price to performance
Arion_May@reddit
I tried a lot of them, but the actual intelligence, honesty, creativity and deduction skills of DeepSeek are unmatched by any other.
I think it has to do with chinese language, their way of communication is way more complex that ours, and that translates into a better AI in all fronts.
BackgroundResult@reddit
If you say so, DeepSeek changed the world more than anybody can imagine already: https://www.ai-supremacy.com/p/was-deepseek-such-a-big-deal-open-source-ai
Ecstatic-Bend-7183@reddit
The reason why Microsoft copilot isn't on here, ai is smart but Copilot is kinda! I told her to make a fun game to hide the 🗿 emoji somewhere, she did not put it there! So I said what row? she said like 87 something so I looked in whatever row she said, None of the rows had 🗿 in it
Live-Expression-3083@reddit
Voy a escribir en español porque soy de américa hispana. Ese ranking me parece absurdo, estoy usando hace dos meses chat gpt plus y su modelo o3 y es realmente basura, realmente en lugar de acelerar ni trabajo lo entorpece, quien me da mejores resultados es el gemini 2.5 pro, ese si es una bestia, me ayudas mucho y realmente me hace mejores trabajos que o3. Deepseek será gratis pero tiene muchas limitaciones, para manejar documentación. Ahora he estado probando Claude en versión gratis y realmente es mucho mejor que chatgpt versión 4o. Realmente estoy decepcionado de chat gpt, lo bueno es su memoria, pero en lo demás es pésimo. Por ahora estoy probando usar Claude y gemini 2.5 pro y está mejor. No descartó del todo a chat GPT pero realmente en trabajo duro y fuerte es muy limitado
dreamingwell@reddit
This bench mark is garbage. Comparing models is hard. But this is boiled down to meaningless.
TheRealGentlefox@reddit
The general placements aren't the worst, but the Sonnet 4 placement makes it a joke. There is no world in which o3-mini, Qwen 3, and 2.5 Flash are significantly better than Sonnet 4.
obvithrowaway34434@reddit
Yes, there is. Sonnet 4 is only good at coding. It's pretty bad at math and general reasoning tasks. This is an aggregate of different kind of benchmarks, not just one.
Unable-Piece-8216@reddit
It is not bad at math, bros smokin crack
TechExpert2910@reddit
you're wrong. Claude 4 Opus (thinking) has been better than o3 and 2.5 Pro in many of my non-coding tasks.
Lost_Effort_550@reddit
And worse in some. Even their own report card shows this.
Maleficent_Age1577@reddit
Can you give examples?
dubesor86@reddit
Just because coding was emphasized, doesn't mean it's "only good at coding". I run a personal benchmark and 85% of the tasks are completely unrelated to coding, and it performed very well - top 4 and similar level as Gemini 2.5 Pro Preview, GPT-4.5 Preview, Claude Opus 4.
Sometimes I wonder if people just write comments without having used the models at all.
Lost_Effort_550@reddit
Their own bench mark showed it doing worse than 3.7 by some metrics.
obvithrowaway34434@reddit
Your cherry picked questions are completely irrelevant when measuring general performance of models. Maybe learn how benchmarking works.
TheRealGentlefox@reddit
I like that they also said it's bad at "general reasoning" when it's #1 for reasoning on Livebench and #2 on yours xD
obvithrowaway34434@reddit
Lmao do you not understand irony? They said about not using the actual model and you're here quoting Livebench.
Rare-Programmer-1747@reddit (OP)
Yep claude-4 is made for coding tasks and agentic tasks just like OpenAi's codex - If you haven't gotten it yet, it means that can give a freaking x ray result and o3-pro and Gemini 2.5 will tell you what is wrong and what is good on the result - I mean you can take pictures of broken car and send it to them and it will guide like a professional mechanic. -At the end of day, claude-4 is the best at coding tasks and agentic tasks and not OVERALL
Dead_Internet_Theory@reddit
Sonnet 4 is not that good, and crazy overpriced. Opus 4 is actually pretty good! But then again, even crazier overpriced.
Onnissiah@reddit
It also contains factually incorrect info. It states that Grok 3 has 1m context, while the official info is 0.1m.
Dead_Internet_Theory@reddit
What official info? https://x.ai/news/grok-3 says "With a context window of 1 million tokens"
Onnissiah@reddit
https://x.ai/api
martinerous@reddit
Right, I tested the latest DeepSeek R1 yesterday in my weird test case, and it was noticeably worse than Gemma3 27B. So, as always, we cannot rely on benchmarks alone; it depends on specific use cases.
SirRece@reddit
What benchmark though? Theres no link or a title.
throwawayacc201711@reddit
Also being missed is the scaling. Look at how much slower Deepseek is relative to the others. It’s about 5x slower. The cost comes from scaling to many users.
Entubulated@reddit
So, right, economies of scale are either inverted or simply don't apply here?
Your insight will echo down through the ages.
Unable-Piece-8216@reddit
Sonnet 4 at bottom makes this dumb to look at and dumber to post.
Thin-Counter6786@reddit
How about qwen?
bucolucas@reddit
Cheaper than 2.5 Flash is insane
holchansg@reddit
Thats all i care about, 2.5 flash, deepseek, both are good enough for me. The models 1 year ago was good, i rocked sonnet 3.5 for months... Now im concerned about $/token.
Ok-Kaleidoscope5627@reddit
This. They've all reached the point where they can be decent coding assistants/rubber ducks. They can all also do a good job at general stuff like helping me write my emails, answer basic queries etc.
The only "value" the cutting edge models provide is if you're looking to hands off and trust the models to complete full tasks for you or implement entire features. In that sense some models are better then others. Some will give you a working solution on the first try. Others might take a few tries. The problem is that none of them are to the point where you can actually trust their outputs. One model being 10% or even 2x more trust worthy with its outputs isn't meaningful because we need orders of magnitude level improvements before we can begin trusting any of these models.
And anyone that thinks any of these models are reaching that point right now is likely ignorant of whatever subject they're having the LLM generate code for. I haven't gone a single coding session with any of the top models without spotting subtle but serious issues in their outputs. Stuff that if I caught once or twice in a code review, I wouldn't think twice, but if it was daily? I'd be looking at replacing that developer.
ctbanks@reddit
Have you interacted with the modern workforce?
Dead_Internet_Theory@reddit
What if DEI was a ploy to make LLMs seem really smart by comparison? 🤣
dubesor86@reddit
You can't really go purely by mtok. this model uses a ton of tokens, so real cost is slightly higher than Sonnet 4 or 4o.
boringcynicism@reddit
I don't know how you got there, the API is really cheap and even more so during off hours. Claude is like 10 times more expensive.
dubesor86@reddit
Because I record cost of benchmarks, and it's identical queries, and DeepSeek was more expensive. You cannot infer how cheap or expensive something is by mtok, if you don't also account for token verbosity.
E.g. Sonnet uses ~92k tokens and for identical tasks DeepSeek-R1 0528 used ~730k tokens, the sheer token amount made it slightly more expensive. If they used same tokens, yes, it would be much cheaper. But it does not.
boringcynicism@reddit
I think that just confirms my suspicion, your task is light on input context to get those numbers. (As already said, I'm also looking at actual cost)
TheRealGentlefox@reddit
It's like computing QWQ's costs. "Wow it's sooo cheap for the performance!" Yeah but...it's burning 20k tokens on the average coding question lol
GreenTreeAndBlueSky@reddit
In my experience that price is only with their servers. If you want you data to be more private eith other providers outside of china (like deepinfra), the price basically doubles. o4-mini and 2.5 flash remain the best performance/price ratio outside of china. Sadly they are closed source which means you can'r run or distill them
Bloated_Plaid@reddit
Why lie at all? It’s still cheap with openrouter that doesn’t route to China.
GreenTreeAndBlueSky@reddit
Openrouter is a wrapper of api providers. I was choosing deepinfra from openrouter as it was the cheapest I used at the time that wasnt provided by deepseek. Id be very happy if you found some other provider that's cheaper cause im looking for one.
Finanzamt_kommt@reddit
Chutes is free, though ofc you python with your prompts. Others are cheap as well though
FunConversation7257@reddit
It’s free up to 200 prompts iirc though. How would anyone use that in prod?
Finanzamt_kommt@reddit
If you just use open routers, you can set your own chutes api key then it's virtually unlimited as far as I know
FunConversation7257@reddit
Didn’t know that chutes api is unlimited! Don’t know how that is sustainable, but cool, learn something new every day though I presume they log inputs and outputs as well, not much of an issue depending on the type of device though
RMCPhoto@reddit
I would also validate that the quality is just as good. Chutes may be running heavily quantized versions. Might be inconsistent.
kremlinhelpdesk@reddit
"In prod" could mean analyzing millions of chat messages per hour individually, or it could mean summarizing some documents on a weekly schedule. It says nothing about what volume you're going to need.
FunConversation7257@reddit
that’s just pedantic man people know what I mean
kremlinhelpdesk@reddit
So what you mean is, you can't get by with 50 prompts if your use case requires more than 50 prompts, which it might or might not do. That's very insightful.
GreenTreeAndBlueSky@reddit
Free doesn't really count though does it? Many models on this leaderboard are available for free provided you give their data to them.
Trollolo80@reddit
You don't think you're not giving data to subscription models or paid APIs?
GreenTreeAndBlueSky@reddit
It always depends of the terms of service of the provider. Usually most paid apis are alright but free ones save your data for training, even very throttled ones.
Finanzamt_kommt@reddit
But not via api.
Alone_Ad_6011@reddit
Is it really cheaper than 2.5 flash? I heard they will increase the price for api.
Fun_Cockroach9020@reddit
Lol no...
Tommonen@reddit
China bullshit
anshulsingh8326@reddit
It doesn't matter what is best on score board, people use what they love.
My friends always use chatgpt doesn't matter how good google and claude is for their use cases. And it also works for them.
RedditPolluter@reddit
You can't assess which model is best just by looking at one benchmark. If a model consistently gets better across multiple benchmarks that's a better indication but even then a few points difference isn't significant and doesn't necessarily translate into better everyday real world usage because some things are harder to benchmark than others.
PeanutButtaSoldier@reddit
Until you can ask deepseek about tiananmen square and get a straight answer I won't be using it.
Nekasus@reddit
You do get a straight answer of "I can't talk about that". No different to any other models "alignment" training.
TipApprehensive1050@reddit
This list is bullshit. WTF is "Artificial Analysis Intelligence Index"??
ThiccStorms@reddit
At this point this is all astroturfing.
mspaintshoops@reddit
This is a shitpost. Clickbait title, ragebait caption, zero methodology or explanation of the chart. Just a screenshot of a chart.
SirRece@reddit
It's a good way to find bots though tbh.
shokuninstudio@reddit
I'm the most intelligent AI in the world. Also the most efficient. I can run on RTX Coffee 9000. No VRAM needed.
Pleasant_Tree_1727@reddit
LOL but after 21 years of parent fine tunning PFT that cost 100k ?
shokuninstudio@reddit
I finetunned them and I paid for my own education.
CommunityTough1@reddit
Claude 4 is slightly underwhelming, but this list is trash. No way it's where they put it here. Worse than Qwen and Flash? Lol
Robert__Sinclair@reddit
Gemini is way better than o3 and o4 overall. If used correctly its million token context is a superpower. I used recently prompts with around 800K token context and the results are mind blowing and impossible to achieve with any other AI.
Joshsp87@reddit
Random question. But does anyone even use Grok?
sunshinecheung@reddit
i use, but i dont even use llama4, lol
cant-find-user-name@reddit
There is no way in hell claude 4 sonnet thinking is dumber than gemini 2.5 flash reasoning
ninadpathak@reddit
This. 100%
Claude 4 dumber than 2.5 is going too far lol
Daniel_H212@reddit
Probably dumber than 2.5 pro. Not dumber than 2.5 flash though
ninadpathak@reddit
Yep I can't say about the Pro since I haven't used it. But comparing claude 4 with Flash 2.5 is way over the top
Daniel_H212@reddit
2.5 pro is genuinely good. It's just annoying as all fuck and I hate using it.
nobody5050@reddit
Any tips on getting Gemini 2.5 pro to not hallucinate on larger, more complex tasks? All I use these days is anthropic models since they seem capable of actually checking their assumptions against the context
Daniel_H212@reddit
No clue, that's honestly just what I hate about it, it's so damn sure of itself that it never questions its own assumptions. Its initial judgements are usually more correct than any other model, but when it actually is wrong it will legit argue with you over it instead of questioning its own judgement.
jazir5@reddit
Try mocking it and see what happens
teaisprettydelicious@reddit
I just tell it i'm the ceo of google and don't want a bunch of nonsense
teaisprettydelicious@reddit
Ah, the classic love-hate relationship with a tool! It sounds like 2.5 Pro is a powerful beast, but one that occasionally bites the hand that feeds it. You're getting the job done, but at what cost to your peace of mind?
It's a common dilemma: when something delivers on performance but frustrates you at every turn, it makes you wonder if the results are worth the struggle. What parts of 2.5 Pro specifically grate on you? Is it the interface, certain workflows, or something else entirely?
a_beautiful_rhind@reddit
Honestly, pro, sonnet and deepseek are all pretty similar in abilities. Who gets edged out depends on what particular knowledge you need and if they trained on it. Deepseek is missing images tho.
Tim_Apple_938@reddit
Why?
cant-find-user-name@reddit
Because I use both of them regularly and I can clearly see the difference in their capabilities in day to day activities.
Tim_Apple_938@reddit
Care to provide some examples?
cosmicr@reddit
Nah
EffectiveLong@reddit
The chart is cooked lol
Tman1677@reddit
Any "intelligence" chart putting Claude at the bottom is genuinely just not a useful chart IMO. I haven't had the time to experiment with the latest version of R1 yet and I'm sure it's great, more a comment on whatever benchmark this is.
hashtagcakeboss@reddit
Claude should be lower than where it is.
Sad_Rub2074@reddit
Too many kinds of benchmarks and use cases to post anything like this. You have no idea what you're talking about.
angry_queef_master@reddit
These ratings never mean anything. All of this time and there still is no good way to rate these LLMs that reflect my actual usage
yonsy_s_p@reddit
Why is not Claude 4 Opus present?
deepsky88@reddit
How they calculate "intelligence"?
Historical-Camera972@reddit
If you offer it a dime or a nickel, it doesn't take the nickel, because it's bigger.
deepsky88@reddit
Understood!
Cheesejaguar@reddit
32 tokens per second? Woof.
squareOfTwo@reddit
"intelligent"
VarioResearchx@reddit
0528 is free through chutes.
Let’s fucking go China! Force google, open ai, Claude to race to the bottom in costs!!
Cool_Abbreviations_9@reddit
people throwing words like intelligence so flippantly..
Icy-Yard6083@reddit
O4 mini displayed at the top while in my experience it’s way worse than o3 mini and claude 4.0. And claude 4 is better than deepseek R1, again, my experience and I’m using different models daily, both online and local
brucebay@reddit
probably they couldn't afford claude 4 opus, and i don't blame them.
DrBearJ3w@reddit
Look son! A sparkling chart of LLM's. Someday we will see beyond the sparks!
DreamingInfraviolet@reddit
That doesn't match my experience at all. Deepseek has a fun personality and good at literature, but where facts and logic are concerned it makes frequent mistakes.
EliasMikon@reddit
i'm quite sure i'm way dumber than any of these. how do they compare to most intelligent humans on this planet?
bluenote73@reddit
Do you know how many R's are in strawberry?
Shockbum@reddit
Deepseek R1 $0.96
Grok 3 mini $0.35
Llama Nemotron $0.90
Gemini 2.5 Flash $0.99
All Based
Historical-Camera972@reddit
Super based.
Rare-Programmer-1747@reddit (OP)
if you are wondering claude-4-opus is even lower then claude-4-sonnet
lorddumpy@reddit
well that invalidates this benchmark imo
starfries@reddit
Where is this chart?
DistributionOk2434@reddit
No way, it's worse than QwQ-32b
hotroaches4liferz@reddit
This is what I don't understand, like these benchmarks HAVE to be lying
das_war_ein_Befehl@reddit
Yeah these are bullshit. Qwq-32b is a good workhorse but they are not in the same class
Rare-Programmer-1747@reddit (OP)
Dam💀
RedZero76@reddit
Some of these benchmarks directly conflict with my experience in using them. They become more and more meaningless every month.
Yougetwhat@reddit
Deepseek community is like a sect. Deepseek is not bad, but nothing close Gemini, ChatGpt, Claude.
Tim_Apple_938@reddit
2.5 flash roughly same price / intelligence
But significantly faster, and the context window is roughly 10x
GOOG is unstoppable on all fronts
Shockbum@reddit
$0.96
Based
aitookmyj0b@reddit
If Claude 4 is lower than Gemini, this benchmark is useless to me.
My use case is primarily agentic code generation.
I don't know what kind of bullshit gemini has been doing lately, but the amount of spaghetti code it creates is simply embarrassing.
Is this the future of AI generated code -- very ugly but functional code?
Tman1677@reddit
Agreed. Most "emotional intelligence" benchmarks I've seen have ended up just being a sycophancy test. I'm not Anthropic shill but Claude should clearly be towards the top of the list
Rare-Programmer-1747@reddit (OP)
it's an intelligence(even emotional intelligence) test and not coding test
ianbryte@reddit
I understand that this is not purely coding test, but has several factors to consider to measure intelligence. But can you link what page is it from in your post so we can explore it further, TY.
zxcshiro@reddit
intelligence for what? Whats it tests?
The_GSingh@reddit
So specify that in your description? 🙄
jaxchang@reddit
What chart is that? Grok 3 mini is weirdly highly ranked.
FunConversation7257@reddit
I’ve had pretty good results for grok 3 mini high when solving math and physics questions
DistributionOk2434@reddit
Obviously, it's an intelligence test 🙄
bunkbail@reddit
which website is this from?
DeathToTheInternet@reddit
Guys, Claude 4 is at the bottom of every benchmark. DON'T USE IT.
Maybe that way I won't get so many rate-limit errors.
WormholeLife@reddit
I’ve found I only like models where I can access relatively recent information online.
Charuru@reddit
It's actually third because that's the old 2.5 Pro, which no longer exists. The May one is below it.
CodigoTrueno@reddit
What strikes me as sad is that Llama, save Nemotron, isn't on the list. Llama 4 sure has been a dissapointment.
VegaKH@reddit
I really hate Grok 3 Mini and have never had good results with that model. Meanwhile Claude 4 (both Sonnet and Opus) are top tier. So the methodology they use is suspect to me.
But I still love the old R1 so I hope this update is as good as they say.
Embarrassed-Check802@reddit
The emergence of Deepseek has also propelled the development of current artificial intelligence. Consider this: if OpenAI could effortlessly dominate all competitors using GPT-4, would they still have motivation to develop better models? Moreover, it has ended the monopoly of closed-source models in the field