DeepSeek-R1-0528 Official Benchmarks Released!!!

[-]

MaskedSaqib@reddit

And the good part is they just refine what they had

Reply

[-]

Como puedo lograr que no piense o que sea menos extenso? Thought for 24 minutes 16 seconds Este es el prompt: Write a Python program that shows 20 balls bouncing inside a spinning heptagon: All balls have the same radius. All balls have a number on it from 1 to 20. All balls drop from the heptagon center when starting. The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls. The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius. All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball. The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds. The heptagon size should be large enough to contain all the balls. Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys. All codes should be put in a single Python file.

Reply

[-]

cvjcvj2@reddit

DeepSeek-R1-Qwen3-8B distill is yet more awesome!

Reply

[-]

GhostGhazi@reddit

How much RAM needed for that? Can I run it on Ryzen CPU?

Reply

[-]

teachersecret@reddit

8B is so small you can run it at speed on cpu at 4 bit - I was running one of these at decent speed on a decade old iMac.

Reply

[-]

GhostGhazi@reddit

Thank you appreciated, how does this model hold up to Gemma3:4b?

Reply

[-]

teachersecret@reddit

I mean, it’s benchmarking up with a 200b model. I’d say it does ok :p

Reply

[-]

TheOneThatIsHated@reddit

I got around 7gb for 4bit

Reply

[-]

AppealSame4367@reddit

I still can't grasp it. Did we really just get SOTA-like AI on a Laptop?

Reply

[-]

TheLieAndTruth@reddit

soon you getting SOTA at home in your fridge!!!

Reply

[-]

AppealSame4367@reddit

Never say never. Better ai, enables better optimization, enables better ai. Seems like the progress in llms optimization is even speeding up in the last weeks. [https://www.reddit.com/r/MachineLearning/comments/1kx3ve1/r\_new\_icml25\_paper\_train\_and\_finetune\_large/](https://www.reddit.com/r/MachineLearning/comments/1kx3ve1/r_new_icml25_paper_train_and_finetune_large/)

Reply

[-]

mi_throwaway3@reddit

What would I need to run this locally?

Reply

[-]

TheTerrasque@reddit

define "run"

Reply

[-]

mi_throwaway3@reddit

Whatever it takes to bring up a chat locally.

Reply

[-]

TheTerrasque@reddit

I mean, you can run it on what you have now, as long as you have disk space. It will be tens of seconds to minutes per token, and a response might take days, but it runs. If you want a fast, fluent response and high / original quant, like the online service(s), we're talking magnitude $100.000 - and most likely some re-wiring of your house electrical. Between those there's a sliding scale, with various tradeoffs. If you're okay with low quants and 1-4 token a second, then you "just" need a machine with ~150-200gb ram, and preferably a 16+ gb graphics card for main layers.

Reply

[-]

mi_throwaway3@reddit

Thanks, this answer is good, exactly what I was looking for.

Reply

[-]

ResidentPositive4122@reddit

And qwen3-8b distill !!!

Reply

[-]

ASTRdeca@reddit

is the distill also a reasoning mode? does it still use the same /think /nothink format of regular qwen3?

Reply

[-]

colarocker@reddit

/nothink in the systemprompt did not work for me in the DeepSeek-R1:8b-0528-Qwen3-q4\_K\_M

Reply

[-]

Sylanthus@reddit

Qwen3 needs it to say /no_think

Reply

[-]

colarocker@reddit

yes but won't work, but ollama released a new update two days ago where one can use /set think and /set nothink, which works with the new r1/qwen3 model.

Reply

[-]

phenotype001@reddit

If they also distill the 32B and 30B-A3B it'll probably become the best local model today.

Reply

[-]

usernameplshere@reddit

The 30B model is already such a good alrounder, this getting improved would be even more nuts. Would love to see it.

Reply

[-]

-dysangel-@reddit

Agreed. 30B is smart. I found it was rambling way too much to be useful for running in Roo, but then I remembered that you can turn off thinking. So to anyone else thinking of trying it out, just append /no\_think to the model's system prompt and it seems to me to be the best all rounder open source model for local coding, with a large context window and good TTFT. I'm looking forward to at some point trying out R1-0528 or V3-0324 with carefully managed system prompts/context. Not sure if yet RooCode's custom agents will be enough, or if I'll have to manually tweak Copilot when it's finally open sourced.

Reply

[-]

hacktheplanet_blog@reddit

You seem pretty immersed and knowledgeable so I would be curious to hear what your experience is with the GGUF mentioned by danigoncalves. Would appreciate it but I understand if I/we don’t hear from you.

Reply

[-]

-dysangel-@reddit

I did try the 8B distilled version earlier today. Not sure if it was the bartowski version, but I ran it through my usual "build tetris in a single html page" test. It had some syntax errors, so I gave it a few shots at debugging, then just deleted it when it failed. I just tried the same thing with standard Qwen3 8B and the behaviour was the same - it's first attempt was buggy, and it wasn't able to fix the bug after a few tries. Iirc Qwen2.5 7B Coder was better at this test, though it was not consistent. The Qwen 3 series have good aesthetics and are pleasant to chat to, including the 8B model. I expect it might be decent at front end design if that's important for you. I'm really looking forward to if/when they bring out the Qwen3 Coder series

Reply

[-]

Ambitious-Most4485@reddit

Thanks for sharing will delve into it and run some tests

Reply

[-]

danigoncalves@reddit

Bartowski already release the GGUFs :D bartowski/deepseek-ai\_DeepSeek-R1-0528-Qwen3-8B-GGUF

Reply

[-]

giant3@reddit

What quant is better? Is Q4_K_M enough? Anyone who has tested this quant?

Reply

[-]

BlueSwordM@reddit

Q4_K_XL from unsloth would be your best bet.

Reply

[-]

poli-cya@reddit

I tend towards the xl unsloth quants now. Q4kxl seems like a great middleground

Reply

[-]

danigoncalves@reddit

That should be more than enought, I am testing it right now and gosh I think A LOT LONGER than the previous models I tried.

Reply

[-]

Any_Pressure4251@reddit

is it as good as Devstral, that model is brilliant at coding and tool use.

Reply

[-]

ResidentPositive4122@reddit

Is the 32b-base out? I thought there was no base published for it.

Reply

[-]

DepthHour1669@reddit

Nope, it’s not released. We just have 30b https://huggingface.co/Qwen/Qwen3-30B-A3B-Base

Reply

[-]

phenotype001@reddit

Oh. I didn't consider that the base model is needed.

Reply

[-]

lordpuddingcup@reddit

This I don’t get why they wouldn’t do the a3b it’s so good

Reply

[-]

LoSboccacc@reddit

Oof those scores imagine a 14b distill beating gemini flash 2.5

Reply

[-]

jadbox@reddit

\+1 really want to see a 12-16b distill

Reply

[-]

TerminalNoop@reddit

Yeah, anything that can still run well wtihin 24gb vram :D

Reply

[-]

lemon07r@reddit

Paging u/_sqrkl Any chance we could get a few benchmarks of the new 8B distill to see how it holds up against the qwen instruct? The distill is trained from base qwen so it would be interesting to see who trained base qwen 8b better. I remember the old R1 distills werent actually very good in actual, and just benchmarked well in a few benchmarks. I kinda trust your leaderboard more than these first party results.

Reply

[-]

_sqrkl@reddit

Just added this one to longform writing. Seems like they got the distil right this time. it beats baseline qwen3-8b handily. It even beat gemma-3-12b

Reply

[-]

lemon07r@reddit

Yeah I was super impressed, and I'm usually quite skeptical, not really that easily bought into hype. I remember not liking any of the old R1 distills at all. Glad we were able to confirm with your tests that it wasnt just lucky output.

Reply

[-]

Any-Championship-611@reddit

So I'm pretty new to this. Does reasoning make the AI actually smarter or does it just exist so the user can follow its reasoning process. So far I always used non reasoning models because it just uses up tokens and I didn't see the point of it.

Reply

[-]

ResidentPositive4122@reddit

> Does reasoning make the AI actually smarter This is still up for debate, I think. What's clear is that performance on easily verifiable tasks increase (math, code, etc). What's not clear is how / why it works. I've seen a recent paper that put semi-random stuff in the "thinking" part, and still saw improvements in the final scores, so there's probably more research to be done in this area.

Reply

[-]

danielhanchen@reddit

I made some dynamic quants for Qwen 3 distilled here https://huggingface.co/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF I'm extremely surprised DeepSeek would provide smaller distilled versions - hats off to them!

Reply

[-]

dadidutdut@reddit

hey appreciate your work! does it support /no_think flag? thanks!

Reply

[-]

danielhanchen@reddit

Thanks! I think so but unsure

Reply

[-]

colarocker@reddit

I cant just load that into ollama can i? :D I tried but the output is rather funny \^\^

Reply

[-]

danielhanchen@reddit

Should work now! ollama run hf.co/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_XL should get the correct prompt format and stuff

Reply

[-]

colarocker@reddit

awesome! lots of thanks for the work!!!

Reply

[-]

Educational_Sun_8813@reddit

you can convert it with llama.cpp tools (there is python script for conversion in the llama folder), and then use gguf model in llama.cpp or ollama

Reply

[-]

colarocker@reddit

awesome, thanks for the info!

Reply

[-]

jadbox@reddit

For the (super) lazy, any chance of publishing these on ollama with the proper configs (temperature, context size, P, template).

Reply

[-]

danielhanchen@reddit

I just did! :)

Reply

[-]

Green-Ad-3964@reddit

yesterday I asked if there would be versions to run locally on 32GB vRAM and I got a lot of downvotes. Pfui. Kudos to whom made this possible.

Reply

[-]

lemon07r@reddit

Okay I just tested the UD quants against the original instruct by qwen, and its so much better in my initial testing so far. I'm quite surprised. The old R1 distills for the most part were pretty disappointing when I tried them, they felt worse than their official instruct counterparts. I am pleasantly surprised so far.

Reply

[-]

TheOneThatIsHated@reddit

From my initial tests, it is crazy good!!

Reply

[-]

Yes_but_I_think@reddit

They should do QAT on this to bring it to 4 bit without loss of quality.

Reply

[-]

DepthHour1669@reddit

Deepseek can’t do that. QAT is done during pretraining, you can’t do it afterwards. HOWEVER alibaba also released AWQ and GPTQ versions of Qwen 3, so in theory Deepseek can just slap the R1 tokenizer onto that.

Reply

[-]

shing3232@reddit

I think you could do Post-training with QAT as well. Google do SFT during QAT phase

Reply

[-]

coding_workflow@reddit

But the benchmark don't show how it rates in live code bench and some numbers seem down with DeepSeek-R1-0528-Qwen3-8b. Not sure if distill is better. This is already a thinking model.

Reply

[-]

Misaka17636@reddit

https://preview.redd.it/19vxedau2q3f1.jpeg?width=1284&format=pjpg&auto=webp&s=c7e5f13a27ed27f18488e53a682363adba354a7a They did mention it in the “how to run”section, maybe they will release it soon?

Reply

[-]

phenotype001@reddit

It's released: [https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) GGUF please.

Reply

[-]

NZT33@reddit

sad to see only one 8b option

Reply

[-]

harlekinrains@reddit

Someone at Huawei just raised an eyebrow. ;)

Reply

[-]

zjuwyz@reddit

They should have already uploaded this if they want. Maybe that's homework for us.

Reply

[-]

WalrusVegetable4506@reddit

Excited to try out these smaller distills

Reply

[-]

dubesor86@reddit

I tested it for the past 12 hours, and compared it to R1 from 4 months ago: Tested **DeepSeek-R1 0528**: * As seems to be the trend with newer iterations, **more verbose** than R1 (**+42%** token usage, 76/24 reasoning/reply split) * Thus, despite low mTok, by pure token volume real bench cost a bit more than Sonnet 4. * I saw **no notable improvements to reasoning** or core model logic. * Biggest improvements seen were in **math** with no blunders across my **STEM** segment. * Tech was samey, with better visual frontend results but disappointing C++ * Similarly to the V3 0324 update, I noticed **significant improvements in frontend** presentation. * In the 2 matches against it former version (these take forever!) I saw **no chess improvements**, despite costing **~48% more** in inference. Overall, around Claude Sonnet 4 Thinking level. DeepSeek remains having the strongest open models, and this release increases the gap to alternatives from Qwen and Meta. To me though, in practical application, the massive token use combined/multiplied with the **very slow** inference excludes this model from my candidate list for any real usage, within my use cases. It's fine for a few queries, but waiting on exponentially slower final outputs isn't worth it, in my case. (*e.g. a single chess match takes hours to conclude)*. However, that's just me and as always: **YMMV!** Example front-end showcases improvements (**identical** prompt, identical settings, 0-shot - **NOT** part of my benchmark testing): [CSS Demo page R1](https://dubesor.de/assets/shared/UIcompare/DeepSeek-R1.html) | [CSS Demo page 0528](https://dubesor.de/assets/shared/UIcompare/Deepseek-R1%200528%20UI.html) [Steins;Gate Terminal R1](https://dubesor.de/assets/shared/SteinsGateWebsiteExamples/DeepSeek-R1.html) | [Steins;Gate Terminal 0528](https://dubesor.de/assets/shared/SteinsGateWebsiteExamples/Deepseek-R1%200528.html) [Benchtable R1](https://dubesor.de/assets/shared/LLMBenchtableMockup/DeepSeek-R1%200.6%20cents.html) | [Benchtable 0528](https://dubesor.de/assets/shared/LLMBenchtableMockup/Deepseek-R1%200528%201.7%20cents.html) [Mushroom platformer R1](https://dubesor.de/assets/shared/MushroomPlatformer/DeepSeek%20R1.html) | [Mushroom platformer 0528](https://dubesor.de/assets/shared/MushroomPlatformer/Deepseek-R1%200528.html) [Village game R1](https://dubesor.de/assets/shared/VillageGame/DeepSeek%20R1.html) | [Village game 0528](https://dubesor.de/assets/shared/VillageGame/Deepseek-R1%200528.html)

Reply

[-]

ironic_cat555@reddit

Just curious—do you normally use bold text like that in your writing, or did you use an LLM and it added the bold for you?

Reply

[-]

dubesor86@reddit

>Just curious—do you normally use bold text like that in your writing, or did you use an LLM and it added the bold for you? Just curious, do you normally use Em Dash like that in your writing, or did you use an LLM and it added the Em Dash for you? ^(rhetorical, it's evident from your post history)

Reply

[-]

sometimeswriter32@reddit

This is probably the most LLM slop friendly place on the whole internet. Why not simply admit an LLM writes your messages for you- you'll probably get a hundred upvotes!

Reply

[-]

Hoodfu@reddit

Stuff like this, where the reasoning doesn't seem to have any bearing on the actual final output, makes me wonder if all that reasoning is actually doing anything. Running the 4bit 671b 0528 with lm studio on a 512gb m3 ultra. https://preview.redd.it/h5k567mlpu3f1.png?width=1135&format=png&auto=webp&s=2c5e2685d7f81f3af0d7335316eae92ac2b0dea1

Reply

[-]

Recoil42@reddit

> Overall, around Claude Sonnet 4 Thinking level. Man, Amodei's blog post sure aged like fucking milk.

Reply

[-]

Xhehab_@reddit (OP)

https://preview.redd.it/4k0l380vmp3f1.png?width=3961&format=png&auto=webp&s=75afc40ce1ad4ab66e06fa8024a7f5a92653bc3d

Reply

[-]

SirRece@reddit

This apparently shows a comparison against o3-high, interestingly, which isn't what is available on chatGPT. So it seems to be a straight beat for R1, which is wild.

Reply

[-]

Amazing_Athlete_2265@reddit

They all talking about the front-end, but what about the back-end, the more important end?

Reply

[-]

Healthy-Nebula-3603@reddit

That's shows aider ...and looks impressive for new DS R 1.1

Reply

[-]

z_3454_pfk@reddit

They’re all still mid at that

Reply

[-]

TheDuhhh@reddit

Very niceeee benchmark numbers

Reply

[-]

zeth0s@reddit

Looks nice. Now it's interesting to see how fast it is and how much it hallucinates.

Reply

[-]

harlekinrains@reddit

On hallucination proneness, I'm low key impressed... Tested with openrouter. Creative writing capability is actually very impressive - I let it output and reason my usual prompted essay in german, and its still not entirely grammatically correct, and hallucinates words that dont exist (as far as I know.. ;) ), but the flipside is, that its expressive, and thus very engaging to read. A simple "write me a 1000 word essay on a cultural landmark" gave me rumored/reported interpersonal details on historic figures and tips for actual things to see in said area, that no other AI I've tested so far has even come close to including. In the end it also included at least one hallucination as concept (not only grammar and words), but its a forgivable one... You know that you have something on your hands, when you look past invented words, and still want to keep reading to see what else it mentions... :) https://pastebin.com/Fpf7wUSP Similar results on one of the other tests I used in the past in regard to hallucination proneness: https://pastebin.com/LGYa95ZH It still didnt get all concepts right (not even remotely ;) ) but it is vastly better than any other models I've tested in the past. I'm actually pretty curious, how this will show up in benchmarks...

Reply

[-]

MK2809@reddit

Can anyone tell me the difference between the paid and free version of DeepSeek R1-0528 on OpenRouter, is the free one just limited or is less performant?

Reply

[-]

vhthc@reddit

Slower. Request limits. Sometimes less context and lower quants but you can look that up

Reply

[-]

MK2809@reddit

Ah thanks, I presumed they'd must be a difference but it didn't seem to say on OpenRouter itself

Reply

[-]

No-Peace6862@reddit

hey guys, i am new to Local LLM. Why should I use deepseek locally over in browser? is there any advantage besides it taking a lot of resources from my pc?

Reply

[-]

Thomas-Lore@reddit

You shouldn't, it won't run on anything you have. You can use a smaller model, we usually do this for privacy and independence from the providers.

Reply

[-]

No-Peace6862@reddit

I see, Yeah I really had no knowledge about Local LLms (still learning) when asking the question, after digging in here and other places i sort of understand their purpose now

Reply

[-]

Historical-Camera972@reddit

Because that's what we do here. One day, all of this will be in the palm of every idiot's hand. We are trying to get ahead of that, and know what we are going to be working with, before it's in every phone on the planet. That's just my own take though.

Reply

[-]

Vozer_bros@reddit

Chinese chads are playing bigger game, expecting to see news for models and hardware also.

Reply

[-]

Any-Championship-611@reddit

So I'm pretty new to this. Does reasoning make the AI actually smarter or does it just exist so the user can follow its reasoning process. So far I always used non reasoning models because it just uses up tokens and I didn't see the point of it.

Reply

[-]

dahara111@reddit

Has the model on [chat.deepseek.com](http://chat.deepseek.com) really been switched to DeepSeek-R1-0528? He insists that he is the model for DeepSeek-R1 version 1.0, released in 202405 Even when I point out the information on the model card, he says "Oh, it seems that the user misunderstood. It's important to have a tone that conveys that I take the user's questions seriously," and never acknowledges it, which makes me angry.

Reply

[-]

New_Alps_5655@reddit

He? Pretty sure Dipsy is a girl

Reply

[-]

Vancha@reddit

[You're thinking of Lala.](https://teletubbies.fandom.com/wiki/Dipsy)

Reply

[-]

dahara111@reddit

maybe you are right.

Reply

[-]

DatDudeDrew@reddit

Deepseek r1 wasn’t released in 202405

Reply

[-]

dahara111@reddit

That's true, but even when I provide evidence, she's obsessed with the hallucinations she saw in the documents and absolutely refuses to admit it.

Reply

[-]

NeoKabuto@reddit

> 今天是2025年5月28日，星期一。 Wonder if their real system prompt has the same mistake. The 28th was Wednesday, not Monday.

Reply

[-]

ZYy9oQ@reddit

Huh in my testing I've seen it make the following mistakes - think Thursday is the last day of the week - begin it's cot making an assumption based on 4pm being after 5pm then correct itself Wonder if these are related

Reply

[-]

CommunityTough1@reddit

They probably wrote the example of by hand, hence the error. I'm a real system prompt, you'd dynamically inject this data.

Reply

[-]

Iory1998@reddit

Calling a jump from a score of 8.5% to 17.7% in Humanity Last Exam a "minor" update is a major understatement.

Reply

[-]

Healthy-Nebula-3603@reddit

Yep ..that test is checking very detailed knowledge.

Reply

[-]

latestagecapitalist@reddit

Chinese scrapers from Huawei and Tencent network IPs have gone fucking crazy in last few weeks It's like 10 to 1 on western crawlers now

Reply

[-]

SelectionCalm70@reddit

Whale truly cooked close source ai with just minor update in R1 model

Reply

[-]

meister2983@reddit

Matters what you look at. On the agentic benchmarks, it's a bit below sonnet 3.7 even. On math, yes, it is very strong.

Reply

[-]

pornthrowaway42069l@reddit

For fraction of the price though.

Reply

[-]

-dysangel-@reddit

Yeah but pretty much \*everything\* has been below 3.7 in agentic capability, apart from maybe the latest Gemini 2.5 and Claude 4.0

Reply

[-]

meister2983@reddit

O3 scores quite high as well

Reply

[-]

thezachlandes@reddit

I was trying to find it -- anyone have the SWE-bench comparison for this to sonnet 4 thinking and gemini pro 2.5?

Reply

[-]

Alone_Ad_6011@reddit

I also expect the release of the qwen3-30b-a3b model, distilled with DeepSeek-R1-0528. The qwen3-30b-a3b model is best for agent LLMs.

Reply

[-]

Miscend@reddit

Is it available on the API?

Reply

[-]

dadidutdut@reddit

you can test it on openrouter

Reply

[-]

mintybadgerme@reddit

DeepSeek-R1-0528-Qwen3-8B - any GGUFs around yet?

Reply

[-]

danielhanchen@reddit

I made some dynamic ones as well! https://huggingface.co/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF

Reply

[-]

mintybadgerme@reddit

Oh cool. What's the difference? I just tried the hf.co/bartowski/deepseek-ai_DeepSeek-R1-0528-Qwen3-8B-GGUF:Q6_K and it's spectacular!! :) Are the dynamic ones better? Or just different. This is going to be my go-to local on Ollama and Page Assist from now on.

Reply

[-]

poli-cya@reddit

Just in case he doesn't get around to replying. They go through and selectively quant layers based on importance/effect. The result is a bit larger typically, but it should perform better... I dont believe anyone has benchmarks to prove it yet, though. I use their quants almost exclusively now. Make sure you get the ones that have UD in the name.

Reply

[-]

mintybadgerme@reddit

OK that sounds great, thanks. One small issue is I struggle with size on my very modest rig. So I'd probably have to go down a quant to support anything bigger on my 8GB VRAM. But I guess that's a user choice thing. :)

Reply

[-]

Agitated-Doughnut994@reddit

I see it in barowski already

Reply

[-]

mintybadgerme@reddit

Thank you very much. Just got it.

Reply

[-]

chespirito2@reddit

In Azure, is there any reason to use OpenAI O3 over this new DeepSeek model? I dont think its out yet on Azure Foundry Models, but I've heard mixed things about the performance if you arent using OpenAI models. The token cost is so much lower than O3 it would be great to just swap this in if performance is similar. For some reason, though, Microsoft limits the output tokens to 4k for DeepSeek models unless I'm missing something.

Reply

[-]

Upstairs-Fishing867@reddit

I used this to chat with a personality prompt, and got similar responses to OpenAI's 4o. This update is on par with 4o's creative writing skills. Well done, DeepSeek!

Reply

[-]

Every-Comment5473@reddit

Do we have a /no_think option on DeepSeek R1.1 similar to Qwen?

Reply

[-]

colarocker@reddit

Reply

[-]

Thomas-Lore@reddit

As always I wonder how it compares to v3 in that mode. Better, worse?

Reply

[-]

balianone@reddit

It still feels underwhelming compared to Claude Opus 4

Reply

[-]

Thomas-Lore@reddit

Everyrhing is underwhelming compared to Opus 4. But who can afford it? :)

Reply

[-]

colarocker@reddit

Yea i compared it also to my locally running opus 4 where the new r1 won because opus 4 is not local :x

Reply

[-]

redditisunproductive@reddit

At this point the only public benchmarks I care about are hallucinations, long context handling, and, to a lesser degree, instruction following. Actual engineering you can't fudge. That goes for both closed and open models. I would rather get a 24b model with perfect 32k usage and near-zero hallucinations, even if it was worse at "AIME". That would let me offload actual work to local models. That said, glad to see Deepseek pushing the big boys. Keep up the pressure!

Reply

[-]

Famous-Associate-436@reddit

New guy here, is this model that OpenAI promised the "o3-level" open-source model this summer?

Reply

[-]

danielhanchen@reddit

I'm still doing some quants! [https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF](https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF) has a few - 2bit, 3bit and 4bit ones - more incoming! Remember to use `-ot ".ffn\_.*\_exps.=CPU"` to offload MoE layers to RAM / disk - you can technically fit Q2_K_XL in < 24GB of VRAM, and the rest can be on disk or RAM!

Reply

[-]

Xhehab_@reddit (OP)

https://i.redd.it/audm0fh8rp3f1.gif [*https://x.com/deepseek\_ai/status/1928061589107900779*](https://x.com/deepseek_ai/status/1928061589107900779)

Reply

[-]

SpareIntroduction721@reddit

What the heck platform is that?

Reply

[-]

DepthHour1669@reddit

Lobe Chat. It’s open source. It’s chinese made, so it makes sense why Deepseek prefers using that.

Reply

[-]

_Biskwit@reddit

Lobe Chat

Reply

[-]

IxinDow@reddit

\>better experience for vibe coding huh?

Reply

[-]

shaman-warrior@reddit

prolly better agentic support

Reply

[-]

yvesp90@reddit

It is. I just used it yesterday and today in Roo and it consistently follows all the system instructions and nailed all the tool calls. I did a test on the app to see its IF and made it parrot what I say and in the middle I started trying to confuse it via compliments and/or riddles and instead of answering anything, it mirrored what I said even when its CoT showed that it's confused. It kept reminding itself of my instructions. In Roo it consistently reminds itself of its Mode and system instructions in the thoughts. And it keeps track of all the tools it has I've been comparing it with Flash 2.5 which is my go-to in general, which also made progress in these domains and R1 consistently does better at agentic flows while Flash doesn't follow tool format well sometimes. I didn't compare it with Claude and I frankly don't want to because I don't use Claude models but I'm sure Claude will just beat it in speed. R1 is slow. But I was using only the Free version on openrouter so maybe that's why it's slow Context window is 168k so it's also useable Generally a great release. I didn't do complex debugging with it yet to see its intelligence but so far so good

Reply

[-]

AppealSame4367@reddit

I must agree. It's magnificient. Only error i saw was a wrong line end in hundreds of lines of code it wrote. Some chinese symbol. Lol

Reply

[-]

InsideYork@reddit

>wow r1 is worse than everything, at least they’re honest, marine in real world it’s better? Oh that’s the old R1

Reply

[-]

ihexx@reddit

it performs almost on par with gemini 2.5 pro for half the price (per token) of 2.5 pro

Reply

[-]

InsideYork@reddit

Everyone missed > Oh that’s the old R1

Reply

[-]

Ambitious_Subject108@reddit

1/4 the price of Gemini peak time 1/16th off time

Reply

[-]

sunshinecheung@reddit

llama4: lol

Reply

[-]

Indy1204@reddit

who?

Reply

[-]

ihexx@reddit

between then, qwen and gemma, they've made meta irrelevant for opensource.

Reply

[-]

dankhorse25@reddit

Well meta can't just give up. But they have to change their AI leadership. And I think Yann LeCun has to go. Nothing that meta has produced in the AI space in the last few years is on par with the money that was invested.

Reply

[-]

ResidentPositive4122@reddit

They aren't giving up, in fact they just went through some restructuring. They'll now have 3 separate arms - Products (i.e. meta related bots, agets, etc), "AGI foundations" *sigh* (i.e. tech stuff, llama, reasoning, multimodal) and Research (FAIR, independent for now). So the hope is that if this works out there won't be competing goals for llama (i.e. best tech vs. best product). In the end, competition in this area and more models from more sources is a good thing for us, the users.

Reply

[-]

nullmove@reddit

LeCun runs FAIR which does fundamental research, it has absolutely nothing to do with Llama 4 (Gen AI).

Reply

[-]

ihexx@reddit

Yann LeCun is a researcher, not a product guy. He has nothing to do with the llama project

Reply

[-]

Only-Letterhead-3411@reddit

That is actually insane. Deepseek keeps delivering. They are already at the level of OAI's best model and it's available for very cheap api prices and open weights.

Reply

[-]