DeepSeek-r1-0528 in top 5 on new SciArena benchmark, the ONLY open-source model

Posted by entsnack@reddit | LocalLLaMA | View on Reddit | 42 comments

Allen AI puts out good work and contributes heavily to open-source, I am a big fan of Nathan Lambert.

They just released this scientific literature research benchmark and DeepSeek-r1-0528 is the only open-source model in the top 5, sharing the pie with the like of OpenAI's o3, Claude 4 Open, and Gemini 2.5 Pro.

I like to trash DeepSeek here, but not anymore. This level of performance is just insane.

[-]

artisticMink@reddit

You have to appreciate the abritrary score. And community voting is in there tho. Totally didnt pull that one out of their ass to make the news.

[-]

SomeOddCodeGuy@reddit

It really is a killer model. I managed to squish the q5_K_M on my M3 Ultra, and it's a bit slow but the responses leave anything else I can run locally in the dust when it comes to almost every task I've tried.

I pretty much had to rebuild all of my workflows because it's now the center of everything I do lol

[-]

texasdude11@reddit

When you say slow, how slow is it? Can you please quantify it?

How many prompt eval time are you seeing and then generation time please?

[-]

SomeOddCodeGuy@reddit

When you say slow, how slow is it? Can you please quantify it?

Ask and you shall receive.

[-]

texasdude11@reddit

That's q4km right? How bad is it on q4km? Have you played with -b and -ub parameter on llama.cpp to increase the pp tk/s speed?

[-]

SomeOddCodeGuy@reddit

Yea, after I played with MQA a bit I realized I could bump up to Q5_K_M after this test, so I did. It barely fits, but it does fit lol.

Is that for batch size? I haven't in llama.cpp, but I did in Koboldcpp, and on the Mac it made no discernible speed difference at 512+, but did get worse below 512. With that said, Kobold didn't have ubatch, so I haven't tried it, and it may do a different implementation of batch from lcpp, so I'll definitely give it a try.

[-]

texasdude11@reddit

Yes -b is batch size. Increasing it will help with the prompt processing but it increases the memory requirements. At the least on the Nvidia GPUs.

Have you been able to cross 15 tk/s on generation speed.

[-]

SomeOddCodeGuy@reddit

I'm afraid not. 12t/s generation is as fast as I've gotten with this model and a regular sized prompt in MLX. Llama.cpp is far slower, at around 3-5t/s generation.

Right now Im in llama.cpp for q5, but someone just told me earlier today that MLX now has a 5 bit version, so I may try it out. The only thing Im not sure on is whether MLX supports MQA; without it, I can't fit 5bpw, because the kv cache gets huge with this model without it.

[-]

json12@reddit

What’s MLA?

[-]

Turbulent_Pin7635@reddit

Can you link this version, pls?

[-]

Affectionate-Cap-600@reddit

the difference between R1 and r1-0528 is impressive.

looking at the whole leader board... llama 4 Maverick is quite embarrassing. o4 mini has a really good score for the price, and gpt 4.1 mini has an interesting score for a small non reasoning model (still, idk how small it could be)

I'm really disappointed from gemini-2.5-flash... I would have expected it above qwq and qwen3 32B

happy to see minimax M1 on the leader board, it is the only 'hybrid transformer' listed.

[-]

entsnack@reddit (OP)

> the difference between R1 and r1-0528 is impressive.

Yeah clearly 0528 wasn't just an incremental update.

[-]

SilentLennie@reddit

It really is though: it's just an update with improved tool handling, etc.

Have a look at scores:

R1 is based on V3

V3 got a big boost with the newer V3 and R1 got a very similar boost with the new R1.

This to me means: they hugely improved V3 and ran a similar process for R1 again with the new R1 and thus it got the same boost.

[-]

pigeon57434@reddit

incremental in AI land actually means like 10 years worth of progress and absolutely groundbreaking in anything else

[-]

llmentry@reddit

gpt 4.1 mini has an interesting score for a small non reasoning model (still, idk how small it could be)

Based on inference costs, it's likely about 5x smaller than GPT-4.1.

I can also vouch for 4.1-mini's abilities. It punches so far above its weight that I initially wondered if it was simply a quantised version of 4.1.

Other than R1-0528, the other interesting performer on that table is o4-mini. I should probably use that one more, by the looks of it. Someone needs to do the output cost / ELO point comparison of these data.

I'm also surprised by Gemini 2.5 Flash's poor performance. I've not experienced this, and I've been using the 2.5 Flash preview a fair bit (as it's cheap as chips); it seems way better than the Qwen models IME.

It would be useful if Ai2 collected and weighted results by academic qualifications / position. I do wonder who is assessing these battles, as you need someone with expert knowledge to assess to make these scores count. I just tried out a rating battle on the site, and it was completely open to anyone. I'd have thought at the very least they'd require users with an academic institutional email address to login prior to testing. And weighting results even by self-reported qualifications would be sensible. There is a danger that model confidence and vibe could bias outcomes otherwise.

[-]

pier4r@reddit

I see a problem with this if they let the community pose questions. Like with lmarena (that is good, if one take care of the limitations), people may become the bottleneck asking simple or silly questions, or also they can judge things in a different way.

It would be good, I didn't see this mentioned in the article, that the question get a screening to only let valid scientific questions go through. Otherwise the arena would likely inherit the same problems of lmarena.

[-]

Kamimashita@reddit

I find that R1 spends too much time on thinking and the thinking is too verbose and often rambling. Tbf other reasoning models might be the same but at least from the time to first non-thinking token they think a lot less.

[-]

Turbulent_Pin7635@reddit

The brat is angry for sure! O muleke é brabo mermo!

Even using the o3 regularly, whenever I need a polished version only my local R1 do it in a good way. =)

[-]

Sorry_Ad191@reddit

It's been a while since V3 0328. I wonder if the chefs are cooking

[-]

robberviet@reddit

As always. Just curious, is anyone really use Deepseek model? For me it seems too slow to be practical.

[-]

IrisColt@reddit

I don't use it. Can't.

[-]

robberviet@reddit

I can't too! Us peasants barely manage to run 30B models. I am sticking with Qwen3 30B at the moment.

For true R1 (not distill), my only way still via Openrouter free API to try, no real usage with the limit.

[-]

createthiscom@reddit

I've been testing V3-0324 vs R1-0528 for agentic purposes pretty intensely for the past couple of weeks. I've come to the conclusion that R1-0528 is the clever nerd who does what he wants. V3-0324 is a solider who follows orders, but isn't particularly clever.

I still prefer V3-0324 when I just want the model to do what I tell it to do, faithfully. However, I've started giving harder problems to R1-0528 when I don't particularly care how the problem is solved and I just need a solution.

I've tried giving orders to R1-0528 and it will do some of the things I ask, but just ignore some of them too. I think of it like a particularly clever software engineer. You have to peak its curiosity.

[-]

amranu@reddit

I find getting Deepseek v3 to use file write tools is like pulling teeth, but maybe I'm just having bad luck

[-]

ThisWillPass@reddit

What quants?

[-]

amranu@reddit

The full model through the deepseek api, I don't have the hardware to run a local version really.

[-]

createthiscom@reddit

Works pretty well for me with open hands. I run Q4. I even use an MCP server for file editing: https://github.com/createthis/diffcalculia_mcp

[-]

SomeOddCodeGuy@reddit

Ive noticed that, especially for reviewing things, both of them, and Qwen3, have very different perspectives, though R1 does a great job consistently.

When I ask them to code review something:

V3-0324: "Nothing at all is wrong here
Qwen3 235b: "EVERYTHING IS WRONG. WHY?!"
R1-0528: Actually does a great job reviewing the code

When I give them Qwen3's panic attac- er, I mean, code review from the first test to peer review:

V3-0324: "The other LLM that code reviewed this will enjoy software development when it learns how"
Qwen3 235b: "Whoever wrote this is a genius"
R1-0528: Again does a great job reviewing the review

[-]

IrisColt@reddit

Thanks for the insight!

[-]

createthiscom@reddit

mmmm. I need to start using R1 for code reviews. Smart.

[-]

ThePixelHunter@reddit

R1 can become distracted by its own thinking chains. I bet if you prefilled <think></think> to skip its reasoning phase, you'd get performance better than V3 without going off track as often.

[-]

Affectionate-Cap-600@reddit

there is a merge of v3 and R1 that, according to the release paper, seems to make the reasoning more concise and less chaotic without hurting performance too much

[-]

Lissanro@reddit

If you mean R1T, it used to be my daily driver for a while, even though it was weaker at reasoning that the old R1, it was much better than raw V3.

However, I find the new R1 is much better and it is my favorite local model since its release (I run IQ4_K_M quant).

[-]

Yes_but_I_think@reddit

You have RAM (512GB) but no disk space?!

[-]

createthiscom@reddit

It's not really that I don't have disk space, it's more than I want to see what the cheaper offshore models can do with it for an hourly rate.

[-]

LienniTa@reddit

yeeeeh im waiting for next v3 checkpoint. Its not like its hard to run r1 without thinking though.

[-]

redditisunproductive@reddit

I am starting to think that the most useful benchmark for me would be an ML/AI benchmark. The most useful model would be a Qwen coder style model that is focused solely on ML related tasks. Pytorch, Unsloth, everything you need to fine-tune and run models, even train them from scratch. Trained on all the most recent githubs, old ones too like BERT models. Newer application stuff like MCPs, etc. All that documentation already ingrained and performing near perfection. I want to see a benchmark for that.

Because benchmaxxing that will lead to local self-improvement (or branching specialization), which is more useful than waiting for random finetunes or corporate open source releases.

[-]

NinjaK3ys@reddit

I've found it to be incredibly useful too. Dumb question maybe ?

Since this is an open source model, How is the closed source models different in terms of training and architecture ?

[-]