DeepSeek-r1-0528 in top 5 on new SciArena benchmark, the ONLY open-source model
Posted by entsnack@reddit | LocalLLaMA | View on Reddit | 42 comments

Post: https://allenai.org/blog/sciarena
Allen AI puts out good work and contributes heavily to open-source, I am a big fan of Nathan Lambert.
They just released this scientific literature research benchmark and DeepSeek-r1-0528 is the only open-source model in the top 5, sharing the pie with the like of OpenAI's o3, Claude 4 Open, and Gemini 2.5 Pro.
I like to trash DeepSeek here, but not anymore. This level of performance is just insane.
artisticMink@reddit
You have to appreciate the abritrary score. And community voting is in there tho. Totally didnt pull that one out of their ass to make the news.
SomeOddCodeGuy@reddit
It really is a killer model. I managed to squish the q5_K_M on my M3 Ultra, and it's a bit slow but the responses leave anything else I can run locally in the dust when it comes to almost every task I've tried.
I pretty much had to rebuild all of my workflows because it's now the center of everything I do lol
texasdude11@reddit
When you say slow, how slow is it? Can you please quantify it?
How many prompt eval time are you seeing and then generation time please?
SomeOddCodeGuy@reddit
Ask and you shall receive.
texasdude11@reddit
That's q4km right? How bad is it on q4km? Have you played with -b and -ub parameter on llama.cpp to increase the pp tk/s speed?
SomeOddCodeGuy@reddit
Yea, after I played with MQA a bit I realized I could bump up to Q5_K_M after this test, so I did. It barely fits, but it does fit lol.
Is that for batch size? I haven't in llama.cpp, but I did in Koboldcpp, and on the Mac it made no discernible speed difference at 512+, but did get worse below 512. With that said, Kobold didn't have ubatch, so I haven't tried it, and it may do a different implementation of batch from lcpp, so I'll definitely give it a try.
texasdude11@reddit
Yes -b is batch size. Increasing it will help with the prompt processing but it increases the memory requirements. At the least on the Nvidia GPUs.
Have you been able to cross 15 tk/s on generation speed.
SomeOddCodeGuy@reddit
I'm afraid not. 12t/s generation is as fast as I've gotten with this model and a regular sized prompt in MLX. Llama.cpp is far slower, at around 3-5t/s generation.
Right now Im in llama.cpp for q5, but someone just told me earlier today that MLX now has a 5 bit version, so I may try it out. The only thing Im not sure on is whether MLX supports MQA; without it, I can't fit 5bpw, because the kv cache gets huge with this model without it.
json12@reddit
What’s MLA?
Turbulent_Pin7635@reddit
Can you link this version, pls?
Affectionate-Cap-600@reddit
the difference between R1 and r1-0528 is impressive.
looking at the whole leader board... llama 4 Maverick is quite embarrassing. o4 mini has a really good score for the price, and gpt 4.1 mini has an interesting score for a small non reasoning model (still, idk how small it could be)
I'm really disappointed from gemini-2.5-flash... I would have expected it above qwq and qwen3 32B
happy to see minimax M1 on the leader board, it is the only 'hybrid transformer' listed.
entsnack@reddit (OP)
> the difference between R1 and r1-0528 is impressive.
Yeah clearly 0528 wasn't just an incremental update.
SilentLennie@reddit
It really is though: it's just an update with improved tool handling, etc.
Have a look at scores:
R1 is based on V3
V3 got a big boost with the newer V3 and R1 got a very similar boost with the new R1.
This to me means: they hugely improved V3 and ran a similar process for R1 again with the new R1 and thus it got the same boost.
pigeon57434@reddit
incremental in AI land actually means like 10 years worth of progress and absolutely groundbreaking in anything else
llmentry@reddit
Based on inference costs, it's likely about 5x smaller than GPT-4.1.
I can also vouch for 4.1-mini's abilities. It punches so far above its weight that I initially wondered if it was simply a quantised version of 4.1.
Other than R1-0528, the other interesting performer on that table is o4-mini. I should probably use that one more, by the looks of it. Someone needs to do the output cost / ELO point comparison of these data.
I'm also surprised by Gemini 2.5 Flash's poor performance. I've not experienced this, and I've been using the 2.5 Flash preview a fair bit (as it's cheap as chips); it seems way better than the Qwen models IME.
It would be useful if Ai2 collected and weighted results by academic qualifications / position. I do wonder who is assessing these battles, as you need someone with expert knowledge to assess to make these scores count. I just tried out a rating battle on the site, and it was completely open to anyone. I'd have thought at the very least they'd require users with an academic institutional email address to login prior to testing. And weighting results even by self-reported qualifications would be sensible. There is a danger that model confidence and vibe could bias outcomes otherwise.
pier4r@reddit
I see a problem with this if they let the community pose questions. Like with lmarena (that is good, if one take care of the limitations), people may become the bottleneck asking simple or silly questions, or also they can judge things in a different way.
It would be good, I didn't see this mentioned in the article, that the question get a screening to only let valid scientific questions go through. Otherwise the arena would likely inherit the same problems of lmarena.
Kamimashita@reddit
I find that R1 spends too much time on thinking and the thinking is too verbose and often rambling. Tbf other reasoning models might be the same but at least from the time to first non-thinking token they think a lot less.
Turbulent_Pin7635@reddit
The brat is angry for sure! O muleke é brabo mermo!
Even using the o3 regularly, whenever I need a polished version only my local R1 do it in a good way. =)
Sorry_Ad191@reddit
It's been a while since V3 0328. I wonder if the chefs are cooking
robberviet@reddit
As always. Just curious, is anyone really use Deepseek model? For me it seems too slow to be practical.
IrisColt@reddit
I don't use it. Can't.
robberviet@reddit
I can't too! Us peasants barely manage to run 30B models. I am sticking with Qwen3 30B at the moment.
For true R1 (not distill), my only way still via Openrouter free API to try, no real usage with the limit.
createthiscom@reddit
I've been testing V3-0324 vs R1-0528 for agentic purposes pretty intensely for the past couple of weeks. I've come to the conclusion that R1-0528 is the clever nerd who does what he wants. V3-0324 is a solider who follows orders, but isn't particularly clever.
I still prefer V3-0324 when I just want the model to do what I tell it to do, faithfully. However, I've started giving harder problems to R1-0528 when I don't particularly care how the problem is solved and I just need a solution.
I've tried giving orders to R1-0528 and it will do some of the things I ask, but just ignore some of them too. I think of it like a particularly clever software engineer. You have to peak its curiosity.
amranu@reddit
I find getting Deepseek v3 to use file write tools is like pulling teeth, but maybe I'm just having bad luck
ThisWillPass@reddit
What quants?
amranu@reddit
The full model through the deepseek api, I don't have the hardware to run a local version really.
createthiscom@reddit
Works pretty well for me with open hands. I run Q4. I even use an MCP server for file editing: https://github.com/createthis/diffcalculia_mcp
SomeOddCodeGuy@reddit
Ive noticed that, especially for reviewing things, both of them, and Qwen3, have very different perspectives, though R1 does a great job consistently.
When I ask them to code review something:
When I give them Qwen3's panic attac- er, I mean, code review from the first test to peer review:
IrisColt@reddit
Thanks for the insight!
createthiscom@reddit
mmmm. I need to start using R1 for code reviews. Smart.
ThePixelHunter@reddit
R1 can become distracted by its own thinking chains. I bet if you prefilled
<think></think>
to skip its reasoning phase, you'd get performance better than V3 without going off track as often.Affectionate-Cap-600@reddit
there is a merge of v3 and R1 that, according to the release paper, seems to make the reasoning more concise and less chaotic without hurting performance too much
Lissanro@reddit
If you mean R1T, it used to be my daily driver for a while, even though it was weaker at reasoning that the old R1, it was much better than raw V3.
However, I find the new R1 is much better and it is my favorite local model since its release (I run IQ4_K_M quant).
Yes_but_I_think@reddit
You have RAM (512GB) but no disk space?!
createthiscom@reddit
It's not really that I don't have disk space, it's more than I want to see what the cheaper offshore models can do with it for an hourly rate.
LienniTa@reddit
yeeeeh im waiting for next v3 checkpoint. Its not like its hard to run r1 without thinking though.
redditisunproductive@reddit
I am starting to think that the most useful benchmark for me would be an ML/AI benchmark. The most useful model would be a Qwen coder style model that is focused solely on ML related tasks. Pytorch, Unsloth, everything you need to fine-tune and run models, even train them from scratch. Trained on all the most recent githubs, old ones too like BERT models. Newer application stuff like MCPs, etc. All that documentation already ingrained and performing near perfection. I want to see a benchmark for that.
Because benchmaxxing that will lead to local self-improvement (or branching specialization), which is more useful than waiting for random finetunes or corporate open source releases.
NinjaK3ys@reddit
I've found it to be incredibly useful too. Dumb question maybe ?
Since this is an open source model, How is the closed source models different in terms of training and architecture ?
Artistic_Okra7288@reddit
Again, open source means something completely different. This is the only open-weight model in top 5. Well done, DeepSeek!
Repulsive-Memory-298@reddit
Tables are turning, we will continue to see some serious open source innovation this year.. I’m betting on some cool continuous training flows and an emphasis on specialist models for edge ai.
bahwi@reddit
It's my preferred coding model
SashaUsesReddit@reddit
The people at Ai2 are solid. Love to see this data from them!