DeepSeek-R1 (Preview) Benchmarked on LiveCodeBench
Posted by Charuru@reddit | LocalLLaMA | View on Reddit | 46 comments
Posted by Charuru@reddit | LocalLLaMA | View on Reddit | 46 comments
vlodia@reddit
How good vs o1 pro?
Mother_Soraka@reddit
isnt O1 Pro same as O1 High?
Charuru@reddit (OP)
No, I think pro might be less high but has search.
lvvy@reddit
R1 is not yet released one?
easyrider99@reddit
do we have any sort of indication on the size of the model?
_yustaguy_@reddit
Not really. People are saying it's V3, but I have my doubts about that, since it just recently got released after all. I'd say it's more likely to still be V2.
JuicedFuck@reddit
It is all but guaranteed to be a reasoning finetune of their previous release.
AmericanNewt8@reddit
Probably inflated benchmark results like Deepseek tends to but even if it's vaguely in the same class it's still huge.
Salty-Garage7777@reddit
I assume it's not the model accessible at DeepSeek.com, when pressing the "deep think" button? Or is it? 😊
TechnoByte_@reddit
I'm pretty sure it's still the lite model, not the full version.
I asked it, and it replied:
BoJackHorseMan53@reddit
You should know by now that models don't know their own name
Mother_Soraka@reddit
its in their system prompt
BoJackHorseMan53@reddit
Maybe they missed changing the system prompt. I noticed AI companies are not much into web development.
Mother_Soraka@reddit
you are not wrong there.
Ironic knowing their own models can improve their WebUi in a single day by a lot
nsdjoe@reddit
would be a pretty remarkable hallucination
adityaguru149@reddit
livebench says it might have inflated numbers for new models and scores might go down as new problems get added.
samelden@reddit
i don't see it in the website
cyanogen9@reddit
Lol o1 mini is better than Sonnet in this benchmark , means benchmark is not accurate at all
Charuru@reddit (OP)
Sonnet is really good (fitted) on react and python, whereas this benchmark tests tough reasoning and compsci problems. It's not quite the same thing.
frivolousfidget@reddit
Meaning sonnet is still the SOTA for real life coding.
Charuru@reddit (OP)
No o1-pro is clearly better than sonnet, but not o1-mini though.
frivolousfidget@reddit
Not for real life agentic use… but I see your point and accept it. I do use both daily while coding.
Charuru@reddit (OP)
Yeah, tbh I'm very excited about R1 for real world since its base is DSv3 which is Sonnet-tier (very slightly worse) in React/Python, both much much better than 4o which is the base for o1. So add strong reasoning on top of that should be crazy.
frivolousfidget@reddit
I had somewhat bad experiences with DSv3 (not terrible but sonnet is much better for me) but it is certainly , by far, the best model that I could run myself, much better than 405b , I do use sonnet in many more languages and it performs super well.
Syzeon@reddit
exactly. The only advantage dsv3 has is it's price and the uncap rate limit. The performance though is nowhere near sonnet, by miles. I often find myself only assign simple and self contained function to dsv3, anything slightly complex it just fall apart completely. Recently I also find myself ditching dsv3 and embracing gemini 1206, since it can do everything dsv3 but completely free. The 10rpm is a little annoying but for coding wise, I find it no concern at all
frivolousfidget@reddit
Sonnet is cheaper than dsv3 on fireworks for my usecase because of input caching.
tommitytom_@reddit
I also find sonnet to be much better than DSv3 for real world coding tasks
rorowhat@reddit
SOTA?
Arcuru@reddit
State Of The Art
uwilllovethis@reddit
More specifically, it consist of questions from leetcode and codeforces.
Charuru@reddit (OP)
Are they actually FROM those? I think they're similar but not those questions or else their claims about preventing contamination wouldn't make sense.
uwilllovethis@reddit
Yes they are from those, however they have some anti-contamination measures in place (like only testing on problems created after the cutoff date of a model). Nevertheless, since its leetcode-style questions, contamination will always remain somewhat of a problem. Some novel problems are almost identical to older ones.
vincentz42@reddit
This benchmark tests LLMs' reasoning capabilities on recent competitive programming problems, such as those from LeetCode and Codeforces. o1 mini and o1 are designed specifically for this use case, so they will do much better.
pigeon57434@reddit
stop glazing anthropic and just accept for christ sake that o1 is good
Orolol@reddit
01 and 01-mini are différents
pigeon57434@reddit
ya and o1mini is just as good as coding
OfficialHashPanda@reddit
For leetcode/codeforces-style questions, yeah. I think he's mostly referring to real-world usage, where o1-mini isn't as good as O1 & Sonnet.
Orolol@reddit
Not really. O1 is great, even better than Sonnet. Mini is good, but worse than Sonnet.
Ayman__donia@reddit
R1 preview release?
IxinDow@reddit
I checked in devtools and at least name of the model in EventStream is "deepseek-r1-preview". I don't know what name they used for Lite version though.
```
{"choices":[{"index":0,"delta":{"content":"","type":"text"}}],"created":1737173468,"model":"deepseek-r1-preview","prompt_token_usage":33,"chunk_token_usage":0,"message_id":2,"parent_id":1,"ban_regenerate":true,"ban_edit":true,"remaining_thinking_quota":49}
```
IxinDow@reddit
henryclw@reddit
Can’t wait to see the open source weights
BetEvening@reddit
China numba won!!!!!💪💪💪💪🇨🇳🇨🇳🇨🇳🇨🇳
jorbl@reddit
🫡 let's go
KillerX629@reddit
Can this be used anywhere?
Charuru@reddit (OP)
The insane thing is that just 4 months ago sonnet was SOTA and now we're doubling it... WTF. The progress is INSANE. https://imgur.com/a/vsje7yr
o1-preview released on Sep 12, 2024, shot up so high when it was released... now it looks downright decrepit.