DeepSeek-R1 (Preview) Benchmarked on LiveCodeBench

[-]

I checked in devtools and at least name of the model in EventStream is "deepseek-r1-preview". I don't know what name they used for Lite version though.
```
{"choices":[{"index":0,"delta":{"content":"","type":"text"}}],"created":1737173468,"model":"deepseek-r1-preview","prompt_token_usage":33,"chunk_token_usage":0,"message_id":2,"parent_id":1,"ban_regenerate":true,"ban_edit":true,"remaining_thinking_quota":49}
```

[-]

AmericanNewt8@reddit

Probably inflated benchmark results like Deepseek tends to but even if it's vaguely in the same class it's still huge.

[-]

Salty-Garage7777@reddit

I assume it's not the model accessible at DeepSeek.com, when pressing the "deep think" button? Or is it? 😊

[-]

TechnoByte_@reddit

I'm pretty sure it's still the lite model, not the full version.

I asked it, and it replied:

I'm DeepSeek-R1-Lite-Preview, an AI assistant created exclusively by the Chinese Company DeepSeek. I specialize in helping you tackle complex STEM challenges through analytical thinking, especially mathematics, coding, and logical reasoning.

[-]

BoJackHorseMan53@reddit

You should know by now that models don't know their own name

[-]

Mother_Soraka@reddit

its in their system prompt

[-]

BoJackHorseMan53@reddit

Maybe they missed changing the system prompt. I noticed AI companies are not much into web development.

[-]

Mother_Soraka@reddit

you are not wrong there.
Ironic knowing their own models can improve their WebUi in a single day by a lot

[-]

BoJackHorseMan53@reddit

AI developers focus on building AI, they're not into web development. That's why the maker of flux only ever launched an API for flux, they didn't bother to make a web app.

[-]

nsdjoe@reddit

would be a pretty remarkable hallucination

[-]

adityaguru149@reddit

livebench says it might have inflated numbers for new models and scores might go down as new problems get added.

[-]

cyanogen9@reddit

Lol o1 mini is better than Sonnet in this benchmark , means benchmark is not accurate at all

[-]

Charuru@reddit (OP)

Sonnet is really good (fitted) on react and python, whereas this benchmark tests tough reasoning and compsci problems. It's not quite the same thing.

[-]

frivolousfidget@reddit

Meaning sonnet is still the SOTA for real life coding.

[-]

rorowhat@reddit

SOTA?

[-]

Arcuru@reddit

State Of The Art

[-]

rorowhat@reddit

Thanks

[-]

Charuru@reddit (OP)

No o1-pro is clearly better than sonnet, but not o1-mini though.

[-]

frivolousfidget@reddit

Not for real life agentic use… but I see your point and accept it. I do use both daily while coding.

[-]

Charuru@reddit (OP)

Yeah, tbh I'm very excited about R1 for real world since its base is DSv3 which is Sonnet-tier (very slightly worse) in React/Python, both much much better than 4o which is the base for o1. So add strong reasoning on top of that should be crazy.

[-]

frivolousfidget@reddit

I had somewhat bad experiences with DSv3 (not terrible but sonnet is much better for me) but it is certainly , by far, the best model that I could run myself, much better than 405b , I do use sonnet in many more languages and it performs super well.

[-]

Syzeon@reddit

exactly. The only advantage dsv3 has is it's price and the uncap rate limit. The performance though is nowhere near sonnet, by miles. I often find myself only assign simple and self contained function to dsv3, anything slightly complex it just fall apart completely. Recently I also find myself ditching dsv3 and embracing gemini 1206, since it can do everything dsv3 but completely free. The 10rpm is a little annoying but for coding wise, I find it no concern at all

[-]

frivolousfidget@reddit

Sonnet is cheaper than dsv3 on fireworks for my usecase because of input caching.

[-]

tommitytom_@reddit

I also find sonnet to be much better than DSv3 for real world coding tasks

[-]

uwilllovethis@reddit

... this benchmark tests tough reasoning and compsci problems.

More specifically, it consist of questions from leetcode and codeforces.

[-]

Charuru@reddit (OP)

Are they actually FROM those? I think they're similar but not those questions or else their claims about preventing contamination wouldn't make sense.

[-]

uwilllovethis@reddit

Yes they are from those, however they have some anti-contamination measures in place (like only testing on problems created after the cutoff date of a model). Nevertheless, since its leetcode-style questions, contamination will always remain somewhat of a problem. Some novel problems are almost identical to older ones.

[-]

vincentz42@reddit

This benchmark tests LLMs' reasoning capabilities on recent competitive programming problems, such as those from LeetCode and Codeforces. o1 mini and o1 are designed specifically for this use case, so they will do much better.

[-]

pigeon57434@reddit

stop glazing anthropic and just accept for christ sake that o1 is good

[-]

Orolol@reddit

01 and 01-mini are différents

[-]

pigeon57434@reddit

ya and o1mini is just as good as coding

[-]

OfficialHashPanda@reddit

For leetcode/codeforces-style questions, yeah. I think he's mostly referring to real-world usage, where o1-mini isn't as good as O1 & Sonnet.

[-]

Orolol@reddit

Not really. O1 is great, even better than Sonnet. Mini is good, but worse than Sonnet.

[-]

vlodia@reddit

How good vs o1 pro?

[-]

Mother_Soraka@reddit

isnt O1 Pro same as O1 High?

[-]

Charuru@reddit (OP)

No, I think pro might be less high but has search.

[-]

lvvy@reddit

R1 is not yet released one?

[-]

easyrider99@reddit

do we have any sort of indication on the size of the model?

[-]

_yustaguy_@reddit

Not really. People are saying it's V3, but I have my doubts about that, since it just recently got released after all. I'd say it's more likely to still be V2.

[-]