Deepseek announces R1 lite , beats o1 preview on some benchmarks

[-]

TheLogiqueViper@reddit (OP)

Test time training is also cool , i cant imagine if these models get test time compute and test time training both

[-]

junior600@reddit

I really hope they are going to release a version that also fits in my rtx 3060 12 gb lol

[-]

This one probably wouldn't even fit even if you quantise to max, but I can see that smaller models will get this extended inference ability in near future. I don't see any reason why this can't apply to 3b, 7b, 32b models. With those, even long context wouldn't be a big problem.

[-]

Outrageous-Ad5641@reddit

R1 lite is based on the v2 lite MoE with 16B params and 2.4 active params

[-]

Fusseldieb@reddit

If we have 4o-level models on 8GB I'll already be jumping from happiness. But this will certainly take a few more years, at least, I guess.

[-]

saraba2weeds@reddit

I've tried deepseek 33B once, and then go back to llama 70b. It seemed that deepseek have little idea what I want it to do.

[-]

jeffwadsworth@reddit

Haha. Deepseek is amazingly good.

[-]

opi098514@reddit

Good thing this isn’t deepseek 33B

[-]

Redoer_7@reddit

Also they promise a release of the full version model

[-]

fiery_prometheus@reddit

leme just dig out my 8x4090 rig from my shed

In all seriousness, I really hope the q2 and q3 will be runnable without too much loss with whatever state of the art method is available at the time for quantization.

[-]

acc_agg@reddit

Just wait for the 5090 version. You can save a whole two cards.

[-]

mycall@reddit

I would be fine with slow inference if it means it works on anything.

[-]

Inspireyd@reddit

I didn't know that. That's amazing. This is just the Lite version.. there will be another full version. This is awesome

[-]

DarkArtsMastery@reddit

Impressive. Makes me even more curious about upcoming Qwen 3, which will hopefully be able to do the same thinking feature and given how awesome Qwen 2.5 was, I think we will be very happy with what we have in just a few short weeks.

Compare this to a situation just a year ago when we did not have anything remotely competing with GPT4 locally, now these models appear so fast that it is ridiculous. I would never pay for AI these days as open-source just proves to be the way forward.

[-]

XForceForbidden@reddit

I pay deepseek for use their api, but the total spent can not exceed $3.

[-]

Inspireyd@reddit

you don't pay for any?

[-]

DarkArtsMastery@reddit

Zero. Not even Sonnet.

[-]

Inspireyd@reddit

That's amazing. I used to pay, but I won't pay anymore. Open source models are getting more and more capable. And about Qwen 3, it seems to me that they are already developing a reasoning version as well.

[-]

jeffwadsworth@reddit

Wow. It solves the Aunt Agatha riddle. Only o1 could do this before. Let’s analyze the situation step by step:

Agatha, the butler, and Charles live in Dreadsbury Mansion.
A killer always hates, and is no richer than his victim.
Charles hates no one that Agatha hates.
Agatha hates everybody except the butler.
The butler hates everyone not richer than Aunt Agatha.
The butler hates everyone whom Agatha hates.
No one hates everyone.

Let’s denote: - ( A ) for Agatha - ( B ) for the Butler - ( C ) for Charles

Step-by-Step Analysis:

Agatha hates everybody except the butler:
( A ) hates ( C ) (since ( C ) is not the butler).
( A ) does not hate ( B ).
The butler hates everyone not richer than Aunt Agatha:
If ( B ) is not richer than ( A ), then ( B ) hates ( A ).
If ( C ) is not richer than ( A ), then ( B ) hates ( C ).
The butler hates everyone whom Agatha hates:
Since ( A ) hates ( C ), ( B ) hates ( C ).
Charles hates no one that Agatha hates:
Since ( A ) hates ( C ), ( C ) does not hate ( C ).
Since ( A ) hates ( A ), ( C ) does not hate ( A ).
No one hates everyone:
This means ( B ) does not hate ( A ) (since ( A ) does not hate ( B )).

Conclusion:

Agatha (A): Hates ( C ) and ( A ), but not ( B ).
Butler (B): Hates ( C ) (since ( A ) hates ( C )), and possibly ( A ) if ( B ) is not richer than ( A ).
Charles (C): Does not hate ( A ) or ( C ).

Given the conditions: - The killer must hate the victim and be no richer than the victim. - ( A ) hates ( C ) and ( A ), but not ( B ). - ( B ) hates ( C ) and possibly ( A ) if ( B ) is not richer than ( A ). - ( C ) does not hate ( A ).

Since ( C ) does not hate ( A ), ( C ) cannot be the killer.

If ( B ) is not richer than ( A ), then ( B ) hates ( A ), and ( B ) could be the killer.

However, if ( B ) is richer than ( A ), then ( B ) does not hate ( A ), and ( B ) cannot be the killer.

Since ( A ) hates ( A ) and is no richer than herself, ( A ) could be the killer of herself.

Final Answer:

Agatha killed herself.

[-]

lolwutdo@reddit

How many parameters is this model?

[-]

zjuwyz@reddit

If they follows the DeepSeek-V2 and DeepSeek-V2-Lite pattern, that's a 2.4B/16B MoE。

[-]

RevoDS@reddit

Looks like there is still no moat in LLMs after all

[-]

No_Step3864@reddit

there will only be a chip-moat.

[-]

Healthy-Nebula-3603@reddit

models are getting better and better and smaller

[-]

yuicebox@reddit

Yes but there are still a lot of rumors of a scaling wall at the major LLM companies

[-]

Healthy-Nebula-3603@reddit

So far we don't see it

I think such rumors is kind of how people are cope.

[-]

yuicebox@reddit

Small models are the area where the most improvement happens. Why is opus 3.5 not out? OpenAI has made statements about not being happy with performance of their latest models considering the data volume and training time.

When was the last major leap in large SoTA model performance? Not small model performance or open source model performance, top tier model performance. Was it o1 preview?

[-]

Healthy-Nebula-3603@reddit

"about not being happy with performance of their latest models considering the data volume and training time. "

Can you link me that official statement?

Opus is/was just too heavy and expensive..

Do you remember how expensive is 1m tokens from it?

I think they are preparing something similar to o1 currently like sonnet 4.

Opus is too heavy and too expensive currently.

Maybe in few years with better hardware...

[-]

Zephandrypus@reddit

And we'll beat that wall through new architectures and techniques. Mixture of experts and reasoning LLMs are still hot off the press with drastic improvements, and we'll just keep exploring.

[-]

yuicebox@reddit

I completely agree that new architectures can fix the problem. That said, more is not new

[-]

Nyghtbynger@reddit

When your cutting edge flagship model gets beaten by "lite"

[-]

RabbitEater2@reddit

Isn't o1 preview a 'lite' version of o1 though?

[-]

HMikeeU@reddit

Isn't that o1-mini?

[-]

eposnix@reddit

From what I understand, o1-preview is their first test of the o1 architecture. Then they created o1 and o1-mini is a distilled version of that.

[-]

Single_Ring4886@reddit

I think you are right and reason they did not released ful o1 is well that it "thinks" and whe you start thinking you sometime stray off not completing task...

[-]

opi098514@reddit

This statements makes me feel heard, and attacked at the same time.

[-]

Healthy-Nebula-3603@reddit

soon will release o1 .... are literally forced now ... lol

[-]

my_name_isnt_clever@reddit

Why are these companies so bad at names?? I hate that we have to compare R1 with o1.

[-]

estebansaa@reddit

o1 already feeling outdated! Gemini is kicking their ass now, without even CoT.

[-]

Caladan23@reddit

Try with 3000 lines of code as input and multiple iterations real-world scenario, instead of a 1 message test riddle, before you judge.

[-]

CheatCodesOfLife@reddit

+1

I fine tuned some models to do which gave outputs similar to the screenshot, solved some riddles. But for anything serious they were useless.

If I've been stuck on a problem for a while with sonnet/qwen, I often toggle over to o1-mini (in open-webui) and tell it "Claude and I have been stuck on for a while. What are we doing wrong?"

A lot of the time, it finds the issue with 1 reply.

[-]

Inspireyd@reddit

This is something I noticed... the current Gemini Exp 1114 is able to do reasoning exercises that only the o1 could do. Now, as if that weren't enough, DeepSeek is launching an R1 to further reduce the importance of the o1.

[-]

estebansaa@reddit

yes, that last Gemini model is great. One issue, the context window is much smaller than their usual 2M conttext.

[-]

LienniTa@reddit

managed to infinite loop it right with the first question to come up with a nickname idea

[-]

fairydreaming@reddit

I checked it on 25 quizes from farel-bench benchmark (aunt or uncle relationship) and it got 92% right (two answers were wrong). That's better than all LLMs I benchmarked so far except o1-mini and o1-preview (they got 100% on full set of 50 quizes for this relationship). Definitely a step in the right direction. This is going to put some serious pressure on OpenAI. Can't wait to see it applied to the large DeepSeek model!

[-]

Aymanfhad@reddit

It's astonishing that free and open-source AI will soon and possibly surpass O1 Preview, which is closed and paid AI. Then they claim that training AI is expensive. They just want your damn money.

[-]

Healthy-Nebula-3603@reddit

Is expensive ....

[-]

SquashFront1303@reddit

Finally an opensource O1 wesee more high quality reasoning models trained on its CoT what a time to be alive 🎉

[-]

Enough-Meringue4745@reddit

Once open source we can make that distinction

[-]

Dorkits@reddit

Where is the link to test it?

[-]

TheLogiqueViper@reddit (OP)

chat.deepseek.com

[-]

Dorkits@reddit

Thanks

[-]

Different-Effect-724@reddit

[-]

Rei1003@reddit

I hate people can’t name their products independently.

[-]

No_Afternoon_4260@reddit

Isn't nous released a similar api, was it in beta?

[-]

hapliniste@reddit

I think it wasn't actually the same thing. It was a standard LLM with test time compute using known techniques

[-]

No_Afternoon_4260@reddit

What do you call test time compute ?

[-]

hapliniste@reddit

Doing more calculations at inference. In the case of nous research it's things like mixture of agents that are used while for o1 and deepseek it's chain of thoughts (but with a special training method).

[-]

tucnak@reddit

Think; there's a reason why not a single lab in the West had released o1 of their own. It's because they're not convinced that this particular approach is worthwhile. Since the o1-preview release, Sonnet had outperformed it where it matters—using traditional autoregression. In the measures where it didn't, that could be attributed to the datasets. In the mind of /r/LocalLLaMA stans, RL is fucking pixie dust. The industry is not convinced. Everybody experiments with RL, it's just that OpenAI are the only ones to whom it made financial sense to release a "RL wonder-model."

Just the other day, so-called llava-o1 was discussed. If you had actually read the paper, you would know that the o1 connection is made through Evaluation of openai o1: Opportunities and challenges of AGI—yet another paper mill product with 50 or so authors. They created that 280-page monstrosity less than two weeks after the o1 release. We don't know what o1 is doing, but it seems the Chinese have figured it out in the matter of days... They say their model performs well on visual benchmarks, but it's probably owing to the fact that they're overfitting these benchmarks in the first place.

[-]

Educational_Gap5867@reddit

The best/worst part is that 4o has started beating it too in the more non gameable benchmarks. It does consistently better on strictly reasoning tasks though and this part IS true still.

[-]

tucnak@reddit

I suspect that 4o and o1 fine-tuning is handled by two non-overlapping teams, with much compartmentation between them, and they're probably not getting the same data. It seems like they'd initially starved 4o off most recent data months before release, and had only trickled it fully weeks after the release—so that o1 would appear stronger momentarily. I stand by my initial assessment that o1 is a revolution in accounting, not capability; the industry agrees, it seems.