Chinese AI startup StepFun up near the top on livebench with their new 1 trillion param MOE model

[-]

KurisuAteMyPudding@reddit

One trillion params -> gets beat by o1 mini

[-]

Account1893242379482@reddit

What are the estimates for o1 mini's size?

[-]

Healthy-Nebula-3603@reddit

Looking how fast o1 mini is I'm confident is less than 50b parameters. Is literally spit out 5k tokens within seconds.

[-]

Account1893242379482@reddit

Ya but there are other providers who are faster even with 70b llama models and those aren't even MoE.

[-]

Healthy-Nebula-3603@reddit

Open ai is using those specialized cards?

[-]

Account1893242379482@reddit

To me it seems o1-mini is about as fast as 70b models on Fireworks who doesn't use specialized cards.

[-]

KurisuAteMyPudding@reddit

If I could take a rough guess that's not based on any facts at all I'd say somewhere between 8-16 billion params

[-]

jastorgally@reddit

o1 mini is 12 dollars per million output tokens I doubt its 8-16 billion

[-]

OfficialHashPanda@reddit

Could very well be openai just charging a premium for its whole new class of models 😊😊

[-]

Whotea@reddit

It also produces tons of cot tokens so that probably raises the price

[-]

learn-deeply@reddit

no, the cot tokens are included as part of the output tokens, even if they're not visible.

[-]

Whotea@reddit

The cot tokens themselves or the summary you see on ChatGPT?

[-]

Affectionate-Cap-600@reddit

The cot tokens themselves

Yep, exactly those tokens...

I made some calls to 01 mini that require just a simple answer of a small paragraph, and I was billed for something like 10k tokens... It's a bit of an overthinker.

[-]

TheDreamWoken@reddit

No.

Just no. https://llm.extractum.io/static/llm-leaderboards/

[-]

adityaguru149@reddit

I read somewhere that models > 70b have substantially higher self consistency accuracy than smaller ones like 32B or lower. So, I would guess 70B with test time compute

o1 can be 120B or higher

[-]

agent00F@reddit

"a million apples -> beaten by one orange"

[-]

adityaguru149@reddit

I'm more interested in the fact that a <2 yo company beats Google in probably the 1st/2nd release. Can it beat OpenAI/Anthropic in probably the next release? Why not?

Any major releases from companies that are non-US are also a big deal for AI democratisation as 1 Gov wouldn't have all the control. Think of how this would spoil ClosedAI's plans of bringing AI regulation to make that a moat against new entrants so that they can command astronomical valuations.

[-]

Any_Pressure4251@reddit

I'm more interested that you can come to such a conclusion before waiting till we do some tests.

[-]

SomeOddCodeGuy@reddit

Good lord that instruction following score. That's going to be insane for RAG, summarization, etc.

Maybe if I string some Mac Studios together, and send it a prompt today, I'll get my response next week.

I'm going to be jealous of whoever can use that model.

[-]

DinoAmino@reddit

Well, in the meantime, Llama 3.1 70B beats it (87.5) - and yes, using an INT8 quant with RAG is really good.

[-]

Pedalnomica@reddit

Yeah, that's really pulling up the average. If you click through to the subcategories, it seems like "story_generation" is where they are really pulling ahead. No doubt that's exciting for many folks around here, but I suspect it means the model will feel a little underwhelming relative to the overall score for more "practical" use cases.

Impressive nonetheless!

[-]

Expensive-Paint-9490@reddit

I could use it at a 3-bit quant but at, well, one token per three seconds.

[-]

I_am_unique6435@reddit

So size doesn‘t always matter.

[-]

rishiarora@reddit

How much overfitting ? YES !!

[-]

yiyecek@reddit

Why stop at 1T when you can do 10T?

[-]

martinerous@reddit

But can it run ~Crysis~ ARC-AGI?

[-]

Pro-editor-1105@reddit

i was excited until i read one trillion parameters.

[-]

No-Refrigerator-1672@reddit

With 1T parameter I won't be surprised if they just overfitted all the testing data, and will produce garbage for literally anything but tests.

[-]

UserXtheUnknown@reddit

OpenAI models are believed to be over 1 trillion of paremeters, by now, so there is no reason to think that this one is more overfitted than an OpenAi one

[-]

Icy_Accident_3847@reddit

I guess you never know what is livebench

[-]

PlantFlat4056@reddit

You mean the place filled with wumaodangs and bots

[-]

wavinghandco@reddit

Why make trillions, when you can make... Billions?

[-]

ab2377@reddit

less is the new more! .. less is the new more!

[-]

Admirable-Star7088@reddit

I was still excited, until I double-checked how much VRAM my consumer GPU had again.

[-]

robertotomas@reddit

NVIDIA published a paper, at a time where they could only reasonably be assumed to be talking about ChatGPT 4o, talking about training a 1.2 trillion parameter model for OpenAI … 1 trillion in more is really not so bad

[-]

Apprehensive_Rub2@reddit

I'm wondering if it's better in chinese

[-]

HairyAd9854@reddit

Does anybody know what are they using to train 1T model? I am not sure any American company may train such a large model without NVIDIA hardware. I guess a large share of parameters are actually 8bit

[-]

Khaosyne@reddit

I tried it and it seems it is mostly trained on Chinese dataset, But somewhat good in english.

[-]

SadWolverine24@reddit

Why is the performance so shitty for 1T parameters?

[-]

Few_Professional6859@reddit

I have read quite a few news articles about scaling laws being limited by bottlenecks.

[-]

Whotea@reddit

Not test time compute scaling

[-]

robertotomas@reddit

I dont know that it was accurate, but the first such leak was about disappointing Orion (o1, non preview). I know Altman came back and commented on it later, in a way that implies the interpretation people had of the leak was incorrect, but still.

[-]

Whotea@reddit

The benchmarks they provided and even o1 preview seem pretty good

[-]

robertotomas@reddit

I’m not saying I am disappointed. Someone who worked in the project said they weren’t able to release on time because the results were disappointing

[-]

Whotea@reddit

Beating phds in the GPQA and getting in the 93rd percentile of codeforces is anything but disappointing. Are you seriously relying on rumors instead of actual evidence lol

[-]

robertotomas@reddit

I guess they expected those results to generalize more easily than they actually do is all. Rumors from outside of the company I don’t care about, even from Microsoft. “Rumors” from the team lead of the project, I take more seriously

[-]

Whotea@reddit

What did the team lead say? Any real sources?

[-]

jd_3d@reddit (OP)

If you take out the test time compute models (o1 and o1 mini) it's literally above everything except Sonnet 3.5.

[-]

Perfect_Twist713@reddit

Something else to note is that there is basically no proper benchmarks that test the breadth of knowledge (and the possible/unknown emergent properties) that the massive models might have. Comparing small models to very large ones by the existing benchmarks is almost like measuring intelligence by seeing if a person can open a pickle jar and saying "My 5 year old is as smart as Einstein because Einstein got it open too".

[-]

notsoluckycharm@reddit

Signal to noise ratio, really. Not all content is worth being in the set, but it’s there. You took your F150 to the office, your boss their Ferrari. You both did the same thing, but one’s sleeker and probably cost a bit more to make.

[-]

clex55@reddit

sparse architecture?

[-]

RudzinskiMaciej@reddit

MoE

[-]

Aggressive-Physics17@reddit

Heavily, astronomically undertrained.

[-]

SadWolverine24@reddit

I can send them my GTX 980 since they clearly need more compute.

[-]

Whotea@reddit

Especially since there’s a gpu embargo on them

[-]

NEEDMOREVRAM@reddit

I fucking love the Chinese!!! And yes, I'm 100% certain that got my name put on yet another watchlist. Get fucked, American NKVD.

I have very high hopes that the Chinese will eventually release a model that will wipe its ass with both ChatGPT and Claude.

C'mon you Chinese guys, surely you see the piss-poor state of America. Do us a solid and give us the power to use an LLM tool that's more powerful than the censored WrongThink correctors that ChatGPT and Claude are.

This is an EASY win for China and an even bigger win for LLM enthusiasts.

[-]

IJCAI2023@reddit

Which leaderboard is this? It doesn't look familiar.

[-]

ihexx@reddit

livebench.ai

it's one of the best leaderboards because they update the questions every few months so LLMs can't just memorize leaks off the internet. This is a problem with others like MMLU where because the questions are open, some people just train on the benchmark to inflate their scores.

[-]

IJCAI2023@reddit

Thank you.

[-]

Plums_Raider@reddit

1 trillion and it sucks lol ok

[-]

robertotomas@reddit

I dont understand how mini is that high. I feel like it has become much worse since they made it produce longer and longer answers. It is always repeating itself 2-3 times it seems, and clearly lost some of its resolution power in the process

[-]

CeFurkan@reddit

Yep. Mini sucks so bad in my usage as well

[-]

CeFurkan@reddit

Chinese is lead in many many ai fields. Look at video generation, image upscale very possibly text to image soon as well

And they also open source so many amazing models

[-]

EfficiencyOk2936@reddit

So, we would need a full server just to run it on a 1bit quant

[-]

DinoAmino@reddit

And a 72B beats it at math lol

[-]

x2network@reddit

1000B on what? 👍🤣

[-]

Ekkobelli@reddit

not math.

[-]

Tanvir1337@reddit

only 1 trillion

[-]

Enough-Meringue4745@reddit

No local no care

[-]

Downtown-Case-1755@reddit

This actually makes sense!

In big cloud deployments for thousands of users, you can stick one (or a few) experts on each GPU for "expert level parallelism" with very little overhead compared to running tiny models on each one. Why copy the same model across each server when you can make each one an expert with similar throughput?

This is not true of dense models, as the communication overhead between GPUs kinda kills the efficiency.

I dunno about training the darn thing, but they must have a frugal schem for that too.

[-]

greying_panda@reddit

Considering that MoE models (at least last time I checked the implementation) have a different set of experts in each transformer layer, this would still require very substantial GPU to GPU communication.

I don't see why it would be more overhead than a standard tensor parallel setup so it still enables much larger models, but a data parallel setup with smaller models would still be preferable in basically every case.

[-]

Downtown-Case-1755@reddit

Is it? I thought the gate and a few layers were "dense" (and these would presumably be pipelined?) while the actual MoE layers are completely independent.

[-]

greying_panda@reddit

I used the term "transformer layer" too loosely, I was referring to the full "decoder block" including the MoE transformation.

Mixtral implementation

My knowledge came from the above when it was released, so there may be more modern implementations. In this implementation, each block has its own set of "experts". Inside the block, the token's feature vectors undergo the standard self attention operation, then the output vector is run through the MoE transformation (determining expert weights and performing the weighted projection).

So hypothetically, all expert indices could be be required throughout a single inference step for one input. Furthermore, in the prefill step, every expert in every block could be required, since this is done per token.

I'm sure there are efficient implementations here, but if the total model is too large to fit on one GPU, I can't think of a distribution scheme that doesn't require some inter-GPU communication.

Apologies if this is misunderstanding your point, or explaining something you already understand.

[-]