Chinese AI startup StepFun up near the top on livebench with their new 1 trillion param MOE model
Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 83 comments
SomeOddCodeGuy@reddit
Good lord that instruction following score. That's going to be insane for RAG, summarization, etc.
Maybe if I string some Mac Studios together, and send it a prompt today, I'll get my response next week.
I'm going to be jealous of whoever can use that model.
DinoAmino@reddit
Well, in the meantime, Llama 3.1 70B beats it (87.5) - and yes, using an INT8 quant with RAG is really good.
Pedalnomica@reddit
Yeah, that's really pulling up the average. If you click through to the subcategories, it seems like "story_generation" is where they are really pulling ahead. No doubt that's exciting for many folks around here, but I suspect it means the model will feel a little underwhelming relative to the overall score for more "practical" use cases.
Impressive nonetheless!
Expensive-Paint-9490@reddit
I could use it at a 3-bit quant but at, well, one token per three seconds.
I_am_unique6435@reddit
So size doesn‘t always matter.
rishiarora@reddit
How much overfitting ? YES !!
yiyecek@reddit
Why stop at 1T when you can do 10T?
KurisuAteMyPudding@reddit
One trillion params -> gets beat by o1 mini
Account1893242379482@reddit
What are the estimates for o1 mini's size?
KurisuAteMyPudding@reddit
If I could take a rough guess that's not based on any facts at all I'd say somewhere between 8-16 billion params
jastorgally@reddit
o1 mini is 12 dollars per million output tokens I doubt its 8-16 billion
OfficialHashPanda@reddit
Could very well be openai just charging a premium for its whole new class of models 😊😊
Whotea@reddit
It also produces tons of cot tokens so that probably raises the price
learn-deeply@reddit
no, the cot tokens are included as part of the output tokens, even if they're not visible.
Whotea@reddit
The cot tokens themselves or the summary you see on ChatGPT?
Affectionate-Cap-600@reddit
Yep, exactly those tokens...
I made some calls to 01 mini that require just a simple answer of a small paragraph, and I was billed for something like 10k tokens... It's a bit of an overthinker.
TheDreamWoken@reddit
No.
Just no. https://llm.extractum.io/static/llm-leaderboards/
Healthy-Nebula-3603@reddit
Looking how fast o1 mini is I'm confident is less than 50b parameters. Is literally spit out 5k tokens within seconds.
Account1893242379482@reddit
Ya but there are other providers who are faster even with 70b llama models and those aren't even MoE.
Healthy-Nebula-3603@reddit
Open ai is using those specialized cards?
adityaguru149@reddit
I read somewhere that models > 70b have substantially higher self consistency accuracy than smaller ones like 32B or lower. So, I would guess 70B with test time compute
o1 can be 120B or higher
agent00F@reddit
"a million apples -> beaten by one orange"
MoffKalast@reddit
Vitamin C bench be like
Affectionate-Cap-600@reddit
Ok that made me laugh too much
adityaguru149@reddit
I'm more interested in the fact that a <2 yo company beats Google in probably the 1st/2nd release. Can it beat OpenAI/Anthropic in probably the next release? Why not?
Any major releases from companies that are non-US are also a big deal for AI democratisation as 1 Gov wouldn't have all the control. Think of how this would spoil ClosedAI's plans of bringing AI regulation to make that a moat against new entrants so that they can command astronomical valuations.
Any_Pressure4251@reddit
I'm more interested that you can come to such a conclusion before waiting till we do some tests.
martinerous@reddit
But can it run ~Crysis~ ARC-AGI?
Pro-editor-1105@reddit
i was excited until i read one trillion parameters.
No-Refrigerator-1672@reddit
With 1T parameter I won't be surprised if they just overfitted all the testing data, and will produce garbage for literally anything but tests.
UserXtheUnknown@reddit
OpenAI models are believed to be over 1 trillion of paremeters, by now, so there is no reason to think that this one is more overfitted than an OpenAi one
Icy_Accident_3847@reddit
I guess you never know what is livebench
PlantFlat4056@reddit
You mean the place filled with wumaodangs and bots
wavinghandco@reddit
Why make trillions, when you can make... Billions?
ab2377@reddit
less is the new more! .. less is the new more!
Admirable-Star7088@reddit
I was still excited, until I double-checked how much VRAM my consumer GPU had again.
robertotomas@reddit
NVIDIA published a paper, at a time where they could only reasonably be assumed to be talking about ChatGPT 4o, talking about training a 1.2 trillion parameter model for OpenAI … 1 trillion in more is really not so bad
Apprehensive_Rub2@reddit
I'm wondering if it's better in chinese
HairyAd9854@reddit
Does anybody know what are they using to train 1T model? I am not sure any American company may train such a large model without NVIDIA hardware. I guess a large share of parameters are actually 8bit
Khaosyne@reddit
I tried it and it seems it is mostly trained on Chinese dataset, But somewhat good in english.
SadWolverine24@reddit
Why is the performance so shitty for 1T parameters?
Few_Professional6859@reddit
I have read quite a few news articles about scaling laws being limited by bottlenecks.
Whotea@reddit
Not test time compute scaling
robertotomas@reddit
I dont know that it was accurate, but the first such leak was about disappointing Orion (o1, non preview). I know Altman came back and commented on it later, in a way that implies the interpretation people had of the leak was incorrect, but still.
Whotea@reddit
The benchmarks they provided and even o1 preview seem pretty good
robertotomas@reddit
I’m not saying I am disappointed. Someone who worked in the project said they weren’t able to release on time because the results were disappointing
Whotea@reddit
Beating phds in the GPQA and getting in the 93rd percentile of codeforces is anything but disappointing. Are you seriously relying on rumors instead of actual evidence lol
robertotomas@reddit
I guess they expected those results to generalize more easily than they actually do is all. Rumors from outside of the company I don’t care about, even from Microsoft. “Rumors” from the team lead of the project, I take more seriously
Whotea@reddit
What did the team lead say? Any real sources?
jd_3d@reddit (OP)
If you take out the test time compute models (o1 and o1 mini) it's literally above everything except Sonnet 3.5.
Perfect_Twist713@reddit
Something else to note is that there is basically no proper benchmarks that test the breadth of knowledge (and the possible/unknown emergent properties) that the massive models might have. Comparing small models to very large ones by the existing benchmarks is almost like measuring intelligence by seeing if a person can open a pickle jar and saying "My 5 year old is as smart as Einstein because Einstein got it open too".
notsoluckycharm@reddit
Signal to noise ratio, really. Not all content is worth being in the set, but it’s there. You took your F150 to the office, your boss their Ferrari. You both did the same thing, but one’s sleeker and probably cost a bit more to make.
clex55@reddit
sparse architecture?
RudzinskiMaciej@reddit
MoE
Aggressive-Physics17@reddit
Heavily, astronomically undertrained.
SadWolverine24@reddit
I can send them my GTX 980 since they clearly need more compute.
Whotea@reddit
Especially since there’s a gpu embargo on them
NEEDMOREVRAM@reddit
I fucking love the Chinese!!! And yes, I'm 100% certain that got my name put on yet another watchlist. Get fucked, American NKVD.
I have very high hopes that the Chinese will eventually release a model that will wipe its ass with both ChatGPT and Claude.
C'mon you Chinese guys, surely you see the piss-poor state of America. Do us a solid and give us the power to use an LLM tool that's more powerful than the censored WrongThink correctors that ChatGPT and Claude are.
This is an EASY win for China and an even bigger win for LLM enthusiasts.
IJCAI2023@reddit
Which leaderboard is this? It doesn't look familiar.
ihexx@reddit
livebench.ai
it's one of the best leaderboards because they update the questions every few months so LLMs can't just memorize leaks off the internet. This is a problem with others like MMLU where because the questions are open, some people just train on the benchmark to inflate their scores.
IJCAI2023@reddit
Thank you.
Plums_Raider@reddit
1 trillion and it sucks lol ok
robertotomas@reddit
I dont understand how mini is that high. I feel like it has become much worse since they made it produce longer and longer answers. It is always repeating itself 2-3 times it seems, and clearly lost some of its resolution power in the process
CeFurkan@reddit
Yep. Mini sucks so bad in my usage as well
CeFurkan@reddit
Chinese is lead in many many ai fields. Look at video generation, image upscale very possibly text to image soon as well
And they also open source so many amazing models
EfficiencyOk2936@reddit
So, we would need a full server just to run it on a 1bit quant
DinoAmino@reddit
And a 72B beats it at math lol
x2network@reddit
1000B on what? 👍🤣
Ekkobelli@reddit
not math.
Tanvir1337@reddit
only 1 trillion
Enough-Meringue4745@reddit
No local no care
Downtown-Case-1755@reddit
This actually makes sense!
In big cloud deployments for thousands of users, you can stick one (or a few) experts on each GPU for "expert level parallelism" with very little overhead compared to running tiny models on each one. Why copy the same model across each server when you can make each one an expert with similar throughput?
This is not true of dense models, as the communication overhead between GPUs kinda kills the efficiency.
I dunno about training the darn thing, but they must have a frugal schem for that too.
greying_panda@reddit
Considering that MoE models (at least last time I checked the implementation) have a different set of experts in each transformer layer, this would still require very substantial GPU to GPU communication.
I don't see why it would be more overhead than a standard tensor parallel setup so it still enables much larger models, but a data parallel setup with smaller models would still be preferable in basically every case.
Downtown-Case-1755@reddit
Is it? I thought the gate and a few layers were "dense" (and these would presumably be pipelined?) while the actual MoE layers are completely independent.
greying_panda@reddit
I used the term "transformer layer" too loosely, I was referring to the full "decoder block" including the MoE transformation.
Mixtral implementation
My knowledge came from the above when it was released, so there may be more modern implementations. In this implementation, each block has its own set of "experts". Inside the block, the token's feature vectors undergo the standard self attention operation, then the output vector is run through the MoE transformation (determining expert weights and performing the weighted projection).
So hypothetically, all expert indices could be be required throughout a single inference step for one input. Furthermore, in the prefill step, every expert in every block could be required, since this is done per token.
I'm sure there are efficient implementations here, but if the total model is too large to fit on one GPU, I can't think of a distribution scheme that doesn't require some inter-GPU communication.
Apologies if this is misunderstanding your point, or explaining something you already understand.
TitoxDboss@reddit
lmao what a ridiculous model
x2network@reddit
Lol 1 trillion 🤣🤣🤣
Financial-Aspect-826@reddit
Its dumb as fuck
masterlafontaine@reddit
It seems to be just beginning the training
ArmoredBattalion@reddit
i wonder if "step 2" means the second step in training.
SadWolverine24@reddit
Hopefully, there is a "step 3" then.
celsowm@reddit
Any place to test it?
Any-Conference1005@reddit
16K...
balianone@reddit
repost?