Opus = 0.5T × 10 = ~5T parameters ?
Posted by Wonderful-Ad-5952@reddit | LocalLLaMA | View on Reddit | 260 comments
Posted by Wonderful-Ad-5952@reddit | LocalLLaMA | View on Reddit | 260 comments
ErokOverflow@reddit
Fake answer by the way, not a real screenshot of a real conversation, everything is made up. Just a tragic post of how the world is lost.
EffectiveCeilingFan@reddit
People still listen to this guy? He just lies. Constantly. About everything.
Defiant-Lettuce-9156@reddit
I don’t even trust him to tell us the size of his own models accurately, let alone for him to know the size of the competition’s models
aprx4@reddit
Some of his employees would tell him what they know about competitor's product. It's a pretty small circle of AI researchers in SF. Some info is always spilled at the hangouts.
QuackerEnte@reddit
not to mention that they probably have moles in each others labs anyway lol
baseketball@reddit
That could be true but he could still be lying and making up numbers to make his models look better.
YairHairNow@reddit
I can picture a scene out of Silicon Valley or Hollywood tech story movie where people are freaking out over 5 trillion parameters like the iphone just got announced.
Bakoro@reddit
That absolutely would have been a scene from 2~3 years ago.
These days, people are expecting super huge models.
Very soon, industry will be freaking out over a 30B model that performs like the current trillion parameter models, and that will cause the market correction on a bunch of AI hyperscalers.
Eden1506@reddit
The Shannon Limit describes the theoretical maximum that information can be compressed without loss and current llms are already approaching it.
As an example if a sentence starts with "The 44th President of the United States was...", a model with zero history knowledge finds the next word at a high-entropy (hard to predict). A model with factual knowledge finds it near-zero entropy.
As such there is still headroom for logic but when it comes to world knowledge there is a hard limit that makes small models on their own (without websearch or an additional databank) unable to ever compete with much larger models.
Bakoro@reddit
The Shannon Limit defines the maximum theoretical rate at which error-free data can be transmitted over a communication channel with a specific bandwidth and signal-to-noise ratio (SNR).
That is not directly applicable here, except for how much information can be transmitted in a single embedding.
The limitations are closer to the entropy of information and Kolmogorov complexity.
The models have to learn some specific facts, but in general, facts are a specific case of a combination of general underlying patterns and principles. Basically, specific details are noise that we assign meaning to.
Finding the generative functions of information means being able to compose and interpolate that information.
This is the whole thing about "generalization".
If you learn the rules of logic, the rules of a million billion things become dramatically easier to understand, because you don't have to memorize everything, you memorize key points, and apply logic.
If you know the sine function, then you can generate all kinds of sine waves and find points to an arbitrary level precision, you don't have to memorize infinite points.
A single 16 bit vector with 4096 dimensions can represent (2^16)^4096 states. That is approximately 10^19728 distinct values. You could give an address to every atom in the observable universe with that number. Then when project that vector up 4x, and store values in that space. The values in the FFN are typically (2^16)^16384 space, which means that embeddings map into a very high dimensional volume.
The capacity of AI models is wildly underutilized, and there's insufficient pressure to make the models use all that space.
The models can often just memorize everything. Under-parameterization forces the model to generalize in order to make efficient use if the space, while over over-parameterization causes generalization, because all those neurons have to be doing something useful without causing the loss to go down.
The underutilization/undertrained observation is what led to the "super massive data" shift, where training went from the low hundred billion tokens, to 10+ trillion tokens.
The models also have to learn a whole lot indirectly, via frequency ana adjacency, which is a big reason why their latent spaces can be a mess.
The cross entropy loss function is useful for training early generation models, but ultimately it's insufficient for any kind of data efficiency on complex data where there is not singular correct answer.
We have Kullback–Leibler divergence, but don't usually know what the actual distribution should be.
The models eventually learn a distribution, and it's probably a decent one. So you can use KL to distill knowledge from one model to another.
If you've got multiple expert models, you can distill the experts into a single student, which then potentially has a better structured latent space.
This is at least partially why we can have 2~7B models today that are better than the 100B models from a few years ago.
Then you have the quantization issue: if we can consistently quantize a 16bit model to 4bit, that means the model was significantly over-precise.
The model could have held 4x~ the information.
So, yeah, we have at least one huge parameter efficiency breakthrough that's going to happen. I'm thinking at least an order of magnitude in terms of weights, and another order of magnitude, in just having a model that is a domain expert, and which doesn't have every digital thing ever in its parameters, but instead is properly trained on the distributions and generative functions of the data.
Eden1506@reddit
fair point
ebra95@reddit
A 30B parameter will never come close to a 1T parameter model. Chillax Gemma 4 was just a marketing stunt, it has little to no value in it (it's lower than qwen3.5, and qwen3.6 is already better)
Bakoro@reddit
I wholly disagree. Current systems are very storage and compute inefficient, because it is dramatically easier to train a grossly over-parameterized model, and the currently dominant architecture works well for processing batches for millions of people.
The entire industry is tuned for a very particular way of doing things, and they are making fairly reasonable engineering trade-offs for the sake of scale.
There are already several architectures which a superior to "series of transformer blocks", in basically every way, except for "scales to data center size".
Things with recurrence, iterative refinement, or dynamic per-token computation all beat the typical architecture, and are also infeasible at scale.
For local models and robots, where you only have one user, the entire operating environment and the engineering trade-offs you can make are radically different.
The problem is that it's a very difficult sell to go to a VC and say "I've got an architecture that doesn't scale well, and I want to hand it out to everyone for free: Please give me $50 million.
So, you need to productize it in a different way, which essentially means physical goods, which ends up being its own scaling problem, and tends to attract different money people.
You just watch, though. Someone is going to come out with the killer local model that's good enough to make people think "do I actually need that subscription?" And businesses will start thinking that the cost of tokens justifies looking into local.
ResidentPositive4122@reddit
Reddit was adamant gpt4 wasn't an MoE at ~1.8B 220A even after hotz spilled the beans. It's like they haven't worked anywhere in real life and have no idea that people move around and casually talk about past projects. The basic info about sizes, arches, main stuff isn't a state secret ffs. People talk around the watercooler.
MMAgeezer@reddit
Hotz's claim was that GPT-4 is a 1.8T A280B model. You're a bit off.
Serprotease@reddit
If true, it’s wild to see that recent models in the 400-700b range are better than it. But it would confirm that openAI/Anthropic moat is just compute.
ResidentPositive4122@reddit
Thanks, typo :)
SodaBurns@reddit
He probably just pinged Dario like bro tell me your model sizes or I'm a say you are gay like Sam.
quantgorithm@reddit
So you are just making things up? What is the lie here and source it.
Defiant-Lettuce-9156@reddit
Lmao. I said I don’t trust him, I.e. I’m not going to pay any attention to his tweet even though I would like to know the model sizes
Relax bro
quantgorithm@reddit
So nothing then.
GOT IT.
Defiant-Lettuce-9156@reddit
Exactly, because I never said he was lying. I said I don’t trust him. Your reading comprehension is terrible.
Stick to rocket league and unreal engine, the grown ups are talking
quantgorithm@reddit
Thanks for adding nothing to the actual story but neat, you have opinions, again, based on nothing. STILL GOT IT!
Cool-Chemical-5629@reddit
"I don't even trust him to tell us the size of his own..." You got my pulse pumping in suspense for a moment there, well played... 🤣
CondiMesmer@reddit
He does, but Grok has at least been a decent and cost effective model. It's not really leading but it's barely keeping up.
chitown160@reddit
There is no use case or price point where Grok is more decent or more effective over others.
Virtamancer@reddit
Insane take.
I pay for every major service (except grok, because it’s not great for coding which is my primary use case). Grok is easily the best for queries that require an internet search—even the free grok 4.20 fast. Maybe not for coding documentation/planning searches, but for general info that must be gathered online and especially if it’s from trending current events or online discourse.
If you pay and use the multi-agents mode, nothing even comes close for search use cases.
chitown160@reddit
It is really not insane. There failure modes in grok that are a result of how it has been biased. It will go into a defensive mode that prevents it from effective operation. Even in multi agent operations it will fail to verify if a tweet is real before going into a dissertation about the tweet to satisfy the bais. There are topics where it will inject opinion or willfully misconstrue information outside of factual reality. This happens via the web interface and API. This failure mode diminishes the ability for it to make use of feedback signals as ICL in iterative operations. This really hamstrings using Grok in agentic operations in addition to the questionable data returned from searches.
Virtamancer@reddit
Yeah ok bro
chitown160@reddit
Do you want to see some evidence?
Western_Objective209@reddit
claude code with a browser automation tool seems to be the best thing to me, it legit writes JS scraping scripts to extract info and has really good vision for images. I haven't tried grok but like, I have trouble picturing a smaller less intelligent model doing as well?
Virtamancer@reddit
I mean do what works for you, but I'm not loyal to any brand so this has just been my experience. It's also commonly recognized, it's not a weird opinion that just some random guy on reddit has. You can probably search for benchmarks and online commentary about it.
Some supporting facts are its hallucination rates being unparalleled, and its instruction following being the highest.
Western_Objective209@reddit
well Opus 4.6 being last place for enterprise model in instruction following while being the most capable multi-turn agent is interesting. kind of seems like the benchmark isn't very good? Similar with hallucinations; can see haiku is the closest.
Maybe it is really good at search, idk, I just haven't had an issue with search at all with either chatgpt or claude so I haven't felt the need to try something else. does it have a better X index or something?
Virtamancer@reddit
I mean nobody’s trying to make you use it lol, you’re just being skeptical and I’m responding.
You admittedly haven’t tried or compared them, so the conversation (there wasn’t really one?) ends there.
Anyways I don’t think some random Reddit commenter’s impossibly obvious surface level observation is particularly insightful. Are you suggesting nobody has noticed that smart, smaller models hallucinate less or follow instructions well?
Like…what’s your point?
It was #1 on the arena.ai search leaderboard until a couple weeks ago. It’s constantly either #1 or in the top few. I don’t know what to tell you.
Western_Objective209@reddit
https://arena.ai/leaderboard/search opus #1, grok below the other real models. I'm just judging what you're saying and it's not particularly adding up
Virtamancer@reddit
You don’t have to judge because that’s literally what I said. It changes from day to day and grok was at the top a couple weeks ago.
I don’t know what your point is in this entire exchange other to to hear yourself talk 🤷♂️
Western_Objective209@reddit
sorry I'm not an LLM that just takes everything you say at face value and tries to engage with it from my own experiences
Virtamancer@reddit
It seems like your just trying to argue about nothing for no reason, not getting to any point except a long way of saying that you’re skeptical despite never using it, which I nor anyone else cares about.
Western_Objective209@reddit
you're very defensive and your only real evidence for grok being the state of the art for search is a benchmark where it's #6. why would I not be skeptical?
Virtamancer@reddit
Yeah ok weirdo. You can leave now.
Western_Objective209@reddit
naw I'm good. come on, make up more stuff about how grok is actually the king of search with zero evidence, low key dick-riding musk while pretending you're just being objective
HeavenBeach777@reddit
i think it has to do with how they design the system to handle twitter related searches, and that works well to figure shit out from stuff on the internet too. Not surprised that Grok does that well. Even from the replies i see on Twitter where ppl @grok for some super weird or niche stuff, it does a good job figuring things out then giving a decent reply.
n8mo@reddit
Forreal.
Remember when he was interviewed after buying twitter and said they had to “rewrite the whole stack”? And, when pressed on the matter, could not describe what “the stack” referred to?
I already wasn’t taking him seriously by that point, but it was the last nail in the coffin.
He’s a rich guy LARPing as an engineer.
Ikinoki@reddit
He's from times when "rewriting whole stack" was changing five lines in a cgi file.
Times have changed drastically. By 2006 when I wrote my first fully fledged CMF system I had a virtual OS with virtual FS in it to make it more secure and easier to work with. By 2010 frontend required a much more advanced stack than just a few js selectors or even basic jquery so for next CMF I had a complete rewrite of backend in Python (from php) and full UI support for mongodb relations and virtual models loaded from database. Nowadays you need whole pipelines and systems of networks to make SPA gui and versatile easy to maintain backend so no wonder he couldn't rewrite it straight away. Heck even authentication and authorization needs proper separate subsystem to handle. Previously it was one or 2 functions which checked password compared to hash.
MrPecunius@reddit
Jesus, wash your mouth out with soap.
Episode 1 - Mongo DB Is Web Scale
Ikinoki@reddit
That was in 2010. It was all the rage, I did not look at it from the point of scaling but from the point of ease of transition of Model -> Object with all the references (hence mongoengine use, but it had issues with references being constantly lost so I had to monkey patch it too)
Looking back the JS inside python was a hell let loose. I hated it. I hate JS and hated it inside of python even more
MrPecunius@reddit
You should have used /dev/null
Nope.
Ikinoki@reddit
You saying you could make a spa gui without a single include in 1 js file or with 1 js include which will be maintainable without templates with inline php or with jinja2? :D
Or that auth doesn't need oauth and you are still in 2015 using logins and passwords? And thus have no centralized auth management and probably no adequate maintainable and portable authorization graph for all apps and services?
Maybe for a small project sure, but not for a tech startup/company for sure.
MrPecunius@reddit
I made my first money writing code in 1980, kid. If you learn to spell, you might move up the food chain.
Enjoy your corporate fads.
Ikinoki@reddit
I didn't mean coding, I mean on reddit. You're very rude and jerky.
MrPecunius@reddit
At least I never used Mongo. 🤡
Ikinoki@reddit
Good boy
pydry@reddit
Imagine being in a meeting with this guy and needing to correct him, knowing that it could get you fired.
MikeFromTheVineyard@reddit
People actively avoid meetings that involve him.
I have a friend at space-x and they usually have a post-Elon meeting to correct all the plans he derailed and sometimes a pre-meeting plan to strategize how to ensure he doesn’t get involved in things they can’t change.
MrPecunius@reddit
I'll take "people you don't actually know" for $200, Alex.
MrPecunius@reddit
Rich guy with a physics degree (double major: also econ degree) from an Ivy League school who got into a Stanford graduate program for materials science.
Whatchu got, Mr. Fresno State dropout?
wolframko@reddit
But it was rewritten to Rust from Java, isn't it? they've really rewritten a lot of repos in their GitHub. So that may be true.
das_war_ein_Befehl@reddit
If you leave engineers alone for too long they’ll inevitably start a migration to rust
Late-Assignment8482@reddit
It's like how everything becomes crab if you leave natural selection unsupervised.
overand@reddit
Woah. And the Rust mascot is a crab- https://rustacean.net/
Silver-Champion-4846@reddit
Is it so that their engineering skills don't get rusty? To avoid the Rust, you go to the Rust? Lol
ThreeKiloZero@reddit
But did it have to be or engineers just trying to keep their jobs and be relevant? The problem is Elon wouldn't know what was or wasn't true unless someone else told him. But he like to play like he's a genius.
wolframko@reddit
I believe this can be true. It seems that many high-level staff members are deceiving him with confidence and false claims, and then he tries to demonstrate that confidence and those claims in public speeches. I’m not sure why this works for him and why it’s the case in each of Elon Musk’s companies.
ThreeKiloZero@reddit
The smart people figure out how to exploit it. The meek suffer as long as they can under him until they can get away or burn out.
Many of the top tech CEOs are the same and wouldn't be able to build something on their own. They got lucky somewhere down the line and have just been exploiting that using their money to cover lies and play games. Literally all of them do it. That and they collude bigtime to stay in power. So much of what keeps them in their positions happens well beyond the actual companies.
n8mo@reddit
Yeah, that.
Listening to the interview, he sounded like a non-technical manager who just learned a new word, and was overexcited about using it to feel smart.
The convo essentially went:
“I was looking into the code yesterday. We have to rewrite the whole stack.”
“Oh, wow. What problems did you find with the stack?”
“Uhhh… Just all of it really. The whole stack is bad and needs a rewrite.”
zipperlein@reddit
He was also LARPing as an pro gamer...
iongion@reddit
Wasn't there indeed a famous rewrite to Scala of some things ? Or am I mixing up things ? I do know twitter went through a tech stack & scaling rewrite after they afforded, just life Facebook had their php thing first
eetsu@reddit
Yes, they did a rewrite to Rust and Python for the recommendation algorithm and a couple other things I thought. People were talking about it pretty recently IIRC, mostly Java guys who couldn't believe Rust was replacing JVM code. Before it was Scala with a lot of other languages tossed into the codebase.
iongion@reddit
My project managers are informed when we do tech stack changes, it is usually massive and incomprehensible for management, they pay us, they trust us, we deliver otherwise we wouldn't be there. I don't think that him not knowing what "tech stack" was involved is something to shame someone, our PMs in real small world (with tens of employees) deal with multiple projects not only one, so I don't find that relevant with calling him LARPing as engineer, that's bullshit! Though, I don't like at all what he has become, he used to be a dude, but he went to the ubermensch side, power took control over him, that's sad
Citadel_Employee@reddit
But I bet your PM doesn’t act like they’re on the ground floor getting their hands dirty. Elon not knowing what a tech stack is in isolation isn’t larping. It’s when you include everything else he’s said, then it becomes larping.
iongion@reddit
I admit, I don't know what else he said, I just commented on initial remark
Alex_1729@reddit
It's the thing that 'stacks'. duh! The thing had to be rewritten to stack better. What's not to get?
throwawayacc201711@reddit
Hey where is my L5 autonomous driving car. It was every year for years
thread-e-printing@reddit
People keep getting in the way
ei23fxg@reddit
noo, he is not a liar, he just says things that aren't true yet. some may become true aaaand some not... but maybe some day... but yadi dadi dogecoin. its called, wishfull speaking xD
dark_bits@reddit
Yes he does, but on the other hand him actually knowing the correct size of Claude models wouldn’t surprise me. They definitely have insider information on what’s going on over their neighbor’s fence.
TheLexoPlexx@reddit
I believe in self driving cars by the end of 2019 and we're definitely going to colonize mars by 2025, trust me bro.
_WaterBear@reddit
Per Musk we were supposed to have launched TWO crewed missions to Mars 2 years ago.
Budget-Juggernaut-68@reddit
His timelines are absolutely meaningless
_relativity@reddit
This is one of my favorite Wikipedia articles: https://en.wikipedia.org/wiki/List_of_predictions_for_autonomous_Tesla_vehicles_by_Elon_Musk
Upset_Page_494@reddit
To be fair, predicting AI has been an issue for experts as well. I think most predicted at 2017\~ we would have self driving by 2024, just turns out it was a harder problem then people thought.
-p-e-w-@reddit
Wait till you find out that NASA was planning to launch manned missions to Mars by the 1980s. That’s right, 40 years ago.
In fact, they were making serious plans for unmanned interstellar missions by the early 2000s.
Spaceflight and ridiculous timelines, name a more iconic duo.
VampiroMedicado@reddit
NASA had way more funding back then, I dunno when the US govt turned off the tap.
Irythros@reddit
We should have had full self driving with zero interaction from the driver every year since 2017.
aprx4@reddit
Falcon family of rocket also suffered severe delay and technical problems during development. Now it launches about 90% of global mass to orbit.
austhrowaway91919@reddit
Sure, but he lied constantly about falcon. In this context, why would we trust him on vague model sizes of his and his competitors ai?
_WaterBear@reddit
Yeah. I’m not implying anything about it is easy, but even taking the rockets out of the equation, there is so much more to develop and test before people can safely land and return that such statements in 2017 were just downright irresponsible.
Crim91@reddit
That's how people get to the C Suite. Every C level executive is a liar. about anything and everything.
Lucky_Yam_1581@reddit
Yeah the name grok 4.20 itself is a give away!
NigaTroubles@reddit
How did you know ?
flatfisher@reddit
Is it really a lie if he has no clue what he is talking about?
quantgorithm@reddit
What is the lie here and can you source it?
DojkaDev@reddit
he has to help his friends at Polymarket, and he just tweets a lot.
Background-Ad-1352@reddit
Grok yourself then
hay-yo@reddit
Release the weights!!
RevolutionaryGold325@reddit
Release the files
GamerBoi1338@reddit
that too
FrameXX@reddit
Didn't he promise to Release Grok 3 weights by the start of 2026?
Shockbum@reddit
Elon Musk is a hero, especially here in Latin America, for defending freedom of expression. You can cry, downvote, and shout insults like angry monkeys, but this is the truth.
bene_42069@reddit
Twitter is down the hall to the right
Shockbum@reddit
the jungle is down the hall to the left.
Shockbum@reddit
The zoo monkey cage is down the hall to the left.
stealthybutthole@reddit
lol what
ambient_temp_xeno@reddit
Has 5T Opus been confirmed by Grimes?
OmarBessa@reddit
Well, it's elon being elon.
He should have very good access to info though.
We can kind of estimate from tps. My previous estimation for Opus 4.5 was 1.1T params with 30 to 40B active.
TBT_TBT@reddit
Nobody knows the size of Sonnet or opus. There are some rumors, saying Opus would be 2T, then some guesses with 3-5T. Then again some say that it is a Mixture of Experts, which makes the total size vs the active size more relevant.
The only thing we can say for sure: only Anthropic knows.
ddavidovic@reddit
Opus is surely MoE
ilintar@reddit
I would be shocked if any of the current top models wasn't MoE. Running a dense 3T model would eat insane amounts of compute.
ddavidovic@reddit
Yes exactly, but there seems to be this mythology I come across quite often that somehow Anthropic is running dense models in 2026 for some inexplicable reasons
yolomoonie@reddit
Haiku is probably a dense one.
ilintar@reddit
Judging from their reasoning traces I'd say they're running a novel proprietary architecture with an internal "scratchpad model", some variation of MTP or cross attention. So likely even more fragmented than MoE.
ddavidovic@reddit
MTP is a decode optimization and cross-attention is a seq2seq thing, don't see how it could be related.
Party-Special-5177@reddit
Not quite, ilintar’s response is plausible:
It was a training optimization first, as it teaches models to ‘plan ahead’. It is proven to increase both sample efficiency and zero-shot performance on downstream tasks. Idk if you missed it, but it seems even Gemma 4 was trained with MTP, which was then removed after the fact for release.
Cite: https://arxiv.org/abs/2404.19737
As to cross attention, that is how the scratchpad model’s outputs would be linked back in to the main model.
ddavidovic@reddit
Thanks, this is useful info.
FullOf_Bad_Ideas@reddit
What reasoning traces have you seen? They output only reasoning summary, you can't access reasoning content outside of rare moments when it spills over.
DeepOrangeSky@reddit
Well... not nobody. The people who made it would know. And some of those employees bounce around from one company to another (including to xAI), so, seems like decent odds he could actually know the info, from people who worked on it directly.
Also could be that he is just lying or exaggerating. But, just saying it's not like some totally insane 1 in a million scenario of how he could know.
If anything, probably better than 50/50 odds that he'd know some insider info about the other main frontier models, if he has a bunch of employees he poached, many of whom worked on those other models.
I mean, I get if people don't like him or whatever, but, seems a little weird that so many people in here are acting like it would be insane/borderline impossible for him to know about something like this.
I'd guess that him, Zuck, Dario, Demis, etc probably know a fair bit of insider info about each other's models.
ieatrox@reddit
what's crazy is that the obviously reasonable response you've got here is this far down the thread.
local llama has been infected with the same groupthink as the main subs. :/
You can dislike musk, but to claim the owner of the latest cluster, one of the most used models, and employer of a lot of the talent pool has zero knowledge is the most Dunning Kruger take ever.
adsci@reddit
i would agree with you, if i havent followed the things musk said about software development the last 10 years. he clearly showed that he does not have deep tech knowledge and only parrots what he hears from devs in meetings. also its the guy who said hes a top video gamer and then gets caught in paying others to play in his name.
he is not trustworthy and he has a minimum tech knowledge. he is a sales man. sure the sales man could have heard about another companies model sizes, but he would also not hesitate to fake it and if its true he doesnt understand it.
ieatrox@reddit
There's a wide gulf between "I don't trust the claims he makes" and "he knows nothing".
That's the point. It's totally reasonable to distrust his claims. It's totally unreasonable however to be so self assured of your impression of him that you claim to know how much knowledge he has without evidence. on reddit.
MMAgeezer@reddit
I don't think most people are claiming he has zero knowledge. But what most people are pointing out is that he is a serial liar, especially when it comes to hyping up his own products and lying about his competitors.
This might be true, it might not be, but his track record gives us no reason to believe him.
Hector_Rvkp@reddit
it can't NOT be MoE. The bigger it is, the more non sensical it becomes to be a dense model. If you ask Gemini SOTA, it will admit to being an MoE model, i dont think it's a secret. They also all re-route aggressively to smaller models, as most people dont need to be served Godzilla when asking Pikachu questions.
a_beautiful_rhind@reddit
Gargantuan model sizes don't completely make sense. You have to fill them with data or you end up like bloom. Sonnet tracks being kimi sized with simply more active parameters.
It has to be servable to people at a profit. Why do you think grok is that small?
my_name_isnt_clever@reddit
I used to work for Apple, what I learned is nobody knows fucking anything about what goes on internally at companies like this. I just know there's some Anthropic devs chuckling in their office about how utterly wrong this is.
yolomoonie@reddit
Okay, so the lore goes like this: Claude 3.0 Opus was a big foundation model and so they distilled it down to the smaller Claude 3.5 Sonnet. But now the people were confused if they should use newer cheaper model or the Opus foundation model.
So Anthropic decided for a different strategy with their 4th generation models. At first they launched Opus 4.0 like they did with Opus 3.0. This model presumable was 5 - 6T parameters and token price was at 75$/mio output. But to avoid confusion this time, they created from this 4th gen foundation model Opus 4.5 as well as Sonnet 4.5. Its assumed that Opus 4.5+ have at least three times less active parameters than Opus 4.0 because there where able to lower the token price to 25$. Also the tps of Opus 4.5 roughly aligns with the tps of GLM, Kimi etc. when you assume something around 100b active parameters and a total MoE model size of 1.2 - 2.5T. In this theory Claude Sonnet 4.5+ is considered to be about 30% smaller then Opus 4.5+.
This is just a brief overview but when you go into detail the theory really makes sense because it well aligns with broad available hardware at times.
For the next generation it seems that Mythos is their foundation model and Clause Opus 5.0 will be from the start on a distilled down version. This because you can train massive 20T+ models on a SOTA NVL72 with GB300 but it comes to inference with for example Google TPUs, as Anthropic massively uses, you need to scale the model down.
So Musk is not wrong he's just bullshitting...
Happy-Register3367@reddit
This whole "0.5T × 10 = 5T" thing feels like guesswork more than anything.
Without knowing active params or routing, total params alone doesnt really tell us much anyway.
tvmaly@reddit
Could be possible that a former Anthropic employee went to XAi and brought that internal detail of model size with them. It is probably not secret knowledge among those that work at these top AI companies.
Pwc9Z@reddit
Isn't this the bloke who'd fire developers based on the number of lines of code they wrote? Yeah, definitely listen to him on the topic of LLMs
ethereal_intellect@reddit
It's what stood out to me too, I wonder if he's just ~~talking out of his ass~~ estimating or has some insider knowledge
_raydeStar@reddit
He might have insider knowledge
He might not.
You never can tell for sure.
GarboMcStevens@reddit
5T seems directionally correct
ShadyShroomz@reddit
I would be surprised if he didnt know. (Due to how often people switch companies), im sure he's poached people from anthropic..
But who knows if he's telling the truth... might just be lying to make grok look better, who knows
AdamEgrate@reddit
How sad would it be to go from Anthropic to xAI. I doubt anyone would make that choice willingly
Virtamancer@reddit
Source: some r*dditor.
casualcoder47@reddit
Company switches are often accompanied by signing bonuses and pay raises. And it's not like big company is any better in terms of sadness they give you. I'm sure they're doing fine
TheRealMasonMac@reddit
I'm pretty sure Elon measures productivity by LoC changed per week, which means employees are making worthless changes to keep their job.
_raydeStar@reddit
Yeah, if I were with anthropic and got an offer for a huge salary increase for basically the same work, I'd be thinking about it.
_raydeStar@reddit
He could prove it to us
by open-sourcing Grok 4.20.
see-these-bones@reddit
Thats what hard to get a handle on. Most people in positions of power are psychologically dysfunctional in some way. This makes them liars, not because they have a compulsion to tell lies, but because they have no need or desire for the truth. They don't lie in a way you can simply believe the opposite they are saying to derive truth. It might be true. They just say whatever feels the most appropriate in the current context to get what they want or at least tell the narrative they want to tell. No wonder they think LLMs are already conscious, its so close to how they are.
Singularity-42@reddit
He might have insider knowledge and still lie to hype up Grok
BumblebeeParty6389@reddit
Well the chance of him knowing information about opus and sonnet is much higher than redditors or twitter ai bros
MMAgeezer@reddit
And the chances of him lying about his competitors to hype his own products (as he does every day on his account) is also extremely high. He's a serial liar. He's not serious.
relmny@reddit
Why will people assume that he will tell the truth, whether he knows it or not?
SpiritualWindow3855@reddit
He's definitely talking out of his ass, and even the number for his own model is misleading since Grok 4.20 is 4 models running concurrently
DeepOrangeSky@reddit
Are you sure? (genuinely curious, since I've seen different people have opposing stances on it in the time since it came out).
Back when it came out, it seemed like even some fairly technical people that discuss LLMs a lot were saying it works the other way (as in, one single 500b model, running 4 aspects of thinking mode within itself or something like that, rather than 4 actual separate 500b models running concurrently).
Are you saying this just from using it and seeing the 4 agents stuff happen on the screen while using it, or was there some actual technical reason or things you read or strong sources or something that made you feel it works the other way? (and if so, what were they)?
Thomas-Lore@reddit
It is not. Grok 4.20 has an option to run 4-8 agents (it is called multi agent on the api) but the model is also available in single version.
SpiritualWindow3855@reddit
Grok 4.20 in their app is the multi agent variant.
Elon is also on the record saying 3 and 4 are 3T parameters and claims 5 will be 6T parameters
But sure, your hero figured out how to get 500B parameter models to beat 3T parameter models in the 2 months since he said that.
dtdisapointingresult@reddit
Can you post a link to his tweet saying Grok 3/4 are 3T params? I can't find it myself. It would help your case.
adt@reddit
https://youtu.be/q_mMV5OpRd4?t=1387
dtdisapointingresult@reddit
Cheers. (To anyone wondering: it's Elon in an interview saying Grok 3/4 are based on a 3T model)
Looks like that other nerd was right. I'm a skeptical they got it down to 500B while doing better at benchmarks, while still calling it 4.x.
I hope he gets Community Noted.
SpiritualWindow3855@reddit
Well you're slow enough to ask me to do basic research for you and try to insult me in one stroke, so I won't do the leg work for you... but I will throw you a bone.
The most obvious search "elon musk grok model parameter count" has it on the very first page.
And in the future, please don't try to police how other people talk when you're this much of a jackass:
Good grief.
dtdisapointingresult@reddit
I did google it. I googled it before I even asked you. All the top google search results for your keywords are about the future Grok 5. If I specify grok 4, I can find random websites saying it's 1.7T, other random websites saying it's 3T, but none sourced by Elon Musk himself.
I'm trying to help you convince people here!
As for my tone, it's hard not to want to "clap back" at someone who is so typically REDDIT. You might even be right about this Grok thing and I'd still want to shove you in a locker, y'nah'mean?
SpiritualWindow3855@reddit
This is a ton of words to say you don't know and have no reasons, but disagree with the majority opinion.
Either way, Grok 4.20 is not a simple 500B parameter MoE. Elon's already stated 3 and 4 are 3T parameters, and claimed 5 will double that. As usual he's talking out of his ass.
DeepOrangeSky@reddit
Alright, well, I'm not so sure that's the majority opinion about it, but I guess I can see why it looks potentially suspicious. It is pretty impressive, if it is legit.
Personally I hope it is legit, since that would be cool if AI is rapidly improving and we get stronger models for cheaper, and less resources per amount of strength and speed and so on.
Anyway, if anyone lurking in here saw anything particularly interesting or solid about it either which way, I would definitely be curious (even if it shows that I'm wrong, I don't mind, I still would like to know about it, since it is an interesting topic, imo).
Thomas-Lore@reddit
Grok 4.20 is one model.
Grok 4.20 Multi-Agent is 4-8 models. It is a separate version.
SpiritualWindow3855@reddit
I guess you like to repeat comments so I'll say it here too: the version they offer users is the multi-agent version, and Elon has already said 3 and 4 are 3T parameters and claimed 5 would be 6T
His post doesn't even pass the smell test except for people who are really far up this guy's backside.
maschayana@reddit
He sniffing too much
KaMaFour@reddit
Minimax mogging grok while being at 200B is still funny.
porkyminch@reddit
Should see the usage stats I saw from my company recently. We’re on GitHub Copilot and Grok was completely free through them on a promotional basis until recently. Most popular model was Opus with like 80 million requests. Grok was in dead last. Like 1500 requests. They can’t even give that shit away.
Hector_Rvkp@reddit
i use it via the GUI for glorified online searches, which you can call research if you want to be nice. Grok is on par with Claude & Gemini in terms of outputting smart sounding stuff. Claude dominates coding, apparently, but for more general stuff, there's no gap i can discern. Anecdotally.
power97992@reddit
.5T is pretty good for its benchmarks .. But opus is not 5 T
hp1337@reddit
If this is true then Opus is wildly inefficient!
Singularity-42@reddit
This is probably the best analysis I've found and it estimates Opus 4.6 at 1.5T to 2T range in terms of size.
https://unexcitedneurons.substack.com/p/estimating-the-size-of-claude-opus
power97992@reddit
He forgot about batching and moe inefficienCies( ironwood has 7.37 TB/s, but when serving moes , the effective bandwidth is about 4.5 TB/s) ,all api providers serve models concurrently.. Once you factor in batching and moe inefficiencies, it will be slightly smaller than that…
Klutzy-Snow8016@reddit
That was written a while ago, and didn't age well in at least one area. They estimate the number of active parameters, then multiply to get the number of total parameters. To get the total : active ratio, they looked at the open weights models GLM 4.7, DeepSeek V3, and Kimi K2. Good so far.
But then they said that we can probably disregard any higher sparsity than Kimi's 1:384 because any higher and you'll get "the Llama 4 problem, where the model is brain damaged". But since they wrote that, Qwen3.5 397B-A17B came out, which has the same level of sparsity as Llama 4 Maverick and performs very well. So if Anthropic was just a couple months ahead of Qwen in research, they could have a model just as sparse and have it work well.
So Opus might be larger than this article's estimate based on knowledge we now have that the author didn't have then.
Singularity-42@reddit
Great points!
Daemontatox@reddit
Its fucking elon musk talking about tech , do we really need anymore proof to not care?
Healthy-Nebula-3603@reddit
he was a programmer anyway
adsci@reddit
when he talked about the tech behind twitter after he bought it he clearly showed, that he has no basic knowledge. at least not anymore. he speaks about tech like they talk about hacking in movies.
Healthy-Nebula-3603@reddit
I did not say he is good at it now but WAS a programmer so has some kind of cluses.
porkyminch@reddit
Being a programmer in the 90s has next to nothing to do with modern machine learning.
Healthy-Nebula-3603@reddit
in 90? still better than ceo who has no idea how to code and managing such company
jackoftrade777@reddit
He clearly never coded a single line.
AlmoschFamous@reddit
20 years ago. I doubt he could even pass a verbal technical screen now.
Dordidog@reddit
Before u knew his political stance u cared
TldrDev@reddit
Never cared.
Elon was a notorious huckster at PayPal and was a well known fraud.
The only people who thought this guy was anything but a moron with money were the people who drive jacked up trucks to Sam's Club.
throwaway2676@reddit
Lmao, this is a view you only see on reddit from morons who have never done anything in their lives. Major investors put billions behind every venture Elon puts out, including from other big tech companies like Google. Elon created SpaceX at a time when the idea of reusable rockets was fantasy. Now it has arguably one of the most impressive inventions ever made. All the other companies in Elon's space speak highly of him, as do his past and present engineers.
It is actually nuts how detached from reality the average reddit mind is.
BlipOnNobodysRadar@reddit
The problem with stupid people is that they all think they're smart. And they move in herds.
TldrDev@reddit
The irony is unbelievable.
Money doesnt impress me. I've worked my entire career in venture capital and private equity, and in the alternative investment space.
SpaceX isnt as innovative as you think, but more and probably most importantly, Elon doesnt know fucking anything about software. Just objectively. Every time he talks about engineering or software he speaks in CSI Miami levels of technobabble.. I mean just objectively doesnt know fucking anything about the words he is saying.
throwaway2676@reddit
I mean, this is just delusional. Starship and Starlink are objectively two of the most innovative creations on the planet. Objectively, he knows way more about LLM research and tech development than you ever will. He works directly with his R&D teams far more than most CEOs, as his past employees will tell you. Other tech leaders like Demis Hassabis respect Elon's technical knowledge and skills, as do most people in this space who matter.
anotheruser323@reddit
I know a bit about programming and I can tell you he doesn't.
Ok maybe he knows some stuff about stuff, that he heard here and there. But I am 100% sure he doesn't understand anything about anything.
throwaway2676@reddit
lmao, I don't get it, is this just how losers feel better about themselves? you haven't contributed anything to the conversation here. Another no name redditor shouting into the wind baseless claims that no person with firsthand knowledge believes.
anotheruser323@reddit
Why are you defending him so hard? There is plenty of examples of how ignorant he is. It's not some conspiracy or anything. It's not worth loosing nerves arguing on antonymous online forums. Even if the nazi knew a lot and had a brain, it is just not worth it. So why?
throwaway2676@reddit
Because this could be an actually interesting thread about model sizes at the cutting edge, but instead it's just an obnoxious circlejerk of drooling redditors. Just like every thread involving Elon, over and over again since he entered politics. None of this has any basis in reality. No one ever posts any actual evidence or objective standards. The most that happens is people like you just saying "there is plenty of examples, just trust me bro." Meanwhile, in the real world, smart and successful people respect Elon Musk on every level, and his companies are extremely innovative and interesting.
Since we can't actually discuss them on here in any meaningful sense, all I have left to do is try to break some small segment of the reddit cult out of their deluded circlejerk.
anotheruser323@reddit
Hey. He lies. All the time he lies. His word is not evidence. He is a known liar.
"Trust me bro", sure. Here's a trust me bro: When x.com and confinity were merging he forced them to use microsoft server and MsSQL. Confinity was using Solaris and Oracledb. Go find an old dope smoking unix guru (aka one of those competent old sysadmins, since you don't believe anonymous me bro) and ask if at the time of merger (1999) what was the obvious choice. MS server was way worse then solaris (solaris was the best written OS for network stuff at the time, maybe some BSD was better), and oracledb was the gold standard for databases. He pushed for MS...
That's just one example. Others are his genius solar shingles (worse then roof + normal solar panels), his genius vacuum tube transportation (physically impossible to achieve with failures being beyond catastrophic), not to mention his completely hilarious incompetence to even appear to be a pro gamer. That's just off the top of my head, and I stopped following his "career" because it's not even funny anymore. The guy can't even say "sorry"...
throwaway2676@reddit
When someone makes thousands of major decisions across multiple companies (especially startups), some of them are bound to be bad. It is 100% inevitable. Google has over 30 failed products. Facebook wasted $80 billion on the metaverse. Microsoft has had so many god awful releases. Cherrypicked failures are simply not a representative or intellectually honest way to evaluate someone's work. Google is not defined by Google Glass.
At least you were able to produce a couple interesting examples, unlike most people on here. They just don't prove what you think they do. Even Elon's worst ventures are more impressive than anything armchair redditors have produced.
anotheruser323@reddit
A rich moron is a moron...
AlmoschFamous@reddit
As an engineer I can tell you with certainty that Elon is not an engineer. He says the words but they aren't in the correct context. Any time he speaks it makes everyone in the industry laugh.
This thread and day was very funny: https://www.reddit.com/r/programming/comments/yzcodt/elon_musk_just_tweeted_a_photo_of_twitters/
throwaway2676@reddit
lmao, I don't get it, is saying shit like this just how losers feel better about themselves? you haven't contributed anything to the conversation here. Another faceless redditor shouting into the wind things with no evidence that no serious person believes.
TldrDev@reddit
The rocket that has launched 11 times and failed to make orbit and the most basic rocket one can build with an open book, publicly funded NASA spec, super innovative.
He doesn't know shit about llms. He doesnt know dhit about computers.
Literally bud, I know you're a throw away grok stan, but nobody who knows anything about engineering or software thinks Elon is good at, or understand, either of them.
Fucking guy was trying to throw shade about sql, said the government doesnt use sql, described graphql as thousands of sequential RPC calls, and has repeatedly embarrassed himself talking about computers.
His whole tenure at ebay was essentially punctuated by him trying to get the engineering team to use windows for their servers. Truly a fucking moron.
Plabbi@reddit
That sound improbable as his first money making ventures were Zip2 and then X.com (first iteration) which later merged to become Paypal. He was the lead developer at both of these companies.
TldrDev@reddit
Zip2 was a single website to list businesses, its nothing notable from a technology perspective. It was acquired by altavista during the dotcom bubble.
X.com (first iteration, lol), was a failure, and it didnt merge to become PayPal it was acquired by confininity which later launched PayPal.
Elon tenure at PayPal was a disaster, and he was promptly fired after being Elon.
PrettyBaker2891@reddit
you do know people hated elon before he became political right? lmfao
stop making everything about politics you npc
mrclamjam@reddit
Political stance of throwing up a Nazi salute?
And just like others have said, he’s always been a known fraudster. lol the man even had to lie about being an “expert” at a video game just to try to fit in like the dweeb he is.
And I mean 95% of his fanbase are just bots on the internet stroking his ego to try to convince the “common man” that Elon is a genius. So can you really claim that everyone cared about his opinion, when that “everyone” is just Elon hyping himself up on his alt accounts?
AlmoschFamous@reddit
The second he opened his mouth about software engineering it was clear he had no idea what he was doing. Truly smart people make advanced concepts palatable. Musk made basic concepts sound like your grandmother was explaining them second hand.
Daemontatox@reddit
Not really , never cared , and never will tbh
Mthatnio@reddit
True, he clearly understands the field less than the average redditor.
throwaway2676@reddit
The sad thing is that the average redditor will think you're being sincere here and never realize they're the butt of the joke. Ugh, I really need to look for a place that doesn't filter every discussion through the unhinged reddit lens
sleepy_roger@reddit
There's another platform with a ton of smart local AI guys and girls 😜
nomorebuttsplz@reddit
go on
CondiMesmer@reddit
You should care, he's not a top 100 player in Path of Exile 2 for nothing!
camracks@reddit
Grok is very obviously not 500b active in an MoE lmao, it would likely be farrrr more intelligent, 500B total sounds about right, it’s not a horrible model, but it isn’t quite at the same level as Claude or ChatGPT or Gemini
NandaVegg@reddit
I think the main issue is that Grok does not have proper or extensive multi-turn training to this date. It still fairly immediately repeats itself or lose consistency like good old R1 or Qwen 2.5 models.
Practical-Collar3063@reddit
My uncle worked at ~~Nintendo~~ Anthropic and he says it is true
adt@reddit
Uh...
(15/Nov/2025): ‘[Grok-5] is a 6 trillion parameter model, whereas Grok-3 and -4 are based on a 3 trillion parameter model.’
But now Grok-4 is a 500B parameter model?
popiazaza@reddit
Grok-4.2 Beta is 500b, not Grok-4.
JorgitoEstrella@reddit
So they shrinked it or it was never 3T?
popiazaza@reddit
It's a different model. Grok-4 is 3t. Grok-4.20 is (probably) distill from Grok-4 then spam RL.
jackoftrade777@reddit
Are we talking about this same guy? He has influence, yes. But he doesn't understand shit.
Clean_Hyena7172@reddit
Honestly wouldn't be surprised if these numbers were accurate.
Defiant-Lettuce-9156@reddit
Given that it’s Elon, I wouldn’t be surprised if none of these numbers are accurate
Due-Memory-6957@reddit
Given that he has for sure poached people from Anthropic, I wouldn't be surprised if he knew exactly what the numbers are.
ZiddyBlud@reddit
You can't poach the work culture
Due-Memory-6957@reddit
Just like you can't make my soda tastes better by adding pee to it, such is life.
j0j0n4th4n@reddit
I wouldn't be surprised if Grok was just Deepseek abliterated tbh.
urekmazino_0@reddit
Opus 4.6 is 3.6T params
VoiceApprehensive893@reddit
if this is true its diabolical asf
TldrDev@reddit
Its as true as everything else he says.
throwaway2676@reddit
He says plenty of true things though. It's even dumber to disregard everything he says than it is to believe everything he says
Existing-Wallaby-444@reddit
"Man sometimes says something true" doesn't sound like someone I'd trust tbh
BusRevolutionary9893@reddit
I agree with you but I can't believe you have any upvotes from such a reasonable comment about Elon on reddit.
my_name_isnt_clever@reddit
If you can't trust him, everything he says is worthless. This take makes no sense.
Skid_gates_99@reddit
Even if the numbers are right I love that we're all just doing napkin math off an Elon tweet like it's a reliable source. The man said full self driving was ready in 2018.
9r4n4y@reddit
Ngl grok 4.20 multi agent is soooooo good
aresdoc@reddit
He's honest
Spiritual_Praline492@reddit
Ska82@reddit
if it isnt open weights, this is irrelevant
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
ForsookComparison@reddit
Idk if this is true or not but whenever this guy's name is mentioned everyone in the comments drops 50 IQ points
cryyingboy@reddit
5 trillion parameters would explain why my api bill looks like a mortgage.
Billthegifter@reddit
Sure. It's strong. As long as you don't need It to be accurate.
Also. Shut up Elon
Cool-Chemical-5629@reddit
I love Opus and I think it's awesome, but I don't think it's 5T kind of awesome.
doudoudad@reddit
Elon had to use T as in 0.5T instead of 500B.
Wonderful-Ad-5952@reddit (OP)
lol
Long_comment_san@reddit
I wasnt impressed with opus knowledge. I still think it's dense in the 120b range.
Select_Truck3257@reddit
idiotiesystemique@reddit
He's basing this of pricing because opus costs 5x what sonnet costs but it does not scale linearly
siegevjorn@reddit
The troll king trolls again? I'm sure grok's twice the size of sonnet but they couldn't make it better so he's just trolling.
-Ellary-@reddit
- "What you can tell about new Gemma 4 release?"
- "It is a decent model, close to Kepler 452B level."
Uriziel01@reddit
Let's be real, remembering the "history" that Elon have with lies, there is a good probability this info is taken straight from his ass.
my_name_isnt_clever@reddit
Pretty pathetic considering their resources if Grok 4 is 500b and Qwen 3.5 122b is almost on par with it.
Also this man has zero insider knowledge, how many times does the Nazi have to lie through his teeth before people stop taking him seriously?
CorpusculantCortex@reddit
Why the f would dumb dumb musk have any idea what the size of opus is?
eat_my_ass_n_balls@reddit
Why is anyone talking Elon Musk seriously about anything?
Tank_Gloomy@reddit
Calling Grok a 'strong' model is doing some truly heavy lifting on the meaning of the word.
Sound_and_the_fury@reddit
All Nazis have puddle deep knowledge
spky-dev@reddit
You realize NASA was founded almost exclusively from Nazi defectors, yeah?
DeliciousGorilla@reddit
How does one even obtain 5T parameters...
misha1350@reddit
Through lots of slop and little distillation. After all, you don't have to be a genius to come up with a huge model that can barely run on a DGX B200. Whereas you do have to be one to come up with something like Qwen3.5 35B A3B, which despite its size is punching way above its weight.
spky-dev@reddit
Lmfao. God this is just comically wrong.
TBT_TBT@reddit
Probably with an amount of unknown Petabytes of training data and tens of thousands of GPUs, 30-60.000$ each, in Amazon's, Microsoft and Google's datacenters.
Budget-Juggernaut-68@reddit
How would he even know?
denoflore_ai_guy@reddit
It only it weren’t a digital Nazi.
Unlucky_Milk_4323@reddit
Friendly reminder: Elon Musk it a traitor and a piece of shit. X should be abandoned completely, as should Grok. If you use any part of that horrible human's ecosystem, you're supporting him.
Neither-Phone-7264@reddit
500b active lol
oxygen_addiction@reddit
I still think this discussion from HN from a few weeks ago points a clearer picture and seems quite reasonable. Probably 100B-1-2T overall.
https://news.ycombinator.com/item?id=47319205
Tough_Frame4022@reddit
He just said in an interview he played golf on the moon. What else do you need him to say?
Easy_Werewolf7903@reddit
X doubt
qwen_next_gguf_when@reddit
He doesn't even have to know this information and can easily confuse with some numbers his non technical executive told him.
LatentSpacer@reddit
How much do you trust him? To me, he’s not a man of his word.
Global_Persimmon_469@reddit
He doesn't know shit
catfrogbigdog@reddit
*Very benchmaxxed