Closed source model size speculation
Posted by redjojovic@reddit | LocalLLaMA | View on Reddit | 22 comments
My Prediction Based on API Pricing and Personal Opinion:
- GPT-4o Mini: Likely around 6.6B–8B active MoE (Mixture of Experts) parameters, potentially similar to the Grin MoE architecture described in this Microsoft paper. This is supported by:
- Qwen 14B appears to deliver performance comparable to GPT-4o Mini.
- The Grin MoE architecture is designed to achieve 14B-level performance, which aligns with the capabilities of GPT-4o Mini.
- Microsoft's close partnership with OpenAI likely provides them with deep insight into OpenAI's model structures, making it plausible that they developed a similar MoE-based model like GPT-4o Mini.
- Gemini Flash (May): 32B dense
- Gemini Flash (September): 16B dense (appears to outperform Qwen 14B)
- Gemini Pro (September): 32B active MoE
- GPT-4 Original (March): 280B active parameters, 1.8T overall (based on leaked details)
- GPT-4 Turbo: \~93B active (for text-only)
- GPT-4o (May): \~47B active (for text-only), possibly similar to the Hunyuan Large architecture
- GPT-4o (August/Latest): \~28–30B active (for text-only), potentially similar to Yi Lightning, Hunyuan Turbo, or Stepfun Step-2 architecture (around 1T+ total parameters, relatively low active parameters)
az226@reddit
GPT-4 original was 1.3 total and 221 active. 16 experts total, 2 active.
redjojovic@reddit (OP)
Apparently that's the original 4 leak detailes:
www.kdnuggets.com/2023/07/gpt4-details-leaked.html
az226@reddit
Some of the details there are wrong. It says 8 experts when it was 16.
It says it was trained on 8k context. It was actually 4k. The rest was scaled in post-training.
Source code had 5 epochs not 4. Text did have 2.
It also tried to say 25 thousand A100s, but says 25 though. Interesting error.
Affectionate-Cap-600@reddit
This article also says that experts in moe are "trained on different datasets", while now we know that MoE training doesn't necessarily goes in this way (since experts are choosen on "per token" basis)
LoadingALIAS@reddit
Any guesses on Claude models? I’ve been wondering the same. This is a great post.
Apart_Boat9666@reddit
I think GPT-4 mini is definitely bigger. It has all-around performance in every field, more like a 27B quantized model. Low-parameter models can't really do coding well or understand the context.
DFructonucleotide@reddit
Agree with many of your guesses, but I believe neither new gemini flash nor new gpt-4o have changed their base model architecture from their original version. Training from scratch is too expensive and they shouldn't do it that frequently.
Gemini flash could be 20-30B dense. Size of gpt4 series could have undergone roughly 50% reduction twice, meaning gpt4T is \~1T with 100B active, and gpt4o is \~500B with 50B active, and they increase it by 10 fold to make a \~5T orion/gpt4.5/gpt5, which agrees with previous reports. These numbers are just my personal guess, of course.
For the Chinese models I would like to point out that yi-lightning is likely to be smaller, based on its extremely low price (even lower than deepseek-v2) and subpar performance in complex reasoning. Step-2, on the other hand, is quite expensive (\~$6/M input and \~$20/M output iirc), so probably much more active parameters.
Aggravating_Carry804@reddit
Yes on an Y Combinator podcast Sonnet and 4o are attributed 500B parameters, and I would trust them since they must know many insiders
Affectionate-Cap-600@reddit
Also, no one here is taking into account the possibility of hybrid models (MoE with some dense portion), and Snowflake showed us that this is an efficient way to train models (I'm referring to the snowflake artic paper, the model should be something like 11B dense + 128x3.6B experts, for about ~20B active parameters). Their model was (imo) still undertrained and had just 4k token context, but it was trained with relatively low budget compared to competitors.
redjojovic@reddit (OP)
You're right that Yi Lightning probably has fewer active parameters than DeepSeek, and maybe the opposite is true for StepFun.
Big companies like OpenAI need efficiency because they serve millions, so resource use matters a lot. They also don’t train models from scratch—they can trim, retrain parts, add/remove experts, train from older models,
GPT-4 Turbo wasn’t just better than GPT-4—it was faster and smaller. Leaks suggest GPT-4 was inefficient ( 280B per use isnt something you can maintain ) , so they likely shrunk it to serve users better.
As for GPT-4 Turbo to 4O, 4O is now multimodal, which sounds like a fresh start, but it’s probably smaller, faster, and more efficient. By the time we get to the latest 4o updates in a few months, it makes sense they’ll reduce its size even further.
Also the new 4o august is cheaper which strengthen my point
Affectionate-Cap-600@reddit
What about claude opus? Its price is even higher then (original) gpt4 32K
kiselsa@reddit
I will never believe 4o is smaller than 70b llama 3.1
People like to drastically underestimate size and perfomance of multilingual openai models, which support dozens of languages to monolingual deepseeks, qwens, etc.
And also you can't really compare models only on benchmarks.
Il_Signor_Luigi@reddit
Lower than i expected tbh. What are your estimates on the Claude models?
redjojovic@reddit (OP)
I would say Qwen 2.5 series performance to size convinced me it's very possible. Especially with MoE architecture + Closed source more advanced research.
I believe models today are much smaller than we initially thought, at least for the active parameters part.
Il_Signor_Luigi@reddit
It's more about the density of real world knowledge i guess. As parameters increase, if developed and trained correctly, more knowledge is retained. And anecdotally it seems to me Gemini Flash and small proprietary models "know more stuff", compared to open source alternatives of apparently the same size.
isr_431@reddit
Please correct me if I'm wrong, but the 8b parameter count of Gemini Flash would be including the vision model. This would bring the 'true' parameter size to around 7b, which is very impressive for its performance.
redjojovic@reddit (OP)
You're right, I wrote ( text only ) because the multimodality would add 1-?B parameters to the size.
It means the text size is even smaller about 7B or less
Il_Signor_Luigi@reddit
Aren't these guesses on the low end compared to open source performance, even given it naturally lags behind? And what are your guesses on the Claude models?
redjojovic@reddit (OP)
Tell me what you think
Any_Pressure4251@reddit
Dr Alan has a table that shows a few.
https://lifearchitect.ai/models-table/
One-Thanks-9740@reddit
if it okay i have a few questions.
does it simply share history?
for example, when calling llm, does it attach history ( like list of conversations ) as argument?
if so, how it handle when history is big? does it using lru cache thing for only use latest conversations?
or it use some sort of compressing using embedding?
or use both?
SuperChewbacca@reddit
You might be able to estimate speed based on tokens/second for the different models over time with general news and knowledge of hardware being used.