Tencent just put out an open-weights 389B MoE model
Posted by girishkumama@reddit | LocalLLaMA | View on Reddit | 178 comments
Posted by girishkumama@reddit | LocalLLaMA | View on Reddit | 178 comments
Enough-Meringue4745@reddit
We’re gonna need a bigger gpu
More-Acadia2355@reddit
Seems like there's going to be a big push in not only getting more VRAM on chip, but more importantly, getting the bandwidth between chips up.
Dry_Parfait2606@reddit
Pcie gen 6 will be pretty solid... Compute will improve steadily...it's all planned out... The giants are delivering... The environmental awareness is here, so just in case that there is a breakthrough that makes computing exponentially useful, we will not boil up the whole planet from all the compute... Lol
I think that the only thing holding everything back is that it's all still an unwritten book...
The big tech, finance, ect... are already running MW of compute..
JFHermes@reddit
Thanks magic.
throwaway_ghast@reddit
NVIDIA: Best I can do is 24GB.
FullOf_Bad_Ideas@reddit
It's banned in EU lol. Definitely didn't expect Tencent to follow Meta here.
Arcosim@reddit
AGI is a race between America and China and no one else. The EU shot itself in the foot.
moarmagic@reddit
AGI isn't even on the road map without some significant new breakthroughs. We're building the most sophisticated looking auto completes, agi as most people picture it is going to require a lot more.
liquiddandruff@reddit
Prove that agi ISN'T somehow "sophisticated looking auto complete", and then you might have an argument.
We don't actually know yet what intelligence really is. Until we do, definitive claims of what is or isn't possible from even LLMs is pure speculation.
qrios@reddit
AGI wouldn't be prone to hallucinations. Autoregressive auto-complete is prone to hallucinations, and (without some tweak to the architecture or inference procedure) will always be prone to hallucinations. This is because autocomplete has no ability to reflectively consider its own internal state. It can't know that it doesn't know something, because it doesn't even know it is there as a thing that can know or not know things.
None of this is to say the necessary tweaks will end up being hard or drastic. Just that they would at least additionally be doing something that is nothing at all like autocompletion.
liquiddandruff@reddit
You'd be surprised to know that most of the statements in your first paragraph are conjecture and some are in dispute.
This is a topic of open research for transformers. The theory goes that in order to best predict the next token, it's possible for the model to create higher order representations that do in fact model "a reality of some sort" in some way. Its own internal state may well be one of these higher order representations.
Secondly, it is known that NNs (and thus autoregressive models) are universal function approximators, so from a computability point of view, there is as yet nothing in principle that rules out even simple AR models from being able to "brute force" find the function that approximates "AGI". It will likely be computationally inefficient compared to more refined methods, but a degree of "AGI" would have been achieved all the same.
I do generally agree with you though. It's just that these remain to be open questions that the fields of cogsci, philosophy, and ML are grappling with.
That leaves the possibility that AGI might in fact be really fancy auto complete. We just didn't know yet.
moarmagic@reddit
Proving a negative is impossible.
I'd say, first define AGI. This is a term thrown around to generate hype and Investment, and I don't think it has a universally agreed on definition. People seem to treat it like some sort of fictional, sentient program.
This only makes the definition more difficult. Measuring intelligence intelligence in general - very difficult. Even in humans, the history of things like the iq test are interesting, and show how meaningless these tend to be.
Then we don't have a test for sentience at all. So near as I can tell "agi" is a vibes based label, that will be impossible to determine what is or isn't.. kinda like "metaverse".
This is why I find it more useful to focus on what technology we actually have, especially when talking about laws and regulations, instead of jumping to purely hypotheticals
liquiddandruff@reddit
All that I can agree with. It's exactly that definitions are really amorphous.
Sentience is another can of worms, and I'd argue is independent of intelligence.
The term AGI as used today is def vibes--we'll know when we see it sort of thing.
For the sort of crazy AGI we see in sci-fi (Ian Banks the Culture series, say), we'll come up with a new term, like "Minds" with an M :p.
Arcosim@reddit
Yes, and how does your post contradict what I said? Do you believe that breakthrough is going to come from Europe? I don't.
moarmagic@reddit
My point is that it's something that doesn't exist, so it's weird that you jump to that. Could talk about how LLM has potential to make existing industries more efficient, could talk about how enforcing laws like the EU has are difficult-, but you jumped to a vague term that may be entirely impossible with the technology that the EU is regulating in the first place..
Eisenstein@reddit
The commenter is envisioning the 'end-game' of the AI race -- the one who gets it wins. This is not 'more efficient industry with LLMs', it is an AGI. It may not be possible, but if it is, then whoever gets it will have won the race. Seems logical to me.
Severin_Suveren@reddit
Agreed! I don't really agree with him since it's a matter of software innovation which may either require mathematical / logical breakthroughs to make big quick jumps, or it may require less innovation but instead the painstaking task of defining tools for every single action an agent makes. If the latter, then sure it's a race between China and the US. But looking at the past two years it seems that the path of innovation is the road we're on, in which case it requires human innovation and could therefore be achieved by any nation, firm or even (though unlilely) a lone individual
treverflume@reddit
The average Joe has no clue what the difference is between a LLM and machine learning. To most people alpha go and chatgpt might as well be the same thing if they even know what either even is. But you are correct 😉
Lilissi@reddit
When it comes to technology, the EU shot itself in the head, a very long time ago.
Billy462@reddit
The EU need to realise they can’t take 15 years to provide clarity on this stuff.
PikaPikaDude@reddit
The current EU commission is very proud on how they shut AI down. And how they shut the EU industry down forcing it into recession.
OpenAI and consorts don't need to lobby in the EU to kill competition, the commission does that for them for free.
HatZinn@reddit
...Did you mean 'cohorts'?
PikaPikaDude@reddit
The word has more than one meaning.
rollebob@reddit
They lost the internet race, the smartphone race and now will lose the AI race.
Severin_Suveren@reddit
I agree with you guys, but it's still understandable why they're going down this road. Essentially they are making a safe bet to ensure they won't be the first to have a rogue AI system on their hands, limiting potential gain from the tech but making said gain more likely.
It's a good strategy in many instances, but with AI we have the situation where we're going down the road no matter what, so imo it's then better to become knowledgable with the tech instead of limiting it, as that knowledge would be invaluable in dealing with a rogue ai
rollebob@reddit
Technology will move ahead no matter what, if you are not on the one pushing it forward you will be the one bearing the consequence.
ZorbaTHut@reddit
They'll realize that in 20 years or so.
Bicycle_Real@reddit
Wonder how Mistral is navigating EU overregulation.
Dry_Rabbit_1123@reddit
Where did you see that?
cr0wburn@reddit
He's talking out of his ass
FullOf_Bad_Ideas@reddit
It's in the license.
https://huggingface.co/tencent/Tencent-Hunyuan-Large/blob/main/LICENSE.txt
cr0wburn@reddit
If you read the license it says it does not apply in the European Union, not that it is forbidden in the EU.
FullOf_Bad_Ideas@reddit
License file, third line and also mentioned later.
https://huggingface.co/tencent/Tencent-Hunyuan-Large/blob/main/LICENSE.txt
AaronFeng47@reddit
Abstract
In this paper, we introduce Hunyuan-Large, which is currently the largest opensource Transformer-based mixture of experts model, with a total of 389 billionparameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large’s superior performanceacross various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregatedtasks, where it outperforms LLama3.1-70B and exhibits comparable performancewhen compared to the significantly larger LLama3.1-405B model. Key practiceof Hunyuan-Large include large-scale synthetic data that is orders larger than inprevious literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, wealso investigate the scaling laws and learning rate schedule of mixture of expertsmodels, providing valuable insights and guidances for future model developmentand optimization. The code and checkpoints of Hunyuan-Large are released tofacilitate future innovations and applications.
Code: https://github.com/Tencent/Tencent-
Hunyuan-LargeModels: https://huggingface.co/tencent/Tencent-Hunyuan-Large
duboispourlhiver@reddit
Why is 405B "significantly larger" than 389B ? Or is it not ?
ortegaalfredo@reddit
Its a MoE meaning that the speed is effectively that of a 52B model, not 389B. Meaning it's very fast.
ForsookComparison@reddit
Still gotta load it all though :(
ortegaalfredo@reddit
Yes, fortunately they work very well by offloading some of the weights to CPU RAM.
drosmi@reddit
How much ram is needed to run this?
Atora@reddit
1 billion bytes happen to be 1GB(in base 10). So in general fp16 model takes B*2 in GB RAM, a q_8 quant is B in GB and a q_4 is B/2 in GB. These aren't exact because context and other overhead is also added but fairly close approximations.
proprotoncash@reddit
Sitting here with 512gb ram wondering the same thing...
No_Afternoon_4260@reddit
Did you tried it?
ortegaalfredo@reddit
256GB of ram plus a couple 3090 should be enough
Drited@reddit
Interesting! What about vram requirements? Closer to that of a 52b or 389b model?
_Erilaz@reddit
Because 405 is a dense model, 389B has much much less active weight
IamKyra@reddit
When you mean "dense model" is it a kind of architecture for LLMs ?
Mean-Force267@reddit
dense: single mlp, processes all tokens
moe: x mlps, only y selected for each token via a gate
IamKyra@reddit
Thanks!
involviert@reddit
I think that line compares llama405 to llama70? Anyway since this is an MoE and llama is not, the point could be made that it's sort of a 52B anyway.
duboispourlhiver@reddit
Oh you're right, I didn't read it that way.
rajwanur@reddit
Correct GitHub URL: https://github.com/Tencent/Tencent-Hunyuan-Large
AaronFeng47@reddit
thanks, edited
dgamr@reddit
Not successfully
Aymanfhad@reddit
He only answers in Chinese languages
involviert@reddit
I love MoE stuff, it's just asking to be run on CPU. I mean I run up to 30B on mostly CPU, and that is on crappy dual channel DDR4. So I would get basically the same performance running this on some random new PC. Just has to be DDR5 and I think there are even 128GB banks by now? Than you don't even have to consider 4 banks often being slower to get 256GB RAM. Plus some low end nvidia card for prompt processing and such and it's on.
Affectionate-Cap-600@reddit
Could you expand that aspect?
involviert@reddit
It's about "time to first token" and also what happens when the context needs to be scrolled (like your conversation exceeds context size and the context management would through out the oldest messages to make room). So it's about ingesting the new input, which is different from generating tokens. The calculations for that are much better suited for a GPU than a CPU. Very much unlike the computations for generating tokens, these usually run rather fine on CPU. That stuff also doesn't have the high RAM/VRAM requirements, unlike token generation. So it really pays to just have a GPU enabled build with a reasonable graphics card, without the usual insane VRAM requirements. For example my GTX 1080 does that job just fine for what I do.
Zyj@reddit
So i have a Threadripper Pro 5xxx with 8x 16GB and a RTX3090, just need a Q4 now i reckon? What's a good software to run this GPU / CPU mix?
involviert@reddit
Anything based on llama.cpp should be able to do this splitting thing just fine. You just configure how many layers to have on the gpu and the rest is on cpu by default. The gpu acceleration for prompt processing should be there even with 0 layers on gpu, as long as it's a GPU enabled build of llama.cpp at all.
No idea about support for this specific model though often new models have some architecture that needs to be supported first. But I mean you could start by just running some regular 70B. Running an MoE would be no different. You'll probably surprised how well the 70B runs if you've never tried it, because 8xDDR4 sounds like 200GB/s or something.
Zyj@reddit
The RTX 3090 manages 936GB/s. That's 4.6 times more.
involviert@reddit
Sure. And a 3060 apparently has 360.
rini17@reddit
CPU is okay until you want long contexts. At 10s thousands of tokens it grinds down almost to halt.
Caffdy@reddit
That's why he mentioned the added GPU for prompt_eval
rini17@reddit
Sure that helps but only if the kv cache fits in the GPU memory. "Low end nvidia card" won't do long contexts either.
ambient_temp_xeno@reddit
I have a Xeon so I could get 256gb quad channel DDR 4 but it depends on 1. llamacpp adding support for the model and 2. it actually being a good model.
steitcher@reddit
If it's a MoE model, doesn't it mean that it can be organized as a set of smaller specialized models and drastically reduce VRAM requirements?
ProposalOrganic1043@reddit
I see only one player winning here: Nvidia
Unfair_Trash_7280@reddit
From their HF repo FP8 is 400GB in size BF16 is 800GB in size Oh well, maybe Q4 is around 200GB in size. We do need at least 9x 3090 to run it. Lets fire up the nuclear plant boys!
Delicious-Ad-3552@reddit
Will it run on a raspberry pi? /s
yami_no_ko@reddit
Should run great on almost any esp32.
Educational_Gap5867@reddit
Try all the esp32s manufactured in 2024. It might need all of them.
The_GSingh@reddit
Yea. For even better computational efficiency we can get a 0.05 quant and stack a few esp32s together. Ex
iamn0@reddit
how much token/years ?
YearZero@reddit
In about 30 years it will!
MoffKalast@reddit
On 100 Raspberry Pi 5s on a network, possibly.
pasjojo@reddit
Not with that attitude
norsurfit@reddit
Meat's back on the menu, boys!
DeltaSqueezer@reddit
With a ktransformers approach you could do it with less.
mlon_eusk-_-@reddit
These numbers are giving my laptop ass anxiety.
nail_nail@reddit
Angry upvote.
CoUsT@reddit
Damn, that abstract scratches nerdy part of me.
Not only they implement and test a bunch of techniques, they also investigate scaling and learning and in the end provide the model to everyone. Model that appears to be better than similar or larger size.
That's such an awesome thing. They did a great job.
ambient_temp_xeno@reddit
The instruct version is 128k but it might be that it's mostly all usable (optimism).
HatZinn@reddit
The Wizard guys *really* have to come back.
Caffdy@reddit
What techniques seem to be the most relevant?
gabe_dos_santos@reddit
Large models are feasible for the common person, it's better to use the API. I think the future leans towards smaller and better models. But that's just an opinion.
DigThatData@reddit
I wonder what it says in response to prompts referencing the Tiananmen Square Massacre.
AbaGuy17@reddit
It's worse in every category compared to Mistral Large? Am I missing something?
Lissanro@reddit
Yeah, I am yet to see the model that actually beats Mistral Large 2 123B for general use cases, not only in some benchmarks, because otherwise, I just end up continuing using Mistral Large 2 daily, and all other new shiny models just clutter my disk after running some tests and few attempts to use them in the real world tasks. Sometimes I try to give tasks that are too hard for Mistral Large 2 to some other newer models, and they usually fail them as well, often in a worse way.
I have no doubt eventually we will have better and more capable models than Large 2, especially in the higher parameter count categories, but I think this day did not come yet.
martinerous@reddit
Yeah, Mistrals seem almost like magic. I'm now using Mistral Small as my daily driver, and while it can get into repetitive patterns and get confused by some complex scenarios, it still feels the least annoying of everything I can run on my machine. Waiting for Strix Halo desktop (if such things will exist at all).
AbaGuy17@reddit
Thx, thought I was going insane
Healthy-Nebula-3603@reddit
What? Have you seen the bench table ? Almost everything is over 90%.... bencharks are saturated
AbaGuy17@reddit
One example:
GPQA_diamond:
Hunyuan-Large Inst.: 42.4%
Mistral Large: 52%
Qwen 2.5 72B: 49%
Healthy-Nebula-3603@reddit
Source ?
ihaag@reddit
Gguf?
martinerous@reddit
I'm afraid we should start asking for bitnet... and even that one would be too large for "an average guy".
punkpeye@reddit
any benchmarks?
visionsmemories@reddit
Healthy-Nebula-3603@reddit
Almost all benchmark are fully saturated... We really need new ones
YearZero@reddit
Seriously when they trade blows of 95% vs 96% it is no longer meaningful especially in tests that have errors like MMLU. It should be trivial to come up with updated benchmarks - you can expand the complexity of most problems without having to come up with uniquely challenging problems.
Say you have 1,3,2,4,x complete the pattern problem. Just create increasingly more complicated patterns to see where the limit is of the model for each "type" of problem. You can do that to most reasoning problems - just add more variables, more terms, more "stuff" until the models can't handle it. Then add like 50 more on top of that to create a nice buffer for the future.
Eisenstein@reddit
The difference between 95% and 96% is much bigger than it seems.
At first glance it looks like it is only a 1% improvement, but that isn't the whole story.
When looking at errors (wrong answers), the difference is between getting 5 answers wrong and getting 4 answers wrong. That is a 20% difference in error rate.
If you are looking at this in production, then having 20% fewer wrong answers is huge deal.
ovnf@reddit
Why that table always looks like lab results from your doctor.. the UGLIEST fonts are always for nerds …
Thomas-Lore@reddit
256k context, nice.
xadiant@reddit
Looks like an overall upgrade to Llama-3 405B while being cheaper.
punkpeye@reddit
Damn
metalman123@reddit
90 mmlu, 90 humaneval, almost 90 bbh
Caffdy@reddit
Still in the 60s in MMLU-pro
duboispourlhiver@reddit
Humaneval is 71
metalman123@reddit
Look at instruct...
medi6@reddit
GPUGE
ErikThiart@reddit
What is the PC specs needed to run this I was were to build a new pc?
my old one is due for a upgrade
helgur@reddit
With each parameter requiring 2 bytes in 16-bit precision you'd need to fork out about $580000 dollars on video cards alone for your pc upgrade. But you can halve that price if you use 8-bit or lower precision using quantization.
Good luck 👍
ErikThiart@reddit
would it be fair to say that hardware is behind software currently?
Small-Fall-6500@reddit
Considering the massive demand for the best datacenter GPUs, that is a fair statement.
Because the software allows for making use of the hardware, companies want more hardware. If software couldn't make use of high-end hardware, I would imagine 80GB GPUs could be under $2k, not $10k or more.
Of course, there's a bit of nuance to this - higher demand leads to economy of scale which can lead to lower prices, but making new and/or larger chip fabs is very expensive and takes a lot of time. Maybe in a few years supply will start to reach demand, but we may only see significant price drops if we see an "AI Winter," in which case GPU prices will likely plummet due to massive over supply. Ironically, in such a future we'd have cheap GPUs able to run more models but there would be practically no new models to run them with.
Lissanro@reddit
12-16 24GB GPUs (depending on the context size you need), or at least 256GB RAM for CPU inference, preferable with at least 8-12 channels, ideally dual CPU with 12 channels each. 256GB RAM dual channel RAM will work as well, but will be relatively slow, especially with larger context size.
How much it will take depends if the model will be supported in VRAM efficient backends like ExllamaV2, that allow Q4 or Q6 cache. Llama.cpp supports 4-bit cache, but no 6-bit cache, so if GGUF comes out, it could be an alternative. However, sometimes cache quantization in Llama.cpp just does not work, for example, it was the case with DeepSeek Chat 2.5 (also MoE) - it lacked EXL2 support and in Llama.cpp, cache quantization refused to work last time I checked.
My guess, running Mistral Large 2 with speculative decoding will be more practical, may be comparable in cost and speed too but will need much less VRAM, and most likely produce better results (since Mistral Large 123B is a dense model, and not MoE).
StraightChemistry629@reddit
MoEs are simply better.
Llama-405B kinda sucks, as it has more params, worse benchmarks and all of that with over twice as many training tokens ...
Intelligent_Jello344@reddit
What a beast. The largest MoE model so far!
Small-Fall-6500@reddit
It's not quite the largest, but it is certainly one of the largest.
The *actual* largest MoE (that was trained and can be downloaded) is Google's Switch Transformer. It's 1.6T parameters big. It's ancient and mostly useless.
The next largest MoE model is a 480b MoE with 17b active named Arctic, but it's not very good. It scores poorly on most benchmarks and also very badly on the lmsys arena leaderboard (rank 99 for Overall and rank 100 for Hard Prompts (English) right now...) While technically Arctic is a dense-MoE hybrid, the dense part is basically the same as the shared expert the Tencent Large model uses.
Also, Jamba Large is another larger MoE model (398b MoE with 98b active). It is a mamba-transformer hybrid. It scores much better than Arctic on the lmsys leaderboard, at rank 34 Overall and rank 29 Hard Prompts (English).
UpperParamedicDude@reddit
Well...
Some_Endian_FP17@reddit
How the hell do they even run this? China already can't buy sanctioned GPUs.
fallingdowndizzyvr@reddit
A Mac 192GB should be able to run a decent quant.
Unfair_Trash_7280@reddit
From the info, its trained on H20 which is designed for China, weaker than H100 but can get things done once you have enough.
CheatCodesOfLife@reddit
One of these would be cheaper and faster than 4x4090's
Cuplike@reddit
There's also the 3090's with 4090 cores and 48 GB VRAM
FullOf_Bad_Ideas@reddit
What is left of 3090 if you replace the main chip and memory? I am guessing the whole PCB gets changed too to accommodate 4090 chip interface on the PCB.
fallingdowndizzyvr@reddit
That's exactly what happens. Unlike what people think, they just don't piggyback more RAM. They harvest the GPU and possibly the VRAM and put them onto another PCB. That's why you can find "for parts" 3090/4090s for sale missing the GPU and VRAM.
vincentz42@reddit
Not sure they actually trained this on H20. The info only says you can infer the model on H20. H20 has a ton of memory bandwidth so it's matching H100 in inference, but it is not even close to A100 in training. They are probably using a combination of grey market H100 and home-grown accelerators for training.
Some_Endian_FP17@reddit
5 to 10 cards linked together for inference then?
Tomr750@reddit
they buy them through singapore
made in taiwan..
shokuninstudio@reddit
They buy up tons on grey and used markets internationally. There are agents abroad who scalp everything from baby formula to GPUs.
shing3232@reddit
Theymaking their own Ascent 910 and inference variant.
fallingdowndizzyvr@reddit
Mac Ultra 192GB. Easy peasy. Also, since it's only 50B active then it should be pretty speedy as well.
CheatCodesOfLife@reddit
Starting to regret buying the threadripper mobo with only 5 PCI-E slots (one of them stuck at 4x) :(
hp1337@reddit
Get a splitter
visionsmemories@reddit
this is some fat ass model holy shit. that thing is massive. it is huge. it is very very big massive model
JohnnyLovesData@reddit
A "Yo Mamma" model
pauljdavis@reddit
Yo llama 😃
ouroboroutous@reddit
Awesome benchmarks. Great size. Look thick. Solid. Tight. Keep us all posted on your continued progress with any new Arxiv reports or VLM clips. Show us what you got man. Wanna see how freakin' huge, solid, thick and KV cache compressed you can get. Thanks for the motivation
shing3232@reddit
Not bigger than llama3.1 400B and it's moe
shing3232@reddit
Not bigger than llama3.1 400B and it's moe
balianone@reddit
need live bench
Kep0a@reddit
Jesus
my_name_isnt_clever@reddit
I wonder if this model will have the same knowledge gaps as Qwen. Chinese models can be lacking on western topics, and vise-versa for western models. Not to mention the censoring.
cgs019283@reddit
Any info for license?
a_slay_nub@reddit
https://huggingface.co/tencent/Tencent-Hunyuan-Large/blob/main/LICENSE.txt
Looks pretty similar to Llama.
ResidentPositive4122@reddit
Except for the part where EU gets shafted :) Man, our dum dums did a terrible job with this mess of a legislation.
Prior-Reserve8976@reddit
u/TheLocalDrummer
🙏🥺
CheatCodesOfLife@reddit
Yes, I just want to see the name lol
GeneralRieekan@reddit
MoE =/= Moe
Unfair_Trash_7280@reddit
Things to note here. Tencent 389B have similar benchmark result to Llama 3.1 405B so it may not have the incentive to use it except for Chinese language (much higher score)
metalman123@reddit
It's a moe with only 50b inference. It's much much cheaper to serve.
Unfair_Trash_7280@reddit
I see. But to run it, we still need the full memory of 200 - 800 GB right? MoE is for faster inferencing, isn’t it?
_yustaguy_@reddit
yeah, definitely a model to get through an API, could potentially be sub 1 dollar. and it crushes the benchmarks
CheatCodesOfLife@reddit
Probably ~210 for Q4. And yes, MoE is faster.
I get 2.8t/s running Llama3 405b with 96gb vram + CPU at a Q3. Should be able to run this monstrosity at least 7 t/s if it get GGUF support.
shing3232@reddit
Ktransformer should be even better
Ill_Yam_9994@reddit
Yep.
The other advantage is that MoE work better partially offloaded. So if you had like an 80GB GPU and 256GB of RAM, you could possibly run the 4 bit version at a decent speed since all the active layers would fit in the VRAM.
At least normally, I'm not sure how it scales with a model this big.
Small-Fall-6500@reddit
No, not really. MoE chooses different experts at each layer, and if those experts are not stored on VRAM, you don't get the speed of using a GPU. (Prompt processing may see a significant boost, but not inference without at least most of the model on VRAM / GPUs)
kmouratidis@reddit
You can offloading the shared part to GPU and the experts to CPU. My rough calculations are 22.5B per expert and 29B for shared.
Small-Fall-6500@reddit
I had not looked at this model's specific architecture, so thanks for the clarification.
Looks like there is 1 shared expert, plus another 16 'specialized' experts, of which 1 is chosen per layer. So just by moving the shared expert to VRAM, half of the active parameters can be offloaded to GPU(s), but with rest on CPU, it's still going to be slow compared to full GPU inference. Though 20b on CPU (with quad or octo channel RAM) is probably fast enough to be useful, at least for single batch inference.
fatihmtlm@reddit
Since its a MoE, it should be faster than 405B
a_slay_nub@reddit
If anyone wants to try it, it's on lmarena.
a_beautiful_rhind@reddit
Hope there is a hunyan-medium.
Expensive-Paint-9490@reddit
It's going to be more censored than ChatGPT and there is no base model. But I'm generous and magnanimously appreciate Tencent's contribution.
FuckSides@reddit
The base model is included. It is in the "Pretrain" folder of the huggingface repo. Interestingly the base model has a 256k context window as opposed to the 128k of the instruct model.
charmander_cha@reddit
Where is my 1.58 models??
jerryouyang@reddit
The model performs so bad that Tencent decided to open source it. Come on, open source is not a trash bin.
Healthy-Nebula-3603@reddit
What you seen the table?
thezachlandes@reddit
I wish they’d distill this to something that fits in 128GB RAM MacBook Pro!
lgx@reddit
Shall we develop a single distributed model running on every GPU on earth for all of humanity?
visionsmemories@reddit
This would be perfect for running on a 256gb m4 max wouldnt it? since its a moe with only 50b active params
Content-Ad7867@reddit
it is a 389B MoE model, to fit the whole model on fp8, at least 400GB of memory is needed. active params 50b is only for faster inference, other parameters need to be in memory
shing3232@reddit
Juat settle for Q4 We can do it with hybird Ktransformer a 24G GPU and 192G ddr5
Unfair_Trash_7280@reddit
M4 Max max at 128GB. Will need M4 Ultra 256GB to run Q4 of around 210GB. With 50B MoE & expected bandwidth of 1TB, token generation speed maybe about 20 TPS.
Maybe some expert should consider allowing to split MoE to run at different machines, so each machine maybe can host 1-2x expert & connect through network as maybe MoE does not need full understanding on all 8 routes
visionsmemories@reddit
yup
and pretty sure https://github.com/exo-explore/exo can split moe models too
DFructonucleotide@reddit
They also have a Hunyuan-Standard model up in lmarena recently (which I assume is a different model). We will see its quality in human preference soon.
Wooden-Potential2226@reddit
MoEs FTW!
adt@reddit
https://lifearchitect.ai/models-table/
visionsmemories@reddit
i like that theres gpt5 and gpt 6
cantgetthistowork@reddit
Bookmarked for reviews