Do we have a critical mass of GPU owners to train a legitimate LLM that could compete with commercial ones?
Posted by decentralize999@reddit | LocalLLaMA | View on Reddit | 40 comments
I discussed with Claude the idea of training a legitimate LLM in a decentralized way using an uncensored 20TB dataset. It recommended a 300B parameter model with a 10M token context size. To train such an LLM, participants (nodes) would need at least 4 RTX Pro 6000 cards if using the DiLoCo training approach.
To summarize my discussion with Claude, here is what is required:
3,000 nodes (owners with 4 RTX Pro 6000 cards)
Duration: 2.5 months
Daily network traffic about 1.7TB per node for syncing checkpoints, etc.
Around $666 total per node for electricity and internet costs, assuming $0.15/kWh
Assuming there are 300,000 people who already own 4 such cards (or are close to it), and even 1% of them would be willing to donate their time and resources to train this LLM - this poll was created to find out.
RedParaglider@reddit
You want people to donate their extremely expensive systems that they built for them to use on their own projects and hundreds of dollars a month while simultaneously making it so they can't use their own hardware for their own goals on your Claude conversation?
ethertype@reddit
People regularly and voluntarily spend large amounts of time to write free software. And documentation. And many other things. And yes, money and energy for various causes. Time, money and energy they may not directly be compensated for. Look at Wikipedia and Open Streetmap as trivial examples.
You may not want to contribute in that way, not even when you have spare capacity. And this is fine. But contributing towards "common good" projects is entirely normal. And you benefit from it.
RedParaglider@reddit
But they don't do that for an AI conversation with no work done at all.
coloredgreyscale@reddit
The idea seems similar to people donating there system resources to boinc, seti@home, worldcommunitygrid, and similar.
HopePupal@reddit
except that those were set up with reputable academic organizational backing by people who knew what the fuck they were doing and had done testing at small scale, while OP knows nothing and is nobody
decentralize999@reddit (OP)
If you noticed, I suggest that only 1% would like idea to donate own time and resources to create something legitimate and uncensored, most people are just consumers, yes. However Linux and other FOSS were created by these 1-5%
reto-wyss@reddit
Forgot to add "Make no mistakes" 😂
RedParaglider@reddit
THE HUTZPAH of people lol.
Front_Eagle739@reddit
Well you can run prefill forwards and backwards passes layer by layer from an nvme to a single gpu, ive got buillds that do that. With a bit of effort you could modify the training pipeline so that anyone with an rtx5090 or couple of rtx3090s could run training passes on glm 5.1 sized models at 400 tokens/s equivalent or so. That expands your pool from a few thousand possible machines to a few million and then its a matter of being able to distribute chunks of the dataset, train small loras, aggregate and merge in a way that slowly converges.
I think its doable. Dataset will be everything really.
decentralize999@reddit (OP)
If it is possbile and not makes delays for other nodes, then yes. I just posted Claude suggestions, it offered DiLoCo training approach. I know little about LLM training, my expirence was mostly in pre-hype era with pre-transformer architectures.
Front_Eagle739@reddit
Wont be hours per step. You can totally hide the streaming behind compute for prefill and training type use cases. As long as you have enough vram for 2 layers, compute and something like a 16k batch and 1024 ubatch you can pretty much fully hide the transfer times then its just down to compute time and an rtx5090 is actually faster than a 6000 pro in that regard if missing a few optimisation levers.
You have a minimum threshold of 32GB vram and around a 5090s worth of compute plus a fast enough internet connection that doing a "only experts that have been changed delta" compressed weight update can be distributed to other nodes regularly and you can roughly approximate normal convergence and not wait on weak nodes. You will need a way to filter bad results and drop outs, reallocate dataset chunks from machines that fell off etc but its all doable
decentralize999@reddit (OP)
Here is Claude's answer:
"Your NVMe streaming approach solves the weight problem perfectly. But for 300B with 10M context there's another bottleneck you can't stream: backprop activations.
For a transformer at 10M context, activations per layer are \~160GB — they change every step so NVMe prefetching can't hide them. Even 2x RTX 5090 can't hold that.
For SSM architecture (Mamba-2, Jamba-style) activations drop to \~1-2GB per layer because there's no full attention matrix. That's where your method becomes fully viable at 32GB VRAM threshold.
So the combination that actually works for 300B + 10M context at scale:
Your NVMe streaming + SSM architecture + MoE expert delta sync
With transformer you'd need 192GB+ VRAM just for activations regardless of weight streaming. SSM removes that constraint entirely. Do your existing builds target SSM or transformer architectures?"
It seems SSM archtecture instead of Transformer is the way where decentralized training will work since poll results show only 14 owners(if count these who has 2 and 3 cards, single owners unlikely will climb to 4 at anytime) on 5K viewers.
If recalculate where 5% nodes are 4 x RTX Pro 6000 and 95% nodes are 1 x RTX5090 and 300B SSM architecture with 10M context size:
10000 nodes total(500 4x6000 and 9500 1x5090)
Duration:72 days
Daily traffic: 0.5GB(6000), 0.12GB(5090)
Question only why 5090 owners would agree to train anything big like 300B what they will not be able to run on their cards. I guess their interest in 20-30B LLMs only.
Front_Eagle739@reddit
10 million context? Yeah that's not happening without some form of sparse lookup on consumer cards. You need something seriously exotic for that. You arent doing 10M context of a 32B model let alone a serious one. You can offload the kv into ram and only pull the per layer chunk for the microbatch you need for large context but 10M is a bit mental.
Also the consumer guys would help train the big model thats then distilled into the smaller ones they can use.
LegacyRemaster@reddit
I train models and believe me, the dataset is the most important thing. Not the size.
ttkciar@reddit
You're in good company. AllenAI and LLM360 entirely agree that training dataset quality are the most critical factors in producing high-quality inference.
LegacyRemaster@reddit
I have been optimizing LLM-based QA generators for months to filter, organize, and render high-quality datasets extracted from documents, web pages, etc.
_dave_maxwell_@reddit
This does not make sense. Well crafted data and fine tuning is the magic sauce that makes a model better. Even if you pulled this off you would get mediocre model at the best. The data is the reason why companies distill each other’s models.
ethertype@reddit
What makes you think that nobody but the grand mages at Anthropic, OpenAI and Alibaba can cook magic sauce?
Datasets may be a different challenge to tackle. I do not know how much volume, quality and variance matters. And how much each contribute towards a qualiity model. Or to what extent and with what methods quality of existing datasets can be increased computationally.
_dave_maxwell_@reddit
You (and anybody else) can cook the magic sauce as well, but it will cost you not just computational resources but labor to make datasets for fine tuning.
Datasets matter the most. Do you think that the companies you mentioned still just mindlessly scrape the internet for any piece of content they can train on? If you train a model on this "base" datasets you can get the model that can finish sentences, just like the completion that used to be on GPT3 did - not even chatting behavior.
You can "borrow" synthetic datasets for fine-tuning by querying the Claude, but you would do what Chinese are already doing and end up with the models that are already available.
This training from scratch makes absolutely no sense, you can start with existing open source model like llama and do the hard part the fine-tuning on specific dataset.
ttkciar@reddit
LLM360 already put the necessary labor into a large, very high-quality training dataset. It is available on Huggingface.
DeepOrangeSky@reddit
Sure, but the reason it would be interesting would not be to see if the community could currently create some SOTA local LLM that was better than GLM 5.1 or what have you. Rather, the reason it would be interesting would be to set a precedent, that the community would be able to know it had in its back pocket, as a "safety net" of sorts, in case we enter some dark ages in the future perhaps, where we aren't getting a bunch of awesome local LLM models from China and from Google and so on, and the good times become bad times.
If the community has already proven that it can create an even just mediocre/somewhat okay model (for its time/size/era that it came out in, that is), then it would be a nice thing to have as like "well, if that happens, we could always just make our own..." last resort thing to be able to do.
So, since I think of it as more of a proof of concept/precedent-setter/psychological backup-notion type of thing, more so than something to, itself, immediately give some actual hard advantage over the current top models, it makes me wonder, in regards to the OP:
What if it was something a lot more tame. a 30b model, with 1M context (or 260k), rather than 300b with 10M? And what if the dataset was 10T or 5T or 1T or something What sorts of minumum node hardware would be needed if it went with something a bit less ambitious than such an extreme model?
I mean, given that the point of it wouldn't be to just instantly have some model that dethroned GLM 5.1, then, maybe there would be no need to go so overboard with such a big, maxxed out model. Maybe just something low-mid range, 30b, medium-good context size, etc, just to see if the community actually managed to do it, and then know from then on that they were capable of making models, if it ever came to it in the future.
coloredgreyscale@reddit
A similar question was asked 1-2 weeks ago.
The gpus need to communicate with high bandwidth between each other. Doing that over the internet is a no-go, even if everyone had 10gbit/s and identical gpus (to avoid waiting for a slow nodes)
ttkciar@reddit
Actually, they don't, if you use the FlexOlmo architecture.
FlexOlmo starts with a "model anchor" being distributed to all training nodes, and then each node trains expert parameters with sharded data.
When training is done, the parameters are shared and merged together with no additional training required. The anchor and FlexOlmo algorithm guarantees that the expert parameters are mutually compatible.
ethertype@reddit
Fabulous idea!
Pretty sure it would be possible to pool the required hardware resources. Look at Folding@home as an example of something similar.
The question is of course if it is possible to create something of more value than current open models.
Some requirements: - a trusted lightning rod (a well known person with recognized credentials, like Karpathy, Junyang Lin etc.) who commits to spearhead the technical side. An equivalent to Linus Torvalds or Guido van Rossum, if you like. - sponsors (even with volunteers to offer hardware and electricity, someone needs to lead and orchestrate this, and that is a full-time effort) - a process which has been tested and validated with smaller scale testing - a process which cannot be trivially derailed/sabotaged by bad actors - trust in the organization formally backing this effort - trust that results are made open under a license participants find acceptable
I really, really like the idea. But will freely admit that I am not sufficiently competent to decide if it is technically feasible. (latency, bandwidth, data volumes, etc.)
decentralize999@reddit (OP)
I believe the decentralized way to train a sota model will lead a trusted lightning rod as well as nodes and dataset holders toward anonymity because current companies are already restricted by state laws on what their models can or cannot do/know. Anthropic in particular leads in self-censorship and lobbies regulators to impose similar constraints on other companies.
And companies are also not interested in governance of their agents, since their revenue depends on models and agents generating as many tokens as possible. Anyway the community will fix it independently of ability to train or not models of sota level.
By the way, it will be good to have a famous person to lead such things and draw attention to a new way of training and censorship problems. After that leaders and participants will be slapped by state or even jailed, it will be inevitable. And the whole training process will grow through darknets such as i2p etc.
ethertype@reddit
TL;DR: This is 100% an organizational and technical challenge. Not a legal one.
I have no faith in an effort like this being able to attract critical mass if going underground. Nor do I see the need.
Authorities in democratic nations will not be able to legally (nor practically) prevent people from voluntarily joining a distributed effort like this. Or have any substantial say about the shape, color or odor of whatever comes out of it.
They could try to go after the org or person orchestrating it.("Facilitating the construction and distribution of illegal LLMs"). Apart from the need for a pre-existing legal framework, it can trivially be bypassed. Just ensure that the 'product' isn't actually an LLM. Just something someone with household computational resources could convert to one. See '80% lower receiver'.
I have very little faith in authorities all over the world being able to come up with a unified view of what constitutes an 'illegal' LLM and implement a matching unified legal framework. Just look at the regulations (or lack thereof) for firearms.
Companies may be held to a higher standard (censoring etc.) by offering a service. This is not about creating a service.
StableLlama@reddit
The hardware is the easy part.
Having training data is much harder. And training data isn't a raw web scrape, training data is filtered and curated. There is much manual work involved.
So, when you want to bring models forward, bring public high quality data forward. Publish it with a free licence on huggingface and I'm sure it'll be part of most future models - completely free for you.
Thick-Protection-458@reddit
Did not paid attention to that approach, but just 4 GPUs, no matter what kind - sounds too few to train anything like that in reasonable time?
Free-Combination-773@reddit
Idea is to have 3000 people with 4 GPUs. Still I don't think it can be done in reasonable time
Thick-Protection-458@reddit
Ah, may make sense than, sorry. Not sure communication overhead won't be too big, but maybe.
sgmv@reddit
Will 16 3090s do as well ?
decentralize999@reddit (OP)
16 RTX3090 is about 450W x 16 = 7.2kW. Most houses have only 7-10kW allowed power per house and we not count aircon consumption which would be needed in such case.
So anything legit and smart(300B LLM) is unrealistic to train on rtx3090 if you are not company/commercial building.
po_stulate@reddit
I'd image people who have such hardware would also have the power issue sorted already.
CoolConfusion434@reddit
>Most houses have only 7-10kW allowed power
FYI, maybe WW but the most common/standard electric service in US is 240V@200A, or 48kW.
On the larger project, it sounds interesting but, wouldn't it be better to start smaller and shake the logistics issues out? You will be lucky to get 100 people to make their time and resources available at the same time, while following directions on any procedure. We can't even get people to bake a cake from the same recipe. One will bake a donut and another will end up with a desk lamp.
Best of luck!!
sgmv@reddit
can always limit to 250w, wont be much slower
Medium_Chemist_4032@reddit
Yup. Only one top end of mine went up to 420. Most are 370 max. I run them at 200
pmttyji@reddit
VonDenBerg@reddit
This is a sick idea. If it was a widespread idea in the local llm communityt to volunteer their resources and connect their compute to the 'openllm' project to develop a decentralized and public model.
VonDenBerg@reddit
Get other service providers to sponser the project by donating their compute resources for x time.
Powerful_Evening5495@reddit
if GPU power would solve anything , then llama 3 won't be dead OP
the trend is 1bit models