Currently running the cards at Gen3 with 4 lanes each,
Doesn't actually appear to be a bottle neck based on:
nvidia-smi dmon -s t
showing under 2GB/s during inference.
I may still upgrade my risers to get Gen4 working.
Will be moving it into the garage once I finish with the hardware,
Ran a temporary 30A 240V circuit to power it.
Pulls about 5kw from the wall when running 405b. (I don't want to hear it, M3 Ultra... lol)
Purpose here is actually just learning and having some fun,
At work I'm in an industry that requires local LLM's.
Company will likely be acquiring a couple DGX or similar systems in the next year or so.
That and I miss the good old days having a garage full of GPUs, FPGAs and ASICs mining.
Got the GPUs from an old mining contact for $650 a pop.
$10,400 - GPUs (650x15)
$1,707 - MB + CPU + RAM(691+637+379)
$600 - PSUs, Heatsink, Frames
---------
$12,707
+$1,600 - If I decide to upgrade to gen4 Risers
Will be playing with R1/V3 this weekend,
Unfortunately even with 384GB fitting R1 with a standard 4 bit quant will be tricky.
And the lovely Dynamic R1 GGUF's still have limited support.
TLDR: instead of iterations predicting the next token from left to right, it guesses across the entire output context, more like editing/inserting tokens anywhere in the output for each iteration.
That’s pretty cool. How does it decide the response length? An image has a predefined pixel count but the answer of a particular text prompt could just be “yes”.
I think same as any other model, it puts a EOT token somewhere, and I think for diffusion LLM it just pads the rest of the output with EOT. I suppose it means your context size needs to be sufficient though, and you end up with a lot of EOT paddings at the end?
Here’s a brief overview of it I think explains it well: https://youtu.be/X1rD3NhlIcE (Mercury Coder)
I haven’t seen anything yet for local, but pretty excited to see where it goes. Context might not be too big of an issue depending on how it’s implemented.
I just watched the video. I didn't get anything about context length, mostly just hype. I'm not against diffusion for text mind you, but I am concerned that the contact window will not be very large. I only understand diffusion through its use in imagery, and as such realize the effective resolution is a challenge. The fact that these hype videos are not talking about the context window is of great concern to me. mind you, I'm the sort of person who uses Gemini instead of ChatGPT or Claude for the most part simply because of the context window.
Locally, that means preferring Llama over Qwen in most cases, unless I run into a censorship or logic issue.
True, although with the compute savings there may be opportunities to use context window scaling techniques like LongRoPE without massively impacting the speed advantage of diffusion LLMs. I am certain if it is a limitation now with Mercury it is something that can be overcome.
Will be interesting to see how long it takes for an opensource D-LLM to come out, and how much VRAM/GPU they need for inference. Nvidia won't thank them!
Are you sure about not being bandwidth bottlenecked... ?
The theoretical bandwidth of PCIe 3.0 x4 is 3.938 GB/s bidirectional, which is around 2GB/s in a single direction. vllm uses tensor parallelism, which should demand pretty high bandwidth between cards.
I had a similar setup with older Nvidia GPUs in a server. Both ran on PCIe 3.0x16, but the training performance took a severe hit, even compared to a single-card setup.
Training would for sure be bottlenecked with my setup.
It loads models onto a single card at 3.6GB/s but inference never goes above 2.
Possible that I don’t have the resolution to see the bottleneck,
For example it could be doing 3.6GB 1/2 the time and idle the other 1/2 of the time but switching faster than Nvidia-smi can pick up on.
Could you link the bifucation card you bought? I've been shit out of luck with the ones I've tried (either signal issues or the gpus just kind of dying with no errors)
Cpayne is decent but I've had a bunch of them defective and only register as x2.0. But the ones that work are great. Only problem is there's no 4x4.0 riser so I could only fit 13 on my Rome8d-2T
So you're seeing 24.5T/s out of a theoretical maximum of 63 T/s, getting about 38.9% of the theoretical performance.
I'm assuming though, that since there are only 8 key-value heads, that what your inference software is doing is first a layer-split in two, then tensor parallel 8-way. With that setup, you're really getting 77.8% of the true value, which looks much more realistic in terms of usable memory bandwidth.
People have motorcycles that are parked most of the time yet cost more and provide your life with a high risk of you dying on the road. I can totally see how spending $12k this way makes a lot of sense! If he wants he can resell the parts and reclaim the cost, it is not all money gone, in the end the fun may end up being free even.
I think you're missing the point completely. It's the difference between somebody else owning your AI, and you having your own AI in the basement. Night and day.
Is privacy and censorship not already enough? Also you can try a lot more around locally on the software side and adjust it how you want it. On the paid models you are a lot more bound to the provider.
Nice! I mean it's costly, but it's not like there's any INexpensive way to get 384GB VRAM and all that. And it's nice to know that LLM work doesn't push the PCIe bus, since if I ever added additional GPUs to my system it'd most likely be via the Thunderbolt ports on it (which I'm sure aren't going to match the speed of my internal PCIe slots.)
Lovely, would LOVE a video walk though of the setup, giving as much detail as possible to the config and everything you considered during the build.
Could you expand on your riser situation? I'm currently using a vedda frame (in my case old mining gpus) but they're all running on 1x pcie lanes. it's my understanding that said risers cannot run above that. care to comment?
The maxcloudon ones are gen 3, and the redriver is expensive. I needed the redriver on slot two of that board to avoid pcie errors, but I'm finding the he much cheaper https://www.sfpcables.com/pcie-to-sff-8654-adapter-for-u-2-nvme-ssd-pcie4-0-x16-2x-8i-sff-8654 works fine for the other pcie slots.
If you don't mind me asking, how did you break into a career that lets you afford/play with all this tech? Working at a company focused on LLM sounds amazing. Did you go to college or just have incredibly fleshed out leetcode page? Really hope to be in those shoes one day.
Thank you for the "why" lmao this is insane. I just bought a second 3090 for my server rig, so looking forward to play with that. This looks beautiful!
Temp 240vac@30a sounds fun I'll raze you a custom PSU that uses forklift power cables to serve up to 3600w of used HPE power into a 1u server too wide for a normal rack
Highly recommend these awesome breakout boards from Alkly Designs, work like a treat for the 1200w ones I have, only caveat being that the outputs are 6 individually fused terminals so ended up doing kind of a cascade to get them to the larger gauge going out. Probably way overkill but works pretty well overall. Plus with the monitoring boards I can pickup telemetry in home assistant from them.
Wow I might look into it, very decently priced. I was gonna use a breakout board but it bought the wrong one from eBay. Was not fun soldering the thick wire onto the PSU😂
I can imagine, there are others out there but this designer is super responsive and they have pretty great features overall. Definitely chatted with them a ton about this while I was building it out and it's been very very solid for me other than one of the PSUs is a slightly different manufacturer so the power profile on that one is a little funky but not a fault of the breakout board at all.
I would be surprised if nv link works. I had an idea earlier to connect a second server’s smx board directly into the first one. There’s some empty pcie slots on there. Maybe we can get 8 gpu working😂😂.
Yeah, 300 it’s actually pretty good deal. Are you talking the card or the adaptor? The card it’s going like 600 on eBay rn. I think smx2 it’s the only options if you wanna try out the smx. Other generations are just so expensive
Just the adapter, still seeing the cards around $1200, so $1500 total.
It's fine overall they can link via pcie anyways, having some pain from getting them to perform better due to the tuning parameters for each hosting container. I threw some benchmark data I've gathered so far but trying to also add in tensorrt-llm before I start tuning each a bit further to see what helps.
Probably use the adapter with one of the V100s to toss it in my other server for stuff
Thankfully it's in the garage, I have the fans tuned down a bit but tbh I am likely going to take it apart and throw it in a custom immersion tank to have as a wall piece on top of hosting models
Wow, good luck to you, I wanted to do that a while ago but sounds like a big project, but it will definitely make it quiet. Are you gonna run mineral oil?
Yeah, it does sound like fun though! Nah looking at ElectroCool from Engineered Fluids instead, more expensive but also nontoxic and designed for the purpose.
No no no, has Nvidia taught you nothing? All 3600w should be going through a single 12VHPWR connector. A micro usb connector would also be appropriate.
It’ll be eye opening when AGI says. “There’s no possible solution, just damage control at this point. Earth will return to pre Industrial Revolution climate in 60000 years if human activity is reduced to 0 today”
Jesus, here I am trying to get 4 3090's working and it's been a pain just setting it up. Although I did convert all of mine into water cooled loops...because I didn't want to hear it running.
Llama 405B on M3 Ultra 512GB Does it give 15t/s ? I wonder about that. If so, I prefer the m3 ultra (with estimated 450w). Don't you think it would make more sense?
Either use a programmatic API that supports batching, or use a good batching server like vLLM. But it's 100 t/s aggregate (I'd think more, actually, but I don't have 16x 3090 to test)
Vllm.
Some tools like to load the model into ram and then transfer it to the gpus from ram.
There is usually a workaround, but percentage wise it wasn’t that much more.
Not really a work around, you can just flat out disable this. I was in the same camp as you until I found out how to disable this. And bow my 8 and 16 and 24 and 32 GPU AI rigs have only 64gb of mem.
Also, please tell me you are using slang or aphrodite with this many gpus.
18T/s on Q2_K_XL at first,
However unlike 405b w/ vllm, the speed drops off pretty quickly as your context gets longer.
(amplified by the fact that it's a thinker.)
It has been implemented months ago, since last year. I have been using it. I can even use it across old GPUs like the P40s and even when running inference across 2 machines on my local network.
oh ok, I thought you were talking about fa, didn't realize you were talking about Deepseek specific. Yeah, but it's not just deepseek if the key and value embedded head are not equal, fa will not work. I believe it's 128/192 for DeepSeek.
Most frames are designed for stacking, that’s what I did here, only on the top one I assembled without the motherboard tray so the gpus could be lower.
With that rig, you'll be better with an awq version and vllm with tp = 16, I wouldn't be surprised if you could get in the 100 t/s that way (never tried with that much GPU, but with an aggregated bandwitch of 16tb/s thats huge)
Fuck yeah dude. I'm rocking a 4090 +3090, so basically 70b models quanted at 4.5. And its still night and day compared to a 7b. I can't imagine the difference that beast makes. Cool!
Crazy, so many card's and you still can't run very large models in 4bit. But I guess you can't get so much VRAM with that speed with such a budget, so a good invest anyway.
something i'm really concerned about is isolation of CEM slot power when using multiple PSU.
back in the old mining days, more than a few people fried equipment by powering a card (inadvertently) from 2 separate power domains -- 1st PSU via the PCIe slot; 2nd PSU via the 12V 8-pin molex connectors
^(Bleep bleep boop. I am a bot here to serve by providing helpful price history data on products. I am not affiliated with Amazon. Upvote if this was helpful. PM to report issues or to opt-out.)
You got all 16 running on one board?? I remember my ethereum mining days and it was such a pain in the ass to get anything over cards on one board to run smoothly
Can you expand on "the lovely Dynamic R1 GGUF's still have limited support" please?
I asked the amazing Unsloth people when they were going to release the dynamic 3 and 4 bit quants. They said "probably" Help me gently remind them.. They are available for 1776 but not the orignal oddly.
Quite a good rig! I am looking to migrating to EPYC platform myself, so it is of interest to me to read about how others build their rigs based on it.
Currently I have just 4 GPUs, but enough power to potentially run 8, however, I ran out of PCI-E lanes and need more RAM too, hence looking into EPYC platforms. And from what I saw so far, it seems DDR4 based platfom is the best choice at the moment in terms of performance/memory capacity/price.
You could run the unsloth Q2_K_XL fully offloaded to the GPUs with llama.cpp.
I get this with 6 3090's + CPU offload:
prompt eval time = 7320.06 ms / 399 tokens ( 18.35 ms per token, 54.51 tokens per second)
eval time = 196068.21 ms / 1970 tokens ( 99.53 ms per token, 10.05 tokens per second)
total time = 203388.27 ms / 2369 tokens
srv update_slots: all slots are idle
You're probably get > 100t/s prompt eval + ~20t/s generation.
Why everybody using Amd CPU's? Isn't better to get 3rd/4th gen Xeon with USM or 4th gen with CXL and get less VRAM, but has better bandwitch between GPU/RAM/CPU to offload stuff?
Like having 8x RTX3090 with 1TB of RAM to load the biggest currently models and to don't bottleneck too much on laness speed it up with USM or CXL? What I'm missing?
I have a question, i've notced folks in this community really frequently will do things like buy 16 3090s, rather than fewer cards, that are admittedly more expensive but have much more vRAM and perform well in other ways. Why is this? are 3090s the best price to performance at this time or some other reason?
I was about to say M3 Ultra with 512gb ram for 10k USD. (It will be interesting to see M3 Ultra R1 speeds when reviewers are getting the 512gb version).
Let me know if you get AWQ under SGLang/vLLM running! We have the same build with 16x3090. We should compare notes! Currently running R1 with https://github.com/ikawrakow/ik_llama.cpp. Check out the pull requests, lots of development happening!
The signal degradation (leakage) is the source of EMF propagation. If the connectors and cables were perfectly shielded, there wouldn't be any propagation.
The effect is negligible either way. I wasn't being serious.
Maybe the tinfoil's from the past. Nowadays "tinfoil" is used to discredit many critical or non-mainstream voice, so be sure that many tinfoils of today are using LLM's.
probably 3, nothing beats local running, running big models on clouds and you never know if you're having model parallelization issues, ram issues, and what not. At least locally it's all quite transparent.
This is so beautiful. Man... what I would not give to even have 2 3090's. LOL. I am lucky tho, I have a single 3060 with 12 gigs vram. It is usable for basic stuff. Someday maybe Ill get to have more. Awesome setup I LOVE it!!
I wanted to start with 1 3090 to learn and have fun (also for gaming). I see some $500-$$600 used cards around me, and now I know why the price is so low. Is it safe to buy them after mining from a random person?
Considering each 3090 can draw 400w. You should hit 6.4kwh just with GPUs. Adding cpu and peripherals it should drawn more than 7kwh from wall when at 100%.
Maybe your pciex 3.0 is limiting your GPUs to get fully utilized
Yes, only the actual “data” I.e. the token data is passed at inference time. This is on the order of a few MB. Whereas the weights are 100s of GB. It’s basically nothing to the point where communication latency matters much more than bandwidth
Aren't the activations (hidden states) of intermediate layers are passed in the case of pipeline parallelism, while in the case of tensor parallel is that much of the communication is done at the layer norm layers, requiring quite a lot of communication? I could be wrong about the inferencing frameworks, my specialty is only in training 😅
Yes those are basically the “token data” but after the Nth layer has processed them.
I’m not sure what OP would use (for MoE it gets slightly more complicated), but tensor parallelism especially on consumer GPUs can be problematic due to collective communication (such as layer norm)
I think the default in many tools is essentially pipeline parallelism (for example, llama.cpp will offload however many layers to the GPU, and run the rest on the CPU). So the activations just behave like an assembly line, they start on the CPU as token+positional vectors, and must be communicated to the first device with the first few layers of the model, then after that is done to the next device with the next layers, and so on
This also has the benefit of being able to handle large request volumes. For example, at any given time for a single request, only 1 device is active (* mostly). So, giving another request when the current request is on device 4/8 means both can be going at full speed — in fact theoretically you can have N concurrent requests each getting effectively 100% of a single GPUs performance in an N GPU machine
Got it! Yea consumer GPUs are really not made for collective communication, the bandwidth and compute capabilities are usually good enough but really struggle when they need to communicate. I tried experimenting on cards without interconnect, 2 GPUs with TP 2 were apparently slower than one, assuming each card can fit one model.
Thanks for sharing on llama.cpp, my work is usually on vllm so I am not too familiar with how llama.cpp shards their model.
The pain point of pipeline is having to wait on the other devices for one token, so yes you are absolutely right, the theoretical limit is N concurrent requests for N gpus.
If my understanding is correct it uses full sharding so each matrix is split across gpus so needs to communicate a lot in order to combine the results of all the matrices ar each layer? but perhaps it is a different kind of parallelism that uses less communication? perhaps divide attention heads or something?
I usually use o3 mini or claude, but on ocassions i run 14b lol locally. I get liken23 t/s… I can’t imagine running llama 405b on my machine, it would crash my system and shorten the lifespan of my ssd.
2 rigs with the inference distributed across the network, my slower rig is a 3060 and 3 P40s. If it was 4 3090's. I'll probably see 5tk/s. I'm also using llama.cpp which is not as fast as vLLM.
I would like to mount of like this for myself. But I don't know where can I start from. I considered ordering a cryptocurrency miner ring (like your, it usesa set of RTX 3090), but I am not sure it would work for AI, either if that would be good.
Do you have a step-step tutorial that I can follow?
Most server type motherboards allow bifurcate on about every pcie slot, but for normal user motherboards it is really up to the maker at that point. For the splitter cards you can just google 'bifurcation card' and you'll get tons of results from postings on amazon to ebay.
m3 ultra is probably going to pair really well with R1 or DeepSeekV3,
Could see it doing close to 20T/s
due to having decent memory bandwidth and no overhead hopping from gpu to gpu.
But it doesn't have the memory bandwidth for a huge non-moe model like 405B
Would do something like 3.5T/s
I've been working on this for ages,
But if I was starting over today I would probably wait to see if the top Llama 4.0 model is MOE or Flat.
With what the 3090's are going for today (\~$1000) you could make a nice profit... ;)
What would the advantage be of running 405b be over 671b in output (quality)? Or is this just a long running project you wanted to finish? AI/LLM development is going so darned fast that by the time you buy/build X, Y is already doing it faster, cheaper, and better...
I'm more curious about the M4 studio. The rig OP has should be able to fit Q4 deepseek R1, unless my math is wrong. Would be interesting to see how it performs
Nice build. I highly recommend you upgrade your fan to a box fan that you can set behind the rig (give it an inch of clearance for some air intake) so that you can push air out across all the cards.
You can get 4x4 x16 switches. It might not help with average bandwidth per card, but if you configure them in a mix of tensor and pipeline parallelism, you'll have enough request throughput to compete with (non-A100/H100/H200) enterprise servers.
When you run the math, large fans like that move enormous amounts of cubic feet of air compared to desktop fans. Blade size is a major factor in the amount of air that is moved.
I'm 3rd month into planning, gathering all the parts, reading, saving money... for my 4x3090 build. Then there's this guy :D Congratulations, amazing build, one of the GOAT's here and goes into my bookmarks folder.
Rig looks amazing ngl. Since you mentioned 405b, do you actually running it? Kinda wonder what's performance in multiagent setup would be, with something like 32b qwq, smaller models for parsing, maybe some long context qwen 14B-Instruct-1M (120/320gb vram for 1m context per their repo) etc running at the same time :D
Mountain_Chicken7644@reddit
you are one lucking motherf-
Conscious_Cut_6144@reddit (OP)
Got a beta bios from Asrock today and finally have all 16 GPU's detected and working!
Getting 24.5T/s on Llama 405B 4bit (Try that on an M3 Ultra :D )
Specs:
16x RTX 3090 FE's
AsrockRack Romed8-2T
Epyc 7663
512GB DDR4 2933
Currently running the cards at Gen3 with 4 lanes each,
Doesn't actually appear to be a bottle neck based on:
nvidia-smi dmon -s t
showing under 2GB/s during inference.
I may still upgrade my risers to get Gen4 working.
Will be moving it into the garage once I finish with the hardware,
Ran a temporary 30A 240V circuit to power it.
Pulls about 5kw from the wall when running 405b. (I don't want to hear it, M3 Ultra... lol)
Purpose here is actually just learning and having some fun,
At work I'm in an industry that requires local LLM's.
Company will likely be acquiring a couple DGX or similar systems in the next year or so.
That and I miss the good old days having a garage full of GPUs, FPGAs and ASICs mining.
Got the GPUs from an old mining contact for $650 a pop.
$10,400 - GPUs (650x15)
$1,707 - MB + CPU + RAM(691+637+379)
$600 - PSUs, Heatsink, Frames
---------
$12,707
+$1,600 - If I decide to upgrade to gen4 Risers
Will be playing with R1/V3 this weekend,
Unfortunately even with 384GB fitting R1 with a standard 4 bit quant will be tricky.
And the lovely Dynamic R1 GGUF's still have limited support.
NeverLookBothWays@reddit
Man that rig is going to rock once diffusion based LLMs catch on.
Sure_Journalist_3207@reddit
Dear gentleman would you please elaborate on Diffusion Based LLM
Freonr2@reddit
TLDR: instead of iterations predicting the next token from left to right, it guesses across the entire output context, more like editing/inserting tokens anywhere in the output for each iteration.
Ndvorsky@reddit
That’s pretty cool. How does it decide the response length? An image has a predefined pixel count but the answer of a particular text prompt could just be “yes”.
Freonr2@reddit
I think same as any other model, it puts a EOT token somewhere, and I think for diffusion LLM it just pads the rest of the output with EOT. I suppose it means your context size needs to be sufficient though, and you end up with a lot of EOT paddings at the end?
330d@reddit
https://x.com/karpathy/status/1894923254864978091
Thesleepingjay@reddit
Wow, Its so fast it looks like magic. thanks for sharing.
NeverLookBothWays@reddit
This is a good overview of the breakthrough: https://youtu.be/X1rD3NhlIcE
https://aimresearch.co/ai-startups/diffusion-models-enter-the-large-language-arena-as-inception-labs-unveils-mercury
Magnus919@reddit
Let me ask my LLM about that for you.
NihilisticAssHat@reddit
I haven't seen anything about that context window. I feel like that would be the most significant limitation.
NeverLookBothWays@reddit
Here’s a brief overview of it I think explains it well: https://youtu.be/X1rD3NhlIcE (Mercury Coder)
I haven’t seen anything yet for local, but pretty excited to see where it goes. Context might not be too big of an issue depending on how it’s implemented.
NihilisticAssHat@reddit
I just watched the video. I didn't get anything about context length, mostly just hype. I'm not against diffusion for text mind you, but I am concerned that the contact window will not be very large. I only understand diffusion through its use in imagery, and as such realize the effective resolution is a challenge. The fact that these hype videos are not talking about the context window is of great concern to me. mind you, I'm the sort of person who uses Gemini instead of ChatGPT or Claude for the most part simply because of the context window.
Locally, that means preferring Llama over Qwen in most cases, unless I run into a censorship or logic issue.
NeverLookBothWays@reddit
True, although with the compute savings there may be opportunities to use context window scaling techniques like LongRoPE without massively impacting the speed advantage of diffusion LLMs. I am certain if it is a limitation now with Mercury it is something that can be overcome.
It’s currently 128k tokens for Mercury Coder
rog-uk@reddit
Will be interesting to see how long it takes for an opensource D-LLM to come out, and how much VRAM/GPU they need for inference. Nvidia won't thank them!
xor_2@reddit
Do diffusion LLMs scale better than auto-regressive LLMs?
From what I read I cannot parallelize stupid flux.1-dev on two GPUs so I have my doubts.
Optifnolinalgebdirec@reddit
When will we get it? anthropic
MetricVoidLX@reddit
Are you sure about not being bandwidth bottlenecked... ?
The theoretical bandwidth of PCIe 3.0 x4 is 3.938 GB/s bidirectional, which is around 2GB/s in a single direction.
vllm
uses tensor parallelism, which should demand pretty high bandwidth between cards.I had a similar setup with older Nvidia GPUs in a server. Both ran on PCIe 3.0x16, but the training performance took a severe hit, even compared to a single-card setup.
Conscious_Cut_6144@reddit (OP)
Training would for sure be bottlenecked with my setup.
It loads models onto a single card at 3.6GB/s but inference never goes above 2.
Possible that I don’t have the resolution to see the bottleneck, For example it could be doing 3.6GB 1/2 the time and idle the other 1/2 of the time but switching faster than Nvidia-smi can pick up on.
hotdogwallpaper@reddit
what line of work are you in?
Stunning_Mast2001@reddit
What motherboard has so many pcie ports??
Conscious_Cut_6144@reddit (OP)
Asrock Romed8-2T
7 x16 slots,
Have to use 4x4 bifurcation risers that plug 4 gpus per slot.
CheatCodesOfLife@reddit
Could you link the bifucation card you bought? I've been shit out of luck with the ones I've tried (either signal issues or the gpus just kind of dying with no errors)
Conscious_Cut_6144@reddit (OP)
If you have one now that isn't working, try dropping your PCIe link speed down in the BIOS.
A lot of the stuff on Amazon is junk,
This one works fine for 1.0 / 2.0 / 3.0
https://riser.maxcloudon.com/en/bifurcated-risers/22-bifurcated-riser-x16-to-4x4-set.html
Haven't tried it yet, but this is supposedly good for 4.0
https://c-payne.com/products/slimsas-pcie-gen4-host-adapter-x16-redriver
https://c-payne.com/products/slimsas-pcie-gen4-device-adapter-x4
https://c-payne.com/products/slimsas-sff-8654-8i-to-2x-4i-y-cable-pcie-gen4
CheatCodesOfLife@reddit
Cool, you were right. My ones must be junk. I bought an nvme -> pcie 4x adapter, plugged a riser into that, then added my 6th 3090 and it works!
I'll try some others, but could settle for x4 for the last 2 cards if I can't get x8 working.
fightwaterwithwater@reddit
Just bought this and, to my great surprise, it's working fine for x4/x4/x4/x4: https://www.aliexpress.us/item/3256807906206268.html?spm=a2g0o.order_list.order_list_main.11.5c441802qYYDRZ&gatewayAdapt=glo2usa
Just need some cheapo oculink connectors.
cantgetthistowork@reddit
Cpayne is decent but I've had a bunch of them defective and only register as x2.0. But the ones that work are great. Only problem is there's no 4x4.0 riser so I could only fit 13 on my Rome8d-2T
Conscious_Cut_6144@reddit (OP)
The 3 links I posted were 4x4.0 no? Poor QC is a shame, especially on stuff coming overseas.
Radiant_Dog1937@reddit
Oh, those work? I've had 48gb worth of AMD I could have been using the whole time.
cbnyc0@reddit
You use risers, which split the PCIe interface out to many cards. It’s a type of daughterboard. Look up GPU risers.
misteick@reddit
yes, but how much does the fan cost? I think it's the MVP
Aphid_red@reddit
So you're seeing 24.5T/s out of a theoretical maximum of 63 T/s, getting about 38.9% of the theoretical performance.
I'm assuming though, that since there are only 8 key-value heads, that what your inference software is doing is first a layer-split in two, then tensor parallel 8-way. With that setup, you're really getting 77.8% of the true value, which looks much more realistic in terms of usable memory bandwidth.
RevolutionaryLime758@reddit
You spend $12k for fun!?
330d@reddit
People have motorcycles that are parked most of the time yet cost more and provide your life with a high risk of you dying on the road. I can totally see how spending $12k this way makes a lot of sense! If he wants he can resell the parts and reclaim the cost, it is not all money gone, in the end the fun may end up being free even.
alphaQ314@reddit
I'm okay with spending 12k for fun haha. But can someone explain why people are building these rigs? Just to host their own models?
Whats the advantage, other than privacy, and lack of censorship?
For an actual business case, wouldn't it be easier to just spend the 12k on one of the paid models?
mintybadgerme@reddit
I think you're missing the point completely. It's the difference between somebody else owning your AI, and you having your own AI in the basement. Night and day.
alphaQ314@reddit
I am. I don't get it. That's why I'm trying to understand from you guys to join in on the fun.
mintybadgerme@reddit
Fair enough. :)
Blizado@reddit
Is privacy and censorship not already enough? Also you can try a lot more around locally on the software side and adjust it how you want it. On the paid models you are a lot more bound to the provider.
anthonycarbine@reddit
This too. It's any AI model you want on demand. No annoying sign ups, paywalls, queues, etc etc.
hwertz10@reddit
Nice! I mean it's costly, but it's not like there's any INexpensive way to get 384GB VRAM and all that. And it's nice to know that LLM work doesn't push the PCIe bus, since if I ever added additional GPUs to my system it'd most likely be via the Thunderbolt ports on it (which I'm sure aren't going to match the speed of my internal PCIe slots.)
polandtown@reddit
Lovely, would LOVE a video walk though of the setup, giving as much detail as possible to the config and everything you considered during the build.
Could you expand on your riser situation? I'm currently using a vedda frame (in my case old mining gpus) but they're all running on 1x pcie lanes. it's my understanding that said risers cannot run above that. care to comment?
Conscious_Cut_6144@reddit (OP)
This one works fine for 1.0 / 2.0 / 3.0
https://riser.maxcloudon.com/en/bifurcated-risers/22-bifurcated-riser-x16-to-4x4-set.html
Haven't tried it yet, but this guys sells stuff for 4.0 and even 5.0
https://c-payne.com/products/slimsas-pcie-gen4-host-adapter-x16-redriver
https://c-payne.com/products/slimsas-pcie-gen4-device-adapter-x4
https://c-payne.com/products/slimsas-sff-8654-8i-to-2x-4i-y-cable-pcie-gen4
Both of these stores offer 4x and 8x lane options, assuming your board supports bifurcation.
Pedalnomica@reddit
The maxcloudon ones are gen 3, and the redriver is expensive. I needed the redriver on slot two of that board to avoid pcie errors, but I'm finding the he much cheaper https://www.sfpcables.com/pcie-to-sff-8654-adapter-for-u-2-nvme-ssd-pcie4-0-x16-2x-8i-sff-8654 works fine for the other pcie slots.
Conscious_Cut_6144@reddit (OP)
Interesting, slot 2 has some extra logic for swapping between m.2, oculinks and the slot so that one being weaker would make sense.
I’ll have to try not using it…
polandtown@reddit
this is fantastic, thank you!
David202023@reddit
Very impressive!
What are you going to do with it? If training from scratch, what model size this build could support?
alluringBlaster@reddit
If you don't mind me asking, how did you break into a career that lets you afford/play with all this tech? Working at a company focused on LLM sounds amazing. Did you go to college or just have incredibly fleshed out leetcode page? Really hope to be in those shoes one day.
azaeldrm@reddit
Thank you for the "why" lmao this is insane. I just bought a second 3090 for my server rig, so looking forward to play with that. This looks beautiful!
mp3m4k3r@reddit
Temp 240vac@30a sounds fun I'll raze you a custom PSU that uses forklift power cables to serve up to 3600w of used HPE power into a 1u server too wide for a normal rack
Clean_Cauliflower_62@reddit
Gee I’ve got the similar set up, but yours is definitely way better well put together then mine.
mp3m4k3r@reddit
Highly recommend these awesome breakout boards from Alkly Designs, work like a treat for the 1200w ones I have, only caveat being that the outputs are 6 individually fused terminals so ended up doing kind of a cascade to get them to the larger gauge going out. Probably way overkill but works pretty well overall. Plus with the monitoring boards I can pickup telemetry in home assistant from them.
Clean_Cauliflower_62@reddit
Wow I might look into it, very decently priced. I was gonna use a breakout board but it bought the wrong one from eBay. Was not fun soldering the thick wire onto the PSU😂
mp3m4k3r@reddit
I can imagine, there are others out there but this designer is super responsive and they have pretty great features overall. Definitely chatted with them a ton about this while I was building it out and it's been very very solid for me other than one of the PSUs is a slightly different manufacturer so the power profile on that one is a little funky but not a fault of the breakout board at all.
Clean_Cauliflower_62@reddit
What gpu are you running? I got 4 v100 16vram running.
mp3m4k3r@reddit
4xA100 Drive sxm2 modules (32gb)
Clean_Cauliflower_62@reddit
Oh boy, it actually works😂. How much vram do you have? 32*4?
mp3m4k3r@reddit
Definitely aren't working with nvlink in this gigabyte server, and they can definitely overheat lol
Clean_Cauliflower_62@reddit
I would be surprised if nv link works. I had an idea earlier to connect a second server’s smx board directly into the first one. There’s some empty pcie slots on there. Maybe we can get 8 gpu working😂😂.
mp3m4k3r@reddit
Ha maybe, I think someone got them to do nvlink with the pcie slot adapter but at like $300/ card that's tough experiment lol
Oh and they also do not thermal throttle, I dunno what they did to the bios in these but they're definitely intended for one purpose lol
Clean_Cauliflower_62@reddit
Yeah, 300 it’s actually pretty good deal. Are you talking the card or the adaptor? The card it’s going like 600 on eBay rn. I think smx2 it’s the only options if you wanna try out the smx. Other generations are just so expensive
mp3m4k3r@reddit
Just the adapter, still seeing the cards around $1200, so $1500 total.
It's fine overall they can link via pcie anyways, having some pain from getting them to perform better due to the tuning parameters for each hosting container. I threw some benchmark data I've gathered so far but trying to also add in tensorrt-llm before I start tuning each a bit further to see what helps.
Probably use the adapter with one of the V100s to toss it in my other server for stuff
Clean_Cauliflower_62@reddit
Too much power for 1u server haha. It already sounds like a jet engine, can’t imagine what yours sounds like😂
mp3m4k3r@reddit
Thankfully it's in the garage, I have the fans tuned down a bit but tbh I am likely going to take it apart and throw it in a custom immersion tank to have as a wall piece on top of hosting models
Clean_Cauliflower_62@reddit
Wow, good luck to you, I wanted to do that a while ago but sounds like a big project, but it will definitely make it quiet. Are you gonna run mineral oil?
mp3m4k3r@reddit
Yeah, it does sound like fun though! Nah looking at ElectroCool from Engineered Fluids instead, more expensive but also nontoxic and designed for the purpose.
mp3m4k3r@reddit
It does but still more tuning to be done, trying out tensorrt-llm/trtllm-serve if I can get Nvidia containers to behave lol
davew111@reddit
No no no, has Nvidia taught you nothing? All 3600w should be going through a single 12VHPWR connector. A micro usb connector would also be appropriate.
Conscious_Cut_6144@reddit (OP)
Nice, love repurposing server gear.
Cheap and high quality.
makhno@reddit
Dope!! :D
jrdnmdhl@reddit
I was wondering why it was starting to get warmer…
WeedFinderGeneral@reddit
OP needs to figure out how to get his rig to double as a steam turbine to help offset to power costs
Dry_Parfait2606@reddit
Climate change probably be solved by AGI
marc5255@reddit
It’ll be eye opening when AGI says. “There’s no possible solution, just damage control at this point. Earth will return to pre Industrial Revolution climate in 60000 years if human activity is reduced to 0 today”
jrdnmdhl@reddit
Might even be a few people still left when it does.
akerro@reddit
climate change is now the goal
Take-My-Gold@reddit
I thought about climate change but then I saw this dude’s setup 🤔
jrdnmdhl@reddit
Summer, climate change, heat wave...
These are all just words to describe this guy generating copypastai.
EFspartan@reddit
Jesus, here I am trying to get 4 3090's working and it's been a pain just setting it up. Although I did convert all of mine into water cooled loops...because I didn't want to hear it running.
SanDiegoDude@reddit
Goddamn, I salute your dedication to "I just want something local to fuck around with"
No-Upstairs-194@reddit
Llama 405B on M3 Ultra 512GB Does it give 15t/s ? I wonder about that. If so, I prefer the m3 ultra (with estimated 450w). Don't you think it would make more sense?
ortegaalfredo@reddit
I think you get way more than 24/T, that is single prompt, if you do continuous batching, you will get perhaps >100 tok/
Conscious_Cut_6144@reddit (OP)
I should probably add 24T/s is with spec decoding.
17T/s standard
Have had it up to 76T/s with a lot of threads.
sunole123@reddit
How do you do continuous batching??
AD7GD@reddit
Either use a programmatic API that supports batching, or use a good batching server like vLLM. But it's 100 t/s aggregate (I'd think more, actually, but I don't have 16x 3090 to test)
Wheynelau@reddit
vLLM is good for high throughput, but seems to struggle a lot with quantized models. Have tried them with gguf models before for testing.
Conscious_Cut_6144@reddit (OP)
GGUF can still be slow in VLLM but try an AWQ quantized model.
cantgetthistowork@reddit
Does that compromise on single client performance?
Normal-Context6877@reddit
That isn't even that bad for that many 3090s.
Dry_Parfait2606@reddit
THAT'S what I'm talking about!!!
Massive-Question-550@reddit
Curious what the point of 512 GB of system ram is if it's all run off the GPU's vram anyway? Also what program do you use for the tensor parallelism?
Conscious_Cut_6144@reddit (OP)
Vllm. Some tools like to load the model into ram and then transfer it to the gpus from ram. There is usually a workaround, but percentage wise it wasn’t that much more.
Phaelon74@reddit
Not really a work around, you can just flat out disable this. I was in the same camp as you until I found out how to disable this. And bow my 8 and 16 and 24 and 32 GPU AI rigs have only 64gb of mem.
Also, please tell me you are using slang or aphrodite with this many gpus.
segmond@reddit
what kind of performance are you getting with llama.cpp on the R1s?
Conscious_Cut_6144@reddit (OP)
18T/s on Q2_K_XL at first,
However unlike 405b w/ vllm, the speed drops off pretty quickly as your context gets longer.
(amplified by the fact that it's a thinker.)
AD7GD@reddit
Did you run with
-fa
? flash attention defaults to offConscious_Cut_6144@reddit (OP)
As of a couple weeks ago flash attention still hadn’t been merged into llama.cpp, I’ll check tomorrow, maybe I just need to update my build.
segmond@reddit
It has been implemented months ago, since last year. I have been using it. I can even use it across old GPUs like the P40s and even when running inference across 2 machines on my local network.
Conscious_Cut_6144@reddit (OP)
It’s specifically missing for Deepseek MOE: https://github.com/ggml-org/llama.cpp/issues/7343
segmond@reddit
oh ok, I thought you were talking about fa, didn't realize you were talking about Deepseek specific. Yeah, but it's not just deepseek if the key and value embedded head are not equal, fa will not work. I believe it's 128/192 for DeepSeek.
bullerwins@reddit
Have you tried ktranformers? I get more consistent 8-9t/s with 4x3090 even at higher ctx
AD7GD@reddit
Which model types need system ram for vLLM? I'm running a 8B model in FP16 right now and the vllm process isn't using close to 16G.
chespirito2@reddit
Damn near charging an electric car at that power
Lance_ward@reddit
What’s the watt/token you are getting with this bad boy?
I-cant_even@reddit
https://www.pugetsystems.com/labs/hpc/quad-rtx3090-gpu-power-limiting-with-systemd-and-nvidia-smi-1983/
I run 4x 3090s off a single 1600W PSU. I followed the above guide to prevent high power draws with minimal negative effect.
(Also, you know if you rotated the rig relative to the fan the fan would work better right?.... Sorry, I had to.)
laterral@reddit
What’s the use case?
cantgetthistowork@reddit
Do you happen to have a listing for the frame? I'm maxed out on a 12 GPU rack and it's annoying me
Conscious_Cut_6144@reddit (OP)
Most frames are designed for stacking, that’s what I did here, only on the top one I assembled without the motherboard tray so the gpus could be lower.
ExploringBanuk@reddit
No need to try R1/V3, QwQ 32B is better now.
Papabear3339@reddit
QwQ is better then the distils, but not the actual r1.
Actual r1 most people can't run because an insane rig like this is needed.
teachersecret@reddit
It's remarkably close to the actual r1 in performance, which is impressive. I've been playing with a 4.25 quant of qwq and it has r1 "feels".
AdventurousSwim1312@reddit
With that rig, you'll be better with an awq version and vllm with tp = 16, I wouldn't be surprised if you could get in the 100 t/s that way (never tried with that much GPU, but with an aggregated bandwitch of 16tb/s thats huge)
tilted21@reddit
Fuck yeah dude. I'm rocking a 4090 +3090, so basically 70b models quanted at 4.5. And its still night and day compared to a 7b. I can't imagine the difference that beast makes. Cool!
mrtransisteur@reddit
what you really need is 16x of those 96 GB Chinese modded 4090s.. you could actually fit full og deepseek r1 on that bro ;_;
roydotai@reddit
How much power does that draw?
IdealDesperate3687@reddit
Llama.cpp is your friend for R1. Love your rig!
fairydreaming@reddit
Congratulations on getting it working, impressive build!
But 5kw... Next project - mini fusion reactor.
gtxktm@reddit
Which PSU do you use?
Also, have you tried exllamav2?
Blizado@reddit
Crazy, so many card's and you still can't run very large models in 4bit. But I guess you can't get so much VRAM with that speed with such a budget, so a good invest anyway.
MD_Yoro@reddit
Sick spec, but can it run Crysis?
sassydodo@reddit
Isn't newest qwq better than r1?
goodtimtim@reddit
heck yeah! congrats on getting this up. If you've got any more of those 650$ 3090s let me know :)
tindalos@reddit
But how many FPS are you getting on Crysis now?
Fresh-Letterhead986@reddit
what did you use for x4 risers?
something i'm really concerned about is isolation of CEM slot power when using multiple PSU.
back in the old mining days, more than a few people fried equipment by powering a card (inadvertently) from 2 separate power domains -- 1st PSU via the PCIe slot; 2nd PSU via the 12V 8-pin molex connectors
x1 risers is the easy answer, but that's a terrible choice (for non-inference). Was considering modifying a x16 ribbon cable like this: https://www.amazon.com/Express-Riser-Extender-Molex-Ribbon/dp/B00OTGJQ10
Cool-Importance6004@reddit
Amazon Price History:
Chenyang PCI-E Express 16X to 16x Riser Extender Card with Molex IDE Power & Ribbon Cable 20cm * Rating: ★★★☆☆ 3.8 (13 ratings)
Source: GOSH Price Tracker
^(Bleep bleep boop. I am a bot here to serve by providing helpful price history data on products. I am not affiliated with Amazon. Upvote if this was helpful. PM to report issues or to opt-out.)
AD7GD@reddit
I'll watch for you on vast.ai ;-)
ShadowbanRevival@reddit
You got all 16 running on one board?? I remember my ethereum mining days and it was such a pain in the ass to get anything over cards on one board to run smoothly
Such_Advantage_6949@reddit
Do update us how many toke u managed to get for any version of deepseek r1 u manage to fit fully in vram
1BlueSpork@reddit
Awesome!!!
MatterMean5176@reddit
Can you expand on "the lovely Dynamic R1 GGUF's still have limited support" please?
I asked the amazing Unsloth people when they were going to release the dynamic 3 and 4 bit quants. They said "probably" Help me gently remind them.. They are available for 1776 but not the orignal oddly.
CheatCodesOfLife@reddit
FWIW, I loaded up that 1776 model and hit regenerate on some of my chat history, the response was basically identical to the original
MatterMean5176@reddit
Thanks for that. I've been wondering how they compare. I might need to give in and download the "remix".
You're running them at home?
Conscious_Cut_6144@reddit (OP)
I can run them in llama.cpp, But llama.cpp is way slower than vllm. Vllm is just rolling out support for r1 ggufs.
MatterMean5176@reddit
Got it. Thank you.
Lissanro@reddit
Quite a good rig! I am looking to migrating to EPYC platform myself, so it is of interest to me to read about how others build their rigs based on it.
Currently I have just 4 GPUs, but enough power to potentially run 8, however, I ran out of PCI-E lanes and need more RAM too, hence looking into EPYC platforms. And from what I saw so far, it seems DDR4 based platfom is the best choice at the moment in terms of performance/memory capacity/price.
segmond@reddit
You can go cheap, if you are on team llama.cpp you can distribute inference across your rigs.
chemist_slime@reddit
What beta bios did you need? Doesn’t this board do x4x4x4x4 per slot? So 4 slots -> 16 x4? Or was it for something else?
Conscious_Cut_6144@reddit (OP)
With stock bios the system can’t boot with more than 14 gpus. Gets a pci resource error. They sent me 3.93A
CheatCodesOfLife@reddit
You could run the unsloth Q2_K_XL fully offloaded to the GPUs with llama.cpp.
I get this with 6 3090's + CPU offload:
prompt eval time = 7320.06 ms / 399 tokens ( 18.35 ms per token, 54.51 tokens per second) eval time = 196068.21 ms / 1970 tokens ( 99.53 ms per token, 10.05 tokens per second) total time = 203388.27 ms / 2369 tokens srv update_slots: all slots are idle
You're probably get > 100t/s prompt eval + ~20t/s generation.
100thousandcats@reddit
Wow, llama 405B. That’s insane!!
jakubdev12@reddit
Why everybody using Amd CPU's? Isn't better to get 3rd/4th gen Xeon with USM or 4th gen with CXL and get less VRAM, but has better bandwitch between GPU/RAM/CPU to offload stuff?
Like having 8x RTX3090 with 1TB of RAM to load the biggest currently models and to don't bottleneck too much on laness speed it up with USM or CXL? What I'm missing?
bordobbereli@reddit
This guy is responsible ALONE for global warming
the__simian@reddit
I have a question, i've notced folks in this community really frequently will do things like buy 16 3090s, rather than fewer cards, that are admittedly more expensive but have much more vRAM and perform well in other ways. Why is this? are 3090s the best price to performance at this time or some other reason?
FlowThrower@reddit
nvlink, I imagine. unless I'm mistaken the 40 and 50 series removed the ability
Just-Requirement-391@reddit
how did you connect 16 gpu to 7 pcie slot motherboard ?
el_koha@reddit
bifurcation
GreedyAdeptness7133@reddit
Maybe it’s not
Dhervius@reddit
Cuando eth clasic te da por dia :v
Difficult-Slip6249@reddit
Glad to see the open air "crypto mining rig" pictures back on Reddit :)
TinyTank800@reddit
Went from mining for fake coins to simulating anime waifus. What a time to be alive.
nexusprime2015@reddit
throw nfts in there as well
wingsinvoid@reddit
How many hashes do you get? What are you using, Claymore? :)
GroundbreakingFile18@reddit
Damn, do the lights in your neighborhood dim and flicker when this spins up?
Proof-Examination574@reddit
Yeah but can it run CRYSIS???
Distinct_Benefit_194@reddit
What is your motherboard, if I may ask?
xephadoodle@reddit
wow, that is quite pretty :D
adulfkittler@reddit
Well this answered my question from yesterday 😂
MannyManMoin@reddit
I was about to say M3 Ultra with 512gb ram for 10k USD. (It will be interesting to see M3 Ultra R1 speeds when reviewers are getting the 512gb version).
Fun setup !
-JamesBond@reddit
Why wouldn’t you buy a new Mac Studio M4/M3 Ultra with 512 GB of RAM for $10k instead? It can use all the memory for the task here and costs less.
ybdave@reddit
Let me know if you get AWQ under SGLang/vLLM running! We have the same build with 16x3090. We should compare notes! Currently running R1 with https://github.com/ikawrakow/ik_llama.cpp. Check out the pull requests, lots of development happening!
rorowhat@reddit
Are you finding the cure for cancer?
Boring-Test5522@reddit
the setup is at least $25000. It is better curing fucking cancer with that price tag.
Ready_Season7489@reddit
"It is better curing fucking cancer with that price tag."
Great return on invest. Gonna be very rich.
Conscious_Cut_6144@reddit (OP)
Prices are in my post a few down, got the 3090's for $650 each.
Haiku-575@reddit
Maybe. 3090s are something like $800 USD used, especially from a miner, bought in bulk. "At least $15,000" is much more realistic, here.
shroddy@reddit
It is probably to finally find out how many r are in strawberry
HelpfulJump@reddit
Last I heard they were using entire Italy's energy to figure that out, I don't think this will cut.
sourceholder@reddit
With all that EMF? Could be creating it.
Massive-Question-550@reddit
Realistically you would have signal degradation in the Pcie cables long before the EMF actually hurts you.
sourceholder@reddit
The signal degradation (leakage) is the source of EMF propagation. If the connectors and cables were perfectly shielded, there wouldn't be any propagation.
The effect is negligible either way. I wasn't being serious.
Massive-Question-550@reddit
I figured. I don't think the tinfoil hat people are into llm's anyway.
YordanTU@reddit
Maybe the tinfoil's from the past. Nowadays "tinfoil" is used to discredit many critical or non-mainstream voice, so be sure that many tinfoils of today are using LLM's.
Anka098@reddit
You need lab samples to test the cure I guess
cultish_alibi@reddit
Oh! You're unbelievable!
orinoco_w@reddit
Whoa
Vivarevo@reddit
This or its for corn
YouAreRight007@reddit
Very neat.
I wonder what the cost would be per hour to have the equivalent resources in the cloud.
miaumiauboombigjan@reddit
the single fan hahahahahahahh
SungamCorben@reddit
Amazing, pull some benchmarks please!
JunketLess@reddit
can someone eli5 what's going on ? it looks cool though
FinnGamePass@reddit
Mining ELon Coin
SomeOddCodeGuy@reddit
That fan really pulls the build together.
BangkokPadang@reddit
Well that's just, like, your opinion, man.
Xylber@reddit
"I'm tired boss"
impaque@reddit
Hahahah I literally thought the same thing, almost posted it, too :D Look at the angle at which it blows, too :D Gold
DangKilla@reddit
Fans don't cool air. He should be blowing the hot air away.... I worked in a data center that used industrial fans.....
Financial_Recording5@reddit
Ahhh…The 20” High Velocity Floor Fan. That’s a great fan.
Then_Conversation_19@reddit
The true MVP
random-tomato@reddit
the fan is the best part XD
needCUDA@reddit
when I used to mine I had a fan too. super effective.
gjallerhorns_only@reddit
I was just thinking that this would have been an insane mining rig like 3yrs ago
davew111@reddit
But... no RGB
tmvr@reddit
Meh, I think it blows!
xendelaar@reddit
That's like..just your opinion, man..
shaolinmaru@reddit
I have one of that and it produces a hella of wind.
Theio666@reddit
I have a fan that's pretty much like on photo, and I bet the fan is louder than all cards combined xD
OmarDaily@reddit
Damn, might just pick up a 512gb Mac Studio instead.. The power draw must be wild at load.
edude03@reddit
I just 5 minutes ago got my 4 founders working in a single box (I have 8 but power/space/risers are stopping me) then I see this
ForsookComparison@reddit
Host Llama 405b with some funky prompts and call yourself an AI startup.
WeedFinderGeneral@reddit
"We'll just ask the AI how to make money"
OneSmallStepForLambo@reddit
I see some AI ads and I can’t help to think something like this is running in a garage supporting the app
(No disrespect to OP - love the rig)
Jucks@reddit
Is this your heater setup for the winter? (seriously wtf is this for=D)
Alavastar@reddit
Yep that's how skynet starts
Electronic-Site8038@reddit
skynet is in china, already alive for the past 4 years
Thireus@reddit
What’s the electricity bill like?
Conscious_Cut_6144@reddit (OP)
$0.42/hour when inferencing,
$0.04/hour when idle.
I haven't tweaked power limits yet,
Can probably drop that a bit.
MizantropaMiskretulo@reddit
So, you're at about $5/Mtok, a bit higher than o3-mini...
AmanDL@reddit
probably 3, nothing beats local running, running big models on clouds and you never know if you're having model parallelization issues, ram issues, and what not. At least locally it's all quite transparent.
smallfried@reddit
You said you have solar. Can you run the whole thing for free when it's sunny?
Conscious_Cut_6144@reddit (OP)
Depends on how you look at it. I still pull a little power from the grid every month, more with this guy running.
MINIMAN10001@reddit
Yep power limits are on my mind with numbers like that lol
Thireus@reddit
Nice! I wish I also lived in a place with cheap electricity 😭 I pay triple.
TerryC_IndieGameDev@reddit
This is so beautiful. Man... what I would not give to even have 2 3090's. LOL. I am lucky tho, I have a single 3060 with 12 gigs vram. It is usable for basic stuff. Someday maybe Ill get to have more. Awesome setup I LOVE it!!
Ok-Investment-8941@reddit
The 6 foot folding plastic table is the unsung hero of nerds everywhere IMO
init__27@reddit
I mean...to OP's credit: Are you even a localLLaMA member if you cant train llama at home? :D
GreedyAdeptness7133@reddit
What kind of crazy workstation mobo supports 16 gpus and how are they connected to it?
Lantan57ua@reddit
I wanted to start with 1 3090 to learn and have fun (also for gaming). I see some $500-$$600 used cards around me, and now I know why the price is so low. Is it safe to buy them after mining from a random person?
misterravlik@reddit
buddy, can you post a beta version of the bios for asrock romed8-2t?
M000lie@reddit
How the hell did you connect all 16x GPUs to your asrock motherboard with 7x pcie4 x16?
letonai@reddit
1.21 Gigawatts?
dr_manhattan_br@reddit
Considering each 3090 can draw 400w. You should hit 6.4kwh just with GPUs. Adding cpu and peripherals it should drawn more than 7kwh from wall when at 100%. Maybe your pciex 3.0 is limiting your GPUs to get fully utilized
DrDisintegrator@reddit
Everyt ime I see a rig like this, I just look at my cat and say, "We can't have nice things.".
VrN00b74@reddit
I totally get you on this.
Public-Subject2939@reddit
This generation is so obsessed with fans😂🤣 its just fans its JuST only FANS😭
Ok_Parsnip_5428@reddit
Those 3090s are working overtime 😅
geoffwolf98@reddit
And yet Crisis still stutters at 4k.
Feisty_Ad_4554@reddit
Nice setup in the winter :)
mini-hypersphere@reddit
The things people do to simulate their waifu
fairydreaming@reddit
with 5kw of power to dissipate she's going to be a real hottie!
-TV-Stand-@reddit
You can turn off your house's warming with this simple trick!
RazzmatazzReal4129@reddit
Still cheaper and less effort than real wife.
h1pp0star@reddit
Are you training the new llama model in your garage?
BoulderDeadHead420@reddit
Im just trying to find one or two at that price damn
TigerRobocop_@reddit
OMG
Active-Ad3578@reddit
Now buy 10 Mac sudio ultra then it will be like 5 TB of vram
kumits-u@reddit
Whats your PCIe speed on each of the cards ? Wouldn't this limit your speed if it's lower than x16 per card ?
Comcast-user-WA@reddit
awesome
Ok-Anxiety8313@reddit
Really surprising you are not memory bandwidth-bound. What implementation/software are you using?
MINIMAN10001@reddit
I mean once you're loaded the communication is extremely limited on inference.
Wheynelau@reddit
Could you elaborate? You mean once the layers are loaded there no communications between devices?
Sm0oth_kriminal@reddit
Yes, only the actual “data” I.e. the token data is passed at inference time. This is on the order of a few MB. Whereas the weights are 100s of GB. It’s basically nothing to the point where communication latency matters much more than bandwidth
Wheynelau@reddit
Aren't the activations (hidden states) of intermediate layers are passed in the case of pipeline parallelism, while in the case of tensor parallel is that much of the communication is done at the layer norm layers, requiring quite a lot of communication? I could be wrong about the inferencing frameworks, my specialty is only in training 😅
Sm0oth_kriminal@reddit
Yes those are basically the “token data” but after the Nth layer has processed them.
I’m not sure what OP would use (for MoE it gets slightly more complicated), but tensor parallelism especially on consumer GPUs can be problematic due to collective communication (such as layer norm)
I think the default in many tools is essentially pipeline parallelism (for example, llama.cpp will offload however many layers to the GPU, and run the rest on the CPU). So the activations just behave like an assembly line, they start on the CPU as token+positional vectors, and must be communicated to the first device with the first few layers of the model, then after that is done to the next device with the next layers, and so on
This also has the benefit of being able to handle large request volumes. For example, at any given time for a single request, only 1 device is active (* mostly). So, giving another request when the current request is on device 4/8 means both can be going at full speed — in fact theoretically you can have N concurrent requests each getting effectively 100% of a single GPUs performance in an N GPU machine
Wheynelau@reddit
Got it! Yea consumer GPUs are really not made for collective communication, the bandwidth and compute capabilities are usually good enough but really struggle when they need to communicate. I tried experimenting on cards without interconnect, 2 GPUs with TP 2 were apparently slower than one, assuming each card can fit one model.
Thanks for sharing on llama.cpp, my work is usually on vllm so I am not too familiar with how llama.cpp shards their model.
The pain point of pipeline is having to wait on the other devices for one token, so yes you are absolutely right, the theoretical limit is N concurrent requests for N gpus.
Ok-Anxiety8313@reddit
If my understanding is correct it uses full sharding so each matrix is split across gpus so needs to communicate a lot in order to combine the results of all the matrices ar each layer? but perhaps it is a different kind of parallelism that uses less communication? perhaps divide attention heads or something?
m4hi2@reddit
repurposed your crypto mining rig? 😅
slippery@reddit
Applause for the tight cabling. I wish I could afford a rig like that.
nanobot_1000@reddit
This is awesome, bravo 👏
5 kW lol... since you are the type to run 240V and build this beast, I forsee some solar panels in your future.
I also heard MSFT might have 🤏 spare capacity from re-opening Three Mile Island, perhaps you could negotiate a co-hosting rate with them
Conscious_Cut_6144@reddit (OP)
Haha you have me all figured out.
I have about 15kw worth of panels in my back yard.
power97992@reddit
What do u do at the night, the power grid plus batteries?
Conscious_Cut_6144@reddit (OP)
I have batteries, but I still pull some power from the grid like 50 weeks of the year.
I drive an EV too, so I use a lot of electricity.
power97992@reddit
If you have much sun and pay that much for electricity, I suspect you live in California. You
nanobot_1000@reddit
Ahaha you are ahead of the game! That's great you are bringing second life to these cards with those 😊
segmond@reddit
Very nice. I'm super duper envious. I'm getting 1.60tk/sec on llama405b Q3K_M
power97992@reddit
That is so slow, u might as well rent a h200 cluster
segmond@reddit
sure, and what performance are you getting when you run it on your own machine?
power97992@reddit
I usually use o3 mini or claude, but on ocassions i run 14b lol locally. I get liken23 t/s… I can’t imagine running llama 405b on my machine, it would crash my system and shorten the lifespan of my ssd.
330d@reddit
on what hardware m8?
segmond@reddit
2 rigs with the inference distributed across the network, my slower rig is a 3060 and 3 P40s. If it was 4 3090's. I'll probably see 5tk/s. I'm also using llama.cpp which is not as fast as vLLM.
-lq_pl-@reddit
Damn you, leave some for the rest of us.
andreclaudino@reddit
I would like to mount of like this for myself. But I don't know where can I start from. I considered ordering a cryptocurrency miner ring (like your, it usesa set of RTX 3090), but I am not sure it would work for AI, either if that would be good.
Do you have a step-step tutorial that I can follow?
andreclaudino@reddit
Next week, this guy will have trained a new deepseek like model by 25k USD
Intrepid_Traffic9100@reddit
The combination of probably 15k plus in cards plus a 5$ fan on a shitty desk is just pure gold
miscellaneous_robot@reddit
DANG!
not_wall03@reddit
So you're the reason 3090s are so expensive
Business-Weekend-537@reddit
Might be a dumb question but how many pcie ports on the motherboard and how do you hook up that many at once?
moofunk@reddit
Put this thing or similar in a slot and bifurcate the slot in BIOS.
laexpat@reddit
But what connects from that to the gpu?
fizzy1242@reddit
A riser cable
Business-Weekend-537@reddit
Where do you get one of those splitter cards? Also was bifurcating in the bios an option or did you have to custom code it?
That splitter card is sexy AF ngl
LockoutNex@reddit
Most server type motherboards allow bifurcate on about every pcie slot, but for normal user motherboards it is really up to the maker at that point. For the splitter cards you can just google 'bifurcation card' and you'll get tons of results from postings on amazon to ebay.
Conscious_Cut_6144@reddit (OP)
It's a setting on most boards nowadays.
Bystander-8@reddit
I can see where allthe budhet went to
Business-Ad-2449@reddit
How rich are you ?
cbnyc0@reddit
Work-related expense, put it on your Schedule C.
DesperateAdvantage76@reddit
So the ris only $10k instead of $12k after you greet back those deductions lol.
rapsoid616@reddit
That's the way I purchase all my electronic needs! In Turkey it saves me about %20.
sourceholder@reddit
Not anymore.
Jucks@reddit
Is this your heater setup for the winter? (seriously wtf is this for=D)
RMCPhoto@reddit
I hope it's winter wherever you are.
Tasty_Ticket8806@reddit
power cons?? like 2 and a half nuclear reactirs or so...?
Ok_Combination_6881@reddit
Is it more economical to buy a 10k m3 ultra with 521gb or buy this rig? I actually want to know
Conscious_Cut_6144@reddit (OP)
m3 ultra is probably going to pair really well with R1 or DeepSeekV3,
Could see it doing close to 20T/s
due to having decent memory bandwidth and no overhead hopping from gpu to gpu.
But it doesn't have the memory bandwidth for a huge non-moe model like 405B
Would do something like 3.5T/s
I've been working on this for ages,
But if I was starting over today I would probably wait to see if the top Llama 4.0 model is MOE or Flat.
Cergorach@reddit
With what the 3090's are going for today (\~$1000) you could make a nice profit... ;)
What would the advantage be of running 405b be over 671b in output (quality)? Or is this just a long running project you wanted to finish? AI/LLM development is going so darned fast that by the time you buy/build X, Y is already doing it faster, cheaper, and better...
Wheynelau@reddit
I'm more curious about the M4 studio. The rig OP has should be able to fit Q4 deepseek R1, unless my math is wrong. Would be interesting to see how it performs
lolwutdo@reddit
Definitely less of a headache and eyesore
Noiselexer@reddit
And energy
keepawayb@reddit
You have my respect and tears of envy.
xor_2@reddit
What is the performance difference between splitting large model by rows and by layers?
power97992@reddit
5600 watts while running and 7200w at peak usage,, ur house must be a furnace.
SadWolverine24@reddit
Why do you have 512GB of RAM?
Tourus@reddit
The most popular inference engines all load the entire model into RAM first.
jack-in-the-sack@reddit
A single motherboard??? How???
a_beautiful_rhind@reddit
What's it idle at?
Alice-Xandra@reddit
Sell the flops & you've got free heating! Some pipe fckery & you've got warm water.
freeenergy
Pretend-Umpire-3448@reddit
a noob question, how do you connect all the gpu? pci-e or ?
2TierKeir@reddit
What do you do with these bro
Greedy_Reality_2539@reddit
Looks like you have your domestic heating sorted
-Ellary-@reddit
Wow, this rig is almost smirking at me, and give me shivers down my spine.
It is great a build for enterprise resource planning.
Future_Might_8194@reddit
I can hear this picture
AppearanceHeavy6724@reddit
It is so hot I had to open my window.
Endless7777@reddit
Cool, what are you doing with it? Im new to this whole llm thing.
sunshinecheung@reddit
48GB VRAM 4090 can be much better
Such_Advantage_6949@reddit
Much more expensive as well
ThenExtension9196@reddit
How so? My modded 4090 was 4k and a couple of 3090s cost that much?
HelpfulFriendlyOne@reddit
He paid 650 each for the 3090s
Such_Advantage_6949@reddit
No, i meant 48gb 4090 gonna be much more expensive compared to 3090s
Mass2018@reddit
Nice build. I highly recommend you upgrade your fan to a box fan that you can set behind the rig (give it an inch of clearance for some air intake) so that you can push air out across all the cards.
dizvyz@reddit
I wouldn't have thought that what was keeping us from attaining true AGI was a desk fan.
Wheynelau@reddit
How does it compare to the 3.3 70b? I heard that the 70b is supposedly comparative to the 405b, can imagine the throughput you would get from that
NobleKale@reddit
So much money on such a shitty little frame.
illusionst@reddit
Can you ask it ‘what is the meaning of the file?’
gaspoweredcat@reddit
Yikes and I thought my 10x CMP 100-210 (160gb total) was overkill
vogelvogelvogelvogel@reddit
dude spent like 20 grand on 3090s mounts them in a 10 buck shelf
Godess_Ilias@reddit
why?
TheManicProgrammer@reddit
What's the fan cooling
Blizado@reddit
Puh, that is insane. I never could afford this. I'm even happy to have at last a 4090. I hate that I'm so poor. :D
random-tomato@reddit
New r/LocalLLaMA home server final boss!
/u/XMasterrrr
Conscious_Cut_6144@reddit (OP)
He has 8x risers, it’s a trade off getting 16 cards for tensor parallel vs extra bandwidth to 14 cards.
kmouratidis@reddit
You can get 4x4 x16 switches. It might not help with average bandwidth per card, but if you configure them in a mix of tensor and pipeline parallelism, you'll have enough request throughput to compete with (non-A100/H100/H200) enterprise servers.
HipHopPolka@reddit
Does... the floor fan actually work?
MINIMAN10001@reddit
When you run the math, large fans like that move enormous amounts of cubic feet of air compared to desktop fans. Blade size is a major factor in the amount of air that is moved.
ParaboloidalCrest@reddit
10x better than your 12 teeny-tiny case fans.
Gullible-Fox2380@reddit
May I ask what you use it for? Just curious! thats a lot of cloud time
rf97a@reddit
Optimising bitcoin mining?
These_Growth9876@reddit
Is the build similar to ones ppl used to build for mining? Can u tell me the motherboard used?
0RGASMIK@reddit
Full circle back to crypto days.
AriyaSavaka@reddit
This can fully offload a 72B-123B model at 16-bit and with 128k context right?
Willing_Landscape_61@reddit
Building an open rig myself. How do you prevent dust form accumulating in your rig?
Dangerous_Fix_5526@reddit
F...ing Madness - I love it.
TerrryBuckhart@reddit
Hopefully whatever you are training the model for will pay for your power bill
GTHell@reddit
Please show us your electricity bill
ReMoGged@reddit
Nice hair dryer
_wOvAN_@reddit
unfortunalety, large contexts are destroying mutli-gpu builds
Odd_Reality_6603@reddit
Bro that's nasty
kaisear@reddit
MadMax build.
kaisear@reddit
330d@reddit
I'm 3rd month into planning, gathering all the parts, reading, saving money... for my 4x3090 build. Then there's this guy :D Congratulations, amazing build, one of the GOAT's here and goes into my bookmarks folder.
lukewhale@reddit
Holy shit. I expect a full will write up and a YouTube video.
You need to share your experience.
beedunc@reddit
Would love to see a ‘-ps’ of that.
Top-Salamander-2525@reddit
You’re going to need a bigger fan…
Conscious_Cut_6144@reddit (OP)
I have a 48in fan mounted in my garage ceiling, exhausting into my attic.
Top-Salamander-2525@reddit
https://bigassfans.com
sunole123@reddit
How many TOPS would you say is this setup?
MixtureOfAmateurs@reddit
Founders 💀. There aren't 16 3090 FEs in my city lol
Conscious_Cut_6144@reddit (OP)
Not anymore 🤣
GmanMe7@reddit
I hope you run Linux
ExceptionOccurred@reddit
What do you use this for?
olmoscd@reddit
you know, talking to it!
robonxt@reddit
I love how the rig is nice, and the cooling solution is just a fan 😂
CheatCodesOfLife@reddit
It's the most effective way though! Even with my vramlet rig of 5x3090's, adding a fan like that knocked the temps down from ~79C to the 60's
Theio666@reddit
Rig looks amazing ngl. Since you mentioned 405b, do you actually running it? Kinda wonder what's performance in multiagent setup would be, with something like 32b qwq, smaller models for parsing, maybe some long context qwen 14B-Instruct-1M (120/320gb vram for 1m context per their repo) etc running at the same time :D
legatinho@reddit
384gb of VRAM. What model and what context size can you run with that?
The_GSingh@reddit
ATP it is alive. What are you building agi or something?
Really cool build btw.
Ok-Anxiety8313@reddit
Can I get the mining contact? Do they have more 3090?
TheDailySpank@reddit
For the love of god, hit it from the front
Conscious_Cut_6144@reddit (OP)
Absolutely, that's just for the pics!