[-]

Conscious_Cut_6144@reddit (OP)

Got a beta bios from Asrock today and finally have all 16 GPU's detected and working!

Getting 24.5T/s on Llama 405B 4bit (Try that on an M3 Ultra :D )

Specs:
16x RTX 3090 FE's
AsrockRack Romed8-2T
Epyc 7663
512GB DDR4 2933

Currently running the cards at Gen3 with 4 lanes each,
Doesn't actually appear to be a bottle neck based on:
nvidia-smi dmon -s t
showing under 2GB/s during inference.
I may still upgrade my risers to get Gen4 working.

Will be moving it into the garage once I finish with the hardware,
Ran a temporary 30A 240V circuit to power it.
Pulls about 5kw from the wall when running 405b. (I don't want to hear it, M3 Ultra... lol)

Purpose here is actually just learning and having some fun,
At work I'm in an industry that requires local LLM's.
Company will likely be acquiring a couple DGX or similar systems in the next year or so.
That and I miss the good old days having a garage full of GPUs, FPGAs and ASICs mining.

Got the GPUs from an old mining contact for $650 a pop.
$10,400 - GPUs (650x15)
$1,707 - MB + CPU + RAM(691+637+379)
$600 - PSUs, Heatsink, Frames
---------
$12,707
+$1,600 - If I decide to upgrade to gen4 Risers

Will be playing with R1/V3 this weekend,
Unfortunately even with 384GB fitting R1 with a standard 4 bit quant will be tricky.
And the lovely Dynamic R1 GGUF's still have limited support.

[-]

nero10578@reddit

Aside from the beta custom bios from asrock (which I also got) what did you set in the BIOS in order to be able to boot with 16 GPUs?

[-]

chemist_slime@reddit

What beta bios did you need? Doesn’t this board do x4x4x4x4 per slot? So 4 slots -> 16 x4? Or was it for something else?

[-]

Conscious_Cut_6144@reddit (OP)

With stock bios the system can’t boot with more than 14 gpus. Gets a pci resource error. They sent me 3.93A

[-]

nero10578@reddit

Hey can you share the beta BIOS? Also attempting the same thing.

[-]

Massive-Question-550@reddit

Curious what the point of 512 GB of system ram is if it's all run off the GPU's vram anyway? Also what program do you use for the tensor parallelism?

[-]

Conscious_Cut_6144@reddit (OP)

Vllm. Some tools like to load the model into ram and then transfer it to the gpus from ram. There is usually a workaround, but percentage wise it wasn’t that much more.

[-]

segmond@reddit

what kind of performance are you getting with llama.cpp on the R1s?

[-]

Conscious_Cut_6144@reddit (OP)

18T/s on Q2_K_XL at first,
However unlike 405b w/ vllm, the speed drops off pretty quickly as your context gets longer.
(amplified by the fact that it's a thinker.)

[-]

bullerwins@reddit

Have you tried ktranformers? I get more consistent 8-9t/s with 4x3090 even at higher ctx

[-]

330d@reddit

Full specs and launch command if you can…

[-]

AD7GD@reddit

Did you run with -fa? flash attention defaults to off

[-]

Conscious_Cut_6144@reddit (OP)

As of a couple weeks ago flash attention still hadn’t been merged into llama.cpp, I’ll check tomorrow, maybe I just need to update my build.

[-]

segmond@reddit

It has been implemented months ago, since last year. I have been using it. I can even use it across old GPUs like the P40s and even when running inference across 2 machines on my local network.

[-]

Conscious_Cut_6144@reddit (OP)

It’s specifically missing for Deepseek MOE: https://github.com/ggml-org/llama.cpp/issues/7343

[-]

segmond@reddit

oh ok, I thought you were talking about fa, didn't realize you were talking about Deepseek specific. Yeah, but it's not just deepseek if the key and value embedded head are not equal, fa will not work. I believe it's 128/192 for DeepSeek.

[-]

Phaelon74@reddit

Not really a work around, you can just flat out disable this. I was in the same camp as you until I found out how to disable this. And bow my 8 and 16 and 24 and 32 GPU AI rigs have only 64gb of mem.

Also, please tell me you are using slang or aphrodite with this many gpus.

[-]

AD7GD@reddit

Which model types need system ram for vLLM? I'm running a 8B model in FP16 right now and the vllm process isn't using close to 16G.

[-]

NeverLookBothWays@reddit

Man that rig is going to rock once diffusion based LLMs catch on.

[-]

nomorebuttsplz@reddit

Why would it be especially good for diffusion llms?

[-]

NeverLookBothWays@reddit

The ~40% speed boost (current predicted gain) as well as potential high scalability of diffusion methods. They are somewhat more intensive to train but the tech is coming along. Mercury Code for example.

[-]

Sure_Journalist_3207@reddit

Dear gentleman would you please elaborate on Diffusion Based LLM

[-]

Freonr2@reddit

TLDR: instead of iterations predicting the next token from left to right, it guesses across the entire output context, more like editing/inserting tokens anywhere in the output for each iteration.

[-]

Ndvorsky@reddit

That’s pretty cool. How does it decide the response length? An image has a predefined pixel count but the answer of a particular text prompt could just be “yes”.

[-]

Freonr2@reddit

I think same as any other model, it puts a EOT token somewhere, and I think for diffusion LLM it just pads the rest of the output with EOT. I suppose it means your context size needs to be sufficient though, and you end up with a lot of EOT paddings at the end?

[-]

330d@reddit

https://x.com/karpathy/status/1894923254864978091

[-]

Thesleepingjay@reddit

Wow, Its so fast it looks like magic. thanks for sharing.

[-]

NeverLookBothWays@reddit

This is a good overview of the breakthrough: https://youtu.be/X1rD3NhlIcE

https://aimresearch.co/ai-startups/diffusion-models-enter-the-large-language-arena-as-inception-labs-unveils-mercury

[-]

Magnus919@reddit

Let me ask my LLM about that for you.

[-]

NihilisticAssHat@reddit

I haven't seen anything about that context window. I feel like that would be the most significant limitation.

[-]

NeverLookBothWays@reddit

Here’s a brief overview of it I think explains it well: https://youtu.be/X1rD3NhlIcE (Mercury Coder)

I haven’t seen anything yet for local, but pretty excited to see where it goes. Context might not be too big of an issue depending on how it’s implemented.

[-]

NihilisticAssHat@reddit

I just watched the video. I didn't get anything about context length, mostly just hype. I'm not against diffusion for text mind you, but I am concerned that the contact window will not be very large. I only understand diffusion through its use in imagery, and as such realize the effective resolution is a challenge. The fact that these hype videos are not talking about the context window is of great concern to me. mind you, I'm the sort of person who uses Gemini instead of ChatGPT or Claude for the most part simply because of the context window.

Locally, that means preferring Llama over Qwen in most cases, unless I run into a censorship or logic issue.

[-]

NeverLookBothWays@reddit

True, although with the compute savings there may be opportunities to use context window scaling techniques like LongRoPE without massively impacting the speed advantage of diffusion LLMs. I am certain if it is a limitation now with Mercury it is something that can be overcome.

It’s currently 128k tokens for Mercury Coder

[-]

rog-uk@reddit

Will be interesting to see how long it takes for an opensource D-LLM to come out, and how much VRAM/GPU they need for inference. Nvidia won't thank them!

[-]

xor_2@reddit

Do diffusion LLMs scale better than auto-regressive LLMs?

From what I read I cannot parallelize stupid flux.1-dev on two GPUs so I have my doubts.

[-]

Optifnolinalgebdirec@reddit

When will we get it? anthropic

[-]

MetricVoidLX@reddit

Are you sure about not being bandwidth bottlenecked... ?

The theoretical bandwidth of PCIe 3.0 x4 is 3.938 GB/s bidirectional, which is around 2GB/s in a single direction. vllm uses tensor parallelism, which should demand pretty high bandwidth between cards.

I had a similar setup with older Nvidia GPUs in a server. Both ran on PCIe 3.0x16, but the training performance took a severe hit, even compared to a single-card setup.

[-]

Conscious_Cut_6144@reddit (OP)

Training would for sure be bottlenecked with my setup.

It loads models onto a single card at 3.6GB/s but inference never goes above 2.

Possible that I don’t have the resolution to see the bottleneck, For example it could be doing 3.6GB 1/2 the time and idle the other 1/2 of the time but switching faster than Nvidia-smi can pick up on.

[-]

hotdogwallpaper@reddit

what line of work are you in?

[-]

Stunning_Mast2001@reddit

What motherboard has so many pcie ports??

[-]

Conscious_Cut_6144@reddit (OP)

Asrock Romed8-2T
7 x16 slots,
Have to use 4x4 bifurcation risers that plug 4 gpus per slot.

[-]

CheatCodesOfLife@reddit

Could you link the bifucation card you bought? I've been shit out of luck with the ones I've tried (either signal issues or the gpus just kind of dying with no errors)

[-]

Conscious_Cut_6144@reddit (OP)

If you have one now that isn't working, try dropping your PCIe link speed down in the BIOS.

A lot of the stuff on Amazon is junk,
This one works fine for 1.0 / 2.0 / 3.0
https://riser.maxcloudon.com/en/bifurcated-risers/22-bifurcated-riser-x16-to-4x4-set.html

Haven't tried it yet, but this is supposedly good for 4.0
https://c-payne.com/products/slimsas-pcie-gen4-host-adapter-x16-redriver
https://c-payne.com/products/slimsas-pcie-gen4-device-adapter-x4
https://c-payne.com/products/slimsas-sff-8654-8i-to-2x-4i-y-cable-pcie-gen4

[-]

CheatCodesOfLife@reddit

Cool, you were right. My ones must be junk. I bought an nvme -> pcie 4x adapter, plugged a riser into that, then added my 6th 3090 and it works!

I'll try some others, but could settle for x4 for the last 2 cards if I can't get x8 working.

[-]

fightwaterwithwater@reddit

Just bought this and, to my great surprise, it's working fine for x4/x4/x4/x4: https://www.aliexpress.us/item/3256807906206268.html?spm=a2g0o.order_list.order_list_main.11.5c441802qYYDRZ&gatewayAdapt=glo2usa
Just need some cheapo oculink connectors.

[-]

cantgetthistowork@reddit

Cpayne is decent but I've had a bunch of them defective and only register as x2.0. But the ones that work are great. Only problem is there's no 4x4.0 riser so I could only fit 13 on my Rome8d-2T

[-]

Conscious_Cut_6144@reddit (OP)

The 3 links I posted were 4x4.0 no? Poor QC is a shame, especially on stuff coming overseas.

[-]

Radiant_Dog1937@reddit

Oh, those work? I've had 48gb worth of AMD I could have been using the whole time.

[-]

cbnyc0@reddit

You use risers, which split the PCIe interface out to many cards. It’s a type of daughterboard. Look up GPU risers.

[-]

misteick@reddit

yes, but how much does the fan cost? I think it's the MVP

[-]

Aphid_red@reddit

So you're seeing 24.5T/s out of a theoretical maximum of 63 T/s, getting about 38.9% of the theoretical performance.

I'm assuming though, that since there are only 8 key-value heads, that what your inference software is doing is first a layer-split in two, then tensor parallel 8-way. With that setup, you're really getting 77.8% of the true value, which looks much more realistic in terms of usable memory bandwidth.

[-]

RevolutionaryLime758@reddit

You spend $12k for fun!?

[-]

330d@reddit

People have motorcycles that are parked most of the time yet cost more and provide your life with a high risk of you dying on the road. I can totally see how spending $12k this way makes a lot of sense! If he wants he can resell the parts and reclaim the cost, it is not all money gone, in the end the fun may end up being free even.

[-]

alphaQ314@reddit

I'm okay with spending 12k for fun haha. But can someone explain why people are building these rigs? Just to host their own models?

Whats the advantage, other than privacy, and lack of censorship?

For an actual business case, wouldn't it be easier to just spend the 12k on one of the paid models?

[-]

mintybadgerme@reddit

I think you're missing the point completely. It's the difference between somebody else owning your AI, and you having your own AI in the basement. Night and day.

[-]

alphaQ314@reddit

I think you're missing the point completely.

I am. I don't get it. That's why I'm trying to understand from you guys to join in on the fun.

[-]

mintybadgerme@reddit

Fair enough. :)

[-]

Blizado@reddit

Is privacy and censorship not already enough? Also you can try a lot more around locally on the software side and adjust it how you want it. On the paid models you are a lot more bound to the provider.

[-]

anthonycarbine@reddit

This too. It's any AI model you want on demand. No annoying sign ups, paywalls, queues, etc etc.

[-]

hwertz10@reddit

Nice! I mean it's costly, but it's not like there's any INexpensive way to get 384GB VRAM and all that. And it's nice to know that LLM work doesn't push the PCIe bus, since if I ever added additional GPUs to my system it'd most likely be via the Thunderbolt ports on it (which I'm sure aren't going to match the speed of my internal PCIe slots.)

[-]

polandtown@reddit

Lovely, would LOVE a video walk though of the setup, giving as much detail as possible to the config and everything you considered during the build.

Could you expand on your riser situation? I'm currently using a vedda frame (in my case old mining gpus) but they're all running on 1x pcie lanes. it's my understanding that said risers cannot run above that. care to comment?

[-]

Conscious_Cut_6144@reddit (OP)

This one works fine for 1.0 / 2.0 / 3.0
https://riser.maxcloudon.com/en/bifurcated-risers/22-bifurcated-riser-x16-to-4x4-set.html

Haven't tried it yet, but this guys sells stuff for 4.0 and even 5.0
https://c-payne.com/products/slimsas-pcie-gen4-host-adapter-x16-redriver
https://c-payne.com/products/slimsas-pcie-gen4-device-adapter-x4
https://c-payne.com/products/slimsas-sff-8654-8i-to-2x-4i-y-cable-pcie-gen4

Both of these stores offer 4x and 8x lane options, assuming your board supports bifurcation.

[-]

Pedalnomica@reddit

The maxcloudon ones are gen 3, and the redriver is expensive. I needed the redriver on slot two of that board to avoid pcie errors, but I'm finding the he much cheaper https://www.sfpcables.com/pcie-to-sff-8654-adapter-for-u-2-nvme-ssd-pcie4-0-x16-2x-8i-sff-8654 works fine for the other pcie slots.

[-]

Conscious_Cut_6144@reddit (OP)

Interesting, slot 2 has some extra logic for swapping between m.2, oculinks and the slot so that one being weaker would make sense.

I’ll have to try not using it…

[-]

polandtown@reddit

this is fantastic, thank you!

[-]

David202023@reddit

Very impressive!

What are you going to do with it? If training from scratch, what model size this build could support?

[-]

alluringBlaster@reddit

If you don't mind me asking, how did you break into a career that lets you afford/play with all this tech? Working at a company focused on LLM sounds amazing. Did you go to college or just have incredibly fleshed out leetcode page? Really hope to be in those shoes one day.

[-]

azaeldrm@reddit

Thank you for the "why" lmao this is insane. I just bought a second 3090 for my server rig, so looking forward to play with that. This looks beautiful!

[-]

mp3m4k3r@reddit

Temp 240vac@30a sounds fun I'll raze you a custom PSU that uses forklift power cables to serve up to 3600w of used HPE power into a 1u server too wide for a normal rack

[-]

Clean_Cauliflower_62@reddit

Gee I’ve got the similar set up, but yours is definitely way better well put together then mine.

[-]

mp3m4k3r@reddit

Highly recommend these awesome breakout boards from Alkly Designs, work like a treat for the 1200w ones I have, only caveat being that the outputs are 6 individually fused terminals so ended up doing kind of a cascade to get them to the larger gauge going out. Probably way overkill but works pretty well overall. Plus with the monitoring boards I can pickup telemetry in home assistant from them.

[-]

Clean_Cauliflower_62@reddit

Wow I might look into it, very decently priced. I was gonna use a breakout board but it bought the wrong one from eBay. Was not fun soldering the thick wire onto the PSU😂

[-]

mp3m4k3r@reddit

I can imagine, there are others out there but this designer is super responsive and they have pretty great features overall. Definitely chatted with them a ton about this while I was building it out and it's been very very solid for me other than one of the PSUs is a slightly different manufacturer so the power profile on that one is a little funky but not a fault of the breakout board at all.

[-]

Clean_Cauliflower_62@reddit

What gpu are you running? I got 4 v100 16vram running.

[-]

mp3m4k3r@reddit

4xA100 Drive sxm2 modules (32gb)

[-]

Clean_Cauliflower_62@reddit

Oh boy, it actually works😂. How much vram do you have? 32*4?

[-]

mp3m4k3r@reddit

Definitely aren't working with nvlink in this gigabyte server, and they can definitely overheat lol

[-]

Clean_Cauliflower_62@reddit

I would be surprised if nv link works. I had an idea earlier to connect a second server’s smx board directly into the first one. There’s some empty pcie slots on there. Maybe we can get 8 gpu working😂😂.

[-]

mp3m4k3r@reddit

Ha maybe, I think someone got them to do nvlink with the pcie slot adapter but at like $300/ card that's tough experiment lol

Oh and they also do not thermal throttle, I dunno what they did to the bios in these but they're definitely intended for one purpose lol

[-]

Clean_Cauliflower_62@reddit

Yeah, 300 it’s actually pretty good deal. Are you talking the card or the adaptor? The card it’s going like 600 on eBay rn. I think smx2 it’s the only options if you wanna try out the smx. Other generations are just so expensive

[-]

mp3m4k3r@reddit

Just the adapter, still seeing the cards around $1200, so $1500 total.

It's fine overall they can link via pcie anyways, having some pain from getting them to perform better due to the tuning parameters for each hosting container. I threw some benchmark data I've gathered so far but trying to also add in tensorrt-llm before I start tuning each a bit further to see what helps.

Probably use the adapter with one of the V100s to toss it in my other server for stuff

[-]

Clean_Cauliflower_62@reddit

Too much power for 1u server haha. It already sounds like a jet engine, can’t imagine what yours sounds like😂

[-]

mp3m4k3r@reddit

Thankfully it's in the garage, I have the fans tuned down a bit but tbh I am likely going to take it apart and throw it in a custom immersion tank to have as a wall piece on top of hosting models

[-]

Clean_Cauliflower_62@reddit

Wow, good luck to you, I wanted to do that a while ago but sounds like a big project, but it will definitely make it quiet. Are you gonna run mineral oil?

[-]

mp3m4k3r@reddit

Yeah, it does sound like fun though! Nah looking at ElectroCool from Engineered Fluids instead, more expensive but also nontoxic and designed for the purpose.

[-]

mp3m4k3r@reddit

It does but still more tuning to be done, trying out tensorrt-llm/trtllm-serve if I can get Nvidia containers to behave lol

[-]

davew111@reddit

No no no, has Nvidia taught you nothing? All 3600w should be going through a single 12VHPWR connector. A micro usb connector would also be appropriate.

[-]

Conscious_Cut_6144@reddit (OP)

Nice, love repurposing server gear.
Cheap and high quality.

[-]

makhno@reddit

Pulls about 5kw from the wall

Dope!! :D

[-]

jrdnmdhl@reddit

I was wondering why it was starting to get warmer…

[-]

WeedFinderGeneral@reddit

OP needs to figure out how to get his rig to double as a steam turbine to help offset to power costs

[-]

Dry_Parfait2606@reddit

Climate change probably be solved by AGI

[-]

marc5255@reddit

It’ll be eye opening when AGI says. “There’s no possible solution, just damage control at this point. Earth will return to pre Industrial Revolution climate in 60000 years if human activity is reduced to 0 today”

[-]

jrdnmdhl@reddit

Might even be a few people still left when it does.

[-]

akerro@reddit

climate change is now the goal

[-]

Take-My-Gold@reddit

I thought about climate change but then I saw this dude’s setup 🤔

[-]

jrdnmdhl@reddit

Summer, climate change, heat wave...

These are all just words to describe this guy generating copypastai.

[-]

EFspartan@reddit

Jesus, here I am trying to get 4 3090's working and it's been a pain just setting it up. Although I did convert all of mine into water cooled loops...because I didn't want to hear it running.

[-]

SanDiegoDude@reddit

Goddamn, I salute your dedication to "I just want something local to fuck around with"

[-]

No-Upstairs-194@reddit

Llama 405B on M3 Ultra 512GB Does it give 15t/s ? I wonder about that. If so, I prefer the m3 ultra (with estimated 450w). Don't you think it would make more sense?

[-]

ortegaalfredo@reddit

I think you get way more than 24/T, that is single prompt, if you do continuous batching, you will get perhaps >100 tok/

[-]

Conscious_Cut_6144@reddit (OP)

I should probably add 24T/s is with spec decoding.
17T/s standard
Have had it up to 76T/s with a lot of threads.

[-]

sunole123@reddit

How do you do continuous batching??

[-]

AD7GD@reddit

Either use a programmatic API that supports batching, or use a good batching server like vLLM. But it's 100 t/s aggregate (I'd think more, actually, but I don't have 16x 3090 to test)

[-]

Wheynelau@reddit

vLLM is good for high throughput, but seems to struggle a lot with quantized models. Have tried them with gguf models before for testing.

[-]

Conscious_Cut_6144@reddit (OP)

GGUF can still be slow in VLLM but try an AWQ quantized model.

[-]

cantgetthistowork@reddit

Does that compromise on single client performance?

[-]

Normal-Context6877@reddit

That isn't even that bad for that many 3090s.

[-]

Dry_Parfait2606@reddit

THAT'S what I'm talking about!!!

[-]

chespirito2@reddit

Damn near charging an electric car at that power

[-]

Lance_ward@reddit

What’s the watt/token you are getting with this bad boy?

[-]

I-cant_even@reddit

https://www.pugetsystems.com/labs/hpc/quad-rtx3090-gpu-power-limiting-with-systemd-and-nvidia-smi-1983/

I run 4x 3090s off a single 1600W PSU. I followed the above guide to prevent high power draws with minimal negative effect.

(Also, you know if you rotated the rig relative to the fan the fan would work better right?.... Sorry, I had to.)

[-]

laterral@reddit

What’s the use case?

[-]

cantgetthistowork@reddit

Do you happen to have a listing for the frame? I'm maxed out on a 12 GPU rack and it's annoying me

[-]

Conscious_Cut_6144@reddit (OP)

Most frames are designed for stacking, that’s what I did here, only on the top one I assembled without the motherboard tray so the gpus could be lower.

[-]

ExploringBanuk@reddit

No need to try R1/V3, QwQ 32B is better now.

[-]

Papabear3339@reddit

QwQ is better then the distils, but not the actual r1.

Actual r1 most people can't run because an insane rig like this is needed.

[-]

teachersecret@reddit

It's remarkably close to the actual r1 in performance, which is impressive. I've been playing with a 4.25 quant of qwq and it has r1 "feels".

[-]

AdventurousSwim1312@reddit

With that rig, you'll be better with an awq version and vllm with tp = 16, I wouldn't be surprised if you could get in the 100 t/s that way (never tried with that much GPU, but with an aggregated bandwitch of 16tb/s thats huge)

[-]

tilted21@reddit

Fuck yeah dude. I'm rocking a 4090 +3090, so basically 70b models quanted at 4.5. And its still night and day compared to a 7b. I can't imagine the difference that beast makes. Cool!

[-]

mrtransisteur@reddit

what you really need is 16x of those 96 GB Chinese modded 4090s.. you could actually fit full og deepseek r1 on that bro ;_;

[-]

roydotai@reddit

How much power does that draw?

[-]

IdealDesperate3687@reddit

Llama.cpp is your friend for R1. Love your rig!

[-]

fairydreaming@reddit

Congratulations on getting it working, impressive build!

But 5kw... Next project - mini fusion reactor.

[-]

gtxktm@reddit

Which PSU do you use?

Also, have you tried exllamav2?

[-]

Blizado@reddit

Crazy, so many card's and you still can't run very large models in 4bit. But I guess you can't get so much VRAM with that speed with such a budget, so a good invest anyway.

[-]

MD_Yoro@reddit

Sick spec, but can it run Crysis?

[-]

sassydodo@reddit

Isn't newest qwq better than r1?

[-]

goodtimtim@reddit

heck yeah! congrats on getting this up. If you've got any more of those 650$ 3090s let me know :)

[-]

tindalos@reddit

But how many FPS are you getting on Crysis now?

[-]

Fresh-Letterhead986@reddit

what did you use for x4 risers?

something i'm really concerned about is isolation of CEM slot power when using multiple PSU.

back in the old mining days, more than a few people fried equipment by powering a card (inadvertently) from 2 separate power domains -- 1st PSU via the PCIe slot; 2nd PSU via the 12V 8-pin molex connectors

x1 risers is the easy answer, but that's a terrible choice (for non-inference). Was considering modifying a x16 ribbon cable like this: https://www.amazon.com/Express-Riser-Extender-Molex-Ribbon/dp/B00OTGJQ10

[-]

Cool-Importance6004@reddit

Amazon Price History:

Chenyang PCI-E Express 16X to 16x Riser Extender Card with Molex IDE Power & Ribbon Cable 20cm * Rating: ★★★☆☆ 3.8 (13 ratings)

Current price: $8.87 👎
Lowest price: $4.99
Highest price: $8.87
Average price: $7.01

Month	Low	High	Chart
03-2025	$8.87	$8.87	███████████████
02-2025	$7.59	$7.59	████████████
07-2024	$7.99	$7.99	█████████████
03-2024	$7.87	$7.87	█████████████
02-2024	$7.24	$7.24	████████████
01-2024	$7.87	$7.87	█████████████
11-2023	$7.98	$7.98	█████████████
03-2023	$7.99	$7.99	█████████████
12-2022	$5.99	$5.99	██████████
08-2022	$4.99	$4.99	████████
05-2022	$5.99	$5.99	██████████
10-2021	$7.88	$7.88	█████████████

Source: GOSH Price Tracker

^(Bleep bleep boop. I am a bot here to serve by providing helpful price history data on products. I am not affiliated with Amazon. Upvote if this was helpful. PM to report issues or to opt-out.)

[-]

AD7GD@reddit

I'll watch for you on vast.ai ;-)

[-]

ShadowbanRevival@reddit

You got all 16 running on one board?? I remember my ethereum mining days and it was such a pain in the ass to get anything over cards on one board to run smoothly

[-]

Such_Advantage_6949@reddit

Do update us how many toke u managed to get for any version of deepseek r1 u manage to fit fully in vram

[-]

1BlueSpork@reddit

Awesome!!!

[-]

MatterMean5176@reddit

Can you expand on "the lovely Dynamic R1 GGUF's still have limited support" please?

I asked the amazing Unsloth people when they were going to release the dynamic 3 and 4 bit quants. They said "probably" Help me gently remind them.. They are available for 1776 but not the orignal oddly.

[-]

CheatCodesOfLife@reddit

They are available for 1776 but not the orignal oddly.

FWIW, I loaded up that 1776 model and hit regenerate on some of my chat history, the response was basically identical to the original

[-]

MatterMean5176@reddit

Thanks for that. I've been wondering how they compare. I might need to give in and download the "remix".

You're running them at home?

[-]

Conscious_Cut_6144@reddit (OP)

I can run them in llama.cpp, But llama.cpp is way slower than vllm. Vllm is just rolling out support for r1 ggufs.

[-]

MatterMean5176@reddit

Got it. Thank you.

[-]

Lissanro@reddit

Quite a good rig! I am looking to migrating to EPYC platform myself, so it is of interest to me to read about how others build their rigs based on it.

Currently I have just 4 GPUs, but enough power to potentially run 8, however, I ran out of PCI-E lanes and need more RAM too, hence looking into EPYC platforms. And from what I saw so far, it seems DDR4 based platfom is the best choice at the moment in terms of performance/memory capacity/price.

[-]

segmond@reddit

You can go cheap, if you are on team llama.cpp you can distribute inference across your rigs.

[-]

CheatCodesOfLife@reddit

You could run the unsloth Q2_K_XL fully offloaded to the GPUs with llama.cpp.

I get this with 6 3090's + CPU offload:

prompt eval time = 7320.06 ms / 399 tokens ( 18.35 ms per token, 54.51 tokens per second) eval time = 196068.21 ms / 1970 tokens ( 99.53 ms per token, 10.05 tokens per second) total time = 203388.27 ms / 2369 tokens srv update_slots: all slots are idle

You're probably get > 100t/s prompt eval + ~20t/s generation.

[-]

100thousandcats@reddit

Wow, llama 405B. That’s insane!!

[-]

Just-Requirement-391@reddit

how did you connect 16 gpu to 7 pcie slot motherboard ?

[-]

Adam1394@reddit

bifurcation

[-]

el_koha@reddit

bifurcation

[-]

GreedyAdeptness7133@reddit

Maybe it’s not

[-]

rorowhat@reddit

Are you finding the cure for cancer?

[-]

lmneozoo@reddit

no, hes finding peak stroke speed for optimal nut duration

[-]

sourceholder@reddit

With all that EMF? Could be creating it.

[-]

SufficientPie@reddit

Non-ionizing radiation doesn't cause cancer

[-]

Massive-Question-550@reddit

Realistically you would have signal degradation in the Pcie cables long before the EMF actually hurts you.

[-]

sourceholder@reddit

The signal degradation (leakage) is the source of EMF propagation. If the connectors and cables were perfectly shielded, there wouldn't be any propagation.

The effect is negligible either way. I wasn't being serious.

[-]

Massive-Question-550@reddit

I figured. I don't think the tinfoil hat people are into llm's anyway.

[-]

YordanTU@reddit

Maybe the tinfoil's from the past. Nowadays "tinfoil" is used to discredit many critical or non-mainstream voice, so be sure that many tinfoils of today are using LLM's.

[-]

Anka098@reddit

You need lab samples to test the cure I guess

[-]

cultish_alibi@reddit

Oh! You're unbelievable!

[-]

orinoco_w@reddit

Whoa

[-]

Boring-Test5522@reddit

the setup is at least $25000. It is better curing fucking cancer with that price tag.

[-]

Conscious_Cut_6144@reddit (OP)

Prices are in my post a few down, got the 3090's for $650 each.

[-]

Neither-Phone-7264@reddit

10k 16x rig? what a deal!

[-]

Ready_Season7489@reddit

"It is better curing fucking cancer with that price tag."

Great return on invest. Gonna be very rich.

[-]

Haiku-575@reddit

Maybe. 3090s are something like $800 USD used, especially from a miner, bought in bulk. "At least $15,000" is much more realistic, here.

[-]

shroddy@reddit

It is probably to finally find out how many r are in strawberry

[-]

HelpfulJump@reddit

Last I heard they were using entire Italy's energy to figure that out, I don't think this will cut.

[-]

Vivarevo@reddit

This or its for corn

[-]

vulcan4d@reddit

How do you do that, the motherboard does not have enough pcie slots

[-]

Imarasin@reddit

So what exactly do you do with all of that?

[-]

Flashy_Layer3713@reddit

R1, have fun.

[-]

AlexByrth@reddit

You can set power limit to something between 200 to 250W, with minimal impacts on inference and huge gains in energy saving.
Example:
nvidia-smi -pl 200

[-]

Ornery_Local_6814@reddit

House lifts off as soon as he tries to inference something. Shit's like UP.

[-]

Mountain_Chicken7644@reddit

you are one lucking motherf-

[-]

jakubdev12@reddit

Why everybody using Amd CPU's? Isn't better to get 3rd/4th gen Xeon with USM or 4th gen with CXL and get less VRAM, but has better bandwitch between GPU/RAM/CPU to offload stuff?

Like having 8x RTX3090 with 1TB of RAM to load the biggest currently models and to don't bottleneck too much on laness speed it up with USM or CXL? What I'm missing?

[-]

bordobbereli@reddit

This guy is responsible ALONE for global warming

[-]

the__simian@reddit

I have a question, i've notced folks in this community really frequently will do things like buy 16 3090s, rather than fewer cards, that are admittedly more expensive but have much more vRAM and perform well in other ways. Why is this? are 3090s the best price to performance at this time or some other reason?

[-]

FlowThrower@reddit

nvlink, I imagine. unless I'm mistaken the 40 and 50 series removed the ability

[-]

Dhervius@reddit

Cuando eth clasic te da por dia :v

[-]

Difficult-Slip6249@reddit

Glad to see the open air "crypto mining rig" pictures back on Reddit :)

[-]

TinyTank800@reddit

Went from mining for fake coins to simulating anime waifus. What a time to be alive.

[-]

nexusprime2015@reddit

throw nfts in there as well

[-]

wingsinvoid@reddit

How many hashes do you get? What are you using, Claymore? :)

[-]

GroundbreakingFile18@reddit

Damn, do the lights in your neighborhood dim and flicker when this spins up?

[-]

Proof-Examination574@reddit

Yeah but can it run CRYSIS???

[-]

Distinct_Benefit_194@reddit

What is your motherboard, if I may ask?

[-]

xephadoodle@reddit

wow, that is quite pretty :D

[-]

adulfkittler@reddit

Well this answered my question from yesterday 😂

[-]

MannyManMoin@reddit

I was about to say M3 Ultra with 512gb ram for 10k USD. (It will be interesting to see M3 Ultra R1 speeds when reviewers are getting the 512gb version).

Fun setup !

[-]

-JamesBond@reddit

Why wouldn’t you buy a new Mac Studio M4/M3 Ultra with 512 GB of RAM for $10k instead? It can use all the memory for the task here and costs less.

[-]

ybdave@reddit

Let me know if you get AWQ under SGLang/vLLM running! We have the same build with 16x3090. We should compare notes! Currently running R1 with https://github.com/ikawrakow/ik_llama.cpp. Check out the pull requests, lots of development happening!

[-]

YouAreRight007@reddit

Very neat.
I wonder what the cost would be per hour to have the equivalent resources in the cloud.

[-]

miaumiauboombigjan@reddit

the single fan hahahahahahahh

[-]

SungamCorben@reddit

Amazing, pull some benchmarks please!

[-]

JunketLess@reddit

can someone eli5 what's going on ? it looks cool though

[-]

FinnGamePass@reddit

Mining ELon Coin

[-]

SomeOddCodeGuy@reddit

That fan really pulls the build together.

[-]

BangkokPadang@reddit

Well that's just, like, your opinion, man.

[-]

Xylber@reddit

"I'm tired boss"

[-]

impaque@reddit

Hahahah I literally thought the same thing, almost posted it, too :D Look at the angle at which it blows, too :D Gold

[-]

DangKilla@reddit

Fans don't cool air. He should be blowing the hot air away.... I worked in a data center that used industrial fans.....

[-]

Financial_Recording5@reddit

Ahhh…The 20” High Velocity Floor Fan. That’s a great fan.

[-]

Then_Conversation_19@reddit

The true MVP

[-]

random-tomato@reddit

the fan is the best part XD

[-]

needCUDA@reddit

when I used to mine I had a fan too. super effective.

[-]

gjallerhorns_only@reddit

I was just thinking that this would have been an insane mining rig like 3yrs ago

[-]

davew111@reddit

But... no RGB

[-]

tmvr@reddit

Meh, I think it blows!

[-]

xendelaar@reddit

That's like..just your opinion, man..

[-]

shaolinmaru@reddit

I have one of that and it produces a hella of wind.

[-]

Theio666@reddit

I have a fan that's pretty much like on photo, and I bet the fan is louder than all cards combined xD

[-]

OmarDaily@reddit

Damn, might just pick up a 512gb Mac Studio instead.. The power draw must be wild at load.

[-]

edude03@reddit

I just 5 minutes ago got my 4 founders working in a single box (I have 8 but power/space/risers are stopping me) then I see this

[-]

ForsookComparison@reddit

Host Llama 405b with some funky prompts and call yourself an AI startup.

[-]

WeedFinderGeneral@reddit

"We'll just ask the AI how to make money"

[-]

OneSmallStepForLambo@reddit

I see some AI ads and I can’t help to think something like this is running in a garage supporting the app

(No disrespect to OP - love the rig)

[-]

Jucks@reddit

Is this your heater setup for the winter? (seriously wtf is this for=D)

[-]

Alavastar@reddit

Yep that's how skynet starts

[-]

Electronic-Site8038@reddit

skynet is in china, already alive for the past 4 years

[-]

Thireus@reddit

What’s the electricity bill like?

[-]

Conscious_Cut_6144@reddit (OP)

$0.42/hour when inferencing,
$0.04/hour when idle.

I haven't tweaked power limits yet,
Can probably drop that a bit.

[-]

MizantropaMiskretulo@reddit

So, you're at about $5/Mtok, a bit higher than o3-mini...

[-]

AmanDL@reddit

probably 3, nothing beats local running, running big models on clouds and you never know if you're having model parallelization issues, ram issues, and what not. At least locally it's all quite transparent.

[-]

smallfried@reddit

You said you have solar. Can you run the whole thing for free when it's sunny?

[-]

Conscious_Cut_6144@reddit (OP)

Depends on how you look at it. I still pull a little power from the grid every month, more with this guy running.

[-]

MINIMAN10001@reddit

Yep power limits are on my mind with numbers like that lol

[-]

Thireus@reddit

Nice! I wish I also lived in a place with cheap electricity 😭 I pay triple.

[-]

TerryC_IndieGameDev@reddit

This is so beautiful. Man... what I would not give to even have 2 3090's. LOL. I am lucky tho, I have a single 3060 with 12 gigs vram. It is usable for basic stuff. Someday maybe Ill get to have more. Awesome setup I LOVE it!!

[-]

Ok-Investment-8941@reddit

The 6 foot folding plastic table is the unsung hero of nerds everywhere IMO

[-]

init__27@reddit

I mean...to OP's credit: Are you even a localLLaMA member if you cant train llama at home? :D

[-]

GreedyAdeptness7133@reddit

What kind of crazy workstation mobo supports 16 gpus and how are they connected to it?

[-]

Lantan57ua@reddit

I wanted to start with 1 3090 to learn and have fun (also for gaming). I see some $500-$$600 used cards around me, and now I know why the price is so low. Is it safe to buy them after mining from a random person?

[-]

misterravlik@reddit

buddy, can you post a beta version of the bios for asrock romed8-2t?

[-]

M000lie@reddit

How the hell did you connect all 16x GPUs to your asrock motherboard with 7x pcie4 x16?

[-]

letonai@reddit

1.21 Gigawatts?

[-]

dr_manhattan_br@reddit

Considering each 3090 can draw 400w. You should hit 6.4kwh just with GPUs. Adding cpu and peripherals it should drawn more than 7kwh from wall when at 100%. Maybe your pciex 3.0 is limiting your GPUs to get fully utilized

[-]

DrDisintegrator@reddit

Everyt ime I see a rig like this, I just look at my cat and say, "We can't have nice things.".

[-]

VrN00b74@reddit

I totally get you on this.

[-]

Public-Subject2939@reddit

This generation is so obsessed with fans😂🤣 its just fans its JuST only FANS😭

[-]

Ok_Parsnip_5428@reddit

Those 3090s are working overtime 😅

[-]

geoffwolf98@reddit

And yet Crisis still stutters at 4k.

[-]

Feisty_Ad_4554@reddit

Nice setup in the winter :)

[-]

mini-hypersphere@reddit

The things people do to simulate their waifu

[-]

fairydreaming@reddit

with 5kw of power to dissipate she's going to be a real hottie!

[-]

-TV-Stand-@reddit

You can turn off your house's warming with this simple trick!

[-]

RazzmatazzReal4129@reddit

Still cheaper and less effort than real wife.

[-]

h1pp0star@reddit

Are you training the new llama model in your garage?

[-]

BoulderDeadHead420@reddit

Im just trying to find one or two at that price damn

[-]

TigerRobocop_@reddit

OMG

[-]

Active-Ad3578@reddit

Now buy 10 Mac sudio ultra then it will be like 5 TB of vram

[-]

kumits-u@reddit

Whats your PCIe speed on each of the cards ? Wouldn't this limit your speed if it's lower than x16 per card ?

[-]

Comcast-user-WA@reddit

awesome

[-]

Ok-Anxiety8313@reddit

Really surprising you are not memory bandwidth-bound. What implementation/software are you using?

[-]

MINIMAN10001@reddit

I mean once you're loaded the communication is extremely limited on inference.

[-]

Wheynelau@reddit

Could you elaborate? You mean once the layers are loaded there no communications between devices?

[-]

Sm0oth_kriminal@reddit

Yes, only the actual “data” I.e. the token data is passed at inference time. This is on the order of a few MB. Whereas the weights are 100s of GB. It’s basically nothing to the point where communication latency matters much more than bandwidth

[-]

Wheynelau@reddit

Aren't the activations (hidden states) of intermediate layers are passed in the case of pipeline parallelism, while in the case of tensor parallel is that much of the communication is done at the layer norm layers, requiring quite a lot of communication? I could be wrong about the inferencing frameworks, my specialty is only in training 😅

[-]

Sm0oth_kriminal@reddit

Yes those are basically the “token data” but after the Nth layer has processed them.

I’m not sure what OP would use (for MoE it gets slightly more complicated), but tensor parallelism especially on consumer GPUs can be problematic due to collective communication (such as layer norm)

I think the default in many tools is essentially pipeline parallelism (for example, llama.cpp will offload however many layers to the GPU, and run the rest on the CPU). So the activations just behave like an assembly line, they start on the CPU as token+positional vectors, and must be communicated to the first device with the first few layers of the model, then after that is done to the next device with the next layers, and so on

This also has the benefit of being able to handle large request volumes. For example, at any given time for a single request, only 1 device is active (* mostly). So, giving another request when the current request is on device 4/8 means both can be going at full speed — in fact theoretically you can have N concurrent requests each getting effectively 100% of a single GPUs performance in an N GPU machine

[-]

Wheynelau@reddit

Got it! Yea consumer GPUs are really not made for collective communication, the bandwidth and compute capabilities are usually good enough but really struggle when they need to communicate. I tried experimenting on cards without interconnect, 2 GPUs with TP 2 were apparently slower than one, assuming each card can fit one model.

Thanks for sharing on llama.cpp, my work is usually on vllm so I am not too familiar with how llama.cpp shards their model.

The pain point of pipeline is having to wait on the other devices for one token, so yes you are absolutely right, the theoretical limit is N concurrent requests for N gpus.

[-]

Ok-Anxiety8313@reddit

If my understanding is correct it uses full sharding so each matrix is split across gpus so needs to communicate a lot in order to combine the results of all the matrices ar each layer? but perhaps it is a different kind of parallelism that uses less communication? perhaps divide attention heads or something?

[-]

m4hi2@reddit

repurposed your crypto mining rig? 😅

[-]

slippery@reddit

Applause for the tight cabling. I wish I could afford a rig like that.

[-]

nanobot_1000@reddit

This is awesome, bravo 👏

5 kW lol... since you are the type to run 240V and build this beast, I forsee some solar panels in your future.

I also heard MSFT might have 🤏 spare capacity from re-opening Three Mile Island, perhaps you could negotiate a co-hosting rate with them

[-]

Conscious_Cut_6144@reddit (OP)

Haha you have me all figured out.
I have about 15kw worth of panels in my back yard.

[-]

power97992@reddit

What do u do at the night, the power grid plus batteries?

[-]

Conscious_Cut_6144@reddit (OP)

I have batteries, but I still pull some power from the grid like 50 weeks of the year.

I drive an EV too, so I use a lot of electricity.

[-]

power97992@reddit

If you have much sun and pay that much for electricity, I suspect you live in California. You

[-]

nanobot_1000@reddit

Ahaha you are ahead of the game! That's great you are bringing second life to these cards with those 😊

[-]

segmond@reddit

Very nice. I'm super duper envious. I'm getting 1.60tk/sec on llama405b Q3K_M

[-]

power97992@reddit

That is so slow, u might as well rent a h200 cluster

[-]

segmond@reddit

sure, and what performance are you getting when you run it on your own machine?

[-]

power97992@reddit

I usually use o3 mini or claude, but on ocassions i run 14b lol locally. I get liken23 t/s… I can’t imagine running llama 405b on my machine, it would crash my system and shorten the lifespan of my ssd.

[-]

330d@reddit

on what hardware m8?

[-]

segmond@reddit

2 rigs with the inference distributed across the network, my slower rig is a 3060 and 3 P40s. If it was 4 3090's. I'll probably see 5tk/s. I'm also using llama.cpp which is not as fast as vLLM.

[-]

-lq_pl-@reddit

Damn you, leave some for the rest of us.

[-]

andreclaudino@reddit

I would like to mount of like this for myself. But I don't know where can I start from. I considered ordering a cryptocurrency miner ring (like your, it usesa set of RTX 3090), but I am not sure it would work for AI, either if that would be good.

Do you have a step-step tutorial that I can follow?

[-]

andreclaudino@reddit

Next week, this guy will have trained a new deepseek like model by 25k USD

[-]

Intrepid_Traffic9100@reddit

The combination of probably 15k plus in cards plus a 5$ fan on a shitty desk is just pure gold

[-]

miscellaneous_robot@reddit

DANG!

[-]

not_wall03@reddit

So you're the reason 3090s are so expensive

[-]

Business-Weekend-537@reddit

Might be a dumb question but how many pcie ports on the motherboard and how do you hook up that many at once?

[-]

moofunk@reddit

Put this thing or similar in a slot and bifurcate the slot in BIOS.

[-]

laexpat@reddit

But what connects from that to the gpu?

[-]

fizzy1242@reddit

A riser cable

[-]

Business-Weekend-537@reddit

Where do you get one of those splitter cards? Also was bifurcating in the bios an option or did you have to custom code it?

That splitter card is sexy AF ngl

[-]

LockoutNex@reddit

Most server type motherboards allow bifurcate on about every pcie slot, but for normal user motherboards it is really up to the maker at that point. For the splitter cards you can just google 'bifurcation card' and you'll get tons of results from postings on amazon to ebay.

[-]

Conscious_Cut_6144@reddit (OP)

It's a setting on most boards nowadays.

[-]

Bystander-8@reddit

I can see where allthe budhet went to

[-]

Business-Ad-2449@reddit

How rich are you ?

[-]

cbnyc0@reddit

Work-related expense, put it on your Schedule C.

[-]

DesperateAdvantage76@reddit

So the ris only $10k instead of $12k after you greet back those deductions lol.

[-]

rapsoid616@reddit

That's the way I purchase all my electronic needs! In Turkey it saves me about %20.

[-]

sourceholder@reddit

Not anymore.

[-]

Jucks@reddit

Is this your heater setup for the winter? (seriously wtf is this for=D)

[-]

RMCPhoto@reddit

I hope it's winter wherever you are.

[-]

Tasty_Ticket8806@reddit

power cons?? like 2 and a half nuclear reactirs or so...?

[-]

Ok_Combination_6881@reddit

Is it more economical to buy a 10k m3 ultra with 521gb or buy this rig? I actually want to know

[-]

Conscious_Cut_6144@reddit (OP)

m3 ultra is probably going to pair really well with R1 or DeepSeekV3,
Could see it doing close to 20T/s
due to having decent memory bandwidth and no overhead hopping from gpu to gpu.

But it doesn't have the memory bandwidth for a huge non-moe model like 405B
Would do something like 3.5T/s

I've been working on this for ages,
But if I was starting over today I would probably wait to see if the top Llama 4.0 model is MOE or Flat.

[-]

Cergorach@reddit

With what the 3090's are going for today (\~$1000) you could make a nice profit... ;)

What would the advantage be of running 405b be over 671b in output (quality)? Or is this just a long running project you wanted to finish? AI/LLM development is going so darned fast that by the time you buy/build X, Y is already doing it faster, cheaper, and better...

[-]

Wheynelau@reddit

I'm more curious about the M4 studio. The rig OP has should be able to fit Q4 deepseek R1, unless my math is wrong. Would be interesting to see how it performs

[-]

lolwutdo@reddit

Definitely less of a headache and eyesore

[-]

Noiselexer@reddit

And energy

[-]

keepawayb@reddit

You have my respect and tears of envy.

[-]

xor_2@reddit

What is the performance difference between splitting large model by rows and by layers?

[-]

power97992@reddit

5600 watts while running and 7200w at peak usage,, ur house must be a furnace.

[-]

SadWolverine24@reddit

Why do you have 512GB of RAM?

[-]

Tourus@reddit

The most popular inference engines all load the entire model into RAM first.

[-]

jack-in-the-sack@reddit

A single motherboard??? How???

[-]

a_beautiful_rhind@reddit

What's it idle at?

[-]

Alice-Xandra@reddit

Sell the flops & you've got free heating! Some pipe fckery & you've got warm water.

freeenergy

[-]

Pretend-Umpire-3448@reddit

a noob question, how do you connect all the gpu? pci-e or ?

[-]

2TierKeir@reddit

What do you do with these bro

[-]

Greedy_Reality_2539@reddit

Looks like you have your domestic heating sorted

[-]

-Ellary-@reddit

Wow, this rig is almost smirking at me, and give me shivers down my spine.
It is great a build for enterprise resource planning.

[-]

Future_Might_8194@reddit

I can hear this picture

[-]

AppearanceHeavy6724@reddit

It is so hot I had to open my window.

[-]

Endless7777@reddit

Cool, what are you doing with it? Im new to this whole llm thing.

[-]

sunshinecheung@reddit

48GB VRAM 4090 can be much better

[-]

Such_Advantage_6949@reddit

Much more expensive as well

[-]

ThenExtension9196@reddit

How so? My modded 4090 was 4k and a couple of 3090s cost that much?

[-]

HelpfulFriendlyOne@reddit

He paid 650 each for the 3090s

[-]

Such_Advantage_6949@reddit

No, i meant 48gb 4090 gonna be much more expensive compared to 3090s

[-]

Mass2018@reddit

Nice build. I highly recommend you upgrade your fan to a box fan that you can set behind the rig (give it an inch of clearance for some air intake) so that you can push air out across all the cards.

[-]

dizvyz@reddit

I wouldn't have thought that what was keeping us from attaining true AGI was a desk fan.

[-]

Wheynelau@reddit

How does it compare to the 3.3 70b? I heard that the 70b is supposedly comparative to the 405b, can imagine the throughput you would get from that

[-]

NobleKale@reddit

So much money on such a shitty little frame.

[-]

illusionst@reddit

Can you ask it ‘what is the meaning of the file?’

[-]

gaspoweredcat@reddit

Yikes and I thought my 10x CMP 100-210 (160gb total) was overkill

[-]

vogelvogelvogelvogel@reddit

dude spent like 20 grand on 3090s mounts them in a 10 buck shelf

[-]

Godess_Ilias@reddit

why?

[-]

TheManicProgrammer@reddit

What's the fan cooling

[-]

Blizado@reddit

Puh, that is insane. I never could afford this. I'm even happy to have at last a 4090. I hate that I'm so poor. :D

[-]

random-tomato@reddit

New r/LocalLLaMA home server final boss!

/u/XMasterrrr

[-]

Conscious_Cut_6144@reddit (OP)

He has 8x risers, it’s a trade off getting 16 cards for tensor parallel vs extra bandwidth to 14 cards.

[-]

kmouratidis@reddit

You can get 4x4 x16 switches. It might not help with average bandwidth per card, but if you configure them in a mix of tensor and pipeline parallelism, you'll have enough request throughput to compete with (non-A100/H100/H200) enterprise servers.

[-]

HipHopPolka@reddit

Does... the floor fan actually work?

[-]

MINIMAN10001@reddit

When you run the math, large fans like that move enormous amounts of cubic feet of air compared to desktop fans. Blade size is a major factor in the amount of air that is moved.

[-]

ParaboloidalCrest@reddit

10x better than your 12 teeny-tiny case fans.

[-]

Gullible-Fox2380@reddit

May I ask what you use it for? Just curious! thats a lot of cloud time

[-]

rf97a@reddit

Optimising bitcoin mining?

[-]

These_Growth9876@reddit

Is the build similar to ones ppl used to build for mining? Can u tell me the motherboard used?

[-]

0RGASMIK@reddit

Full circle back to crypto days.

[-]

AriyaSavaka@reddit

This can fully offload a 72B-123B model at 16-bit and with 128k context right?

[-]

Willing_Landscape_61@reddit

Building an open rig myself. How do you prevent dust form accumulating in your rig?

[-]

Dangerous_Fix_5526@reddit

F...ing Madness - I love it.

[-]

TerrryBuckhart@reddit

Hopefully whatever you are training the model for will pay for your power bill

[-]

GTHell@reddit

Please show us your electricity bill

[-]

ReMoGged@reddit

Nice hair dryer

[-]

_wOvAN_@reddit

unfortunalety, large contexts are destroying mutli-gpu builds

[-]

Odd_Reality_6603@reddit

Bro that's nasty

[-]

kaisear@reddit

MadMax build.

[-]

kaisear@reddit

[-]

330d@reddit

I'm 3rd month into planning, gathering all the parts, reading, saving money... for my 4x3090 build. Then there's this guy :D Congratulations, amazing build, one of the GOAT's here and goes into my bookmarks folder.

[-]

lukewhale@reddit

Holy shit. I expect a full will write up and a YouTube video.

You need to share your experience.

[-]

beedunc@reddit

Would love to see a ‘-ps’ of that.

[-]

Top-Salamander-2525@reddit

You’re going to need a bigger fan…

[-]

Conscious_Cut_6144@reddit (OP)

I have a 48in fan mounted in my garage ceiling, exhausting into my attic.

[-]

Top-Salamander-2525@reddit

https://bigassfans.com

[-]

sunole123@reddit

How many TOPS would you say is this setup?

[-]

MixtureOfAmateurs@reddit

Founders 💀. There aren't 16 3090 FEs in my city lol

[-]

Conscious_Cut_6144@reddit (OP)

Not anymore 🤣

[-]

GmanMe7@reddit

I hope you run Linux

[-]

ExceptionOccurred@reddit

What do you use this for?

[-]

olmoscd@reddit

you know, talking to it!

[-]

robonxt@reddit

I love how the rig is nice, and the cooling solution is just a fan 😂

[-]

CheatCodesOfLife@reddit

It's the most effective way though! Even with my vramlet rig of 5x3090's, adding a fan like that knocked the temps down from ~79C to the 60's

[-]

Theio666@reddit

Rig looks amazing ngl. Since you mentioned 405b, do you actually running it? Kinda wonder what's performance in multiagent setup would be, with something like 32b qwq, smaller models for parsing, maybe some long context qwen 14B-Instruct-1M (120/320gb vram for 1m context per their repo) etc running at the same time :D

[-]

legatinho@reddit

384gb of VRAM. What model and what context size can you run with that?

[-]

The_GSingh@reddit

ATP it is alive. What are you building agi or something?

Really cool build btw.

[-]

Ok-Anxiety8313@reddit

Can I get the mining contact? Do they have more 3090?

[-]

TheDailySpank@reddit

For the love of god, hit it from the front

[-]

Conscious_Cut_6144@reddit (OP)

Absolutely, that's just for the pics!