From 4090 to 5090 to RTX PRO 6000… in record time
Posted by Fabix84@reddit | LocalLLaMA | View on Reddit | 258 comments

Started with a 4090, then jumped to a 5090… and just a few weeks later I went all in on an RTX PRO 6000 with 96 GB of VRAM. I spent a lot of time debating between the full power and the Max-Q version, and ended up going with Max-Q.
It’s about 12–15% slower at peak than the full power model, but it runs cooler, pulls only 300W instead of 600W, and that means I can add a second one later without melting my power supply or my room. Given how fast I went from 4090 → 5090 → RTX PRO 6000, there’s a real chance I’ll give in to the upgrade itch again sooner than I should.
I almost pre-ordered the Framework board with the AMD AI Max+ 395 and 128 GB unified RAM, but with bandwidth limited to 256 GB/s it’s more of a fun concept than a serious AI workhorse. With the RTX PRO 6000, I think I’ve got the best prosumer AI hardware you can get right now.
The end goal is to turn this into a personal supercomputer. Multiple local AI agents working 24/7 on small projects (or small chunks of big projects) without me babysitting them. I just give detailed instructions to a “project manager” agent, and the system handles everything from building to testing to optimizing, then pings me when it’s all done.
skizatch@reddit
You should be able to configure the non-MaxQ version to run at 300W, right? Then you’d have the full 600W for whenever you wanted it. I’m not completely sure about this though. The 5090 won’t go below 400W, probably to hinder people from using it as a server card.
DAlmighty@reddit
Running the Pro 6000 Workstation at 300w actually performs worse than a MaxQ at 300w.
skizatch@reddit
do the fans eat into the power budget or something?
mxmumtuna@reddit
Yep big discussion about this on L1T
Fabix84@reddit (OP)
You can limit the power on the full-power version, but the Max-Q isn’t just about wattage. It’s designed with a cooling layout that allows for better airflow when you run multiple GPUs side by side. That makes it a lot easier to stack them without cooking everything inside the case. For my end goal of running two cards 24/7, that’s a huge advantage over simply downclocking a full-power card.
MengerianMango@reddit
Specifically, the hot air exits outside the case, flowing from the front/turbo side out the opposite end.
sr1729@reddit
Yes, you can use nvidia-smi to reduce the power-usage of a RTX PRO 6000 (workstation) as root (administrator privileges):
It's useful in long-running batches: half the power, but only 15% or 20% more time.
This setting not persisted.
The Max-Q-card is better suited for multi-GPU-systems because of its cooler design.
bullerwins@reddit
This is probably the best decision especially if you want to run video and image models. A bit more cost effective way is 3x5090 as those would be cheaper, or you can get 4x5090 (128GB VRAM vs 96 in the rtx6000) for the same price. But you still would need more power, so you made the right decision probably
Fabix84@reddit (OP)
Yeah, that’s exactly why I went this route. Power draw was a big deciding factor. 300W for the Max-Q means I can eventually add a second card without completely rebuilding my PSU and cooling setup. The 4x 5090 option is tempting for the extra VRAM, but the total power, heat, PCIe lane limitations, and space requirements would be a nightmare. Plus, I don’t have a nuclear power plant in my backyard to run all those 5090s.
fallingdowndizzyvr@reddit
Well then, wouldn't the Max+ have been a better choice. It's 130-140W at the wall running all out. And since you aren't using this in real time, you are batching jobs and waiting for results, peak token gen speed isn't such a requirement. In fact, you could get multiple Maxi and then spread jobs across those machines instead of them all sharing that one 6000. Which would do a lot to mitigate the lower compute and memory bandwidth of a single Max+.
Fabix84@reddit (OP)
Unfortunately it’s not that simple, otherwise I would have gone that route. The memory bandwidth is limited to 256 GB/s, which is far too low for what I typically need. And if you consider using multiple Max+, keep in mind they can only be linked via USB4 at a theoretical 20 GB/s. In reality, you’re never going to run anything across them that wouldn’t already fit on a single Max+.
fallingdowndizzyvr@reddit
But how can that be if you aren't running in realtime? You are running batched. Also, in the future when tensor parallel is possible across multiple boxes, that not only effectively boosts the compute but also the memory bandwidth.
Even for running a single large model, for split up the model and run a section on each Max+, that's more than enough. In the future, that might be an issue for tensor parallel though. But in your case where you only have 96GB with the 6000, you are running models that don't need a lot of RAM. You are just running multiple jobs. There's no reason you need fast interconnect between the machines to run different jobs on different machines. So for current use case, that doesn't matter.
I don't know why you say that. Since it already makes sense to already do that. Like with a large MOE.
Fabix84@reddit (OP)
The fact that I don’t need it in real time shouldn’t be confused with “it doesn’t matter how long it takes.” The faster and more powerful the system, the more tasks it can complete in the same amount of time. That said, the AMD MAX+ 395 is definitely a very capable machine as I mentioned, I was actually about to pre-order it (two units, in fact), but in the end I analyzed the many limitations I’d run into. In any case, it’s not out of the question that I might get one in the future for other purposes.
Anyway, here are some interesting benchmarks:
https://www.youtube.com/watch?v=N5xhOqlvRh4
fallingdowndizzyvr@reddit
But since you are running multiple tasks, having those tasks spread out across multiple machines instead of running them on one machine would mitigate much of that performance gap.
That is interesting. I've been running multiple boxes for over a year so I'm pretty familiar with the pros and cons.
Mundane_Progress_898@reddit
It seems AMD may adapt MAX+ for traditional desktop -
https://www.crn.com/news/components-peripherals/2025/amd-we-re-exploring-discrete-gpu-alternative-for-pcs
or maybe this solution will look something like this - https://en100.enchargeai.com/
Very interesting, gentlemen, what do you think about this.
fallingdowndizzyvr@reddit
That's not Max+. That's expanding on the NPUs they have had for a while. Unfortunately the big complaint about that is that they haven't provided the software to actually use those NPUs.
"32GB On-Board Memory (up to 68 GB/s bandwidth)"
Too little, too slow.
Mundane_Progress_898@reddit
You wrote 3 months ago that the video gen does not tolerate multiple cards well - https://www.reddit.com/r/IntelArc/comments/1klgixq/intel_arc_b580_rumored_to_get_custom_dualgpu/
Perhaps you have more detailed links?
fallingdowndizzyvr@reddit
You want me to provide links to something that doesn't work?
Try this.
https://www.reddit.com/r/StableDiffusion/comments/1mfogci/can_i_run_wan_22_on_multigpu/
Fabix84@reddit (OP)
Since I’m not running an industrial setup, I can’t really keep multiple machines powered on and operational 24/7. The total power target I’m aiming for is around 1,000–1,200 watts. If I split that across multiple machines, there’s a lot of wasted energy.
fallingdowndizzyvr@reddit
5 Maxi would top out at around 750 watts at the wall. That's going all out.
Only if they stay busy. Since when they aren't going all out, they idle at 6-7 watts. Just the idle power of the 6000 alone is about what 3 Max+ would be. Add in the PC that hosts that 6000 and a PC+6000 idling is more than 5 Max+ doing the same.
killver@reddit
Question is how does a heavily undervolted (300-400W) 5090 compare to the 300W Pro 6000?
Salty-Garage7777@reddit
And you'd probably need to learn to live WITHOUT ever removing the earplugs. 😜
iamthewhatt@reddit
doesn't the RTC Pro use the same general layout as a 5090 PCB? A waterblock would work just fine on there
vebb@reddit
I knew being born deaf was good for something!
Bohdanowicz@reddit
Same boat but ada6000 and 7900xtx.
My only hesitation is that competition is so fierce we could see 200+ gb cards at the 6000 level in 6-12 months. Anyone buying a inference card now will choose vrmam over raw speed. We will see 1mm+ context soon and that's a lot of ram in itself.
I also see cards in the pipe that look to combine vram with expandable memory slots.
dasnihil@reddit
why don't you build a nuclear plant first
Financial_Memory5183@reddit
quick question - could i setup these agents to write news article daily and post to a wordpress site?
Fabix84@reddit (OP)
sure, but it would be a very limiting use case
Forgot_Password_Dude@reddit
It'd my first time hearing max-q versions??? Is it the new server versions?
vibjelo@reddit
No, just a different air flow, goes out the back of the case instead of circulating inside the case. Useful when you wanna have two right next to each other
panchovix@reddit
Besides power and offloading, for image/video models you can't run them on multiple GPUs as LLMs, and that is a heavy downside.
I.e. with the 6000 PRO you can run wan 2.2 28B fully at FP16 and use about 56-60GB VRAM.
On 2x5090 you can't run it at FP16 (so just FP8 for less quality). The advantage would be you can run 2 instances to generate video on each, but still no FP16 quality.
tofuchrispy@reddit
You can run it at fp16 just use blockswapping so the model is in ram and the latent pixels are in vram
KindlyAnything1996@reddit
is the performance on 96GB VRAM in one single card the same as split across 4 different cards? from what i have heard a model can only use vram on a single card not across multiple cards. plz correct me if i'm wrong.
subspectral@reddit
Most models are split across multiple cards just fine.
Fabix84@reddit (OP)
You’re right that, by default, a single model can only use the VRAM on one card... it can’t just “merge” the VRAM from multiple GPUs unless you’re using specific techniques like model parallelism or tensor/parameter sharding. And even then, the effective bandwidth between cards is much lower than within a single card’s VRAM, so performance usually takes a hit. How much of a hit depends on your other components (CPU, PCIe lanes, memory speed, etc.) and the workload. That’s why for very large models that don’t fit in a single GPU, a high-bandwidth interconnect becomes essential, otherwise you spend more time moving data between cards than actually computing. That said, even if you’re not running huge models, having multiple GPUs is still very useful. You can run different models in parallel for different tasks without them competing for the same resources.
RenlyHoekster@reddit
Inferencing doesn't really need much PCIe bandwidth; you can run multiple GPUs just fine on one node and for example bifurcate an X16 slot into four X4 slots and run your four 3090 or 4090 or 5090, and with tensor parellelism split the model weights or layers among all your cards.
KindlyAnything1996@reddit
wow so beautifully explained. thanks for the clarity brother 👍👍
Bohdanowicz@reddit
Depends on the model. You would then be limited by the pcie bus speed when the cards communicate. Less fragmentation is best.
hamiltop@reddit
MoE architectures should be decent at partitioning parameters across multiple cards.
KindlyAnything1996@reddit
is there any proof that it is?
Karyo_Ten@reddit
video/image models have issues being split across cards, and the interconnect needs to be fast as there a lot of data to shuffle. Below 500GB/s doesn't cut it and PCIe gen5 x16 is only 128GB/s.
traderjay_toronto@reddit
If the model cannot fit into the VRAM the performance tanks big time
some_user_2021@reddit
He was asking if that VRAM is split in multiple cards
ExplanationDeep7468@reddit
I heard the same
VPNbypassOSA@reddit
I wonder what the lowest viable power draw is for a 5090. Probably not anywhere near even 300W.
If not for that then maybe it would be a good choice.
Herr_Drosselmeyer@reddit
I encountered instability when power limiting my 5090s below 500W. Could probably do 400 with some tweaks but 300 is probably too low.
VPNbypassOSA@reddit
Wow. That’s so high. So best case 1500W for 96GB.
bullerwins@reddit
400w. I did some test and other people too
traderjay_toronto@reddit
But if you run out of VRAM on the 5090 your performance tanks
bullerwins@reddit
Breaths in EU
randomanoni@reddit
Overvolted by default.
Maleficent_Age1577@reddit
Its not cost effective. 3 x 5090 x 24h / 365 equals as 1600W vs. 300W.
VPNbypassOSA@reddit
I wonder what the lowest viable power draw is for a 5090. Probably not anywhere near even 300W.
If not for that then maybe it would be a good choice.
Karyo_Ten@reddit
video/image models have issues being split across cards, and the interconnect needs to be fast as there a lot of data to shuffle. Below 500GB/s doesn't cut it and PCIe gen5 x16 is only 128GB/s.
_VirtualCosmos_@reddit
Ok rich boy
Raidicus@reddit
It's probably his business and he's probably making enough off of it to justify the expense.
9897969594938281@reddit
The poor Reddit 20 year old brigade downvoting everyone
Raidicus@reddit
classic reddit butthurt
_VirtualCosmos_@reddit
idk, I have looked to invest into one of those RTX over 10k euros to run pods in Runpod. It would take many years recover from the inversion.
spaceman_@reddit
Dude spent more on hardware in a year than I spent on cars in my lifetime.
JFHermes@reddit
Cars are a famously bad investment while a machine learning GPU is tax deductible and probably won't lose value for like 5 years.
It's a great purchase if your livelihood can justify it.
Leading_Life9014@reddit
tax deductible how?
JFHermes@reddit
If you utilise the GPU in order to produce work you can claim the cost of the GPU against your taxable income. Where you live depends on how you do this.
ryocoon@reddit
I'd like to know that too.
Wild guess is self-employed/owner-operator "legitimate" business expense.
Fun possibilities could be: Capital Investment with heavy schedule depreciation? Alternative Heating sources credit?
JFHermes@reddit
Yeah you got it. Here in Europe you can claim it as a tax expense on a depreciation schedule.
No_Afternoon_4260@reddit
Alternative heating source
stephan_grzw@reddit
🤣
spaceman_@reddit
Depends on what you buy and when. As soon as Nvidia decides "actually, we would like to increase memory sizes on our new cards", current cards will instantly be worth cents on the dollar.
Also, I think my comment implied I buy cars (and cards for that matter) used, not getting a new 7 series every 4 years or some shit like that.
JFHermes@reddit
I think the only time this will happen is with a true competitor to CUDA. Even then, people still buy 3090's to run inference on. I bought my 3090 new 3 years ago and I can only buy second hand for.. the same price.
techno156@reddit
Wouldn't the GPU lose value much more quickly, with how computing technology advances? It's doubtful that someone who's basically upgrading every generation of GPU would stay there long-term.
Especially if things are starting to go in the direction of TPUs instead of GPUs for AI.
Anidamo@reddit
Well, you'd think so, but the GPU market has been weird for a while.
I bought two PNY RTX A6000 Ampere cards about 18 months ago from a used server supply site for $3450 each, and according to ebay one of them sold a few days ago for... $3850.
I figure it has to pop at some point, but who knows when.
MaapuSeeSore@reddit
You only buy used cars that’s 15 years old , and owned only 1 car ?
spaceman_@reddit
I've bought four cars, and the youngest was 12 years old.
I don't think that's that exceptional. Most people can't or don't want to spend 60k on a new car (or go 85k in debt for a 60k car).
iamthewhatt@reddit
dude those cards are like $8000 each. You can get a used car from 2020 for that price.
techno156@reddit
15 years old isn't that bad for a used car either, and you can get a fairly solid one for that money these days. It's not a bad proposition if you want to save up for an 8 grand GPU.
Fabix84@reddit (OP)
It’s not about being rich. This is both my profession and my passion, and I just reinvest a small part of what I earn back into hardware.
GreatBigJerk@reddit
If that's a small part of your earnings, you are rich.
Fabix84@reddit (OP)
Elon Musk is rich...
GreatBigJerk@reddit
Yes he is, so are you. You're just different levels of rich. The average person can't take a "small portion" of their earnings to buy a $2500+ graphics card and be okay financially. They are either saving for a long time or are taking on debt.
That goes double these days where just living is ridiculously expensive.
Fabix84@reddit (OP)
What I mean is that, being in the business sphere, this is an investment in my work, not just a luxury purchase. When I started working 20 years ago, I couldn’t even afford my first computer. Today I can buy what I want because I’ve invested in myself and my career. Buying high-end hardware also lets you work on more complex projects and gain experience with technologies that aren’t accessible to everyone, which increases your value to companies. They’re willing to pay more because you actually know what you’re doing.
GreatBigJerk@reddit
Cool. I'm glad that working for 20 years has worked out for you. Most people can't make that kind of investment without taking on debt, regardless of how much they care about their career.
9897969594938281@reddit
Jesus Debbie Downer, follow another hobby if this shit triggers you
moofunk@reddit
It's a weird kind of hubbub over a niche GPU that can be used to earn money. It's not a toy, though you can treat it like one, if you want.
Some industries have much harder paywalls than working in AI.
teachersecret@reddit
Almost everyone has a hobby or something they enjoy doing that they've spent ten grand on at almost any income bracket you care to look at.
I know a dirt poor couple who spends damn near that much every year on cigarettes. Normal not-rich Americans take vacations (poof, there goes ten grand), buy 4-Wheelers and travel trailers (good luck if that's 10k, it's usually more), buy expensive instruments (I've got a $1500 ukulele laying around). You don't have to be rich to spend 10k on something you enjoy, you just have to want to spend 10k on something, and be willing to live without the things you'll be unable to buy if you spend it on a GPU.
I've got a teenager with no bills living at home who earned more than ten grand over the summer. He could have bought one of those things, if he wanted, and he's damn sure not a rich man.
skymasster@reddit
Many people commenting might not fully understand the distinction you're making. There's a difference between being well-off compared to the general population and being truly wealthy compared to the top 1%.
_VirtualCosmos_@reddit
Elon Musk is a fucking billionaire lol
IrisColt@reddit
I am going to agree with you. I once met a rich woman... the only thing that set her apart from me was that she never had to work again... and that hit me like a ton of bricks.
ElementNumber6@reddit
The AI Slop industry is surprisingly lucrative right now.
projectradar@reddit
Get a load of Mr money bags over here
YearnMar10@reddit
I guess we’re all jealous
Neither-Phone-7264@reddit
course we are, i want one
nostriluu@reddit
Just wait until you hear about next year's hardware.
kmouratidis@reddit
Should be 2 years, no?
and
nostriluu@reddit
The RAM bumped a lot more this year than it has in other years. That's in response to the industry seeing demand. By next year, there should be more solid larger VRAM (or fast shared RAM/hybrid) options. Of course it depends a lot on what happens with the US political situation.
txgsync@reddit
> the best prosumer AI hardware you can get right now
I think that's the fastest at that size that you can buy in a single card.
For about 2/3 of the cost I got myself an M4 Max MacBook Pro with 40 GPU cores and 128GB RAM, and 4TB of storage plus a nice screen, trackpad, keyboard, and sound and stuff attached to it. And it runs about 1/3 to 1/2 the inference speed.
Training is a bit more challenging. Different toolchains when everyone is using pytorch is different.
It's hard to believe that if I buy Apple gear I'm slumming it for cheap inference of 60GB+ models ;)
redditorialy_retard@reddit
hey you got 3 beasts of a GPU now, also do you know where do I start learning to create these AI agent networks?
I might save up to buy a 3090 cuz no way in hell my 2050 laptop is running any AI
getgoingfast@reddit
About your decisions to choose Max-Q over regular 600W version. What's your line of thinking? Why not pick 600W and cap the power?
mxmumtuna@reddit
600w version performs worse at 300w compared to the Max Q. Coil whine is also god awful on the 600w.
__JockY__@reddit
We run a pair of Workstation 6000s and suffer no coil whine.
mxmumtuna@reddit
🍀
MelodicRecognition7@reddit
could you share a link to the benchmarks please?
MelodicRecognition7@reddit
https://forum.level1techs.com/t/wip-blackwell-rtx-6000-pro-max-q-quickie-setup-guide-on-ubuntu-24-04-lts-25-04/230521/168
ok I guess my second 6000 will be a 300W version.
mxmumtuna@reddit
There was a lot of discussion on it on L1T. I also have both.
vibjelo@reddit
Heating is a bit harder with the Workstation edition, compared to Max Q, since they push hot air differently. If you're planning to do multiple in the same chassis, you pretty much have to use Max Q, or do your own bespoke cooling solution.
AstroZombie138@reddit
Why not just use cloud, or a Mac studio? I know everyone hates on MLX, but you can buy a 512gb studio close to the price of the GPU, and while its slower, it seems like a great platform for single user.
__JockY__@reddit
You’re in /r/LocalLLaMa!
Fabix84@reddit (OP)
The bandwidth difference between a Mac Studio and an RTX PRO is huge, and that’s before even getting into CUDA and the broader NVIDIA ecosystem. I actually did consider the Mac Studio, it’s a nice machine for certain workflows but for my use case the raw bandwidth, CUDA acceleration, and software compatibility of the RTX PRO make a massive difference. Explaining all the trade-offs would turn this into a pretty long discussion, but trust me, I ran the numbers before deciding.
bregmadaddy@reddit
How is the coil whine on the Max-Q version?
epSos-DE@reddit
WHat are your AI prompts to do all of that ????
epSos-DE@reddit
So , which prompts to make it run for a day autonomously and behave well ???
HilLiedTroopsDied@reddit
Curious to what setup you run, lang graph? What else has an orchestrator framework that goes four hours
Emotional_Thanks_22@reddit
went dual 5090 but realized that for self-supervised learning my CPU kinda sucks and is the current bottleneck.
Fabix84@reddit (OP)
For a dual-GPU setup you should go for a CPU with 128 PCIe lanes and a matching motherboard, for example, the AMD Ryzen Threadripper PRO 9000 series.
Emotional_Thanks_22@reddit
too expensive i fear, but eyeing something like 32 core 9000 ryzen soon (still at 5950x atm)
HilLiedTroopsDied@reddit
Go used epyc gen 4. With 12 channel ddr5.
ggone20@reddit
You’re chasing the wrong dragon.
The money you spent on hardware could net you BILLIONS and BILLIONS of tokens using hosted providers. API usage from anywhere ‘doesn’t get trained on’.
Local inference is a waste of money… we went down this rabbit hole (with math, not just throwing money away lol) early on at work. There are compliant providers also. You’re spending years (literally) worth of hosted inference just to moonlight as a ‘personal supercomputer’ lol.
Fabix84@reddit (OP)
For many people, hosted inference is the smarter move. But in my case, local inference isn’t just about cost per token. I want full control over the models, the data, and the runtime environment, with no dependency on external providers and no usage limits. Some of the stuff I’m building involves chaining multiple agents and running them 24/7, so latency, bandwidth, and privacy matter a lot. Sure, I could rent that power for years, but then I’d also be renting the constraints. For me, owning the hardware means I can experiment freely, test edge cases, and keep everything in-house. And yeah… part of it is just because it’s fun.
Different-Toe-955@reddit
Yeah I can't really agree with using cloud based AI either. How much does it cost to not have a company analyzing your usage of their services?
subspectral@reddit
I did it because I already had a decent VR gaming rig, & so the only additional ‘cost’ was to keep the 4090 when I upgraded to a 5090, instead of selling it.
ggone20@reddit
It’s OK if it’s fun. Just don’t want people to drink the cool-aid unless they understand what it’s about. Those 6000s are hawt. 😁
CritStarrHD@reddit
I'm curious, what do you use these rigs for? I'm sure that investment is probably justified in some way, shape or form.
Different-Toe-955@reddit
Very economical post.
ICanSeeYourPixels0_0@reddit
Would love to hear what you inferred from your results. I’m currently trying to figure out what the cheapest provider would be to run 480B open source models for day to day use. Privacy would be a huge concern.
ggone20@reddit
I suggest Together for scale. Cerebras for speed. Both are $2/M in/out.
ICanSeeYourPixels0_0@reddit
It’s less about me caring about my own personal privacy and more about ensuring that my clients have confidence that their code isn’t being sent to servers out of their control. Granted that’s still a possibility when using cloud compute, but thanks for the suggestions. I’ll have a look into those two options.
ggone20@reddit
I understand. Having dealt with HIPAA compliance it’s more than just a ‘want’.
Cheers
sp4_dayz@reddit
Dunno, its like talking to 911 RS3 owner that he can rent almost any car he want for years (new car per week), for this price.
ggone20@reddit
Not at all. A car is a car is a car. They all [mostly] get a->b.
Hosted intelligence is just fundamentally better. Be it cost, energy, intelligence, speed, whatever - you CAN’T EVEN rent that car.
For some use cases an AI is an AI and the a->b thing holds up. But not mostly.
It’s fun and OP responded to my comment with all the reasons why. My comments weren’t criticisms so much as making sure new people don’t think they need tens of thousands of hardware to do good stuff 🤓. Unless you want to.
Marksta@reddit
He's chasing the right dragon if he has any non LLM use case, or LLM use cases that aren't coding. The hosted ones going into an instant refusal shutdown freak out over anything remotely human makes them incapable of nearly every creative task in my experience.
Lots of scenarios of 'Whoops sorry, forgot this characters backstory is about the time they got betrayed and left for dead by their-then masters.' then you get a lesson from the LLM that beating up people is mean and you can't be a person's master. 😑 Not sure any genre of anything won't set nanny LLMs off.
ggone20@reddit
unpopularopinion - fiction writing is a waste of life, paper, and time. Lol get a real job. Maybe the LLM is trying to make you better. Thoughts?
Lmao this is totally in jest and mostly because I couldn’t help myself and I’m a terrible person. Fiction is silly tho… unless it’s SCIENCE fiction - that’s just prophecy 🤭👽🤷🏽♂️
MaycombBlume@reddit
This is /r/LocalLLaMA. If you don't understand the value of local models, or the anti-value of cloud services, then you're in the wrong place.
ggone20@reddit
Lol just conversation. I understand just fine - the value in anything is a cost benefit analysis. There isn’t much value in local models for MOST people. Maximum intelligence for minimum cost is the name of the game ‘in the real world’. Thanks for the comment tho!
Trotskyist@reddit
It makes zero rational sense, but I do think there's some value in the cheaper (but still high end) cards e.g. a couple of 3090/4090 because for stupid human brain psychology reasons I'm much more willing to play around with some stuff if I already have the sunk cost of the hardware vs. paying per minute on a hosted provider.
There's definitely a limit here though. For me it was 2x 3090s. I use hosted for anything where I need more power than that.
ggone20@reddit
This is the struggle. Local inference sounds so nice at first blush… then you start to understand the inherent limitations, headaches, power bills.
Hosted intelligence scales so much better. Until we’re all running things 24/7 in the background (at which point inference/intelligence will be close to free) setting up for local is just a hobby/game. It’s fun. It feels good when those cards start spitting out tokens. Always ends in disappointment and going back to APIs lol
Lividmusic1@reddit
It’s a good decision. I went with the 600w version and it runs FUCCKKn hot, I had to under volt it to get acceptable thermals. However I see almost no performance drop with a 30% drop in power draw, which is pretty insane imo
Different-Toe-955@reddit
GPUs love being undervolted a little bit. -80mv and it drops 220W GPUs to 160 average, and that's with 200mhz core overclock.
Fabix84@reddit (OP)
Thanks for sharing your experience!
Shadow-Amulet-Ambush@reddit
Do you have a system in mind for your “project manager” workflow? I’ve been looking to do the same but with API use models. Haven’t found anything that can run on its own after being given a task like that
AI-On-A-Dime@reddit
Lol, I like how this is tagged ”discussion”, what’s there to discuss?
daisysmcguire@reddit
My wallet hurt when I saw the prices
hibagus@reddit
I bought the 600W version since you can limit the 600W version to run at 300W and have better cooling :)
Correctsmorons69@reddit
With the spend on cards, why are you worried about buying more PSUs? Could watercool them and/or undervolt them.
Fabix84@reddit (OP)
I don't know why so many of you talk as if you had hydroelectric power plants in your garage. In your country, electricity supply is probably unlimited. Unfortunately, where I live, it's not. It's not a PSU problem.
Correctsmorons69@reddit
I just responded to what you wrote, and you primarily framed it as a PSU problem. 🤷♂️
GTHell@reddit
Leaving it running to do its agentic thing is going to be crazy for the electric build. Me just using my 5080 for gaming on a week holiday and the electricity jump to $150+ for that month. Crazy!!
JR2502@reddit
Assuming an avg cost of $0.17/kwh, and this card running at its peak 300W consumption over 24h, that'll be:
0.3kw x 24 = 7.2kWh x $0.17/kWh = $1.22 per day, or \~$37/month.
GTHell@reddit
Well, I fall trap for that simple calculation too. The calculation itself doesn’t factor a lot of thing like heating. For a server room you must require some sort of cooling solution and for my case, my 5080 turn my man’s cave into a furnice that force me to add a fan and increase my AC usage. So many other factor as well.
ThenExtension9196@reddit
Corrected title:
I swiped my credit card in record time.
Fabix84@reddit (OP)
naa... wire transfer.
ThenExtension9196@reddit
If ya got the money ya got the money. Talking crap but tbh if you can afford it I tip my hat to you sir.
Financial_Memory5183@reddit
how much better are the models that you can fit into 96GB then 32GB? i'm running the 5090 msi liquid.
Fabix84@reddit (OP)
For example, you can run all the medium models at full precision, and I assure you that there is a big difference between a model running at full precision and one quantized to, for example, 4 bits.
subspectral@reddit
I have that same GPU, paired with my older 4090, for 56GB of VRAM. I can run some really impressive iMatrix MoE quants from mrademacher on HuggingFace.
MengerianMango@reddit
I got the non Max Q and kinda regret it. Sorta want to get a second so that I can run the big models, but hard to do that now with the fan version. The thermals are gonna be dooky
eidrag@reddit
Good sir, prithee, might thou bestow upon me an RTX 6000 pro 600w?
Fabix84@reddit (OP)
Yeah, that’s exactly why I went with the Max-Q. It’s not just the lower wattage, the cooling design is made for running multiple cards side by side without choking airflow. For my plan to eventually add a second one, that thermal headroom is a big deal.
jaMMint@reddit
you could just use a riser cable to give it more space in between if thats a problem..
if420sixtynined420@reddit
The r6p responds really well to undervolting & will open up a lot of thermal headroom
MengerianMango@reddit
Is that possible on Linux?
vibjelo@reddit
Sure, should be doable with nvidia-smi. If in doubt, check the "nvidia overclocking" page on Arch Wiki, just go the other way with the values :)
if420sixtynined420@reddit
https://discuss.cachyos.org/t/undervolting-nvidia-gpu-under-wayland-and-x11/3293
MengerianMango@reddit
Thanks!
YashP97@reddit
Crazy money
Fabix84@reddit (OP)
Money spent on hardware that allows you to make more money is never crazy money.
Front-Republic1441@reddit
5 rigs running 3090s would be cheaper
Fabix84@reddit (OP)
Yes, but then you need a hydroelectric plant to power everything and for the cooling system.
Front-Republic1441@reddit
if you have the money for it and can monetize it properly go all out
Front-Republic1441@reddit
true
Butefluko@reddit
Could you, for the sake of it, provide a benchmark on a video game?
Fabix84@reddit (OP)
I'll do it as soon as I receive it. I've just ordered it for now.
thinkscience@reddit
build vs buy ! doesnt the ai cloud serve you better than building it yourself ?
Fabix84@reddit (OP)
These are completely different needs. For many projects I work with, I work exclusively with cloud AI. For others, I work exclusively with local AI. Still others, a mix of the two. There's no single solution.
TerrryBuckhart@reddit
You using this for image gen workflows? That’s the clearest use case I can envision.
Fabix84@reddit (OP)
Image gen, Video gen, LLMS
Gridhub@reddit
daddys money...
Fabix84@reddit (OP)
no, my job money...
Last_Mastod0n@reddit
Im currently a 4090 owner. I bought it at first for gaming nearly 3 years ago but justified the purchase to my wife by saying I would use it for AI. Now, here I am using it every day to process data on my llama and Gemma models 😂
I dont feel like the jump from 4090 to 5090 is worth it. But I was wondering what your take is on the performance increases? It sounds like you weren't happy with the 5090 and thus got the RTX PRO 6000 super quickly. Im thinking if I did get an AI accelerator card, I would buy a used one on Ebay.
Fabix84@reddit (OP)
I actually recycled my 4090 into my racing simulator, so I’m still happy with it and that’s probably the same fate my 5090 will meet when the RTX PRO 6000 will be received. The gap between the 4090 and 5090 isn’t huge, but the extra 8 GB of VRAM can be nice, especially if you’re running AI models for video generation like WAN 2.2. That said, I honestly wouldn’t recommend upgrading from a 4090 to a 5090.
subspectral@reddit
You can run them together for 56GB of VRAM with Ollama.
Last_Mastod0n@reddit
I honestly need to look into that. I would have 48gb of vram with 2 4090s. The problem is my second PCIE slot is only x8 instead of x16. I wonder how much that would hinder performance.
subspectral@reddit
In theory, significantly. In practice, for local casual use, I don’t think it will matter that much.
Try it and see.
Last_Mastod0n@reddit
Yeah that pretty much confirms my thoughts. 8gb more vram is just not enough of a jump to justify the purchase. I dont use video generation, just image and text analysis. Thanks for the advice!
Btw that racing simulator looks absolutely sick!! 🔥
Bitter-Good-2540@reddit
With how fast ai hardware is devaluing, I wouldn't spent that money, but you do you
Freonr2@reddit
4090s and RTX 6000 Ada are still worth solid money.
Fabix84@reddit (OP)
I get what you’re saying, but from a business perspective that logic doesn’t really work. Clients pay me to deliver solutions now, not to wait for future hardware to get cheaper. Hardware prices and performance have been on that curve for decades. If you just keep waiting for the “next drop” you’ll never actually build anything.
MustafaMahat@reddit
What kind of rig do you use to put these video cards in?
thinkscience@reddit
how much did you spend on it !
AfraidScheme433@reddit
here i’m running 4 x 3090s..i feel unworthy
Heterosethual@reddit
"The end goal -" damn did you not plan any of this? Try and max out the 4090 before getting the 5090? You might be on an endless "goal" here where no one wins but Nvidia lol
Demonicated@reddit
I did the almost the same thing and I went with the workstation edition. With nvidia-smi you can set it to 300w if you want. Essentially the same thing as having the Q. I usually set it to 400w, i find that keeps it below 80 degrees C.
stacksmasher@reddit
$8K? Good luck if you need to RMA LOL!!
iamgladiator@reddit
Appreciate you sharing ignore the jealous comments. Would love to hear what your using it for and how it's going to
Donnybonny22@reddit
What should be the optimal build together with the rtx pro 6000 ?
gwestr@reddit
I went with the liquid cooled 5090 and run it at 402W (70%) to get a similar effect to Max-Q. Then I let the Nvidia app AI thing set the overclock automatically. Something like +120 cores and +210 memory, currently. Fans stay 1000 rpm when in use and off otherwise. Draws about 30W at idle. 70C max temps for a new game and cools down to 40C in a few minutes.
subspectral@reddit
I have a an MSI Liqui Suprim 5090, runs quite well, no tweaks beyond factory overclocking. My 4090 is air-cooled. Even so, they don’t make a lot of noise when running together.
subspectral@reddit
I have a dual 5090/4090 setup on an i13900K with 128GB of DRAM. It gives me 56GB of pooled VRAM, & Ollama load-balances models across it.
This is actually my VR gaming rig, but I dual-boot it into Ubuntu Server Pro. I may replace the 4090 with the Max-Q 6000 at some point, but this is enough to run some decent iMatrix quants, for now.
ik-when-that-hotline@reddit
> Multiple local AI agents working 24/7 on small projects (or small chunks of big projects) without me babysitting them. I just give detailed instructions to a “project manager” agent, and the system handles everything from building to testing to optimizing, then pings me when it’s all done.
[)
FurrySkeleton@reddit
Nice! I went with a pile of cheap cards and kinda wish I had gone this route instead. Oh well, maybe I'll catch it next generation when I'm unsatisfied with what I have.
Elvarien2@reddit
are you just here to brag about your wealth ?
KlutzyWay7692@reddit
Curious to see the overall ROI for a project like this would. Very to see what kinds of projects you're running on your set up. Do you have a github or somewhere where you're documenting your process? I would be interested in learning about specific workflows you're running and the overall cost of inference would be vs going with online API's.
Rich_Artist_8327@reddit
If using vLLM with tensor parallel, a 4 card setup if space and lanes, could be the best option.
I have now 2x 7900 XTX and 1x5090 and maybe add 2x more 7900XTX.
It pretty much scales in performance. (With Ollama and others adding more cards wont scale in performance)
Anyway, have been thinking also going 4x RTX PRO 4000 having 96GB vram total under 5K. It would have total much more CUDA cores than 1 rtx pro 6000. Also ram bandwidth in total of those pro 4000 would be more than a single card. lets see.
VectorD@reddit
I wouldn't say that it runs cooler, it has worse cooling than the 600W version. You can just buy the 600W version and power limit it to 300W and it will run cooler than the maxq.
mxmumtuna@reddit
They (Max-Q) do run cooler, and do have a better thermal design especially for multiples. Power limited 600w versions performs a good bit worse than the Max-Q versions and generate more heat.
The downside is about a 10% perf penalty, but better thermals and easy expansion to multiple cards.
The coil whine on the 600w versions is horrible too.
Trilogix@reddit
Pff ads.
Impossible-Glass-487@reddit
"4090 → 5090 → RTX PRO 6000"
Has $10K to drop on a new GPU (+$3200 for the 5090 + $2000 for the 4090 = $15K on GPU's in the last two years), hasn't completed any projects with $10K GPU. Perfect weather, has nothing to do on a Sunday at 2PM. Takes the time to make this post and uses arrow emojis, instead of spending your time starting projects. Posts an article and not a picture of their own PC build.
Fabix84@reddit (OP)
If I hadn’t completed any projects, I probably wouldn’t have the budget for more GPUs in the first place. I’ve only ordered the new card and haven’t received it yet, so this post was just about sharing the upgrade path, not a full build log. I’ll post proper pics once the setup is complete.
turbotunnelsyndrome@reddit
Would love to hear what kinds of projects OP is working on
Forgot_Password_Dude@reddit
Where to you buy one? I don't see the q max on eBay only the regular pro 6000
Fabix84@reddit (OP)
In Europe it can be found quite easily in larger computer stores.
kissgeri96@reddit
Yeah let us know how it performs vs the 4090/5090. I think lot of us looking at it right now trying to figure out is it worth it or not to sell our kidneys for one of these pro 6000's. 😄
mxmumtuna@reddit
6000 is very much a 5090 with 96gb. Performance characteristics are the same. As is the hassle with vllm and sglang.
Impossible-Glass-487@reddit
Fair enough, looking forward to it.
ZealousidealShoe7998@reddit
thats fun, I plan to buy a few in the near future. I was looking into building a workstation with 3 max-q . that way I can run a full LLM with knowledge graph, have one for video/photo gen which I can use to train with my photos. and one for smaller models/TTS assistant that can control everything.
also when i'm not using AI I can just spin a windows machine and play some games on it remotely or through KVM.
I believe in the near feature consumer gpu will come with 96gb of vram once more and more people move from these cloud solutions to localized LLM. look at these new chinese models they can almost be used for agentic programing just as good as claude code . assuming one have the knowledge you could train it to become even more profficient on your language of choice,tasks etc.
so having multiple cards would be ideal for people that wanna be productive have bandwidth to explore training
townofsalemfangay@reddit
Not at all jelly right now... please do share some of the performance you get!
jklre@reddit
I was looking at getting 2 of these. I cant seem to find them under $9500. I was wondering about power consumption as well. im basically training ML and LLM models, finetuning , LoRA's etc.. What is the power cabling like?
moar1176@reddit
I have 2 coming Tuesday. Check exxact corp, great prices; but business to business only so don't bother with a Gmail inquiry.
elchurnerista@reddit
anyone can get a custom domain email
mxmumtuna@reddit
Just get an EIN and they don’t care.
jklre@reddit
Thanks for the tip. I used my company email. It should get their attention.
parrot42@reddit
With a "BE QUIET! Pure Power 13 M 650W ATX 3.1" with one card it would be one cable from PSU to GPU (with the q-max version).
Ok_Doughnut5075@reddit
Why would you buy one card instead of building a cluster?
Fabix84@reddit (OP)
300W vs 500W+500W+500W for, say, 3x5090. Easier on power, cooling, and noise.
Ok_Doughnut5075@reddit
You're not running the big boy models on 3x5090.
futilehabit@reddit
Clusters can have a ton of overhead though, can't they?
fiery_prometheus@reddit
IMO, I think people care too much about power draw, when you are at this level, just get multiple PSUs and buy what combination gives the best performance/vram combo. As long as you can fit everything into a single board, like the wrx mobo boards, you are still within the ease of usability of a single board system. And the cards can be power limited, it's not like their 300w limited cards are magically going to NOT produce power spikes anyway.
But if physical space is a requirement, then you have more limited options. But if you buy a card like an rtx6000, it's not a stretch of imagination to get a high wattage PSU either, in case you want to dual wield with a single PSU, it won't even be necessary to have multiple PSU's, even for the high wattage version of that card.
Fabix84@reddit (OP)
It’s not really about the PSU, it’s about total power availability. Especially for someone like me running on a standard household electrical setup in a European country, total draw is a real factor. Sure, if you have 18 kW in solar panels you can do whatever you want, but even then I’d rather have more machines that each consume less than fewer machines that each consume more.
fiery_prometheus@reddit
You have a totally legit point. But if you live in a house or similar here in the EU, it's pretty normal to have either 25A on most buildings or 32A/35A, 3 phase connection to the power grid. That's a lot of power, so unless you are maxing that out, you are more likely tripping the circuit breaker/incorrectly load balancing your phases. Think of all the homeowners who have a workshop with large machines, they can draw a lot of power, which would easily trip a phase line where other home stuff was competing.
Fabix84@reddit (OP)
In Italy I have mono phase with 3 kW.
fiery_prometheus@reddit
I pay my respects :D
JacketHistorical2321@reddit
Gotta love having this much expendable income
Western_Objective209@reddit
Why wouldn't you build this with API's first as a proof of concept? There are dozens of people who have tried this project out already, the state of the art agentic models like opus 4 aren't quite there yet, so it's really unlikely you'll be able to build it with open source models
Like if you just like hacking on hardware and touching the real stuff, more power too you, but you aren't going to beat companies with tens to hundreds of thousands of nvidia data center cards with this a few thousand in hardware
Legitimate-Dog5690@reddit
I'm not sure if I have some sort of underlying jealousy going on here, certainly not ruling it out.
This feels like such a bizarre waste though, for a sliver of that cost you could have had a few years of top notch server access and done exactly the same. By which point models have improved, hardware has improved, A6000s have plummeted in value.
nicksterling@reddit
There are use cases that preclude using a hosted solution like being in an air-gapped environment or if you have strict privacy policies. There’s also something to be said for having a reliable and consistent solution that won’t change. You have complete control over the model, the quants, the drivers, etc.
Legitimate-Dog5690@reddit
I guess it depends very heavily on the use case and future pricing. It seems an odd subreddit for me to argue against local, I know, it's just an argument I've had in my head multiple times when I consider chucking lots at this sort of thing.
I'd worry about tech moving along rapidly and being left behind. I've worked in big tech with AI and there was so much focus on building solutions that could be neatly bundled, spin up a container, receive context, run the agent, share output, release.
The initial pitch isn't anything new, it just feels like a limited way of doing it. A single card isn't going to be run in stacks of agents simultaneously, on any sort of big model.
Fabix84@reddit (OP)
for a lot of people, cloud access makes more sense. For me though, it’s not just about raw cost per token. I need full control over the hardware, the models, and the data, without worrying about API limits, provider policies, or external dependencies. Some of my projects involve multiple AI agents working together continuously, so having everything local means lower latency, no bandwidth bottlenecks, and complete privacy.
Legitimate-Dog5690@reddit
Thanks, apologies for a very similar reply to what was posted already a few times.
There are a lot of places literally renting the hardware on a fairly empty Linux box, I think you'd end up with a much more powerful setup if everything was containerized up and ready to deploy anywhere, be that local or futuristic servers running hardware that's not currently invented.
I think my other issue personally would be the quality of current local models at the top end currently. I'm not sure what sort of work you do, but taking coding for example, something fairly meaty like qwen3, you're either going to run a really quantized 235b or a limited 30b, neither would compete with Claude/Gpt/Gemini. A dockerized cloud GPU could happily run the 235b, spin up when needed, hammer out the request, release.
Fenix04@reddit
I'm looking to build something similar! What CPU and memory did you use?
Fabix84@reddit (OP)
Right now it’s still running in the same setup I used for the 5090: AMD Ryzen 9 9950X3D, 96 GB DDR5 6800 (2×48), and 12 TB PCIe5 storage. When I eventually move to dual RTX 6000 PROs, I’ll also switch to an AMD Ryzen Threadripper PRO 9000 series and a new motherboard.
Fancy-Restaurant-885@reddit
People who have more money than sense
Fabix84@reddit (OP)
Probably because it makes me more than it costs me?
Commercial-Celery769@reddit
Feels like a more boujee version of what I did, started with a prebuilt with a 4070 super, upgraded from 32gb of ram to 128gb, got a 3090, upgraded to a 1200w PSU, swapped to a super tower case (antec c8 wood low price is giant and looks really nice) then got a 3090 ti to get 48gb VRAM total. On a side note you can buy a used 8x v100 32gb SXM2 server with a total of 256 gb of VRAM for around $6k, expensive but a better deal than trying to get 256gb VRAM with 3090's plus they are NVLINKED by default so should be quite good for training.
alew3@reddit
Did you think about getting the 600w version and dialing the power consumption down for normal day usage? So you can run it at full speed when needed? I set my rtx 5090 to 400w max power draw.
Fabix84@reddit (OP)
That works fine if you don’t plan to go multi-GPU. In that case, airflow design becomes important. It’s not just about wattage.
hyouko@reddit
I don't wholly understand going with the max-Q version. You can always run
nvidia-smi -pl 300
and run the full fat version at lower power. I am sure they have done some optimizations for low wattage performance, but still, having the option to go full power if you want to seems worthwhile.(source: have the workstation edition and do this regularly. oddly this would NOT have worked with the 5090, which I think refused to go lower than 400W?)
Fabix84@reddit (OP)
The Max-Q makes multi-GPU setups a lot more practical.
DAlmighty@reddit
Run some benchmarks on various power levels and you’ll see the light. 450w is a sweet spot when looking at performance per watt. Then you’ll realize that it doesn’t matter and run it at 600w.
vibjelo@reddit
The major difference is the cooling design. Workstation = blow hot air around in the case, Max Q = blows hot air out the back of the case
More important when you have more than one unit in the same case
power97992@reddit
Just buy 3 b200s already.. Tell us how well will it run qwen 3 coder…
And-Bee@reddit
Your agents will forget what you asked them to do and proceed to vandalise your directory.
j4ys0nj@reddit
Nice! I picked up the SE recently. made a custom duct for it - running nice and cool!
developers_hutt@reddit
Dammm. New way to flex money 💰
Junior-Childhood-404@reddit
How much money do you have Jfc
sp3kter@reddit
The ultra wealthy will use their own local uncensored models to cheat and steal.
DIXOUT_4_WHORAMBE@reddit
RTX 6000 is an incredible waste of money
some_user_2021@reddit
For you
teachersecret@reddit
So’s a jet ski. Shrug!
chisleu@reddit
You are going to have to back that up with something brother. You don't get 96GB of VRAM on one bus for cheaper. The tokens per second are not worth it, but being able to load models into a single card will give great performance.
tat_tvam_asshole@reddit
Strix Halo