RTX 5090 or Mac Studio?

Posted by Excellent_Koala769@reddit | LocalLLaMA | View on Reddit | 38 comments

Hey Guys,

I run a small business where I use a many agents to handle sensitive client work. Everything has to stay 100% on-prem for compliance reasons.

Right now I'm running the full Gemma 4 31B dense model (4-bit) on my M5 Max laptop with 128 GB of memory. The main agent does long reasoning tasks and I'm only able to run about 2 agents at the same time. I get around 28 tokens per second when it's just one, but it drops to 22 when two are going. The whole thing feels slow and I'm already hitting the limit.

In the upcoming months I need to scale up to handle way more agents at once (around 40-80 concurrently).

I'm trying to decide between building a simple RTX 5090 desktop node (and using vLLM) or buying a high-RAM Mac Studio. The GPU side seems a lot stronger for running multiple agents, but the Mac would be quieter and simpler.

What would you guys do?

[-]

Kofeb@reddit

FYI you don’t need to stay on prem for “compliance reasons”

Even for HIPAA you can do a BAA with AWS and still be compliant. AWS bedrock has access to Anthropic models and to do agenetic pipelines and workflows. They have zero data retention and no data training too. Same with Azure OpenAI just make sure you’re looking at the data residency because the region doesn’t equal data residency, it’s the model’s SKU.

[-]

croninsiglos@reddit

Bingo! Anyone claiming “compliance reasons” hasn’t done their homework. It’s the old fear, uncertainty, and doubt around cloud.

[-]

HopePupal@reddit

my lawyer is also a 5090

[-]

AlwaysLateToThaParty@reddit

If concurrency is required, the Nvidia card will always perform better. As more concurrency is required, the difference will become greater. If you have many running full-time, the Mac will be considerably slower.

Small business? RTX 6000 Pro. That's what I have, and I run it on a six year old 10th generation intel with DDR4 RAM. It handles this type of workload fine.

[-]

Excellent_Koala769@reddit (OP)

RTX 6000 pro? That has 96 GB vram?

The thing is, Gemma 4 31b fits into a 5090 almost perfectly. Could I load multiple gemma instances into a 6000? Also, what would the concurrecy look like between 3 5090s vs 1 6000 pro?

[-]

AlwaysLateToThaParty@reddit

Could I load multiple gemma instances into a 6000?

Why would you need to? Just use one and hit it through the API end point. But you could run multiple different models at the same time, which I sometimes do. Pretty hard to go past qwen3.5 122b/10a heretic mxfp4_MOE though. 75GB of VRAM, and it does pretty much anything i ask it to do.

what would the concurrecy look like between 3 5090s vs 1 6000 pro?

One card will always be better, because there doesn't need to be communication across the bus.

[-]

Excellent_Koala769@reddit (OP)

I am probably worng on this, please correct me if so. My reasoning behind this statement was that Gemma 4 31b loads needs roughly 25gb vram (somewher around that).. so couldn't you just load it into the RTX 6000 Pro 4 times and inference from 4 different instances?

[-]

AlwaysLateToThaParty@reddit

Dude, you are definitely wrong on this. You don't know how inference works. You run one model, and just send all of the requests to that one model through its API end point. It handles the concurrency. If you want to run gemma, qwen3.5, and gpt-oss, sure, you can run three different models. But there's no benefit whatsoever for running multiple copies of the same model.

[-]

Excellent_Koala769@reddit (OP)

Okay, so let's say you fit Gemma 4 into the RTX 6000 Pro and you have 70 GB of vram left. What is that extra vram used for? KV Cache?

[-]

SpaceTraveler2084@reddit

Not trying to be a dick. But go do some research and learn about the basics, looks like you have no clue at all and for someone looking to invest on a 10k GPU
its the least you have to do.

[-]

AlwaysLateToThaParty@reddit

What is that extra vram used for?

Using a better model.

[-]

Excellent_Koala769@reddit (OP)

For my use case, Gemma 4 31b does very well.

My agents use pretty long contexts (15-40k tokens each). Would the extra VRAM on the RTX 6000 mostly go toward holding more KV cache so I can run more of those long-context agents in parallel at the same time?

I was originally looking at the 5090 because it’s a lot cheaper and already seems to fit the 31B model comfortably. Just trying to understand the real difference for my workload.

[-]

AlwaysLateToThaParty@reddit

I don't think you're the listening type. I'm done.

[-]

cointegration@reddit

Just use vllm, it eats all available vram memory and handles concurrency kvcache for you. Its not as fast for single user mode compared to llama.cpp but it can’t be beat for concurrency

[-]

ShelZuuz@reddit

The model fits into RAM but you're going to be able to maybe fit the KV Cache for maybe 2 context windows in that memory and then perf will tank to the single digits.

On the RTX 6000 PRO you can run that model in vLLM and easily get dozens of concurrent users.

Rent one of each on vast.ai and run your perf tests to see based on your desired context window size and load etc.

You should be able to figure this out within two hours so it will cost you less than $5 between the two GPUs to figure out which suits your needs better. Just have Claude or something stand by to install vLLM and the model so you don't waste time. And make sure you stand ready with a huggingface auth token. Personally I'd just give it to Claude to go do the install and tests, then revoke it again after.

And specifically run vLLM, don't try this in Ollama or Llama.cpp. Multi-user is a different beast.

[-]

cakemates@reddit

gemma 4 fits in a 5090 with 1 user... you need some extra space for each extra user.

[-]

fuchelio@reddit

5090 32GB is not enough for serious local coding agent. Running Qwen3.6-35B-A3B Q6_K_XL with 256k context, mmproj, unlimited thinking budget takes \~42GB VRAM (29GB weights + 3GB KV cache + 8GB compute). Agent workflow burns context like crazy — tool calls, file reads, workflow docs — 64k is basically zero. Q6_K_XL vs Q5_K_XL is \~44% less KLD for only 4GB more (Qwen3.6-35B-A3B bench), worth it. 48GB minimum, 64GB+ comfortable.

[-]

segmond@reddit

22 t/s at 2 means 44 tk/sec. You will not get the same speed you get with 1 request with 40 requests. It drops off, doesn't matter if it's Mac or Nvidia. You get more tokens per second when you multiple. Plan accordingly.

[-]

bad_detectiv3@reddit

bro, what is your line of work and business?

i work in finance and our compliance is literally with Microsoft team to give us compliant Github Copilot

[-]

GamerHaste@reddit

What small business needs 40-80 “agents” running in parallel.

[-]

throwawayacc201711@reddit

Seriously what the heck are these agents even doing

[-]

GamerHaste@reddit

I have no clue I really can’t think of a great use case apart from just pushing spam. And OP, for something like running 40-80 agents, I’d just recommend renting GPUs thru a cloud provider or just literally hosting models for inference thru something like AWS bedrock, I can’t imagine needing to optimize inference hosting for a small company to the point of needing to buy your own hardware. It would be way more cost effective to literally just host instances on a cloud provider.

[-]

panthereal@reddit

If you can find an msrp 5090 it's a smarter investment because you can resell it for a similar cost if you decide it doesn't work out for you. Assuming that's allowed with your compliance. Buying anything mac takes around 20% off the resale almost immediately

However I would consider jumping straight to a pro nvidia gpu if you really have a need to scale and have a budget supporting that. Scaling with 5090 was so challenging that tinygrad stopped trying and only offers pro GPU now

[-]

Excellent_Koala769@reddit (OP)

Scaling 5090s is challenging? What do you mean by this?

I imagine starting with one and then growing to a small cluster with distributed inference. Is that harder than it sounds?

[-]

cakemates@reddit

you need a server to get scaling with multiple gpus mate, the pcie lanes to run multi gpus and big ram are important part of the equation.

[-]

panthereal@reddit

5090 are split between a dozen or so models with completely different form factors and prices, but most of them have massive coolers.

going from one 5090 to two on a regular ATX mobo is difficult if you have a 4 slot model, though if you're starting with a server grade mobo it wouldn't be as bad. finding the FE model which is 2 slot limits you to fewer vendors overall.

with the pros you're always getting a 2 slot gpu so adding more is easy

[-]

OneSlash137@reddit

I’d personally go with the Mac I feel like it’s more scalable.

Now, with that out of the way, why are you running that many agents at once?

[-]

Excellent_Koala769@reddit (OP)

The thing is, I don't need a large pool of UM. I just need to be able to run Gemma 4 31b. That fits into the 32 GB Vram in one 5090. Not sure if getting a mac studio that has 512 or 256 GB of UM would allow me to have the same throughput/bandwidth for fast inference as multiple 5090s.

Another thing to consider is the price for the studios. M3 ultra has soared in price recently due to the RAM shortage. I can't even find a 512 GB unit on the market. I think in order to get a 256 GB unit I would have to pay around 10k. For a 5090 it is a third of that price at a little over 3k new.

I'm curious on your full take on why a Mac would be more scalable for the sole use case of running Gemma 4 31b for agentic inference.

To answer your question, I run a business that uses autonomous agents. I don't want to get into the details of the business for privacy reasons, but basically customers send in requests and each requests spins up an agent that might work anywhere from a few days to a few months.

[-]

No_Algae1753@reddit

Well you also gotta take care of the kv cache. 32gb vram with a 31b model you would be forced to use a quant of gemma. Tbh I don't think neither of these are an option.

[-]

Excellent_Koala769@reddit (OP)

Damn, you're right! I didn't think about the KV cache. I assumed it would fit fine in the 5090. Would you possit that the KV cache needs more vram?

[-]

No_Algae1753@reddit

Dunno what possit means but the kv cache itself grows over time, at least those were my experiences. When you now take in consideration that you have 40-80 co current tasks then this yes you definitely need more vram.

[-]

No_Algae1753@reddit

I dont think a mac will be able to handle 40-80 concurrent tasks with reasonable speeds

[-]

No_Algae1753@reddit

I'd say a Mac won't be able to handle 40–80 concurrent tasks effectively. If concurrency is your main priority, I would definitely recommend getting a high-end workstation GPU or even two of them instead of the 5090 (keep in mind, the 5090 is designed primarily for gaming). There is also a way to connect multiple Macs together, though I'm not sure how well that scales.

[-]

Excellent_Koala769@reddit (OP)

What do you consider a "high-end workstation GPU"? And why would that be better than a 5090 for this use case?

[-]

No_Algae1753@reddit

High end rtx 6000 pro but there are cheaper alternatives from Intel and AMD as well

[-]

Roy3838@reddit

Connecting Macs together through a tool like exolabs is very inconsistent.

When it works it feels like magic, but if you sneeze everything breaks.

[-]

No_Algae1753@reddit

Yeah and I've heard that the performance also doesn't scale up that well but hey it's still an option

[-]

CharlesCowan@reddit

2 4500s?