Best Hardware Setup to Run DeepSeek-V3 670B Locally on $40K–$80K?

Posted by PrevelantInsanity@reddit | LocalLLaMA | View on Reddit | 65 comments

We’re looking to build a local compute cluster to run DeepSeek-V3 670B (or similar top-tier open-weight LLMs) for inference only, supporting ~100 simultaneous chatbot users with large context windows (ideally up to 128K tokens).

Our preferred direction is an Apple Silicon cluster — likely Mac minis or studios with M-series chips — but we’re open to alternative architectures (e.g. GPU servers) if they offer significantly better performance or scalability.

Looking for advice on: • Is it feasible to run 670B locally in that budget? • What’s the largest model realistically deployable with decent latency at 100-user scale? • Can Apple Silicon handle this effectively — and if so, which exact machines should we buy within $40K–$80K? • How would a setup like this handle long-context windows (e.g. 128K) in practice? • Are there alternative model/infra combos we should be considering?

Would love to hear from anyone who’s attempted something like this or has strong opinions on maximizing local LLM performance per dollar. Specifics about things to investigate, recommendations on what to run it on, or where to look for a quote are greatly appreciated!

[-]

GradatimRecovery@reddit

AS only makes sense if your budget is $10k. You can afford 8x RTX Pro 6000 blackwells you get a lot more per/$ with that than you would a cluster of AS.

[-]

DepthHour1669@reddit

On the flip side, Apple Silicon isn't the best value at $5-7k either. Just the $10k tier.

However, at the $5k-7k tier, there's a better option: 12-channel DDR5-6400 is 614GB/sec. The $10k Mac Studio 512gb has 819GB/sec memory bandwidth.

https://www.amazon.com/NEMIX-RAM-12X64GB-PC5-51200-Registered/dp/B0F7J2WZ8J

You can buy on Amazon 768GB (12X64GB) of DDR5-6400 for $4,585.

Buy a case, an AMD EPYC 9005 cpu, and a 12 ram slot server motherboard which supports that much ram, and for about $6500 total... which gives you 50% more RAM than the Mac Studio 512gb but at 75% of the memory bandwidth.

With 768GB ram, you can run Deepseek R1 without quantizing.

[-]

MKU64@reddit

Do you know how many TFLOPS would the EPYC 9005 give? One thing is memory bandwidth of course but time to first token is also important if you want a server to begin responding as fast as possible

[-]

DepthHour1669@reddit

Depends on which 9005 series CPU. Obviously the cheapest one will be slower than the most expensive one.

I think this is a moot point though. I think the 3090 is 285TFLOPs, the cheapest 9005 is 10TFLOPs. Just buy a $600 3090 and throw it in the machine and you can process 128k tokens in 28 seconds. 32 seconds if you factor in 3090 bus lane bandwidth.

[-]

Aphid_red@reddit

28 seconds? You've got to tell me how you'd do that.

The best public software reports on a CPU platform, even with GPU support, seem to be about 50 tps for ik_llama.cpp. Just napkin math that should be at least 250 seconds to get 128k, a full order of magnitude slower.

[-]

DepthHour1669@reddit

He’s talking about TTFT which is prompt processing speed

[-]

Aphid_red@reddit

Yes, which is also what I'm talking about. 50 tps prompt processing. Maybe 7-8 tps inference.

You certainly can't get 50tps deepseek inference on CPU. The model is effectively 37B, your memory is effectively maybe about 400 GB/s at best, which limits you to 10-11 tps @ q8 or 20 tps @ q4.

Real speeds will be substantially less due to overhead and imperfections. And that's likely the fastest speed of all quants as smaller quants means more work to convert it back into working units (fp16).

[-]

No_Afternoon_4260@reddit

You can buy on Amazon 768GB (12X64GB) of DDR5-6400 for $4,585.

Buy a case, an AMD EPYC 9005 cpu, and a 12 ram slot server motherboard which supports that much ram, and for about $6500 total...

So you find a mobo and a cpu for 2k usd? You got to explain it to me 🫣

[-]

Far-Item-1202@reddit

Motherboard $730 (look for revision 2.00 or newer): https://www.newegg.com/supermicro-h13ssl-nt-amd-epyc-9004-series-processors/p/N82E16813183820

CPU $660: https://www.newegg.com/amd-epyc-9115-socket-sp5/p/N82E16819113865

SP5 CPU cooler ~$100

[-]

No_Afternoon_4260@reddit

This cpu has 2 CCD you'll never saturate the theoretical ram bandwidth you are aiming. Anecdotally a 9175F had poor results even tho it has 16 CCD and higher clock. You need cores clocks and CCD for the the amd plateform, the CCD things seems to be more important for Turin.
You need to understand server cpu have numa memory domains that are shared between cores and memory controllers. All to say to really use a lot of ram slots you need enough memory controllers that are attached to cores. Cores communicate between them through a fabric and that induce a lot of challenges.
it seems the sweet spot for our community is to get something with at least 8 CCD to hope having 80% (genoa) and 90% potential max ram bandwidth from theoretical. Then take into account that our inference engine aren't really optimised for the challenges induces by what we've talked.
Give it some potential with imho at least a fast 32 cores, that's where I draw the sweet spot for that plateform. But imo threadripper pro is a good alternative if a 9375F is too expensive

[-]

Aphid_red@reddit

I came across another solution at this price point: Get an actual big boy server; specifically the NVidia GH200 (144GB variant).

I'm still looking into what part of it's CPU memory bandwidth is available to the GPU (how fast is the interconnect?) but if that was 100% speed, you have:

480GB LPDDR Memory. (500GB/s)
144GB HBM (4000 GB/s)
Chip: H100.
TDP: 1KW
Watercooled (hotplug) versions available (hook up an aquarium pump and a big radiator and it's relatively quiet too).

Looking at the spec sheet: https://resources.nvidia.com/en-us-data-center-overview-mc/en-us-data-center-overview/grace-hopper-superchip-datasheet-partner this is indeed the case!

I predict that this chip will do in the neighbourhood of 500-1000 tps prompt processing and 80 tps generation speed on deepseek-V3, because NVlink beats the pants off of PCI-express (wish they'd let it trickle down to normal server motherboards with upgradeable RAM...).

Price: \~$45K from gigabyte or supermicro via a server vendor, with warranty and everything. This is actually looking like a decent deal compared to adding many 6000-pro's. It's a similar price to a 4x RTX 6000 pro machine but it's more power efficient and will actually perform better for models in the 384-624GB range.

[-]

k_means_clusterfuck@reddit

Get them blackwells

[-]

MachineZer0@reddit

The worst hardware setup is a Gen 9 server with 600gb RAM and six Volta based GPUs for $2k. I was getting about 1tok/s on q4.

[-]

ortegaalfredo@reddit

100 users or 100 concurrents users? that's a difference. 100 users means 2 or 3 concurrent users, at most. That's something llama.cpp can do.

For 100 concurrent users you need a computer the size of a car.

[-]

PrevelantInsanity@reddit (OP)

Reached that conclusion. Was working with specifications provided to me. Adapting to that, I’m welcoming thoughts on how to manage that # of ccu with decent quality of output/context window size through quantization or changes to context window

[-]

ortegaalfredo@reddit

To serve deepseek to 100 concurrent users (that means about 10000 users) we are talking DGX-level hardware, from 200k and up.

[-]

Equivalent-Bet-8771@reddit

The new Nvidia cards can do NVFP4 which should help reduce model size without much quantization loss.

[-]

AlgorithmicMuse@reddit

For the amount of dollars your are talking about , you should be talking to Nvidia for a dedicated system not messing around with garage shop solutions.

[-]

eloquentemu@reddit

To be up front, I haven't really done much with this, especially when it comes to managing multiple long contexts, so maybe there's something I'm overlooking.

Is it feasible to run 670B locally in that budget?

Without knowing the quantization level and expected performance it's hard to say. For low enough expectations, yes. Let's say you want to run the FP8 model at 10t/s per user so 1000t/s (though you probably want more like 2000t/s peak to get each user ~10t/s on the mid-size context lengths). That might not be possible.

Note that while 1000t/s might look crazy you can batch inference, meaning process multiple tokens at once, for each user. Because inference is mostly memory bound, if you have extra compute you can access the weights once and use them for multiple calculations. Running Qwen3-30B as an example:

PP	TG	B	S_PP t/s	S_TG t/s
512	128	1	4162.09	170.35
512	128	4	4310.28	278.29
512	128	16	4045.05	672.99
512	128	64	3199.48	1335.82

You can see my 4090 'only' gets ~170t/s when processing one context, but gets 1335t/s processing 64 contexts simultaneously. That's only 20t/s per user and dramatically slower than the 170t/s because this is an MoE like Deepseek. For a single context only 3B parameters are used, but across 64 context nearly all 30B get used. For reference Qwen3-32B also gets about 10t/s @ batch=64 but only 40t/s @ batch=1.

Can Apple Silicon handle this effectively — and if so, which exact machines should we buy within $40K–$80K?

I think the only real option would be the Mac Studio 512GB. It runs the Q4 model at ~22t/s, however that is peak for one context (i.e. not batched). A bit of Googling didn't come up with performance tests for batched execution of 671B on the Mac Studio, but they seem pretty compute bound and coupled with the MoE scaling problems as mentioned before I suspect they will probably cap around 60t/s for maybe batch=4. So if you buy 8 for $80k you'll come up pretty short. Unless you're okay running @ Q4 and ~5/ts.

If someone has batched benchmarks though I'd love to see.

How would a setup like this handle long-context windows (e.g. 128K) in practice?

Due to Deepseek's MLA 128k context is actually pretty cheap, relative to its size. 128k needs 16.6GB so times 100 is a lot of VRAM. But, 128k context is also a lot. It is a full novel, if not more. You should consider how important that is and/or how simultaneous your users are.

What’s the largest model realistically deployable with decent latency at 100-user scale?

Are there alternative model/infra combos we should be considering?

It's hard to say without understanding your application. However the 70B range of dense models might be worth a look, or Qwen3. However, definitely watch the context size for those - I think Llama3.3-70B needs 41GB for 128k!

Qwen3-32B model might be a decent option. If you quantize the context to q8 and limit length to 32k you only need 4.3GB which would let run the you serve ~74 users and the model at q8 at very roughly 20t/s from 4x RTX Pro 6000 Blackwell for a total cost of ~$50k. Maybe that's ok?

I guess just to guess about Deepseek... If you get 8x Pro6000 and run the model at q4, that leaves 484GB for context so 30 users at 128k. Speed? Hard to even speculate... Max in theory (based on bandwidth vs size of weights supposing all would be active) would be 35t/s, though, so >10t/s seems reasonable at moderate context size. Of course, 8x Pro6000 is just a touch under $80k already so you won't likely be able to make a decent system without going over budget.

P.S. This got long enough, but you could also look into speculative decoding. It's good for a moderate speed boost but I wouldn't count on it being more than a nice-to-have. Like it might go from 10->14 but not 10->20 t/s.

[-]

No_Afternoon_4260@reddit

Ho the expert thing, because it's an moe batch also "increase the number of active expert" thus the batch effect is lessen. Interesting thanks

[-]

Mabuse00@reddit

Just an idea you reminded me of - I've been using Deepseek R1 0528 on the Nvidia NIM API. (Which if you don't they have a ton of AI models free up to 40 requests per minute.)The way they pull it off is they have a requests queue to limit how many generations it can do at the same time. I think Nvidia's queue is 30, and I rarely see even a couple of seconds in line, and that's with them serving it free. I don't know what context they serve and I assume it's capped fairly low but their in-website chat uses 4K max token return and the only thing longer than a CVS reciept is a Deepseek think block.

[-]

snowowlshopscotch@reddit

I am not an expert on this at all! Just genuinely interested and confused about why Jolla Mind2 never gets mentioned here?

As I understand it, it is exactly what people here are looking for, or am I missing something?

[-]

MelodicRecognition7@reddit

Jolla is hoping to find success with the Mind2, a single-board computer designed to run LLMs in a box for improved privacy.

you will not get any usable performance with a single-board computer. What people here are looking for begins with 1kW power draw.

[-]

Willing_Landscape_61@reddit

" 128K tokens" "Apple Silicon handle this effectively?" No.

[-]

Ok_Warning2146@reddit

8xH20 box. Each H20 is sold around 10k in china

[-]

Alpine_Privacy@reddit

Mac mini noooo, watched a youtube video?, i think u will need 6xA100s to even run at Q4 quant, try to get them used. 10k x 6 = 60k in GPUs rest in cpu ram and all. You should look up KIMI K2 500Gb ram + even one A100 will do for it.

[-]

PrevelantInsanity@reddit (OP)

Perhaps I’ve misunderstood what I’ve been looking at, but I’ve seen people running these large models on clusters of Apple silicon devices given their MoE nature requiring less raw compute and more VRAM (unified memory!) for just storing the massive amounts of parameters in any fashion that won’t slow things to a halt or near it.

If I’m mistaken I admit that. Will look more.

[-]

photodesignch@reddit

More or less.. keep in mind Mac is shared memory. If it’s 128gb you need to reserve at least 8gb for the OS.

On the other hand, pc is direct mapping. You need 128gb main memory and it would load the LLM first from cpu, then allocate another 128gb vram on GPU so it can mirror over.

Mac is obviously simpler, but dedicated gpu on pc should perform better.

[-]

Mabuse00@reddit

Think he also needs to keep in mind that Deepseek R1 0528 in full precision / HF transformers is roughly 750gb. Even the most aggressive quants aren't likely to fit on 128gb of ram/vram.

[-]

PrevelantInsanity@reddit (OP)

We were looking at a cluster of Mac minis/studios if that was the route we took, not just one. I admit a lack of insight here, but I am trying to consider what I can find info on. For context, I’m an undergraduate researcher trying to figure this out who has hit a bit of a wall.

[-]

LA_rent_Aficionado@reddit

A cluster of Mac Minis will be so much slower than say buying 8 RTX 6000, not to mention clusters add a whole other layer of complication. It’s a master of money comparably, sure you’ll have more VRAM but it would wont compare to a dedicated GPU setup. Even with partial cpu offload.

[-]

Mabuse00@reddit

But the money is no small matter. To run Deepseek, you need 8x RTX 6000 *Pro 96gb at $10k each.

[-]

LA_rent_Aficionado@reddit

I’ve seen them in the 8k range, for 8 units he could maybe get a bulk discount and maybe an educational discount. It’s a far better option if they ever want to pivot to other workflows as well be it image gen or training. But yes, even if you get it for $70k that’s still absurd lol

[-]

Mabuse00@reddit

By the way, I don't want to forget to mention, there are apparently already manufacturer's samples of the M4 Ultra being sent out here and there for review and they're looking like a decent speed boost over the M3 Ultra.

[-]

Mabuse00@reddit

No worries. Getting creative with LLM's and the hardware I load it on is like... about all I ever want to do with my free time. So far one of my best wins has been running Qwen 3 235B on my 4090-based PC.

Important thing to know is these Apple M chips have amazing neural cores but you need to use CoreML which is its own learning curve, though there are some tools to let you convert Tensorflow or Pytorch to CoreML.

https://github.com/apple/coremltools

[-]

rorowhat@reddit

As a general rule avoid Apple

[-]

Alpine_Privacy@reddit

Hey, I totally get you. I saw that same video and was mislead too! Its super hard for organisations to deploy LLMs securely and privately, been there done that 😅 best of luck, on ure build!

[-]

Alpine_Privacy@reddit

Your best bet would be to rent a cluster, deploy ure LLM ( expose say using openwebui or librechat ) do a small pilot and then finalise ure compute. Runpod is a great place to run this experiment. We use this approach works well for us.

[-]

createthiscom@reddit

I run kimi-k2 at 20 tok/s on a single blackwell 6000 pro with a dual EPYC 9355 rig and 768gb 5600 MT/s ram. I mean, maybe that’s “abysmal” for multi user purposes, but it’s plenty for single user agentic purposes.

[-]

Verticaltranspire@reddit

You can use smaller models to answer smaller question. Route shorts answers to the same model at once in the same prompt just separate them so the model know what then route answers back to user in their own pieces. Lots of way to achieve this

[-]

AdventurousSwim1312@reddit

With 100 users, you will need more compute, the low compute, large vram and medium bandwitch hold true for single user / low activation count model, but will quickly break otherwise.

Given your budget, I'd suggest looking either for a rack of 6 rtx pro 6000 black well (596 gb vram will allow you to host up to 1000 B parameters models) or a server with two or three AMD MI 350x (around 576 gb) that will be even faster (slightly inferior compute, but largely faster vram) but software might be a bit more messy to get to work.

[-]

Mabuse00@reddit

Yeah that's the trade-off you really have to take into consideration. Speed vs size. We're talking about clusters here so the numbers are at FP32 an M3 Ultra Mac Studio gets 28.4 Tflops and 128gb of unified ram for $4000. An RTX Pro 6000 96gb gets 118.11 Tflops and 96gb of VRAM for $10K. So if you took the RTX money and bought Mac Studios with it, you'd be getting 320gb of ram and 85.2 Tflops of FP32 compute. Sure it's a bit less but the extra ram is a big deal when you're getting into the realm of 750gb models like Deepseek R1 0528.

To get enough ram space to hold a model that size you can buy 6 M3 Ultra Studios for $24K or 8 RTX Pro 6000's for $8K. And once you're at that point the Macs will still add up to 170 Tflops at FP32, double that in FP16. For a hundred users who won't all be sending a completion request at the same moment anyway, that's more than plenty.

[-]

SteveRD1@reddit

There is some good discussion on running high quants of deepseek over on the Level1Tech forums (theres even people building quants there).

You could ask over there, seriously doubt anyone would recommend Apple!

[-]

Mabuse00@reddit

I would absolutely recommend an Apple M3 Ultra over any other consumer grade hardware. That thing has 32 cores, 80 graphics cores, 32 neutral cores and 128gb of unified 800Gb/s ram. Even GPT had this to say:

[-]

Nothing3561@reddit

Yeah but why limit yourself to consumer grade hardware? The RTX Pro 6000 96gb card has a little less memory, but 1.79 TB/s.

[-]

Mabuse00@reddit

Hardware-wise, totally agree with you. That card is a beast. Software compatibility is just still catching up with Blackwell. I tried out a B200 last week and installed the current release of Pytorch and it was just like "nope."

[-]

Conscious_Cut_6144@reddit

Apple Silicon can’t really handle 100 users like an Nvidia system can. Great memory size and bandwidth, but lacking in compute.

[-]

PrevelantInsanity@reddit (OP)

Good idea. I’ll go check those forums out. Thanks!

[-]

spookperson@reddit

There are two main issues with building inference servers on Apple in my experience. One is the prompt processing speed - it will be much much lower than you want for large context (even if you're using Apple-optimized MLX). The other is concurrency/throughput software - so far I think even if you compile vllm on Apple you just get CPU-only support.

So in my mind if you spend $10k on a Mac Studio to run very large models like Deepseek, at the moment it is only so-so at production workloads for a single person at a time (so-so because of slow prompt processing, single person because the Apple-compatible inference server software isn't great at throughput in continuous batching). So you could think of that level of budget supporting 4-8 people using the cluster at once but still dealing with very slow prompt processing.

On the other hand, with that $40k-$80k budget you can get Intel/AMD server hardware that supports a bunch of pcie lanes and get a bunch of RTX 6000 Pro Blackwells. 4 of the 96gb cards would be enough to load 4-bit Deepseek and have room for context. You'll need more cards to support higher bit quants and more simultaneous users (and associated size of aggregate kv cache). Just be aware of your power/cooling requirements.

[-]

PrevelantInsanity@reddit (OP)

The RTX 6000 pro Blackwell route seems interesting to me. I don’t mind dropping to 4-bit quant. I don’t think that will harm output in a way that matters to me.

Context does concern me a bit, as in my research it seems to get really big really fast. We’d only be at like 384gb of VRAM from four blackwells, which seems significant at 4-bit quant? Not sure though.

[-]

Conscious_Cut_6144@reddit

I just responded to the original post, but I would say 8 pro 6000’s would be ideal. 6 may be doable.

Source: I have 8 of them on back order.

[-]

Conscious_Cut_6144@reddit

At 80k a pro 6000 system is doable if you are willing to deal with “some assembly required”

8 would be ideal. tp8 6 would run the fp4 version with tp2 pp3

[-]

PrevelantInsanity@reddit (OP)

The specification given to me was, more or less, “40-80k to spend, largest model we can run with peak 100 concurrent users.” I have found while researching this myself at the same time as posting around that that number of CCU increases the spec requirements hugely. I’m not sure how to best handle that— quantization of the same model, different model, shrunken context windows, or what.

[-]

MosaicCantab@reddit

Without knowing what the users will be doing, it’s kind of hard to give guidance. But I frankly don’t believe you’ll be able to fit a model the size of DeepSeek on 100 concurrent users on $80k.

Even with quantization, you’ll need far more computer if the users are doing any reasoning.

[-]

HiddenoO@reddit

This. People still underestimate how different individual users' behavior can be from one another. Asking short questions with straight answers from knowledge is 1/100th to 1/10000th the compute per interaction compared to e.g. filling the context and generating code with reasoning.

[-]

claythearc@reddit

Apple silicon kinda works but the tok/s is very low, and you actually need substantially more of them than you’d think due to overhead from it sharing with system. You also hard lock yourself out of some models that require things like flash attention 2 (this one specifically may have support now, I haven’t checked - but it’s one example of a couple big ones).

These things are MoE so it’s better than it is in other models but that’s off set to a large degree from the thinking tokens it outputs.

The best way you can host it is still A100s which is probably like 60k in GPUs, realistically closer to 120-140k total for the system because you want a usable quant like q8, and need to hold the KV cache in memory.

Realistically these models are just non existent for local hosting - the cost benefit just isn’t there for anything beyond like the Qwen 200/behemoth scale imo.

[-]

GPTshop_ai@reddit

1.) GH200 624GB for 39k
2.) DGX Station (GB300) 784GB for approx. 80k

[-]

GPTrack_ai@reddit

Available as server or desktop, BTW.

[-]

Fgfg1rox@reddit

Why not wait for the new intel pro gpu’s and their project matrix? That complete system should only cost 10-15k and can run the full LLM, but if you can’t wait then I think you are on the right track.

[-]

PrevelantInsanity@reddit (OP)

Time constraint on the funding. Good to know that’s on the horizon though. Thanks!

[-]

spookperson@reddit

Time constraints on funding makes me wonder if you have education/nonprofit grants. If so, you may want to look at vendors with education/nonprofit discounts. I've heard people talking about getting workstations/GPUs from ExxactCorp with a discount on the components or build.

[-]

PrevelantInsanity@reddit (OP)

Bingo. ExxactCorp is a good tip. Thanks.

[-]

Aaaaaaaaaeeeee@reddit

This report was pretty interesting using llama.cpp 4x H100, sharing it to keep your expectations low, Maybe you need to get a bargain idk.

https://www.researchgate.net/profile/Leon-Sulfierry/publication/391867293_Efficient_Deployment_of_a_685B-Parameter_Open-Source_LLM_on_the_Brazilian_Santos_Dumont_Supercomputer/links/682b4e9fbe1b507dce8c03d5/Efficient-Deployment-of-a-685B-Parameter-Open-Source-LLM-on-the-Brazilian-Santos-Dumont-Supercomputer.pdf

Maybe you can also get 8xA100 and then run a throughput oriented engine at 4bits.

If you get 10 users on 1 512gb, and it's fast enough for them, then great, but it's less likely to be used for other research projects.

[-]

Calm_List3479@reddit

You could look into Nvidia Digits/Spark. Maybe three of them could run this? The throughput wouldn't be great.

A $300k 8xH200 running this FP8 w/ 128k tokens will only support 4 or so concurrent users.

[-]

GradatimRecovery@reddit

AS only makes sense if your budget is $10k. Beyond that, performance per $ is poor