What hardware to buy if I want to run a 70 B model locally?

[-]

unjustifiably_angry@reddit

There are no competitive 50B-70B models anymore, that class is obsolete. The new small models (<35B) are better than any 70B model ever was, and the medium sized models (80B-120B) have gotten exceptionally good as well as very fast.

Apple:

I can't advise on Apple hardware, I'm not educated on that ecosystem. You'll want something with 128GB or more ideally... 80GB absolute minimum to get access to today's best "attainable-sized" local models. If you can go higher, go higher, but be aware that they do get slower and slower unless you also get faster and faster hardware.

Nvidia:

DGX Spark or other GB10 derivatives: Gives you 128GB and if you're tech-literate you can group two together to roughly double performance and get them into much more usable territory. I would not advise buying one individual DGX Spark, they're basically designed to be networked. 2 is the sweet spot. The non-Nvidia models are much cheaper and there's no downside. There is no reason to get one with 4GB of storage, 1GB is plenty to hold a selection of the best models. THESE ARE LINUX-BASED. DO NOT BUY IF YOU'RE UNCOMFORTABLE WITH LINUX. DGX Spark was extremely disappointing out of the gate due to a ton of software compatibility issues but these are gradually being ironed out. Running the same model, two months ago I getting 13 tokens per second and now I'm getting over 40, and there's still a ton of optimization work to be done. It's a frustrating situation and you could very reasonably say Nvidia did a lot of false advertising.
RTX 6000 Pro Blackwell: Gives you 96GB, which is exactly the current sweet spot. These are expensive (and moreso every day) but won't be truly obsolete for probably 5+ years because they're so far ahead of the curve. Probably high resale value in the future. There's a server variant (a bit cheaper, more efficient, louder) and a workstation variant (faster but double the power draw, still quieter, coil whine from hell).
RTX 5090 32GB or RTX 4090 24GB: I won't go into too much depth on these but in my opinion, for AI use they're a trap. They appear to be a thrifty solution, a way to bypass the Nvidia tax on pro-tier cards, but the headaches involved in getting multiple connected to a single system and the performance overhead... I've been through it. Not recommended. Also, if you plan on ever doing anything non-LLM with your card (image generation, video generation, asset generation, etc), you'll be limited to the VRAM capacity of your largest card, not the combined VRAM pool.
Older/used workstation cards: these are available in 40GB and 80GB capacity I think, and some of them can be joined together with NVLink. A solid choice if you can get a deal.

AMD:

Radeon 7900 XTX 24GB: AMD's best offering for AI and a couple months ago you could still buy them reasonably cheap (worth investigating still) but it's in the same category as 4090, IMO. Lots of compatibility issues but these are getting better over time because they're a cheap way into local AI. I actually bought a bunch of these and ended up returning them. They're fine but AMD is already abandoning support for them. Even today they have poor resale value and by the time RDNA5 comes out they'll be paperweights.
Strix Halo 128GB: A reasonable deal if you can get one at a good price, the cheapest way to get 128GB of sorta-fast memory (same speed as DGX Spark) right now. Depending on the type you get and how brave you are they can be networked similar to DGX Spark with similar performance gains. However unlike DGX Spark it's unlikely they're going to get dramatically faster over time. AMD is basically cutting ties with anything older than RDNA4 and AMD hardware isn't where most developers are focused.
RX 9070 XT 16G: Anything with less than 24GB isn't worth considering for local AI purposes. Perfectly fine for gaming but the wrong thing to get for AI.
Radeon Pro cards: Top out at 24GB and cost way more than 7900 XTX. Nah.

[-]

HopePupal@reddit

you missed one relevant AMD option: the Radeon AI R9700 Pro is a 32 GB RDNA 4 card. got mine literally an hour ago for $1350. i think you'd be right to classify them as being essentially in the same category as the 4090, 5090, and 7900XTX (and any RTX PRO up to the 4500, and maybe the B70 someday) in that they have (relatively) small amounts of VRAM and can only be linked thru PCIe.

[-]

unjustifiably_angry@reddit

I looked up that card and on AMD's site I only saw 24GB models. My mistake.

32GB is at least in the realm of sanity, you could run Q3.5-27B and have something useful, if possible a little slow. I use my RTX 6000 Pro for work so speed is critical and I can get 95-140 tokens/s with 122b.

[-]

HopePupal@reddit

yeah the RTX PRO 6000 is nuts! costs $10k for a reason.

The R9700 has a 256-bit memory path and consequently doesn't have as much memory bandwidth (640 GB/s) as the 384-bit 7900 XTX (960 GB/s), but the extra VRAM lets you have some actual context, and it's RDNA4 (gfx1201 family in ROCm). if anything actually used RDNA4 FP8/BF8 support, that'd be so cool, but alas. no 4-bit for MXFP4 either, that's for Instinct chips.

speedwise, it runs Q3.5-27B @ Q6_K with up to about 132k context (fp16 KV cache). it's not "we have Claude at home" but it's well over the useful threshold for me.

test	depth	t/s
pp2048	0	617.81
tg32	0	16.48
pp2048	16000	795.51
tg32	16000	15.99
pp2048	32000	722.20
tg32	32000	15.50
pp2048	48000	687.94
tg32	48000	15.15
pp2048	64000	650.67
tg32	64000	14.72
pp2048	80000	615.03
tg32	80000	14.35
pp2048	96000	587.24
tg32	96000	14.07
pp2048	112000	559.72
tg32	112000	13.69

[-]

unjustifiably_angry@reddit

A recent update to llama (nightly) somewhat improved the quality of Q8_O kv-cache, might be worth trying out. Should be enough to get you to 256K.

[-]

HopePupal@reddit

nice, i'll give it a try this weekend

[-]

hobel_@reddit

AMD AI PRO R9700 has 32GB

[-]

MrFahrenheit_451@reddit

And if you can put two in the box, it’s 64GB. And still under the cost of a single 5090

[-]

unjustifiably_angry@reddit

What's the performance like though? Don't mean to shit on you but it's what it comes down to if you're using these for work. At the pace I burn through tokens my RTX 6000 Pro will pay for itself in less than a month so upfront cost is almost irrelevant.

[-]

MrFahrenheit_451@reddit

I could run a test if you want. Tell me what you want me to use and how you want it tested and I’ll give you the tokens /sec rate of them.

[-]

pirateadventurespice@reddit

I was going to say that the spark derivatives make a lot of sense here. Even at one, it could probably handle what OP wants and two will give them room to grow. I have one right now and a second supposed to arrive shortly, no complaints, but I wasn’t and early adopter.

[-]

angry_baberly@reddit (OP)

you have a single of the asus units? I’m interested in this. What are you running on it right now?

[-]

pirateadventurespice@reddit

I just got it a few weeks ago as I’d seen a lot of the earlier kinks seem to be being ironed out and nvidia also seems to be committing to the nvfp4 approach.

So, I’m still fiddling with models, but I’ve liked both deepseek-r1-distill-llama-70b and mistral large 2. I’m sure there are newer ones and I’ll probably play with Gemma this weekend.

It still takes some fiddling (you have to quant into nvfp4 with nvidia model optimizer or find a version that’s already in it, etc. ), but it does more than everything I want it to. I’m only getting the second as it’s a work box and I have some “use it or lose it” funds.

[-]

unjustifiably_angry@reddit

Look into running things in FP8 if you can fit them, you'll find they perform extremely well and there's no compromise on quality.

[-]

pirateadventurespice@reddit

I haven’t heard that term before, but am familiar with the Ozaki scheme (and also its disappointments). My professional life is somewhere incredibly closely partnered with nvidia, so I get a lot of support there.

I’ll check out what you’re suggesting. Are there particular models you like on the dual set up?

[-]

Gringe8@reddit

Thats not really true if you dont like thinking on. Even 24b finetunes are still better than qwen 27b.

[-]

unjustifiably_angry@reddit

I agree for a chatbot but for my use-case on my hardware typically only takes a few seconds and noticeably improves the results.

[-]

Gringe8@reddit

Ive been testing gemma4 31b with thinking on. The replies are similar to having thinking off, but feels like it has a bit better understanding of context and my meanings when i dont explicitly say things.

When i tried thinking previously with other models it was difficult control response length, but with gemma4 the thinking is consistently between 300-500 tokens so its not too bad.

[-]

Specialist_Sun_7819@reddit

128gb mac studio is honestly the move if you can stomach the wait. the unified memory means 70b q4_k_m runs smooth with room to spare for context. i know people running qwen 72b and llama 70b on them no issues

if you cant wait tho, check the used market for m2 ultra mac studios, they pop up way more often than the m4. or if youre open to the server route, used p40s are dirt cheap and you can stack vram that way but yeah its a whole project lol

[-]

Dejotaa@reddit

Where do you check for used m2 ultra?

[-]

spky-dev@reddit

You’ll get a whopping 10 tok/s gen and atrocious prompt processing on a Mac Studio. Dense models are not its forte.

[-]

_derpiii_@reddit

I don’t have much experience with local models, but reading your other comments, I don’t think you’re going to be happy with local setup.

Correct me if I’m wrong but you’re looking for a thought partner to manage ADHD train of thought brainstorming. More open ended thinking and research than deterministic outcomes.

IMHO that is solidly in the domain of frontier models. And your $2500 budget would be better spent on on-going cloud subscriptions.

People run local models for raw coding performance, reliability, and data compliance/privacy.

[-]

angry_baberly@reddit (OP)

I’ve had some serious issues across models that I don’t really wanna get into the specifics on that is driving this. I’m also willing to up that budget it was just my initial thought about what something like that could cost, and a very quickly figured out it would definitely be more than that-I probably didn’t even need to share but I guess it does kind of indicate I’m not quite ready for like a 15 or $20,000 unit.

[-]

_derpiii_@reddit

issues like… censorship?

[-]

angry_baberly@reddit (OP)

Worse. But I don’t wanna spell it out. Thank you

[-]

_derpiii_@reddit

Well, I guess that's enough value out of me then, I'll see myself out

[-]

pineapplekiwipen@reddit

96gb M3 ultra base is way better than 128gb M4 max for 70b dense not just because of the increased bandwidth but because of the improved cooling design. wait time is also considerably better (back when i got one there was practically no wait time, unsure now)

but large dense models have been falling out of favor for a while now. currently the biggest popular dense model people run is qwen 3.5 27b, which honestly is pretty comparable to older 70b models in terms of performance

[-]

_derpiii_@reddit

what was improved in the cooling? I’m surprised it never came up in my comparison prompts 😱

[-]

angry_baberly@reddit (OP)

OK thank you. I do a lot of reasoning and thinking out loud and I have found that using AI to do that helps. When I’ve played around with the DuckDuckGo options the 7B free tier mistral/llama models available on there cannot stay with me in the session they drop context get confused, everything etc.

So I got on somewhere else and asked what level I would need to interact with for it to you know stay on track and help me build like outlines for papers and developing products stuff – I’m pretty non-linear so following my multiple simultaneous trains of thoughts takes more effort. I find that the cloud based consumer whatever ChatGPt worked well for this last year back when it was GPT – 40, but ever since they updated back in August, I have not been able to do the same thing and every update actually seems to make it worse. I’m trying to replace that experience and even make it better.

If I wanna run a model locally and do the best one that I possibly can at home for this type of usage, what are your suggestions?

[-]

HopePupal@reddit

what's your use case?

whatever chatbot you're asking for advice is outdated. 70B open weight models are not really a thing any more. modern useful dense models are smaller (Qwen 3.5 27B being the one everyone's hot for right now) and useful MoEs are getting bigger and bigger (think 100B+).

[-]

angry_baberly@reddit (OP)

OK thank you. I do a lot of reasoning and thinking out loud and I have found that using AI to do that helps. When I’ve played around with the DuckDuckGo options the 7B free tier mistral/llama models available on there cannot stay with me in the session they drop context get confused, everything etc.

So I got on somewhere else and asked what level I would need to interact with for it to you know stay on track and help me build like outlines for papers and developing products stuff – I’m pretty non-linear so following my multiple simultaneous trains of thoughts takes more effort. I find that the cloud based consumer whatever ChatGPt worked well for this last year back when it was GPT – 40, but ever since they updated back in August, I have not been able to do the same thing and every update actually seems to make it worse. I’m trying to replace that experience and even make it better.

If I wanna run a model locally and do the best one that I possibly can at home for this type of usage, what are your suggestions?

[-]

HopePupal@reddit

so mostly chatbot stuff, not coding? it's not going to be as fast or smart as ChatGPT at that budget. period. but not needing super high context does broaden your options a bit.

128 GB Mac Studio is not a terrible option but a new one is $1k more than your budget. whoever said "look for a used M2 Ultra" is like half right, but don't look on eBay because they're still selling for $3500 or more. ~~ideally you want stolen…~~ just kidding. ~~but only because of activation locks and MDM.~~

the Strix Halo aka Ryzen 395 mini PCs are slower but might come in under budget: Bosgame has the only 128 GB one that's under $2500. (i can't speak to Bosgame quality because mine's a GMKtec.) https://www.bosgame.com/products/bosgame-m5-ai-mini-desktop-ryzen-ai-max-395-96gb-128gb-2tb

model-wise, Qwen3.5 122B-A10B is an option, GPT-OSS 120B is an option, maybe GLM 4.5 Air. again, none of these are going to be as fast or as smart as ChatGPT, so my advice is to spend like $10 testing these models on OpenRouter so you know what you're getting into, and then understand that once you are running them locally, they will be slower.

if that's not a hard budget, you have some more options, but most of them are going to be a lot more work.

[-]

angry_baberly@reddit (OP)

I mean that was my initial budget but I understand now I’ll have to spend more….

[-]

HopePupal@reddit

figure out your actual budget before doing anything else. $3k is a lot different than $12k. either one will buy you a lot of OpenRouter credit (or ChatGPT or Claude or Minimax or GLM or…) right now, but half of the people in here are betting on the cloud vednors getting much more expensive, and the other half are too rich to care and just want privacy and reliability.

[-]

angry_baberly@reddit (OP)

I mean the MacBook Pro I was looking at is like 5500 after tax. 12 K is probably out of my range right now, but I could start with something else and then get that later maybe

I’m not a particularly tech forward person so everything will be a learning curve for me, but I do learn quick. I’m looking for insight from people who are already knowledgeable about this. I just know that I liked the way I was interacting with ChatGPT before they changed a bunch of stuff, and I have some issues with what I’ve learned about their research process that makes me want a completely local model as well

[-]

HopePupal@reddit

a laptop is rarely the correct answer for AI stuff because they're not designed for sustained continuous workloads and will "thermal throttle" by slowing down when they get too hot. if you're not experienced building your own PC, it might make sense to wait for the M5 Mac Studios, because competitive PC builds in that price range are going to be possible but some work.

[-]

_derpiii_@reddit

You bring up a good point about thermals. I was planning on getting an M5 128 GB MacBook Pro (I travel a lot), but after your comment, reconsidering.

[-]

angry_baberly@reddit (OP)

I was looking at some of the things labeled “AI workstation” like the ASUS GX 10, I understand those can be connected but maybe one would be plenty to start with? Price point it seems comparable to a Mac Studio, maybe a smidge less.

There would be a learning curve with linux but I wonder if once it was set up if it wouldn’t be too complicated? I would be looking to Screen share or somehow use it on an iPad so I can pace while I’m talking to it.

I tried asking ChatGPT and it’s all over the place. Reddit has been much more helpful so far. Thanks again

[-]

dkeiz@reddit

ryzen ai 128gb + usb4 40gbps port for 4090/5090 or whatever you can buy.
mac good if its 5 generation, otherways prefill speed is problematic.

[-]

drunnells@reddit

I have two used Nvidia Tesla P40s from eBay and run them with 70B gguf models with llama.cpp an reasonable inference speeds for conversation. They have gone up in price, but I think the pair would be about $600 at today's prices.

[-]

angry_baberly@reddit (OP)

OK I’m interested. I have not heard of these

[-]

journalofassociation@reddit

I run an 80B MoE (Qwen 3 Next) on 36 GM VRAM (3080 to + 3090) at a lower quant.

That model actually performs quite well at a 3-bit quant, and I get 48k of context.

For a 70B dense model you'd need more VRAM than that most likely.

[-]

angry_baberly@reddit (OP)

So maybe I’m looking at it the wrong way?

I said this earlier to a different comment but I’m gonna paste it here anyway maybe it will help – it describes how I’ve been using it and what I wanna do:

I do a lot of reasoning and thinking out loud (transcribed) and I have found that using AI to do that helps. When I’ve played around with the DuckDuckGo options the 7B free tier mistral/llama models available on there cannot stay with me in the session they drop context get confused, everything etc. So I got on somewhere else and asked what level I would need to interact with for it to you know stay on track and help me build like outlines for papers and developing products stuff – I’m pretty non-linear so following my multiple simultaneous trains of thoughts takes more effort. I find that the cloud based consumer whatever ChatGPt worked well for this last year back when it was GPT – 40, but ever since they updated back in August, I have not been able to do the same thing and every update actually seems to make it worse. I’m trying to replace that experience and even make it better.

If I wanna run a model locally and do the best one that I possibly can at home for this type of usage, what are your suggestions?

[-]

journalofassociation@reddit

It's hard to say without knowing exactly how much context you're using. But I used the 80B model I mentioned above to summarize my 45-minute therapy appointment and it did fantastic job and only filled about 20k of context.

The larger-parameter models are good because they're better at figuring out what's going on even when words are mistranscribed or sentences are incomplete (newer ChatGPT models are fantastic at it but I keep my medical info local)

I did have some good summarizing results with Qwen 3.5 27B as well, with reasoning on. I can get over 100k context with my setup. With a decent quant you could run it on a single 3090 which go for maybe $1500 used or lower if you find a deal.

If your transcripts are highly accurate, you could possibly use some of the lower parameter models, maybe Gemma 3 12B or Qwen 3.5 9B. with a higher context.

[-]

angry_baberly@reddit (OP)

Like multiturn reasoning, building context in session type stuff. So it’s not like one prompt –> one response interaction.

[-]

journalofassociation@reddit

Some frequent advice I see people give here is just rent a cloud GPU and try those models out to see if that config works for your use case... Then buy the equivalent if it does.

[-]

ArtfulGenie69@reddit

Dual 3090's lets you run 70b in 4bit with either llamacpp or exllamav3.

[-]

synn89@reddit

Dual 3090 setup can run that pretty well for chat and that build used to be the gold standard for 70B's back in the day. A Mac M1 Ultra 128GB system runs that at a higher quant for a lot less power just a tad slower. But it has way more flexibility in regards to running larger MOE's.

Framework style AMD desktops will be too slow for this. The M1 Ultra is the slowest I'd want to go with memory bandwidth for a dense model of this size. And while the M1 Ultra can run larger dense models, it's a tad too slow of an experience.

With a dual 3090 setup, at around 4bit quant you'll probably be limited to 32k context or so. Honestly, I haven't ran an EXL quant in awhile(my 3090 builds are offline and I just prefer the Mac), so I don't know what the state of the art is in quantizing the context. But 32k on a 4bit quant was comfortable on my dual 3090 setups and quite usable.

You might consider renting a dual 3090 setup and testing with that first. Vast.ai probably has a lot of them for cheap and you can test out first hand what the experience will be like. Maybe 2 isn't enough and you want 3 or 4 for the context. Maybe 3090 isn't fast enough and you'll want 4x or 5x generation cards. Renting will let you experiment.

[-]

ZK_Zinode@reddit

I just recently purchased 2x 3090 FE and planning on testing the capability limit of this setup. Have you tried any models using TurboQuant? I’m interested in seeing what the bandwidth difference has been from your experience

[-]

spky-dev@reddit

For $2500? Nothing that runs at a decent speed. 70b dense are very demanding.

[-]

angry_baberly@reddit (OP)

I did mention that I realized this already… which is why I’m asking for guidance here

[-]

ImportancePitiful795@reddit

Do you have PC already? If yes, 2xB70 and use vLLM. They will set you back around $2000.

[-]

Ok_Mammoth589@reddit

Absolutely not, this is the worse of everything

[-]

SnooSprouts3872@reddit

What is a use case for 70b?

[-]

angry_baberly@reddit (OP)

I do a lot of reasoning and thinking out loud and I have found that using AI to do that helps. When I’ve played around with the DuckDuckGo options the 7B free tier mistral/llama models available on there cannot stay with me in the session they drop context get confused, everything etc.

So I got on somewhere else and asked what level I would need to interact with for it to you know stay on track and help me build like outlines for papers and developing products stuff – I’m pretty non-linear so following my multiple simultaneous trains of thoughts takes more effort. I find that the cloud based consumer whatever ChatGPt worked well for this last year back when it was GPT – 40, but ever since they updated back in August, I have not been able to do the same thing and every update actually seems to make it worse. I’m trying to replace that experience and even make it better.

If I wanna run a model locally and do the best one that I possibly can at home for this type of usage, what are your suggestions?

[-]

Savantskie1@reddit

Smarter model for coding or chatting

[-]

SnooSprouts3872@reddit

For coding? I doubt that

[-]

qwen_next_gguf_when@reddit

2x3090

[-]

GestureArtist@reddit

RTX Pro 6000 workstation. It will be significantly faster than the Mac.

[-]

mindwip@reddit

Only a 64gb or 96gb gpu

I would not run a 70b on my strix halo. It could but too slow.

A 70b dense needs better then 1000gb memory bandwidth

[-]

pineapplekiwipen@reddit

96gb M3 ultra base is way better than 128gb M4 max for 70b dense not just because of the increased bandwidth but because of the improved cooling design. wait time is also considerably better (back when i got one there was practically no wait time, unsure now)

[-]

Dry-Paper-2262@reddit

I use dual 3090s for 70B using EXL3 4.25-4.5 bpw quants with context windows of 64k~ with fast inference speeds. Depends on what kind of context windows you're looking for but this setup hosted using tabbyAPI has been so good for me that I've started quantizing models myself if there isn't a 4.25 bpw of a 70b merge I'm interested I'm trying.

[-]

Primary-Wear-2460@reddit

Depends what kind of speed are you looking for?

[-]

def_not_jose@reddit

Appliance box (strix halo?) speed would be pretty much unusable for 70b dense, wouldn't it?

[-]

HopePupal@reddit

can confirm that running LLaMA 3.3 70B on a Strix is too slow to be useful. think TG @ 2 tok/s. also, i don't know if it's the attention architecture or the training data, but load up even a small document and prepare to be amazed at how many mistakes it makes when asked basic questions about the contents. not remotely worth it.

[-]

Primary-Wear-2460@reddit

Depends what you are doing. Drafting documents or something it might be fine.

[-]

Mindless_Selection34@reddit

Spent 1800€: 18gb vram, 64gb RAM. I think i can do it

[-]

Equivalent_Bit_461@reddit

Dense model? Hahahhahahah no way Moe, maybe, before the ram price spikes, you could get yourself 128gb under 300 bucks Right now it's difficult In short, you should've bought it yesterday My advice would be to wait a couple more months ram prices might drop a bit

[-]

mayo551@reddit

3x3090 for 72GB VRAM.

It's outside of your budget.

You can try using 2x3090, but you'll be using small quants without any decent context.

[-]

a_beautiful_rhind@reddit

Takes about 48gb of ram for a decent dense 70b quant. However you want to do that.