What's the cheapest setup for running full Deepseek R1
Posted by Wooden_Yam1924@reddit | LocalLLaMA | View on Reddit | 96 comments
Looking how DeepSeek is performing I'm thinking of setting it up locally.
What's the cheapest way for setting it up locally so it will have reasonable performance?(10-15t/s?)
I was thinking about 2x Epyc with DDR4 3200, because prices seem reasonable right now for 1TB of RAM - but I'm not sure about the performance.
What do you think?
Southern_Sun_2106@reddit
Cheapest? I am not sure this is it, but, I am running Q4_K_M with 32K context on LM Studio, on the M3 Ultra ($9K USD), at 10 - 12 t/s. Not my hardware.
Off topic, but I want to note here that it's ironic that the Chinese model is helping sell American hardware (I am tempted to get M3 Ultra now). DS is such a lovely model, and in light of the recent closedAI court orders, plus unexplained 'quality' fluctuations of Claude, open routers, and the likes, having a consistently performing high-quality local model is very, very nice.
Spanky2k@reddit
I so wish they'd managed to make an M4 Ultra instead of M3. Apple developed themselves into a corner because they likely didn't see this kind of AI usage coming when they were developing the M4 so dropped the interlink thing. I'm still tempted to get one for our business but I think the performance is just a tad too slow for the kind of stuff we'd want to use it for.
Have you played around with Qwen3-235b at all? I've been wondering if using the 30b model for speculative decoding with the 235b model might work. The speed of the 30B model on my M1 Ultra is perfect (50-60 tok/sec) but it's just not as good as the 32B model in terms of output and that feels a little too slow (15-20 tok/sec). But I can't use speculative decoding on M1 to eek anything more out. Although I have a feeling speculative decoding might not work on the really dense models anyway as no one seems to talk about it.
Southern_Sun_2106@reddit
I downloaded and played around with the 235B model. It actually has 22b active parameters when it outputs, so it is as fast as a 22b model, and as far as I understand, it won't benefit much from using a 30B speculative decoding model (22b < 30b?). I downloaded the 8-bit MLX version in LLM Studio, and it runs at 20 t/s with a max context of 40K. 4-bit probably would be faster and take less memory. It is a good model. Not as good as R1 q4KM by far, but still pretty good.
The 235B 8bit mlx is using 233.53GB of unified memory with the 40K context.
I am going to play with it some more, but so far so good :-)
Spanky2k@reddit
20 t/s is pretty decent. The 30b is a 30b-a3b so it has 3b active parameters, hence why it might still give a speed up with speculative decoding. Something you might like to try too are the DWQ versions, e.g. Qwen3-235B-A22B-4bit-DWQ as the DWQ 4bit versions reportedly have the perplexity of 6bit.
As an aside, 30B absolutely screams on my M3 Max MacBook Pro compared to my M1 Ultra - 85 tok/s vs 55 tok/s. My guess is the smaller the active model, the less memory bandwidth becomes the bottleneck. Whereas 32B runs a little slower on my M3 Max (although can be brought up to roughly the same speed as the M1 Ultra if I use speculative decoding, which isn't an option on M1).
Southern_Sun_2106@reddit
Hi, there! Just a quick update - the 30b-a3b doesn't come up as drafting model option in LLM studio. But there's a specific 0.5b r1 drafting model that's available. I am able to select that one in the LLM studio options drop down for R1. However, there's another 'but' - the R1 crashes with the drafting model option on. It could be LLM Studio-specific issue that needs to be addressed. I googled around, no solution so far.
Southern_Sun_2106@reddit
You are correct! I forgot that it has 3b active parameters. I am going to give it a try.
After some more testing of the 235B model, I have to say, I like it more and more. It is super-fast (compared to DeepSeek) and perfect for my little assistant (integrated with the project management app, notes, internet). I project it to the M3 Max MacBook Pro. It is completely uncensored, so no lecturing, lol. I think it would be perfect for serving AI in an office environment from an M3 Ultra (but maybe not tell people that the model is uncensored lol).
Again, thanks for pointing out the 3b aspect!
Southern_Sun_2106@reddit
I literally got it to my home two days ago, and between work and family, haven't had a chance to play with it much. Barely managed to get the R1 0528 q4_k_m to run (for some reason Ollama would not do it, so had to do the LM Studio).
I am tempted to try Qwen3 234B and will most likely do so soon - will keep you posted. Downloading these humongous models is a pain.
I have a MacBook M3 with 128GB of unified memory, and Gemma 3, qwq 32B, Mistral Small 3.1 are my go-to models for the notebook/memory enabled personal assistant; RAG; document processing/writing applications. I agree with you - M3 Ultra is not fast enough to run those big models (like R1) for serious work. It works great for drafting blog articles/essays/general document creation; but RAG/multi-convo is too slow to be practical. However, overall, the R1 was brilliant so far. To have such a strong model running locally is such a sweet flex :-)
Going back to the M3 with 128GB and those models that I listed - considering the portability and the speed, I think that laptop is the best Apple's offering for local AI at the moment, whether intentional or not. Based on some news (from several months ago) about the state of Siri and AI in general at Apple, my expectations for them are pretty low at the moment, unfortunately.
QuantumSavant@reddit
Cheapest way would be probably to get a mac studio ultra with 512GBs of unified ram and run it a 4-bit. To get 1TB of server ram will cost you at least $5k. Add everything else and you’re close to what the mac costs, and nowhere near its performance.
sascharobi@reddit
4-bit 🙅‍♀️
EducatorThin6006@reddit
4-bit qat. If some enthusiast shows up and utilizes qat technique. It will be much closer to the original one.
woahdudee2a@reddit
it can only be done by deepseek, during the training process
FailingUpAllDay@reddit
"Cheapest setup"
"1TB of RAM"
My brother, you just casually suggested more RAM than my entire neighborhood has combined. This is like asking for the most fuel-efficient private jet.
But I respect the hustle. Next you'll be asking if you can run it on your smart fridge if you just add a few more DIMMs.
Bakoro@reddit
It's a fair question. The cap on compute is sky high, you could spend anywhere from $15k to $3+ million.
A software developer on the right side of the income distribution might be able to afford a $15k computer or even. $30k computer, very few can afford a 3 million dollar computer.
Wooden_Yam1924@reddit (OP)
I've got currently Tr Pro 7955WX(yeah, I read about CCDs bandwidth problem after I bought it and was surprised with low performance), 512 DDR5, 2x A6000, but I use it for training and development purposes only. Running Deepseek R1 Q4 gets me around 4t/s(LMStudio out of the box with partial offloading, I didn't try any optimizations). I'm thinking about getting some reasonably priced machine that could go around 15t/s because of reasoning which produces a lot of tokens.
DifficultyFit1895@reddit
My mac studio (M3U, 512GB RAN) is currently getting 19 tokens/sec with the latest R1 Q4 (at small context). This is a basic out of the box setup in LM Studio running the MLX version, no interesting customizations.
humanoid64@reddit
Very cool, is it able to max out the context size? How does it perform when the context starts filling up?
DifficultyFit1895@reddit
I haven’t tried really big context but have seen a few tests around, seems like the biggest hit is on prompt processing speed (time to first token). I just now asked it to summarize the first half of Animal Farm (about 16,000 tokens). The response speed dropped to 5.5 tokens/sec and time to first token was 590 seconds.
tassa-yoniso-manasi@reddit
Don't you have more PCIe available?
The ASUS Pro WS WRX90E-SAGE has 6 PCIe 5.0 x16 + 1 x8.
Astrophilorama@reddit
So if its any help as a comparison, I've got a 7965wx and 512gb of 5600mhz ddr5. Using a 4090 with that, I get about 10t/s on R1 Q4 on Linux with ik_llama.cpp. I'm sure there's some ways I could sneak that a bit higher, but if that's the lower bound of what you're looking for, it may be slightly more in reach than it seems with your hardware.
I'd certainly recommend playing with optimising things first before spending big, just to see where that takes you.Â
SteveRD1@reddit
Whats the RAM/VRAM split when you run that?
Papabear3339@reddit
Full R1, not the distill, is massive. 1TB of ram is still going to be a stretch. Plus cpu only will be dirt slow, barely running.
Unless you have big money for a whole rack of cuda cards, stick with smaller models.
Willing_Landscape_61@reddit
1TB of ECC DDR4 at 3200 cost me $1600.
thrownawaymane@reddit
When did you buy?
Willing_Landscape_61@reddit
Mar 24, 2025
FullstackSensei@reddit
DDR4 ECC RDIMM/LRDIMM memory is a lot cheaper than you'd think. I got a dual 48 cor3 Epyc system with 512GB of 2933 memory for under 1k. About 1.2k if you factor in the coolers, PSU, fans and case. 1TB would have taken thinks to ~1.6k (64GB DIMMs are a bit more expensive).
If OP is willing to use 2666 memory, 64GB LRDIMMs can be had for ~ 0.55-0.60/GB. The performance difference isn't that big, but the price difference is substantial.
Wooden-Potential2226@reddit
This
un_passant@reddit
You got a good price ! Which CPUs models and mobo are these ?
FullstackSensei@reddit
H11DSI with two 7642s. And yes, got a very good price by hunting deals and not clicking on the first buy it now item on ebay.
TheRealMasonMac@reddit
On a related note, private jets are surprisingly affordable! They can be cheaper than houses. The problem, of course, is maintenance, storage, and fuel.
FailingUpAllDay@reddit
Excuse me my good sir, I believe you dropped your monocle and top hat.
TheRealMasonMac@reddit
Private jets can cost under $1 million.
DorphinPack@reddit
Right and the point here is that some of us consider five figures and up a pipe dream for something we only use a few times a year.
Less than a million vs more than million doesn’t change the math.
It’s just a difference of perspectives, I def don’t mean anything nasty by it.
TheRealMasonMac@reddit
Yeah, it's obviously not something anyone could afford unless you were proper rich. Just saying that the private jet itself is affordable -- by standards in the States I guess.
DorphinPack@reddit
Friend, the point is that when you insist upon YOUR definition of "proper rich" you are not only revealing a lot about yourself but also asserting that someone else's definition is less correct.
Nobody is going to cal you on this that isn't trying to help. I'm sure I could put it more tactfully but its advice I had to get the hard way myself and I haven't found a better way yet.
I do get your point though -- comfortable personal finances is less than "my money makes money" is less than "I own whole towns" is less than etc.
TheRealMasonMac@reddit
My guy, you are taking away something that I did not say.
DorphinPack@reddit
Oh I’m not claiming to know what you meant.
I can tell what you meant isn’t what it seems like you mean. And I thought you should know.
Cheers
redragtop99@reddit
I bought my jet to save money. I only need to fly 3-4 times a day and it pays for itself after 20 years, Assuming zero maintenance of course.
TheRealMasonMac@reddit
If you don't do maintenance, why not just live in the private jet? Stonks.
westsunset@reddit
Tbf he's asking for a low cost set up with specific requirements, not something cheap for random people.
Faux_Grey@reddit
Everything is awarded to the lowest bidder.
The cheapest way of getting a human to the moon is to spend billions, it can't be done with what you have in your back pocket.
"What's the cheapest way of doing XYZ"
Answer: Doing XYZ at the lowest cost possible.
Ok_Warning2146@reddit
1 x Intel 8461V? 48C96T 1.5GHz 90MB LGA4677 Sapphire Rapids $100
8 x Samsung DDR5-4800 RDIMM 96GB $4728
This is the cheapest setup with AMX instruction and 768GB RAM.
Wooden-Potential2226@reddit
You’re forgetting the ~1k mobo…
Lga3647 mobos are much cheaper and the 61xxx/63xx cpus also have avx512 instructions albeit fewer cores
Ok_Warning2146@reddit
You can also get $200 mobo if you trust aliexpress. ;)
tameka777@reddit
The heck is mobo?
thrownawaymane@reddit
Motherboard
GoldCompetition7722@reddit
I got a server with ~1500 RAM and one A100 80GB. Tried running 671b today with ollama. No results where produced it 10 minutes and I aborted.
createthiscom@reddit
I have a dual 9355 epyc with 768gb ram. It didn’t reach usable performance with V3-0324 until I added a 3090 and ran ktransformers. That said, ktransformers is buggy and can be a little frustrating. The most usable long context I get out of my setup is 40-50k. It crashes a lot. The ktransformers team has recently moved away from their old ktransformers backend and is only actively working on their balance_serve backend, which isn’t as good for agentic purposes IMO because the prefill cache isn’t instantaneous. I can run R1, but I usually don’t because it’s too slow and the original didn’t follow instructions well. I plan to try R1-0528 this weekend though. 1TB ram would buy you the ability to enable NUMA at compile time and perhaps a little extra performance. Personally, I’d rather add a 6000 pro instead of a 3090 though. The ktransformers team is also focusing a lot on intel AMX instruction set optimizations lately, so that’s something to consider. Take a look at their example rig throughputs if you’re interested. I would personally buy as much machine as you can afford. One thing about a dual socket machine is you get more CCDs which theoretically raises the memory bandwidth, but GPUs are way way faster than CPU so you really really want at least a 3090 for usable speeds.
Lissanro@reddit
Ktransformers never worked well for me, so I run ik_llama.cpp instead, it is just as fast or even slightly faster than Ktransformers.
You are right about using GPUs, having context cache and common tensors fully on GPUs makes huge difference.
createthiscom@reddit
If it never worked for you, how do you know it's just as fast?
fasti-au@reddit
Lical your buying like 8-16 3090s maybe more if you want context etc. so your better to rent gpu online and tunnel to it
Lissanro@reddit
With just four 3090 GPUs I can fit 100K context cache at Q8 along with all common expert tensors and even 4 full layers, with Q4_K_M quant of DeepSeek 671B. For most people, getting a pair of 3090 probably will be enough, if looking for low budget solution.
Renting GPUs is surprisingly expensive especially if running a lot, not to mention privacy concerns, so for me it is not an option to consider. API is cheaper but it has privacy concerns as well and limits what settings you can use, sampler options usually are very limited also. But could be OK I guess if you only need it occasionally or just to try before considering buying your own rig.
valdev@reddit
Cheapest...
There is a way to run full deepseek off of pure swap linked to an NVME drive on essentially any CPU.
It might be 1tk per hour. But it will run.
GTHell@reddit
OpenRouter. I had top up $80 since last 3 months and now still have 69 left. Don’t waste money on hardware. It’s a poor & stupid decision
abc142857@reddit
I can run DeepseekR1 0528 UD Q5KXL with 9-11 t/s (depending on context size) on a dual EPYC 7532 16x64GB DDR4 2666 with a single 5090, ktransformers is a must for the second socket to be useful. It runs two copies of the model so it uses 8xx GB RAM in total.
q-admin007@reddit
I run a 1.78 bit quant (unsloth) on a i7-14700k and 196GB of DDR5 RAM and get less than 3t/s.
The same with two times EPYC 9174F 16-Core Processor and 512GB DDR5 gets 6t/s.
Conscious_Cut_6144@reddit
Dual DDR4 Epyc is a bad plan.
A second CPU gives pretty marginal gains if any.
Go DDR5 EPYC,
You also get the benefit of 12 memory channels.
Just make sure you get one with 8 CCDs so you can utilize that memory bandwidth.
DDR5 Xeon is also an option, I tend to think 12 channels beats AMX, but either is better than dual ddr4.
And throwing a 3090 in there and running Ktransformers is an option too.
smflx@reddit
100w at idle... I was going get one, it makes me umm. Thanks for sharing.
silenceimpaired@reddit
They’re amazing in winter. Heat your house and think for you. During summer you can set them up to call 911 when you die of heat stroke.
Natural_Precision@reddit
A 1kw PC has been proven to provide the same amount of heat as a 1kw heater.
Commercial-Celery769@reddit
Yep in winter its great you don't need any heating in what room a rig is in so you can run central heat less but during summer if you don't have a AC running 24/7 it turns 90 F in that room in no time
moofunk@reddit
Extracting heat from PCs for house heating ought to be an industry soon.
Historical-Camera972@reddit
Industry is ahead of you. They have been selling crypto miner/heaters for years.
moofunk@reddit
Those are just fancy single-room air heaters.
I'm talking about systems that capture heat and transfers it into the house heating system or the hot water loop, so you can get hot tap water using compute.
That could be done with a heat pump directly mounted on the PC that connects to the water loop.
a_beautiful_rhind@reddit
They never tell you the details on the ES chips. Mine don't support vnni and his idle all crazy.
NCG031@reddit
Dual EPYC 9135 with 24 channel memory and do not look back. 884 GB/s with DDR5 6000 memory, 1200 USD per CPU. Beats all other options for price. Dual EPYC 9355 for 970 GB/s is the next step.
Lumpy_Net_5199@reddit
How much does that run vs something like 4-6x 3090s with some DDR4? I’m able to get something like 13-15 t/s with q235b @q3.
That probably fall some (proportionally) given the experts are larger in deepseek
National_Meeting_749@reddit
I'm really curious about the quant, it's probably a normal 'best result between 4 and 8'. But with the 1 quant still being like... 110+ GB I'm super curious if it's still a useful model.
Conscious_Cut_6144@reddit
I use UD-Q3 is and UD-Q2 depending on context and whatnot, both still seem pretty good.
a_beautiful_rhind@reddit
Sure it does.. you get more memory channels and they kinda work with numa. It's how my Xeons can push 200gb/s. I tried numa isolate and using one proc, t/s is cut in half. Not idea even with llama.cpp's crappy numa support.
SteveRD1@reddit
Which DDR5 EPYC with 8 CCD is the best value for money do you know? Are there any good 12 channel motherboards available yet?
Conscious_Cut_6144@reddit
I think it’s all of them other than the bottom 3 or 4 skus. Wikipedia has ccd counts.
H13SSL is probably what I would go with.
Donnybonny22@reddit
If I got lile 4 rtx 3090 can I combine that with DDR 5 epyc ?
Conscious_Cut_6144@reddit
Yep, it’s not a huge speed boost, but basically if you offload 1/2 the model onto 3090’s it’s going to be about 2x faster.
hurrdurrmeh@reddit
Would it be worth getting a modded 48GB 4090 instead of a 3090 for KTransformers?
No_Afternoon_4260@reddit
Sure
Faux_Grey@reddit
100% echo everything said here.
Single socket, 12x dimms, fastest you can go.
mitchins-au@reddit
Probably a Mac Studio, if we are being honest - it’s not cheap but compared to other high speed setups it may be relatively cheaper? Or digits
mxforest@reddit
I don't think any CPU only setup will give you that much throughput. You will Have to have a combo of as much GPU you can fit and then cover the rest with RAM. Possible 4x RTX Pro 6000 which will cover 384 GB VRAM and 512 DDR5?
Historical-Camera972@reddit
Posts like yours make my tummy rumble.
Is the internet really collectively designing computer systems for the 8 people who can actually afford them? LMAO. Like, your suggestion is for a system that 0.0001% of computer owners will have.
Feels kinda weird to think about, but we're acting as information butlers for a 1%er if someone actually uses your advice.
para2para@reddit
I’m not a 1%er and I just built a rig with Threadripper pro, 512gb ddr4 and an RTXA6000 48gb, thinking of adding another soon to get to 96gb vram
Historical-Camera972@reddit
Yeah, but what are doing with the AI, and is it working fast?
harrro@reddit
OP is asking for to run at home what is literally the largest available LLM model ever released.
The collective 99.9999% of people don't plan to do that but he does so the person you're responding to is giving a realistic setup.
datbackup@reddit
Your criticism is just noise. At least parent comment is on topic. Go start a thread about “inequality in AI compute” and post your trash there.
teachersecret@reddit
Realistically… I’d say go with the 512gb max ultra Mac.
It’s ten grand, but you’ll be sipping watts instead of lighting your breaker box on fire and you’ll still get good speed.
Axotic69@reddit
How about a Dell Precision 7910 Tower - 2X Intel Xeon E5-2695 V4 18-Core 2.1Ghz - 512GB DDR4 REG? I wanted to get an older server to play with and run some tests, but I have to go abroad for a year and don’t feel like taking it with me. Running on CPU, I understand the 512GB ram is not enough in order to load Depseek in the memory so maybe add some more?
bick_nyers@reddit
If you're going to go dual socket, I heard Wendell from Level1Tech recommended getting enough RAM so that you can keep 2 copies of Deepseek in RAM, one on each socket. You might be able to dig up more info. on their forums: https://forum.level1techs.com/
I'm pretty sure DDR4 generation Intel CPUs don't yet have AMX, but would be worth confirming as KTransformers has support for AMX.
BumbleSlob@reddit
Only 808Gb of RAM!
HugoCortell@reddit
I would say that the cheapest setup is waiting for the new dual-gpu 48GB intel cards to come out.
retroturtle1984@reddit
If you are not looking to run the “full precision” model, Ollama’s quantized versions running on llama.cpp work quite well. Depending on your need, the distilled versions can also be a good option to increase your token throughout. The smaller models 32B and below can run with reasonable realtime performance on CPU.
JacketHistorical2321@reddit
You'll get 2-3 t/s with that setup. Search the forum. Plenty of info 👍
sascharobi@reddit
I wouldn't want to use it, way too slow.
extopico@reddit
It depends on two things. Context window and how fast you need it to work. If you don’t care about speed but want the full 128k token context you’ll need around 400GB of RAM without quantising it. The weights will be read off the SSD if you use llama-server. Regarding speed, CPUs will work, so GPUs are not necessary.
Lissanro@reddit
I can run it on EPYC 7763 64-core with 1TB 3200 MHz DDR4 RAM at 8 tokens/s, and well over 100+ tokens/s for prompt processing, using 3090 GPUs. I have four, so I can fit 100K+ context at Q8 but I recommend having at least two, then you can fit 48K-64K context, along with some common tensors.
Based on that, it is possible that two socket EPYC with DDR4 can push beyond 10 tokens/s, but do not expect double the performance - dual socket system will not provide that much boost. Also, I do not recommend going below EPYC 7763 - since during token generation, it is fully saturated, so less powerful CPU will reduce your token generation performance (will not affect prompt processing speed though, since it is mostly done by GPUs).
If you are limited in budget, then 3090 cards + single socket DDR4 EPYC platform probably will be the best choice for the money.
Also, backend choice and picking the right quant is also important, if you are looking for performance. For example, llama.cpp, ollama and most other backends will not provide good performance for CPU+GPU inference for DeepSeek R1. I recommend using ik_llama.cpp instead. I shared how to setup and run ik_llama.cpp , in case you would like to give it a try. Here I documented how to create a good quality GGUF quant from scratch from the original FP8 safetensors that would work the best for ik_llama.cpp, covering everything including converting FP8 to BF16 and calibration datasets (new llama.cpp quants may work too, but at reduced performance, unless they specifically mention ik_llama.cpp compatibility for CPU+GPU inference).
Willing_Landscape_61@reddit
I don't think that you can get 10t/s on DDR4 Epyc as the second socket won't help that much because of NUMA. Disclaimer: I have such a dual Epyc Gen 2 server with a 4090 and I don't get much more than 5 t/s with smallish context.
FullstackSensei@reddit
10-15tk/s is far above reasonable performance for such a large model.
8 get about 4tk/s with any decent context (~2k or above) on a single 3090 and a 48 core Epyc 7648 with 2666 memory using ik_llama.cpp. I also have a dual Epyc system with 2933 memory and that gets under 2k/s without a GPU.
The main issue is the software stack. There's no open source option that's both easy to setup and well optimized for NUMA systems. Ktransformers doesn't want to build on anything less than Ampere. ik_llama.cpp and llama.cpp don't handle NUMA well.
kmouratidis@reddit
Search is your friend: - Deepseek R1 671b minimum hardware to get 20TPS running only in RAM - DeepSeek-R1 CPU-only performances (671B , Unsloth 2.51bit, UD-Q2_K_XL) - Relatively budget 671B R1 CPU inference workstation setup, 2-3T/s - PC Build: Run Deepseek-V3-0324:671b-Q8 Locally 6-8 tok/s - ktransformers docs