DeepSeek V4 PRO on how many 3090 ?
Posted by szansky@reddit | LocalLLaMA | View on Reddit | 35 comments
Hi guys I got only 3090 GPUs so... How many prefer to run to get a great result in DeepSeek V4 PRO? Thanks!
Tormeister@reddit
It would probably crawl at 1 token per minute if you even manage to split it into 40x 3090s, that LLM size is not for consumer hardware
FusionCow@reddit
too many
MachineZer0@reddit
99% of localllama stops around 384-512gb VRAM/RAM. Most probably 16gb. I’d venture to say less 5 people will ever run DeepSeek v4 pro locally.
I stopped at GLM 4.7. Diminishing returns to have that much capital tied up for a single user.
Rethinking everything after Qwen3.6 27b.
szansky@reddit (OP)
That's the point
mzzmuaa@reddit
Run qwen 3.6 27b q8 with 256k context on 2 3090s, or deepseek flash v4 q4 on 6 3090s. Those are best local coders. Cloud deepseek V4 flash is so cheap right now it is financially irresponsible to run it anywhere else but the cloud unless you have all the hardware already and electricity is free.
segmond@reddit
0 or more.
With llama.cpp you it's GPU ram + system ram. So you can run with 1 3090 and 1tb system ram. If the weight is quantized right, it should come out to about 800gb+ size.
ranting80@reddit
That's the wrong tool for the job on this one. You'd probably want a stack of 2 x Mac studio 512gb models or 4 x 256gb models and it won't be very fast.
I wish I could recommend a dual CPU server with 1 TB of RAM but it's still extremely expensive now.
aigemie@reddit
It's a simple math: Q4 is around 800GB, a 3090 has 24GB, 800/24~=34, and you need more for context and other overhead buffer , so let's add 2 more 3090s, which is 26 3090s.
szansky@reddit (OP)
Thanks a lot bro, did you use DeepSeek locally? Can you say something about this model? Worth, not ?
aigemie@reddit
No I don't see it locally cause I don't have that many GPUs. No I can't say anything about it besides their own benchmarks.
szansky@reddit (OP)
Okay thanks a lot, so what do you prefer currently the best model for programmers with a couple of 3090 ?
gaspoweredcat@reddit
chances are its something along the lines of gemma 4 or qwen 3.6 which could run very easily on a pair of 3090s, if you have a larger amount of cards you could try deepseek v4 flash (still incredibly capable) or the big qwen 3.6 but theyre still probably going to need like 10 or so cards to run
szansky@reddit (OP)
More better than Qwen 3.6 in code then?
redditorialy_retard@reddit
If it's a single user. Yes. 4/8x 3090 can run big Qwen with moderate quant
szansky@reddit (OP)
So will feel a big difference between this 1x 3090 running 3.6 27b?
redditorialy_retard@reddit
Bigger models better results
aigemie@reddit
Just like others said, Qwen3.6 27B is probably good for you to run with 1 3090. You can also try Qwen3.5 122B with a few 3090s.
MengerianMango@reddit
Qwen 3.6 is probably your best bet.
gaspoweredcat@reddit
even at that your results may vary unless you have a very specialized setup, hosting a full 26x full 16x lane pcie is hard enough but with that many cards you are going to be bottle necked by the pcie bandwidth, you need something running via NV link or similar to do it properly
Present-Aardvark-299@reddit
Just a thing, this tech is quite new. In future it will probably require way less VRAM and RAM, so rather than buying today tons of gpus to run ai locally, it would be to wait for maybe a decade or so, and then maybe it could run on 1 good gpu.
jikilan_@reddit
Like buying GTS 8800 now?
szansky@reddit (OP)
Yes that's the point
Ceneka@reddit
Yes
Blues520@reddit
💀
ImportancePitiful795@reddit
🤣🤣🤣🤣
MaxKruse96@reddit
we really out here doing simple math for ppl now huh
OneSlash137@reddit
This is what AI has done to a lot of people.
Makers7886@reddit
Google it for me crowd is evolving - wait devolving
szansky@reddit (OP)
Sorry for my lack of education.
gaspoweredcat@reddit
well lets be real, we all know LLMs suck at maths so its not like he can ask AI (i am joking of course, even somewhat crap models can handle maths that simple)
ImportancePitiful795@reddit
Considering you need around 26-27 RTX3090, and given their cost not only to buy, setup up but running costs, consider to buy a GH200 server, it will be much cheaper to buy and pay electricity.😁
Lissanro@reddit
It depends on if you plan to offload to RAM or not. For better performance, you need at least enough VRAM to hold context cache and common expert tensors, and if you still have VRAM left, then as much as fits. Modern llama.cpp can do it automatically but currently V4 Pro is not supported yet, but the work on it seems to be in progress, so likely will be possible to run with llama.cpp soon.
I plan to run it as Q4 quant (when it will be available and supported in the mainline llama.cpp) with four 3090 GPUs + 1 TB RAM.
If you want absolutely best performance and load it in VRAM only, you will need to use better GPUs, like maybe from 12 to 16 RTX PRO 6000 (depending on what quant and context size you plan to run, and with what backend).
MelodicRecognition7@reddit
lol yet another "recommend a LLM for coding" thread disguised as DS4 discussion
Herr_Drosselmeyer@reddit
About 20.
szansky@reddit (OP)
Thanks man