DeepSeek V4 PRO on how many 3090 ?

[-]

Tormeister@reddit

It would probably crawl at 1 token per minute if you even manage to split it into 40x 3090s, that LLM size is not for consumer hardware

[-]

MachineZer0@reddit

99% of localllama stops around 384-512gb VRAM/RAM. Most probably 16gb. I’d venture to say less 5 people will ever run DeepSeek v4 pro locally.

I stopped at GLM 4.7. Diminishing returns to have that much capital tied up for a single user.

Rethinking everything after Qwen3.6 27b.

[-]

Run qwen 3.6 27b q8 with 256k context on 2 3090s, or deepseek flash v4 q4 on 6 3090s. Those are best local coders. Cloud deepseek V4 flash is so cheap right now it is financially irresponsible to run it anywhere else but the cloud unless you have all the hardware already and electricity is free.

[-]

segmond@reddit

0 or more.

With llama.cpp you it's GPU ram + system ram. So you can run with 1 3090 and 1tb system ram. If the weight is quantized right, it should come out to about 800gb+ size.

[-]

ranting80@reddit

That's the wrong tool for the job on this one. You'd probably want a stack of 2 x Mac studio 512gb models or 4 x 256gb models and it won't be very fast.

I wish I could recommend a dual CPU server with 1 TB of RAM but it's still extremely expensive now.

[-]

aigemie@reddit

It's a simple math: Q4 is around 800GB, a 3090 has 24GB, 800/24~=34, and you need more for context and other overhead buffer , so let's add 2 more 3090s, which is 26 3090s.

[-]

szansky@reddit (OP)

Thanks a lot bro, did you use DeepSeek locally? Can you say something about this model? Worth, not ?

[-]

aigemie@reddit

No I don't see it locally cause I don't have that many GPUs. No I can't say anything about it besides their own benchmarks.

[-]

szansky@reddit (OP)

Okay thanks a lot, so what do you prefer currently the best model for programmers with a couple of 3090 ?

[-]

gaspoweredcat@reddit

chances are its something along the lines of gemma 4 or qwen 3.6 which could run very easily on a pair of 3090s, if you have a larger amount of cards you could try deepseek v4 flash (still incredibly capable) or the big qwen 3.6 but theyre still probably going to need like 10 or so cards to run

[-]

szansky@reddit (OP)

More better than Qwen 3.6 in code then?

[-]

redditorialy_retard@reddit

If it's a single user. Yes. 4/8x 3090 can run big Qwen with moderate quant

[-]

szansky@reddit (OP)

So will feel a big difference between this 1x 3090 running 3.6 27b?

[-]

redditorialy_retard@reddit

Bigger models better results

[-]

aigemie@reddit

Just like others said, Qwen3.6 27B is probably good for you to run with 1 3090. You can also try Qwen3.5 122B with a few 3090s.

[-]

MengerianMango@reddit

Qwen 3.6 is probably your best bet.

[-]

gaspoweredcat@reddit

even at that your results may vary unless you have a very specialized setup, hosting a full 26x full 16x lane pcie is hard enough but with that many cards you are going to be bottle necked by the pcie bandwidth, you need something running via NV link or similar to do it properly

[-]

Present-Aardvark-299@reddit

Just a thing, this tech is quite new. In future it will probably require way less VRAM and RAM, so rather than buying today tons of gpus to run ai locally, it would be to wait for maybe a decade or so, and then maybe it could run on 1 good gpu.

[-]

jikilan_@reddit

Like buying GTS 8800 now?

[-]

szansky@reddit (OP)

Yes that's the point

[-]

Ceneka@reddit

Yes

[-]

Blues520@reddit

💀

[-]

ImportancePitiful795@reddit

🤣🤣🤣🤣

[-]

MaxKruse96@reddit

we really out here doing simple math for ppl now huh

[-]

OneSlash137@reddit

This is what AI has done to a lot of people.

[-]

Makers7886@reddit

Google it for me crowd is evolving - wait devolving

[-]

szansky@reddit (OP)

Sorry for my lack of education.

[-]

gaspoweredcat@reddit

well lets be real, we all know LLMs suck at maths so its not like he can ask AI (i am joking of course, even somewhat crap models can handle maths that simple)

[-]

ImportancePitiful795@reddit

Considering you need around 26-27 RTX3090, and given their cost not only to buy, setup up but running costs, consider to buy a GH200 server, it will be much cheaper to buy and pay electricity.😁

[-]

Lissanro@reddit

It depends on if you plan to offload to RAM or not. For better performance, you need at least enough VRAM to hold context cache and common expert tensors, and if you still have VRAM left, then as much as fits. Modern llama.cpp can do it automatically but currently V4 Pro is not supported yet, but the work on it seems to be in progress, so likely will be possible to run with llama.cpp soon.

I plan to run it as Q4 quant (when it will be available and supported in the mainline llama.cpp) with four 3090 GPUs + 1 TB RAM.

If you want absolutely best performance and load it in VRAM only, you will need to use better GPUs, like maybe from 12 to 16 RTX PRO 6000 (depending on what quant and context size you plan to run, and with what backend).

[-]

MelodicRecognition7@reddit

lol yet another "recommend a LLM for coding" thread disguised as DS4 discussion

[-]

Herr_Drosselmeyer@reddit

About 20.

[-]

szansky@reddit (OP)

Thanks man