Budget to run Deepseek V4 locally at FP4 precision

Posted by DanielusGamer26@reddit | LocalLLaMA | View on Reddit | 28 comments

Just a question for fun/curiosity: in your opinion, if I had enough money, how much would be needed and what configuration would be required to run DeepSeek v4? Maybe not necessarily everything in VRAM, maybe something hybrid. Let's discuss :)

Sorry for the low-effort post, but it's pure curiosity; I'm not here to farm karma or anything like that.

[-]

This_Maintenance_834@reddit

$25K for the flash one?

[-]

Blaze6181@reddit

But beware, DeepGEMM does not support sm 12.0 and apparently has no plans to, so you will have to lean on the community for whatever they come up with. I'm literally trying to build and optimize a new kernel for this exact use case as we speak.

[-]

This_Maintenance_834@reddit

i believe that eventually going to happen. it only took less than 4 weeks for DFlash to propagates to many llm engines.

[-]

AppealSame4367@reddit

My guess, looking at Qwen3.6 27B and such: Just wait 3-6 months and you'll have that power on a gaming pc. Why invest 60k $ for something that will be dirt cheap in a few months?

What i mean with that: Open Models will keep evolving. I have a usable qwen3.6 35b running on my 6gb vram old gaming laptop in pi cli and it's currently analyzing and fixing a whole rust client server game in the background while i do other things. It's crazy and I will probably have deepseek 4 intelligence on that same old laptop in a few months. So why bother?

[-]

Ill_Initiative_8793@reddit

I upgraded my stock 4090 to 48Gb version and quite happy about it so far with all those new models. Only problem is turbine noise under full load, but for LLMs its not a problem most of the time.

[-]

DanielusGamer26@reddit (OP)

So you get that modded gpu from alibaba? What about driver compatibilty and resilience in the long term? (Like gpu failures because they are modded)

These are the main concerns that stop me from shopping these cards

[-]

Ill_Initiative_8793@reddit

I've got it from some guys in Moscow, as I live in Russia, they do it themselves, they transfer GPU chip and VRAM from stock card to a new board and add 24 GB new VRAM to the other side. Then they add turbine cooler made for this board (more noisy than what I've had especially under load). There is also liquid cooling option, more suitable for home workstation. Works with the same driver under linux (I didn't reinstall anything). I didn't try windows yet, but they told me it works with stock driver too. I've tried to run it under full load for hours - seems fine and has some thermal headroom.

[-]

DanielusGamer26@reddit (OP)

I'd use it on Linux as well. I also own an RTX 5060Ti and would like to run a 48GB 4090 alongside it. Did you happen to check the VRAM temperature while inference is running? Do you know if it's compatible with multiple GPUs to leverage parallelism? Do you know if it might cause issues with two GPUs from different generations?

Thanks in advance for any replies. I'm afraid of spending money on something that might end up causing problems, so I'm being a bit paranoid with all these questions XD.

[-]

Ill_Initiative_8793@reddit

I didn't try it alongside other cards but I don't think it's different from stock 4090 in that regard. It's now also takes 2 slot only, while previously it was 3 slot version. I didn't measure VRAM temps, but turbine cools VRAM too AFAIK.

[-]

FusionCow@reddit

it depends on inference speed you want, but a 512gb m3 ultra mac would work, but like if you truly don't care, you could get like 384gb of ddr3 ram yk. but if inference speed is a huge deal, 8xb200

[-]

Expensive-Paint-9490@reddit

FP4 isn't yet working properly in workstation-class Blackwell GPUs. If you want to exploit the dedicated hardware, you need datacenter-class Blackwell. So the cheapest (ha ha) option would be an Nvidia HGX B200. I think it can be bought for 300,000 USD.

[-]

CalligrapherFar7833@reddit

It will continue not working properly unless you stop calling sm120 and sm100 blackwell. One is fake blackwell the other is not.

[-]

evil0sheep@reddit

I have an rtx 6000 pro and the fp4 matmul instructions work just fine. Are you saying that it’s not supported in a specific piece of inference software?

[-]

Badger-Purple@reddit

two dgx spark or other tb10 chips — 6k

[-]

Electrical_Name_5434@reddit

This guy wrote an article about running it at bf16. He got it done on 2x 4090’s but recommends 4. So roughly 1/4 that should suit fp4. A single 4090 would get it done but you’d lose accuracy.

https://wavespeed.ai/blog/posts/deepseek-v4-gpu-vram-requirements/

[-]

pixelterpy@reddit

Without further quantization I would assume >865 GB RAM+VRAM; you would probably get away with 768 GB main memory + 112 GB+ VRAM, depending on the KV. Cheapest non completely garbage solution I could think of (used parts) would be an EPYC (up to 3rd gen) / Xeon 3rd gen, 768 GB DDR4 and 10-12x 3060 12 GB or 5-6x 3090 24 GB. Maybe Intel B60 32GB or AMD R9700 AI 32 GB if 3090 prices are too wild.

Board + CPU 1k$; RAM = \~3k$; GPU \~4k$.

You will also need a PSU, proper (bifurcation) riser + cables for the 3060 / 3090, and at least an 1 TB SSD.

My verdict: 10k$ if you live in a country where you have access to the usual used parts market.

[-]

DanielusGamer26@reddit (OP)

It seems reasonable, but I'm concerned about two points: managing 12 GPUs seems like it would be quite painful; also, with the model staying in RAM for so long, wouldn't the speed be very slow, like really low PP? I am not familiar with the Epyc and Xeon lines; does it have a wider bandwidth?

[-]

pixelterpy@reddit

You will be in the ballpark of \~30t/s pp and \~4t/s tg.

[-]

Quadrapoole@reddit

4t/s generation is useless

[-]

ComplexType568@reddit

Overnight runs exist, I guess. With all these talks about how reliable it is and stuff, it could probably knock out a massive task in a night or a day when you go out, very, very slowly

[-]

Orolol@reddit

AKA completely useless outside of very basic chat.

[-]

Conscious_Cut_6144@reddit

Complex riser setup for 10% offload makes no sense.

Go with 1 or 2 high end gpus, 30/40/5090’s Looking at like 2T/s or 2.2T/s with a bunch of gpus.

[-]

opoot_@reddit

What about modded cards like the 2080 ti 22gb and 3080 20gb

[-]

CatalyticDragon@reddit

I'd think Threadripper and 4x R9700.

[-]

Technical-Earth-3254@reddit

Flash or Pro?

[-]

segmond@reddit

The cheapest and best way is just a pure system run, epyc milan with fastest CPU, maxed out ram. board and CPU = $1500. 1tb 3200mhz ddr4 Ram $12,000. fast nvme drive, So about $14000.

[-]

Long_comment_san@reddit

it's simple enough to be answered by any chatbot with a higher degree of accuracy than people here

[-]

DanielusGamer26@reddit (OP)

Yeah, I’ve thought about it, but chatbots have certain limitations when it comes to LLM knowledge and still don’t provide reliable information that reflects real-world use cases. I’ve been on LocalLLama for a long time and have seen so many people with all kinds of configurations—some of them massive. I’m talking about people like them, with that level of expertise, who know exactly how much they’ve spent on hardware.