How do you plan to run DeepSeekV4 Pro locally?

Posted by segmond@reddit | LocalLLaMA | View on Reddit | 29 comments

For those of us who are crazy with this, what's your plan? Save the Q0.5, Q1 jokes. I'm currently stressed because I can't run it.

[-]

imbilbobaggins@reddit

For a less worthless set of answers, take a look here:
https://www.reddit.com/r/LocalLLaMA/comments/1sua2rr/budget_to_run_deepseek_v4_locally_at_fp4_precision/

[-]

LeyLineDisturbances@reddit

lol there's no chance i can run this locally on my m1 max.

[-]

I tried but vLLM has a bug. More details here - https://www.reddit.com/r/LocalLLaMA/comments/1su3tfb/comment/oi5defe/?context=3&utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button

"Technically" not local, but details. 🙃

[-]

-dysangel-@reddit

It's not a joke, I do plan to try at Q2, or even Q1 if necessary.

I've just tweaked mlx-lm to allow cache snapshots, since built in kv caching is not working with linear/sliding window caches, and a lot of new models are using these so it's kind of an essential feature that I'm surprised is not in there yet.

[-]

Expensive-Paint-9490@reddit

If llama.cpp will support the model, which at this point is not a given, I guess I'll resort to a 2-bit quant. That all can fit on 512GB RAM + 24GB VRAM.

[-]

sine120@reddit

I paid for a 4TB gen 5 SSD. Swap disk is free real estate.

[-]

ProfessionalSpend589@reddit

The so called "disk offloading" doesn’t write to a swap on the disk, but dynamically reads some weights from disk to RAM while cleaning old weights from RAM.

For every token as far as I know, which is why it’s so slow.

[-]

No_Conversation9561@reddit

Mac studio 512 GB can probably run 3bit

[-]

Front_Eagle739@reddit

2.5 bit only unless you have two macs and you wont get anything else running on it

[-]

a9udn9u@reddit

I plan to rob a bank this weekend then I'll buy a ton of GPUs to run it.

[-]

cafedude@reddit

Why not skip the bank and just go directly to a datacenter?

[-]

ProfessionalSpend589@reddit

Yeah, lead times for GPUs and CPUs are increasing to months. Those money will be wasted doing nothing.

[-]

imp_12189@reddit

It seems too much work, isn't it easier to rob some nvidia's warehouse or data center directly? Less secure, more money per kg

[-]

DarthFluttershy_@reddit

That's what I'm going to do, too, just as soon as I get deepseek v4 pro running locally to help me plan the bank robbery.

[-]

Clean_Hyena7172@reddit

Might need to do it more than once with these prices.

[-]

speedb0at@reddit

On about 60 thousand GT 1030’s

[-]

RandoReddit72@reddit

How many DGX Sparks are needed?

[-]

Bob_Fancy@reddit

What a silly thing to stress about

[-]

Kahvana@reddit

Not.

Gemma 4 (26B-A4B, 31B) and Qwen3.6 (35B-A3B and 27B) are really good models and cover 99% of cases I need to use it for.

Not sure if DeepSeek V4 Pro would run fast enough with a pure 1TB DDR5 EPYC server, no GPU.

Jankiest and dumbest way I can come up with would be running 2x PCIE 5.0 x8 2-slot NVME cards and fill them up with Samsung 9100 Pro 8TB drives (for their on-board DRAM and resilience). Then make sure you can run x8x8 PCIE 5.0 on your motherboard (like Gigabyte B850 AI TOP). Fill up the motherboard with 256GB RAM.

4x Samsung 9100 Pro 8TB is 4800EU combined
Gigabyte B850 AI TOP is \~350EU
RAM (5-6000J3644D64GX4-FX5) is 6000EU
AMD Ryzen 5 9600X is 200EU
2x ASUS Hyper M.2 x16 Gen5 is 160EU combined

That would set you back 11510EU, uses token-per-minute metric, but it would run!

[-]

qwen_next_gguf_when@reddit

flash is reachable

[-]

segmond@reddit (OP)

i can run flash no problem.

[-]

Dr_Me_123@reddit

My local agent searched online and said it's still a long way from being implemented in llama.cpp. I don't know if that's true.

[-]

segmond@reddit (OP)

implementation time is as fast as someone is motivated to get it done. the challenge is once we have the gguf, running it is not going to be easy at all.

[-]