How do you plan to run DeepSeekV4 Pro locally?
Posted by segmond@reddit | LocalLLaMA | View on Reddit | 29 comments
For those of us who are crazy with this, what's your plan? Save the Q0.5, Q1 jokes. I'm currently stressed because I can't run it.
imbilbobaggins@reddit
For a less worthless set of answers, take a look here:
https://www.reddit.com/r/LocalLLaMA/comments/1sua2rr/budget_to_run_deepseek_v4_locally_at_fp4_precision/
LeyLineDisturbances@reddit
lol there's no chance i can run this locally on my m1 max.
amitbahree@reddit
I tried but vLLM has a bug. More details here - https://www.reddit.com/r/LocalLLaMA/comments/1su3tfb/comment/oi5defe/?context=3&utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button
"Technically" not local, but details. 🙃
-dysangel-@reddit
It's not a joke, I do plan to try at Q2, or even Q1 if necessary.
I've just tweaked mlx-lm to allow cache snapshots, since built in kv caching is not working with linear/sliding window caches, and a lot of new models are using these so it's kind of an essential feature that I'm surprised is not in there yet.
Expensive-Paint-9490@reddit
If llama.cpp will support the model, which at this point is not a given, I guess I'll resort to a 2-bit quant. That all can fit on 512GB RAM + 24GB VRAM.
sine120@reddit
I paid for a 4TB gen 5 SSD. Swap disk is free real estate.
ProfessionalSpend589@reddit
The so called "disk offloading" doesn’t write to a swap on the disk, but dynamically reads some weights from disk to RAM while cleaning old weights from RAM.
For every token as far as I know, which is why it’s so slow.
No_Conversation9561@reddit
Mac studio 512 GB can probably run 3bit
Front_Eagle739@reddit
2.5 bit only unless you have two macs and you wont get anything else running on it
a9udn9u@reddit
I plan to rob a bank this weekend then I'll buy a ton of GPUs to run it.
cafedude@reddit
Why not skip the bank and just go directly to a datacenter?
ProfessionalSpend589@reddit
Yeah, lead times for GPUs and CPUs are increasing to months. Those money will be wasted doing nothing.
imp_12189@reddit
It seems too much work, isn't it easier to rob some nvidia's warehouse or data center directly? Less secure, more money per kg
DarthFluttershy_@reddit
That's what I'm going to do, too, just as soon as I get deepseek v4 pro running locally to help me plan the bank robbery.
Clean_Hyena7172@reddit
Might need to do it more than once with these prices.
speedb0at@reddit
On about 60 thousand GT 1030’s
RandoReddit72@reddit
How many DGX Sparks are needed?
Bob_Fancy@reddit
What a silly thing to stress about
Kahvana@reddit
Not.
Gemma 4 (26B-A4B, 31B) and Qwen3.6 (35B-A3B and 27B) are really good models and cover 99% of cases I need to use it for.
Not sure if DeepSeek V4 Pro would run fast enough with a pure 1TB DDR5 EPYC server, no GPU.
Jankiest and dumbest way I can come up with would be running 2x PCIE 5.0 x8 2-slot NVME cards and fill them up with Samsung 9100 Pro 8TB drives (for their on-board DRAM and resilience). Then make sure you can run x8x8 PCIE 5.0 on your motherboard (like Gigabyte B850 AI TOP). Fill up the motherboard with 256GB RAM.
That would set you back 11510EU, uses token-per-minute metric, but it would run!
qwen_next_gguf_when@reddit
flash is reachable
segmond@reddit (OP)
i can run flash no problem.
Dr_Me_123@reddit
My local agent searched online and said it's still a long way from being implemented in llama.cpp. I don't know if that's true.
segmond@reddit (OP)
implementation time is as fast as someone is motivated to get it done. the challenge is once we have the gguf, running it is not going to be easy at all.
GradatimRecovery@reddit
Turin 24 * 128
StardockEngineer@reddit
You’re stressed because you can’t run an LLM? What a great life you must have.
grim-432@reddit
If you aren’t spending more on your AI than you do on your car, are you even doing AI?
laterbreh@reddit
Full precision, just waiting on SM120 support to get baked into VLLM.
FoxiPanda@reddit
Yeah I’m not going to try for pro even with 1TB of vram. I’m going to run flash. Once all the quirks are fixed, it’ll be a great model.
sasquatch3277@reddit
Not stressed at all because I won't be able to.