Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!

Posted by Known_Ice9380@reddit | LocalLLaMA | View on Reddit | 14 comments

Hey r/DeepSeek,

Who says we need an H100 cluster or the latest expensive GPUs to run frontier MoE models? I wanted to see how far we could push a single node of consumer legacy hardware, so we spent less than $2,500 total to build a budget machine that successfully runs DeepSeek-V4-Flash (284B total, 13B active) locally!

Surprisingly, we managed to hit around 255 prefill tokens/s with a very tight memory budget.

Here is a quick breakdown of how we achieved this "legacy donkey pulling a massive MoE chariot" feat via hardware-software co-optimization:

⚡️ The Technical Breakthroughs

Custom Turing CUDA Kernels: The 2080 Ti Tensor Cores are still capable, but PCIe Gen3 and VRAM bandwidth are huge bottlenecks. We rewrote custom CUDA kernels tailored specifically for the Turing architecture to accelerate W8A8 (INT8) matrix multiplication, heavily alleviating the bandwidth choke.
Heterogeneous Inference: Optimized static memory splitting and dynamic offloading between the 4x 11/22GB VRAM and 1TB system RAM. 100% of the hardware capacity is utilized.
Computation-Communication Overlap: Implemented a pipelined execution strategy to hide the massive multi-GPU communication overhead caused by MoE routing.

🖥️ Budget Hardware Specs

CPU: Intel Xeon E5-2696 v4 (The classic budget king for multi-core)
GPU: 4x RTX 2080 Ti (11/22GB each)
RAM: 1TB DDR4 ECC

The entire implementation, deployment script, and preliminary tech report are 100% open-sourced. I'd love to hear your thoughts, benchmarks, or feedback from fellow system/compiler hackers here!

🔗 GitHub Repository:https://github.com/lvyufeng/deepseek-v4-2080ti

(Note: I submitted the detailed report to arXiv a few days ago, but it’s currently caught in the manual moderation queue—likely because a rookie author throwing a 2080 Ti at DeepSeek-V4 triggered their review boundaries lol. Will update with the arXiv link once it's cleared!)

https://reddit.com/link/1ti5sxu/video/uu9ea2l0v62h1/player

https://reddit.com/link/1ti5sxu/video/if6alov1v62h1/player

[-]

Business_Average1303@reddit

Can’t post here yet because of low karma, but I’m thinking in changing computer and need advice on what’s a recommendation for Coding models right now to see how much VRAM I need

Is Minimax M2.7 the new hype?

Known_Ice9380@reddit (OP)

If you want to buy a server, you can try the recommendation of ktransformers. Or just buy a MacBook/Mac Studio!

I need a personal computer, ideally portable but I want one that can run a good coding model, not sure if laptops have enough VRAM for them

macbook pro 128g is your best choice

Zomboe1@reddit

RAM: 1TB DDR4 ECC

Kinda seems like you're burying the lede, surely that costs more than $2k?

No-Comfortable-2284@reddit

that decode speed

FullstackSensei@reddit

Saves on memory bandwidth. But it really depends on how efficient the kernels are at dequantizing to fp16/fp32. Still, my hunch is regular Q4 will be faster.

yes, we write a kernel for turing

DinoAmino@reddit

Oh bot, you failed to address this subreddit directly. We are not DeepSeek. Not gonna look but I assume you failed in all the other subs where you shotgunned this post.

CummingDownFromSpace@reddit

Super cool.

I cant get over how cheap flash is at the moment. With its current pricing of $0.28/million output tokens, at 250 output tokens/second, for $2500 you could get approx. \~400 days straight worth of output tokens, not taking into account the electricity costs of running it locally.

That drops to \~100 days once the 75% discount runs out though.

Privacy is more expensive

slavik-dev@reddit

What's 11/22GB VRAM? Is it half gigabyte?

two versions, 22GB is modified