I have (even faster) DeepSeek V4 Pro at home
Posted by fairydreaming@reddit | LocalLLaMA | View on Reddit | 39 comments
Few days ago I posted about my DeepSeek V4 Pro at home - now time for an update. Yesterday I finally managed to run this model in ktransformers (sglang + kt-kernel). I followed the tutorial for DeepSeek V4 Flash and tweaked some options (NUMA, cores) for my hardware (Epyc 9374F + RTX PRO 6000 Max-Q). Then I ran llama-benchy with increasing context depth to check the performance. Results:
Depth 0:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:----------------------------|-------:|-------------:|------------:|----------------:|----------------:|----------------:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 | 39.76 ± 0.00 | | 12878.44 ± 0.00 | 12877.59 ± 0.00 | 12878.44 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 | 7.54 ± 0.00 | 8.00 ± 0.00 | | | |
Depth 2048:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:----------------------------|--------------:|-------------:|------------:|----------------:|----------------:|----------------:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d2048 | 45.13 ± 0.00 | | 56726.85 ± 0.00 | 56725.93 ± 0.00 | 56726.85 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 @ d2048 | 7.32 ± 0.00 | 8.00 ± 0.00 | | | |
Depth 4096:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:----------------------------|--------------:|-------------:|------------:|-----------------:|-----------------:|-----------------:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d4096 | 45.75 ± 0.00 | | 100729.28 ± 0.00 | 100728.46 ± 0.00 | 100729.28 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 @ d4096 | 7.29 ± 0.00 | 8.00 ± 0.00 | | | |
Depth 8192:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:----------------------------|--------------:|-------------:|------------:|-----------------:|-----------------:|-----------------:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d8192 | 45.97 ± 0.00 | | 189354.94 ± 0.00 | 189354.03 ± 0.00 | 189354.94 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 @ d8192 | 7.25 ± 0.00 | 8.00 ± 0.00 | | | |
Depth 16384:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:----------------------------|---------------:|-------------:|------------:|-----------------:|-----------------:|-----------------:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d16384 | 46.16 ± 0.00 | | 365997.22 ± 0.00 | 365996.26 ± 0.00 | 365997.22 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 @ d16384 | 7.17 ± 0.00 | 8.00 ± 0.00 | | | |
Depth 32768:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:----------------------------|---------------:|-------------:|------------:|-----------------:|-----------------:|-----------------:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d32768 | 46.18 ± 0.00 | | 720687.13 ± 0.00 | 720685.67 ± 0.00 | 720687.13 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro | tg32 @ d32768 | 7.07 ± 0.00 | 8.00 ± 0.00 | | | |
During 64k test (that took over 20 min) llama-benchy did not report the result despite sglang finishing processing the request so I aborted the test. I don't know, maybe there is some kind of timeout happening.
This is all running the original model files, no need for conversion.
- GPU VRAM usage: 90815MiB / 97887MiB
- GPU power usage: \~100W during PP, \~150W during TG
- RAM usage: 907.5GB / 1152GB
- CPU+MB power usage: \~400W
Potential-Leg-639@reddit
At this point just use the Deepseek API is just a fraction of the cost
Ferilox@reddit
Deepseek API stores and trains on your data
sn2006gy@reddit
i always wonder why people expect everyone else to train models but them.
Ferilox@reddit
What do you mean?
sn2006gy@reddit
i mean, we’re all here using the models and we want them to be better right? How do they get better if they don’t train them? it’s a weird paradox that so many people run around saying “they train on your data” as if that’s purely bad yet they have to train on something to get better and we’re all like “it’s ok if it’s someone else’s data, just not mine”
SundererKing@reddit
Since you said you wonder:
Maybe im working on a product that is proprietary and i dont want someone stealing. And even if some corporation like Anthropic doesnt have direct interest in taking my thing (with their copyright theft machine), if its in their training, itll hand the idea off as a suggestion to other people and suddenly 10 other people have the same code as me.
Or maybe I dont want my personal data sold to marketers or whoever. And based on your previous response, im sure youll say something insightful and original like "it already is". sure buddy. ANd if someone has one single std, there is no reason for them to ever wear a condom again, right? If you fall off a ladder and break a leg, theres no point in limiting risk for future damage, because you already got hurt.
Ferilox@reddit
There are different ways to go about it. In inference API I want to be customer and customer only, not their product as well. Especially in an enterprise setting. Their subsidized cost is not worth what I want to input to it.
sn2006gy@reddit
i'm only talking in locallama context
Potential-Leg-639@reddit
No problem with that for most of the tasks. For sensitive stuff I use local AI, but of course not for everything, makes no sense nowadays.
fairydreaming@reddit (OP)
Yes, probably I won't use it like all the time. But it's good to know that it's there and I can use it any time I want.
WeUsedToBeACountry@reddit
"i know you got it working but at this point just using the very thing you got it working to avoid is a fraction of the cost"
xienze@reddit
40ish t/s prefill, 7 t/s generation. Unusable for anything but simple chat.
"I'm running DeepSeek v4 Pro at home, you jelly?"
fairydreaming@reddit (OP)
I'm sorry if I hurt your feelings *casting a healing spell*
Just wanted to show that it's possible.
SundererKing@reddit
I appreciate it. Testing out possibilities, reporting on it, it allows other people to see and know what to expect instead of having to go through it all if it doesnt fit their need. Its possible, also useful. It may help someone as a starting point to build off of, maybe using a different model, or changing some element of it, or if they have some very targeted use case where its just the right set of circumstances for it to be worthwhile.
toptier4093@reddit
Gonna take 6 months for that healing spell to finish casting.
fairydreaming@reddit (OP)
Top tier!
blakeman8192@reddit
Hold up. OP is running an absolutely massive state-of-the-art model, developed at the bleeding edge of human ingenuity, released only weeks ago. This thing probably has local access to more general knowledge than any one of us.
Successfully, on his fucking desk.
And here we sit giving him shit for not running it fast enough.
2026 is so surreal lmfao
fairydreaming@reddit (OP)
Same bench results for DeepSeek V4 Flash (30 experts in VRAM, usage: 90471MiB / 97887MiB, RAM usage 166.1GB):
Depth 0:
Depth 2048:
Depth 4096:
Depth 8192:
Depth 16384:
Depth 32768:
Depth 65536:
Depth 131072:
Lorian0x7@reddit
That's cool, but there's not much you can really do with it in practice. It's just too slow! Especially pp. It's 7 minutes just to have it starting to reply after only 20k context.
coder543@reddit
well, presumably there is prompt caching... the time to first token doesn't grow as the conversation gets longer, it is just whatever new context was provided that needed to be processed. 20k tokens in a single turn would be a very unusual turn.
Lorian0x7@reddit
Not unusual for agentic use cases.
coder543@reddit
I am talking about agentic use cases, yes. Models usually read parts of files so they don’t blow the whole context on a single file. Reading 20k tokens at once is not normally wise.
Lorian0x7@reddit
It depends what you do, while working with LLM wiki/memory systems +20k context ingestion is the normality.Plus new data if you passing new documents to the IA to ingest + new Skills calls.
fairydreaming@reddit (OP)
Horrible, you could live your whole life like 7 times during that time!
Lorian0x7@reddit
Sorry if I slapped you with my dose of realism, but here's another slap… I did the math and yesterday I consumed 5M tokens on Deepseek 4 Pro over the course of a day (8 hours) across several message exchanges. To do the same, it would take you about 110 hours (realistically much more), that's like 2 weeks if you work 8 hours a day, just to have the same level of interaction. I don’t know if you’re immortal or what… but my time has value
fairydreaming@reddit (OP)
I have no words, you won.
coder543@reddit
Does ktransformers let you adjust the batch/ubatch size?
fairydreaming@reddit (OP)
Yes, it has
--chunked-prefill-sizethat is set to 2048. Tried increasing it to 4096 but PP rate went up only to 47-48 t/s, so it had little effect.AndreVallestero@reddit
This is already a decent starting point. If you can get MTP or dflash working with DSv4 flash, you might even hit 20+ tps.
ai-infos@reddit
thanks for sharing your feedback bench! That's always interesting to see how different kind of hardware perform with SOTA giant llms
techlatest_net@reddit
Wild setup. Stable ~7 t/s decode at 64k depth is impressive, but the real story is the memory footprint — 900GB RAM and 90GB VRAM is basically a mini data center.
a_beautiful_rhind@reddit
This is the full model, no quants?
fairydreaming@reddit (OP)
Full model but it's already quantized by model creators (experts are FP4).
a_beautiful_rhind@reddit
ktransformers is no ik_llama I guess. Even with the numa support.
2Norn@reddit
i was thinking about this as well but this is a bit on the extreme side you basically have 90% of the model offloaded to ram, altho idk if that impacts the speed of an moe model or by how much if so
sautdepage@reddit
It's actually optimal this way.
Having 50% or 90% of the MOE in RAM doesn't actually make a big difference. Its going to be "RAM slow" anyway.
So even you spend a ton of money on GPUs to reach 50 or 80% in VRAM it's just not worth the cost. You need to reach 100% VRAM offload to make it fly.
2Norn@reddit
i got the idea after running 35b-a3b on a 16gb vram system at about 70 tk/s and i didn't even try stuff like dflash or mtp when i did that
so naturally i was thinking about if i could scale it up to 300ish b models and still have decent tk/s
__JockY__@reddit
Are you using all 12 memory channels on that EPYC?
fairydreaming@reddit (OP)
Yes, 12 x 96GB (4800 MT/s).