I have (even faster) DeepSeek V4 Pro at home

Posted by fairydreaming@reddit | LocalLLaMA | View on Reddit | 39 comments

Few days ago I posted about my DeepSeek V4 Pro at home - now time for an update. Yesterday I finally managed to run this model in ktransformers (sglang + kt-kernel). I followed the tutorial for DeepSeek V4 Flash and tweaked some options (NUMA, cores) for my hardware (Epyc 9374F + RTX PRO 6000 Max-Q). Then I ran llama-benchy with increasing context depth to check the performance. Results:

Depth 0:

| model                       |   test |          t/s |    peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:----------------------------|-------:|-------------:|------------:|----------------:|----------------:|----------------:|
| deepseek-ai/DeepSeek-V4-Pro |  pp512 | 39.76 ± 0.00 |             | 12878.44 ± 0.00 | 12877.59 ± 0.00 | 12878.44 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro |   tg32 |  7.54 ± 0.00 | 8.00 ± 0.00 |                 |                 |                 |

Depth 2048:

| model                       |          test |          t/s |    peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:----------------------------|--------------:|-------------:|------------:|----------------:|----------------:|----------------:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d2048 | 45.13 ± 0.00 |             | 56726.85 ± 0.00 | 56725.93 ± 0.00 | 56726.85 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro |  tg32 @ d2048 |  7.32 ± 0.00 | 8.00 ± 0.00 |                 |                 |                 |

Depth 4096:

| model                       |          test |          t/s |    peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:----------------------------|--------------:|-------------:|------------:|-----------------:|-----------------:|-----------------:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d4096 | 45.75 ± 0.00 |             | 100729.28 ± 0.00 | 100728.46 ± 0.00 | 100729.28 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro |  tg32 @ d4096 |  7.29 ± 0.00 | 8.00 ± 0.00 |                  |                  |                  |

Depth 8192:

| model                       |          test |          t/s |    peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:----------------------------|--------------:|-------------:|------------:|-----------------:|-----------------:|-----------------:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d8192 | 45.97 ± 0.00 |             | 189354.94 ± 0.00 | 189354.03 ± 0.00 | 189354.94 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro |  tg32 @ d8192 |  7.25 ± 0.00 | 8.00 ± 0.00 |                  |                  |                  |

Depth 16384:

| model                       |           test |          t/s |    peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:----------------------------|---------------:|-------------:|------------:|-----------------:|-----------------:|-----------------:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d16384 | 46.16 ± 0.00 |             | 365997.22 ± 0.00 | 365996.26 ± 0.00 | 365997.22 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro |  tg32 @ d16384 |  7.17 ± 0.00 | 8.00 ± 0.00 |                  |                  |                  |

Depth 32768:

| model                       |           test |          t/s |    peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:----------------------------|---------------:|-------------:|------------:|-----------------:|-----------------:|-----------------:|
| deepseek-ai/DeepSeek-V4-Pro | pp512 @ d32768 | 46.18 ± 0.00 |             | 720687.13 ± 0.00 | 720685.67 ± 0.00 | 720687.13 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Pro |  tg32 @ d32768 |  7.07 ± 0.00 | 8.00 ± 0.00 |                  |                  |                  |

During 64k test (that took over 20 min) llama-benchy did not report the result despite sglang finishing processing the request so I aborted the test. I don't know, maybe there is some kind of timeout happening.

This is all running the original model files, no need for conversion.

GPU VRAM usage: 90815MiB / 97887MiB
GPU power usage: \~100W during PP, \~150W during TG
RAM usage: 907.5GB / 1152GB
CPU+MB power usage: \~400W

[-]

Potential-Leg-639@reddit

At this point just use the Deepseek API is just a fraction of the cost

[-]

Ferilox@reddit

Deepseek API stores and trains on your data

[-]

sn2006gy@reddit

i always wonder why people expect everyone else to train models but them.

[-]

Ferilox@reddit

What do you mean?

[-]

sn2006gy@reddit

i mean, we’re all here using the models and we want them to be better right? How do they get better if they don’t train them? it’s a weird paradox that so many people run around saying “they train on your data” as if that’s purely bad yet they have to train on something to get better and we’re all like “it’s ok if it’s someone else’s data, just not mine”

[-]

SundererKing@reddit

Since you said you wonder:

Maybe im working on a product that is proprietary and i dont want someone stealing. And even if some corporation like Anthropic doesnt have direct interest in taking my thing (with their copyright theft machine), if its in their training, itll hand the idea off as a suggestion to other people and suddenly 10 other people have the same code as me.

Or maybe I dont want my personal data sold to marketers or whoever. And based on your previous response, im sure youll say something insightful and original like "it already is". sure buddy. ANd if someone has one single std, there is no reason for them to ever wear a condom again, right? If you fall off a ladder and break a leg, theres no point in limiting risk for future damage, because you already got hurt.

[-]

Ferilox@reddit

There are different ways to go about it. In inference API I want to be customer and customer only, not their product as well. Especially in an enterprise setting. Their subsidized cost is not worth what I want to input to it.

[-]

sn2006gy@reddit

i'm only talking in locallama context

[-]

Potential-Leg-639@reddit

No problem with that for most of the tasks. For sensitive stuff I use local AI, but of course not for everything, makes no sense nowadays.

[-]

fairydreaming@reddit (OP)

Yes, probably I won't use it like all the time. But it's good to know that it's there and I can use it any time I want.

[-]

WeUsedToBeACountry@reddit

"i know you got it working but at this point just using the very thing you got it working to avoid is a fraction of the cost"

[-]

xienze@reddit

40ish t/s prefill, 7 t/s generation. Unusable for anything but simple chat.

"What's the capital of Japan?" a minute later Tokyo

"I'm running DeepSeek v4 Pro at home, you jelly?"

[-]

fairydreaming@reddit (OP)

I'm sorry if I hurt your feelings *casting a healing spell*

Just wanted to show that it's possible.

[-]

SundererKing@reddit

I appreciate it. Testing out possibilities, reporting on it, it allows other people to see and know what to expect instead of having to go through it all if it doesnt fit their need. Its possible, also useful. It may help someone as a starting point to build off of, maybe using a different model, or changing some element of it, or if they have some very targeted use case where its just the right set of circumstances for it to be worthwhile.

[-]

toptier4093@reddit

Gonna take 6 months for that healing spell to finish casting.

[-]

fairydreaming@reddit (OP)

Top tier!

[-]

blakeman8192@reddit

Hold up. OP is running an absolutely massive state-of-the-art model, developed at the bleeding edge of human ingenuity, released only weeks ago. This thing probably has local access to more general knowledge than any one of us.

Successfully, on his fucking desk.

And here we sit giving him shit for not running it fast enough.

2026 is so surreal lmfao

[-]

fairydreaming@reddit (OP)

Same bench results for DeepSeek V4 Flash (30 experts in VRAM, usage: 90471MiB / 97887MiB, RAM usage 166.1GB):

Depth 0:

Depth 2048:| model                         |   test |          t/s |     peak t/s |      ttfr (ms) |   est_ppt (ms) |   e2e_ttft (ms) |
|:------------------------------|-------:|-------------:|-------------:|---------------:|---------------:|----------------:|
| deepseek-ai/DeepSeek-V4-Flash |  pp512 | 70.06 ± 0.00 |              | 7366.73 ± 0.00 | 7365.49 ± 0.00 |  7366.73 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Flash |   tg32 | 21.24 ± 0.00 | 22.00 ± 0.00 |                |                |                 |

Depth 2048:

| model                         |          test |           t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:------------------------------|--------------:|--------------:|-------------:|----------------:|----------------:|----------------:|
| deepseek-ai/DeepSeek-V4-Flash | pp512 @ d2048 | 149.12 ± 0.00 |              | 17195.18 ± 0.00 | 17193.98 ± 0.00 | 17195.18 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Flash |  tg32 @ d2048 |  20.04 ± 0.00 | 21.00 ± 0.00 |                 |                 |                 |

Depth 4096:

| model                         |          test |           t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:------------------------------|--------------:|--------------:|-------------:|----------------:|----------------:|----------------:|
| deepseek-ai/DeepSeek-V4-Flash | pp512 @ d4096 | 187.05 ± 0.00 |              | 24658.33 ± 0.00 | 24657.13 ± 0.00 | 24658.33 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Flash |  tg32 @ d4096 |  16.52 ± 0.00 | 23.76 ± 0.00 |                 |                 |                 |

Depth 8192:

| model                         |          test |           t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:------------------------------|--------------:|--------------:|-------------:|----------------:|----------------:|----------------:|
| deepseek-ai/DeepSeek-V4-Flash | pp512 @ d8192 | 186.75 ± 0.00 |              | 46629.70 ± 0.00 | 46628.54 ± 0.00 | 46629.70 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Flash |  tg32 @ d8192 |  19.69 ± 0.00 | 20.00 ± 0.00 |                 |                 |                 |

Depth 16384:

| model                         |           test |           t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:------------------------------|---------------:|--------------:|-------------:|----------------:|----------------:|----------------:|
| deepseek-ai/DeepSeek-V4-Flash | pp512 @ d16384 | 188.30 ± 0.00 |              | 89750.10 ± 0.00 | 89749.17 ± 0.00 | 89750.10 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Flash |  tg32 @ d16384 |  18.58 ± 0.00 | 20.00 ± 0.00 |                 |                 |                 |

Depth 32768:

| model                         |           test |           t/s |     peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:------------------------------|---------------:|--------------:|-------------:|-----------------:|-----------------:|-----------------:|
| deepseek-ai/DeepSeek-V4-Flash | pp512 @ d32768 | 188.66 ± 0.00 |              | 176427.36 ± 0.00 | 176426.29 ± 0.00 | 176427.36 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Flash |  tg32 @ d32768 |  19.07 ± 0.00 | 20.00 ± 0.00 |                  |                  |                  |

Depth 65536:

| model                         |           test |           t/s |     peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:------------------------------|---------------:|--------------:|-------------:|-----------------:|-----------------:|-----------------:|
| deepseek-ai/DeepSeek-V4-Flash | pp512 @ d65536 | 186.61 ± 0.00 |              | 353967.08 ± 0.00 | 353966.24 ± 0.00 | 353967.08 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Flash |  tg32 @ d65536 |  17.76 ± 0.00 | 18.00 ± 0.00 |                  |                  |                  |

Depth 131072:

| model                         |            test |           t/s |     peak t/s |        ttfr (ms) |     est_ppt (ms) |    e2e_ttft (ms) |
|:------------------------------|----------------:|--------------:|-------------:|-----------------:|-----------------:|-----------------:|
| deepseek-ai/DeepSeek-V4-Flash | pp512 @ d131072 | 184.24 ± 0.00 |              | 714206.80 ± 0.00 | 714205.89 ± 0.00 | 714206.80 ± 0.00 |
| deepseek-ai/DeepSeek-V4-Flash |  tg32 @ d131072 |  16.03 ± 0.00 | 17.00 ± 0.00 |                  |                  |                  |

[-]

Lorian0x7@reddit

That's cool, but there's not much you can really do with it in practice. It's just too slow! Especially pp. It's 7 minutes just to have it starting to reply after only 20k context.

[-]

coder543@reddit

It's 7 minutes just to have it starting to reply, after only 20k context.

well, presumably there is prompt caching... the time to first token doesn't grow as the conversation gets longer, it is just whatever new context was provided that needed to be processed. 20k tokens in a single turn would be a very unusual turn.

[-]

Lorian0x7@reddit

Not unusual for agentic use cases.

[-]

coder543@reddit

I am talking about agentic use cases, yes. Models usually read parts of files so they don’t blow the whole context on a single file. Reading 20k tokens at once is not normally wise.

[-]

Lorian0x7@reddit

It depends what you do, while working with LLM wiki/memory systems +20k context ingestion is the normality.Plus new data if you passing new documents to the IA to ingest + new Skills calls.

[-]

fairydreaming@reddit (OP)

Horrible, you could live your whole life like 7 times during that time!

[-]

Lorian0x7@reddit

Sorry if I slapped you with my dose of realism, but here's another slap… I did the math and yesterday I consumed 5M tokens on Deepseek 4 Pro over the course of a day (8 hours) across several message exchanges. To do the same, it would take you about 110 hours (realistically much more), that's like 2 weeks if you work 8 hours a day, just to have the same level of interaction. I don’t know if you’re immortal or what… but my time has value

[-]

fairydreaming@reddit (OP)

I have no words, you won.

[-]

coder543@reddit

Does ktransformers let you adjust the batch/ubatch size?

[-]

fairydreaming@reddit (OP)

Yes, it has --chunked-prefill-size that is set to 2048. Tried increasing it to 4096 but PP rate went up only to 47-48 t/s, so it had little effect.

[-]

AndreVallestero@reddit

This is already a decent starting point. If you can get MTP or dflash working with DSv4 flash, you might even hit 20+ tps.

[-]

ai-infos@reddit

thanks for sharing your feedback bench! That's always interesting to see how different kind of hardware perform with SOTA giant llms

[-]

techlatest_net@reddit

Wild setup. Stable ~7 t/s decode at 64k depth is impressive, but the real story is the memory footprint — 900GB RAM and 90GB VRAM is basically a mini data center.

[-]

a_beautiful_rhind@reddit

This is the full model, no quants?

[-]

fairydreaming@reddit (OP)

Full model but it's already quantized by model creators (experts are FP4).

[-]

a_beautiful_rhind@reddit

ktransformers is no ik_llama I guess. Even with the numa support.

[-]

2Norn@reddit

i was thinking about this as well but this is a bit on the extreme side you basically have 90% of the model offloaded to ram, altho idk if that impacts the speed of an moe model or by how much if so

[-]

sautdepage@reddit

It's actually optimal this way.

Having 50% or 90% of the MOE in RAM doesn't actually make a big difference. Its going to be "RAM slow" anyway.

So even you spend a ton of money on GPUs to reach 50 or 80% in VRAM it's just not worth the cost. You need to reach 100% VRAM offload to make it fly.

[-]

2Norn@reddit

i got the idea after running 35b-a3b on a 16gb vram system at about 70 tk/s and i didn't even try stuff like dflash or mtp when i did that

so naturally i was thinking about if i could scale it up to 300ish b models and still have decent tk/s

[-]

JockY@reddit

Are you using all 12 memory channels on that EPYC?

[-]

fairydreaming@reddit (OP)

Yes, 12 x 96GB (4800 MT/s).