PP speed on dual RTX 6000 12c EPYC setup

Posted by iVoider@reddit | LocalLLaMA | View on Reddit | 18 comments

I want to run big models like GLM 5.1 or Kimi k2.6.
I can buy Mac Studio M3 Ultra with 512gb ram, but PP speed would be ofc bad.

Then I researched benchmarks of hybrid single gpu (RTX 6000 or 5090) and system with EPYC 9xxxx and 12x channel DDR5 6400 ram planks.

On such setups PP is also abysmal post 96k context size, little bit higher than M3 Ultra.
Would a second RTX 6000 boost these numbers by parallelising tensors of dense models part and how much?

[-]

Maharrem@reddit

Dual RTX 6000 Turing (24GB each) will bottleneck prompt processing because llama.cpp/lmstudio split layers sequentially across GPUs, not in parallel. You’ll get roughly the speed of the bigger half of the layer split plus PCIe overhead, nowhere near 2x. I get 700-900 t/s on a single 3090 for a 7B Q4_K_M; expect 300-600 t/s on this setup for similar-sized models, maybe worse if you go Q8 or large context. Use -ngl 99 to offload everything.

For quick VRAM checks, canitrun.dev does the job.

[-]

iVoider@reddit (OP)

I meant 96gb Blackwell and for much bigger models.

[-]

CalligrapherFar7833@reddit

Then you shouldve written it in your post. rtx 6000 is not rtx 6000 pro

[-]

suicidaleggroll@reddit

With recent updates in ik_llama, prompt processing is very fast on my dual Pro 6000 EPYC system. In the last two weeks, pp speeds on Kimi-K.6 have gone from 240 to 1800. Generation is still the same at about 24.

I’m not sure what the numbers are for a single Pro 6000, but a recent post I read said they were seeing around 7-800.

[-]

iVoider@reddit (OP)

Thanks, do you happen to remember for what context size 1800 pp?

[-]

Farmadupe@reddit

FYI kimi k2.6 is quite big. you might need more than two RTX 6000 cards :)

[-]

ComplexType568@reddit

Think they plan to do cpu+gpu inference so maybe a lot more ram sticks

[-]

Farmadupe@reddit

The maths is insane here though. Even at fp8 kimi k2.6 is over a terabyte. Yes 192G vram is better than 96G vram, but it's not going to move the needle in any way at all.

And there's just no way that it's worth spending $20-50k on what's basically a CPU inferencing setup for the other 800G of the model. Yes it would probably "work" with llama.cpp, but nobody drops that much money to use llama.cpp for inference at less than one token per second.

I know this is r/locallama but $50k gets you quite a lot of tokens on openrouter.

[-]

iVoider@reddit (OP)

Fp8 is too much, I would be pretty happy with 4bit quants. And api is unfortunately is unacceptable for my tasks. Also locally I can build such setup for \~$25000. Btw Mac Studio is less than half price, so it’s difficult choice.

[-]

MelodicRecognition7@reddit

why you want to inflate the 600 GBs of Kimi originally released in FP4 to a terabyte in FP8?

[-]

a_beautiful_rhind@reddit

They're not aware it's native FP4.

[-]

Kyuiki@reddit

The one thing I always like to remind people too is that these big expensive cards can fail. If they do outside of warranty that’s $10000 a failure! That’s a lot of money to plan for!

[-]

ComplexType568@reddit

I think 50K going into hardware running into a model that: - you know won't have data/telemetry collection - won't randomly be downgrades (cough cough CLAUDE) - won't randomly disappear out of nowhere

Is a good deal for some. YMMV though!

[-]

MelodicRecognition7@reddit

it weights a bit less than 600 GB

[-]

a_beautiful_rhind@reddit

When your context fits on the GPUs and you use the CPU for textgen, the prompt processing isn't so bad. Have to use ik_llama.cpp though. Regular llama.cpp sucks for this.

A second card will obviously help you but only goes so far. There's literally no way to reach fully offloaded PP/TG without actually doing it.

[-]