96GB VRAM plus 256GB/512GB Fast RAM

Posted by SteveRD1@reddit | LocalLLaMA | View on Reddit | 25 comments

I'm thinking of combining 96GB (1800GB/s) VRAM from the 6000 RTX PRO (already have this) with 256GB or 512GB (410GB/s) RAM in the upcoming Threadripper.

Do you all think this could run any largish versions of Deepseek with useful thruput?

[-]

a_beautiful_rhind@reddit

Since I get about 100ts prompt and ~10ts on 4x3090s and Xeon skylake, I think you'll do ok. With DDR5 and 512gb of ram you can do 4KM and hit around that ballpark.

[-]

a_beautiful_rhind@reddit

exp from 32-60. About 88gb on CPU out of 168gb for IQ1 which is what I ran last.

[-]

No_Afternoon_4260@reddit

I see thanks, are you happy with the results of that q1?

[-]

a_beautiful_rhind@reddit

Yea, its surprisingly good and similar to API. I have V3 in IQ2_XXS as well (slightly lower PP). Wish they'd support minmax. I don't see that one hosted for free but plenty of R1/V3.

[-]

No_Afternoon_4260@reddit

That's impressive for 3/4 of the model on a skylake

[-]

Thireus@reddit

Go with 512GB of RAM. You should be able to run Q4 quant. See https://huggingface.co/anikifoss/DeepSeek-R1-0528-DQ4_K_R4

[-]

Educational_Rent1059@reddit

This. And waiting for a new threadripper series is just wasting time and value on the GPU. By the time the new series release your GPU is old gen.

[-]

SteveRD1@reddit (OP)

New threadripper is expected to hit SI's in July.

[-]

LA_rent_Aficionado@reddit

Good call, when I bought 384gb I thought this was a lot I have come to learn otherwise

[-]

LA_rent_Aficionado@reddit

Which quant, I can test it for you (although I'm running 3x 5090s so the results may be skewed)

[-]

bullerwins@reddit

What use cases do you have for 3 5090s? I ask as most engines will use 1,2,4 or 8 gpus for tensor parallel. Are you only using gguf or exl for inference?

[-]

LA_rent_Aficionado@reddit

FYI:

prompt eval time = 302.82 ms / 3 tokens ( 100.94 ms per token, 9.91 tokens per second)

eval time = 76712.53 ms / 797 tokens ( 96.25 ms per token, 10.39 tokens per second)

total time = 77015.35 ms / 800 tokens

srv update_slots: all slots are idle

srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

with DeepSeek-R1-0528-GGUF/UD-Q2_K_XL/DeepSeek-R1-0528-UD-Q2_K_XL

20 layers offloaded

[-]

panchovix@reddit

Make sure to use -ot and not just -ngl and -ts. I get about 25-35x times your PP speed by doing that on that specific model (I get less PP at Q4)

[-]

LA_rent_Aficionado@reddit

Which tensors do you generally offload with -ot, any reccs? Thank you

[-]

a_beautiful_rhind@reddit

mainline llama.cpp is robbing your prompt processing for sure.

[-]

bullerwins@reddit

maybe you can try ubergarm's quants with ik_llama.cpp, instead of offloading 20 layers, offload everything an manually assign the layers to the gpu's
https://github.com/ikawrakow/ik_llama.cpp/discussions/532
https://github.com/ikawrakow/ik_llama.cpp/discussions/477

[-]

LA_rent_Aficionado@reddit

GGUF to max model quants/context and size mostly.

I have a 4th in the works, I just need to swap out cases and haven't gotten around to it yet. I'm at capacity.

[-]

I’m hitting 20 tok/s with 5600mhz rdimms on 24 channels at 716gb/s throughput with a blackwell 6000 pro and deepseek v3 0324 671b:q4_k_xl. If my throughput drops to 600gb/s I get 18 tok/s. A simple ratio calculation predicts you’ll see about 10 tok/s at 380 gb/s.

[-]

PraxisOG@reddit

It would work, but if your primary goal is inference then you might want to consider server hardware. Threadripper has 4 memory channels, but the newer epyc cpus support 12. A used epyc 9334 is about 1k usd so it's not too pricy either. If you're doing anything that needs single core performance it's not great, with single core boost around 3.6ghz.