96GB VRAM plus 256GB/512GB Fast RAM
Posted by SteveRD1@reddit | LocalLLaMA | View on Reddit | 25 comments
I'm thinking of combining 96GB (1800GB/s) VRAM from the 6000 RTX PRO (already have this) with 256GB or 512GB (410GB/s) RAM in the upcoming Threadripper.
Do you all think this could run any largish versions of Deepseek with useful thruput?
a_beautiful_rhind@reddit
Since I get about 100ts prompt and ~10ts on 4x3090s and Xeon skylake, I think you'll do ok. With DDR5 and 512gb of ram you can do 4KM and hit around that ballpark.
No_Afternoon_4260@reddit
What backend are you using?
a_beautiful_rhind@reddit
ik_llama
No_Afternoon_4260@reddit
How many layers are on cpu?
a_beautiful_rhind@reddit
exp from 32-60. About 88gb on CPU out of 168gb for IQ1 which is what I ran last.
No_Afternoon_4260@reddit
I see thanks, are you happy with the results of that q1?
a_beautiful_rhind@reddit
Yea, its surprisingly good and similar to API. I have V3 in IQ2_XXS as well (slightly lower PP). Wish they'd support minmax. I don't see that one hosted for free but plenty of R1/V3.
No_Afternoon_4260@reddit
That's pretty cool
No_Afternoon_4260@reddit
That's impressive for 3/4 of the model on a skylake
Thireus@reddit
Go with 512GB of RAM. You should be able to run Q4 quant. See https://huggingface.co/anikifoss/DeepSeek-R1-0528-DQ4_K_R4
Educational_Rent1059@reddit
This. And waiting for a new threadripper series is just wasting time and value on the GPU. By the time the new series release your GPU is old gen.
SteveRD1@reddit (OP)
New threadripper is expected to hit SI's in July.
Educational_Rent1059@reddit
ah nice
LA_rent_Aficionado@reddit
Good call, when I bought 384gb I thought this was a lot I have come to learn otherwise
SteveRD1@reddit (OP)
Thanks, will do!
LA_rent_Aficionado@reddit
Which quant, I can test it for you (although I'm running 3x 5090s so the results may be skewed)
bullerwins@reddit
What use cases do you have for 3 5090s? I ask as most engines will use 1,2,4 or 8 gpus for tensor parallel. Are you only using gguf or exl for inference?
LA_rent_Aficionado@reddit
FYI:
prompt eval time = 302.82 ms / 3 tokens ( 100.94 ms per token, 9.91 tokens per second)
eval time = 76712.53 ms / 797 tokens ( 96.25 ms per token, 10.39 tokens per second)
total time = 77015.35 ms / 800 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions
127.0.0.1
200
with DeepSeek-R1-0528-GGUF/UD-Q2_K_XL/DeepSeek-R1-0528-UD-Q2_K_XL
20 layers offloaded
panchovix@reddit
Make sure to use -ot and not just -ngl and -ts. I get about 25-35x times your PP speed by doing that on that specific model (I get less PP at Q4)
LA_rent_Aficionado@reddit
Which tensors do you generally offload with -ot, any reccs? Thank you
a_beautiful_rhind@reddit
mainline llama.cpp is robbing your prompt processing for sure.
bullerwins@reddit
maybe you can try ubergarm's quants with ik_llama.cpp, instead of offloading 20 layers, offload everything an manually assign the layers to the gpu's
https://github.com/ikawrakow/ik_llama.cpp/discussions/532
https://github.com/ikawrakow/ik_llama.cpp/discussions/477
LA_rent_Aficionado@reddit
GGUF to max model quants/context and size mostly.
I have a 4th in the works, I just need to swap out cases and haven't gotten around to it yet. I'm at capacity.
createthiscom@reddit
I’m hitting 20 tok/s with 5600mhz rdimms on 24 channels at 716gb/s throughput with a blackwell 6000 pro and deepseek v3 0324 671b:q4_k_xl. If my throughput drops to 600gb/s I get 18 tok/s. A simple ratio calculation predicts you’ll see about 10 tok/s at 380 gb/s.
PraxisOG@reddit
It would work, but if your primary goal is inference then you might want to consider server hardware. Threadripper has 4 memory channels, but the newer epyc cpus support 12. A used epyc 9334 is about 1k usd so it's not too pricy either. If you're doing anything that needs single core performance it's not great, with single core boost around 3.6ghz.