How much total VRAM (or shared RAM for Mac/Halo/etc) do you have on your local server/PC?
Posted by panchovix@reddit | LocalLLaMA | View on Reddit | 62 comments
DataPhreak@reddit
Wonder what percent of =>64GB & < 128GB is 96gb.
DigitalguyCH@reddit
64 out of 128 in my Halo, I realized I need at least as much system memory as VRAM
DataPhreak@reddit
So the GPU shared memory is almost as fast as the GPU assigned memory on these. (which isn't very fast but we.) Technically this is 96.
DigitalguyCH@reddit
I am not sure what you mean by this. That it doens't matter how much memory is shared vs allocated to GPU? What I see is that if I give 96 the system will just not use it as it will first overload the system RAM. Ideally I would want to allocate 48, but it won't let me.
Are you implying that allocating 32 is better, as the shared memory is fine?
By the way, memory is fast, I get results almost as fast as my Radeon 7900xt when everything fits in the 20GB vRAM
DataPhreak@reddit
On the Strix Halo, the bus is the same whether the ram is allocated to the CPU or GPU, so it doesn't matter if your model offloads that much. The only time you lose any speed is when memory allocation queues get backed up. So if you overflow into the shared GPU memory, you lose maybe 10% speed, as opposed to potentially 90% speed if you were on a discrete GPU.
As for comparing it to the 7900xt, it has a memory bus speed of 800gbps. Your strix halo has 250gbps. If you are getting the same tok/s, you've got something configured wrong.
No-Refrigerator-1672@reddit
Mine is a unique one: 72GB. That's 2x20GB cards and 1x32GB. Pretty much sure there won't be many people with this exact setup.
Endlesscrysis@reddit
3080TI Custom Extra VRAM and 5090? Or R9700?
No-Refrigerator-1672@reddit
Good guess! Correct!
Nope. Mi50 32GB. I'm evaluating if it can suplement my AI with image gen and running embedding services, or I should buy another beefy Nvidia.
EarlMarshal@reddit
I got a 96gb laptop
Big_Valuable31@reddit
8gb and say thank you
twack3r@reddit
2 6000 Pros, 1 5090, 6 3090s in 3 nvlinked pairs so 368GiB VRAM plus 256GiB DDR5 6400 ecc octo channel.
farkinga@reddit
44gb. Non-base-2 gang.
Middle_of_Infinity@reddit
4GB 1050ti (and 32GB DDR3) - Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf @ \~5t/s
CLI args:
--fit on --fit-ctx 128000 --reasoning-budget -1 --flash-attn 1 --no-mmap
---
12GB AMD 6700xt (and 64GB DDR4) - Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf @ \~10t/s
CLI args:
HSA_OVERRIDE_GFX_VERSION="10.3.0" [...] --fit on --fit-ctx 128000 --reasoning-budget -1 --flash-attn 1
---
I use them for coding and am actually quite pleased with the code quality. Any tips for speeding up output would be appreciated.
pmttyji@reddit
Middle_of_Infinity@reddit
Hey thanks! I'll check em out when I have moment.
hust921@reddit
Question for the: `=> 48GB < 128GB` people?
What models are you running? Seems like \~30B models are outperforming 100B models in many areas. Are you just running faster and with larger context. Or something I'm missing. Seems like theres a gab between 30B and \~300B with "good" models
idumlupinar@reddit
128 gb ddr4 system ram and 24gb vram (single gpu - rtx 3090)
Kerem-6030@reddit
8gb vram 16gb ram i think i am gona upgrade to 16gb vram or 64gb ram what you think?
GrokiniGPT@reddit
Halo??? I love halo mcc
FullstackSensei@reddit
There's no option for >512GB π
Technically speaking, 576GB (192GB VRAM + 384 GB RAM), but if I can be arsed to pull it from under my desk and unmoint the top two cards, it could become 960GB (192+768). The stupidest part is that this cost under 2k, including said RAM upgrade, 9 months ago...a
phido3000@reddit
Looks like half of my server..
I like your layout..
FullstackSensei@reddit
Thanks! It's all possible thanks to the gargantuous X11DPG-QT
phido3000@reddit
Arhh you went xeons..
I have lenovo dual xeons, but for my big setup I went epyc.
How do you find cpu infrencing?.. I have 6100's I want to upgrade to 6200s
FullstackSensei@reddit
I have ES 8260 (QQ89) and find it not far behind Epyc Rome (7642), which has twice the number of cores.
FullOf_Bad_Ideas@reddit
This is still >128GB of VRAM and <512GB. The title suggests that OP is not asking about classical system RAM. It's about VRAM and if I get you right, you have 192GB of that.
FullstackSensei@reddit
Why? It system RAM has about double the bandwidth of whatever DDR5 desktop you can buy and it can run 400B Q4 models on one CPU and half the GPUs at greater than 10t/s. That's bigger and faster than what GB10 or Strix Halo cano do.
FullOf_Bad_Ideas@reddit
Think about bandwidth and memory size addressable by GPU. DDR5 desktop wouldn't qualify into the poll either so I don't think it should be in this picture at all. GB10/Strix Halo can feed GPUs at ~250GB/s. With 192GB of VRAM and very fast 768GB of DDR5 you still have PCI-E in the way and you won't be able to read it faster than ~32GB/s (assuming PCI-E 4.0 x16, I don't know what GPUs you showed and what PCI-e they're using), so about 8x slower than GB10/Strix Halo.
FullstackSensei@reddit
Sorry, but your math is wrong.
First, in hybrid inference you don't stream the model to the GPU. Nobody does that. That's absolutely stupid.
Second, the model is split between GPUs and CPU. Attention, router, context stay on GPU, FF layers go to CPU.so, traffic over the PCIe bus during inference is inference is in the 100s of MB, not GBs. You absolutely don't need fast PCIe in hybrid inference.
Third, my system is six channel DDR4-2933, that's 142GB/s. Your desktop DDR5 system does 90GB/s.
Fourth, Strix Halo is absolutely slower than this setup even for models that fit. SH gets ~50t/s on gpt-oss-120b. I get over 60t/s.
I regularly run Qwen 3.5 397B Q4_K_XL, a 245GB GGUF file, at 13t/s on one CPU + 3 GPUs.
It's utterly stupid to say it doesn't count just because of your total lack of understanding of how things work, when in reality the thing runs faster and can run much bigger models.
FullOf_Bad_Ideas@reddit
Did I mention hybrid inference? I'm not talking in hybrid inference.
It's just not VRAM. You can't do some things that you can do on real VRAM, like batch inference or training. For those things, RAM just doesn't really cut it, so I assume that OP, asking this question, did actually mean what they said, because there are absolutely places where what you want is to have a lot of VRAM and that's the thing that matters. They didn't ask about how much fast CPU RAM people have.
FullstackSensei@reddit
Your assumptions are flat out wrong.
You can absolutely do batching offloading on CPU. Again, just showing your ignorance.
And where did OP mention training? Strix Halo, GB10 or Mac absolutely suck at training, more so than this setup. OP explicitly said RAM.
You should at the very least Google something before making such erroneous assumptions.
FullOf_Bad_Ideas@reddit
How'd that work if you need to read 100GB of KV cache 50 times per second to do decoding? I'd be slow. I'm talking about vllm-style serving to multiple people. I don't think you can really use CPU RAM for this. More experts would need to be ran on CPU and it would quickly stop scaling.
No, I brought this up as a case where it'd make sense to focus on VRAM.
192GB of VRAM in your setup is what makes training work, not 384GB/768GB of RAM. GB10 can do some training. CPU + 384GB/768GB of RAM without VRAM would suck more than GB10/Strix Halo, so I think everything revolves around the VRAM.
Kinda. We're working off assumptions.
/u/panchovix please clarify if fast DDR5 counts as VRAM in your poll.
FullstackSensei@reddit
Do a freaking Google search instead of making erroneous assumptions. You're so ignorant and arrogant it's not even funny.
FullOf_Bad_Ideas@reddit
I did and I don't really see any people experimenting with inference for high concurrency where a large chunk of the model and KV cache is in RAM. Care to share some sources? I still don't think it's a thing that runs at good speeds unless you have very fast connection to CPU RAM. Maybe it could work on GH200 SXM where GPU is connected to CPU by 450GB/s NVLINK-C2C.
our vibes don't match but let's stay friendly.
seamonn@reddit
What backend do you use and what's PP and TG like?
FullstackSensei@reddit
Llama.cpp. Minimax Q4_K_XL ~28t/TG, 120 ot 150 PP, don't remember exactly.
It's a dual CPU system, so I usually run two 200-400B Q4 models side by side on each CPU + 3 GPUs. TG drops to 12-14 and PP drops to ~90, but I don't mind. It's my planning machine, where I rubber duck ideas and convert them to concrete plans. Even when I need to add documentation, it's rarely more than 2k a time. I like to rubber duck and plan with more than one model. Each gives a slightly different perspective.
And before someone brainlessly says "but power costs a fortune", running in VRAM only consumes less than 400W at the wall, and running two models in parallel is like 600W at the wall. I shut it down when not in use, so consumption is like 1Wh.
bigh-aus@reddit
The thing that got me over the power consumption thing was this (applying your situation).
600w is 0.6 kw. to run that for an hour is 0.6 x your cost per kwh. When you think of it like that it really doesn't feel too bad at all.
FullstackSensei@reddit
I'd still use it if it used 3kwh and I pay β¬0.35. That's like β¬1.05/hr.
What on earth are you doing if your time is worth less than β¬1/hr????
bigh-aus@reddit
Exactly!
bigh-aus@reddit
Did you try kimi 2.6? Your total vram + system ram looks like it would be right on the border of fitting itπ
FullstackSensei@reddit
No, and TBH I doubt I will. When DS4 gets merged into mainline llama.cpp, I think I'll swap the RAM sticks to run that.
For now, I don't think the difference vs something like 3.5 397B is worth the significant slow down, at least for my use cases.
seamonn@reddit
What speed do you get with Gemma 4:31b?
Also, did you have to do any workarounds to get llama cpp working or does it work out of tye box on the latest build?
FullstackSensei@reddit
Never run such small models there. I have another machine with 3090s for small dense models.
I pull llama.cpp main and build. Almost the same script I have for my 3090s, just replacing the CUDA parts with HIP. Llama-server commands are also the same, replacing device names from CUDA0,CUDA1,... with ROCM0,ROCM1,...
Only two things worth noting when running models also on CPU are setting --numa to numactl and prefixing llama-server with numactl to pin all threads to the cores of one CPU.
The unfortunate part is that Mi50 prices are ridiculous now.
grabber4321@reddit
now thats what the fuck im talking about - thats a server!
Due_Duck_8472@reddit
8192GB to be exact deployed in a high security environment - once we've gotten the enriched uranium dug up we'll put it to good use πΉπ―
Tanto63@reddit
Zero!
128GB DDR3. Yes, it's very slow...
bigh-aus@reddit
What are you running at what speed? I was looking at dell R930s today :p
Tanto63@reddit
Qwen 3.6 35B, 2.69 tokens/second
1x E5-2680v2
philmarcracken@reddit
wow. now I feel like having my orchestrator be qwen 27b(4tk/s) and the subagents the faster moe...
pmttyji@reddit
1-bit version models need your love. Bonsai for example.
geldonyetich@reddit
64 and 128gb options are a pretty standard amount in a last year dedicated AI Box.
Tough sell following Rampocolypse though.
fivetoedslothbear@reddit
No kidding. I was waiting for Apple to come out with a 512GB M5 Ultra Mac Studio, but everything is memory constrained and even the M3 models aren't available with that much. The most memory they sell now is 128GB in a MacBook Pro (and I have an M4 Max MBP with 128GB).
Velocita84@reddit
Sucks to be a 6gb sucker, all i can run is <10B models and MoEs
pmttyji@reddit
Hopefully by end of this year, these stuffs could help(both thread & comments) Poor GPU Club to run medium & above size models better
MalabaristaEnFuego@reddit
Depending on your 6GB, you should be able to get 12-15 tokens/s with GPT-OSS:20b, Gemma 4:26b, and Qwen 3 Coder:30b. That's what I'm currently getting with them on an RTX 4050.
JackStrawWitchita@reddit
You need an option for 'no VRAM'. Some of us run LLMs without a GPU, on CPU only.
mzzmuaa@reddit
2 rtx6000 + 5090 + 4090 they warm my feet. i'm gonna reinforce the zipties holding the rtx 6000 and 4090
Newtonip@reddit
32GB VRAM
192GB DDR5 RAM
AHHHH_AHHHHHHHH@reddit
I have 2 16gb m4s I exo together, so 32gb technically haha and a separate build with a 5060ti with 128gb ddr4
DiscipleofDeceit666@reddit
Gamer pc turned home lab gang, holler at me
bigh-aus@reddit
Single rtx6000pro with 256gb of system ram in my epyc rack server. Sadly the r7515 only supports one GPU.
grabber4321@reddit
2x 5070tis and about to buy 3080 20G and set up a proper rig.
FullOf_Bad_Ideas@reddit
I remember the previous poll from a while ago showed that most people were <=24GB, I am curious to see how this have changed and what kinds of people have churned out from the community altogether.