How much total VRAM (or shared RAM for Mac/Halo/etc) do you have on your local server/PC?

[-]

DataPhreak@reddit

Wonder what percent of =>64GB & < 128GB is 96gb.

[-]

DigitalguyCH@reddit

64 out of 128 in my Halo, I realized I need at least as much system memory as VRAM

[-]

DataPhreak@reddit

So the GPU shared memory is almost as fast as the GPU assigned memory on these. (which isn't very fast but we.) Technically this is 96.

[-]

I am not sure what you mean by this. That it doens't matter how much memory is shared vs allocated to GPU? What I see is that if I give 96 the system will just not use it as it will first overload the system RAM. Ideally I would want to allocate 48, but it won't let me.
Are you implying that allocating 32 is better, as the shared memory is fine?
By the way, memory is fast, I get results almost as fast as my Radeon 7900xt when everything fits in the 20GB vRAM

[-]

DataPhreak@reddit

On the Strix Halo, the bus is the same whether the ram is allocated to the CPU or GPU, so it doesn't matter if your model offloads that much. The only time you lose any speed is when memory allocation queues get backed up. So if you overflow into the shared GPU memory, you lose maybe 10% speed, as opposed to potentially 90% speed if you were on a discrete GPU.

As for comparing it to the 7900xt, it has a memory bus speed of 800gbps. Your strix halo has 250gbps. If you are getting the same tok/s, you've got something configured wrong.

[-]

No-Refrigerator-1672@reddit

Mine is a unique one: 72GB. That's 2x20GB cards and 1x32GB. Pretty much sure there won't be many people with this exact setup.

[-]

Endlesscrysis@reddit

3080TI Custom Extra VRAM and 5090? Or R9700?

[-]

No-Refrigerator-1672@reddit

2x 3080TI Custom Extra VRAM

Good guess! Correct!

and 5090? Or R9700?

Nope. Mi50 32GB. I'm evaluating if it can suplement my AI with image gen and running embedding services, or I should buy another beefy Nvidia.

[-]

EarlMarshal@reddit

I got a 96gb laptop

[-]

Big_Valuable31@reddit

8gb and say thank you

[-]

twack3r@reddit

2 6000 Pros, 1 5090, 6 3090s in 3 nvlinked pairs so 368GiB VRAM plus 256GiB DDR5 6400 ecc octo channel.

[-]

farkinga@reddit

44gb. Non-base-2 gang.

[-]

Middle_of_Infinity@reddit

4GB 1050ti (and 32GB DDR3) - Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf @ \~5t/s

CLI args:
--fit on --fit-ctx 128000 --reasoning-budget -1 --flash-attn 1 --no-mmap

---

12GB AMD 6700xt (and 64GB DDR4) - Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf @ \~10t/s

CLI args:
HSA_OVERRIDE_GFX_VERSION="10.3.0" [...] --fit on --fit-ctx 128000 --reasoning-budget -1 --flash-attn 1

---

I use them for coding and am actually quite pleased with the code quality. Any tips for speeding up output would be appreciated.

[-]

pmttyji@reddit

Any tips for speeding up output would be appreciated.

Go with IQ4_XS quant (5GB less than Q4_K_XL)
Q8 KVCache (This PR merge made Q8's quality more closer to F16. So you could use Q8 to save VRAM. Poor GPU Club do use Q8 even before that PR.)
Update llama.cpp once a week at least(I spot at least one optimization related PR/week getting merged so tiny boost on pp and/or tg at least)

[-]

Middle_of_Infinity@reddit

Hey thanks! I'll check em out when I have moment.

[-]

hust921@reddit

Question for the: `=> 48GB < 128GB` people?

What models are you running? Seems like \~30B models are outperforming 100B models in many areas. Are you just running faster and with larger context. Or something I'm missing. Seems like theres a gab between 30B and \~300B with "good" models

[-]

idumlupinar@reddit

128 gb ddr4 system ram and 24gb vram (single gpu - rtx 3090)

[-]

Kerem-6030@reddit

8gb vram 16gb ram i think i am gona upgrade to 16gb vram or 64gb ram what you think?

[-]

GrokiniGPT@reddit

Halo??? I love halo mcc

[-]

FullstackSensei@reddit

There's no option for >512GB 💔

Technically speaking, 576GB (192GB VRAM + 384 GB RAM), but if I can be arsed to pull it from under my desk and unmoint the top two cards, it could become 960GB (192+768). The stupidest part is that this cost under 2k, including said RAM upgrade, 9 months ago...a

[-]

phido3000@reddit

Looks like half of my server..

I like your layout..

[-]

FullstackSensei@reddit

Thanks! It's all possible thanks to the gargantuous X11DPG-QT

[-]

phido3000@reddit

Arhh you went xeons..

I have lenovo dual xeons, but for my big setup I went epyc.

How do you find cpu infrencing?.. I have 6100's I want to upgrade to 6200s

[-]

FullstackSensei@reddit

I have ES 8260 (QQ89) and find it not far behind Epyc Rome (7642), which has twice the number of cores.

[-]

FullOf_Bad_Ideas@reddit

This is still >128GB of VRAM and <512GB. The title suggests that OP is not asking about classical system RAM. It's about VRAM and if I get you right, you have 192GB of that.

[-]

FullstackSensei@reddit

Why? It system RAM has about double the bandwidth of whatever DDR5 desktop you can buy and it can run 400B Q4 models on one CPU and half the GPUs at greater than 10t/s. That's bigger and faster than what GB10 or Strix Halo cano do.

[-]

FullOf_Bad_Ideas@reddit

Think about bandwidth and memory size addressable by GPU. DDR5 desktop wouldn't qualify into the poll either so I don't think it should be in this picture at all. GB10/Strix Halo can feed GPUs at ~250GB/s. With 192GB of VRAM and very fast 768GB of DDR5 you still have PCI-E in the way and you won't be able to read it faster than ~32GB/s (assuming PCI-E 4.0 x16, I don't know what GPUs you showed and what PCI-e they're using), so about 8x slower than GB10/Strix Halo.

[-]

FullstackSensei@reddit

Sorry, but your math is wrong.

First, in hybrid inference you don't stream the model to the GPU. Nobody does that. That's absolutely stupid.

Second, the model is split between GPUs and CPU. Attention, router, context stay on GPU, FF layers go to CPU.so, traffic over the PCIe bus during inference is inference is in the 100s of MB, not GBs. You absolutely don't need fast PCIe in hybrid inference.

Third, my system is six channel DDR4-2933, that's 142GB/s. Your desktop DDR5 system does 90GB/s.

Fourth, Strix Halo is absolutely slower than this setup even for models that fit. SH gets ~50t/s on gpt-oss-120b. I get over 60t/s.

I regularly run Qwen 3.5 397B Q4_K_XL, a 245GB GGUF file, at 13t/s on one CPU + 3 GPUs.

It's utterly stupid to say it doesn't count just because of your total lack of understanding of how things work, when in reality the thing runs faster and can run much bigger models.

[-]

FullOf_Bad_Ideas@reddit

First, in hybrid inference you don't stream the model to the GPU. Nobody does that. That's absolutely stupid.

Did I mention hybrid inference? I'm not talking in hybrid inference.

It's just not VRAM. You can't do some things that you can do on real VRAM, like batch inference or training. For those things, RAM just doesn't really cut it, so I assume that OP, asking this question, did actually mean what they said, because there are absolutely places where what you want is to have a lot of VRAM and that's the thing that matters. They didn't ask about how much fast CPU RAM people have.

[-]

FullstackSensei@reddit

Your assumptions are flat out wrong.

You can absolutely do batching offloading on CPU. Again, just showing your ignorance.

And where did OP mention training? Strix Halo, GB10 or Mac absolutely suck at training, more so than this setup. OP explicitly said RAM.

You should at the very least Google something before making such erroneous assumptions.

[-]

FullOf_Bad_Ideas@reddit

You can absolutely do batching offloading on CPU.

How'd that work if you need to read 100GB of KV cache 50 times per second to do decoding? I'd be slow. I'm talking about vllm-style serving to multiple people. I don't think you can really use CPU RAM for this. More experts would need to be ran on CPU and it would quickly stop scaling.

And where did OP mention training?

No, I brought this up as a case where it'd make sense to focus on VRAM.

Strix Halo, GB10 or Mac absolutely suck at training, more so than this setup.

192GB of VRAM in your setup is what makes training work, not 384GB/768GB of RAM. GB10 can do some training. CPU + 384GB/768GB of RAM without VRAM would suck more than GB10/Strix Halo, so I think everything revolves around the VRAM.

OP explicitly said RAM.

Kinda. We're working off assumptions.

/u/panchovix please clarify if fast DDR5 counts as VRAM in your poll.

[-]

FullstackSensei@reddit

Do a freaking Google search instead of making erroneous assumptions. You're so ignorant and arrogant it's not even funny.

[-]

FullOf_Bad_Ideas@reddit

Do a freaking Google search instead of making erroneous assumptions.

I did and I don't really see any people experimenting with inference for high concurrency where a large chunk of the model and KV cache is in RAM. Care to share some sources? I still don't think it's a thing that runs at good speeds unless you have very fast connection to CPU RAM. Maybe it could work on GH200 SXM where GPU is connected to CPU by 450GB/s NVLINK-C2C.

You're so ignorant and arrogant it's not even funny.

our vibes don't match but let's stay friendly.

[-]

seamonn@reddit

What backend do you use and what's PP and TG like?

[-]

FullstackSensei@reddit

Llama.cpp. Minimax Q4_K_XL ~28t/TG, 120 ot 150 PP, don't remember exactly.

It's a dual CPU system, so I usually run two 200-400B Q4 models side by side on each CPU + 3 GPUs. TG drops to 12-14 and PP drops to ~90, but I don't mind. It's my planning machine, where I rubber duck ideas and convert them to concrete plans. Even when I need to add documentation, it's rarely more than 2k a time. I like to rubber duck and plan with more than one model. Each gives a slightly different perspective.

And before someone brainlessly says "but power costs a fortune", running in VRAM only consumes less than 400W at the wall, and running two models in parallel is like 600W at the wall. I shut it down when not in use, so consumption is like 1Wh.

[-]

bigh-aus@reddit

The thing that got me over the power consumption thing was this (applying your situation).
600w is 0.6 kw. to run that for an hour is 0.6 x your cost per kwh. When you think of it like that it really doesn't feel too bad at all.

[-]

FullstackSensei@reddit

I'd still use it if it used 3kwh and I pay €0.35. That's like €1.05/hr.

What on earth are you doing if your time is worth less than €1/hr????

[-]

bigh-aus@reddit

Exactly!

[-]

bigh-aus@reddit

Did you try kimi 2.6? Your total vram + system ram looks like it would be right on the border of fitting it😄

[-]

FullstackSensei@reddit

No, and TBH I doubt I will. When DS4 gets merged into mainline llama.cpp, I think I'll swap the RAM sticks to run that.

For now, I don't think the difference vs something like 3.5 397B is worth the significant slow down, at least for my use cases.

[-]

seamonn@reddit

What speed do you get with Gemma 4:31b?
Also, did you have to do any workarounds to get llama cpp working or does it work out of tye box on the latest build?

[-]

FullstackSensei@reddit

Never run such small models there. I have another machine with 3090s for small dense models.

I pull llama.cpp main and build. Almost the same script I have for my 3090s, just replacing the CUDA parts with HIP. Llama-server commands are also the same, replacing device names from CUDA0,CUDA1,... with ROCM0,ROCM1,...

Only two things worth noting when running models also on CPU are setting --numa to numactl and prefixing llama-server with numactl to pin all threads to the cores of one CPU.

The unfortunate part is that Mi50 prices are ridiculous now.

[-]

grabber4321@reddit

now thats what the fuck im talking about - thats a server!

[-]

Due_Duck_8472@reddit

8192GB to be exact deployed in a high security environment - once we've gotten the enriched uranium dug up we'll put it to good use 🇹🇯

[-]

Tanto63@reddit

Zero!

128GB DDR3. Yes, it's very slow...

[-]

bigh-aus@reddit

What are you running at what speed? I was looking at dell R930s today :p

[-]

Tanto63@reddit

Qwen 3.6 35B, 2.69 tokens/second

1x E5-2680v2

[-]

philmarcracken@reddit

wow. now I feel like having my orchestrator be qwen 27b(4tk/s) and the subagents the faster moe...

[-]

pmttyji@reddit

1-bit version models need your love. Bonsai for example.

[-]

geldonyetich@reddit

64 and 128gb options are a pretty standard amount in a last year dedicated AI Box.

Tough sell following Rampocolypse though.

[-]

fivetoedslothbear@reddit

No kidding. I was waiting for Apple to come out with a 512GB M5 Ultra Mac Studio, but everything is memory constrained and even the M3 models aren't available with that much. The most memory they sell now is 128GB in a MacBook Pro (and I have an M4 Max MBP with 128GB).

[-]

Velocita84@reddit

Sucks to be a 6gb sucker, all i can run is <10B models and MoEs

[-]

pmttyji@reddit

Hopefully by end of this year, these stuffs could help(both thread & comments) Poor GPU Club to run medium & above size models better

[-]

MalabaristaEnFuego@reddit

Depending on your 6GB, you should be able to get 12-15 tokens/s with GPT-OSS:20b, Gemma 4:26b, and Qwen 3 Coder:30b. That's what I'm currently getting with them on an RTX 4050.

[-]

JackStrawWitchita@reddit

You need an option for 'no VRAM'. Some of us run LLMs without a GPU, on CPU only.

[-]

mzzmuaa@reddit

2 rtx6000 + 5090 + 4090 they warm my feet. i'm gonna reinforce the zipties holding the rtx 6000 and 4090

[-]

Newtonip@reddit

32GB VRAM

192GB DDR5 RAM

[-]