Are the rich RAM /poor GPU people wrong here?

Posted by crowtain@reddit | LocalLLaMA | View on Reddit | 49 comments

Hello Guys,
I know everyone has his definition of local models, but for me i see 2 "reasonable" type of frontier local models.

a dense one that barely fit in a 32GB ou 24GB of gpu for the most "reasonable" GPU wealthy guys and a MOE in the 100B params, the 100ish B billion params can be run on hybrid offload with a decent speed on a 128GB ram, since 128GB is the max a standard motherboard can support. Again it's cheap but common people can still afford it, it's still cheaper than a car 😄 .

We see a lot of limit dense models, like qwen 27B, but for for the 100 MOE type there was only the Qwen 3.5 122B, they didn't even release the 3.6. the best MOE models range in the 30-35B.
does it mean that for rich ram and poor GPU people we don't have much choice, and the big GPU was the only good road?
Of course you can cram minimaxi like with Q3 or deepseek V3 in Q1. but for tool calling , speed and real usage it's barely usable.
I bought a strix halo before the ram-pocalypse, but i see very few use case for the 128GB exept being able to load multiple models that can be done with llama swap

[-]

Double_Cause4609@reddit

It doesn't matter what your hardware consideration or affordances are. There will *always* be a model that's just outside of the range you can run.

Also, 256GB is the largest RAM size technically supported on consumer motherboards, not 128GB. I myself have 192GB, for example.

As for other models that fit into 128GB...

Minimax M2.5/2.7 may just barely squeeze in at a low quant, Mistral Small 4, Nemotron Super 120B, Qwen 3.5 122B as you noted, Qwen 3 Next...

I don't even think I covered all of them, either. There's tons of models out there, and a huge amount of them fit a large RAM - low GPU profile. I'm actually personally really excited to try Deepseek V4 Flash once LCPP support lands, for example.

As for what to do in your case? Honestly, Qwen 27B And Gemma 4 31B may be dense and not quite what you were planning to run on your hardware...But you know what? You can do some fun things with them. You can experiment with concurrent inference using your spare memory.

Do a vLLM build for your hardware, and run multiple concurrent context windows. You can get a pretty huge total T/s, and actually possibly get more total T/s than a comparable GPU would have gotten you. Learn to use things like subagents in CLI harnesses, and you'll have a great time.

[-]

_TheWolfOfWalmart_@reddit

Qwen 3.5 122B as you noted

I'm able to run this just fine even on my little 64 GB RAM + 24 GB 4090 system using the UD-Q3_K_XL quant. 128K context (bf16 cache). No disk swapping.

I'm really impressed with this model. Even Q3 is great. It doesn't matter nearly as much with these larger models.

[-]

IndigoEtherea@reddit

Two years ago I dreamed about having what I have now, but now I dream about even more and what I have now doesn't even come close. I blame API's, tbh. I got so used to using them that switching back to Local felt so lacking. lol

It's like making more money. Your cost of living goes up along with your wage more times than not, so you'll always want for more.

[-]

_TheWolfOfWalmart_@reddit

I'm running Qwen3.5-122B-A10B with 128K BF16 context. Unsloth's UD-Q3_K_XL quant.

System is:

i9-13900KS
64 GB system RAM
RTX 4090 24 GB VRAM

It fits in system RAM + VRAM just barely so no swapping needed. It's no speed demon, but I'm getting a pretty reasonable 25 tok/s. And I don't even have my RAM/CPU overclocked right now.

I've compared it to models that fit fully in VRAM like 3.6 27B dense and 35B A3B. I've used them all for various tasks. Coding, deep research on the web, and some other stuff.

122B might be half or 1/3 the speed, but I absolutely will be using this over the smaller ones most of the time. It's reasoning capability clearly blows 27B/35B out of the water. It feels like a frontier cloud model from at most 2 years ago. It's smart and makes very few mistakes.

27B/35B are virtually unusable for coding outside of the most basic shit.

122B obviously can't touch Codex or Claude, but it does a fine job even with medium-sized code bases as long as you've got 128K+ context. It's saved me a ton of my cloud usage allowance. I only go to Codex when I truly need frontier-level reasoning.

[-]

_TheWolfOfWalmart_@reddit

Also, I have a pair of Poweredge R740 rack servers with dual Xeon 6148s (40 cores) and 512 GB RAM each.

I'm so incredibly annoyed at the massive memory bandwidth bottleneck on these when running LLMs lol. I've got 512 GB just sitting right there! Could fit such awesome models, but the inferencing is just glacial. Even if I isolate llama.cpp to a single NUMA node and use numactl to allocate only memory local to that node for the model. It's just brutal. :(

[-]

LagOps91@reddit

i have 128gb ram and 24gb vram. you can run M2.7 (230b) at q4 with no problems. and if you don't mind dropping to q2 (not as bad as you think), the largest you can fit is trinity with 400b parameters.

Certainly you get better performance that 30b size class models.

[-]

ttkciar@reddit

Can relate to this rather a lot. Ever since I got my 32GB MI60, my preferred models have been something dense in the 24B to 32B range for in-VRAM inference, and something much larger for pure-CPU inference.

Nowadays that's Gemma-4-31B-it and GLM-4.5-Air (106B-A12B). At max context Air consumes 127GB of memory, so it would just barely fit in two MI210 if I had them. Some day!

I keep testing new 120B-class models to see if any are better than GLM-4.5-Air, most recently Mistral Medium 3.5 128B, but so far they've all fallen short in some way or another.

My other "big" model is K2-V2-Instruct, which is "only" 72B dense but its context maxes out at 512K tokens. Near that limit it will consume 250GB of memory, which is as much as my crufty old Xeon servers have.

[-]

_TheWolfOfWalmart_@reddit

my preferred models have been something dense in the 24B to 32B range for in-VRAM inference

Same. As far as models that fit totally in VRAM, I've gotta say Qwen3.6-27B dense is up there. I prefer it to the 35B MoE.

[-]

super1701@reddit

How does GLM compare to qwen 3.5 122b?

[-]

LizardViceroy@reddit

I have 512GB worth of 128GB devices and I've been feeling worse about my choices since Qwen3.6 27B and Gemma4 dropped... In the GPT-OSS-120b days we looked like the smart ones. These things come and go in waves though. The advantage of VRAM in times like these are still numerous: plenty room for context and high bit quants. The 122B version of Qwen3.6 should put the ball back in our court soon.
I'm currently coping by sharding 200B+ models between two nodes with tensor parallelism but before you go down that road, realize that that itching you're feeling... it doesn't stop.

[-]

NNN_Throwaway2@reddit

There isn’t going to be a qwen 3.6 122b, though.

[-]

No_Lingonberry1201@reddit

They mentioned there'd be one, didn't they?

[-]

NNN_Throwaway2@reddit

They did not.

[-]

No_Lingonberry1201@reddit

Damn.

[-]

NNN_Throwaway2@reddit

People don't seem to realize the Qwen we knew and loved is dead and believe they're still going to be releasing a bunch of models like they used to.

[-]

No_Lingonberry1201@reddit

Double damn, well, at least they released two bangers before that?

[-]

VodkaHaze@reddit

We're seeing a couple of inflection points in model sizes right now IMO.

The main theme is that some model sizes are "good enough" for a subset of tasks you're doing, so there's no point in going bigger.

<7B: Models this size have to be task-specific to be useful (eg. tool calling, classification, translation, ..) They can't handle much context and reasoning will often end up breaking at runtime.
7-9B: Around the minimal size to have a somewhat generalist model. With reasoning enabled, models in this size can be pretty damn strong (ex: Qwen3, nvidia nemotron 3 series). Still generally useful for a few tasks the model is good at. Will go off the rails on context longer than a few thousand tokens.

Note that "coding" is not a specific task, it's a set of tasks (tool calling, reasoning, structured outputs, etc.) that the model needs to perform reliably over medium / long context.

24-35B: This category has gotten incredibly strong in the last 18 months. Can be used as a generalist model. With reasoning enabled, the modern ones (gemma4, qwen 3.6) will compete with models 10x their size. Without reasoning or with quan
100-140B: This category of models has become fairly irrelevant. Most of the tasks they can do, the 30B models can also do. For the very long context multi-skill tasks you'll likely still need a huge model.
>300B: The best models here can handle very long context and juggling multiple tasks more reliably (still dependent on the model, however!) It's useful if you don't want to spend time optimizing your workflow and just want a one-stop shop.

At this point in time, most of the time, using the 300B+ models is overkill. They're still unreliable when they have 15+ tools exposed, and will still go off the rails on very long context. They're also very expensive! But you'll still see a much higher tendency for, say, Opus to one-shot a coding problem that most other models will need multiple rounds to do.

[-]

a_beautiful_rhind@reddit

I think 24-35b dense is the minimum for a general purpose model. Nothing below that has ever been good-good.

[-]

Jorlen@reddit

This has been my experience as well. I keep trying them and I keep being disappointed.

[-]

soshulmedia@reddit

I agree with the value of 24-35B models as a strong baseline for general-purpose reasoning—especially with modern architectures and reasoning capabilities. However, "good-good" is a subjective benchmark, and performance depends heavily on the task, use case, and context length. While 24-35B models are powerful, they're not universally superior across all domains, and smaller models can still excel in specific, well-defined workflows—especially when optimized for efficiency or speed. The key isn't just size, but alignment with real-world needs.

The above is generated by good old Qwen3 4B 2507, prompted with the whole thread as pasted data and:

"Attached is a reddit discussion thread. You are Qwen3 4B 2507 Instruct. So, with 4B, you fall under the 24-35B border for which a_beautiful_rhind says "Nothing below that has ever been good-good."

Please formulate an appropriate, defensive and brief reply for yourself as an answer to the last post in the thread."

[-]

a_beautiful_rhind@reddit

It's not really refuting me. But we are talking about a generalist. Text encoders, document processors, etc can probably get away with something that small.

[-]

soshulmedia@reddit

Fair enough. I guess I'd argue it is "good-good" in a relative sense, not compared to the behemoths. Fits on less than a 4.7GB DVD and you can have some meaningful (short) conversations with it and use it for field/text extraction, sentiment analysis, even short python snippets ...

[-]

gingerbeer987654321@reddit

what are you using to shard such large models?

[-]

can999999999@reddit

Well they both have their pros and cons, I can run Qwen 3.6 27B on my B70 in a good quant and context size comfortably, but I kind of wonder how a big model on less fast RAM would be, just like you wonder how a dense model on a GPU would be. People always want what they don't have so stop the gear acquisition syndrome like we call it in the guitar world and actually use and enjoy what you have.

[-]

lacerating_aura@reddit

If i may ask: How's the performance? How are you running it? What kind of context and quant?

[-]

can999999999@reddit

My daily driver is Qwen 3.6 27B UD_Q6_K_XL with 131K Q8 kvcache, the step up to Q8 isn't worth it from a precision standpoint and you get next to zero room for context anyway. If I need more context on some coding tasks, I drop down Q5, rarely ever Q4, at that point I would rather just tell the model to write down a summary and to-do list for its next instance.

Honestly, the Intel cards are slow, even with my llama.cpp-sycl build, but really, it's usable. If you want 32GB AND fast compute then go ahead and drop 3x the price on an NVIDIA card, that's just the trade-off. I'd buy it again.

[-]

5dtriangles201376@reddit

I'm curious how many tokens/s you get for pp and tg (without mtp) so I can compare with my setup

[-]

Plastic-Stress-6468@reddit

Hey hope you can share your experience a bit. I have a 5090 but want to get a second 5090 for more vram. Looked at the price that went up by another 1k usd.

Do you feel like the b70 serves you well? I want to get 2 b70s.

[-]

can999999999@reddit

Similar to what I replied to in another comment in this post, the B70 is slow compared to the NVIDIA options, even with my llama.cpp-sycl build. But still, you'd have to pay 3x as much to get the same VRAM but at NVIDIA speed so yeah, I love my B70, I can only recommend it.

[-]

Plastic-Stress-6468@reddit

Thanks for the insight.

So if you don't mind me asking, in your current position, would you consider buying another B70 to potentially get a maybe 50-70% speed up and larger vram? How much t/s are you getting right now vs would you desire such a speed up?

[-]

can999999999@reddit

I don't really have a desire for more than one B70, 32GB is enough to run a near lossless model with enough near lossless kvcache for 95% of my use-cases and the extra speed wouldn't make me want to spend another 1250€. I don't have exact numbers on hand rn, but when using my regular ~30GB setup, I get about (I'm really not sure, don't quote me om this) 700-800 pp and about 20 tg. Yeah it's slow but definitely usable. I mean if I'm impatient there's always 35B

[-]

crowtain@reddit (OP)

so true, i'm always waiting for the next model, next hardware and use most time at tweaking and testing than using the model itself :D

[-]

can999999999@reddit

Jup, that's exactly the trap people often fall into, chasing guitar after guitar, switching strings, pedals and so on, but never actually sitting down to actually play the damn thing lmao

[-]

ProfessionalSpend589@reddit

bought a strix halo before the ram-pocalypse, but i see very few use case for the 128GB

You could also test smaller models in BF16 and then compare with quantised versions.

And you can consider yourself lucky, because you have only one Strix Halo and not two. ;)

[-]

colin_colout@reddit

I'm quite happy with minimax m2.7 in the q3 range on my framework desktop. Speed and quality are just fine for my architecture and planning agent.

Can even run UD-IQ4_NL with quantized kv cache at 8_0 but it nerfs long context coding (I'm waiting for turbo quant or similar to merge).

Also, qwen3-next-coder is still quite magical at Q8_K_XL, though it has severe ADHD when left to its own devices.

...i kinda wish qwen3.5 122b wasn't so "meh" as a coding agent model. On paper it should blow at qwen3-next. It feels like it's almost there, so maybe a 3.6 release will help?

[-]

Subject_Mix_8339@reddit

I too have been waiting for the next big MoE. Have been very happy with Qwen 3.6 35b-a3b but I get jealous seeing the dense models.

[-]

UncleRedz@reddit

I made a similar conclusion, I could either upgrade RAM or GPU for the budget I had and were initially leaning towards RAM, for running larger MoE. However same conclusion, if looking at open models released in the 100B size, the selection is quite limited, especially if you are looking for advancements like hybrid attention, etc. Going bigger would scratch the itch of what happens when running bigger? How smarter does it get? I instead went for the GPU upgrade, from 5060Ti 16GB to RTX Pro 4500 Blackwell 32GB, the thinking was to be able to run faster and better quality, larger context without spilling to system RAM.

While it doesn't scratch the itch of bigger models, it's way more practical and I can get more things done faster and better. I have to say that I'm very happy with the upgrade, can't run much larger models than before, but can stick with fp16 KV cache, Q6 instead of Q4 on model, 128K+ context etc and I'm running about 2x on token generation and up to 6x on prompt processing, which makes a huge difference. I would say it's a "quality of life" improvement that I don't think I would have gotten from a larger model.

[-]

CreamPitiful4295@reddit

What does an MOR give you?

[-]

Expensive-Paint-9490@reddit

We just had a 100B model from DeepSeek.

[-]

wombweed@reddit

How do you run it on a non-Mac with CPU non-active expert offloading? I tried setting it up and had a lot of trouble getting usable speeds on my 2x3090 256GB RAM system.

[-]

a_beautiful_rhind@reddit

I haven't downloaded it because no ik support and only random ass forks. Hybrid on mainline was never great to begin with so about 2 more weeks?

[-]

Expensive-Paint-9490@reddit

I don't know if it has been properly ported in llama.cpp. Usually with the default --fit on it should optimize the loading automatically.

[-]

ComplexType568@reddit

Around 240B~ params, not 100 haha. If you saw something like 188B params that is an HF issue for the time being.

[-]

Expensive-Paint-9490@reddit

I stand corrected. I got confused by the size, with the experts in FP4 is only 160 GB.

[-]

a_beautiful_rhind@reddit

Hybrid inference is ok but nothing beats full offload. These MoE 100b aren't really 100b strength models. You should be able to at least run those 30b densies on GPU if you crave general purpose LLMs.

[-]

Ledeste@reddit

Do not want to make you sad, but as someone with both 192GB of ram (is not the new max 256 since DDR5?) and a 5090, I'm only using ram to test the new models, but will avoid getting out of vram as much as possible otherwise.
The speed gain is just too important for the too small gain on accuracy.

[-]

ambient_temp_xeno@reddit

I bought 256gb ram and an old xeon before the Ramdemic. 24gb vram makes sense there.

Nobody cared about tool calling back then so it's not really about being right or wrong - we're all just screwed.

[-]

TokenRingAI@reddit

You are making a lot of assumptions about the car I drive

[-]