Guide on building a system for 30B dense models.

Posted by Kahvana@reddit | LocalLLaMA | View on Reddit | 24 comments

Hey everyone, not a native speaker so please correct me if I make mistakes!

With the current trend of API models generating lower-quality results over time, price hikes and whatnot, and now very strong \~30B model being released, I see interest increasing in running these models. Thing is, I don't see that many guides in decision-making for building your own system to run them.

In this post I will highlight decisions I made during building my own PC back in January 2026 ( https://www.reddit.com/r/LocalLLaMA/comments/1qdtvgs/not_as_impressive_as_most_here_but_really_happy_i ).

I will be using current (2026-04-26) Dutch prices (megekko.nl for new, markplaats.nl for used) as reference.

Goals

Why this target?

With MoE models we can get away with a single weaker GPU (like a Strix Halo or experts offloading), but for dense models it would be really slow.

From my practical experience, difference between Q4 to Q5 is quite noticable. From Q5 to Q6 and higher depends more on non-latin use however ( https://localbench.substack.com/p/gemma-4-31b-gguf-kl-divergence ).

While I understand Q8_0 for context isn't lossless for Gemma4 ( https://localbench.substack.com/p/kv-cache-quantization-benchmark ), at half the model's context (128k of 256k) I have yet to experience issues with it in practical use.

System parts

Buy used?

If you're willing to bear the risk, it is a really good option (and can be much cheaper!)

Personally, due to the uncertain times and not being able to secure that money relatively soon in case anything goes wrong or breaks, I did not. So my own choices resolved buying around new hardware.

GPU

Most important part(s) of the system. You have a few options:

The R9700 Pro is the best value for money here. Only downside is how loud it is (blower-style fan) and the lack of CUDA (in case you need it, for inference you can use Vulkan on llama.cpp).

Personally I went for two ASUS PRIME RTX 5060 Ti 16GB. I could buy one first and the other later. That specific model is very silent under load and draw very little power. MXFP4 / NVFP4 hardware support is a nice bonus, CUDA makes anything AI software related easy to set up.

What about Intel?

While their prices are really good, the performance isn't (slow hardware and unstable drivers). Look up B70 and B60 reviews on this subreddit for more info so you know what you're getting into.

What about datacenter GPUs? (P40, V100, MI25, MI50, etc)

No comment as I have too little experience with them. From what I've read here they can be really good, so look them up!

Anything to be careful of?

When buying RTX 3000 series cards: they might've been used for mining, which significantly reduced their lifespan if so. Repaste them!

For RTX 5090, be very careful as they my have bad 12vhpr connectors required for them ( https://gamersnexus.net/gpus/12vhpwr-dumpster-fire-investigation-contradicting-specs-corner-cutting ). Undervolting is a good idea!

Motherboard

If you choose the RTX 5090 or R9700 Pro, any used PCIE 4/5 x16 motherboard is fine.

Otherwise, you really want a motherboard that supports PCIE 5.0 x8x8 mode. Not doing so results in a performance penalty, which is especially bad for the RTX 5060 Ti. Options I know supporting x8x8 include:

I went with the PROART X870E as it has the best chipset available for a good price and good PCIE x16 slot placement for the cards I want to use. Most 2/3-slot GPUs are actually 3/4-slot due to their cooler's size.

It also supports display routing: Connect the monitor to the motherboard's display port (HDMI or DP), during inference the GPUs can use their full 16GB each and the iGPU handles the display. When playing games, the motherboard uses the GPUs and not the iGPU without having to change cables around.

What about Intel?

Didn't research! I knew I wanted an AMD Ryzen 9000 CPU.

CPU

It kinda depends.

If you choose the RTX 5090 or R9700 Pro, you can get away with the the Ryzen 5 5600 or better.

Otherwise, an AMD Ryzen 7600 and better will do.

I went with the AMD Ryzen 5 9600X as I wanted the AVX-512 improvements from the Ryzen 9000 series for my work.

Why not 8+ cores?

You won't get much benefit of having more than 6 cores, you're getting RAM bandwidth starved ( https://www.reddit.com/r/LocalLLaMA/comments/1qdtvgs/comment/nztj6g7 ).

Why not Ryzen 5500 or Ryzen 8000 series?

The AMD Ryzen 5 5500 and older doesn't support PCIE 4.0, Ryzen 8000 series on AM5 uses PCIE 4.0.

What about Intel?

Didn't research! I knew I wanted an AMD Ryzen 9000 CPU.

RAM

You want to have at least 32GB RAM, prefer 2x 16GB. More capacity is always really useful but a luxury.

I personally have 96GB (2x 48GB) DDR5-6000 CL30 which I bought before the RAM demand increase (September 2025).

Having at least 96GB is needed when running 120B MoE models, but you don't need it to run Qwen3.6 27B nor Gemma4 31B.

Other hardware

Make sure there is at least 1 slot space between the graphics cards inside your case, and that a fan is blowing away the heat of the GPU's backplate.

If you have an iGPU, attach the display to it to free up a little more VRAM. Every byte counts!

The software side

You really want to use llama.cpp directly for the least overhead.

Make sure to specify when using two GPUs:

device = cuda0,cuda1 (or vulkan0,vulkan1 when using AMD)
tensor-split = 16,16 (or 24,24 when using RTX 3090)

That way llama.cpp knows how to handle the dual GPU setup.

Performance

Metrics for my build (the highlighted parts).

I am using https://en.wikipedia.org/wiki/Comfort_women html source as a measure (up until references listing), a little above 100k tokens.

Qwen3.6 27B:

Gemma4 31B

That's it!

Hopefully this infodump was helpful to you! Let me know your questions or thoughts down below, I'll be happy to help where I can.