[-]

EuphoricAnimator@reddit

That's a seriously impressive speed on the Blackwells, good catch on the corrections though,details matter a lot with these builds. I've been running models locally on a Mac Studio M4 Max with 128GB of RAM for a while now, and it's a different world from even a year ago. People often underestimate how much RAM really helps, especially when swapping between models.

I mostly play with Qwen 3.5, Gemma 4, and a bunch of stuff through Ollama. I can usually get Qwen 3.5-7B running comfortably at around 60-70 tokens per second with a decent context window. The 14B version is still usable, but it slows down to maybe 30-40. Going much higher than that gets pretty painful, even with quantization. It's not just about VRAM either, the unified memory architecture on the Apple Silicon is a big part of what keeps things from grinding to a halt.

One thing I've noticed is the overhead when switching models isn't zero. It takes a bit for the model to load and initialize, so constantly flipping between different 7B parameters isn't ideal. I tend to pick one and stick with it for a longer session. Also, be careful about system memory usage. Even with 128GB, a really large context window can cause issues, and macOS will start swapping to disk, killing performance.

It’s cool to see builds like this pushing the limits on the PC side. It gives us Mac users a benchmark to compare against and shows what's possible when you have dedicated GPUs. I'm curious if Metal acceleration will continue to improve enough to close the gap, but for now, the Blackwells definitely have an edge.

[-]

eliko613@reddit

Good thread. If you're tackling this in production, this is the pattern that usually works:
1) Start with a weekly top-10 token spend report by endpoint/use-case.
2) A/B routing policies (cheap-first vs quality-first) and compare quality + cost together.
3) Cap max tokens and require explicit override for outliers.
We started evaluating zenllm.io to identify multi-endpoint waste in production and it's been decent so far.

[-]

tecneeq@reddit

Wow, what a wealth of information. Thanks!

I have a unoptimized build at work, 2x 6000 Blackwell QMax (with slots fro two more). I get 50 t/s for qwen 3.5 27b and 100 t/s for 122b with llama.cpp out of the box.

qwen3.5 doesn't work with speculative decoding for llama.cpp yet. I need to look into your stuff.

[-]

AlwaysLateToThaParty@reddit

Realistically, how many people do you think that system could serve in a professional environment?

[-]

Model	tok/s	Engine	Config
Qwen3.5-122B NVFP4	198	SGLang b12x+NEXTN	modelopt_fp4, speculative decode
Qwen3.5-27B FP8	170	vLLM DFlash	2B drafter, 2 GPU
MiniMax M2.5 NVFP4	148		modelopt_fp4
Qwen3.5-122B NVFP4	131		compressed-tensors
Qwen3.5-397B GGUF	79		UD-Q3_K_XL, fully in VRAM

Re-test methodology

Corrected cold-start TTFT on 2× RTX PRO 6000 + SGLang b12x+NEXTN 122B

What went wrong in my original numbers

1) 4K was too high

2) 57K and 150K were too low

The giveaway

What stays the same

1. How PCIe switches work, and whether you need one

2. GPU requirements

3. What the kernel and modprobe settings actually do

CPU performance governor

pci=noacs

uvm_disable_hmm=1

4. The extra thing that was missed originally

5. Complete setup sequence

6. How to verify it is actually working

7. Final software-stack tip