Running a GPT-OSS-120B model on 64GB unified memory — observations on quantization + MoE data access

Posted by manjunath_shiva@reddit | LocalLLaMA | View on Reddit | 1 comments

I’ve been experimenting with running large MoE models locally on Apple Silicon and ran into an interesting set of constraints around memory and throughput.

The main issue is obvious: 120B-scale models don’t fit into 64GB unified memory in their original form.

I tried a 3-bit quantization approach using:

randomized Hadamard rotations (to normalize weight distributions)
Lloyd–Max quantization (for better codebook efficiency)

This brought a GPT-OSS 120B MoE model down to \~48 GB, with \~52 GB peak memory usage.

What surprised me more was the performance side.

Initial generation speed was extremely low (\~1.7 tok/s), even after quantization.

The bottleneck turned out to be data access, not compute.

Since MoE models only activate a few experts per token, I changed the execution pattern to gather only the active experts instead of touching all experts during dequantization.

After doing this (via a fused Metal kernel), generation speed increased to \~44 tok/s on 120B and \~73 tok/s on a 20B MoE model.

This feels like an under-discussed aspect of running MoE locally — quantization alone isn’t enough, access patterns matter just as much.

Curious if others working on local MoE inference have seen similar bottlenecks or tried different approaches here?