Running a GPT-OSS-120B model on 64GB unified memory — observations on quantization + MoE data access
Posted by manjunath_shiva@reddit | LocalLLaMA | View on Reddit | 1 comments
I’ve been experimenting with running large MoE models locally on Apple Silicon and ran into an interesting set of constraints around memory and throughput.
The main issue is obvious: 120B-scale models don’t fit into 64GB unified memory in their original form.
I tried a 3-bit quantization approach using:
- randomized Hadamard rotations (to normalize weight distributions)
- Lloyd–Max quantization (for better codebook efficiency)
This brought a GPT-OSS 120B MoE model down to \~48 GB, with \~52 GB peak memory usage.
What surprised me more was the performance side.
Initial generation speed was extremely low (\~1.7 tok/s), even after quantization.
The bottleneck turned out to be data access, not compute.
Since MoE models only activate a few experts per token, I changed the execution pattern to gather only the active experts instead of touching all experts during dequantization.
After doing this (via a fused Metal kernel), generation speed increased to \~44 tok/s on 120B and \~73 tok/s on a 20B MoE model.
This feels like an under-discussed aspect of running MoE locally — quantization alone isn’t enough, access patterns matter just as much.
Curious if others working on local MoE inference have seen similar bottlenecks or tried different approaches here?
manjunath_shiva@reddit (OP)
Edit: Added a short demo video → https://x.com/i/status/2040844820433039410