Anybody running gpt-oss-120b on a MacBook Pro M4 max 128GB?
Posted by Appomattoxx@reddit | LocalLLaMA | View on Reddit | 15 comments
If you are, could you *please* let me know?
-Thank you,
thinking of getting. one, want to know if I can run that particular model, at a reasonable speed.
Badger-Purple@reddit
you can run much more than OSS120b on that computer!
Amazing_Clock5847@reddit
Is the immense context all the time applied with proper quality? 1 million is huge.
Badger-Purple@reddit
dude, in AI, a comment made 153 days ago is ancient history.
This model is outdated at this point. Qwen Next Coder is a finetuned version that does well for coding and more.
1 million context now standard in other models without rotary positional embeddings. Some even have now rotary attention!
laerien@reddit
I can also confirm it works great. I'm seeing over 60 tok/sec with Unsloth's F16 GPT OSS 120B. That said, use Qwen3 Next 80B A3B 8-bit MLX since it's better and also above 60 tok/sec on an M4 Max 128GB.
committer@reddit
How fast is the prompt processing?
Appomattoxx@reddit (OP)
Thank you! Can you say what context widows you’re using?
Due_Mouse8946@reddit
Max context.
Gregory-Wolf@reddit
Does unsloth's F16 GPT OSS 120B give actually better results than the original MXFP4 in your experience?
laerien@reddit
I think MXFP4, labeled F16. They call it "gpt-oss-120b-F16.gguf" but pretty sure you're right and it's plain MXFP4. Unsure if they mean unquantized MXFP4 or what?
Gregory-Wolf@reddit
The weights are probably the same (size in Gb is same at least), but they claim they did some fixes - template and some precision changes here and there. And as if it should be more stable and in some cases provide better results. Than't why I ask.
I have M3 Max 128Gb, and I use MXFP4. I wondered if you compared vanilla MXFP4 to unsloth's F16 and saw any difference, and that's why you switched to unsloth's.
StateSame5557@reddit
I get over 70 tok/sec with VCoder, a trained 120B by EpistemeAI
https://huggingface.co/nightmedia/VCoder-120b-1.0-qx86-hi-mlx
Daemonix00@reddit
Yeah even on the plane… it’s quite good
weasl@reddit
It works great (around 40 t/s) but I prefer GLM 4.5 Air or Qwen 3 Next
tiltology@reddit
Yeah, it works well. I used it with Xcode pointing at LM Studio as a coding test and it’s nice and fast. Not at the machine right now so I can’t tell you the tokens per second but it was definitely faster than reading speed.
Appomattoxx@reddit (OP)
Thank you! I’m excited about the idea of running that model off a Mac, but I wanted to confirm it’d work, before making the purchase.