Help! What and how ro run on m3 ultra 512. (Coding)

Posted by matyhaty@reddit | LocalLLaMA | View on Reddit | 6 comments

Hello everyone

I could really do it some advice and help on what local coding ai to host on my mac stdio m3 ultra with 512gb. we will only use for coding.

As I have discovered over the last weekend, it's not just a matter of what model to run.But also what server to run it on

So far, I have discovered that l m studio is completely unusual and takes ninety percent of the time processing the prompt

I haven't had much time with olama, but have experimented with llama c p p and omlx. both of those seem better, but not perfect. them its whether to use gguf or mlx. then what qant. then what lab (unclothed, etc) and before you know it my head is fried.

As for models, we did loads of test prior to purchase and found that g l m 5 is really good, but it's quite a big model and seems quite slow

Obviously having a very large amount of vram opens a lot of doors, but also this isn't just for one user. So it's a balance between reasonable speed and quality of output. if I had to choose, I would choose quality of output above all else

welcome any opinions and thoughts. especially on things which confuse me like the server to run it, the setting for them. models.wise we will just test them all!!!

thank you.

[-]

Delicious-Storm-5243@reddit

512GB is a beast, you can run pretty much anything. My setup recommendations:

Server: mlx-lm (pip install mlx-lm). Built for Apple Silicon, way faster than llama.cpp on Mac. LM Studio's MLX backend also works if you prefer a GUI.

Models for coding (what I'd try first): 1. Qwen3.5-Coder-Next — latest and strongest for coding tasks 2. Qwen3.5-235B-A22B — with 512GB you can run the full MoE, active params are only 22B so speed stays reasonable 3. GLM-5 in MLX format — since you already liked it, the MLX quantized version will be noticeably faster

Key tips: - Skip GGUF on Mac. Go MLX format, night and day speed difference on Apple Silicon - For multi-user: mlx-lm.server gives you an OpenAI-compatible endpoint so everyone can share the same box - LM Studio was slow for me too with large models. It was the default backend — switching to MLX fixed it

The 512GB gives you headroom most people dream of. Don't overthink quantization — at your memory level just run the biggest model that fits and iterate from there.

[-]

matyhaty@reddit (OP)

Thank you So there is two lm studios? Lm studio and lm studio mlx?

Sorry if im being thick!!

[-]

is there actually a Qwen3.5-Coder-Next or a Qwen3.5-235B-A22B?

Qwen3.5-Coder-Next is available here though it's an ollama pull, not sure if that works with MLX.

I haven't seen Qwen3.5-235B-A22B anywhere.