How I got faster local LLM inference on Apple Silicon by switching from llama.cpp to MLX format

Posted by Double-Astronaut-780@reddit | LocalLLaMA | View on Reddit | 1 comments

Been running local models on my M-series Mac for a while. llama.cpp works fine but I kept noticing it wasn't fully utilizing the Metal GPU the way Apple's MLX framework does.

After some digging, the bottleneck is the format — GGUF is designed around llama.cpp's runtime, not MLX's memory model. Converting to MLX format made a noticeable difference in throughput and memory usage.

The conversion process roughly involves:

Parse the GGUF header (magic bytes, tensor count, metadata)
Extract or map weights to MLX-compatible tensor layout
Generate config.json, model.npz, tokenizer files
Use mlx-lm (mlx_lm.convert) for architectures it supports natively

Since March 2026, Ollama also switched to MLX as its default backend on Apple Silicon — so the ecosystem is clearly moving this direction.

Has anyone else gone down this path? Curious what models people are running and whether the MLX gains held up for them. I found it most noticeable on longer context runs where memory bandwidth matters most.

Happy to share more details on the conversion pipeline if there's interest.