How I got faster local LLM inference on Apple Silicon by switching from llama.cpp to MLX format
Posted by Double-Astronaut-780@reddit | LocalLLaMA | View on Reddit | 1 comments
Been running local models on my M-series Mac for a while. llama.cpp works fine but I kept noticing it wasn't fully utilizing the Metal GPU the way Apple's MLX framework does.
After some digging, the bottleneck is the format — GGUF is designed around llama.cpp's runtime, not MLX's memory model. Converting to MLX format made a noticeable difference in throughput and memory usage.
The conversion process roughly involves:
-
Parse the GGUF header (magic bytes, tensor count, metadata)
-
Extract or map weights to MLX-compatible tensor layout
-
Generate config.json, model.npz, tokenizer files
-
Use mlx-lm (mlx_lm.convert) for architectures it supports natively
Since March 2026, Ollama also switched to MLX as its default backend on Apple Silicon — so the ecosystem is clearly moving this direction.
Has anyone else gone down this path? Curious what models people are running and whether the MLX gains held up for them. I found it most noticeable on longer context runs where memory bandwidth matters most.
Happy to share more details on the conversion pipeline if there's interest.
ScrapEngineer_@reddit
Ollama spam, go back to you clanker friends