What is LLMFit Smoking? Can M1 Max run anything decently enough for agentic coding?
Posted by GoodhartMusic@reddit | LocalLLaMA | View on Reddit | 2 comments
As you can see in this analysis, LLMfit estimated 85 tokens per second with a 64B model. When i tried, I got 9t/s. :'( I'm pretty extremely new to local inference and wonder if an m1 max can realistically take advantage of that in a meaningful way, even if a substantial process takes hours?
PracticlySpeaking@reddit
It's all about the quant.
Last night I tried the Ollama MLX preview (which only runs the special Qwen3.5-35b-a3b-NVFP4 in 32GB) and it was outputting \~64tk/sec on a binned M1 Max (24 GPU).
Edenar@reddit
Did you use the Q4_K_M quant it suggests ? i don't think it actually fits in your memory. Also param number is wrong for that one (should be 122B !) so i guess it underestimates the memory required to run it.
With 64GB you are a bit stuck : you can run fast smaller MoE like GLM 4.7 flash or qwen 3.5-35B-A3B, or go for dense models like Qwen 3.5 27B or Gemma4 31B but they will be slower than MoE (but they'll provide you with the best results for their sizes)