Mac Users: New Mistral Large MLX Quants for Apple Silicon (MLX)

Posted by thezachlandes@reddit | LocalLLaMA | View on Reddit | 15 comments

Hey! I’ve created q2 and q4 MLX quants of the new mistral large, for MLX (apple silicon). The q2 is up, and the q4 is uploading. I used the MLX-LM library for conversion and quantization from the full Mistral release.

With q2 I got 7.4 tokens/sec on my m4 max with 128GB RAM, and the model took about 42.3GB of RAM. These should run significantly faster than GGUF on M-series chips.

You can run this in LMStudio or any other system that supports MLX.

Models:

https://huggingface.co/zachlandes/Mistral-Large-Instruct-2411-Q2-MLX

https://huggingface.co/zachlandes/Mistral-Large-Instruct-2411-Q4-MLX

[-]

Durian881@reddit

Thank you very much!

[-]

thenomadexplorerlife@reddit

How good will be Mistral large q2 over llama 70b q4? I am getting a m4 pro 64gb but was feeling bad I cannot run mistral large q4 due to less memory.

[-]

matadorius@reddit

Damm i am wondering if I should go for 64gb rather than 48 now

[-]

thezachlandes@reddit (OP)

64GB on the max chip has a higher memory bandwidth than 48GB. Double check to be sure, but that's what I figured out from the table on the macbook pro wikipedia

[-]

matadorius@reddit

Yeah but if I get the 16max up I better pay the 600€ extra and get 128gb but it seems like a waste of money pay 2x of what I initially wanted

[-]

cm8ty@reddit

Curious to know the tok/sec w/ q4. Congrats on the new beast-of-a-machine btw

[-]

thezachlandes@reddit (OP)

Very slow. .58 tokens/sec. I'm sure there are use cases!

[-]

SomeOddCodeGuy@reddit

What processing time are you seeing a larger prompt? Really curious to see what the total time is for MLX vs ggufs; I've only ever tried ggufs on the mac.

[-]

MaxDPS@reddit

I did a comparison between MLX vs GGUF with Codestral earlier today. The difference was roughly ~20% faster on MLX.

[-]

thezachlandes@reddit (OP)

I had 20% in a test I did with another model

[-]

busylivin_322@reddit

Anyone know if mlx quants would work with Ollama?

[-]

thezachlandes@reddit (OP)

It should

[-]