Mac Studio M3 Ultra 512GB DeepSeek V3-0324 IQ2_XXS (2.0625 bpw) llamacpp performance

Posted by WhereIsYourMind@reddit | LocalLLaMA | View on Reddit | 32 comments

I saw a lot of results that had abysmal tok/sec prompt processing. This is from the self compiled binary of llamacpp, commit f423981a.

./llama-bench -m ~/.lmstudio/models/unsloth/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-UD-IQ2_XXS-00001-of-00005.gguf --n-gpu-layers 62 --flash-attn 0 -ctk f16,q8_0 -p 16384,32768,65536 -n 2048 -r 1 
| model                          |       size |     params | backend    | threads | type_k |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | ------------: | -------------------: |
| deepseek2 671B IQ2_XXS - 2.0625 bpw | 203.63 GiB |   671.03 B | Metal,BLAS |      24 |    f16 |       pp16384 |         51.17 ± 0.00 |
| deepseek2 671B IQ2_XXS - 2.0625 bpw | 203.63 GiB |   671.03 B | Metal,BLAS |      24 |    f16 |       pp32768 |         39.80 ± 0.00 |
| deepseek2 671B IQ2_XXS - 2.0625 bpw | 203.63 GiB |   671.03 B | Metal,BLAS |      24 |    f16 |       pp65536 |     467667.08 ± 0.00 | (failed, OOM)
| deepseek2 671B IQ2_XXS - 2.0625 bpw | 203.63 GiB |   671.03 B | Metal,BLAS |      24 |    f16 |        tg2048 |         14.84 ± 0.00 |
| deepseek2 671B IQ2_XXS - 2.0625 bpw | 203.63 GiB |   671.03 B | Metal,BLAS |      24 |   q8_0 |       pp16384 |         50.95 ± 0.00 |
| deepseek2 671B IQ2_XXS - 2.0625 bpw | 203.63 GiB |   671.03 B | Metal,BLAS |      24 |   q8_0 |       pp32768 |         39.53 ± 0.00 |
| deepseek2 671B IQ2_XXS - 2.0625 bpw | 203.63 GiB |   671.03 B | Metal,BLAS |      24 |   q8_0 |       pp65536 |         25.27 ± 0.00 |
| deepseek2 671B IQ2_XXS - 2.0625 bpw | 203.63 GiB |   671.03 B | Metal,BLAS |      24 |   q8_0 |        tg2048 |         16.09 ± 0.00 |

build: f423981a (5022)

[-]

WhereIsYourMind@reddit (OP)

I noticed a slight improvement when using flash attention at lower context lengths. I’ll run the larger prompt processing tests using flash attention overnight tonight.

[-]

nomorebuttsplz@reddit

Yeah idk if they just updated metal, but my gguf prompt processing speeds went up to like 60 t/s for the full UD_Q4_K_XL quant from unsloth. It was like 10 before.

Also, though it hasn't been integrated into LM Studio yet, I've heard that you can now get over 100 t/s prompt processing speed using MLX

[-]

segmond@reddit

60 tk/s? for prompt processing? no way! what context size? what's the speed of prompt eval? How are you seeing the quality of UDQ4? wow, I almost want to get me a mac right way.

[-]

nomorebuttsplz@reddit

60 t/s is prompt eval to be clear. We really need standardized terminology.
1. prompt processing, PP, Prompt evaluation, Prefill, Token evaluation, = 60 t/s

Token generation, inference speed, = about 17 t/s to start, quickly falls to 10 or so.

[-]

segmond@reddit

nice performance. thanks for sharing! I'm living with 5tk/s, so 10 is amazing. The question that remains is if my pocket can live with parting some $$$ for a mac studio. :-D

[-]

BahnMe@reddit

Might go up 30% pretty soon unless you can find something in stock somewhere

[-]

segmond@reddit

might also go down when lots of people become unemployed and are desperate for cash and selling their used stuff.

[-]

Professional-market Macs usually depreciate slower than typical consumer electronics. M2 Ultra 64GB/1TB go for $2500-$3200 on eBay for used and refurbished units, compared to a launch price of $5k 21 months ago. I think it helps that Apple rarely runs sales on their high-end stuff, which keeps the new prices high and gives headroom for the used market.

The 3090/4090 market could have an influx of supply; but because they are the top-end for their generation, I can't see many gamers selling them off. There could be gamers cashing out on their appreciated 4090s and going for a cheaper 5000 series card with more features and less performance.

[-]

segmond@reddit

Yeah, oh well. I'll manage with what I have if it comes to that.

[-]

WhereIsYourMind@reddit (OP)

RAG with 50k context tokens. I'm tweaking the size of documents relative to the number of documents, and 2 bit lets me test a lot of combinations. I'm hoping I don't need all 50k tokens and I can use a higher quant in the future.

[-]

terminoid_@reddit

oh man, 50k tokens is fucking brutal at those prompt processing speeds...you're looking at 16 minutes before you get your first output token =/

[-]

Cergorach@reddit

People need to realize that 50k input tokens is essentially 40% of a novel, none of us read a novel in 40 minutes, not even the speed readers at 50%+ comprehension.

50k tokens is a LOT of text to read AND comprehend. That a small, relatively cheap, personal device can do that is amazing by itself.

I would also assume that you don't ask these questions lightly when you need a 50k context window. When for a job I get a simple question I can answer directly, I'm pretty fast because training/experience. For a more complex question with data that can change constantly I need to do research and that can take hours, days, or even weeks, depending on the complexity of the question and the amount of data to reference.

But the issue is never really how fast you do it, it's the quality of the output. And depending on what kinds of questions you're giving and what kind of answers you're expecting, I expect that such an overly shrunken model won't give you what you're looking for.

[-]

terminoid_@reddit

i agree this is cool, but damn...i just can't imagine where having 2 questions answered per hour is a huge productivity booster

[-]

WhereIsYourMind@reddit (OP)

It depends on the workflow, I think. I have plenty of coding tasks that I can shelve for a few hours and come back and evaluate multiple outputs. It's like having several junior engineers write solutions for the same problem, and then I pick the best and develop it further. Junior engineers can take a day or more, so waiting a few hours isn't terrible.

My eventual goal is to see how far I can reduce that 50k and still get informed, relevant output. Then, I'll compare memory footprints and (hopefully) be able to upgrade to a higher quant with smaller context. This should give me both higher quality generation and faster prompt processing. There's an argument that I should go the opposite way, choosing a higher quant and slowly increasing the context; I might try that next and see where the mid point is.

[-]

Hunting-Succcubus@reddit

You have 512 memory but still using q2, its so sad

[-]

fairydreaming@reddit

For comparison purposes here's my yesterday's run of Q4 DeepSeek V3 in llama-bench with 32k pp and tg:

$ ./bin/llama-bench --model /mnt/md0/huggingface/hub/models--ubergarm--DeepSeek-V3-0324-GGUF/snapshots/b1a65d72d72f66650a87c14c8508c556e1057cf6/DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf -ctk q8_0 -mla 2 -amb 512 -fa 1 -fmoe 1 -t 32 --override-tensor exps=CPU --n-gpu-layers 63 -p 32768 -n 32768 -r 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl | type_k | fa | mla |   amb | fmoe |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --: | ----: | ---: | ------------: | ---------------: |
| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB |   672.05 B | CUDA       |  63 |   q8_0 |  1 |   2 |   512 |    1 |       pp32768 |     75.89 ± 0.00 |
| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB |   672.05 B | CUDA       |  63 |   q8_0 |  1 |   2 |   512 |    1 |       tg32768 |      9.70 ± 0.00 |

build: 6d405d1f (3618)

The hardware is Epyc 9374F 384GB RAM + 1 x RTX 4090. The model is DeepSeek-V3-0324-IQ4_K_R4. I ran it on ik_llama.cpp compiled from source code.

Also detailed pp/tg values:

Since RAM was almost full I observed some swapping at the beginning, I guess that caused the performance fluctuations with small context sizes.

[-]

butidontwanto@reddit

Have you tried MLX?

[-]

Healthy-Nebula-3603@reddit

Bro... Q2 model is useless to any real usage plus you even used compressed cache ....

[-]

WhereIsYourMind@reddit (OP)

I've found larger models suffer less from quantization than smaller models do.

[-]

Healthy-Nebula-3603@reddit

But still you have to respect laws of physics and Q2 will be always a big degradation if you compare to Q4 or Q8.

And from my tests even cache Q8 degrading quality....

You can easily test how bad the quality is now anyway.... Test the same questions on your local Q2 and DP on the webpage ....

[-]

sandoz25@reddit

A man who is used to walking 10km to work every day is not upset that his new car is a lada

[-]

Healthy-Nebula-3603@reddit

That's the wrong comparison. Rater a car made with a precision of elements +/- 1 cm even for engine parts....

Q2 produce pretty broken output with a very low quality of understanding questions.

[-]

Cergorach@reddit

Depends on what kind of output that you need. You don't need a bricklayer with an IQ of 130, but you don't want a chemist with an IQ of 70... If this setup works for this person, who are we to question that. We just need to realize that this setup might not work for the rest of us.

[-]

WhereIsYourMind@reddit (OP)

80k context allows me to provide a significant amount of documentation and source material directly. From my experience, when I include the source code itself within the context, the response quality greatly improves—far outweighing the degradation you might typically expect from Q2 versus higher quantization levels. While I agree Q4 or Q8 might produce higher-quality results in general queries, the benefit of having ample, precise context directly available often compensates for any quality loss.

[-]

Cergorach@reddit

But wouldn't a smaller, specialized model with a large context window produce better results? Or is this what you're trying to figure out? I'm also very curious if you see any significant improvements if you provide the same context to the full model? And if you're clustering M3 Ultra 512GB over Thunderbolt5 if you will get similar performance of if performance would go down drastically.

[-]

Healthy-Nebula-3603@reddit

Lie yourself like you want . Q2 compression hurts models extremely. Q2 models are very dumb whatever you say and is only gimmick. Try to make perplexity test and you find out is currency more stupid than any 32b model with even Q4km ...

[-]

Ok_Top9254@reddit

Yes, but dense models. 70b Q2 will be better than 33B Q3 or Q3L, but this is not quite true for MoE. Deepseek only has 37B active parameters, the impact would be bigger than something like a 400b llama (when comparing the same models against each other...).

[-]

DunderSunder@reddit

Is this supposed to be non-abysmal? 12 minutes for 30k context pp is not usable.

[-]

Serprotease@reddit

For this kind of model, it’s quite hard to go above 40/50 tk/s for pp. 500+gb of fast VRAM is outside consumer reach in price and energy requirements.
the only way to get better results is a Turin/Xeon6 dual cpu system with 2*512gb of ram and a gpu with ktransformer and even this will struggle to get more than 3/4 time the performance of the MacStudio at this amount of context (For twice the price…).

That’s the edge of local Llm for now. It will be slow until hardware catches up.

[-]

henfiber@reddit

according to ktransformers, they have managed to reach 286 t/sec for pp, with dual 32-core Intel Gold 6454s and a 4090. Turin may not be as fast because it lacks the AMX instructions.

https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/DeepseekR1_V3_tutorial.md#v03-preview

[-]

Serprotease@reddit

Yes, it looks very promising to run this big model. But they gave us the number for 8k context. I really look forward to see If similar improvements can be seen at 16k/32k. That’s would be a big breakthrough.

[-]

JacketHistorical2321@reddit

Man, you PP snowflakes are everywhere