M5 Max 128GB, 17 models, 23 prompts: Qwen 3.5 122B is still a local king
Posted by tolitius@reddit | LocalLLaMA | View on Reddit | 108 comments
The last Llama (Scout/Maverick) was released a year ago. Since then US based releases have been super rare: Granite 3.3, GPT-OSS 20B & 120B, Nemotron 3 Nano / Super and now Gemma 4. Can't even compare to the solid Chinese open model output or Qwens, DeepSeeks, Kimis, MiniMaxes, GLMs, MiMos, Seeds, etc..
Gemma 4 is like a breath of fresh air. Not just the model itself, but the rollout, the beauty, the innovation: K=V in global attention, Per-Layer Embeddings, tri-modal minis (E4B, E2B), etc.
Most of my local LLM usage used to be via rented GPUs: Google Cloud, AWS, etc. But about a month ago I decided to bring it all home, and bought a shiny M5 Max MacBook Pro 128GB. It is a beast of a laptop, but also opens up the kind of models I can run locally: 128GB of unified RAM and all.
Besides the cost, the true benefit of running models locally is privacy. I never fell easy sending my data to "OpenRouter => Model A" or even hosting it in AWS on P4d/P4de instances (NVIDIA A100): it is still my data, and it is not home. where I am.
But my laptop is.
When it comes to LLMs, unless it is research or coding finding utility is difficult. But I have kids, and they have school, and if anything is super messy in terms of organization, variety of disconnected systems where the kids data lives, communication inconsistencies, that would be US public schools. But being a parent is fun, and this mess is a great fit for LLMs to make sense of. Local LLMs solve the last piece: my kids data stay on my laptop at home.
So it began. I loaded all I could to my 128GB friendly beast and start looking at which models are good for what. The flow is not difficult: go to many different school affiliated websites, some have APIs, some I need to playwright screen scape, some are a little of both plus funky captchas and logins, etc. Then, when on "a" website, some teachers have things inside a slide deck on a "slide 13", some in some obscure folders, others on different systems buried under many irrelevant links. LLMs need to scout all this ambiguity and come back to be with a clear signals of what is due tomorrow, this week; what the grades are, why they are what they are, etc. Again, a great use case for LLM, since it is lots of unorganized text with a clear goal to optimize for.
You maybe thinking just about now: "OpenClaw". And you would be correct, this is what I have started from, but then I realized that OpenClaw is as good as the set of LLMs behind it. Also if I schedule a vanilla OS cron that invokes a "school skill", the number of tokens sent to LLM goes from 10K to about 600. And while I do have an OpenClaw running on VPS / OpenRouter, this was not (maybe yet) a good use of it.
In order to rank local models I scavenged a few problems over the years that I had to solve with big boys: Claude, OpenAI, Grok and Gemini. They are nice enough to record everything we talk about, which is anything but local, but in this case gave me a chance to collect a few problems and convert them to prompts with rubrics.
I then wrote a script to start making sense of what works for me vs. what is advertised and/or works for others. The script grew fast, and was missing look and feel, so I added UI to it: https://github.com/tolitius/cupel
Besides the usual general problems, I used a few specific prompts that had tool use and muli-turns (multiple steps composed via tool calling) focused specifically on school related activities.
After a few nights and trial and error, I found that "Qwen 3.5 122B A10B Q4" is the best and the closest that solves most of the tasks. A pleasant surprise, by the way, was the "NVIDIA Nemotron 3 Super 120B A12B 4bit". I really like this model, it is fast and unusually great. "Unusually" because previous Nemotrons did not genuinely stand out as this one.
[pre Gemma 4]()
And then Gemma 4 came around.
Interestingly, at least for my use case, "Qwen 3.5 122B A10B Q4" still performs better than "Gemma 4 26B A4B", and about 50/50 accuracy wise with "Gemma 4 31B", but it wins hands down in speed. "Gemma 4 31B" full precision is about 7 tokens per second on M5 Max MacBook Pro 128GB, whereas "Qwen 3.5 122B A10B Q4" is 50 to 65 tokens / second.
[(here tested Gemma 4 via OpenRouter to avoid any misconfiguration on my side + 2x faster)]()
But I suspect I still need to learn "The Way of Gemma" to make it work much better. It really is a giant leap forward given its size vs. quality. After all, at 31B, although dense, it stands side by side with 122B.
bwjxjelsbd@reddit
Can you please keep updating this series of post? With Minimax M2.7 coming out this weekend it’s going to be a fun one
tolitius@reddit (OP)
need to address lots of very solid feedback in this thread I agree one follow up post might not address all of them
while M5 Max 128GB is a great laptop, it is still just a laptop, and can only run one LLM test at a time (to be fair to that LLM / resourcing), I'll try to address most over the weekend
cupel does help me, as I have a place that I can make better to move faster, given the pace of everything
what would be the best way to have the series here on r/LocalLLaMa to make sure they are helpful vs. noise?
bwjxjelsbd@reddit
Would be posting the leaderoard tbh.
BTW minimaxM2.7 just released! https://huggingface.co/MiniMaxAI/MiniMax-M2.7
slypheed@reddit
Just post to this leaderboard using the Anubis app, then we can all have results across different machines in one place: https://devpadapp.com/leaderboard.html
https://github.com/uncSoft/anubis-oss
BestSeaworthiness283@reddit
I like qwen3.5:9b for speed
tolitius@reddit (OP)
at full precision Qwen 3.5 9B is a little on a slow side for M5 Max
for example Qwen 35B A3B (given that it only does 3B at a time vs. 9B) is much faster
and even at Q4, Qwen 35B A3B quality wise is compatible to Qwen 3.5 9B, and it runs at 112 tokens a second (on M5 Max)
Imaginary-Unit-3267@reddit
What I find odd here is that on my RTX 3060, it's the other way around - 9B gives me ~45 t/s, and 35B-A3B gives me ~21 t/s. I still haven't figured out how it's possible that the one that's three times bigger in active parameters runs twice as fast.
tolitius@reddit (OP)
this is most likely because in 35B-A3B all 35B need to fit and sit in RAM. 9B comfortably fits into RTX 3060 (12GB?), but I think 35GB spills to the CPU
Imaginary-Unit-3267@reddit
I thought the entire point of mixture of experts models was that you don't have to use the whole model all at once, just a portion of it? Why can't inactive experts be on RAM and that be fine?
gfghgfghg@reddit
If the entire model was to fit on your card, then it would be faster. MoE still needs to load the entire 35b model into vram, it just only cycles through/ uses ~3b when responding. Which is quicker than needing to go through the entire 35b model, even if all loaded into vram
BestSeaworthiness283@reddit
thank you very much, also i have a newbie question, i still cant wrap my head around what that q4 means :(
tolitius@reddit (OP)
LLMs are trained and released at "full precision". it means that if Qwen 3.5 9B has 9 billion weights / numbers, each weight is 16 bits (hence, fp16 / bf16). 9 billion * 16 bits (2 bytes) = 18GB of GPU RAM that is needed. another important note is memory bandwidth: how many weights need to "fire" / be used for each token. Qwen 3.5 9B is a dense model, which means all 9 billion weights need to be processed for each token: i.e. 18GB that need to be read through per token.
this can get large and slow for a given hardware / memory bandwidth limits therefore there is a process called quantization which takes each weight and rounds it to be represented in N bits vs 16. for example it takes 16 bits weight that has a "
3.14159" value and rounds it to "3.1" representable in just 4 bits. vs. 16. this would be calledQ4(4 for 4 bits). the model needs quarter of original memory and bandwidth. "Q" stands for "Quantized".BestSeaworthiness283@reddit
thank you very much
xraybies@reddit
BestSeaworthiness283@reddit
Ty very much guys, another thing i want to ask, do you know any online resource, like an website, which has this things explained in depth, i want to really get in to this and want to undetsyand, thank you again!
AdamDhahabi@reddit
For multi-GPU owners (Nvidia), my 'Frankenstein build' runs Qwen 3.5 122B IQ4_XS GGUF (Bartowski) at 50 t/s (first few thousands of tokens) which comes close to M5 Max in terms of speed.
Specs: 2x 5070 Ti + 3090 + 5060 Ti 16GB (mix of expensive Blackwells and a single 3090 to keep it affordable)
To compare with M5 platform, Nvidia multi-GPU should be faster for dense models and slower for MoE once you start offloading to slow system RAM. I have 72GB VRAM.
In case Qwen 3.5 122B can't hack a problem, or when doing research, I run a REAP Q2 version of its larger sibling 397B. It isn't as fast (especially because not fully in VRAM) but it's smarter and pretty usable at 2-bit. https://huggingface.co/OpenMOSE/Qwen3.5-REAP-262B-A17B-GGUF
tolitius@reddit (OP)
that's a really good point!
given close / similar RAM Apple Silicon has an advantage for MoE models (that fit in that RAM that is). I would suspect, besides offloading to CPU/System RAM you can see a lot of PCI communication traffic when MoE router summons and expert that is split between the cards?
AdamDhahabi@reddit
For inference no high PCIE bandwidth needed, I only have a consumer mainboard (very suboptimal) and still get good speeds with pipeline parallelism. I cannot even think about tensor parallelism (since yesterday experimental in llama.cpp). Here my config: PCIe 4.0 x16 (CPU), PCIe 4.0 x4 (CPU), PCIe 4.0 x4 (Chipset), PCIe 3.0 x1 (Chipset). Especially the PCIe 3.0 x1 is not good but no worries because of pipeline parallelism.
tolitius@reddit (OP)
I see. pipeline parallelism makes this
not a problem. thanks for the distinction
since you use multiple GPUs, can you try https://github.com/tolitius/cupel (you can install and remove it if you don't need it)
after this conversation I realized that it did not support auto discovery for llama.cpp and multiple GPUs, I added it just want to make sure it shows up (i.e. the multiple GPUs piece)
if it is a hassle, not a problem I'll stand up an EC2 instance with a few A100s or something later to test.
Accurate-Egg-6787@reddit
I reached the same conclusion for pretty much the same workload on my 128gb Strix Halo, though with less formal eval. Distilling school communication is a hero workload for parents! I set Gmail to auto-forward all the school emails to an api-only Gmail account I'd made years ago, and the agent accesses it via GWS skills [1] to create a daily breakdown of reminders, things needed at school the next day, and conversation starters based on curriculum and special school events. These get posted as events shared with the my personal calendar.
Similar TG speedsl gap to you, but lower numbers on strix halo. Qwen-3.5-122B-A10B using bartowski Q6 on vulkan llama.cpp gets about 20 tk/s tg, and Gemma 27B Q8_0 hits about 6 tk/s tg. I found Gemma to be slightly better at improving my SKILL.md. They are really about the same when it comes to following the skill, with Qwen so much faster.
[1] https://github.com/googleworkspace/cli
tolitius@reddit (OP)
noice!
100%
unfortunately our public school sends a lot of email but not a lot useful context. and the context lives in many different places in many different formats.
do you really feel that difference with "
Qwen-3.5-122B-A10B Q6" vs.Q4?I tried Q4, Q5 and Q6, and, at least for school skills Q4 pulled ahead simply because of the speed to go over all the scraping data, HTML files, large JSON blobs, etc.
Accurate-Egg-6787@reddit
I never tried Q4, will give it a go.
Accurate-Egg-6787@reddit
Tried bartowski Q4_K_L (71G on disk) tonight vs Q6_K_L (99G on disk). Spouse and I both judged. We both felt Q6's output might be a bit better, but it was hard to tell. Both models made some mistakes, but inconsistent ones across runs. I probably need a more robust critique and correct step in my skill.
Performance: \~15% speed advantage for Q4 over Q6
- Q4_K_L averaged 241.0 tk/s pp and 23.8 tk/s tg
- Q6_K_L averaged 209.3 tk/s pp and 20.2 tk/s tg
Q6 fits comfortably on my headless server, so I'll probably stick with it. If it were on my laptop I'd probably go with Q4 to use the RAM for other things. I think I'd need a more objective/scalable eval to figure out if I really need Q6 or if I can take the speed win.
tolitius@reddit (OP)
added the feedback from this discussion to: https://github.com/tolitius/cupel/issues/1
will try to post the results next week
jeffwadsworth@reddit
King if you are using 128GB or less. GLM 5.1 is the master if you have the hardware. Too bad you can’t run your suite with it.
tolitius@reddit (OP)
yea, GLM 5.1 benched really well for me, I have it in a second screenshot in the post:
if you install https://github.com/tolitius/cupel and start it:
it will open with a set of sample, but real / run results as you see above if you want to look around. GLM 5.1 is there.
jeffwadsworth@reddit
But through an API. I run it locally and get better results via llama.cpp. The Q4 Unsloth version.
tolitius@reddit (OP)
GLM 5.1 local is great which hardware do you run it on to make it perform locally?
jeffwadsworth@reddit
HP Z8 G4 with dual Xeons and 1.5 TB DDR4. Affordable a year ago.
tolitius@reddit (OP)
that's a beefy box
4tokens per second?(my thinking: 200 GB/s with NUMA overhead)
jeffwadsworth@reddit
Sadly 1.7 t/s
slypheed@reddit
Try out the 122b mxfp4 version; fits in 128gb much easier.
tolitius@reddit (OP)
"
mxfp4" is great advicethis one
nightmedia/Qwen3.5-122B-A10B-Text-mxfp4-mlx?also do you see a difference in accuracy between that and
mlx-community/Qwen3.5-122B-A10B-4bit?slypheed@reddit
yep, that's the one.
It appears to be about the same quality-wise; honestly hard to tell; it def does dumb stuff, compared to say Opus or even Sonnet probably, but the original 4bit and even 6bit ones did as well.
tolitius@reddit (OP)
super, thanks
xraybies@reddit
My go to's are mxfp8 if that don't fit 4. They're better in the sense of speed and lower heat generation... not sure if they're more or less accurate, but no obvious difference.
Gallardo994@reddit
M4 Max 128gb user here. For the love of god I just cannot understand how people get satisfactory results with Qwen3.5 122B. It keeps yapping and yapping at easiest of tasks, making it honestly unusable for me just as Qwq-32B was at launch. I use all the recommended sampling settings and I always update my llama-cpp in LM Studio. Qwen3.5 122B always takes much longer to reason before the final answer compared to both GPT-OSS-120B and Qwen3-Coder-Next. I tried both Unsloth Q4KM and nightmedia's mxfp4 text-only version. What am I doing wrong?
slypheed@reddit
Same. fwiw, it seems to be reasonably decent at actual coding.
But general workflow outside of coding (holy crap, I'd give my left kidney for local models to actually use git worktrees correctly!) is probably at least 10x slower than if i just did it manually.
tolitius@reddit (OP)
I found local LLM universe to be very use case specific. I would love to get a few prompts (if you ok to share them) that do not work well with Qwen 3.5 122B, but do work well with GPT-OSS-120B
understanding why it does not work could help to understand how to make it work
GPT-OSS-120B is a really good model, so is Nemotron 3 Super 120B A12B 4bit in the same class. But again when I say "good" I only mean these models have "been" good to me, for my use cases
Gallardo994@reddit
https://pastebin.com/2VBagFK2
Qwen3.5-122B Unsloth Q4_K_XL - 6 minutes 8 seconds (1.94s TTFT, 11201 tokens, 30.6 t/s)
GPT-OSS-120B MXFP4 - 18 seconds (1.41s TTFT, 1348 tokens, 80.42 t/s)
Time measured could be off by some seconds as I was using a stopwatch. Metal Llama.cpp v2.12.0 on LM Studio 0.4.9 build 1. Both models gave correct answers.
Qwen uses `Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0` recommended settings from the page card. GPT uses their own respective recommended settings, Medium reasoning.
I consistently get such long yapping on most of my promps, except for super-easy ones like "Give me an example of FNV1A64 in C# and C++".
tolitius@reddit (OP)
super, thanks for sharing! will add to the collection and give it a go
I would probably make another separate post to share how it works on "my"
Qwen 3.5 122B A10B Q4to address other people questions as wellxraybies@reddit
U have 2 issues:
1. System prompt.
2. LM Studio.
Accurate-Egg-6787@reddit
There have been a few posts about how the model can overthink if you use it without tool definitions, like this one [1]. Even without a custom system prompt, I find that having tools in scope helps a lot.
[1] https://www.reddit.com/r/LocalLLaMA/comments/1rf5y13/qwen3535ba3b_thinks_less_if_tools_available/
PraxisOG@reddit
I’m glad to see nemotron 3 super right behind Qwen 122b, it’s still a very capable model and personally I like its talking style more
tolitius@reddit (OP)
it is a little unfair to
Nemotron 3 Super 120B A12Bsince it is not multimodal one of the tests has images, and Nemotron is unable to score thereotherwise it would be almost exactly where "
Qwen 3.5 122B A10B" isI do really like the model, and I am hopeful with the latest NVIDIA investment in local models + their access to hardware Nemotron would only get better
magikfly@reddit
It's a breath of fucking fresh air to see a human written post here.
adobo_cake@reddit
This is the first thing I noticed and I read the entire thing.
vick2djax@reddit
Would you say you are pleased with the M5 Max 128GB or do you still end up dipping into Opus?
monjodav@reddit
I’m still heavily into opus with my m5 max
tolitius@reddit (OP)
I use both. I can't of course use Anthropic for anything that I need to stay local, as the use case with kids I described.
While the gap is definitely closing between the local models and frontier it is still a gap.
I really like my M5 Max 128GB though, it is probably one of the best laptops today, and with 614 GB/s memory bandwidth, I really feel them tiny superpowers when models like Qwen 3.5 35B A3B Q4, GPT-OSS-20B, etc.. are flying with 115 tokens per second. For example to daily read many research papers and summarize them takes a lot of time with dense or larger models, but Qwen 3.5 35B A3B Q4 does it fast and well.
Next would be to try Gemma 4 26B A4B Q4 for the same type of tasks
Excellent_Koala769@reddit
How many tps did you get for Gemma 4 31b 4-bit? I have the same laptop and I average about 26-28 tps running it on mlx.
tolitius@reddit (OP)
I did some oMLX bench for
Gemma 4 31B 4 bit:depending in the context size it gets a little on a slow side
for example 2.3 tokens/second on 200K, while
Qwen 3.5 122B A10Bat 200K is 14.7 tokens/secondChoubix@reddit
Hi! Could you please share the size of the context window you can fit when using a 120B model on your 128Gb of unified ram please?
tolitius@reddit (OP)
I did some benching with oMLX
Qwen3.5-122B-A10B-4bit with
1Kand200Kcontext windows:+30.3GBfor 200K context (+20GBheadroom remaining within my 117 GB wired limit)tokens per second slow down to 14.7/s at 200K but it is not that bad given the size
No_Individual_8178@reddit
Running qwen 2.5-72b q4 on an m2 max 96GB and the privacy thing resonates hard, same reason I went all local. At 96GB I can't fit the 122b models so I've been stuck in the 72b tier, which is fine for most structured tasks but tool calling gets shaky. Curious whether you noticed a big jump from 72b to 122b specifically on multi-turn tool use, or if the main difference is more about general reasoning quality.
DeepOrangeSky@reddit
Have you tried Qwen3 Coder Next 80b? It is MoE instead of dense, but, people kept saying it was the strongest coding model around up till ~230b back before all the new Qwen models came out.
Qwen2.5 72b is dense, right? So, on the one hand that should make it a lot stronger than the Qwen 80b MoE, if they had come out at the same time, but, it is a lot older, too, so, not sure where they shake out compared to each other due to the advancements of the 80b model being newer and having advancements from that.
Did you try Qwen3.5 27b at a higher quant, too (since that one is supposed to be super strong for its size, and also you'd have enough memory to run it at significantly higher than Q4)? And if so, how did it compare to Qwen2.5 72b at q4?
Imaginary-Unit-3267@reddit
Not the person you're asking, but I've tried 80B and found it's really good at simple web apps but gets confused at anything really complex. 35B-A3B does a bit better and (on my machine) is a few t/s faster, but sometimes gets stuck in reasoning loops for thousands of tokens rethinking the same things over and over before responding. I still haven't fully gotten a sense of which is better for which tasks; and now I have Gemma 26B-A4B to compare them to, and it seems more "stable" than either so far, in the sense that it is less likely to suggest something that just flat out doesn't work or to (as both Qwens LOVE to do) quietly "simplify" something without asking in order to escape work.
No_Individual_8178@reddit
Haven't tried the coder 80b, 96GB would be tight for it. I do have qwen3.5 27b on the machine and it's solid for reasoning but for tool calling specifically I ended up sticking with the 9b. The 27b didn't justify the extra vram for structured output tasks. Might be different for coding though.
tolitius@reddit (OP)
I did, Qwen 3.5 family, in my tests, performed better than Qwen 2.5 at multi turns / tool calling
what are you pushing your 96GB to (
iogpu.wired_limit_mb)?you can probably push it to 80GB to test
mlx-community/Qwen3.5-122B-A10B-4bitit takes ~70GB + the K/V | context10GB may not be enough for actual work, but should probably show you the difference specifically for your "business" use case
No_Individual_8178@reddit
I haven't touched wired_limit_mb, just running Ollama defaults so yeah probably leaving headroom. Thanks for the pointer, might try bumping it to test 122b even with tight context. I have moved to 3.5 since then, and been on qwen3.5:9b for most tool calling stuff, way better than 2.5 was.
Thrumpwart@reddit
If you like Qwen 3.5 122b at Q4, check out the Apex I-Quality quant of it. It’s smarter and faster on Apple Silicon in my experience. I’ve been using it for a few days and it’s now my favourite model to run on the Mac.
tolitius@reddit (OP)
could not find the MLX version of it is this GGUF that you use?
will give it a try, but since it is llama.cpp, would not it be slower on Apple Silicon vs. oMLX?
thinking:
Thrumpwart@reddit
AFAIK it has not yet been ported to MLX yet but I really hope it will be.
I’ve never tried oMLX. Worth it?
tolitius@reddit (OP)
oMLX is gold!
the author (u/cryingneko) is really helpful and super active
Thrumpwart@reddit
Jesus you weren’t kidding, this is amazing!
Thrumpwart@reddit
Thank you, will give it a shot!
qubridInc@reddit
Qwen 3.5 122B stays king locally because it hits the rare sweet spot of frontier-level usefulness, real speed, and actual privacy.
moneylab_ai@reddit
The comparison between full-weight dense models and MoEs at different sizes is always going to be a little apples-to-oranges, but that's kind of the point — when you're running local, you care about what fits in your VRAM and what gives you the best output at that memory budget. Qwen 3.5 122B being a dense model that you can actually run on consumer hardware (with enough RAM) is its real advantage.
What I've found practically useful is tracking tokens/second at the quantization level you'll actually use daily, not just benchmark scores. A model that scores 2% higher on MMLU but runs at half the speed in Q4 isn't actually better for most workflows. The M5 Max with 128GB unified memory is an interesting test bed because it removes the multi-GPU complexity — you're testing the model, not your parallelism setup.
Curious whether you tested any long-context performance. That's where I've seen the biggest quality divergence between quant levels — Q4 and Q6 can score identically on short prompts but fall apart very differently past 16K context.
tolitius@reddit (OP)
you probably meant MoE, and yes, definitely an advantage
yep, that is exactly what I found with Gemma 4 31B, and the choice I had to make before with Qwen 3.5 27B as they are really good, but Qwen 3.5 122B is just more practical because of the speed
might need to do a better testing, but what I found with longer contexts with Q4, at rare times, it goes into repetition looping, whereas Q6 does not. I could not attribute it to the length of the context vs. a certain / particular sequence of tokens that cause it
moneylab_ai@reddit
Good catch — yeah, MoE, not dense. Got my wires crossed. The Q4 repetition looping at longer contexts is interesting — that tracks with what I've seen too. It feels like a quantization artifact that only surfaces when the model needs to maintain coherence over longer dependency chains. Have you noticed it more with certain types of content (e.g., code vs. prose)?
tolitius@reddit (OP)
that's a difficult one, I only saw it a few times. some were with the code, some were not. the way I found to fix it (reacting to it already happened) was to reword the prompt a little.
which is not super deterministic, as while changing the token sequence does not effect the context length much, it does affect the mythical pathways of Ks, Qs and Vs. I suspect model temperature would affect it as well
PiaRedDragon@reddit
I want to try this one, but I don't have the kit. Can you test and let us know if it is any good?
If it is I will bite the bullet and bet a 128GB Studio.
https://huggingface.co/baa-ai/Qwen3.5-122B-A10B-RAM-100GB-MLX
MasterKoolT@reddit
If you can, wait for the M5 studios. They added neural accelerators to the GPUs so prompt processing is 4x faster than M4
PiaRedDragon@reddit
Good call. I was expecting them in the Spring Release and held off buying a 512GB version of the M3, then they stop selling them, and increased the prices of the 256GBs, lol.
tolitius@reddit (OP)
sure will do, later tonight. will let you know
it is 93GB, not a lot of room for KV Cache / Context
and Qwen DOES like to think
but anything goes
RSultanMD@reddit
Wish I could Get this to work lll
tolitius@reddit (OP)
you can try to install and run cupel
if you have oMLX, Ollama, LM Studio, SGLang, etc. running it will auto discover it
(you might need to add an API key for some if they require it)
if your hardware does not allow you to install / run LLMs, you can add something like OpenRouter:
you then can run a sample set of benchmarks or use models that would be also self discovered to create your own set
on local as well as remote models
let me know if you hit any problems
_derpiii_@reddit
Have you run into any thermal throttling?
tolitius@reddit (OP)
I did not, but.. when running multiple test:
checking:
would be a good idea
will do
_derpiii_@reddit
Thank you for that power trick!
Zc5Gwu@reddit
Missing stepfun 3.5 flash.
tolitius@reddit (OP)
looks like it is
111GBat 4 bits (https://huggingface.co/mlx-community/Step-3.5-Flash-4bit)it would be a bit difficult, but I can try
slypheed@reddit
It should run fine for single prompts, but context quickly overloads it (from testing on an m4 max 128).
I'd possibly use it over qwen 122b if not for that.
rosstafarien@reddit
Need to see some Gemma4 quants before I get too excited.
tolitius@reddit (OP)
makes sense
will do Gemma quants difference
I am interested myself
catplusplusok@reddit
Try MiniMax M2.5, I find coding hard to bit for a 128GB unified memory device model (with some quantization/light REAP to fit)
tolitius@reddit (OP)
you mean the 3bit version?
it is 100GB, would not coding need some context to slurp the repos / parts of repos?
do you use a different one / how do you manage the K/V cache / context shortage?
DinoAmino@reddit
Neat opinions you got there. Guess you totally missed out on the Granite 4 release s, but that's easy to miss considering all the shilling in this sub centers on non-Western models.
tolitius@reddit (OP)
ah, yes
thanks for the correction, it is a bit smaller for the use case, this probably why I did not mention it in a timeline, but definitely worth mentioning
Negative-Thinking@reddit
Hah, that is exactly the model I use on my M4 Mac 128GB and I totally agree - qwen is good (not as good as sonnet, but passable in many scenarios) . I am using Claude Pro for planning and code review, but delegate implementation of the plan to qwen 3.5. Qwen running through omlx. Claude sonnet/opus for final code review
tolitius@reddit (OP)
do you see a pattern in improvements Claude finds when reviewing Qwen's work?
I wonder whether you have enough intel on these reviews to summarize do/don't markdown for Qwen specifically
Negative-Thinking@reddit
Nah, it's random and not enough data to come to any conclusions yet
RevolutionaryGold325@reddit
Can you please add the Qwen-3.5-397b-UD-IQ2_XXS
I want to see if others can reproduce my results of getting better results than with the 122b-Q4
tolitius@reddit (OP)
do you know whether it is available in MLX?
in Q2 it would probably need around 109GB, which leaves not much room for the context. plus in GGUF it would pay for CPU to GPU coordination (llama.cpp)
I'll certainly give it a try, but would appreciate and intel on how to best run it on 128GB apple silicon hardware
RevolutionaryGold325@reddit
Yeah that is a bit close to the limits. I'm using dgx spark for running it. I have not looked into MLX.
CarelessOrdinary5480@reddit
Such a goofy benchmark. Doing a full weight against a MOE and wildly different sizes.
tolitius@reddit (OP)
yea, agree: not apples to apples
but I was coming from "the best" I can run on 128GB (in reality \~117GB of unified RAM)
the best, accuracy wise from Gemma is "`Gemma 4 31B`" full precision
the best from "`Qwen is 122B A10B Q4`"
and as anyone here I would love if Google releases Gemma 124B
CarelessOrdinary5480@reddit
Yea, I get it, I've considered that benchmark too. The world of agentic stuff for us on the strix halo is still qwen 3 coder or 3.5 122b.
Happy-Register3367@reddit
cool benchmarks, but I’d love to see more real world comparison (coding, reasoning tasks, long contxt ..). sometimes the headline numbers dont tell the full story.
tolitius@reddit (OP)
fair point
I was contemplating on how to make it generic enough to be useful for all, but not too generic to be useful for no one. I do have this breakdown:
there are more in a README here: https://github.com/tolitius/cupel
did not want to overload on pictures in the post
pseudonerv@reddit
You can run 120b a10b at q6, which works far better than q4 for me
tolitius@reddit (OP)
I did try Q6, it is significantly slower and eats a lot of RAM
interestingly enough it performed exactly the same as Q4 (for my use case that is)
also tried Q5, and landed on Q4, since it is really good and fast
Expensive-Paint-9490@reddit
"Still" the king for a model published when, one month ago? By a lab that is consistently SOTA. I hope you weren't expecting a Gemma model less than half its size to outperform it.
Far-Low-4705@reddit
thats not really a fair point
122b is a MOE, you are comparing it to a 31b DENSE model. there is a good chance it will be close just given that alone
tolitius@reddit (OP)
good point
interestingly enough "
Gemma 4 31B" full precision did outperform "Qwen 122B A10B Q4" on a few promptshowever the reason Qwen pulls forward in my case is speed. It is \~50 to 65 tokens per second on M5 Max vs. "
Gemma 4 31B" full precision is 7 tokens per secondExpensive-Paint-9490@reddit
The old (like, Mixtral 8x7B old) rule of thumb is that a MoE model performs as a dense model with its geometric means of total and active parameters. So (122*10)\^0.5 = 35. So, Gemma-4 dense at 31B can compare.
I was referring instead to the 26B-A4B, which by this rule is equivalent to a 10B model. I didn't get that dense 31B was in the picture.
reneil1337@reddit
neeed \~120B Gemma4 MoE
jgulla@reddit
Very interesting. Appreciate the detailed post!