mistralai/Mistral-Medium-3.5-128B · Hugging Face
Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 299 comments
Mistral Medium 3.5 128B
Mistral Medium 3.5 is our first flagship merged model. It is a dense 128B model with a 256k context window, handling instruction-following, reasoning, and coding in a single set of weights. Mistral Medium 3.5 replaces its predecessor Mistral Medium 3.1 and Magistral in Le Chat. It also replaces Devstral 2 in our coding agent Vibe. Concretely, expect better performance for instruct, reasoning and coding tasks in a new unified model in comparison with our previous released models.
Reasoning effort is configurable per request, so the same model can answer a quick chat reply or work through a complex agentic run. We trained the vision encoder from scratch to handle variable image sizes and aspect ratios.
Find more information on our blog.
TheWaffleKingg@reddit
Hey guys can I run this locally?
I have 4mb of ram and run with a core duo
Should work fine right?
Nobby_Binks@reddit
A token a day, but doable
Mart-McUH@reddit
I recommend asking only yes/no questions.
AlwaysLateToThaParty@reddit
And turn off thinking.
jacek2023@reddit (OP)
you need raspberry pi
onewheeldoin200@reddit
Yeah but the 5 tho not the 4
AlwaysLateToThaParty@reddit
Min 8GB model.
BaronRabban@reddit
Everyone should give this one a try:
https://huggingface.co/RecViking/Mistral-Medium-3.5-128B-NVFP4
It definitely works and is not brain dead. This model is not a lost cause! Almost all the quants don't work but this one does I confirm it.
dumeheyeintellectual@reddit
The B number, considering I’m a D grade brain at best, is there a select B number or below that automatically translates to RTX4090 will support?
mhphilip@reddit
Anyone running this on a rtx 6000 pro?
quantier@reddit
Looking forward to the NVFP4 version 😍
AutonomousHangOver@reddit
2xRTX6000 Pro 262144 context size:
unsloth's quant
pp: 1100t/s about 500 tokens test promp (create a 3d spinning glass dodecahedron with inner light and orbiting lights, etc.)
And... it went berserk a second ago looping all over again after \~1k tokens, on newest built llama.cpp
kaisurniwurer@reddit
Dense model looping? That sounds a bit worrying.
AutonomousHangOver@reddit
It's a llama.cpp issue. Unsloth removed gguf files as there were some problems, even with FP16.
I'm running it today on vllm (0.21 dev nightly) with eagle draft model.
VLLM logs show very high draft acceptance ratio:
(APIServer pid=8762) INFO 04-30 11:59:33 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 54.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 13.7%, Prefix cache hit rate: 0.2%
(APIServer pid=8762) INFO 04-30 11:59:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.76, Accepted throughput: 40.30 tokens/s, Drafted throughput: 43.80 tokens/s, Accepted: 403 tokens, Drafted: 438 tokens, Per-position acceptance rate: 0.973, 0.932, 0.856, Avg Draft acceptance rate: 92.0%
Model is usable and seems pretty nice, but I don't have full tests finished.
AutonomousHangOver@reddit
I'll continue by answering to myself 😉
Vllm with mistral tool-call-parser mistral is having trouble with known error:
For now I've turned streaming off and I'm able to use i.e. Roo Code
MiuraDude@reddit
If this is actually Sonnet level I love it!
Healthy-Nebula-3603@reddit
Sonnet 4.5 ..... that's old model
Qwen 3.6 27b easily beat that old model
VEHICOULE@reddit
Qwen models are bad on real world use case..
Healthy-Nebula-3603@reddit
Tell me you didn't use qwen 3.6 27b with opencode without telling me.
Artistic_Okra7288@reddit
Qwen3.6-27B is actually really good on real world use cases in my experience.
SKirby00@reddit
Only if you're comparing them against Sonnet or other models that are wayyy bigger. The Qwen 27B models are actually great in real world use for their size.
szansky@reddit
how about coding with him?
JLeonsarmiento@reddit
If I quantize this to 1 bit I can make it run on my machine…
BitGreen1270@reddit
/me looking for 0.1 but that can run on a potato
JLeonsarmiento@reddit
🥔🤝💾
FullOf_Bad_Ideas@reddit
1.4bpw Mistral Large 123B was coherent, will that fit your machine?
IvGranite@reddit
DENSE
molbal@reddit
patricious@reddit
Denser than a snickers bar, I tell you that much.
hurdurdur7@reddit
this model is definitely thicker than a bowl of oatmeal ..
rpkarma@reddit
points and nods
Lissanro@reddit
This makes me feel nostalgic, because in the past, Mistral Large 123B the dense model was my most used model for a while. Then there were DeepSeek R1 and V3, later followed by Kimi models, so it has been some time since I ran Mistral models. I will definitely give a try to this new Medium 128B, it would be interesting too see how how much progress Mistral has made by trying it in my actual use cases.
One more notable thing, they released a model for speculative decoding: https://huggingface.co/mistralai/Mistral-Medium-3.5-128B-EAGLE . This is great to see, because in the past, one of the big issues of Mistral Large 123B used to be that I had to use mismatched Mistral 7B model for drafitng, still it gave decent performance boost. Even though EAGLE is not supported in llama.cpp yet, this comment from about 3 weeks ago sounds encouraging that it may be available soon:
coder543@reddit
Unfortunately, no PR for that API refactoring has even been published, so... who knows if/when it will happen.
Supporting any one of EAGLE-3, MTP, or DFLASH would be a game changer for llama.cpp. I wish better specdec were being treated as the highest priority thing to develop in llama.cpp.
Nindaleth@reddit
I consider this PR to be relevant: https://github.com/ggml-org/llama.cpp/pull/22397 But he has several spec-related PRs going on, maybe it's a piece-by-piece effort.
coder543@reddit
Unfortunately, I can't even find a PR for that API refactoring, so... who knows if/when it will happen.
valtor2@reddit
What about at 10k context?
a_beautiful_rhind@reddit
The old devstral 123 at q4_K_xl
| 1024 | 256 | 10240 | 1.641 | 624.04 | 9.690 | 26.42 | | 1024 | 256 | 11264 | 1.656 | 618.42 | 9.785 | 26.16 | | 1024 | 256 | 12288 | 1.674 | 611.71 | 9.889 | 25.89 | | 1024 | 256 | 13312 | 1.688 | 606.62 | 9.994 | 25.61 |
this one should be similar. 4x3090
IvGranite@reddit
I ain't got that kinda time lol
valtor2@reddit
haha had to try 😇
temperature_5@reddit
Ugh, this means I'll get 1.6 t/s on 890m. But still, might be worth it on occasion if it's really smart!
TripleSecretSquirrel@reddit
That’s honestly better generation speed than I expected!
edsonmedina@reddit
I'm also on Strix Halo (128Gb) but the model fails to load (IQ4_NL)
harpysichordist@reddit
And he never reported back.
Creepy-Bell-4527@reddit
Similarly useless results on M3 Ultra. Needless to say I don’t think this model is for us 😉
exaknight21@reddit
Dense days return.
po_stulate@reddit
Waiting for dflash
Freonr2@reddit
128B THICC BOI
LetsGoBrandon4256@reddit
What a chonker.
grumd@reddit
128B dense is an interesting niche
TripleSecretSquirrel@reddit
Honestly just feels like it might be ahead of its time. I think once the next generation of GPUs come out — AMD will probably release AI Pro cards with 48GB and maybe even 96GB VRAM, and god willing, Intel will fix their driver issues making their cards more viable and offer bigger VRAM options — this model size might be the sweet spot for the high end of local inference.
Right now, for those of us with higher end consumer hardware (i.e., 32GB VRAM), Qwen 3.6:27B is basically the gold standard. You can run it at 4-bit precision with full context. Smaller models are getting better and better, but all else being equal, more parameters are pretty much always going to mean better output.
So I’m imagining that like next year, the bleeding edge of local inference will be models in the 80B-120B range instead of the ~30B dominance we’re seeing now.
HiddenoO@reddit
Why would they with current memory pricing? The optimal strategy for them is to allocate something like 90% of the memory to datacentres and then put the remaining 10% into relatively low-memory gaming GPUs so they don't lose brand awareness.
Just having the memory doesn't mean you can actually run them fast enough to be useful in practice.
nebteb2@reddit
W7800 48gb ai pro already exists lol
TripleSecretSquirrel@reddit
True, but I’m thinking RDNA4 platform.
falcongsr@reddit
My work gave me enough budget to buy a Mac Studio and I could only order the M3 Ultra with 96GB. I guess I could try this.
ahh1258@reddit
Yall hiring? 🤣
dtdisapointingresult@reddit
copium.jpeg.png
grumd@reddit
I wouldn't bet on companies releasing GPUs with more memory when memory tripled in price and is fully sold out
BubrivKo@reddit
Yup, I was so tired from these MoE models. They are not bad, but these with the little active parameters are actually stupid and not that useful. Sadly, I cannot run that Mistral but I'm still happy that someone is still working over dense models which are superior!
grumd@reddit
I think moe models are the future unfortunately, simply to crunch more knowledge into the model while not destroying the speed. The only mistake is making the active params count too low. Something like A30B is probably enough for it to not feel dumb. Even Qwen 122B A10B has been great for me locally
NandaVegg@reddit
I have not yet a chance to try this model, but generally MoE with a reasonably-sized shared weights/activated parameters has significant advantage over large dense as LLM activation is mostly noise, which is empirically just bad rather than something useful (naively upping # of active experts for existing MoE model simply make the model worse, low-pass filter type gating techniques work well with LLM, etc). The "partitioning" done by MoE architecture works to filter out those noises.
A remarkable advantage of large dense usually comes with large hidden dim (GPT-3 DaVinci was 175B with 12288 hidden dim IIRC? Llama-3 405B is 16384, Mistral Medium 3.5 is also 12288) which would be able to distinguish and partition extremely close features that would otherwise overlap in say a 5120 hidden dim model and (hopefully) worked through layers. That also means the (large hidden dim) model could place Paris and London (along with related things like Toulouse and Brighton) to the polar opposite end of latent space if there are enough evidence in the datasets to do so. I'm not sure if that is good or bad. Good old GPT-3 DaVinci had a feel that inference path diverges really hard (the model goes from one mode to another; in today's standard that would at least mean base/instruction/single-turn reasoning/terminal-agentic modes) by just one token.
For creativity and generalization of hard problems, one would generally want more layers rather than larger hidden dim within the same parameter count, unless hidden dim is too small to create meaningful basins anymore.
toothpastespiders@reddit
I mourn GLM Air every day.
BubrivKo@reddit
Or why not 1T + 100B active 😃
Caffdy@reddit
we already got 1T + A40B~ models
AltruisticList6000@reddit
Yeah I can't run big models like this but I was thinking, what if for example there was something like a 35B MoE but with 9-10AB? That could spill over into RAM but would still have an okay speed, would be probably smarter and more knowledgable than 12-14b dense models on the same hardware with barely any speed difference. Or they could just do 20-24b dense models like Mistral, which are still more intelligent in some way than the 3AB MoEs I tried, which don't feel smarter than 9B dense models.
Ardalok@reddit
Yeah, I have 25-30 tokens on 32 gb ddr5 and rtx 4060 with 35b qwen in q4, would be nice to have smarter model with little less tokens.
TokenRingAI@reddit
I think the engram method is the future, with small dense models retrieving information from slow storage.
Equivalent-Freedom92@reddit
It's pretty close to the upper limit possible with for hardware. 4x 3090 for 96GB of VRAM and such, using a board like ASUS ProArt with 3x 16x PCIe slots and then converting one of the M.2 NVME slots into an additional 16x slot with an adapter. For inference the lane bandwidth won't completely choke out quite yet with such a setup.
Freonr2@reddit
The ROMED8-2T beckons you.
Makers7886@reddit
I have two of these and are my top recommendation.... until earlier today I googled them and no retailers had them and used on ebay was going for 3X what I got mine for ($1600+). Crazy times
Prudent-Ad4509@reddit
If you need more than 2 gpu on such boards you are better off with pcie switch
Real_Ebb_7417@reddit
A "you can run it locally, but you won't like the experience" niche 😂
But I'm happy to see them make a dense model, they have experience with it already, so hopefully this one will be much better compared to similar-sized models than Mistral Small 4.
Sunija_Dev@reddit
For roleplay/writing, you can run it at home for \~1200€.
For that money you get 2x 3090, so you can run IQ2_M at \~5 tok/s. Since you probably already have a GPU, you can also run a bigger quant. In my experience, even the old Mistral-123b beats everything out of the park at that size (for writing).
...and that is probably the best affordable thing you can run at home? MoE's get better at \~400b params, but the RAM is probably crazy expensive. Not sure about the speed.
__some__guy@reddit
Used 3090s are about 1000€ now (at least in Germany).
FullOf_Bad_Ideas@reddit
EXL3 is great for dense 120B Mistrals. 2.5bpw quants are actually pretty good.
exllamav3 author got coherent output from 1.4BPW Mistral Large 123B, so 2.5bpw is plenty and it should be better than GGUF at this size. It also support tensor parallel so it's pretty fast.
q-admin007@reddit
It comes with a bespoke draft model. Could be faster than Qwen 3.6 27b in the end.
Real_Ebb_7417@reddit
Does speculative decoding work well in llama.cpp though? (Serious question, didn’t test it so far)
dtdisapointingresult@reddit
There's various types of speculative decoding. llama.cpp doesn't support MTP or Eagle3, which is what the AI labs usually provide. For example Qwen and GLM models have MTP, while this Mistral Medium release has an eagle3 from Mistral.
With llama.cpp your only option is the more ghetto solution of using a small draft model, but finding a compatible model is a bitch. It's easy when they have the same tokenizer/vocabulary, for example using Gemma 3 to boost Gemma 4. You get a major speed boost, you might get double speed for free. But idk what you could use as a draft model for this Mistral Medium release.
RoomyRoots@reddit
There is a comment on a different reply. TL;DR, not yet.
coder543@reddit
Qwen3.6 27B has MTP built-in and DFlash support... don't see how Mistral Medium 3.5 could ever be faster just because of an EAGLE-3.
FullOf_Bad_Ideas@reddit
dense models run fast with tensor parallel. I had 16 or 20 t/s with Devstral 123B IIRC and I have 11 t/s with Hermes 405B.
I like dense models like that more than I like 1T MoEs that I have no memory for.
More-Curious816@reddit
They probably targeting middle class to upper middle class GPU, not GPU poor class like us.
Herr_Drosselmeyer@reddit
It won't run well on anything below a 6000 PRO. I don't consider that middle class.
stoppableDissolution@reddit
Old large in q2 was beating everything else you could run in 48gb (2x3090) up until the gemma4 31b got released, idk. Will see how that one holds up.
jochenboele@reddit
This made me laugh out loud 🤣 Thanks
dtdisapointingresult@reddit
For most people, this model will only be used for writing, with reasoning disabled, at Q4. But the good part is that you don't really need more than 5 tok/sec for this kind of task.
Late-Assignment8482@reddit
Valid. The average adult person reads somewhere in 8-12 tokens / second (AKA 4-6 words) range and rarely as someone who writes a LOT of prose would I need something to write faster tahn me. Creating copy is just not the same task creating code. A model that takes five minutes but then doesn't need four more tries, just a read and edit because it's mostly well written is superior to one that took 10s to generate it.
Healthy-Nebula-3603@reddit
You're serious?
With 5 t/s that model is useless.
Only what you can do is using in the chatt for simple conversations.
po_stulate@reddit
I think they're betting on speculative decodings like dflash. If a huge dense model like this is actually smart and can still have a good speed with a speculative decoder like dflash, then that's the biggest win.
Healthy-Nebula-3603@reddit
That's is only in theory.
I tested many times speculative decoding.
It is hard to get more even than 0.5 acceptance with such small draft model like 3.5 GB ( fp 16 so has 2b parameters ?)
dtdisapointingresult@reddit
0.5 acceptance is a huge speed boost. With num_speculative_tokens=3 I get almost a double speedup with prompts like 'write me an essay about XYZ.'
But even 5 t/s is good enough. If you're asking it to create the occasional an Instagram post for you, to write an email, to help you with the story of your game you're developing, speed doesn't matter unless you have a massive pipeline. 5 token/sec is only slightly slower than the average person's reading speed.
The only reason we think it's slow is because agentic coding needs to generate like 5k tokens of reasoning + 2k of source code on every turn. This is not necessary for writing. If you want an Instagram post, it only takes 20-30 seconds.
Healthy-Nebula-3603@reddit
0.5 acceptece from 16 propositions ....
dtdisapointingresult@reddit
Well I'm not an expert on that stuff. I experimented with different num_speculative_tokens and found I got the fastest speedup with tokens=3. Even with a lower mean acceptance than you, this meant a huge speed boost. I'm talking double speed if I'm using MTP on Qwen, 50% speed boost using Eagle3 on Gemma 4 31B (it doesn't support MTP).
Don't stick to defaults without doing some quick experiments, I think it varies based on the compute of your hardware. Also try the same type of prompt you'd use in reality as well (for example, python coding).
Healthy-Nebula-3603@reddit
Wait ..Did you say you're using draft qwen 3 on Gemma 4?
You know that is not working like that ?
dtdisapointingresult@reddit
No.
LegacyRemaster@reddit
waiting for qwen 3.6 122b...
Nobby_Binks@reddit
Yes, and 397B
Freonr2@reddit
Right, even on an RTX 6000 that's probably <10 t/s based on what I get with other models.
reto-wyss@reddit
I'm getting 16 tg/s at around 10k tokens deep, but that's without MTP on 2x Pro 6k. But there appears to be something wrong with KV-cache calculation in vllm nightly. PP may be over 1k/s, but I haven't run any real tests because of the KV-cache thingy.
FullOf_Bad_Ideas@reddit
2x RTX 6000 Pro with tensor parallel?
Healthy-Nebula-3603@reddit
Yes
That 120b dense model is practically useless with such speed... And you still need inane expensive card ... And still is useless.
Only useful usecase is a simple chatt.
Longjumping-Boot1886@reddit
for Macbook with 128Gb VRAM, I think.
CYTR_@reddit
At 5tk/s lol.
q-admin007@reddit
50 t/s with EAGLE drafter. Do you still live in 2024?
CYTR_@reddit
You can't do everything with a draft, especially if the request is complex and involves lengthy contexts. I even linked the draft as soon as I saw it. There's no need to be unpleasant in your reply.
Healthy-Nebula-3603@reddit
Even 5 t/s will be challenging....
Eyelbee@reddit
I like the choice but it doesn't seem to deliver the performance
ambient_temp_xeno@reddit
124B Gemma niche.
alex_bit_@reddit
Hoe many 3090s to run this thing?
InstaMatic80@reddit
Too big for my 3090 😅 Waiting for a 27B version
overand@reddit
Just don't forget to check out the "club 3090" https://github.com/noonghunna/club-3090 - it's a llama.cpp and/or vllm setup that gets surprisingly good speed for Qwen3.6-27b from one or two 3090s. With my 2x 3090 setup, I went from \~27t/s to \~80 t/s. It's pretty wild
InstaMatic80@reddit
😳 I’ll definitely take a look
silenceimpaired@reddit
If they released an MoE at this size it would be cozy for those with the RAM
reto-wyss@reddit
Qwen 27b, who is the densest now?
zenmagnets@reddit
Unfortunately Qwen3.6 27b is still the smarter model. Matches Mehstral 3.5 at SWE Verified, but 27b is better at browser comp and agentic tasks.
Clank75@reddit
My experience of Qwen3.6 so far is that it is undoubtedly, by far, the most adept model at flat-out lying and fabrication I've ever used. It's genuinely remarkable - it hallucinates toolcalls it never made, it invents entirely false references that it never looked up, and perhaps most novel - when called out, it doubles down and creates more lies to cover its tracks.
It's certainly smart. Completely useless and untrustworthy, but smart. I predict a career in politics for Qwen3.6.
AndThenFlashlights@reddit
What quant are you running at? I haven't used 3.6 heavily yet, but 3 and 3.5 have been generally honest with me - frankly, hallucinating less than Claude and ChatGPT at certain focused coding tasks. I have seen issues where 3.6 will think and argue with itself (for a LONG time) if it's not sure about something, or if the context isn't clear about something.
Clank75@reddit
Unsloth's Q6. And yeah, 3.5 - while in general more prone to confabulation than other models - is not nearly as bad as 3.6 imx.
It's genuinely bad enough that I don't think it's a useful model for my purposes. You should never trust an AI, of course, but 3.6 seems to go up a notch into pathological lying.
AndThenFlashlights@reddit
Oof, that's annoying. And Q6 shouldn't be introducing oddness on its own.
Yeah that's really disappointing. Qwen3-30b-a3b was my daily driver for a long time because it lied the least out of all the similar models I used.
SqueakySquak@reddit
Color me surprised when in the middle of a coding session, Qwen 3.6 27B tried to call the MCP tool to read my Gmail... When called out on it, it pretendes it was a mistake and not trying to spy on me or anything. I'm running the model unquantized BTW. Apart from its spying tendencies it's actually pretty good at coding. But I wouldn't trust it, and certainly not unsupervised!
Agreeable_System_785@reddit
For your use case. As an European, I gladly embrace the work of Mistral. LLM's have more use cases besides coding. Still, you seem to have thoroughly tested this model already.
FullOf_Bad_Ideas@reddit
llama 3.1 405b
find me a denser model, I'll wait.
pkmxtw@reddit
Does franken-self-merge of L3.1 405B with like 1T dense parameters count?
https://huggingface.co/mlabonne/BigLlama-3.1-1T-Instruct
FullOf_Bad_Ideas@reddit
I guess so, if it produces coherent text.
Maleficent-Ad5999@reddit
Qwen has is sibling too right? 122B one?
JaredsBored@reddit
That's an MoE, with only 10B active per token. This is 128/128B active every token.
Maleficent-Ad5999@reddit
Ah right. Totally forgot. Thanks
Upstairs_Tie_7855@reddit
What about Gemma 4 31b?
Key_Papaya2972@reddit
first glance: another 120B\~, nice, let’s see where the active params is.
second glance: 128B what?
jacek2023@reddit (OP)
MotokoAGI@reddit
119B was trash and 123B was't too bad. Glad to see this looks solid. I wish they compared to similiar sized model like Qwen-122B
jacek2023@reddit (OP)
Qwen 122B is MoE, number of active parameters is totally different
overand@reddit
It doesn't look like there have been any dense model releases in this size range since March 2025 (CohereLabs/c4ai-command-a-03-2025), and before that it was 2024:
FullOf_Bad_Ideas@reddit
you forgot Devstral 2 123B.
Hermes 4 405B finetune released on 2025-08-26 too.
overand@reddit
Ah - yeah, if I'm going to include the two versions of Mistral-Large-Instruct, I definitely should have included those. The former is based on Mistral-Large-Instruct, right, and the latter on Llama-4?
FullOf_Bad_Ideas@reddit
Devstral 2 Instruct 123B is a different architecture and it has different tokenizer, it's most likely not the same base.
Hermes 4 405B is a llama 3.1 405B finetune.
overand@reddit
Yeah. Dense vs MoE is pretty significant.
mantafloppy@reddit
And already at re-release 2, of course...
https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF/discussions/1
BaronRabban@reddit
They appear to have just taken down all of the ggufs
https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF/tree/main
mantafloppy@reddit
Aiming for a 3rd re-release under 4h.
I need to start taking note, might be a new record.
Its not like they would have know it was'nt loading if they tested it...
But, they were first, that what important.
Makers7886@reddit
fool me once
yoracale@reddit
We're working with Mistral on this but it seems through further testing that GGUF implementation needs more investigation. Prompting the model the first few times work but then afterwards it doesn't work properly. Mistral has now labelled GGUF implementations as a WIP. Seems to be most likely a parser issue.
tarruda@reddit
They deleted the repo, likely to erase discussion history. They could have just re-uploaded it.
yoracale@reddit
Not to erase discussion history lol. We're working with Mistral on this but it seems through further testing that GGUF implementation needs more investigation. Prompting the model the first few times work but then afterwards it doesn't work properly. Mistral has now labelled GGUF implementations as a WIP. Seems to be most likely a parser issue.
DJTsuckedoffClinton@reddit
i miss the old mistral
jacek2023@reddit (OP)
what does it mean?
Monad_Maya@reddit
He's a Kanye fan.
ttkciar@reddit
This is great news! Looking forward to giving it a try.
Devstral 2 Large was a huge disappointment, but hopefully MistralAI has learned from their past mistakes and cooked up this 128B right. Maybe this will finally be the 120B-class model which knocks GLM-4.5-Air off its perch?
ROS_SDN@reddit
Its a very niche model since its fully dense.
It and glm4.5 air dont really compare. One can tolerate hybrid inference, the other can't.
This is an entirely different beast at this level of dense. It needs to absolutely cook to be worth the resources needed to run it interactively with a user, or still really really cook to do back end batch jobs on it at a crawl.
Very few local people will be able to use this as a chat agent it's really a rtx 6000 pro + model.
While glm 4.5 air you could respectably work with 24GB -48GB vram + 64GB ddr5.
ttkciar@reddit
I am quite aware of the differences between MoE and dense.
The fact remains that GLM-4.5-Air is more competent at codegen than every 120B-class model I've tried, including Devstral 2 Large (which is 123B dense), at least for the specific codegen skills that matter most to me.
GLM-4.5-Air is quite weak at function-calling, but superior at instruction-following compared to GPT-OSS-120B, Qwen3.5-122B-A10B, and Devstral 2 Large. It is more prone to infer bugs, but hallucinates less and less prone to infer design flaws. That's made it my go-to.
Every time a new 120B-class codegen model pops up, I think "surely this one will beat Air", but so far they just haven't, despite Air "only" having 106B total parameters.
I've downloaded Mistral Medium 3.5 128B, but won't be running it through my codegen tests until tonight. Maybe it will be "the one".
ROS_SDN@reddit
Pray for qwen3.6 122b then to hit the mark.
billy_booboo@reddit
Or perhaps not
rebelSun25@reddit
In before "Guys, can I run this on my single RTX 3060 ?"
We've all been there. An no you can't. It's a chungus of a model
cutebluedragongirl@reddit
Yeah... shit is fucked.
lolidkwtfrofl@reddit
How am I looking with my 4070ti? Much better odds right? Right?
cries in corner while looking at a RTX PRO 6000
one day, my love, one day.
__some__guy@reddit
I wonder if it can still compete with the older Mistral models in terms of creativity.
It's description doesn't sound promising.
Either way, nice to have a new large and dense model that can still be run locally with a reasonable setup.
mantafloppy@reddit
Backup of the IQ2_M if anyone wanna play with it before Unsloth re-upload.
https://huggingface.co/mantafloppy/Mistral-Medium-3.5-128B-GGUF
mantafloppy/Mistral-Medium-3.5-128B-GGUF
BaronRabban@reddit
I just tried the Bartowski qaunt in llama.cpp and it is also brain damaged.
And even the VLLM nightly doesn't seem able to load ggufs for this yet so I am at an impasse.
ValueError: GGUF model with architecture mistral3 is not supported yet.
mantafloppy@reddit
I have not tried Bartowski, but the Unsloth version at IQ2_M gave me one anwser, and the beginning of an other before looping.
https://hastebin.com/share/okezetehol.xml
But it was loading in Lm Studio.
CYTR_@reddit
There is a draft : https://huggingface.co/mistralai/Mistral-Medium-3.5-128B-EAGLE
Zc5Gwu@reddit
What kind of speed increase do you think you could see I wonder…
DinoAmino@reddit
SD with Eagles are great when generating code. Often double the tps, or more. But they don't do very well with regular text generation. Sometimes it drops to half the normal tps. It would be great if it was possible to enable/disable SD per request for specific tasks.
simracerman@reddit
Someone posted the Strix Halo numbers for Q4 at 3.5tps. Double is still horrible at 7tps.
q-admin007@reddit
Waiting for GGUF! Should fly on my Strix Halo.
rangorn@reddit
I consider myself quite dense I had no idea it has become quite the compliment.
Limp_Classroom_2645@reddit
I don't appreciate dense models anymore
_ballzdeep_@reddit
Why did everyone stop pushing 70B models?
Reddit_User_Original@reddit
I want to see more in the 35 to 40B range personally. That is the sweet spot for 64GB unified memory.
simracerman@reddit
How sweet is it at 5 t/s?
Reddit_User_Original@reddit
Breh wut??? I believe I'm getting about 50 t/s with qwen3.6 35b moe
simracerman@reddit
Wait, in the context of your reply I thought you meant dense. Ofc, would love to see move MoE competition in that range. The real sweet spot is 80B-8B active from Qwen.
FullOf_Bad_Ideas@reddit
you'd want to see more dense 70B models?
Would you run them if they came out?
_-_David@reddit
Yes!
RetiredApostle@reddit
For a Llama anniversary, Meta probably will.
ortegaalfredo@reddit
Too expensive to train. About the same than a 400B model IIRC.
Service-Kitchen@reddit
literally
Local_Phenomenon@reddit
My Man!
funding__secured@reddit
What happened to the models? Everything has been pulled.
BaronRabban@reddit
https://huggingface.co/unsloth/Mistral-Medium-3.5-128B/discussions/2#69f26fcce689cdc885c82a2f
BaronRabban@reddit
I don’t know what’s wrong but it feels brain damaged to me. Running the Q6 on five 3090’s. Latest llama I just rebuilt. Something is definitely off but not sure what. Just feels like brain damage.
tarruda@reddit
If it is unsloth gguf, I'd wait a few weeks before trying the weights.
But also, I no longer have high expectations with mistral models.
yoracale@reddit
It's most likely GGUF parser issue, not the model or quant algorithm. Unfortunately no matter what GGUF yoy create with the model, it doesn't function properly most likely due to the parser.
IrisColt@reddit
hmm...
yoracale@reddit
We’re working with Mistral on this, but further testing suggests the GGUF implementation needs more investigation. The model responds correctly to the first few prompts, but then begins behaving improperly. Mistral has now labeled GGUF implementations as a work in progress, and this appears most likely to be a parser issue.
Affectionate-Cap-600@reddit
from a fast reading of the config file, it seems a pure global softmax attention model... I mean, it doesn't seems to use sliding window in any of the layers.
quite rare nowadays, even non hybrid models use some kind of sliding window or sparse attention in some layers... those are 88 layers of pure attention. also ~10k+ hidden size and ~20k+ MLP intermediate size.
interesting for sure... we needed a model like that.
I assume they spent quite a lot training it. memory footprint at 256k contex will be crazy.
we will se if they release a report.
alberto_467@reddit
Thank you for noticing that, it is really an interesting choice to not use any hybrid sparse layers.
seconDisteen@reddit
What a pleasant surprise!
As someone who only really uses local LLMs for creative writing/RP, I am still using Mistral Large 2 123B, mostly the Behemoth finetunes. Ever since things have shifted to fully MoE with a focus on coding, there hasn't been much I've gotten excited about. Yes, even many smaller MoE models are smart and can do creative writing, but with smaller active parameters they often don't have really dense knowledge on fandoms and other things I like to explore. Granted, ML2 still does almost everything I want it to, is pretty smart, and knows a lot about a lot, so I haven't really been griping for a new dense model, but I'll sure as hell take one! I can't wait to try this thing out!
IrisColt@reddit
Interesting... can't you just feed the relevant bits and cross your fingers? Gemma 4 excels at this.
seconDisteen@reddit
You can, and I have done that with a number of the newer, smaller models, but it can become tedious really quickly. Especially if you're working with a really expansive IP, like Harry Potter or Marvel. It's nice that dense models already have so much of the lore baked in, and you only need to tweak it with context. Even Miqu 70B going back, what... 2.5 years now? Was really dense with pop culture knowledge. With these newer, smaller models you have to do a lot of the heavy lifting in context, which is not only tedious but eats up context, especially as the story drags on. Not only that but I've found that the smaller MoE models aren't nearly as good as tying the entire story together. If you have a 40k context story, with multiple scenes, characters, and hooks, I find smaller models aren't as good at taking everything into account as things progress.
I have gotten some decent results with some of the newer stuff, particularly with GLM AIR. But I felt it was only as good as the same results I get from Behemoth/ML2 when dealing with creative writing in large fandoms, despite being much newer and better architecture. Yes, it's faster, but requires more work to get more or less the same results.
jacek2023@reddit (OP)
oxygen_addiction@reddit
Wow, those BrowseComp numbers are horrendous.
Dany0@reddit
idk man beating sonnet 4.5 with 122b is fine imo Sonnet models are likely 500-1000b range
IrisColt@reddit
oof.gif
sterby92@reddit
So qwen3.6-35b and 27b crushes it with way less compute? 🤔
disgruntledempanada@reddit
That's not Qwen 3.6 35b unless you are referencing another benchmark.
sterby92@reddit
yeah, not in this benchmark. But qwen3.6 35b / 27b are around the quality of qwen3.5-397 in a lot of benchmarks.https://artificialanalysis.ai/?models=gpt-oss-120b%2Cgpt-oss-20b%2Cgpt-5-5%2Cgpt-5-4-mini%2Cgpt-5-4%2Cgpt-5-4-pro%2Cmuse-spark%2Cgemma-4-31b%2Cgemini-3-1-pro-preview%2Cgemini-3-flash-reasoning%2Cclaude-opus-4-7%2Cclaude-4-5-haiku-reasoning%2Cclaude-sonnet-4-6-adaptive%2Cmistral-small-4%2Cdeepseek-v4-pro%2Cdeepseek-v3-2-reasoning%2Cdeepseek-v4-flash%2Cgrok-4-20%2Cnova-2-0-pro-reasoning-medium%2Csolar-pro-3%2Cminimax-m2-7%2Cnvidia-nemotron-3-super-120b-a12b%2Ckimi-k2-6%2Cmimo-v2-5-pro%2Ck2-think-v2%2Cglm-5-1%2Cqwen3-5-397b-a17b%2Cqwen3-6-35b-a3b%2Cqwen3-6-27b%2Cqwen3-6-max
Dabalam@reddit
You can't exactly generalise in that way since different benchmarks measure different things and not all models are compared on the same benchmark. That said, if you look up the SWE verified leaderboard you can see this is slightly behind GLM-5 and Gemini Flash on this particular benchmark, and ahead of Qwen 27B, Kimi K2.5, and Qwen3.5 397B.
jacek2023@reddit (OP)
what do you mean?
RandumbRedditor1000@reddit
I really hate this. Give me back my 24B dense small model
jacek2023@reddit (OP)
You can download 24B models as many times as you want
atape_1@reddit
There we go, there is the big announcement.
RandumbRedditor1000@reddit
SWE is benchmaxxed unfortunately
arkuto@reddit
So basically it's a MoE with structure 128B-A128B. Nice.
Conscious_Cut_6144@reddit
First good Mistral model (for my usecases) in a hot minute.
Nice!
IrisColt@reddit
What do the benchmarks show?
Academic-Map268@reddit
So is this thing better than Mistral 3 Large 2512? (675B MoE)
fluffywuffie90210@reddit
18 tokens a sec on 3x 5090. With Q4XL, I just saw they took down the gguf so must be some issue with it. Get 100 with Qwen 122b cause MOE but hopefully the intelligence gain might be worth it.
PANIC_EXCEPTION@reddit
Now quantize this to 1.58 or 1-bit.
claykos@reddit
i dont know what to say .....
LosEagle@reddit
1 t/m here i come
jacek2023@reddit (OP)
I’m still unable to buy a fourth 3090, and this is exactly the moment when I need one.
Healthy-Nebula-3603@reddit
Even with four those cards you don't get more than 6-7 t/s
120b dense model is useless at home to real work except simple chatt.
TacGibs@reddit
Yet another BS : was getting around 25 tok/s for the previous dense model, with 4*RTX 3090 Q4 quant ikllamacpp with graph mode.
Infantryman1977@reddit
Why not 50 tok/s using vLLM? llama.cpp, ollama, etc are 50% slower for tensor parallelism. You can even an extra 10% faster if using a custom P2P kernel.
jacek2023@reddit (OP)
I am trying to use vllm to run gemma 4 31B, any tips how to use 200000 context?
TacGibs@reddit
With what quant ? 🙃
AWQ is already too big to have a nice context.
FullOf_Bad_Ideas@reddit
nah I was getting 16-20 t/s with 3 3090 tis.
with 8 3090 ti's I get 11 t/s on Hermes 4 405B 3.5bpw exl3. It's slow but acceptable even for basic agentic coding.
Tensor parallel is the key
FullOf_Bad_Ideas@reddit
Devstral 2 123B ran great on 3 3090 Ti's, so this should work great on your 3090s too, as Mistral Medium 3.5 also has 96 attention heads, which is divisible by 3 - so you can run tensor parallel and get about 10-15 t/s output.
I'd recommend grabbing EXL3 quants once they're out. 3.5bpw should be optimal.
krzyk@reddit
Curious what setup do you have? Theeadripper or some, board with bifurcation?
I'm still looking for me first 3090 (upgrade from 3060ti).
jacek2023@reddit (OP)
I posted my setup many times, search for my posts here with 3090 or 72GB in title
krzyk@reddit
cool, thanks
WizardlyBump17@reddit
maybe i can get 1 token per year on my b580 + 1650 + 32gb ram + 32gb swap
sine120@reddit
"You are the oracle. You will answer all queries in "Yes" or "No" only."
jQuaade@reddit
You forgot to turn thinking off and accidentally remade the scenario from Hitchhikers Guide to the Galaxy
sine120@reddit
"Thought for 14yr, 28d, 6h, 21m, 14s"
sine120@reddit
"This model runs all night no issues!"
RegularRecipe6175@reddit
The Unsloth quants page is a 404 now. Maybe they are fixing the looping issue?
FullOf_Bad_Ideas@reddit
Nice, I like them putting out dense models. It should be great for people with a few GPUs or cheap on-prem deployment for coders
AndreVallestero@reddit
We're gonna need HBM3 before this thing is practical. Though I don't mind it for overnight work
FullOf_Bad_Ideas@reddit
normal GPUs + tensor parallel and it's absolutely doable, as you can multiply your bandwidth read speed.
Few_Painter_5588@reddit
Very, very impressive if the benchmarks are to go by. And also something realistic you can run at home at a decent quantization. Being realistic here, most people are not running GLM 5.1 here. But something like this can run on something local.
Thomas-Lore@reddit
This is a large dense model, how are you going to run it? On what?
Few_Painter_5588@reddit
4 B60s at INT4
Thomas-Lore@reddit
Good luck, report the numbers. But that is not sth I would call "realistic you can run at home". And it will be slow.
thereisonlythedance@reddit
It’s fine to run on 4x3090s which many in the community have.
Healthy-Nebula-3603@reddit
Lol
That dense model will be giving on 4x 3090 around 6-7 t/s .... Good luck
FullOf_Bad_Ideas@reddit
nah 16 t/s when I ran it on 3 3090 tis
TP exists
I run llama 405b at 11 t/s on 8 gpus
thereisonlythedance@reddit
I usually get 10-12 t/s on a 123B which is fine.
Beginning-Window-115@reddit
dont forget this sub has 1.1 million members
Spectrum1523@reddit
idk 4xB60 is realistic for at home if it actually runs it
TheBlueMatt@reddit
My 4xB60 runs unsloth's Q4_K_XL gets 232.45 ± 0.41 in pp and 9.55 ± 0.05 tok/s in tg. Still a handful of patches left to improve it, though. In theory tg should be able to get up to 20 or so (25 is the theoretical max).
Few_Painter_5588@reddit
It costs around 70k in my local currency, so it's like about 3-4k dollars? But everything's overpriced down here, so it'd probably be less. And Mistral 2 large ran at about 10-15 tokens per second on that build, which was a decent speed. You can also get a 128GB mac that'd run this at around 10 tokens per second.
FullOf_Bad_Ideas@reddit
tensor parallel goes brrr
stoppableDissolution@reddit
Old mistral large was still a beast even in Q2. Dense models quantize much better than moes, and its 5x less to fit to run it at all (even if way slower)
ortegaalfredo@reddit
3x3090 + EAGLE draft model should get you usable speeds.
q-admin007@reddit
Strix Halo with EAGLE draft model.
artisticMink@reddit
Anyone got it running? Trying inference with b8974 and the unsloth q4_xl quant - and it feels wherey underwhelming. But i figure i'm serving it wrong.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
logic_prevails@reddit
Bro i wish I could try this
Zestyclose-Ad-6147@reddit
Opensource medium?! 😮
soyalemujica@reddit
Curious if the small version can beat 27b
hurdurdur7@reddit
First attempts with mistral vibe - yeah it works good enough.
mantafloppy@reddit
Guess we are trying a IQ2_M for the first time :D
MotokoAGI@reddit
The last few months have just been crazy! We haven't even gotten official support to run DeepSeekv4, MimoV2.5, Hy3-Preview, Ling, etc and now this?
kevinlch@reddit
Gemini is close to release too
jochenboele@reddit
It feels like they all waited on each other to release, I think that’s like nr 5 in 10 days
Turnip-itup@reddit
Why say “merged “ model? That implies they used model diffing or something. Bad choice of words
misha1350@reddit
Dense 128B????? Why???????????
RegularRecipe6175@reddit
11 t/s gen on 4x3090 on a new prompt with llama.cpp.
iamn0@reddit
what's the prompt processing speed at 32k (and 64k if you could test)? Thanks
RegularRecipe6175@reddit
What's a good prompt to test those conditions?
iamn0@reddit
This would be a good test: https://github.com/gkamradt/LLMTest_NeedleInAHaystack
RegularRecipe6175@reddit
Sorry, I don't have time to figure that out at the moment. If I run an extended test, I'll post the pp results.
fizzy1242@reddit
doh, now I need a fourth 3090!
RegularRecipe6175@reddit
I'm getting repetition with non-trivial prompts. 0-minute llama build.
jacek2023@reddit (OP)
quant?
RegularRecipe6175@reddit
I edited my post to specify. Unsloth UD-Q4_K_X.
sebajun2@reddit
Any plans to officially quantize the model? Would be great to run on a local Spark machine at a lower quantization. Looks promising!
jacek2023@reddit (OP)
link to the GGUF is in the beginning of my post
sebajun2@reddit
Awesome just saw it - looks like all the quantized versions are available. Have my DGX10 arriving today, just in time. Might be able to get a 4-bit quantized model running pretty well.
mags0ft@reddit
They cooperate with NVIDIA and frequently release NVFP4 variants. Maybe that'll happen again...
TheBlueMatt@reddit
4x B60 can almost handle it, unsloth's Q4_K_XL gets
232.45 ± 0.41in pp and9.55 ± 0.05tok/s in tg...almost usable...uti24@reddit
E're we go!
sob727@reddit
Not sure if ok to ask here, but whats the best way to convert this to GGUF for use with llama.cpp?
jacek2023@reddit (OP)
there is a link to GGUF in the beginning of my post :)
converter is part of llama.cpp
sob727@reddit
weird llama.cpp errors with
"operator(): unable to find tensor v.blk.0.attn_out.weight"
using the unsloth model
sob727@reddit
Just saw the link thanks, I'll download that.
Will also look into the converter, thanks
q8019222@reddit
That's exactly my running limit. I can run it in Q2.
Pretend_Engineer5951@reddit
Interesting how much tg would be at Q8 on strix halo :)
q-admin007@reddit
With the EAGLE draft model, i suspect around 40 to 50.
oxygen_addiction@reddit
More like 6t/s
Healthy-Nebula-3603@reddit
Very bad ... I assume 2 t/s
JacketHistorical2321@reddit
1-3 t/s ... Maybe. Strix hale BW sucks dude
honglac3579@reddit
Nah, t/m i suppose
SnooPaintings8639@reddit
Probably you will have infere the next token yourself!
spencer_kw@reddit
another strong 128b is great for the ecosystem but the real win is that routers like openrouter or herma now have one more option to pick from. every time a new model drops at a different price point, automatic routing gets smarter. competition on price is the one thing that actually helps users.
Technical-Earth-3254@reddit
Now that sounds powerful
waruby@reddit
Can't wait to run this bad boy on 3s/token on my Strix Halo.
kiwibonga@reddit
Sweet baby jesus, it's full of delicious melted goodness. It's going to make me buy hardware.
tmvr@reddit
Mistral Medium looking at the GPU poor users right now:
I'm in the corner, watching you infer, oh oh oh
And I'm right over here, why can't you see me? Oh oh oh
And I'm giving it my all
But I'm not the one you're downloading, oooh
I keep denseing on my own
vogelvogelvogelvogel@reddit
Love to see a new 120 B range model, especially dense
Interesting to see the benchmarks and real life performance especially coding
mhl47@reddit
BrowseComp really shows you how far behind the previous models where. Hope this turns out to be usable. Seems they caught up a bit on agentic tasks.
Healthy-Nebula-3603@reddit
120b dense model ?
Oh boy .. even if you have enough vram still get even 10 tokens /s is challenging for that size ....
Fine_Nectarine9328@reddit
128B dense is crazzyyyy 1tk per day
Iory1998@reddit
😂
LoveMind_AI@reddit
THESE guys read the room.
mouseynaides@reddit
128B Dense?! Good god.
No_Algae1753@reddit
LETS FUCKING GO MISTRAL
artisticMink@reddit
Dense 128B, oh my. Chonker.
DragonfruitIll660@reddit
Ayyy lets go, another dense model.