mistralai/Mistral-Medium-3.5-128B · Hugging Face

(APIServer pid=8762) INFO 04-30 11:59:33 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 54.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 13.7%, Prefix cache hit rate: 0.2%

(APIServer pid=8762) INFO 04-30 11:59:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.76, Accepted throughput: 40.30 tokens/s, Drafted throughput: 43.80 tokens/s, Accepted: 403 tokens, Drafted: 438 tokens, Per-position acceptance rate: 0.973, 0.932, 0.856, Avg Draft acceptance rate: 92.0%

Model is usable and seems pretty nice, but I don't have full tests finished.

[-]

AutonomousHangOver@reddit

I'll continue by answering to myself 😉

Vllm with mistral tool-call-parser mistral is having trouble with known error:

IndexError: list index out of range

For now I've turned streaming off and I'm able to use i.e. Roo Code

[-]

MiuraDude@reddit

If this is actually Sonnet level I love it!

[-]

Healthy-Nebula-3603@reddit

Sonnet 4.5 ..... that's old model

Qwen 3.6 27b easily beat that old model

[-]

VEHICOULE@reddit

Qwen models are bad on real world use case..

[-]

Healthy-Nebula-3603@reddit

Tell me you didn't use qwen 3.6 27b with opencode without telling me.

[-]

Artistic_Okra7288@reddit

Qwen3.6-27B is actually really good on real world use cases in my experience.

[-]

SKirby00@reddit

Only if you're comparing them against Sonnet or other models that are wayyy bigger. The Qwen 27B models are actually great in real world use for their size.

[-]

szansky@reddit

how about coding with him?

[-]

JLeonsarmiento@reddit

If I quantize this to 1 bit I can make it run on my machine…

[-]

BitGreen1270@reddit

/me looking for 0.1 but that can run on a potato

[-]

JLeonsarmiento@reddit

🥔🤝💾

[-]

FullOf_Bad_Ideas@reddit

1.4bpw Mistral Large 123B was coherent, will that fit your machine?

[-]

IvGranite@reddit

DENSE

[-]

molbal@reddit

[-]

patricious@reddit

Denser than a snickers bar, I tell you that much.

[-]

hurdurdur7@reddit

this model is definitely thicker than a bowl of oatmeal ..

[-]

rpkarma@reddit

points and nods

[-]

Lissanro@reddit

This makes me feel nostalgic, because in the past, Mistral Large 123B the dense model was my most used model for a while. Then there were DeepSeek R1 and V3, later followed by Kimi models, so it has been some time since I ran Mistral models. I will definitely give a try to this new Medium 128B, it would be interesting too see how how much progress Mistral has made by trying it in my actual use cases.

One more notable thing, they released a model for speculative decoding: https://huggingface.co/mistralai/Mistral-Medium-3.5-128B-EAGLE . This is great to see, because in the past, one of the big issues of Mistral Large 123B used to be that I had to use mismatched Mistral 7B model for drafitng, still it gave decent performance boost. Even though EAGLE is not supported in llama.cpp yet, this comment from about 3 weeks ago sounds encouraging that it may be available soon:

The current status of this PR is that it’s pending u/ggerganov's API refactoring, which aims to unify this feature with other speculative decoding approaches such as MTP. At this stage, there isn’t much left to be done, and I expect the PR to be merged very soon.

[-]

coder543@reddit

Unfortunately, no PR for that API refactoring has even been published, so... who knows if/when it will happen.

Supporting any one of EAGLE-3, MTP, or DFLASH would be a game changer for llama.cpp. I wish better specdec were being treated as the highest priority thing to develop in llama.cpp.

[-]

Nindaleth@reddit

I consider this PR to be relevant: https://github.com/ggml-org/llama.cpp/pull/22397 But he has several spec-related PRs going on, maybe it's a piece-by-piece effort.

[-]

coder543@reddit

Unfortunately, I can't even find a PR for that API refactoring, so... who knows if/when it will happen.

[-]

valtor2@reddit

What about at 10k context?

[-]

a_beautiful_rhind@reddit

The old devstral 123 at q4_K_xl

| 1024 | 256 | 10240 | 1.641 | 624.04 | 9.690 | 26.42 | | 1024 | 256 | 11264 | 1.656 | 618.42 | 9.785 | 26.16 | | 1024 | 256 | 12288 | 1.674 | 611.71 | 9.889 | 25.89 | | 1024 | 256 | 13312 | 1.688 | 606.62 | 9.994 | 25.61 |

this one should be similar. 4x3090

[-]

IvGranite@reddit

I ain't got that kinda time lol

[-]

valtor2@reddit

haha had to try 😇

[-]

temperature_5@reddit

Ugh, this means I'll get 1.6 t/s on 890m. But still, might be worth it on occasion if it's really smart!

[-]

TripleSecretSquirrel@reddit

That’s honestly better generation speed than I expected!

[-]

edsonmedina@reddit

I'm also on Strix Halo (128Gb) but the model fails to load (IQ4_NL)

[-]

harpysichordist@reddit

And he never reported back.

[-]

Creepy-Bell-4527@reddit

Similarly useless results on M3 Ultra. Needless to say I don’t think this model is for us 😉

[-]

exaknight21@reddit

Dense days return.

[-]

po_stulate@reddit

Waiting for dflash

[-]

Freonr2@reddit

128B THICC BOI

[-]

LetsGoBrandon4256@reddit

What a chonker.

[-]

grumd@reddit

128B dense is an interesting niche

[-]

TripleSecretSquirrel@reddit

Honestly just feels like it might be ahead of its time. I think once the next generation of GPUs come out — AMD will probably release AI Pro cards with 48GB and maybe even 96GB VRAM, and god willing, Intel will fix their driver issues making their cards more viable and offer bigger VRAM options — this model size might be the sweet spot for the high end of local inference.

Right now, for those of us with higher end consumer hardware (i.e., 32GB VRAM), Qwen 3.6:27B is basically the gold standard. You can run it at 4-bit precision with full context. Smaller models are getting better and better, but all else being equal, more parameters are pretty much always going to mean better output.

So I’m imagining that like next year, the bleeding edge of local inference will be models in the 80B-120B range instead of the ~30B dominance we’re seeing now.

[-]

HiddenoO@reddit

AMD will probably release AI Pro cards with 48GB and maybe even 96GB VRAM

Why would they with current memory pricing? The optimal strategy for them is to allocate something like 90% of the memory to datacentres and then put the remaining 10% into relatively low-memory gaming GPUs so they don't lose brand awareness.

So I’m imagining that like next year, the bleeding edge of local inference will be models in the 80B-120B range instead of the \~30B dominance we’re seeing now.

Just having the memory doesn't mean you can actually run them fast enough to be useful in practice.

[-]

nebteb2@reddit

W7800 48gb ai pro already exists lol

[-]

TripleSecretSquirrel@reddit

True, but I’m thinking RDNA4 platform.

[-]

falcongsr@reddit

My work gave me enough budget to buy a Mac Studio and I could only order the M3 Ultra with 96GB. I guess I could try this.

[-]

ahh1258@reddit

Yall hiring? 🤣

[-]

dtdisapointingresult@reddit

I think once the next generation of GPUs come out — AMD will probably release AI Pro cards with 48GB and maybe even 96GB VRAM

copium.jpeg.png

[-]

grumd@reddit

I wouldn't bet on companies releasing GPUs with more memory when memory tripled in price and is fully sold out

[-]

BubrivKo@reddit

Yup, I was so tired from these MoE models. They are not bad, but these with the little active parameters are actually stupid and not that useful. Sadly, I cannot run that Mistral but I'm still happy that someone is still working over dense models which are superior!

[-]

grumd@reddit

I think moe models are the future unfortunately, simply to crunch more knowledge into the model while not destroying the speed. The only mistake is making the active params count too low. Something like A30B is probably enough for it to not feel dumb. Even Qwen 122B A10B has been great for me locally

[-]

NandaVegg@reddit

I have not yet a chance to try this model, but generally MoE with a reasonably-sized shared weights/activated parameters has significant advantage over large dense as LLM activation is mostly noise, which is empirically just bad rather than something useful (naively upping # of active experts for existing MoE model simply make the model worse, low-pass filter type gating techniques work well with LLM, etc). The "partitioning" done by MoE architecture works to filter out those noises.

A remarkable advantage of large dense usually comes with large hidden dim (GPT-3 DaVinci was 175B with 12288 hidden dim IIRC? Llama-3 405B is 16384, Mistral Medium 3.5 is also 12288) which would be able to distinguish and partition extremely close features that would otherwise overlap in say a 5120 hidden dim model and (hopefully) worked through layers. That also means the (large hidden dim) model could place Paris and London (along with related things like Toulouse and Brighton) to the polar opposite end of latent space if there are enough evidence in the datasets to do so. I'm not sure if that is good or bad. Good old GPT-3 DaVinci had a feel that inference path diverges really hard (the model goes from one mode to another; in today's standard that would at least mean base/instruction/single-turn reasoning/terminal-agentic modes) by just one token.

For creativity and generalization of hard problems, one would generally want more layers rather than larger hidden dim within the same parameter count, unless hidden dim is too small to create meaningful basins anymore.

[-]

toothpastespiders@reddit

The only mistake is making the active params count too low.

I mourn GLM Air every day.

[-]

BubrivKo@reddit

Or why not 1T + 100B active 😃

[-]

Caffdy@reddit

we already got 1T + A40B~ models

[-]

AltruisticList6000@reddit

Yeah I can't run big models like this but I was thinking, what if for example there was something like a 35B MoE but with 9-10AB? That could spill over into RAM but would still have an okay speed, would be probably smarter and more knowledgable than 12-14b dense models on the same hardware with barely any speed difference. Or they could just do 20-24b dense models like Mistral, which are still more intelligent in some way than the 3AB MoEs I tried, which don't feel smarter than 9B dense models.

[-]

Ardalok@reddit

Yeah, I have 25-30 tokens on 32 gb ddr5 and rtx 4060 with 35b qwen in q4, would be nice to have smarter model with little less tokens.

[-]

TokenRingAI@reddit

I think the engram method is the future, with small dense models retrieving information from slow storage.

[-]

Equivalent-Freedom92@reddit

It's pretty close to the upper limit possible with for hardware. 4x 3090 for 96GB of VRAM and such, using a board like ASUS ProArt with 3x 16x PCIe slots and then converting one of the M.2 NVME slots into an additional 16x slot with an adapter. For inference the lane bandwidth won't completely choke out quite yet with such a setup.

[-]

Freonr2@reddit

The ROMED8-2T beckons you.

[-]

Makers7886@reddit

I have two of these and are my top recommendation.... until earlier today I googled them and no retailers had them and used on ebay was going for 3X what I got mine for ($1600+). Crazy times

[-]

Prudent-Ad4509@reddit

If you need more than 2 gpu on such boards you are better off with pcie switch

[-]

Real_Ebb_7417@reddit

A "you can run it locally, but you won't like the experience" niche 😂

But I'm happy to see them make a dense model, they have experience with it already, so hopefully this one will be much better compared to similar-sized models than Mistral Small 4.

[-]

Sunija_Dev@reddit

For roleplay/writing, you can run it at home for \~1200€.

For that money you get 2x 3090, so you can run IQ2_M at \~5 tok/s. Since you probably already have a GPU, you can also run a bigger quant. In my experience, even the old Mistral-123b beats everything out of the park at that size (for writing).

...and that is probably the best affordable thing you can run at home? MoE's get better at \~400b params, but the RAM is probably crazy expensive. Not sure about the speed.

[-]

someguy@reddit

Used 3090s are about 1000€ now (at least in Germany).

[-]

FullOf_Bad_Ideas@reddit

EXL3 is great for dense 120B Mistrals. 2.5bpw quants are actually pretty good.

exllamav3 author got coherent output from 1.4BPW Mistral Large 123B, so 2.5bpw is plenty and it should be better than GGUF at this size. It also support tensor parallel so it's pretty fast.

[-]

q-admin007@reddit

It comes with a bespoke draft model. Could be faster than Qwen 3.6 27b in the end.

[-]

Real_Ebb_7417@reddit

Does speculative decoding work well in llama.cpp though? (Serious question, didn’t test it so far)

[-]

dtdisapointingresult@reddit

There's various types of speculative decoding. llama.cpp doesn't support MTP or Eagle3, which is what the AI labs usually provide. For example Qwen and GLM models have MTP, while this Mistral Medium release has an eagle3 from Mistral.

With llama.cpp your only option is the more ghetto solution of using a small draft model, but finding a compatible model is a bitch. It's easy when they have the same tokenizer/vocabulary, for example using Gemma 3 to boost Gemma 4. You get a major speed boost, you might get double speed for free. But idk what you could use as a draft model for this Mistral Medium release.

[-]

RoomyRoots@reddit

There is a comment on a different reply. TL;DR, not yet.

[-]

coder543@reddit

Qwen3.6 27B has MTP built-in and DFlash support... don't see how Mistral Medium 3.5 could ever be faster just because of an EAGLE-3.

[-]

FullOf_Bad_Ideas@reddit

dense models run fast with tensor parallel. I had 16 or 20 t/s with Devstral 123B IIRC and I have 11 t/s with Hermes 405B.

I like dense models like that more than I like 1T MoEs that I have no memory for.

[-]

More-Curious816@reddit

They probably targeting middle class to upper middle class GPU, not GPU poor class like us.

[-]

Herr_Drosselmeyer@reddit

It won't run well on anything below a 6000 PRO. I don't consider that middle class.

[-]

stoppableDissolution@reddit

Old large in q2 was beating everything else you could run in 48gb (2x3090) up until the gemma4 31b got released, idk. Will see how that one holds up.

[-]

jochenboele@reddit

This made me laugh out loud 🤣 Thanks

[-]

dtdisapointingresult@reddit

For most people, this model will only be used for writing, with reasoning disabled, at Q4. But the good part is that you don't really need more than 5 tok/sec for this kind of task.

[-]

Late-Assignment8482@reddit

Valid. The average adult person reads somewhere in 8-12 tokens / second (AKA 4-6 words) range and rarely as someone who writes a LOT of prose would I need something to write faster tahn me. Creating copy is just not the same task creating code. A model that takes five minutes but then doesn't need four more tries, just a read and edit because it's mostly well written is superior to one that took 10s to generate it.

[-]

Healthy-Nebula-3603@reddit

You're serious?

With 5 t/s that model is useless.

Only what you can do is using in the chatt for simple conversations.

[-]

po_stulate@reddit

I think they're betting on speculative decodings like dflash. If a huge dense model like this is actually smart and can still have a good speed with a speculative decoder like dflash, then that's the biggest win.

[-]

Healthy-Nebula-3603@reddit

That's is only in theory.

I tested many times speculative decoding.

It is hard to get more even than 0.5 acceptance with such small draft model like 3.5 GB ( fp 16 so has 2b parameters ?)

[-]

dtdisapointingresult@reddit

0.5 acceptance is a huge speed boost. With num_speculative_tokens=3 I get almost a double speedup with prompts like 'write me an essay about XYZ.'

But even 5 t/s is good enough. If you're asking it to create the occasional an Instagram post for you, to write an email, to help you with the story of your game you're developing, speed doesn't matter unless you have a massive pipeline. 5 token/sec is only slightly slower than the average person's reading speed.

The only reason we think it's slow is because agentic coding needs to generate like 5k tokens of reasoning + 2k of source code on every turn. This is not necessary for writing. If you want an Instagram post, it only takes 20-30 seconds.

[-]

Healthy-Nebula-3603@reddit

0.5 acceptece from 16 propositions ....

[-]

dtdisapointingresult@reddit

Well I'm not an expert on that stuff. I experimented with different num_speculative_tokens and found I got the fastest speedup with tokens=3. Even with a lower mean acceptance than you, this meant a huge speed boost. I'm talking double speed if I'm using MTP on Qwen, 50% speed boost using Eagle3 on Gemma 4 31B (it doesn't support MTP).

Don't stick to defaults without doing some quick experiments, I think it varies based on the compute of your hardware. Also try the same type of prompt you'd use in reality as well (for example, python coding).

[-]

Healthy-Nebula-3603@reddit

Wait ..Did you say you're using draft qwen 3 on Gemma 4?

You know that is not working like that ?

[-]

dtdisapointingresult@reddit

No.

2x speed if using MTP on Qwen 27B
1.5x speed if using Gemma 4 Eagle 3 small model to boost Gemma 4 31B

[-]

LegacyRemaster@reddit

waiting for qwen 3.6 122b...

[-]

Nobby_Binks@reddit

Yes, and 397B

[-]

Freonr2@reddit

Right, even on an RTX 6000 that's probably <10 t/s based on what I get with other models.

[-]

reto-wyss@reddit

I'm getting 16 tg/s at around 10k tokens deep, but that's without MTP on 2x Pro 6k. But there appears to be something wrong with KV-cache calculation in vllm nightly. PP may be over 1k/s, but I haven't run any real tests because of the KV-cache thingy.

[-]

FullOf_Bad_Ideas@reddit

2x RTX 6000 Pro with tensor parallel?

[-]

Healthy-Nebula-3603@reddit

Yes

That 120b dense model is practically useless with such speed... And you still need inane expensive card ... And still is useless.

Only useful usecase is a simple chatt.

[-]

Longjumping-Boot1886@reddit

for Macbook with 128Gb VRAM, I think.

[-]

CYTR_@reddit

At 5tk/s lol.

[-]

q-admin007@reddit

50 t/s with EAGLE drafter. Do you still live in 2024?

[-]

CYTR_@reddit

You can't do everything with a draft, especially if the request is complex and involves lengthy contexts. I even linked the draft as soon as I saw it. There's no need to be unpleasant in your reply.

[-]

Healthy-Nebula-3603@reddit

Even 5 t/s will be challenging....

[-]

Eyelbee@reddit

I like the choice but it doesn't seem to deliver the performance

[-]

ambient_temp_xeno@reddit

124B Gemma niche.

[-]

alex_bit_@reddit

Hoe many 3090s to run this thing?

[-]

InstaMatic80@reddit

Too big for my 3090 😅 Waiting for a 27B version

[-]

overand@reddit

Just don't forget to check out the "club 3090" https://github.com/noonghunna/club-3090 - it's a llama.cpp and/or vllm setup that gets surprisingly good speed for Qwen3.6-27b from one or two 3090s. With my 2x 3090 setup, I went from \~27t/s to \~80 t/s. It's pretty wild

[-]

InstaMatic80@reddit

😳 I’ll definitely take a look

[-]

silenceimpaired@reddit

If they released an MoE at this size it would be cozy for those with the RAM

[-]

reto-wyss@reddit

Qwen 27b, who is the densest now?

[-]

zenmagnets@reddit

Unfortunately Qwen3.6 27b is still the smarter model. Matches Mehstral 3.5 at SWE Verified, but 27b is better at browser comp and agentic tasks.

[-]

Clank75@reddit

My experience of Qwen3.6 so far is that it is undoubtedly, by far, the most adept model at flat-out lying and fabrication I've ever used. It's genuinely remarkable - it hallucinates toolcalls it never made, it invents entirely false references that it never looked up, and perhaps most novel - when called out, it doubles down and creates more lies to cover its tracks.

It's certainly smart. Completely useless and untrustworthy, but smart. I predict a career in politics for Qwen3.6.

[-]

AndThenFlashlights@reddit

What quant are you running at? I haven't used 3.6 heavily yet, but 3 and 3.5 have been generally honest with me - frankly, hallucinating less than Claude and ChatGPT at certain focused coding tasks. I have seen issues where 3.6 will think and argue with itself (for a LONG time) if it's not sure about something, or if the context isn't clear about something.

[-]

Clank75@reddit

Unsloth's Q6. And yeah, 3.5 - while in general more prone to confabulation than other models - is not nearly as bad as 3.6 imx.

It's genuinely bad enough that I don't think it's a useful model for my purposes. You should never trust an AI, of course, but 3.6 seems to go up a notch into pathological lying.

[-]

AndThenFlashlights@reddit

Oof, that's annoying. And Q6 shouldn't be introducing oddness on its own.

Yeah that's really disappointing. Qwen3-30b-a3b was my daily driver for a long time because it lied the least out of all the similar models I used.

[-]

SqueakySquak@reddit

Color me surprised when in the middle of a coding session, Qwen 3.6 27B tried to call the MCP tool to read my Gmail... When called out on it, it pretendes it was a mistake and not trying to spy on me or anything. I'm running the model unquantized BTW. Apart from its spying tendencies it's actually pretty good at coding. But I wouldn't trust it, and certainly not unsupervised!

[-]

Agreeable_System_785@reddit

For your use case. As an European, I gladly embrace the work of Mistral. LLM's have more use cases besides coding. Still, you seem to have thoroughly tested this model already.

[-]

FullOf_Bad_Ideas@reddit

llama 3.1 405b

find me a denser model, I'll wait.

[-]

pkmxtw@reddit

Does franken-self-merge of L3.1 405B with like 1T dense parameters count?

https://huggingface.co/mlabonne/BigLlama-3.1-1T-Instruct

[-]

FullOf_Bad_Ideas@reddit

I guess so, if it produces coherent text.

[-]

Maleficent-Ad5999@reddit

Qwen has is sibling too right? 122B one?

[-]

JaredsBored@reddit

That's an MoE, with only 10B active per token. This is 128/128B active every token.

[-]

Maleficent-Ad5999@reddit

Ah right. Totally forgot. Thanks

[-]

Upstairs_Tie_7855@reddit

What about Gemma 4 31b?

[-]

Key_Papaya2972@reddit

first glance: another 120B\~, nice, let’s see where the active params is.
second glance: 128B what?

[-]

jacek2023@reddit (OP)

[-]

MotokoAGI@reddit

119B was trash and 123B was't too bad. Glad to see this looks solid. I wish they compared to similiar sized model like Qwen-122B

[-]

jacek2023@reddit (OP)

Qwen 122B is MoE, number of active parameters is totally different

[-]

overand@reddit

It doesn't look like there have been any dense model releases in this size range since March 2025 (CohereLabs/c4ai-command-a-03-2025), and before that it was 2024:

2024-07-23 Llama-3.1-405B-Instruct
2024-07-24 Mistral-Large-Instruct-2407
2024-08-21 CohereForAI/c4ai-command-r-plus-08-2024
2024-11-18 Mistral-Large-Instruct-2411
2025-03-11 CohereLabs/c4ai-command-a-03-2025

[-]

FullOf_Bad_Ideas@reddit

you forgot Devstral 2 123B.

Hermes 4 405B finetune released on 2025-08-26 too.

[-]

overand@reddit

Ah - yeah, if I'm going to include the two versions of Mistral-Large-Instruct, I definitely should have included those. The former is based on Mistral-Large-Instruct, right, and the latter on Llama-4?

[-]

FullOf_Bad_Ideas@reddit

Devstral 2 Instruct 123B is a different architecture and it has different tokenizer, it's most likely not the same base.

Hermes 4 405B is a llama 3.1 405B finetune.

[-]

overand@reddit

Yeah. Dense vs MoE is pretty significant.

Model	Total Params (B)	Active Params (B)
Mistral-Medium-3.5-128B	128	128
LLaMA-3.3-70B	70	70
Qwen3.5-122B-A10B	122	10
Mistral-Small-4-119B-2603	119	6
GPT-OSS-120B	117	5

[-]

mantafloppy@reddit

And already at re-release 2, of course...

https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF/discussions/1

danielhanchen Unsloth AI org 34 minutes ago

Sorry we just fixed it - we had to patch some components up since llama.cpp conversion did not like some token_ids - they should work now! @Alsa @brzewVCE @ru5h Apologies for the issues!

[-]

BaronRabban@reddit

They appear to have just taken down all of the ggufs

https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF/tree/main

[-]

mantafloppy@reddit

Aiming for a 3rd re-release under 4h.

I need to start taking note, might be a new record.

Its not like they would have know it was'nt loading if they tested it...

But, they were first, that what important.

[-]

Makers7886@reddit

fool me once

[-]

yoracale@reddit

We're working with Mistral on this but it seems through further testing that GGUF implementation needs more investigation. Prompting the model the first few times work but then afterwards it doesn't work properly. Mistral has now labelled GGUF implementations as a WIP. Seems to be most likely a parser issue.

[-]

tarruda@reddit

They deleted the repo, likely to erase discussion history. They could have just re-uploaded it.

[-]

yoracale@reddit

Not to erase discussion history lol. We're working with Mistral on this but it seems through further testing that GGUF implementation needs more investigation. Prompting the model the first few times work but then afterwards it doesn't work properly. Mistral has now labelled GGUF implementations as a WIP. Seems to be most likely a parser issue.

[-]

DJTsuckedoffClinton@reddit

i miss the old mistral

[-]

jacek2023@reddit (OP)

what does it mean?

[-]

Monad_Maya@reddit

He's a Kanye fan.

[-]

ttkciar@reddit

This is great news! Looking forward to giving it a try.

Devstral 2 Large was a huge disappointment, but hopefully MistralAI has learned from their past mistakes and cooked up this 128B right. Maybe this will finally be the 120B-class model which knocks GLM-4.5-Air off its perch?

[-]

ROS_SDN@reddit

Its a very niche model since its fully dense.

It and glm4.5 air dont really compare. One can tolerate hybrid inference, the other can't.

This is an entirely different beast at this level of dense. It needs to absolutely cook to be worth the resources needed to run it interactively with a user, or still really really cook to do back end batch jobs on it at a crawl.

Very few local people will be able to use this as a chat agent it's really a rtx 6000 pro + model.

While glm 4.5 air you could respectably work with 24GB -48GB vram + 64GB ddr5.

[-]

ttkciar@reddit

I am quite aware of the differences between MoE and dense.

The fact remains that GLM-4.5-Air is more competent at codegen than every 120B-class model I've tried, including Devstral 2 Large (which is 123B dense), at least for the specific codegen skills that matter most to me.

GLM-4.5-Air is quite weak at function-calling, but superior at instruction-following compared to GPT-OSS-120B, Qwen3.5-122B-A10B, and Devstral 2 Large. It is more prone to infer bugs, but hallucinates less and less prone to infer design flaws. That's made it my go-to.

Every time a new 120B-class codegen model pops up, I think "surely this one will beat Air", but so far they just haven't, despite Air "only" having 106B total parameters.

I've downloaded Mistral Medium 3.5 128B, but won't be running it through my codegen tests until tonight. Maybe it will be "the one".

[-]

ROS_SDN@reddit

Pray for qwen3.6 122b then to hit the mark.

[-]

billy_booboo@reddit

Or perhaps not

[-]

rebelSun25@reddit

In before "Guys, can I run this on my single RTX 3060 ?"

We've all been there. An no you can't. It's a chungus of a model

[-]

cutebluedragongirl@reddit

Yeah... shit is fucked.

[-]

lolidkwtfrofl@reddit

How am I looking with my 4070ti? Much better odds right? Right?

cries in corner while looking at a RTX PRO 6000

one day, my love, one day.

[-]

someguy@reddit

I wonder if it can still compete with the older Mistral models in terms of creativity.

It's description doesn't sound promising.

Either way, nice to have a new large and dense model that can still be run locally with a reasonable setup.

[-]

mantafloppy@reddit

Backup of the IQ2_M if anyone wanna play with it before Unsloth re-upload.

https://huggingface.co/mantafloppy/Mistral-Medium-3.5-128B-GGUF

mantafloppy/Mistral-Medium-3.5-128B-GGUF

[-]

BaronRabban@reddit

I just tried the Bartowski qaunt in llama.cpp and it is also brain damaged.

And even the VLLM nightly doesn't seem able to load ggufs for this yet so I am at an impasse.

ValueError: GGUF model with architecture mistral3 is not supported yet.

[-]

mantafloppy@reddit

I have not tried Bartowski, but the Unsloth version at IQ2_M gave me one anwser, and the beginning of an other before looping.

https://hastebin.com/share/okezetehol.xml

But it was loading in Lm Studio.

[-]

CYTR_@reddit

There is a draft : https://huggingface.co/mistralai/Mistral-Medium-3.5-128B-EAGLE

[-]

Zc5Gwu@reddit

What kind of speed increase do you think you could see I wonder…

[-]

DinoAmino@reddit

SD with Eagles are great when generating code. Often double the tps, or more. But they don't do very well with regular text generation. Sometimes it drops to half the normal tps. It would be great if it was possible to enable/disable SD per request for specific tasks.

[-]

simracerman@reddit

Someone posted the Strix Halo numbers for Q4 at 3.5tps. Double is still horrible at 7tps.

[-]

q-admin007@reddit

Waiting for GGUF! Should fly on my Strix Halo.

[-]

rangorn@reddit

I consider myself quite dense I had no idea it has become quite the compliment.

[-]

Limp_Classroom_2645@reddit

I don't appreciate dense models anymore

[-]

_ballzdeep_@reddit

Why did everyone stop pushing 70B models?

[-]

Reddit_User_Original@reddit

I want to see more in the 35 to 40B range personally. That is the sweet spot for 64GB unified memory.

[-]

simracerman@reddit

How sweet is it at 5 t/s?

[-]

Reddit_User_Original@reddit

Breh wut??? I believe I'm getting about 50 t/s with qwen3.6 35b moe

[-]

simracerman@reddit

Wait, in the context of your reply I thought you meant dense. Ofc, would love to see move MoE competition in that range. The real sweet spot is 80B-8B active from Qwen.

[-]

FullOf_Bad_Ideas@reddit

you'd want to see more dense 70B models?

Would you run them if they came out?

[-]

_-_David@reddit

Yes!

[-]

RetiredApostle@reddit

For a Llama anniversary, Meta probably will.

[-]

ortegaalfredo@reddit

Too expensive to train. About the same than a 400B model IIRC.

[-]

Service-Kitchen@reddit

literally

[-]

Local_Phenomenon@reddit

My Man!

[-]

funding__secured@reddit

What happened to the models? Everything has been pulled.

[-]

BaronRabban@reddit

https://huggingface.co/unsloth/Mistral-Medium-3.5-128B/discussions/2#69f26fcce689cdc885c82a2f

[-]

BaronRabban@reddit

I don’t know what’s wrong but it feels brain damaged to me. Running the Q6 on five 3090’s. Latest llama I just rebuilt. Something is definitely off but not sure what. Just feels like brain damage.

[-]

tarruda@reddit

If it is unsloth gguf, I'd wait a few weeks before trying the weights.

But also, I no longer have high expectations with mistral models.

[-]

yoracale@reddit

It's most likely GGUF parser issue, not the model or quant algorithm. Unfortunately no matter what GGUF yoy create with the model, it doesn't function properly most likely due to the parser.

[-]

IrisColt@reddit

hmm...

[-]

yoracale@reddit

We’re working with Mistral on this, but further testing suggests the GGUF implementation needs more investigation. The model responds correctly to the first few prompts, but then begins behaving improperly. Mistral has now labeled GGUF implementations as a work in progress, and this appears most likely to be a parser issue.

[-]

Affectionate-Cap-600@reddit

from a fast reading of the config file, it seems a pure global softmax attention model... I mean, it doesn't seems to use sliding window in any of the layers.

quite rare nowadays, even non hybrid models use some kind of sliding window or sparse attention in some layers... those are 88 layers of pure attention. also ~10k+ hidden size and ~20k+ MLP intermediate size.

interesting for sure... we needed a model like that.

I assume they spent quite a lot training it. memory footprint at 256k contex will be crazy.

we will se if they release a report.

[-]

alberto_467@reddit

Thank you for noticing that, it is really an interesting choice to not use any hybrid sparse layers.

[-]

seconDisteen@reddit

What a pleasant surprise!

As someone who only really uses local LLMs for creative writing/RP, I am still using Mistral Large 2 123B, mostly the Behemoth finetunes. Ever since things have shifted to fully MoE with a focus on coding, there hasn't been much I've gotten excited about. Yes, even many smaller MoE models are smart and can do creative writing, but with smaller active parameters they often don't have really dense knowledge on fandoms and other things I like to explore. Granted, ML2 still does almost everything I want it to, is pretty smart, and knows a lot about a lot, so I haven't really been griping for a new dense model, but I'll sure as hell take one! I can't wait to try this thing out!

[-]

IrisColt@reddit

they often don't have really dense knowledge on fandoms and other things I like to explore

Interesting... can't you just feed the relevant bits and cross your fingers? Gemma 4 excels at this.

[-]

seconDisteen@reddit

You can, and I have done that with a number of the newer, smaller models, but it can become tedious really quickly. Especially if you're working with a really expansive IP, like Harry Potter or Marvel. It's nice that dense models already have so much of the lore baked in, and you only need to tweak it with context. Even Miqu 70B going back, what... 2.5 years now? Was really dense with pop culture knowledge. With these newer, smaller models you have to do a lot of the heavy lifting in context, which is not only tedious but eats up context, especially as the story drags on. Not only that but I've found that the smaller MoE models aren't nearly as good as tying the entire story together. If you have a 40k context story, with multiple scenes, characters, and hooks, I find smaller models aren't as good at taking everything into account as things progress.

I have gotten some decent results with some of the newer stuff, particularly with GLM AIR. But I felt it was only as good as the same results I get from Behemoth/ML2 when dealing with creative writing in large fandoms, despite being much newer and better architecture. Yes, it's faster, but requires more work to get more or less the same results.

[-]

jacek2023@reddit (OP)

[-]

oxygen_addiction@reddit

Wow, those BrowseComp numbers are horrendous.

[-]

Dany0@reddit

idk man beating sonnet 4.5 with 122b is fine imo Sonnet models are likely 500-1000b range

[-]

IrisColt@reddit

oof.gif

[-]

sterby92@reddit

So qwen3.6-35b and 27b crushes it with way less compute? 🤔

[-]

disgruntledempanada@reddit

That's not Qwen 3.6 35b unless you are referencing another benchmark.

[-]

sterby92@reddit

yeah, not in this benchmark. But qwen3.6 35b / 27b are around the quality of qwen3.5-397 in a lot of benchmarks.https://artificialanalysis.ai/?models=gpt-oss-120b%2Cgpt-oss-20b%2Cgpt-5-5%2Cgpt-5-4-mini%2Cgpt-5-4%2Cgpt-5-4-pro%2Cmuse-spark%2Cgemma-4-31b%2Cgemini-3-1-pro-preview%2Cgemini-3-flash-reasoning%2Cclaude-opus-4-7%2Cclaude-4-5-haiku-reasoning%2Cclaude-sonnet-4-6-adaptive%2Cmistral-small-4%2Cdeepseek-v4-pro%2Cdeepseek-v3-2-reasoning%2Cdeepseek-v4-flash%2Cgrok-4-20%2Cnova-2-0-pro-reasoning-medium%2Csolar-pro-3%2Cminimax-m2-7%2Cnvidia-nemotron-3-super-120b-a12b%2Ckimi-k2-6%2Cmimo-v2-5-pro%2Ck2-think-v2%2Cglm-5-1%2Cqwen3-5-397b-a17b%2Cqwen3-6-35b-a3b%2Cqwen3-6-27b%2Cqwen3-6-max

[-]

Dabalam@reddit

You can't exactly generalise in that way since different benchmarks measure different things and not all models are compared on the same benchmark. That said, if you look up the SWE verified leaderboard you can see this is slightly behind GLM-5 and Gemini Flash on this particular benchmark, and ahead of Qwen 27B, Kimi K2.5, and Qwen3.5 397B.

[-]

jacek2023@reddit (OP)

what do you mean?

[-]

RandumbRedditor1000@reddit

I really hate this. Give me back my 24B dense small model

[-]

jacek2023@reddit (OP)

You can download 24B models as many times as you want

[-]

atape_1@reddit

There we go, there is the big announcement.

[-]

RandumbRedditor1000@reddit

SWE is benchmaxxed unfortunately

[-]

arkuto@reddit

So basically it's a MoE with structure 128B-A128B. Nice.

[-]

Conscious_Cut_6144@reddit

First good Mistral model (for my usecases) in a hot minute.
Nice!

[-]

IrisColt@reddit

What do the benchmarks show?

[-]

Academic-Map268@reddit

So is this thing better than Mistral 3 Large 2512? (675B MoE)

[-]

fluffywuffie90210@reddit

18 tokens a sec on 3x 5090. With Q4XL, I just saw they took down the gguf so must be some issue with it. Get 100 with Qwen 122b cause MOE but hopefully the intelligence gain might be worth it.

[-]

PANIC_EXCEPTION@reddit

Now quantize this to 1.58 or 1-bit.

[-]

claykos@reddit

i dont know what to say .....

[-]

LosEagle@reddit

1 t/m here i come

[-]

jacek2023@reddit (OP)

I’m still unable to buy a fourth 3090, and this is exactly the moment when I need one.

[-]

Healthy-Nebula-3603@reddit

Even with four those cards you don't get more than 6-7 t/s

120b dense model is useless at home to real work except simple chatt.

[-]

TacGibs@reddit

Yet another BS : was getting around 25 tok/s for the previous dense model, with 4*RTX 3090 Q4 quant ikllamacpp with graph mode.

[-]

Infantryman1977@reddit

Why not 50 tok/s using vLLM? llama.cpp, ollama, etc are 50% slower for tensor parallelism. You can even an extra 10% faster if using a custom P2P kernel.

[-]

jacek2023@reddit (OP)

I am trying to use vllm to run gemma 4 31B, any tips how to use 200000 context?

[-]

TacGibs@reddit

With what quant ? 🙃

AWQ is already too big to have a nice context.

[-]

FullOf_Bad_Ideas@reddit

nah I was getting 16-20 t/s with 3 3090 tis.

with 8 3090 ti's I get 11 t/s on Hermes 4 405B 3.5bpw exl3. It's slow but acceptable even for basic agentic coding.

Tensor parallel is the key

[-]

FullOf_Bad_Ideas@reddit

Devstral 2 123B ran great on 3 3090 Ti's, so this should work great on your 3090s too, as Mistral Medium 3.5 also has 96 attention heads, which is divisible by 3 - so you can run tensor parallel and get about 10-15 t/s output.

I'd recommend grabbing EXL3 quants once they're out. 3.5bpw should be optimal.

[-]

krzyk@reddit

Curious what setup do you have? Theeadripper or some, board with bifurcation?

I'm still looking for me first 3090 (upgrade from 3060ti).

[-]

jacek2023@reddit (OP)

I posted my setup many times, search for my posts here with 3090 or 72GB in title

[-]

krzyk@reddit

cool, thanks

[-]

WizardlyBump17@reddit

maybe i can get 1 token per year on my b580 + 1650 + 32gb ram + 32gb swap

[-]

sine120@reddit

"You are the oracle. You will answer all queries in "Yes" or "No" only."

[-]

jQuaade@reddit

You forgot to turn thinking off and accidentally remade the scenario from Hitchhikers Guide to the Galaxy

[-]

sine120@reddit

"Thought for 14yr, 28d, 6h, 21m, 14s"

[-]

sine120@reddit

"This model runs all night no issues!"

[-]

RegularRecipe6175@reddit

The Unsloth quants page is a 404 now. Maybe they are fixing the looping issue?

[-]

FullOf_Bad_Ideas@reddit

Nice, I like them putting out dense models. It should be great for people with a few GPUs or cheap on-prem deployment for coders

[-]

AndreVallestero@reddit

We're gonna need HBM3 before this thing is practical. Though I don't mind it for overnight work

[-]

FullOf_Bad_Ideas@reddit

normal GPUs + tensor parallel and it's absolutely doable, as you can multiply your bandwidth read speed.

[-]

Few_Painter_5588@reddit

Very, very impressive if the benchmarks are to go by. And also something realistic you can run at home at a decent quantization. Being realistic here, most people are not running GLM 5.1 here. But something like this can run on something local.

[-]

Thomas-Lore@reddit

This is a large dense model, how are you going to run it? On what?

[-]

Few_Painter_5588@reddit

4 B60s at INT4

[-]

Thomas-Lore@reddit

Good luck, report the numbers. But that is not sth I would call "realistic you can run at home". And it will be slow.

[-]

thereisonlythedance@reddit

It’s fine to run on 4x3090s which many in the community have.

[-]

Healthy-Nebula-3603@reddit

Lol

That dense model will be giving on 4x 3090 around 6-7 t/s .... Good luck

[-]

FullOf_Bad_Ideas@reddit

nah 16 t/s when I ran it on 3 3090 tis

TP exists

I run llama 405b at 11 t/s on 8 gpus

[-]

thereisonlythedance@reddit

I usually get 10-12 t/s on a 123B which is fine.

[-]

Beginning-Window-115@reddit

dont forget this sub has 1.1 million members

[-]

Spectrum1523@reddit

idk 4xB60 is realistic for at home if it actually runs it

[-]

TheBlueMatt@reddit

My 4xB60 runs unsloth's Q4_K_XL gets 232.45 ± 0.41 in pp and 9.55 ± 0.05 tok/s in tg. Still a handful of patches left to improve it, though. In theory tg should be able to get up to 20 or so (25 is the theoretical max).

[-]

Few_Painter_5588@reddit

It costs around 70k in my local currency, so it's like about 3-4k dollars? But everything's overpriced down here, so it'd probably be less. And Mistral 2 large ran at about 10-15 tokens per second on that build, which was a decent speed. You can also get a 128GB mac that'd run this at around 10 tokens per second.

[-]

FullOf_Bad_Ideas@reddit

tensor parallel goes brrr

[-]

stoppableDissolution@reddit

Old mistral large was still a beast even in Q2. Dense models quantize much better than moes, and its 5x less to fit to run it at all (even if way slower)

[-]

ortegaalfredo@reddit

3x3090 + EAGLE draft model should get you usable speeds.

[-]

q-admin007@reddit

Strix Halo with EAGLE draft model.

[-]

artisticMink@reddit

Anyone got it running? Trying inference with b8974 and the unsloth q4_xl quant - and it feels wherey underwhelming. But i figure i'm serving it wrong.

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

logic_prevails@reddit

Bro i wish I could try this

[-]

Zestyclose-Ad-6147@reddit

Opensource medium?! 😮

[-]

soyalemujica@reddit

Curious if the small version can beat 27b

[-]

hurdurdur7@reddit

First attempts with mistral vibe - yeah it works good enough.

[-]

mantafloppy@reddit

Guess we are trying a IQ2_M for the first time :D

[-]

MotokoAGI@reddit

The last few months have just been crazy! We haven't even gotten official support to run DeepSeekv4, MimoV2.5, Hy3-Preview, Ling, etc and now this?

[-]

kevinlch@reddit

Gemini is close to release too

[-]

jochenboele@reddit

It feels like they all waited on each other to release, I think that’s like nr 5 in 10 days

[-]

Turnip-itup@reddit

Why say “merged “ model? That implies they used model diffing or something. Bad choice of words

[-]

misha1350@reddit

Dense 128B????? Why???????????

[-]

RegularRecipe6175@reddit

11 t/s gen on 4x3090 on a new prompt with llama.cpp.

[-]

iamn0@reddit

what's the prompt processing speed at 32k (and 64k if you could test)? Thanks

[-]

RegularRecipe6175@reddit

What's a good prompt to test those conditions?

[-]

iamn0@reddit

This would be a good test: https://github.com/gkamradt/LLMTest_NeedleInAHaystack

[-]

RegularRecipe6175@reddit

Sorry, I don't have time to figure that out at the moment. If I run an extended test, I'll post the pp results.

[-]

fizzy1242@reddit

doh, now I need a fourth 3090!

[-]

RegularRecipe6175@reddit

I'm getting repetition with non-trivial prompts. 0-minute llama build.

[-]

jacek2023@reddit (OP)

quant?

[-]

RegularRecipe6175@reddit

I edited my post to specify. Unsloth UD-Q4_K_X.

[-]

sebajun2@reddit

Any plans to officially quantize the model? Would be great to run on a local Spark machine at a lower quantization. Looks promising!

[-]

jacek2023@reddit (OP)

link to the GGUF is in the beginning of my post

[-]

sebajun2@reddit

Awesome just saw it - looks like all the quantized versions are available. Have my DGX10 arriving today, just in time. Might be able to get a 4-bit quantized model running pretty well.

[-]

mags0ft@reddit

They cooperate with NVIDIA and frequently release NVFP4 variants. Maybe that'll happen again...

[-]

TheBlueMatt@reddit

4x B60 can almost handle it, unsloth's Q4_K_XL gets 232.45 ± 0.41 in pp and 9.55 ± 0.05 tok/s in tg...almost usable...

[-]

uti24@reddit

Dense 128B

E're we go!

[-]

sob727@reddit

Not sure if ok to ask here, but whats the best way to convert this to GGUF for use with llama.cpp?

[-]

jacek2023@reddit (OP)

there is a link to GGUF in the beginning of my post :)

converter is part of llama.cpp

[-]

sob727@reddit

weird llama.cpp errors with

"operator(): unable to find tensor v.blk.0.attn_out.weight"

using the unsloth model

[-]

sob727@reddit

Just saw the link thanks, I'll download that.

Will also look into the converter, thanks

[-]

q8019222@reddit

That's exactly my running limit. I can run it in Q2.

[-]

Pretend_Engineer5951@reddit

Interesting how much tg would be at Q8 on strix halo :)

[-]

q-admin007@reddit

With the EAGLE draft model, i suspect around 40 to 50.

[-]

oxygen_addiction@reddit

More like 6t/s

[-]

Healthy-Nebula-3603@reddit

Very bad ... I assume 2 t/s

[-]

JacketHistorical2321@reddit

1-3 t/s ... Maybe. Strix hale BW sucks dude

[-]

honglac3579@reddit

Nah, t/m i suppose

[-]

SnooPaintings8639@reddit

Probably you will have infere the next token yourself!

[-]

spencer_kw@reddit

another strong 128b is great for the ecosystem but the real win is that routers like openrouter or herma now have one more option to pick from. every time a new model drops at a different price point, automatic routing gets smarter. competition on price is the one thing that actually helps users.

[-]