[Ministral 3] Add ministral 3 - Pull Request #42498 · huggingface/transformers

[-]

Cool-Chemical-5629@reddit

Now wait a damn minute. Is this the reveal of Bert-Nebulon Alpha ? Because if it is, then I'm all in!

[-]

brown2green@reddit

It has 256k tokens context and vision support like the model on OpenRouter. That one also has "small model smell" in some aspects.

[-]

sschuhmann@reddit

In the PR the context window is mentioned 😉

[-]

To me it appears like it's 16k native extended to 256k with YaRN scaling. https://github.com/huggingface/transformers/blob/2d4578a7e58c298a0398c297adaf73887aa36e5b/src/transformers/models/ministral3/configuration_ministral3.py

[-]

Cool-Chemical-5629@reddit

Why not? We have to believe. 🙂

[-]

dampflokfreund@reddit

Compare Bert's speed to Ministral 8B. Bert is way slower on OpenRouter, so it's a much bigger model.

[-]

Cool-Chemical-5629@reddit

Sure, but whatever Ministral 3 is, it's a new architecture, because they are adding support for it to Transformers. If this was based on the original Ministral 8B, they wouldn't need to touch Transformers to support it, right? But here they are, removing 9 lines, adding 1403 new lines of code. This is not our old Ministral architecture, so the performance of the model is yet to be disclosed.

To be fair, when I said Bert-Nebulon Alpha could be this Ministral 3, it is just my wishful thinking and there could really be a bigger model, maybe one that is using the same new architecture as this Ministral 3, but we don't know that for sure. What I do know though, last time I tried this Bert-Nebulon Alpha, it had some serious flaws in its logic which I commented about in other threads in this sub and I made comparison to Mistral Small 3.2 at some point with conclusion that if this Bert-Nebulon Alpha is indeed a new Mistral model, its logic is weaker than that of Mistral Small 3.2, but then again maybe they fixed it in the meantime. If not, it would make more sense that it is truly that 8B model and for that size, it would be really smart one.

In the meantime, someone knowledgeable in this could take a look and figure out what kind of architecture it really is from the code, I'm sure it would be appreciated.

[-]

Hoblywobblesworth@reddit

Haven't looked in detail at the code additions yet, but the comments on the PR suggest it's not a major architecture update beyond a minor change to the RoPE implementation.

[-]

Cool-Chemical-5629@reddit

Minor change with 1403 additions? Hmm, okay.

[-]

Hoblywobblesworth@reddit

Comment on the PR from ~6hrs ago:

"Out of interest: if the only difference here is that the attn layer now supports L4-style rope extension, why was a whole new arch made instead of extending the regular Mistral LM arch with L4 rope support?"

[-]