Mistral removing ton of old models from API (preparing for a new launch?)

Posted by mpasila@reddit | LocalLLaMA | View on Reddit | 21 comments

They are going to be removing 9 (screenshot is missing one) models from their API at the end of this month. So I wonder if that means they are preparing to release something early December? I sure hope I finally get Nemo 2.0 or something... (it's been over a year since that released).
Source: https://docs.mistral.ai/getting-started/models#legacy-models

[-]

AppearanceHeavy6724@reddit

Well these are not popular- they are al derived from Small 3.0 base which has very stiff language and very, very prone to looping.i have long deleted 3 and 3.1 from my hdd.

[-]

kaisurniwurer@reddit

3.2 is also no saint in that regards. But it's excellent with context and instructions, so I gave it a pass over 24B version.

[-]

Few_Painter_5588@reddit

Possibly, they've been working on a new Mistral large for quite some time. It'd be cool to see a new Mixtral 8x22B

[-]

stoppableDissolution@reddit

I really, really hope new mistral large is still 120b dense and not huge ass moe that cant be run on a consumer hardware while being worse. I'm also, unfortunately, quite certain that it will in fact be a huge moe :c

[-]

Few_Painter_5588@reddit

It's all about active parameters. Sure you can get a 235B22A model or a 681B31A model. But the big models like Claude 4.5 and Gemini have upwards of 100B active parameters in their MoE structure

[-]

usernameplshere@reddit

I hope so, Mistral is lacking behind the asian models by a lot.

[-]

AppearanceHeavy6724@reddit

Mistral 3.2 is far far better generalist than almost all Chinese model of comparable size, except GLM 4 32b, but the latter has its own quirks.

[-]

Xandred_the_thicc@reddit

qwen's instruction following is atrocious compared to magistral 2509, imo. I've noticed the chinese models tend to outright ignore instructions that "don't align with the community guidelines of..." no matter how extensively you explain how stupid of a reason that is to refuse to translate "controversial" text or transcibe an image.

[-]

Sea-Rope-31@reddit

I always forget these guys exist tbh

[-]

xxPoLyGLoTxx@reddit

Dunno why you are downvoted but it’s the same for me. It’s a crowded space now.

[-]

My_Unbiased_Opinion@reddit

Idk Magistral 1.2 is very solid. Not super verbose, but gets to the point.

[-]

macumazana@reddit

does it mean you cannot use them via api even if you were using them in your production?

what happens - do you get error on request now or they redirect you to a newer model?

[-]

VicboyV@reddit

Or maybe they're cutting costs?

[-]

Ill_Barber8709@reddit

I'm not sure how removing old models helps cutting costs.

[-]

Double_Cause4609@reddit

It's really hard to serve multiple models at varying levels of usage; every additional model you serve results in some underutilization of GPUs (unless you serve at high latency), so offering multiple variants of the same model gets really painful really quickly.

That is, if you need 100 GPUs to serve 1 model (just as an example) at full capacity, and you have, let's say, 150 GPUs worth of people using that one model, you need at least 200 GPUs, because you generally have to allocate GPUs in blocks (again, not the actual numbers, I'm just using easy, round numbers to illustrate).

But if you have, say, 4 versions of the same model, you now need 400 GPUs, even if you only have 150 GPUs worth of people using the models...And it gets potentially worse! If 101 GPUs-worth of people are using the newest model (and this is not an unusual pattern), you actually need to allocate 500 full GPUs (5 full blocks) to meet that demand, even if it only happens for like, 10% of the day!

So I think it's actually pretty intuitive that when you're serving at scale offering tons of different models at once is actually kind of brutal.

[-]

EndlessZone123@reddit

I can't imagine the overhead is that high (100 vs 400) when the consept of serverless and dynamically adjusting servers are an extremely common concept. There is no reason they couldn't just load a different model to each of their GPUs if demand changes. That takes only seconds.

Maybe like 5-10% more, but not 100% more per model.

[-]

Double_Cause4609@reddit

You are sort of right, but keep in mind that inference isn't clean like a lot of other service based workloads; you can't just spin up a single extra GPU to get a few more requests per second. There's actually minimum viable clusters necessary to be able to start serving a new batch.

Like, if you think about how inference works at scales like these you have to allocate enough GPUs to load the model + decent concurrency (could be something like 2-4 for Mistral Small, for example. Might be 8 in an enterprise scenario), and *then* you can serve inference. This doesn't sound as bad as my example (especially if you're already serving like, 50,000 to 100,000 other people already), but selling per token (especially on a partially open source model stack) is an extremely low margin affair, and low GPU utilization in an allocated cluster means you're losing money if you priced tokens according to full allocation.

Additionally, it also depends on the deals you have with providers; when you're dealing in extremely large numbers of GPUs you actually do often have to commit to a certain number of them (tbh, it's kind of like having the worst parts of renting *and* owning), and you don't necessarily have the ability to scale up or down by thousands (which is often the order on which you're dealing with here.

It's extremely difficult to allocate these types of resources to a fine grained degree, and having to server multiple variants of every single model is really challenging.

Featherless had to produce a completely different serving model with a bespoke engine to be able to serve finetunes on demand, and it significantly hurt their ability to offer at a high rate (and in fact, at a low cost), which are also things that your users will complain about.

[-]

Cool-Chemical-5629@reddit

This. Some people don't seem to realize that these services run 24/7. It's not like your local LM Studio or whatever set up where you load your model when you actually need to use it and then you unload it when you're finished to be able to use those resources on different tasks. This is an online server which is a whole different world and the models must be loaded permanently even if they are never actually used by the end-users, so they still hog the resources that could be used on more recent models at least.

[-]

Ill_Barber8709@reddit

And some people simply need an explanation because not everyone has all the answers.

[-]

Ill_Barber8709@reddit

Thanks for the explanation!

[-]

FullOf_Bad_Ideas@reddit

I am sure it makes sense for them, but what should devs/consultants should built apps on if they don't want to use OpenAI/Google models and they want a project to stay without maintenance for years?

Let's say you have a client that wants some workflow to be executed periodically, let's say some report generated for every incoming invoice.

You want to ideally just set it up with some API, deploy it and let the app live forever. The less maintenance, the better, and sometimes you can't just swap the model endpoint to a new recommended one and not run into any undesired changes.

With open model that's trivial to do with autoscaling serving platforms where you're running a custom app that has weights downloaded to local storage and you can freeze the whole workflow to be stable regardless of dependencies and API depreciations (that's not an ad so I won't name them here)

But how to do that with a closed model from a company like Mistral if they'll depreciate the model 12 months after release?

Quick depreciation schedule is something that will make some customers think twice about building a small project on a given API model. gpt-3.5-turbo-1106 from November 2023 is still on the API and I am sure they still have some customers.

I think there's surely some smart autoscaling setup they could do internally or outsource it to a third party for this kind of a "barely but still there" API longevity.