Mistral Small 3 24b is the first model under 70b I’ve seen pass the “apple” test (even using Q4).

[-]

Still_Potato_415@reddit

Try this: Generate 10 English sentences, each starting with "banana" and ending with "apple."

Reply

[-]

Sky_Linx@reddit

Just tried this, and it completely failed, lol. On the other hand, Qwen2.5 14b works much better, though it still isn't perfect.

Reply

[-]

EmergencyLetter135@reddit

I am also positively surprised by the performance of the Mistral 24B. This model would be perfect with a longer context. Unfortunately, 32K is no longer sufficient for my purposes.

Reply

[-]

EmergencyLetter135@reddit

Text generation & RAG in German language. I use several models. Supernova Medius for example for RAG and for text generation various 70B models (Athene v2, Qwen, Nemotron and Deepseek.

Reply

[-]

Are the 70B models in general able to write without errors in german? My experience with smaller models is that this does not work on a production-ready quality (e.g. Qwen 14B). But the new Mistral 24B model does a nice job writing in german, so far.

Reply

[-]

drifter_VR@reddit

MistralAI makes amongst the best multilingual models. And Small 3 is an even better multilingual than Small 2

Reply

[-]

rhinodevil@reddit

After playing around with it a bit more: There are the occasional missing syllables in (e.g.) German, but all in all very high quality.

Reply

[-]

EmergencyLetter135@reddit

That is also my experience. Mistral and Nemotron are sufficiently good in German for my purposes. Athena v2 is also okay ... but Qwen and R1 (distilled versions) are significantly worse in German in my opinion.

Reply

[-]

BraceletGrolf@reddit

I'm curious about multi language results ? I'm working on an app with LLMs that would be for the western european market so it has to be multilingual, so far Mistral is the only one advertising such results, I wonder about Llama performance on that.

Reply

[-]

Flashy_Management962@reddit

I too use SuperNova Medius and I think it even outperforms the Virtuoso models. Is it also true for you?

Reply

[-]

ds_nlp_practioner@reddit

Sounds good

Reply

[-]

Zenobody@reddit

Eh if you have enough VRAM for 64k+ context you probably have VRAM for larger models anyway.

Reply

[-]

Worth-Product-5545@reddit

Thing is, people are still testing LLM on tests that typically fall under either (1) the tokenizer's weaknesses or (2) the sampling parameters' fault (e.g. repetition penalty here). We need more realistic tests.

Reply

[-]

pkmxtw@reddit

/r/LocalLLaMA: Benchmarks are completely useless and we should never trust them! Also, on /r/LocalLLaMA: Using one random trivia to compare models.

Reply

[-]

Many_SuchCases@reddit

Both of these statements are true though. Benchmarks have done a lot of damage to LLMs. Just look at the Phi models, they should be outperforming many others, and yet most people don't like them outside of some limited use cases.

Reply

[-]

jeremyckahn@reddit

I keep hearing this, but Phi 4 is consistently the best model for real world, day-to-day tasks for me. It’s such a good all-rounder.

Reply

[-]

NeedleworkerDeer@reddit

Phi 1-3 was amazing for creative writing for me, Phi 4 is the first I haven't used because it wasn't as good. Everyone has their own use case I guess.

Reply

[-]

GrungeWerX@reddit

I tried it as well and it was crap for writing.

Reply

[-]

-Ellary-@reddit

I agree, it's a solid 14b for work and formatting, tools etc. Maybe even best in 7-14b range.

Reply

[-]

zekses@reddit

I've done some real world testing of mistral small q8, mind I had to use the instruction set from the original mistral since current one is broken in oobabooga, ny verdict is: I don't need an llm to be my yes-man. Mitral's c++ code reviews are skin deep and it doesn't go deep enough to be useful. Qwen coder 32b is way more noisy but It also detects way more problems.

Reply

[-]

Positive_Click_8963@reddit

Got 8/10 with Mistral Small 24B on my end... Q4KM, llama.cpp, linux. Still a fantastic model and has pretty much become my go-to.

Reply

[-]

uti24@reddit

Mistral Small 3 24b is good, it's really good, it's best model, probably, up to 70B. But this test could just sneak into learning tokens of this model. I made a small experiment with Mistral-Small-24B-Instruct-2501-Q6\_K: >Write 10 sentences that end with the word 'submarine.' AI: ... 2. The captain gave the order to dive, and the submarine began its descent. ... 10. The marine biologist used the research submarine to explore the deepest parts of the ocean.

Reply

[-]

dubesor86@reddit

Nemotron 51B is stronger, but I agree with the sentiment. R1 Distill Qwen 32B and Gemma 2 27B are competing, depends on task though.

Reply

[-]

-Ellary-@reddit

Why people don't talk about Nemotron 51b? it is better than Qwen 32b, And the best model before 70b range.

Reply

[-]

AaronFeng47@reddit

Too large for 24gb card

Reply

[-]

drifter_VR@reddit

And to small for 2x24gb

Reply

[-]

Admirable-Star7088@reddit

I use 70b models quite a lot, and Mistral Small 3 24b is indeed very good and quite comparable. 70b models still have more depth, but Mistral Small 3 feels like a "70b light" model, lol.

Reply

[-]

drifter_VR@reddit

yeah the best 70b models like Midnight-Miqu-70B-v1.5 are still superior

Reply

[-]

Sea_Sympathy_495@reddit

your parameters are wrong for the task you want the model to perform.

Reply

[-]

uti24@reddit

I mean, amybe? Arent parameters those days are loaded with model? If model is that smart, maybe default repetition penalty should be less.

Reply

[-]

Sea_Sympathy_495@reddit

>I mean, amybe? Arent parameters those days are loaded with model? nope. >If model is that smart, maybe default repetition penalty should be less. You're using the word smart as if the model can adjust it's own training weights on the fly lol. No thats not how any of this works.

Reply

[-]

onil_gova@reddit

I use the word raspberry instead of strawberry. If the model has generalized, then it should be able to perform the same task with any other variation.

Reply

[-]

drifter_VR@reddit

Similarly, Mistral Small 3 can perfectly understand certain conceptual jokes that models under 70b or GPT-3.5 turbo struggle with. Jokes like : "The adult does not believe in Father Christmas. He votes.", "With my wife, we have sexual relations. But on the whole they don't come very often."...

Reply

[-]

Sea_Sympathy_495@reddit

are you guys suffering from some sort of mental issue? This has been discussed in so much depth in here that posts like these make 0 sense. This has 0 to do with the models intelligence but your parameter settings. That's all. You have too high repeat penalty / top k / temperature for the model. Even LLama 2 7b can do this. please stop, how does this have so many upvotes?

Reply

[-]

vyralsurfer@reddit

Is there a source to find the optimal settings for the different models? For example, I heard that Deepseek needs a much lower than normal temperature, but found no sources to corroborate that. Would be interested in the lower best parameters for Mistal since I've been having a lot of luck with it and wonder if I could make it even better...

Reply

[-]

Hisma@reddit

Go to hugging face and look up the model there. They'll typically include the optimal settings in the model description, or you can just look at the models config.json.

Reply

[-]

YRUTROLLINGURSELF@reddit

"typically" is doin a looooot of heavy lifting there chief we need WAY better standards for this particular thing :(

Reply

[-]

Sea_Sympathy_495@reddit

no just play around with the values

Reply

[-]

Sea_Sympathy_495@reddit

Here's llama 3.1 8b performing this task flawlessly proving it's your parameters that are the issue. https://imgur.com/a/uNhCmuw

Reply

[-]

beedunc@reddit

I don’t think ‘kids eating crunchy fresh apple SEEDS’ is a valid answer.

Reply

[-]

Sea_Sympathy_495@reddit

crunchy seeds of an apple, unless you're illiterate?

Reply

[-]

Porespellar@reddit (OP)

You modified the standard apple test prompt by adding the “coherent”part. Secondly, you modified the default model temperature to 1.3 which is way past default. Everyone has their own method and to each their own, I personally just don’t like trying to make a model do something by changing parameters to pass a test, I just want to see what it does using out of the box settings.

Reply

[-]

Sea_Sympathy_495@reddit

> You modified the standard apple test prompt by adding the “coherent”part. yeah thats not how it works, put shit in get shit out, work on prompting better. >Secondly, you modified the default model temperature to 1.3 which is way past default. There is no default. It's per model. Some work for some models some don't. For example there's models that if you put repeat_panalty they become incoherent. >I personally just don’t like trying to make a model do something by changing parameters to pass a test Irrelevant what you like, they are mathematical equations, you need to tune them to the task you want executed. Imagine wanting to write a book and habing a low temperature, that is idiotic

Reply

[-]

perturbe@reddit

I would not call this flawless, some of these do not make sense “A bright green apple ripened in the sun apple” in particular

Reply

[-]

AmericanKamikaze@reddit

Hey, what are peoples speeds with 16gb vram?

Reply

[-]

Gloomy_MTTime420@reddit

This is humans trying to be egotistical humans. So because you threw 10 sentences at a 70B model, you think you’ve identified the best. That’s laughable behavior and classic human fallacy…thinking we can actually comprehend what 1B of anything is, let alone 24B or 70B.

Reply

[-]

rookan@reddit

https://preview.redd.it/t1w9mqp7irge1.jpeg?width=1440&format=pjpg&auto=webp&s=e49b8ff9de5105c1fffd2f2564ec3a9acd448d6c Deepseek R1 8B Q4 KM failed this test

Reply

[-]

overnightmare@reddit

This model really surprised me, even at Q4, it knows Italian pretty perfectly and can write in a very coherent way in that language. It’s miles better than Gemma 27b.

Reply

[-]

Brilliant-Day2748@reddit

Could you share those 10 sentences? Would be interesting to see how creative the model got. Pretty impressive for a 24b model to match 70b performance. Makes you wonder what other tests it might ace.

Reply

[-]

first2wood@reddit

But it failed my task writing five sentences using the ending word as the beginning word of the next sentence. Maybe because I was testing a q4.

Reply

Reply to Post

51 Comments