Can somebody please do a benchmark of the new Mistral-Large-2411?
Posted by CH1997H@reddit | LocalLLaMA | View on Reddit | 21 comments
I can't find a single benchmark anywhere. Nobody has made one. Very unusual
Is it better than the previous Mistral Large? How does it compare against Llama 3.1 405B?
That's all up to your imagination 😊
segmond@reddit
lolzinventor@reddit
I'm using Mistral Large 2407 Q5 for generating a CoT training data set, got a couple of weeks to go yet. Do you think it is worth pausing the job and switching to 2411? Possibly wouldn't notice the 1% increase between 2407 and 2411?
FullOf_Bad_Ideas@reddit
Q5 quant for batch dataset creation? Is this cheaper then just using their API? By Q5 I am assuming a GGUF quant, yes?
lolzinventor@reddit
Yes GGUF. It is cheaper for me.
segmond@reddit
I don't know, that 1% could turn into more for you. But 2411 seems to perform CoT better without prompting. So it's probably worth it to get the newest model and continue from there.
lolzinventor@reddit
There is no harm in downloading and having a look first...
lolzinventor@reddit
There is no harm in downloading and having a look first...
lemon07r@reddit
I like my magical imagination
DragonfruitIll660@reddit
So far from my very limited testing it seems to be better than the previous version. It's not perfect (still hoping for an LLM that can accurately keep track of stats and status tags for an RPG game) but it seems to be more detailed for remembering lore. Running on Q4xs so may be even greater quality gains at higher quants.
cm8t@reddit
It is certainly better than 2407, better reasoning, big picture understanding
Caffeine_Monster@reddit
It certainly has better (more human) prose and stronger prompt following.
However I don't think the model is fundamentally any stronger though (which is why we keep seeing it match / within margin of error of perplexity tests.)
thezachlandes@reddit
I just uploaded a q2 MLX quant in case anyone wants to try this on MacOS! Size loaded in LM Studio is 42.31GB. I'm going to upload the q4 a little later.
get it here: https://huggingface.co/zachlandes/Mistral-Large-Instruct-2411-Q2-MLX/
Admirable-Star7088@reddit
I was wondering the same, and the best way to benchmark a model is trying it out yourself. My curiosity made me download Mistral-Large-2411 (Q4_K_M), even though I believed it would most likely crash my system. To my surprise, it actually worked (just right on my system's limit).
So far, in my own testings, it's a very powerful model with excellent reasoning and prompt-following, I've had a bunch of fun feeding it hypothetical / logical questions and also compared it against Nemotron 70b and Athene-V2 70b. Mistral-Large-2411 stands strong and it's on of my favorite models right now.
hello_2221@reddit
Just curious, what's your setup?
Admirable-Star7088@reddit
64gb ddr5 ram and 16gb vram = 80gb RAM in total. Getting \~1 t/s with Mistral Large Q4_K_M.
segmond@reddit
I haven't played with Athene-v2-70b, how does it compare to Nemotron70b?
Admirable-Star7088@reddit
I would need to test Athene-V2-70b a bit more to be more confident in my answer, but so far it has been roughly on par with Nemotron 70b for me, they are both very intelligent.
The main difference in a nutshell, as I can see, is that Nemotron seem to do more deep analysis on your hypothetical / logical requests, while Athene goes more straight to the point (but still making good points).
For example, in story writing, I have so far found Nemotron to write more fascinating and interesting texts, as it fills in with interesting details and makes it more awe-inspiring, while Athene tends to just progress the story more straightforward (but still in a logical and intelligent way).
This is my "feel" about these models right now.
Healthy-Nebula-3603@reddit
2407
llama-perplexity.exe --model models/new3/Mistral-Large-Instruct-2407-Q3_K_S.gguf --threads 30 -ngl 35 --hellaswag --hellaswag-tasks 400 -f models\hellaswag_val_full.txt -c 8192 --no-mmap
390 90.76923077
391 90.53708440
392 90.56122449
393 90.58524173
394 90.60913706
395 90.63291139
396 90.65656566
397 90.68010076
398 90.70351759
399 90.72681704
400 90.75000000
2411
bin\llama-perplexity.exe --model models/new3/Mistral-Large-Instruct-2411-Q3_K_S.gguf --threads 30 -ngl 35 --hellaswag --hellaswag-tasks 400 -f models\hellaswag_val_full.txt -c 8192 --no-mmap
390 88.97435897
391 88.74680307
392 88.77551020
393 88.80407125
394 88.83248731
395 88.86075949
396 88.88888889
397 88.91687657
398 88.94472362
399 88.97243108
400 89.00000000
Newer seems a bit worse with perplexity ....
YearZero@reddit
You can find it here:
https://oobabooga.github.io/benchmark.html
https://aider.chat/docs/leaderboards/
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard (no 2407 on this one to compare against tho)
segmond@reddit
It is better than 2407.
Healthy-Nebula-3603@reddit
I'm going to do that today for perplexity....locally