Cohere releases Aya Expanse multilingual AI model family

[-]

jd_3d@reddit

Looks really strong based on the win rates. Too bad it's non-commercial use.

Reply

[-]

Why does anyone care about that? There is no way to proof the origin of LLM's outputs. Exactly zero ways to do that. As far as I am concerned, they only use that phrase for this tech to be "protected" if any psycho went and did something highly illegal.

Reply

[-]

Koksny@reddit

>There is no way to proof the origin of LLM's outputs. Until you get sued into discovery.

Reply

[-]

Serveurperso@reddit

JPP Faux, du Luc Julia

Reply

[-]

LoafyLemon@reddit

What you're describing is currently impossible. Fingerprinting techniques are being developed, but no model uses it yet. Besides, this whole fingerprinting idea is dumb and won't hold up, because it requires that all models use it to work, but we know it won't happen, people will keep using different models that will break this metric by simply existing outside of it.

Reply

[-]

Koksny@reddit

We were able to fingerprint models even year or two ago, just by asking the model to give random number in range of 0 to n. It was accurate enough to pinpoint the particular release version. Remember, fingerprinting isn't about being able to tell a model from non-generative text. It's about being able to recognize the model given large enough sample of output. And it's far from impossible, since i think everyone here has experienced the model slop - slop that is often unique to particular model or fine-tune.

Reply

[-]

LoafyLemon@reddit

Fingerprinting is about discerning between models in the wild, not in a closed-off setting where you directly ask the model to output a number. Text models are trained on naturalized language, this idea is bound to fail simply because people (that's us, yay!), write in the same style as large language models, because they are trained on our creativity. The 'GPT slop' as you call it can be reduced in a model to the point where your fingerprinting test would fail. Mind you not all models are public, and the sole existence of models outside the spectrum would outright make the idea fail, because as I said before - they exist outside the metric.

Reply

[-]

silenceimpaired@reddit

Not true. They just released a paper on digital watermarks.. and people also have consciences :)

Reply

[-]

silenceimpaired@reddit

I always downvote posts about this company for that reason.

Reply

[-]

appakaradi@reddit

That sucks.

Reply

[-]

MasterThread@reddit

It's the best model I found for interactive storytelling, but it supports only 8k tokens, any alternatives?

Reply

[-]

Quiet_Joker@reddit

I don't get why people are hating on this model, it's literally a model meant to translate from one language to another. Not a model designed to do task like coding and other stuff in mind. I compared this and the previous version of the model which was just named Aya, and this is a significant better version to the previous one when it comes to translating stuff.

Reply

[-]

first2wood@reddit

Hey, have you done any test or you are already using it as a translator? what's the quality compared with google translation and GPT4 series?

Reply

[-]

Nekotekina@reddit

Not OP, but just tested it and it seems better than Gemma-2 27B at translation. Will test more, I'm intrigued so far.

Reply

[-]

first2wood@reddit

Yes, I tested too. It's quite decent as a translator!

Reply

[-]

Nekotekina@reddit

Gemma remains a tough opponent though. Its output looks more stable and solid, but at the same time, Aya seems more vivid and witful about nuances. Qwen2.5 (32B tested) seems worst, very unstable and quirky but its writing may look interesting. If I cherry pick some "word stutter" recognition ("s-stutter"): * Gemma2 may completely ignore the stutter in translation (seems okay). * Aya Expanse works but not reliably. * Qwen2 produces unnecessary transliteration of language fragments (very bad in my opinion). I wish I could combine strong sides from them all...

Reply

[-]

MRGRD56@reddit

The 32B model is also good for writing, at least in English, it seems to be coherent and creative enough. This model wasn't that good for reasoning: I gave it a riddle/problem (I made it up myself) that Qwen 2.5 32B was able to solve, with correct reasoning, and even Qwen 2.5 14B sometimes can solve it, but Aya's reasoning was wrong. So it can't be a perfect ChatGPT alternative ("helpful assistant") but it can be really good at writing, story telling, roleplaying, and also working with different languages of course. The 8B one was okay, but it didn't surprise me much. I've tested it a bit, I think it should be alright for translating tasks.

Reply

[-]

Dyonizius@reddit

it also uses GQA, though 8k context for a translation model leaves a bit to be desired

Reply

[-]

dubesor86@reddit

I ran the models through my [personal benchmark](https://dubesor.de/benchtable); very weak for their size compared to the competition, not worth the storage space imo. Aya Expanse 8B (f16) - failed pretty much everything and was around L3.2 3B capability. Aya Expanse 32B (Q4_K_M) - weaker results than even Gemma 2 9B & Nemo 12B in my testing. It would be OK as like a 12B model due to being fairly uncensored. Gets absolutely stomped by Qwen2.5 As always, YMMV! - but I'm deleting the models again.

Reply

[-]

MaycombBlume@reddit

The emphasis with this model is on its multilingual capabilities. Are your tests relevant to that domain?

Reply

[-]

dubesor86@reddit

While I do have some multilingual tasks, it's not the focus by any means. The marketing however claim an above 50% winrate against models with no emphasis on multilingual capabilities, and also in the English-specific win rates. (Link and charts are in the OP). However, to me it's not really relevant as I test each model overall skillset regardless of their intended use, such as coders, tiny lightweight models, etc.)

Reply

[-]

walrusrage1@reddit

Thoughts on Nemo in general? I see it also ranks higher than GPT3.5 on your evaluation table

Reply

[-]

dubesor86@reddit

For 12B it's very good. I was surprised that it managed to do so well in my coding segment, and it's obviously far less censored than Gemma 2.

Reply

[-]

appakaradi@reddit

Why is no one comparing against Qwen 2.5?

Reply

[-]

Xhehab_@reddit

https://twitter.com/johnamqdang/status/1849883876245516594

Reply

[-]

dubesor86@reddit

Those would be insane winrates, my own testing of these models puts Aya dead last, but oh well. The low amount of ties is also quite surprising. https://preview.redd.it/jndwbizpbbxd1.png?width=661&format=png&auto=webp&s=d48e2220b8c1c47290e9c04fb1f7d9b60888df44

Reply

[-]

Healthy-Nebula-3603@reddit

That model is DESIGNED for TRANSLATION only. Win rates are connected to translations.

Reply

[-]

dubesor86@reddit

where do you get the info its designed for 'TRANSLATION only'? Certainly there are no such claims on their announcements. Either way, that's not relevant for my testing which tests all aspects of a model regardless for what it's 'DESIGNED for'.

Reply

[-]

Healthy-Nebula-3603@reddit

Did you even read "readme" from their hugginface?

Reply

[-]

BlueSwordM@reddit

Oh, very nice. I'm honestly surprised Qwen 2.5 is "better" than Gemma2 in this regard since Gemma2 has been top tier in terms of language performance for me.

Reply

[-]

fungnoth@reddit

I'm also surprised. Maybe it's because the comparisons being mostly short conversations. From my experience using the q4, longer context make it randomly outputing chinese, trying to translate the previous sentences

Reply

[-]

appakaradi@reddit

Qwen is a tough opponent

Reply

[-]

AdSuperb3336@reddit

8k context length

Reply

[-]

isr_431@reddit

Has this not been out for a week already? I've been using it in Ollama since then. Gemma 2 9b is better at translation, but supports fewer languages. Qwen 2.5 is still the best for most Asian languages.

Reply

[-]

nodating@reddit

Yeah but to be fair it kinda went under the radar, understandable as its main strength is multilingual use and we all know that most Americans can barely handle English (no offense) and there are just better models out there already if all you care about is English use-cases.

Reply

[-]