I can't find any LLM that is better than gemma-2-9b-it-SimPO...

Posted by pumukidelfuturo@reddit | LocalLLaMA | View on Reddit | 40 comments

... that you can drive, in a reasonable mamner, with 8gb of vram.

I've tried a lot of the new toys and i always end with the same.

I hope somebody tries to replicate the style (stop gptisms plox, enough is enough) and makes something better on the ballpark of 8 to 10 billion parameters that you can drive locally in the most humble (actually affordable) gpu's.

Or maybe we need gemma3.

[-]

thru_dangers_untold@reddit

what does the "-it" stand for in the names of these models? Gemma seems to have regular models and "-it" models, but I'm not sure the difference

[-]

Everlier@reddit

Granted it's 9B, you can check all the recent LLMs in the same range. Notably Qwen 2.5 and Llama 3.1

I'd avoid SimPO and other PPO models, tbh, they are just tuned more towards "likeable" output, not necessarily a good one.

[-]

recitegod@reddit

what would be your take for 12 vram?

[-]

Everlier@reddit

Qwen 14B with lower context, if you're ok with occasional overfit and some Chinese characters.

[-]

recitegod@reddit

Thank you so much everlier.

[-]

BlueSwordM@reddit

Viruoso Small probably. It's a neat little finetune that seems to outperform stock Qwen 2.5 Instruct outside of losing a bit of performance in instruction following.

[-]

BlueSwordM@reddit

Yeah. I've personally tried SPPO and SimPO finetunes and only SPPO has been getting close to original performance with Gemma2.

Even then, there are many instances where the original instruct just performs better, even by a bit.

[-]

SomeOddCodeGuy@reddit

I really enjoy the intelligence and tone of the Gemma models, but their context absolutely kills me. 16k is the sweet spot for me; I struggle with 8192 these days.

[-]

mayo551@reddit

Then use 16k context.

Gemma 2 can do around 16k-20k context with rope scaling and it works pretty well.

[-]

ttkciar@reddit

Give https://huggingface.co/bartowski/Gemma-2-Ataraxy-9B-GGUF a whirl.

I can't say whether it's better than gemma-2-9b-it-SimPO but it blows Tiger-Gemma-9B out of the water.

[-]

Majestical-psyche@reddit

I tried it a while back for stories and RP... I don't like it personally, it's leans way towards Purple Pose.

[-]

Felladrin@reddit

Check also the leaderboard from (GPU Poor LLM Arena)[https://huggingface.co/spaces/k-mktr/gpu-poor-llm-arena] to get similar-sized models insights.

[-]

Healthy-Nebula-3603@reddit

I like that name ;)

[-]

Admirable-Star7088@reddit

I do hope I am wrong, but I feel very small models (7b-9b) have plateaued, they all feel roughly the same nowadays. It's not like a year ago when, for example, when Mistral 0.1 7b was released, and we were blown away by its performance.

The last time I was recently blown away, was by Nemotron 70b (an improved model upon Llama 3.1 70b), I got that Mistral 0.1 7b feeling again. It appears, larger models still have room for improvements, but not small models.

As said, I could be wrong, and please prove me wrong if possible.

[-]

ArsNeph@reddit

I have to agree. The convergence due to synthetic data makes it all the worse, and there's not as much variety in finetunes anymore. This generation of models seems to be hitting it's limit. I pray that the next generation, Llama 4 is a generational leap. Otherwise, we're going to need Bitnet if we want to dream of cutting edge performance

[-]

Admirable-Star7088@reddit

The dream would be if a major breakthrough was made in the hardware industry, where instead of hardware becoming a little bit more powerful each year, it becomes like 1000 times more powerful in one go for about the same price. There have been talks about atom and quantum computers, maybe these techs could make it happen, someday.

[-]

ArsNeph@reddit

That's already happened, though not at once, but over a few years, our compute power already has gone up over 10000x. Yet, we're still struggling with these models due to how inefficient Transformers is. I think it'd be possible to do another 1000x, but we may have to switch hardware architectures, like moving to photon-based computing, or even analog computing. However, unless it's plug and play with existing computers, unfortunately, the adoption period will be horrible. Quantum computers are unlikely to be used in households for a long time yet, their cooling requirements are ridiculous, and they don't have enough qubits of compute to do anything particularly effective right now. Unfortunately, it's far more likely we'll just have to wait for a company to challenge Nvidia's monopoly and release low-cost, high-VRAM dedicated AI accelerator cards.

[-]

Admirable-Star7088@reddit

I have my fingers crossed that Intel Arc will manage to establish themselves as a big player in the GPU market, this could help progress hardware a bit faster.

[-]

ArsNeph@reddit

They definitely have themselves cemented as a reasonable budget option, but don't know that they're bringing anything unique to the table. What's to stop the 5060 from making their entire line obsolete? (Probably the rumored 8 GB of VRAM honestly). The only way I can see Intel really gaining a customer base there is if they make low cost high VRAM cards, then people will buy it up like hotcakes

[-]

Admirable-Star7088@reddit

The question is, is it really possible to produce a GPU with much VRAM cost-effectively? If it's possible, shouldn't AMD a long time ago already released a relatively cheap card with 32GB VRAM (or maybe even more), just to beat Nvidia?

[-]

ArsNeph@reddit

Oh it's very possible. VRAM is actually much much cheaper than we think to produce, and accounting for manufacturing at scale, it is actually a relatively cheap resource. It's simply that Nvidia is overcharging for it because they can, there are no other companies in the world that are really capable of making GPUs for Enterprise use. As for why AMD does not do it, part of it is that they haven't really been known for great business decisions in terms of AI, ROCM is a great example, but as a duopoly, they also want to cash in on that Enterprise money. Another thing is that Nvidia CEO Jensen Huang and AMD CEO Lisa Su are both actually cousins, which opens up the door for all sorts of backroom dealings. Intel has just started with the GPU game, and were struggling to even produce a functioning product on their first generation

[-]

Admirable-Star7088@reddit

Nvidia and AMD CEOs being cousins is a very bad sign, lol. It's probably pretty inevitable then that there will be partnerships behind the scenes. Well, unless their relationship is frosty of course, not all family ties get along.

In any case, we'll see and hope that Intel Arc at least will release some cheap, high-VRAM cards in the near future.

[-]

ArsNeph@reddit

I think a little bit of cash can warm up any frosty relations lol. I hope so too, but I don't expect much out of Intel

[-]

Educational_Gap5867@reddit

You say that and yet Qwen 2.5 coder 7B was a sweet surprise.

[-]

Admirable-Star7088@reddit

For coding specifically, Qwen2.5 7b Coder was nice indeed!

[-]

Thebadwolf47@reddit

I've personally measured 10-15% improvement on our private benchmark at my lab between llama3.1 and llama3.

And for the 70B version 3.3 seems a lot better than 3.1, so if they applied the 3.3 recipe to the 8b models don't you think we'd get a big improvement boost?

Also we've begun doing true RL (with verifiable rewards) on LLMs, so there are still big improvement gains on the way for smaller models

[-]

pumukidelfuturo@reddit (OP)

i don't think so. There's a lot of room for improvement. New optimizations every day. I think it's still way too early to talk about plauteau.

Ofc, i could be wrong too.

[-]

Admirable-Star7088@reddit

Then, let's pray I am the one who is wrong here. 🙏

[-]

extopico@reddit

QWen 2.5 7B performed better for my task. It follows instructions and outputs valid json when sufficiently constrained.

[-]

ThinkExtension2328@reddit

That’s easy Gemma 27b

[-]

steezy13312@reddit

Just curious, have you tried tulu3? https://ollama.com/library/tulu3 I was enjoying that before doing a GPU upgrade and moving up a tier

[-]

ghosted_2020@reddit

I'm pretty happy with Nemo. It's 12B iirc. I run the q8 quant with a 6GB GPU and offload some to ram. It's slow, compared to models that can fit into the gpu, but blazingly fast compared to running QwQ (which I run as well).

I'll try out Gemma though. Getting quite a collection of LLMs lol.

[-]

clduab11@reddit

OP, while I do semi-concur, there are plenty of more than decent models that this applies to (I also have 8GB VRAM).

When I get back to my playground, I’ll screenshot for you and share.

[-]