Shit post: Why no new 20-35B models to keep feeding my addiction?!
Posted by ParaboloidalCrest@reddit | LocalLLaMA | View on Reddit | 76 comments
t(ಠ益ಠt)
Posted by ParaboloidalCrest@reddit | LocalLLaMA | View on Reddit | 76 comments
t(ಠ益ಠt)
Affectionate-Hat-536@reddit
I run small models with 2GB VRAM and still have fun interacting with them (phi3.5, gemma2). With 24GB, you are super rich 🤑.
ParaboloidalCrest@reddit (OP)
Well, folks with iGPU or just no computer at all would envy you :P. It's human nature.
Affectionate-Hat-536@reddit
🫡
Downtown-Case-1755@reddit
What's wrong with Qwen 2.5 and its finetunes? Command-R?
I've recently found these two lovely models for writing:
https://huggingface.co/nbeerbower/Qwen2.5-Gutenberg-Doppel-32B
https://huggingface.co/EVA-UNIT-01/EVA-Qwen2.5-32B-v0.2
And they're being updated!
ParaboloidalCrest@reddit (OP)
Nothing wrong with them, just want MOOORE
Zyj@reddit
Have you considered adding more 24GB GPUs?
Xandrmoro@reddit
And 24 more... And 24 more...
(the amount of pci-e was not the only thing I looked on when upgrading mobo, I swear)
ParaboloidalCrest@reddit (OP)
That's a slippery slope XD.
skrshawk@reddit
Feed your addiction by learning how to merge models. You might just make the next Midnight Miqu.
ParaboloidalCrest@reddit (OP)
Any examples of merges accross different model families and sizes, eg Mistral Small 22B and Qwen 2.5 32B?
skrshawk@reddit
These are known as frankenmerges. I think Fimbu or SOLAR was such a model, but it's even more of a crapshoot than merging models with the same base.
bearbarebere@reddit
I’m guessing I can’t merge them on my hardware if I can only run Q4 quants of them? :(
skrshawk@reddit
Merging can be done by CPU, it's slow like inference, but people generally care a lot less because you can just run it in the background.
rm-rf-rm@reddit
We just got Qwen2.5-Coder 32B a few days ago.. GPT-4o level performance.
Kep0a@reddit
Idk the people on reddit who actually use it seemed less than thrilled.
rm-rf-rm@reddit
I've been using it and though it hasnt shown to be better than Qwen2.5 as yet in the few days I've used it, I've also not been let down so far.
Downtown-Case-1755@reddit
It's easy to hold local models wrong with a bad quant, settings, messed up sampling... Getting peak performance is very finicky.
MoffKalast@reddit
It's been days!
We must starve...
VongolaJuudaimeHime@reddit
Have you already tried the recent updates from Drummer? I saw that he has several new models within this range.
https://huggingface.co/TheDrummer
w4ldfee@reddit
and parasiticrogue just released a merge of those two and its awesome :)
https://huggingface.co/ParasiticRogue/EVA-Instruct-32B-v2
Downtown-Case-1755@reddit
Indeed!
But any merge with Qwen Instruct will suffer past 32K. But the Gutenberg finetuner said they'll try a finetune on top of EVA instead, which I am very excited about, as it should retain the 64K performance:
https://huggingface.co/nbeerbower/Qwen2.5-Gutenberg-Doppel-32B/discussions/1
And if it doesn't, a light slerp merge with Qwen base should bring it back.
Jesus359@reddit
Whats command-r good for? Ive never heard of that one.
Downtown-Case-1755@reddit
It's a RAG focused model, and I find its great for tasks that are "in the context," like summarization, continuing a long story, stuff like that, as opposed to recalling knowledge and data from the LLM itself.
The original Command-R was widely regarded for its storywriting prose, but consensus on the new model is more mixed. However, it's still a great model for 24GB cards, with 64K+ context at 4bpw being easily doable.
gmatheu@reddit
I bought an 24GB RTX-3090 last week for this same reason ...
and now I can feel how the card is staring at me and saying "that's all what you had???" (of course with only 1 fan spinning)
ParaboloidalCrest@reddit (OP)
Exactly! Gotta put this thing to good use, cause the idle gpu is the devil's workshop (eg video games).
Hunting-Succcubus@reddit
Because world don’t resolve around 4090s.
ctrl-brk@reddit
Look here yesterday:
https://www.reddit.com/r/LocalLLaMA/s/bp9JpG3MmB
ParaboloidalCrest@reddit (OP)
While I am addicted, I'm not that addicted ;).
VulpineKitsune@reddit
What do you mean?
ParaboloidalCrest@reddit (OP)
I mean I don't try finetunes except the more reputable ones such as Hermes. There's just tooo many of them.
Ravenpest@reddit
That's the literal guy who made Koboldcpp, what's more reputable than that
Swashybuckz@reddit
OCD.
2rememberyou@reddit
Do you have it in IV?
Healthy-Nebula-3603@reddit
Recently you got qwen 32b instruct , Aya 27b, Owen 32b coder instruct (current sota ) in those sizes
ttkciar@reddit
I've been really impressed with Qwen2.5-32B-AGI too, despite the cringe name.
Weary_Long3409@reddit
Been there through the back and forth of 8B, 70B, and GPT-4o... until Qwen 2.5 32B Instruct came to life. This model is like a weight off my shoulders.
matt23458798@reddit
Sadly theres only 1-3B or 70B+ models now
Deep_Fried_Aura@reddit
Feed your addiction they way I do!
I'm always on the hunt for 8B-14B models that have an FP16/F16 variant.
I get so much more out of those than I do regular 8B-14B. Llama models in F16?? chef kiss
MoffKalast@reddit
Anecdotal, for llama-3.x-8B I feel like fp16 model + 4 bit cache gives pretty great results. It's so overtrained that even Q8 slightly lobotomizes it lol.
ParaboloidalCrest@reddit (OP)
I use as high as Q6K for models around Mistral Small's size, but hardly feel any need to go for bigger quants or fp16.
Deep_Fried_Aura@reddit
In my PERSONAL experience, the quality of long context outputs is improved, but there's no scientific data to back up that claim, it's purely based on my usage, prompting, and frameworks for inference.
ParaboloidalCrest@reddit (OP)
It's worth trying. Thanks!
Deep_Fried_Aura@reddit
Remember too that with Ollama we can use the models on their website plus we can create our own Modelfile using .gguf from huggingface.
Happy hunting!
ParaboloidalCrest@reddit (OP)
Actually you can `ollama pull` hf models directly now https://huggingface.co/docs/hub/en/ollama .
Deep_Fried_Aura@reddit
Even though Nicholas Cage hasn't tried to kidnap you, YOU ARE A NATIONAL TREASURE!!
It's been a while since I looked over Ollama updates/Docs since I'm used to running things "as they are" but this is the best tidbit of info anybody has put me on to today. I came here thinking I'd share with you but you took my $1 and gave me back $100 🫡
farmasek@reddit
What do you want to use it for ? That it makes you crave it :D
ParaboloidalCrest@reddit (OP)
Mixture of a Million Agents! And I'm the aggregator.
VongolaJuudaimeHime@reddit
Wait... 24 GB still can't fit 70B models? ;____; I'm doomed... So do we actually need 3 12 GB GPUs then?
ArsNeph@reddit
Not 3 12GB, 2 24GB. 2 x used 3090 at $600 = $1200
ASYMT0TIC@reddit
For Q4, but honestly you probably want q4km and a bit of room for context. I'd go with 48gb for 70b.
VongolaJuudaimeHime@reddit
Is see, thank you!
Maaaan, that's a lot of GPUs... .___.
SPACE_ICE@reddit
plot twist you can get a used crypto miner rig for cheap (usually just the case at that point) and just buy loads of cheaper gpus like 12gb versions of the 3060 ti (roughly just under $300 on amazon), get 4 of those and you will have the same vram as two 4090's. Main issue is the buslane bandwidth is lower so it will lower tokens per second as well.
Ulterior-Motive_@reddit
Some IQ2 quants just barely fit, and are surprisingly helpful, but still a huge downgrade from IQ4 or better ones.
LoafyLemon@reddit
IQ2 and lower quants are unsuitable for anything in my tests, including coding, trivia, or even RP. I have no idea what people use such low quants for, and at this point I'm afraid to ask. Lol
VongolaJuudaimeHime@reddit
Can you please enumerate specific models that still works okay in that quant? Gotta take note for future reference. Thank you so much!
Ulterior-Motive_@reddit
I know you can get Llama3.1 70B small enough with an IQ2_XS quant, I used that before, but I think even Qwen2.5 72B is too large.
Admirable-Star7088@reddit
This is how addiction works.
Your abstinence level increases and you crave a new 30b model more and more. The momentary high happens when you finally download one, and it feels like heaven as it answers your hypothetical questions a bit better. Next thing you know, you're already itching for a newer, updated version.
This is just like people who suffer from food addiction and obesity. They can't wait for their big breakfast in the morning, and at the moment they have finished eating, they can't wait for lunch, and so on.
ParaboloidalCrest@reddit (OP)
Guilty.
Jesus359@reddit
I mean at that point just shill out the $20/mo to help your local one and make it mutimodel. Gpt only to assist your local when its stuggling.
Noone is ever going to reach what a collective company can get at monetary and infrastructure wise. Lol.
No_Afternoon_4260@reddit
Have you tried supernova from medius?
ParaboloidalCrest@reddit (OP)
Stumbled upon it while looking up "merging" as some suggested. Will try it.
jackcloudman@reddit
damn, "The more you buy, the more you save"
shroddy@reddit
Time for a new addiction. How about image gen? There is stable diffusion, pony, flux, auraflow, video generation with CogVideoX... Just to name a few. And thousands of Loras, finetunes, merged...
dmitryplyaskin@reddit
Because 20–35B models feel like something in between 8–12B and 70B, and they don’t seem to make much sense. You still can’t run them on weak hardware without quantization, but their performance boost isn’t as significant as with 70B+ models.
ParaboloidalCrest@reddit (OP)
Q4K* quants of 35B models run fine on 24gb vram.
Single_Ring4886@reddit
What is your most used models in this size today? Everyone is praising Qwen and then I suppose Nemo from mistral?
ParaboloidalCrest@reddit (OP)
Qwen 2.5 35B, Mistral Small, Gemma 27B and Commad-r 25B.
Single_Ring4886@reddit
Didnt tried command-r in what area is best? Sory for all the questions :)
ParaboloidalCrest@reddit (OP)
There's no rhyme or reason on which of those is best, and that's why I use all of them ¯\_(ツ)_/¯.
dmitryplyaskin@reddit
You can quantize a 70B model down to Q2 and fit it into 24GB, but those who train these models clearly don’t think about how users will quantize them. They primarily focus on their own hardware and not on subsequent quantization. In theory, a 32B model can fit into one or two A100 GPUs with full context.
Spirited_Example_341@reddit
MORE! - Kylo Ren on LLM models
Won3wan32@reddit
learn to merge
ParaboloidalCrest@reddit (OP)
Teach me!
2rememberyou@reddit
Which model would be the best for a Home Assistant LLM on a 4090? If anyone has a recommendation for a resource that would help someone starting out with local LLM for this purpose I would be grateful.
bigattichouse@reddit
This guy has a pretty good video:
https://www.youtube.com/watch?v=XvbVePuP7NY
jacek2023@reddit
https://www.reddit.com/r/LocalLLaMA/comments/1gai2ol/list_of_models_to_use_on_single_3090_or_4090/