TheaterFire

Drummer's Cydonia 24B v3 - A Mistral 24B 2503 finetune!

Posted by TheLocalDrummer@reddit | LocalLLaMA | View on Reddit | 31 comments

Survey Time: I'm working on Skyfall v3 but need opinions on the upscale size. 31B sounds comfy for a 24GB setup? Do you have an upper/lower bound in mind for that range?

Reply to Post

31 Comments

gcavalcante8808@reddit

In my experience 22/24b are the ones that I had good experience on my 7900xtx card.
View on Reddit #58119509

RedditSucksMintyBall@reddit

Do you overclock your card for LLM stuff? I recently got the same one.
View on Reddit #58134610

gcavalcante8808@reddit

Nope, I use the default clock
View on Reddit #58726625

RottenPingu1@reddit

Curious for any pointers in using this card as mine shows up this week...
View on Reddit #58176544

NimbzxAkali@reddit

27B with Q5\_K\_L bartowski quants is the sweetspot for me for \~16k context, with some headroom for more context if needed. 31B should fill that headroom, but might be reasonable. I just don't like to let too much layers/context bleed into my slow DDR4 RAM, I guess. System: 24GB VRAM + 64 GB DDR4 RAM
View on Reddit #58383718

Mr_Moonsilver@reddit

For the uninitiated, what is this?
View on Reddit #58178217

logseventyseven@reddit

their previous models are very popular for RP and writing
View on Reddit #58263016

Glittering-Bag-4662@reddit

31B is fine for me
View on Reddit #58248116

whiskers_z@reddit

Any notes on how this differs from v2.1? Granted I'm all the way down at Q2, but while this was still impressive on my initial test, v2.1 was a freaking magic trick.
View on Reddit #58200357

paranoidray@reddit

I love 24b models, 22b would be even better I think for some room to spare.
View on Reddit #58196507

Iory1998@reddit

I have an RTX3090, and in my opinion, I'd rather have a model at Q6 with a large context size than a Q4 with a limited context. Also, I am not sure if upscaling a 24B model would do it any good. If it were, don't you think the labs that created those models would have already being doing that?
View on Reddit #58138927

Phocks7@reddit

In my experience lower quants of higher parameter models perform better than higher quants of lower parameter models. eg Q4 123b > Q6 70b.
View on Reddit #58160833

blahblahsnahdah@reddit

Agreed. It's not a small difference either, even a Q3 of a huge model will blow away a Q8 of equivalent weights filesize when it comes to commonsense reasoning (I make no claims about benchmark scores).
View on Reddit #58162221

AppearanceHeavy6724@reddit

Not sure about that.Qwen 2.5 instruct 32b iq3xs completely fell apart in fiction compared to 14b q4km. The latter sucked too as qwen 2.5 is unusable for creative writing anyway.
View on Reddit #58171216

blahblahsnahdah@reddit

32B isn't huge! We're talking about 100B plus. Yeah, small models have unusable brain damage at low quants.
View on Reddit #58174033

TheRealMasonMac@reddit

[https://github.com/QwenLM/ParScale](https://github.com/QwenLM/ParScale) is probably more interesting
View on Reddit #58155269

SomeoneSimple@reddit

>Also, I am not sure if upscaling a 24B model would do it any good. If it were, don't you think the labs that created those models would have already being doing that? My thoughts as well. I mean, the only guys that making are bank off LLM's are doing the [the exact opposite](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1).
View on Reddit #58150578

_Cromwell_@reddit

In ggufs, what are the ones that are _NL for? Or what do they do differently then the normal Imatrix?
View on Reddit #58135594

toomuchtatose@reddit

For ARM devices, the inference speeds could be 1.5x to 8x faster.
View on Reddit #58155926

SkyFeistyLlama8@reddit

Use the IQ4_NL or Q4_0 GGUF files if you're running on ARM CPUs like Snapdragon X or Ampere. I prefer Q4_0 for Snapdragon X because the Adreno OpenCL backend also supports this format, so you can fast inference on both CPU and GPU backends.
View on Reddit #58170459

_Cromwell_@reddit

Ahhh... okay. So it's for ARM. thanks
View on Reddit #58157203

Quazar386@reddit

The main thing about the IQ4\_NL quant from what I can understand is that it uses a non-linear quantization technique with a non-uniform codebook designed to better match LLM weight distributions. For practical uses though most people use IQ4\_XS as it has very similar (within margin of error) KL divergence as IQ4\_NL with better space savings or Q4\_K\_S for overall faster speeds. So IQ4\_NL does not really have much of a place in practical uses as other quants either have better space savings or faster speeds with similar KL divergence.
View on Reddit #58151558

_Cromwell_@reddit

Thanks. Almost seems like there's too many options because people can't decide what's best. :) Or there's still debate on what's best. So people who prep these things just prep everything for everybody I guess, to avoid complaints they left something out.
View on Reddit #58151754

SkyFeistyLlama8@reddit

I just wanna know how this would compare to Valkyrie Nemotron 49B. That's a sweet model but it's huge.
View on Reddit #58123819

-Ellary-@reddit

Well, just download it, run it, test it, sniff it, rub it, what the point listening to random people, What if I will say that it is better than Valkyrie? On my own specific nya cat girl test?
View on Reddit #58130393

Abandoned_Brain@reddit

The problem some people have is that their ISP (at least, in the US) will have bandwidth caps of some type in place. Grabbing an 18GB model sight-unseen (and that's a problem with Huggingface, less than about 1/4 of the models have cards which actually detail what the models actually are recommended for) can kill most hotspots' bandwidth for the month. I agree somewhat with you. It's a great time to be an AI hobbyist because you can download a different AI "brain" full of knowledge and personality every 5 minutes if you wanted to, but doing that causes other issues downstream for people. I had to block my model folder in my backup apps because they were constantly copying these new models to the cloud. My storage started costing me a lot more than previous months, which took a bit for me to figure out. :) BTW, where's your nya cat girl test, would be interested in testing it myself... :D
View on Reddit #58151208

IrisColt@reddit

Heh!
View on Reddit #58131787

MidAirRunner@reddit

Have you used it? How good is it?
View on Reddit #58126043

RickyRickC137@reddit

What are the recommended temperature and other parameters?
View on Reddit #58125019

Echo9Zulu-@reddit

Thanks for your work! So if we throw away questions about inference capability and just look at what benefits higher parameter counts provide, what do you think we gain from having more in this case?
View on Reddit #58123329

LagOps91@reddit

31b sounds good for 24gb assuming context isn't too heavy. I would want to run either 16k or preferably 32k context without quanting context (for some reason quanting context is really slow for me).
View on Reddit #58120024