Drummer's Skyfall 31B v4.2 aka SKYFALL-31B-V4.2-UNCENSORED-OPUS-4.6-ROLEPLAYING-100000X-XTREME-VALUE

Posted by TheLocalDrummer@reddit | LocalLLaMA | View on Reddit | 28 comments

Yes, Google stole my proprietary model size (31B). Yes, I plan to tune all the Gemma 4 models. Join us, and support the mission! Thank you all for the love <3

[-]

Specter_Origin@reddit

can someone make finetune of 26b-b4a which is better at function calling for opencode and cline, it seems to fall flat overtime on complex write calls xD

[-]

Specter_Origin@reddit

Btw, it was a parser issue on llama.cpp which has been fixed in new release, so if you are experiencing it please update your llama.cpp.

[-]

LoveMind_AI@reddit

joining the choir of people who very much want a drummer Gemma 4!

[-]

Nrgte@reddit

Okay what's the difference compared to 4.1?

[-]

Internet-Buddha@reddit

How does this compare to Magidonia, which is one of my favorite models!

[-]

rc_ym@reddit

It's worth it to try. I like it. It seems to have richer language but a very, very slight increase in non-sequiturs and impossiblisms. Very similar performance even with the larger size.

[-]

Sirosky@reddit

It's an upscale of Mistral Small, so it'll be better just by virtue of being larger. But in general, this model is exceptional, even by upscale standards.

[-]

AnonLlamaThrowaway@reddit

Can you explain what an "upscale" means vs a regular fine tune?

[-]

ttkciar@reddit

> Skyfall is "upscaled". How does that work?

It works by using Goddard's mergekit (or equivalent technology) to make something called a "passthrough self-merge". This means making a new model from the first two-thirds of its layers (or thereabouts; it usually needs some trial and error to find the right cut-off) and the last two-thirds of its layers, appended to each other.

This results in a model about 30% larger, because some middle layers have been duplicated. This has two effects:

Heuristics (generalized knowledge) which are encoded in this middle layer get applied twice, which means they present more strongly in the inference result,
It adds some redundancy to the model's parameters, so that further training is less likely to obliterate something important (what the field calls "catastrophic forgetting"). The optimizer (AdamW or whatever) is able to repurpose some of those parameters to encode new heuristics without losing the old ones.

The theory of why this works is still very much under development, but David Ng has been developing what he calls RYS theory which describes part of it. You could look him up if you want to learn more about it.

[-]

Chief_Broseph@reddit

Something similar to the RYS method? Been waiting for a good rp finetune of that one.

[-]

ttkciar@reddit

Yes, you will notice I mention RYS theory in the last paragraph.

[-]

Sirosky@reddit

My layman's understanding is that additional layers were added on top of the model before tuning, resulting in a fatter but (hopefully) superior finetune. All the Skyfall models back to v1 are upscales of Mistral Small and its derivatives.

Folks on the Discord server did a blind test of Skyfall v2 vs. the same-generation Cydonia, and the preference was overwhelmingly for Skyfall, so it seems like upscaling does work, even if it comes at the expense of the model's VRAM requirements / speed.

[-]

freia_pr_fr@reddit

Should we start a r/locallamacirclejerk ?

[-]

TheLocalDrummer@reddit (OP)

It exists. You just need to add one more L

r/localllamacirclejerk

[-]

evenyourcopdad@reddit

be the change you want to see in the world

unless you're not going to do it right. then just leave it for someone that will.

[-]

jacek2023@reddit

Waiting for Drumminggemmas

[-]

LegacyRemaster@reddit

The name I needed