New 24B finetune: Impish_Magic_24B

Posted by Sicarius_The_First@reddit | LocalLLaMA | View on Reddit | 28 comments

It's the 20th of June, 2025—The world is getting more and more chaotic, but let's look at the bright side: Mistral released a new model at a very good size of 24B, no more "sign here" or "accept this weird EULA" there, a proper Apache 2.0 License, nice! 👍🏻

This model is based on mistralai/Magistral-Small-2506 so naturally I named it Impish_Magic. Truly excellent size, I tested it on my laptop (16GB gpu) and it works quite well (4090m).

Strong in productivity & in fun. Good for creative writing, and writer style emulation.

New unique data, see details in the model card:
https://huggingface.co/SicariusSicariiStuff/Impish_Magic_24B

The model would be on Horde at very high availability for the next few hours, so give it a try!

[-]

NoobMLDude@reddit

Interesting.

You mention this in model card: “This model went "full" fine-tune over 100m unique tokens. Why do I say "full"?

I've tuned specific areas in the model to attempt to change the vocabulary usage, while keeping as much intelligence as possible. So this is definitely not a LoRA, but also not exactly a proper full finetune, but rather something in-between.”

Could you please explain the fine tuning technique. Is it training different LoRAs on different model layers and merging them? Some technical details would be helpful to understand what was done. Thanks

[-]

vasileer@reddit

probably it went a full epoch which is \~128steps

[-]

Sicarius_The_First@reddit (OP)

w-what? 🤨

[-]

vasileer@reddit

for everyone downvoting my comment

An “epoch” is one full pass through your training dataset. The number of optimization steps in one epoch is simply:

steps_per_epoch = ⌈dataset_sizebatch_size⌉\text{steps\_per\_epoch} \;=\;\Big\lceil \frac{\text{dataset\_size}}{\text{batch\_size}}\Big\rceil

— where

dataset_size is the total number of training examples (or total number of tokens, if you’re counting in tokens),
batch_size is the number of examples (or tokens) processed at each step.

If you’re using gradient‐accumulation over NN mini‐batches to form an effective batch, then:

steps_per_epoch= ⌈dataset_sizebatch_size×N⌉\text{steps\_per\_epoch} =\; \Big\lceil \frac{\text{dataset\_size}} {\text{batch\_size}\times N} \Big\rceil

For example, 100 000 examples with a per‐device batch size of 32 (and no accumulation) gives

100 000/32≈3125 steps per epoch.100\,000 / 32 \approx 3125\text{ steps per epoch.}

[-]

Sicarius_The_First@reddit (OP)

I think you might be mixing things up, full fine tune in the context of comparing to a lora has nothing to do with datasets, but the depth of training.

LoRA only trains a limited depth (R = X) while fft trains everything. spectrum (as mentioned before) trains fully, at full depth, just like a full fine tune, but you can be selective about the projection layers you tune.

[-]

Sicarius_The_First@reddit (OP)

I'll also add that while LoRA too can be selective about projection layers and depth it tunes, it lacks the granularity of spectrum (at least in the "vanilla" "naive" LoRA implementation).

[-]

Sicarius_The_First@reddit (OP)

To be even more specific, because I got these questions in my DMs as well, with LoRA you can be selective, but not granular like this:

lora_target_modules:
- gate_proj
- down_proj
- up_proj
- q_proj
- v_proj
- k_proj
- o_proj

But with spectrum, you can be extremely granular like this:

# self_attn.o_proj layers
#- model.layers.22.self_attn.o_proj
- model.layers.23.self_attn.o_proj
#- model.layers.24.self_attn.o_proj

# self_attn.q_proj layers
- model.layers.13.self_attn.q_proj
- model.layers.14.self_attn.q_proj
#- model.layers.15.self_attn.q_proj
#- model.layers.16.self_attn.q_proj

[-]

NoobMLDude@reddit

Thanks for sharing your fine tuning approach.

I have 2 follow-up questions: 1. How do you select which layers to fine tune and to which depth? If yes, what’s the intuition for it

I saw that the dataset comprises of conversations from 4chan with negative and natural human interactions.

How do you avoid overfitting on this dataset since it only has 1000 samples? Also is there any noticeable degradation on original reasoning skills of Magistral ?

[-]

Sicarius_The_First@reddit (OP)

superb questions, rarely seen on reddit

some of it is intuition, llm so far is far from exact science, but you also have SNR scanning based on spectrum (look it up on github).
indeed, normally 1k entries would overfit, however due to the data being authentic human data, and of the most chaotic kind (4chan) it cannot and will not overfit. in addition this is a small part of a more generalized dataset, the key is mixing typical assistant data with different niches.

[-]

TheApadayo@reddit

I’ve messed around with this. You can do a full fine-tune of some blocks and LoRA other blocks. This makes it sound like the embedding blocks were trained normally while the rest were frozen and trained using LoRA so the model more reliably recognizes new token IDs while reducing the effect the fine-tune has on the base model performance.

[-]

Sicarius_The_First@reddit (OP)

Yeah there are more than 2 ways to skin a cat, so to speak :)

While LoRA allows to finetune arbitrary depth (R = X), you can also do a full depth tune on specific projection layers (with spectrum, as I mention in another comment).

There are myriad ways today to rune models, we live in abundance, thankfully :)

[-]

Sicarius_The_First@reddit (OP)

I've used spectrum.

[-]

AvaritiaGula@reddit

Wow, this model is quite good at story writing. Previous Mistral 24b was very dry but the new model doesn't have such issues.

[-]

Sicarius_The_First@reddit (OP)

Glad to hear it, indeed there was a lot of interesting creative data, and the model surprises even me, especially with its ability to do a complex Adventure format. It even able to track items very well for its size,

I'll attach some examples to the model card under:

https://huggingface.co/SicariusSicariiStuff/Impish_Magic_24B/tree/main/Images/Adventure

[-]

Zestyclose_Yak_3174@reddit

You're a legend man! Loved your Negative Llama model.

[-]

Sicarius_The_First@reddit (OP)

Thank you so much :)

negative llama is great but it's too big to be easily accessible, which is why I really like the 24B size!

[-]

Zestyclose_Yak_3174@reddit

Yeah., well you did excellent work. Of course it's not perfect but I have run, analyzed and compare hundreds of models over the last few years and that one came close to perfection in terms of my personal/business life assistant without BS censoring or sugarcoating. Can't wait to try out your new 24B

[-]

Echo9Zulu-@reddit

No mistral tekken? Acceleration frameworks gang rejoice!

Thanks for your work!

[-]

Sicarius_The_First@reddit (OP)

You're very welcome :)

[-]

Confident-Artist-692@reddit

Hi, I tried to load this model today: SicariusSicariiStuff\Impish_Magic_24B_GGUF\SicariusSicariiStuff_Impish_Magic_24B-Q4_K_M.gguf into LM Studio but it flagged up an error.

Failed to load model

[-]

Sicarius_The_First@reddit (OP)

This was tested in llama.cpp for ggufs and worked fine, might be an issue with your front end.

[-]

Sicarius_The_First@reddit (OP)

Advanced grammar correction with a breakdown example:

[-]

Repulsive-Memory-298@reddit

Could it do that before

[-]

Sicarius_The_First@reddit (OP)

it could correct grammar before, every 3b model can, but not breaking it down like in the example, which helps a lot in improving language skills.

It not just simply corrects grammar (plenty of options for this), it analyzes and explains each correction.

[-]