New 24B finetune: Impish_Magic_24B
Posted by Sicarius_The_First@reddit | LocalLLaMA | View on Reddit | 28 comments
It's the 20th of June, 2025—The world is getting more and more chaotic, but let's look at the bright side: Mistral released a new model at a very good size of 24B, no more "sign here" or "accept this weird EULA" there, a proper Apache 2.0 License, nice! 👍🏻
This model is based on mistralai/Magistral-Small-2506 so naturally I named it Impish_Magic. Truly excellent size, I tested it on my laptop (16GB gpu) and it works quite well (4090m).
Strong in productivity & in fun. Good for creative writing, and writer style emulation.
New unique data, see details in the model card:
https://huggingface.co/SicariusSicariiStuff/Impish_Magic_24B
The model would be on Horde at very high availability for the next few hours, so give it a try!
NoobMLDude@reddit
Interesting.
You mention this in model card: “This model went "full" fine-tune over 100m unique tokens. Why do I say "full"?
I've tuned specific areas in the model to attempt to change the vocabulary usage, while keeping as much intelligence as possible. So this is definitely not a LoRA, but also not exactly a proper full finetune, but rather something in-between.”
Could you please explain the fine tuning technique. Is it training different LoRAs on different model layers and merging them? Some technical details would be helpful to understand what was done. Thanks
vasileer@reddit
probably it went a full epoch which is \~128steps
Sicarius_The_First@reddit (OP)
w-what? 🤨
vasileer@reddit
for everyone downvoting my comment
An “epoch” is one full pass through your training dataset. The number of optimization steps in one epoch is simply:
steps_per_epoch = ⌈dataset_sizebatch_size⌉\text{steps\_per\_epoch} \;=\;\Big\lceil \frac{\text{dataset\_size}}{\text{batch\_size}}\Big\rceil
— where
If you’re using gradient‐accumulation over NN mini‐batches to form an effective batch, then:
steps_per_epoch= ⌈dataset_sizebatch_size×N⌉\text{steps\_per\_epoch} =\; \Big\lceil \frac{\text{dataset\_size}} {\text{batch\_size}\times N} \Big\rceil
For example, 100 000 examples with a per‐device batch size of 32 (and no accumulation) gives
100 000/32≈3125 steps per epoch.100\,000 / 32 \approx 3125\text{ steps per epoch.}
Sicarius_The_First@reddit (OP)
I think you might be mixing things up, full fine tune in the context of comparing to a lora has nothing to do with datasets, but the depth of training.
LoRA only trains a limited depth (R = X) while fft trains everything. spectrum (as mentioned before) trains fully, at full depth, just like a full fine tune, but you can be selective about the projection layers you tune.
Sicarius_The_First@reddit (OP)
I'll also add that while LoRA too can be selective about projection layers and depth it tunes, it lacks the granularity of spectrum (at least in the "vanilla" "naive" LoRA implementation).
Sicarius_The_First@reddit (OP)
To be even more specific, because I got these questions in my DMs as well, with LoRA you can be selective, but not granular like this:
lora_target_modules:
- gate_proj
- down_proj
- up_proj
- q_proj
- v_proj
- k_proj
- o_proj
But with spectrum, you can be extremely granular like this:
# self_attn.o_proj layers
#- model.layers.22.self_attn.o_proj
- model.layers.23.self_attn.o_proj
#- model.layers.24.self_attn.o_proj
# self_attn.q_proj layers
- model.layers.13.self_attn.q_proj
- model.layers.14.self_attn.q_proj
#- model.layers.15.self_attn.q_proj
#- model.layers.16.self_attn.q_proj
NoobMLDude@reddit
Thanks for sharing your fine tuning approach.
I have 2 follow-up questions: 1. How do you select which layers to fine tune and to which depth? If yes, what’s the intuition for it
I saw that the dataset comprises of conversations from 4chan with negative and natural human interactions.
Sicarius_The_First@reddit (OP)
superb questions, rarely seen on reddit
some of it is intuition, llm so far is far from exact science, but you also have SNR scanning based on spectrum (look it up on github).
indeed, normally 1k entries would overfit, however due to the data being authentic human data, and of the most chaotic kind (4chan) it cannot and will not overfit. in addition this is a small part of a more generalized dataset, the key is mixing typical assistant data with different niches.
TheApadayo@reddit
I’ve messed around with this. You can do a full fine-tune of some blocks and LoRA other blocks. This makes it sound like the embedding blocks were trained normally while the rest were frozen and trained using LoRA so the model more reliably recognizes new token IDs while reducing the effect the fine-tune has on the base model performance.
Sicarius_The_First@reddit (OP)
Yeah there are more than 2 ways to skin a cat, so to speak :)
While LoRA allows to finetune arbitrary depth (R = X), you can also do a full depth tune on specific projection layers (with spectrum, as I mention in another comment).
There are myriad ways today to rune models, we live in abundance, thankfully :)
Sicarius_The_First@reddit (OP)
I've used spectrum.
AvaritiaGula@reddit
Wow, this model is quite good at story writing. Previous Mistral 24b was very dry but the new model doesn't have such issues.
Sicarius_The_First@reddit (OP)
Glad to hear it, indeed there was a lot of interesting creative data, and the model surprises even me, especially with its ability to do a complex Adventure format. It even able to track items very well for its size,
I'll attach some examples to the model card under:
https://huggingface.co/SicariusSicariiStuff/Impish_Magic_24B/tree/main/Images/Adventure
Zestyclose_Yak_3174@reddit
You're a legend man! Loved your Negative Llama model.
Sicarius_The_First@reddit (OP)
Thank you so much :)
negative llama is great but it's too big to be easily accessible, which is why I really like the 24B size!
Zestyclose_Yak_3174@reddit
Yeah., well you did excellent work. Of course it's not perfect but I have run, analyzed and compare hundreds of models over the last few years and that one came close to perfection in terms of my personal/business life assistant without BS censoring or sugarcoating. Can't wait to try out your new 24B
Echo9Zulu-@reddit
No mistral tekken? Acceleration frameworks gang rejoice!
Thanks for your work!
Sicarius_The_First@reddit (OP)
You're very welcome :)
Confident-Artist-692@reddit
Hi, I tried to load this model today: SicariusSicariiStuff\Impish_Magic_24B_GGUF\SicariusSicariiStuff_Impish_Magic_24B-Q4_K_M.gguf into LM Studio but it flagged up an error.
Failed to load model
Sicarius_The_First@reddit (OP)
This was tested in llama.cpp for ggufs and worked fine, might be an issue with your front end.
Sicarius_The_First@reddit (OP)
Advanced grammar correction with a breakdown example:
Repulsive-Memory-298@reddit
Could it do that before
Sicarius_The_First@reddit (OP)
it could correct grammar before, every 3b model can, but not breaking it down like in the example, which helps a lot in improving language skills.
It not just simply corrects grammar (plenty of options for this), it analyzes and explains each correction.
NoIntention4050@reddit
Im pretty sure your model name must include the name of the original model
FullOf_Bad_Ideas@reddit
No, with Apache 2.0 it's not needed.
NoIntention4050@reddit
right, sorry
IrisColt@reddit
Thanks!!!