Local fine-tuning will be the biggest competitive edge in 2026.

Posted by HerbHSSO@reddit | LocalLLaMA | View on Reddit | 14 comments

While massive generalist models are incredibly versatile, a well-fine-tuned model that's specialized for your exact use case often outperforms them in practice even when the specialized model is significantly smaller and scores lower on general benchmarks.

To actually do this kind of effective fine-tuning today (especially parameter-efficient methods like LoRA/QLoRA that let even consumer hardware punch way above its weight), here are some open-source tools:

Unsloth: specialized library designed to maximize the performance of individual GPUs. It achieves significant efficiencies by replacing standard PyTorch implementations with hand-written Triton kernels

Axolotl is a high-level configuration wrapper that streamlines the end-to-end fine-tuning pipeline. It emphasizes reproducibility and support for advanced training architectures.

Do you know of other types of tools or ideas for training and finetuning local models?

[-]

Trick-Stress9374@reddit

I use AI TTS models to generate full audio-books. After testing over 15 different models, I found that most lack the expressiveness or the high voice variation I’m looking for. Some are more expressive but very unstable, while others offer high variation but produce inconsistent timbres.

I eventually found a prompt audio that works well with Spark-TTS, and it sounds great. However, it takes many attempts and different settings to find the right result, and it only applies to one specific voice, zero-shot cloning for many other voices just doesn't sound good enough.
A week ago, I started fine-tuning a model using Unsloth with LoRA and was surprised by how fast it was to train the model and how little VRAM it used. The biggest challenge was finding good dateset and the right training parameters, but once I figured them out, the results were excellent. The improvement in the speaker's style similarity is outstanding. I also tested 30 minutes of audio files and it still work very well, maybe even lower will work. The quality is much more important. It doesn't seem to make the model more unstable, as it misses about the same amount of words as the original model. It is stable enough that I can use STT to find errors and then regenerate those parts with a more stable model to fix most issues.

I also tried writing a training script for Echo TTS with LoRA (not Unsloth), which is a much larger model with better stability than Spark-TTS. I didn't use it as my primary model because it lacks voice variation and the output is noisier, not significantly, but it’s apparent when using earphones. As it is a larger model, it requires more RAM for training, but it stays around the same level as the inference requirements, so any card with 16GB (or even 12GB) of VRAM should handle it. Fine-tuning significantly improved the speaker's style similarity compared to using zero-shot. It now has high voice variation and expressiveness if the audio dateset have it. It is still noisier than Spark-TTS, but because it is much more stable, it is perfect for regenerating the parts where Spark-TTS has issues.
I will add that spark-tts can be very fast if implementing a SGLang implementation, using high batch, bfloat16 along with deterministic inference you can get around 50x RTFx when using an rtx 5070 ti.
You can use VLLM too but I could not get fully deterministic inference work(even with the same seed) so if you use batch 1 or 50 you will get different result but it still have consistent output for same batch if using "export VLLM_BATCH_INVARIANT=1" . Without export VLLM_BATCH_INVARIANT=1 you will get different result even if use the same batch and seed for identical sentences one after another.
Also as spark-tts output only 16khz audio files so it can sound quite muffled but if using audio super resolution model like Flow-High or LavaSR to up-sample it to 48khz, it sound fantastic. The speed of Flow-High is around RTFx of 45x and LavaSR is around 700x. It really depend of the voice, which produce better output but LavaSR is much faster (Flow-High is quite fast too).

[-]

Silver-Tie-3859@reddit

Could you please share how to apply VLLM to Spark TTS? Thank you very much.

whenhellfreezes@reddit

I'm lightly worried that it won't be. It's worth looking into some of the literature about GEPA which is a prompt optimization algorithm. The upshot is that you can have a prompt get mutated by LLM reflecting about how well it had done with the old version of the prompt and what went wrong across multiple runs. The amount of GPU time for doing GEPA is about 1/3 the GPU work that a fine tune takes and performs about as well. Also you can have a stronger model do the reflection while a weak model does the trial runs. Essentially getting some distillation and cost savings.

Okay and then what is the result a nice starting prompt for getting your desired result. Well that prompt can be shared around and you don't need to distribute and host a new model just the prompt.

Alright now consider that we already have agent skills in things like claude code and open code. Well thats just a prompt that gets injected when it's needed and there are plugin markeplaces to make installing and finding easy... In many cases these skills are effectively human done GEPA or could be made via GEPA.

Anyways I'm not sure about the idea of open model finetuning being as good as I used to think it would be compared to skills on a plugin marketplace.

Of course fine tune + GEPA actually outperforms both and you can context distill things in etc. So like idk.

IrisColt@reddit

Gemma 4 31B can reflect on its prompts and the full chat history, then suggest tweaks to elicit the answer the user wants. When you rerun those revised prompts, they work not just for Gemma, but for other models too, like Qwen 3.6 35B A3B.

I know this thread is old, but I’ve gotta push back. Since 2025, every time I’ve used a fine‑tuned model it’s been a total disaster. When Gemma 3 and Qwen 3 (and their later versions) dropped, those models started forgetting how to do several of their basic tricks... and it’s even worse with Qwen 3.6 and Gemma 4. For example, the more you pump up the prose, the more they lose nuance, multilingual ability, or awareness, or you name it… I’m starting to think that as the models get smarter and the training gets tighter, fine‑tuning messes with them in ways that only a broad, interdisciplinary benchmark can pick up.

Other-Competition-86@reddit

If you're on Apple Silicon, try MLX-LM for LoRA. It's dramatically simpler than the PyTorch/QLoRA path and trains about the same speed using Metal. \~20 minutes for 1500 training pairs on an M1 Pro. The main limitation is it's Apple-only, but if that's your platform it's a no-brainer over setting up bitsandbytes + PEFT.

catplusplusok@reddit

The problem is you need a large dataset (like thousands of long context examples) of what you are trying to teach plus a padding of general training to avoid existing capability loss. Plus you want reasoning blocks for thinking models. And if you can generate a synthetic dataset, it means you already have a model that does what you need done, in which case you can use it. Unless it's specifically to distill large model's knowledge into a smaller local model, but that's not a typical home project.

RevolutionaryLime758@reddit

Giving advice to a bot lol low iq

HerbHSSO@reddit (OP)

bruh https://x.com/leerob/status/2035035355364081694?s=20

LevianMcBirdo@reddit

Mistral seems to agree, they launched Forge yesterday, so businesses can outsource the process

Slop post by slop person

FRAIM_Erez@reddit

I’d start with RAG + evals before investing in a full fine-tuning pipeline.

ttkciar@reddit

> What are you thoughts on fine-tuning a model in your own codebase?

I think AllenAI's SERA project demonstrated how this can be done economically and to good effect. Anyone attempting to economically fine-tune local models for their own code repos should start with AllenAI's paper and training code.

> Do you know of other types of tools or ideas for training and finetuning local models?

Besides Unsloth and Axolotl, TRL is a commonly used fine-tuning framework. It's worth looking into.

There are other ways to modify local models than fine-tuning, too. Goddard's mergekit and -p-e-w-'s Heretic implement low-compute methods for improving model behavior.

Also, llama.cpp developers are working on a native training functionality. It is as of yet incomplete, but when it is done it should solidify llama.cpp as an all-purpose LLM solution with limited external dependencies.

LegacyRemaster@reddit

on my way