Question regarding fine tuning.
Posted by Fun-Agent9212@reddit | LocalLLaMA | View on Reddit | 14 comments
What's the minimum record count you'd want in a fine-tuning dataset before you trust the results?
DinoAmino@reddit
Freakin bots man.
Fit-Produce420@reddit
I found decent results from LoRA of 2-4% of total parameters.
So the number changes a bit based on the size of the model but for an 8-12GB model I used between 10,000 - 16,000 entries.
If you go way overboard training you will definitely reduce general intelligence and the model will get stupid.
Fun-Agent9212@reddit (OP)
The 2-4% of total parameters rule of thumb is really helpful, thanks. And the warning about reduced general intelligence from overtraining tracks with what others are saying too. when you say 10-16K entries, were those all task-specific or did you mix in general data to keep the model balanced?
Fit-Produce420@reddit
If you're doing LoRA training you are only adding in domain or task specific examples.
For instance training it on your code base. It already knows python, but it can be additionally trained on YOUR python codebase, you wouldn't add generalized data for that.
I doubt you'll see much improvement training on generalized data, that's what base models already have.
GamerHaste@reddit
You’re going to need to test yourself with different amounts of data… it’s deff annoying and there’s no particular value that can be recommended based on a specific task. It’s why in ML “making” a model is like “growing a brain”… there’s a lot of trial and error involved and running experiments and seeing the result. As others have said in this thread, there’s really not a particular value. You’ll need to create different sizes of datasets and validate how the model performs before vs after, then continue to run more ablations with more/less data.
Fun-Agent9212@reddit (OP)
This is solid advice, thank you. The benchmarking point especially — I've been so focused on generation quality that I hadn't given enough thought to how buyers would actually validate improvements on their end. Might be worth including suggested eval metrics alongside the datasets themselves. Gives people a starting point instead of just handing them raw data and saying good luck.
GamerHaste@reddit
If your goal is to sell some ai modeling service to customers, from direct personal experience just based on my job and what we do, the #1 most important thing you can do is try and find a (or maybe a few) specific directly measurable quantifiable metrics that you can look at before vs after fine tuning a model. It’s an orders of magnitudes harder task to solve than just throwing raw text or some structured data into a training algorithm and saying “here do something”… and a lot of the time benchmarking is the bottleneck. How do you measure a success case? Good luck with your project.
Fun-Agent9212@reddit (OP)
Thank you that's really helpful! I will definitely take that bottleneck into account.
AutomataManifold@reddit
2000.
I'm basing that on the LIMA results. In practice it depends on what you are trying to accomplish.
And, really, asking how much training data you need before trusting the results has it backwards.
Figure out your evaluation first. How are you going to measure when it's doing it right? Once you have that determined you can work backwards from there.
Fun-Agent9212@reddit (OP)
Evaluation-first makes a lot of sense, and the LIMA reference is a good anchor point. I'm coming at this from the supplier side so I'm partly trying to figure out what sizes buyers actually expect when they're shopping. But you're right that framing it around eval rather than raw count is probably the better conversation to have with customers too.
Crafty-Celery-2466@reddit
Depends on a lot of factors. Please add more as you see fit.
End of the day, get as much data yiu can first. Start testing with a minimal count n see how it performs on eval data. Day 1000/100/100 split. Then increase it slowly as you see fit. I started with 2000 or so and now im at 45K for a 4B model.
Smaller model might need lesser data if it’s specialized Task. Bigger model can take more and generalize a bit better.
All based on my experience. Might vary broadly. 🫡 good luck.
Fun-Agent9212@reddit (OP)
the token length overfitting thing is a useful warning I hadn't thought about. when did you catch that problem? and starting small then scaling makes sense. what field are you building for?
Regarding the task, I'm working on building synthetic dialogue datasets for HR/workplace conflict scenarios. labeled with severity, conflict type, quality scores etc. mainly targeting people fine-tuning models for conflict detection or building HR chat tools. to clarify I'm not fine-tuning myself though Im looking into that to further my knowledge. Rn I'm generating and selling the datasets. asking more from a market perspective like what sizes buyers actually look for when they're shopping for training data.
Thanks!
Crafty-Celery-2466@reddit
Oh well, if you find a nice blog or anything for good SDG, i’d love to hear. I am struggling with Verification as the LLM as a judge itself is unreliable
Fun-Agent9212@reddit (OP)
What domain are you working in?