Question regarding fine tuning.

[-]

Fit-Produce420@reddit

I found decent results from LoRA of 2-4% of total parameters.

So the number changes a bit based on the size of the model but for an 8-12GB model I used between 10,000 - 16,000 entries.

If you go way overboard training you will definitely reduce general intelligence and the model will get stupid.

[-]

The 2-4% of total parameters rule of thumb is really helpful, thanks. And the warning about reduced general intelligence from overtraining tracks with what others are saying too. when you say 10-16K entries, were those all task-specific or did you mix in general data to keep the model balanced?

[-]

Fit-Produce420@reddit

If you're doing LoRA training you are only adding in domain or task specific examples.

For instance training it on your code base. It already knows python, but it can be additionally trained on YOUR python codebase, you wouldn't add generalized data for that.

I doubt you'll see much improvement training on generalized data, that's what base models already have.

[-]

GamerHaste@reddit

You’re going to need to test yourself with different amounts of data… it’s deff annoying and there’s no particular value that can be recommended based on a specific task. It’s why in ML “making” a model is like “growing a brain”… there’s a lot of trial and error involved and running experiments and seeing the result. As others have said in this thread, there’s really not a particular value. You’ll need to create different sizes of datasets and validate how the model performs before vs after, then continue to run more ablations with more/less data.

[-]

Fun-Agent9212@reddit (OP)

This is solid advice, thank you. The benchmarking point especially — I've been so focused on generation quality that I hadn't given enough thought to how buyers would actually validate improvements on their end. Might be worth including suggested eval metrics alongside the datasets themselves. Gives people a starting point instead of just handing them raw data and saying good luck.

[-]

GamerHaste@reddit

If your goal is to sell some ai modeling service to customers, from direct personal experience just based on my job and what we do, the #1 most important thing you can do is try and find a (or maybe a few) specific directly measurable quantifiable metrics that you can look at before vs after fine tuning a model. It’s an orders of magnitudes harder task to solve than just throwing raw text or some structured data into a training algorithm and saying “here do something”… and a lot of the time benchmarking is the bottleneck. How do you measure a success case? Good luck with your project.

[-]

Fun-Agent9212@reddit (OP)

Thank you that's really helpful! I will definitely take that bottleneck into account.

[-]

AutomataManifold@reddit

2000.

I'm basing that on the LIMA results. In practice it depends on what you are trying to accomplish.

And, really, asking how much training data you need before trusting the results has it backwards.

Figure out your evaluation first. How are you going to measure when it's doing it right? Once you have that determined you can work backwards from there.

[-]

Fun-Agent9212@reddit (OP)

Evaluation-first makes a lot of sense, and the LIMA reference is a good anchor point. I'm coming at this from the supplier side so I'm partly trying to figure out what sizes buyers actually expect when they're shopping. But you're right that framing it around eval rather than raw count is probably the better conversation to have with customers too.

[-]

Crafty-Celery-2466@reddit

Depends on a lot of factors. Please add more as you see fit.

Task at hand.. easy tasks probably lesser data js okay. If it’s more nuanced, you need more.
Try to lora FT it as much as possible if you want the model to be good at general tasks too.
I tried to shove 40K data points on a 4B and even Lora overfit.. i am talking about output token length n stuff was overfit
Train with lesser batch size. Even tho your GPU might be a lot bigger, increasing bs might affect it negatively.

End of the day, get as much data yiu can first. Start testing with a minimal count n see how it performs on eval data. Day 1000/100/100 split. Then increase it slowly as you see fit. I started with 2000 or so and now im at 45K for a 4B model.

Smaller model might need lesser data if it’s specialized Task. Bigger model can take more and generalize a bit better.

All based on my experience. Might vary broadly. 🫡 good luck.

[-]

Fun-Agent9212@reddit (OP)

the token length overfitting thing is a useful warning I hadn't thought about. when did you catch that problem? and starting small then scaling makes sense. what field are you building for?

Regarding the task, I'm working on building synthetic dialogue datasets for HR/workplace conflict scenarios. labeled with severity, conflict type, quality scores etc. mainly targeting people fine-tuning models for conflict detection or building HR chat tools. to clarify I'm not fine-tuning myself though Im looking into that to further my knowledge. Rn I'm generating and selling the datasets. asking more from a market perspective like what sizes buyers actually look for when they're shopping for training data.

Thanks!

[-]

Crafty-Celery-2466@reddit

Oh well, if you find a nice blog or anything for good SDG, i’d love to hear. I am struggling with Verification as the LLM as a judge itself is unreliable

[-]

Fun-Agent9212@reddit (OP)

What domain are you working in?