I have a question I've really never seen addressed well in all of the many fine-tuning videos, blogs, articles, etc. as most of them focus on training LLMs to respond to chats or instructions in a certain style or format.
At our work we use a specialized piece of software which is similar to VB but highly customized to the point where even a coding LLM that was trained on VB would still get things wrong. I have plenty of code examples as well as the developer documentation which is highly-detailed and definitely contains everything one would need to know in order to properly script something.
I understand the concepts of fine tuning and have done it plenty of times with text and image based models, but when it comes to training a coding LLM I get stuck. If you know of any good resources that go into greater detail on how best to do this I'd love to know about them. Perhaps you might even consider creating a fine-tuning notebook or blog article specifically about best practices for training a coding model.
Ideally, I'd like to have a model (or two, depending on suggestions) that can both generate code (input the requirements, get code out) as well as something that can be used conversationally to answer questions about the language, suggest code improvements, help correct errors in code, etc.
Some of the things that I get stuck on:
* Should I train a base model first to let it 'learn the patterns' of the language, then do instruction tuning for generating code and answering questions, or is the current state of models / fine-tuning sufficient to where I can skip straight to an existing instruction-trained coding model (perhaps one already trained on VB)?
* Between documentation, code examples, archived conversations between developers discussing the software and scripting concepts (email, forum posts) and synthetically generated Q&A or instructions/outputs, roughly how much of each should there be in the training data?
* How should chunking be approached with code? Even with some of the content I've found specifically about creating training data for coding LLMs, it's for languages which are easily split into multiple files and thus an entire file can fit into the context window. In the case of my custom scripting language, all code for a particular use case must be contained in a single file and can get quite large. If I have example code that's too long for the model's context window, do I simply throw it out? Cut out what I can so that it still remains valid? Simply truncate the file and add an indicator at the cut points that it's continued from elsewhere?
* When it comes to fine-tuning coding LLMs, how much training data should I aim for? (I suppose this might differ based on whether I'm using a model which is already familiar with VB vs one only trained for the usual languages, Python, HTML/CSS/JS etc)
* Any model suggestions for my use case?
I started down this road back when the first major Llama model came out and when Unsloth first came on the scene - I've been wanting to give it another shot with some of the newer models out there but it seems like if you stop paying attention to the space for a week you're already out of date!
I know I asked a lot of questions - any guidance you can provide on any of these points would be a tremendous help! Thanks in advance and thanks for all the work you've done for the community.
It technically works! See https://docs.unsloth.ai/basics/multi-gpu-training-with-unsloth - we're still working to make it much better and much more efficient!
Thanks for all of your hard work. Just a small query from my end. When does the team think it will be possible to fine-tune 120B GPT OSS and export to vLLM in 4bit? I believe it’s currently limited to FP16. Thanks!!!
That or MXFP4 - personally I have a novel use case for GOT-OSS120B and love that it can fit into 1x H100. But as far as I understand if we want to fine tune it, we have to use the FP16 version which is much higher in VRAM requirements.
Thanks again
Hey! Great work with the Drummer models as usual! I remember you mentioned highlighting of dataset roles during the preparation stage - is this something that's still of interest?
Thank you! Agatha v1 and a couple more models were tuned using Unsloth because of the insane optimization tricks you guys did.
Helper functions for manipulating and previewing the dataset. In Axolotl, they do the following:
* Prints several samples from the dataset for inspection.
* Prints masked tokens in the color red, prints unmasked tokens in the color green.
* Prints the respective token id and attention mask values beside every token in the sample.
* Sample packing for even distribution (e.g., when I set seq\_len to 16k with sample packing, then I know the model is exposed to \~16k \* bsz in every training step)
There's probably a bunch more I've forgotten since we discussed these a few months ago.
What specific dataset preparation features would you like to see in Unsloth?
We currently have training on completions which is actually very hard to implement
Data preparation for vision datasets
Tokenizer chat template preparation
Synthetic data generation and more!
But we're always looking to improve unsloth so please list your top things you want to include and we'll try to make it happen
Masking out tokens for the assistant prompt generally increases accuracy by 1% or more as seen in the [QLoRA paper](https://arxiv.org/pdf/2305.14314)
https://preview.redd.it/x9of8euyb7of1.jpeg?width=1200&format=pjpg&auto=webp&s=61e87499af8b77772be796a0729a31714f653585
The issue is it's actually very complex since tokenizers can tokenize combined tokens or newlines differently, so one has to be careful about masking out the correct tokens.
Simply tokenizing assistant and user prompts separately unfortunately do not work, so we had to create a universal custom masking also in Unsloth. More details in our hyper parameters [guide](https://docs.unsloth.ai/get-started/fine-tuning-llms-guide/lora-hyperparameters-guide#training-on-completions-only-masking-out-inputs)
What’s up with all these axolotl fanboys zerging every unsloth thread/topic? Did you even read the comment and reply or is trolling the only thing you are seeking?
Also, that’s not how training on completions work, ”just tokenize everything”, do you have anh clue? Like wtf are you on about? Why not reply to the question? What utilities? Jesus…
Hi r/LocalLLaMA đź‘‹
We're excited for tomorrow's guests, **The Unsloth Team!** They're the folks behind the blazing-fast Unsloth fine-tuning library and a slew of community notebooks.
**Kicking things off tomorrow (Wednesday, Sept. 10th) 10 AM–1 PM PST**
⚠️ **Note:** The AMA itself will be hosted in a **separate thread,** please don’t post questions here.
28 Comments
Rukelele_Dixit21@reddit
yoracale@reddit
samplebitch@reddit
thesillystudent@reddit
danielhanchen@reddit
danielhanchen@reddit
sammcj@reddit
danielhanchen@reddit
yoracale@reddit
Mother_Context_2446@reddit
danielhanchen@reddit
Mother_Context_2446@reddit
danielhanchen@reddit
Mother_Context_2446@reddit
danielhanchen@reddit
chlobunnyy@reddit
danielhanchen@reddit
TheLocalDrummer@reddit
danielhanchen@reddit
TheLocalDrummer@reddit
danielhanchen@reddit
yoracale@reddit
TheLocalDrummer@reddit
danielhanchen@reddit
Educational_Rent1059@reddit
Educational_Rent1059@reddit
danielhanchen@reddit
XMasterrrr@reddit (OP)