Finetune LLama3 - Dataset format?

Posted by indrasmirror@reddit | LocalLLaMA | View on Reddit | 4 comments

Hey guys, I'm just curious if someone can help. I'm trying to fine-tune Llama3-8B on a form filling task, and I was wondering what the best way to structure the dataset for instructions is. I've looked around and can't seem to find a definitive structure. This is my first LLM fine-tune, so I'm not sure if I can train it on any structure or data, or if it's best to stick to its base training dataset structure. I was thinking of doing it like this: alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {} ### Input: {} ### Response: {}""" EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN def formatting_prompts_func(examples): instructions = examples["instruction"] inputs = examples["input"] outputs = examples["output"] texts = [] for instruction, input, output in zip(instructions, inputs, outputs): # Must add EOS_TOKEN, otherwise your generation will go on forever! text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN texts.append(text) return {"text": texts} or can I structure it like system_message = """You are Llama, an AI assistant created by Vignesh to be helpful and honest. Your knowledge spans a wide range of topics, allowing you to engage in substantive conversations and provide analysis on complex subjects.""" def create_conversation(sample): if sample["messages"][0]["role"] == "system": return sample else: sample["messages"] = [{"role": "system", "content": system_message}] + sample["messages"] return sample This is my first fine-tune and there isn't a clear definition that I could find anyway on like the base structure of the dataset. OR is it something I can hypothetically train on any structure? Is it possible to implement your own structure based understanding into the model?