Finetune LLama3 - Dataset format?

Posted by indrasmirror@reddit | LocalLLaMA | View on Reddit | 4 comments

Hey guys, I'm just curious if someone can help. I'm trying to fine-tune Llama3-8B on a form filling task, and I was wondering what the best way to structure the dataset for instructions is. I've looked around and can't seem to find a definitive structure. This is my first LLM fine-tune, so I'm not sure if I can train it on any structure or data, or if it's best to stick to its base training dataset structure. I was thinking of doing it like this: alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {} ### Input: {} ### Response: {}""" EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN def formatting_prompts_func(examples): instructions = examples["instruction"] inputs = examples["input"] outputs = examples["output"] texts = [] for instruction, input, output in zip(instructions, inputs, outputs): # Must add EOS_TOKEN, otherwise your generation will go on forever! text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN texts.append(text) return {"text": texts} or can I structure it like system_message = """You are Llama, an AI assistant created by Vignesh to be helpful and honest. Your knowledge spans a wide range of topics, allowing you to engage in substantive conversations and provide analysis on complex subjects.""" def create_conversation(sample): if sample["messages"][0]["role"] == "system": return sample else: sample["messages"] = [{"role": "system", "content": system_message}] + sample["messages"] return sample This is my first fine-tune and there isn't a clear definition that I could find anyway on like the base structure of the dataset. OR is it something I can hypothetically train on any structure? Is it possible to implement your own structure based understanding into the model?

[-]

FullOf_Bad_Ideas@reddit

Are you starting finetune from base 8B Or Instruct 8B? How big is your dataset? If you're starting from base, you can use any dataset format you want, just print out a few samples before sending it to to SFTTraineroor or whatever trainer you're using to check if all newlines and text is in place. I usually just go with chatml format. Please note that base llama 8b has Instruct chat template in tokenizer, if you train on some different format, you will want to erase/overwrite it. My unsloth training script is here on the [model page](https://huggingface.co/adamo1139/Llama-3-8B-AEZAKMI-run1), feel free to copy as you want. I had to overwrite tokenizer chat template later manually.

[-]

Aggressive_Energy413@reddit

Should I follow the prompt format Llama3 used? It seems that when I use a format like this, the fine-tuned model performs worse. { "instruction": "……", "input": "……", "output": "……" }

[-]

FullOf_Bad_Ideas@reddit

That's just json format you enter, and not prompt format passed to the model. Json/jsonl/parquet files has to be parsed by the script later to insert a prompt format. Do you know if you format and parse the dataset to any particular template before giving it to the trainer to train the model on? If you're not formatting it, it's gonna give you bad results, since you're not using any prompt template. I suggest using chatml prompt format

[-]

Aggressive_Energy413@reddit

I'm useing a framework named MLX in an apple studio. It seem that the dataset data are not converted to a prompt template format before tokenizer encode. def iterate_batches(dset, tokenizer, batch_size, train=False): # Shuffle indices while True: indices = np.arange(len(dset)) if train: indices = np.random.permutation(indices) # Collect batches from dataset for i in range(0, len(indices) - batch_size + 1, batch_size): # Encode batch batch = [tokenizer.encode(dset[indices[i + j]]) for j in range(batch_size)] lengths = [len(x) for x in batch] # Check if any sequence is longer than 2048 tokens if max(lengths) > 2048: print( "[WARNING] Some sequences are longer than 2048 tokens. " "Consider pre-splitting your data to save memory." ) # Pad to the max length batch_arr = np.zeros((batch_size, max(lengths)), np.int32) for j in range(batch_size): batch_arr[j, : lengths[j]] = batch[j] batch = mx.array(batch_arr) yield batch[:, :-1], batch[:, 1:], mx.array(lengths) if not train: break

Finetune LLama3 - Dataset format?

Reply to Post

4 Comments

FullOf_Bad_Ideas@reddit

Aggressive_Energy413@reddit

FullOf_Bad_Ideas@reddit

Aggressive_Energy413@reddit