How to SFT diffusion large language model ?

Posted by ProfessionalGuess884@reddit | LocalLLaMA | View on Reddit | 4 comments

I’m wondering if there’s any way to perform SFT (Supervised Fine-Tuning) on a diffusion-based large language model.
If anyone has experience with this, could you please share your insights?

[-]

F4k3r22@reddit

Okay, I'm working on a project where I'm building a Large Language Diffusion Model from scratch, and the SFT process is almost the same as pre-training (according to the LLaDA paper). You take pairs of prompts and their respective responses. You leave the prompt as is (YOU ARE NOT GOING TO MASK IT), but you will mask the response to that prompt USING A BERNOULLI VARIABLE for each position, with probability t for true (mask) and 1–t for false (do not mask).

Here, t is randomly sampled between 0 and 1: when t is closer to 0, you only mask a few tokens of the response (easy case); when t is closer to 1, you mask almost the entire response (hard case). This way, you don't mask everything, and the model learns to condition its behavior based on the prompt, and you only punish the model until it gets closer to the expected response of the pairs.

And for masking, you'll use the mask_token_id that comes with the model and its tokenizer, so don't try to invent a new token for that.

I hope this helps you understand it a little better.

[-]

F4k3r22@reddit

If you want to see how I'm doing in my project to create a Large Language Diffusion Model from scratch, I'll leave you the GitHub repo, I'm still implementing the file to pre-train the model and then I'm going to create another one to do the SFT. Repo:https://github.com/F4k3r22/LLaDA-from-scratch

[-]

Top-Effort677@reddit

Is it possible to perform PEFT for the SFT of LLADA?

[-]

F4k3r22@reddit

I reviewed the paper and looked for information, but there is almost nothing about being able to do PEFT in SFT, almost all the fine tuning was done with mixed long chain-of-thought