By compressing the idea of RLHF to single diagram, a lot of information is lost and it gets confusing, somewhat inaccurate. The lineage from from initial model (actually one after SFT training already, but this information was lost in the diagram to make it smaller) through "Generate Response" > "Human Evaluator" > "Reward Model" is fine. It does get you a reward model. But what's happening with the branching here? A finetune is based on "Generate Responses" and "Human Evaluator" combined?? Why is it branching off before it reaches "Reward Model"? That doesn't really allow for coherent understanding of the diagram. Is "Human Evaluator" needed for every step of the training even if we assume that branching happens from "Reward Model" and not from space between it and "Human Evaluator"? Well it is in the lineage for every step, so you might assume so based on a graph.
[Here's](https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2023/08/31/ML-14874_image001.jpg) an example of a diagram that actually explains it in a great way.
The best single loop diagram I found is from [wikipedia](https://upload.wikimedia.org/wikipedia/commons/b/b2/RLHF_diagram.svg), but it is way harder to read than the one from AWS.
And that's the origin of the future posts saying "im-also-a-good-gpt2-chatbot get lobotomized, it used to be able to make this diagrams perfect, now it's printing flawed diagrams"
I mean yea, it flawed. I was more impressed by the attempt than the exact execution though, because I have not seen that before in any other model, unless I specifically asked for it. here it was just part of its natural answer to the prompt shown in the top-right.
10 Comments
FullOf_Bad_Ideas@reddit
Healthy-Nebula-3603@reddit
FullOf_Bad_Ideas@reddit
hudimudi@reddit
VectorD@reddit
Enfiznar@reddit
dubesor86@reddit (OP)
Randomhkkid@reddit
CodeMurmurer@reddit
SUPR3M3Kai@reddit