Amazing work, congratulations!
In the paper I found:
>To train our TÜLU 3 models, we used between 4 and 16 8xH100 nodes with high speed interconnect
and:
>The PPO runtime is roughly 28 hours using two nodes
Could you share information about the number of nodes used and the duration in hours for remaining stages of the training recipe? I looked through the paper, but couldn't find this information. In [tulu3.md](https://github.com/allenai/open-instruct/blob/main/docs/tulu3.md) there are commands for 8 nodes, but no information about execution time.
It would help people estimate the costs of replicating the training recipe run. Thanks!
Yeah lemme work on this, will add it to the paper.
DPO is pretty quick because fewer tokens. Like 12 hours or less on 2 nodes at 8B, \~24 hours at 4 nodes on 70b.
RL can really be a long time depending how long you want it to run.
I am very impressed. I started with the usual "how are you" and the model felt it had to make a CoT reply on the only fact in the prompt (in SillyTavern my user card says "a woman in her fourties). It was a quite good set of considerations on women in their forties.
Apart from this fun start, I am impressed at how well the model takes meta-instructions from the conversation. I asked to act as a catgirl; it replied with a conversation between a catgirl persona and me. I told it to limit its replies to the catgirl persona and end with an EOS token. It never made the mistake again in 8,000 tokens of conversation. Very good, getting these meta things right is never a given. I see continuous improvements and this fine-tune moves in the right direction.
I checked this model in farel-bench and it performed a bit better than llama-3.1 70b, on par with qwen 2.5 72b. But it also made two errors in basic quizzes, so I have mixed feelings. Tomorrow I'll try it with CoT.
This model looks like utter trash if judging by these benchmarks. The “average score” is higher, but:
* no significant difference among most valuable entries, if compared to Llama instruct.
* degraded MMLU and HumanEval, so it's useless if you want a model smarter in coding/facts; Qwen is really outstanding here despise worse results in other entries.
* big leap on “Safety”, so it's useless censored garbage overall (funnily enough, it looks like this is the entry that contributes most to the “growth” of the model's “average score”)
Thankfully your angry rant at something that was given to you for free ends with "useless censored garbage" so I can immediatly disqualify your opinion.
Looks like it is mostly large improvements in "PopQA" and "Safety" scores while sacrificing other scores, so kind of the opposite of useful benchmark improvements.
I think the benchmark scores is not very useful.
plus, this model even not using chatml format, move it into trash....
(chatml used in many way including mllm, i bet they not even consider further usage except benchmarking.)
For reasons I do not understand this model as 70b q4km does seem to work distributed using llamacpp RPC. I was running it over network at 12tok/sec. Seems like a good model too.
I uploaded GGUF quants for 8B to https://huggingface.co/unsloth/Llama-3.1-Tulu-3-8B-GGUF
70B quants ongoing! Also 4bit bitsandbytes to https://huggingface.co/unsloth/Llama-3.1-Tulu-3-8B-bnb-4bit
SOTA doesn't mean "the best", it means "The best, at this moment." In any arena where technology marches at a rapid pace, there has to be terms to indicate the state of things. The term has been around since the 1910's or thereabouts, used by both snek-oil salesmen and genuine inventors alike.
70B not as good as llama 3.1 8B in this case:
llama 3.1 8B:
You > how many R's are there in strawberry?
A.I. > There are 2 R's in the word "strawberry".
You > double check it
A.I. > There are actually 3 R's in the word "strawberry".
https://preview.redd.it/xn6wt7dnuc2e1.png?width=1008&format=png&auto=webp&s=809414fd31c97dc4887389d655b4f6d39a8ab2fd
Interesting read about state of the art finetuning, any advice for smaller scale fine tuning methods for companies willing to fine tune LLMs to their specific needs?
Amazing work, congratulations!
In the paper I found:
>To train our TÜLU 3 models, we used between 4 and 16 8xH100 nodes with high speed interconnect
and:
>The PPO runtime is roughly 28 hours using two nodes
Could you share information about the number of nodes used and the duration in hours for remaining stages of the training recipe? I looked through the paper, but couldn't find this information. In [tulu3.md](https://github.com/allenai/open-instruct/blob/main/docs/tulu3.md) there are commands for 8 nodes, but no information about execution time.
It would help people to estimate the cost needed to replicate the training run. Thanks!
please change the title and say "LLama3 finetunes" instead of "models" to not confuse people,
this will not diminish the work done at all, and will increase trust
The official model name does include Llama-3.1 as you can see in the HF repos: [Llama-3.1-Tulu-3-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B/tree/main) & [Llama-3.1-Tulu-3-70B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B).
actually not, on HF they have the right prefix [https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B)
That's great. Finetuning VLMs is something I've found to be quite hard compared to LLMs. And there's huge potential there. Recent open VLMs are honestly amazing, seemingly better than the proprietary alternatives. And being able to finetune them with domain specific knowledge really takes them to the next level.
Most of my use cases is quite niche so finetuning is a bit of a must. Which is why I've had limited use out of the recent VLMs released. A finetuned Qwen2-VL feels like it's going to be killer. So I'm looking forward to trying it out.
Being able to finetune just specific layers is also quite interesting. I assume that is a good way to teach new knowledge without hurting the general vision abilities. Which seems useful.
For anyone interested, this podcast video released today, goes more into what went into the training of Tulu 3: [https://youtu.be/LVXtFnEbNU0](https://youtu.be/LVXtFnEbNU0)
Thanks I'll have to look at this later. All I know is we need to come up with some kind of trivia game that makes it fun to help build better training data for small shops like this. Anyone with me?
They're Llama 3 fine-tunes. For a moment I thought they're new models from scratch and that they're releasing the dataset and recipe for pretraining like open coder.
Trained on top of Llama 3 **base**, so they're guaranteed to be different from Llama 3 *instruction*. Plus, releasing the finetuning dataset and code makes it much easier to train your own dataset, because you don't have to worry as much about matching the undisclosed data distribution...and you can train other foundation models on the same dataset if you like this one.
Plus, it looks like a good reference for how to do your own finetune from scratch at scale, which is handy.
Yes, Tulu is Allen AI's finetuned model. They do also have an entirely open made from scratch model named Olmo. Based on recent PRs to transformers and llama.cpp it seems that model is also getting an update at some point this month.
Allen AI is in general one of the most open AI organizations around right now.
Nice, I love papers like this one which specify what worked and what didn't. Giving the specific training hyperparamerts and data mixtures is incredibly helpful.
I've been very impressed by Allen AI's recent Molmo VLM and their general stance to model training and transparency, so I'm excited to try this model out. I remember their Llama 2 finetune being quite good as well.
I've been very impressed by Allen AI's recent Molmo VLM and their general stance to model training and transparency, so I'm excited to try this model out.
42 Comments
robotphilanthropist@reddit
fairydreaming@reddit
robotphilanthropist@reddit
Billy462@reddit
anemone_armada@reddit
Xhehab_@reddit (OP)
fairydreaming@reddit
Wynneve@reddit
bbsss@reddit
LetterRip@reddit
robotphilanthropist@reddit
LewisJin@reddit
Xhehab_@reddit (OP)
noneabove1182@reddit
Billy462@reddit
danielhanchen@reddit
robotphilanthropist@reddit
danielhanchen@reddit
EntertainmentBroad43@reddit
LosEagle@reddit
Sabin_Stargem@reddit
foldl-li@reddit
Any-Conference1005@reddit
Maokawaii@reddit
fairydreaming@reddit
vasileer@reddit
robotphilanthropist@reddit
acr_vp@reddit
mikael110@reddit
vasileer@reddit
Sky_Linx@reddit
mikael110@reddit
phoneixAdi@reddit
MatlowAI@reddit
tucnak@reddit
FullstackSensei@reddit
jd_3d@reddit
AutomataManifold@reddit
mikael110@reddit
HideLord@reddit
mikael110@reddit
mikael110@reddit