TheaterFire

Tülu 3 -- a set of state-of-the-art instruct models with fully open data, eval code, and training algorithms

Posted by Xhehab_@reddit | LocalLLaMA | View on Reddit | 42 comments

Reply to Post

42 Comments

robotphilanthropist@reddit

Hey -- co-lead here. All I will add to start is: OLMo soon as well.
View on Reddit #41188529

fairydreaming@reddit

Amazing work, congratulations! In the paper I found: >To train our TÜLU 3 models, we used between 4 and 16 8xH100 nodes with high speed interconnect and: >The PPO runtime is roughly 28 hours using two nodes Could you share information about the number of nodes used and the duration in hours for remaining stages of the training recipe? I looked through the paper, but couldn't find this information. In [tulu3.md](https://github.com/allenai/open-instruct/blob/main/docs/tulu3.md) there are commands for 8 nodes, but no information about execution time. It would help people estimate the costs of replicating the training recipe run. Thanks!
View on Reddit #41194631

robotphilanthropist@reddit

Yeah lemme work on this, will add it to the paper. DPO is pretty quick because fewer tokens. Like 12 hours or less on 2 nodes at 8B, \~24 hours at 4 nodes on 70b. RL can really be a long time depending how long you want it to run.
View on Reddit #41291646

Billy462@reddit

I really love all the work you guys are doing. Learned so much from the olmo paper and code
View on Reddit #41228338

anemone_armada@reddit

I am very impressed. I started with the usual "how are you" and the model felt it had to make a CoT reply on the only fact in the prompt (in SillyTavern my user card says "a woman in her fourties). It was a quite good set of considerations on women in their forties. Apart from this fun start, I am impressed at how well the model takes meta-instructions from the conversation. I asked to act as a catgirl; it replied with a conversation between a catgirl persona and me. I told it to limit its replies to the catgirl persona and end with an EOS token. It never made the mistake again in 8,000 tokens of conversation. Very good, getting these meta things right is never a given. I see continuous improvements and this fine-tune moves in the right direction.
View on Reddit #41284755

Xhehab_@reddit (OP)

https://preview.redd.it/3oy7liozja2e1.png?width=1386&format=png&auto=webp&s=8342c5b9a3c1e2ce9bc4fab742485edcd7c6b930 Benchmarks ***TL;DR:*** *8B surpasses Qwen 2.5 7B Instruct* *70B surpasses Qwen 2.5 72B Instruct, GPT-4o Mini, Claude 3.5 Haiku*
View on Reddit #41173667

fairydreaming@reddit

I checked this model in farel-bench and it performed a bit better than llama-3.1 70b, on par with qwen 2.5 72b. But it also made two errors in basic quizzes, so I have mixed feelings. Tomorrow I'll try it with CoT.
View on Reddit #41274722

Wynneve@reddit

This model looks like utter trash if judging by these benchmarks. The “average score” is higher, but: * no significant difference among most valuable entries, if compared to Llama instruct. * degraded MMLU and HumanEval, so it's useless if you want a model smarter in coding/facts; Qwen is really outstanding here despise worse results in other entries. * big leap on “Safety”, so it's useless censored garbage overall (funnily enough, it looks like this is the entry that contributes most to the “growth” of the model's “average score”)
View on Reddit #41234704

bbsss@reddit

Thankfully your angry rant at something that was given to you for free ends with "useless censored garbage" so I can immediatly disqualify your opinion.
View on Reddit #41237451

LetterRip@reddit

Looks like it is mostly large improvements in "PopQA" and "Safety" scores while sacrificing other scores, so kind of the opposite of useful benchmark improvements.
View on Reddit #41182553

robotphilanthropist@reddit

We made sure to beat Llama on average without including safety... a lame benchmark to be the only one you win on.
View on Reddit #41188587

LewisJin@reddit

I think the benchmark scores is not very useful. plus, this model even not using chatml format, move it into trash.... (chatml used in many way including mllm, i bet they not even consider further usage except benchmarking.)
View on Reddit #41249015

Xhehab_@reddit (OP)

8B model: [https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B…](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B) 70B model: [https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B…](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B) Try it out: [https://playground.allenai.org](https://playground.allenai.org/) Learn more: [https://allenai.org/tulu](https://allenai.org/tulu)
View on Reddit #41173241

noneabove1182@reddit

I've got the SFT and DPO versions up in GGUF in case any one wants to compare:  https://huggingface.co/bartowski?search_models=Tulu-3
View on Reddit #41243411

Billy462@reddit

For reasons I do not understand this model as 70b q4km does seem to work distributed using llamacpp RPC. I was running it over network at 12tok/sec. Seems like a good model too.
View on Reddit #41234532

danielhanchen@reddit

I uploaded GGUF quants for 8B to https://huggingface.co/unsloth/Llama-3.1-Tulu-3-8B-GGUF 70B quants ongoing! Also 4bit bitsandbytes to https://huggingface.co/unsloth/Llama-3.1-Tulu-3-8B-bnb-4bit
View on Reddit #41185540

robotphilanthropist@reddit

Wow beat us to it at Ai2. We are excited to play a bit more in the local space next year.
View on Reddit #41188678

danielhanchen@reddit

Excited for all the Ai2 releases!! Keep up the fantastic work!
View on Reddit #41217181

EntertainmentBroad43@reddit

DROP scores are strange for qwen 72b and gpt4o mini. It can’t be worse than gpt3.5.
View on Reddit #41215862

LosEagle@reddit

Does the "state-of-the-art" claim carry any weight these days? I see it thrown around so often it feels more like a buzzword.
View on Reddit #41208302

Sabin_Stargem@reddit

SOTA doesn't mean "the best", it means "The best, at this moment." In any arena where technology marches at a rapid pace, there has to be terms to indicate the state of things. The term has been around since the 1910's or thereabouts, used by both snek-oil salesmen and genuine inventors alike.
View on Reddit #41213613

foldl-li@reddit

70B not as good as llama 3.1 8B in this case: llama 3.1 8B: You > how many R's are there in strawberry? A.I. > There are 2 R's in the word "strawberry". You > double check it A.I. > There are actually 3 R's in the word "strawberry". https://preview.redd.it/xn6wt7dnuc2e1.png?width=1008&format=png&auto=webp&s=809414fd31c97dc4887389d655b4f6d39a8ab2fd
View on Reddit #41207641

Any-Conference1005@reddit

The title is highly misleading and disappointing.
View on Reddit #41202394

Maokawaii@reddit

Interesting read about state of the art finetuning, any advice for smaller scale fine tuning methods for companies willing to fine tune LLMs to their specific needs?
View on Reddit #41197891

fairydreaming@reddit

Amazing work, congratulations! In the paper I found: >To train our TÜLU 3 models, we used between 4 and 16 8xH100 nodes with high speed interconnect and: >The PPO runtime is roughly 28 hours using two nodes Could you share information about the number of nodes used and the duration in hours for remaining stages of the training recipe? I looked through the paper, but couldn't find this information. In [tulu3.md](https://github.com/allenai/open-instruct/blob/main/docs/tulu3.md) there are commands for 8 nodes, but no information about execution time. It would help people to estimate the cost needed to replicate the training run. Thanks!
View on Reddit #41193427

vasileer@reddit

please change the title and say "LLama3 finetunes" instead of "models" to not confuse people, this will not diminish the work done at all, and will increase trust
View on Reddit #41175685

robotphilanthropist@reddit

<3
View on Reddit #41188642

acr_vp@reddit

The model also goes against meta's license, it needs to have llama in the model name, not that I care, but it highlights a basic competency.
View on Reddit #41186506

mikael110@reddit

The official model name does include Llama-3.1 as you can see in the HF repos: [Llama-3.1-Tulu-3-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B/tree/main) & [Llama-3.1-Tulu-3-70B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B).
View on Reddit #41187047

vasileer@reddit

actually not, on HF they have the right prefix [https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B)
View on Reddit #41186682

Sky_Linx@reddit

Basically every model I am ready about lately is "state of the art". LOL
View on Reddit #41187696

mikael110@reddit

That's great. Finetuning VLMs is something I've found to be quite hard compared to LLMs. And there's huge potential there. Recent open VLMs are honestly amazing, seemingly better than the proprietary alternatives. And being able to finetune them with domain specific knowledge really takes them to the next level. Most of my use cases is quite niche so finetuning is a bit of a must. Which is why I've had limited use out of the recent VLMs released. A finetuned Qwen2-VL feels like it's going to be killer. So I'm looking forward to trying it out. Being able to finetune just specific layers is also quite interesting. I assume that is a good way to teach new knowledge without hurting the general vision abilities. Which seems useful.
View on Reddit #41186545

phoneixAdi@reddit

For anyone interested, this podcast video released today, goes more into what went into the training of Tulu 3: [https://youtu.be/LVXtFnEbNU0](https://youtu.be/LVXtFnEbNU0)
View on Reddit #41177586

MatlowAI@reddit

Thanks I'll have to look at this later. All I know is we need to come up with some kind of trivia game that makes it fun to help build better training data for small shops like this. Anyone with me?
View on Reddit #41184470

tucnak@reddit

The guy is a 100x fuck me.
View on Reddit #41183325

FullstackSensei@reddit

They're Llama 3 fine-tunes. For a moment I thought they're new models from scratch and that they're releasing the dataset and recipe for pretraining like open coder.
View on Reddit #41175299

jd_3d@reddit

Seems a bit dishonest to not mention Llama-3.1 in the marketing graphics.
View on Reddit #41181448

AutomataManifold@reddit

Trained on top of Llama 3 **base**, so they're guaranteed to be different from Llama 3 *instruction*. Plus, releasing the finetuning dataset and code makes it much easier to train your own dataset, because you don't have to worry as much about matching the undisclosed data distribution...and you can train other foundation models on the same dataset if you like this one. Plus, it looks like a good reference for how to do your own finetune from scratch at scale, which is handy.
View on Reddit #41176804

mikael110@reddit

Yes, Tulu is Allen AI's finetuned model. They do also have an entirely open made from scratch model named Olmo. Based on recent PRs to transformers and llama.cpp it seems that model is also getting an update at some point this month. Allen AI is in general one of the most open AI organizations around right now.
View on Reddit #41175923

HideLord@reddit

Nice, I love papers like this one which specify what worked and what didn't. Giving the specific training hyperparamerts and data mixtures is incredibly helpful.
View on Reddit #41177885

mikael110@reddit

I've been very impressed by Allen AI's recent Molmo VLM and their general stance to model training and transparency, so I'm excited to try this model out. I remember their Llama 2 finetune being quite good as well.
View on Reddit #41177815

mikael110@reddit

I've been very impressed by Allen AI's recent Molmo VLM and their general stance to model training and transparency, so I'm excited to try this model out.
View on Reddit #41176291