Tülu 3 -- a set of state-of-the-art instruct models with fully open data, eval code, and training algorithms

[-]

robotphilanthropist@reddit

Hey -- co-lead here. All I will add to start is: OLMo soon as well.

Reply

[-]

Amazing work, congratulations! In the paper I found: >To train our TÜLU 3 models, we used between 4 and 16 8xH100 nodes with high speed interconnect and: >The PPO runtime is roughly 28 hours using two nodes Could you share information about the number of nodes used and the duration in hours for remaining stages of the training recipe? I looked through the paper, but couldn't find this information. In [tulu3.md](https://github.com/allenai/open-instruct/blob/main/docs/tulu3.md) there are commands for 8 nodes, but no information about execution time. It would help people estimate the costs of replicating the training recipe run. Thanks!

Reply

[-]

robotphilanthropist@reddit

Yeah lemme work on this, will add it to the paper. DPO is pretty quick because fewer tokens. Like 12 hours or less on 2 nodes at 8B, \~24 hours at 4 nodes on 70b. RL can really be a long time depending how long you want it to run.

Reply

[-]

Billy462@reddit

I really love all the work you guys are doing. Learned so much from the olmo paper and code

Reply

[-]

anemone_armada@reddit

I am very impressed. I started with the usual "how are you" and the model felt it had to make a CoT reply on the only fact in the prompt (in SillyTavern my user card says "a woman in her fourties). It was a quite good set of considerations on women in their forties. Apart from this fun start, I am impressed at how well the model takes meta-instructions from the conversation. I asked to act as a catgirl; it replied with a conversation between a catgirl persona and me. I told it to limit its replies to the catgirl persona and end with an EOS token. It never made the mistake again in 8,000 tokens of conversation. Very good, getting these meta things right is never a given. I see continuous improvements and this fine-tune moves in the right direction.

Reply

[-]

Xhehab_@reddit (OP)

https://preview.redd.it/3oy7liozja2e1.png?width=1386&format=png&auto=webp&s=8342c5b9a3c1e2ce9bc4fab742485edcd7c6b930 Benchmarks ***TL;DR:*** *8B surpasses Qwen 2.5 7B Instruct* *70B surpasses Qwen 2.5 72B Instruct, GPT-4o Mini, Claude 3.5 Haiku*

Reply

[-]

fairydreaming@reddit

I checked this model in farel-bench and it performed a bit better than llama-3.1 70b, on par with qwen 2.5 72b. But it also made two errors in basic quizzes, so I have mixed feelings. Tomorrow I'll try it with CoT.

Reply

[-]

Wynneve@reddit

This model looks like utter trash if judging by these benchmarks. The “average score” is higher, but: * no significant difference among most valuable entries, if compared to Llama instruct. * degraded MMLU and HumanEval, so it's useless if you want a model smarter in coding/facts; Qwen is really outstanding here despise worse results in other entries. * big leap on “Safety”, so it's useless censored garbage overall (funnily enough, it looks like this is the entry that contributes most to the “growth” of the model's “average score”)

Reply

[-]

bbsss@reddit

Thankfully your angry rant at something that was given to you for free ends with "useless censored garbage" so I can immediatly disqualify your opinion.

Reply

[-]

LetterRip@reddit

Looks like it is mostly large improvements in "PopQA" and "Safety" scores while sacrificing other scores, so kind of the opposite of useful benchmark improvements.

Reply

[-]

robotphilanthropist@reddit

We made sure to beat Llama on average without including safety... a lame benchmark to be the only one you win on.

Reply

[-]

LewisJin@reddit

I think the benchmark scores is not very useful. plus, this model even not using chatml format, move it into trash.... (chatml used in many way including mllm, i bet they not even consider further usage except benchmarking.)

Reply

[-]

Xhehab_@reddit (OP)

8B model: [https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B…](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B) 70B model: [https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B…](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B) Try it out: [https://playground.allenai.org](https://playground.allenai.org/) Learn more: [https://allenai.org/tulu](https://allenai.org/tulu)

Reply

[-]

noneabove1182@reddit

I've got the SFT and DPO versions up in GGUF in case any one wants to compare: https://huggingface.co/bartowski?search_models=Tulu-3

Reply

[-]

Billy462@reddit

For reasons I do not understand this model as 70b q4km does seem to work distributed using llamacpp RPC. I was running it over network at 12tok/sec. Seems like a good model too.

Reply

[-]

danielhanchen@reddit

I uploaded GGUF quants for 8B to https://huggingface.co/unsloth/Llama-3.1-Tulu-3-8B-GGUF 70B quants ongoing! Also 4bit bitsandbytes to https://huggingface.co/unsloth/Llama-3.1-Tulu-3-8B-bnb-4bit

Reply

[-]

robotphilanthropist@reddit

Wow beat us to it at Ai2. We are excited to play a bit more in the local space next year.

Reply

[-]

danielhanchen@reddit

Excited for all the Ai2 releases!! Keep up the fantastic work!

Reply

[-]

EntertainmentBroad43@reddit

DROP scores are strange for qwen 72b and gpt4o mini. It can’t be worse than gpt3.5.

Reply

[-]

LosEagle@reddit

Does the "state-of-the-art" claim carry any weight these days? I see it thrown around so often it feels more like a buzzword.

Reply

[-]

Sabin_Stargem@reddit

SOTA doesn't mean "the best", it means "The best, at this moment." In any arena where technology marches at a rapid pace, there has to be terms to indicate the state of things. The term has been around since the 1910's or thereabouts, used by both snek-oil salesmen and genuine inventors alike.

Reply

[-]

foldl-li@reddit

70B not as good as llama 3.1 8B in this case: llama 3.1 8B: You > how many R's are there in strawberry? A.I. > There are 2 R's in the word "strawberry". You > double check it A.I. > There are actually 3 R's in the word "strawberry". https://preview.redd.it/xn6wt7dnuc2e1.png?width=1008&format=png&auto=webp&s=809414fd31c97dc4887389d655b4f6d39a8ab2fd

Reply

[-]

Any-Conference1005@reddit

The title is highly misleading and disappointing.

Reply

[-]

Maokawaii@reddit

Interesting read about state of the art finetuning, any advice for smaller scale fine tuning methods for companies willing to fine tune LLMs to their specific needs?

Reply

[-]

fairydreaming@reddit

Amazing work, congratulations! In the paper I found: >To train our TÜLU 3 models, we used between 4 and 16 8xH100 nodes with high speed interconnect and: >The PPO runtime is roughly 28 hours using two nodes Could you share information about the number of nodes used and the duration in hours for remaining stages of the training recipe? I looked through the paper, but couldn't find this information. In [tulu3.md](https://github.com/allenai/open-instruct/blob/main/docs/tulu3.md) there are commands for 8 nodes, but no information about execution time. It would help people to estimate the cost needed to replicate the training run. Thanks!

Reply

[-]

vasileer@reddit

please change the title and say "LLama3 finetunes" instead of "models" to not confuse people, this will not diminish the work done at all, and will increase trust

Reply

[-]

robotphilanthropist@reddit

<3

Reply

[-]

acr_vp@reddit

The model also goes against meta's license, it needs to have llama in the model name, not that I care, but it highlights a basic competency.

Reply

[-]

mikael110@reddit

The official model name does include Llama-3.1 as you can see in the HF repos: [Llama-3.1-Tulu-3-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B/tree/main) & [Llama-3.1-Tulu-3-70B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B).

Reply

[-]

vasileer@reddit

actually not, on HF they have the right prefix [https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B)

Reply

[-]

Sky_Linx@reddit

Basically every model I am ready about lately is "state of the art". LOL

Reply

[-]

mikael110@reddit

That's great. Finetuning VLMs is something I've found to be quite hard compared to LLMs. And there's huge potential there. Recent open VLMs are honestly amazing, seemingly better than the proprietary alternatives. And being able to finetune them with domain specific knowledge really takes them to the next level. Most of my use cases is quite niche so finetuning is a bit of a must. Which is why I've had limited use out of the recent VLMs released. A finetuned Qwen2-VL feels like it's going to be killer. So I'm looking forward to trying it out. Being able to finetune just specific layers is also quite interesting. I assume that is a good way to teach new knowledge without hurting the general vision abilities. Which seems useful.

Reply

[-]

phoneixAdi@reddit

For anyone interested, this podcast video released today, goes more into what went into the training of Tulu 3: [https://youtu.be/LVXtFnEbNU0](https://youtu.be/LVXtFnEbNU0)

Reply

[-]

MatlowAI@reddit

Thanks I'll have to look at this later. All I know is we need to come up with some kind of trivia game that makes it fun to help build better training data for small shops like this. Anyone with me?

Reply

[-]

tucnak@reddit

The guy is a 100x fuck me.

Reply

[-]

FullstackSensei@reddit

They're Llama 3 fine-tunes. For a moment I thought they're new models from scratch and that they're releasing the dataset and recipe for pretraining like open coder.

Reply

[-]

jd_3d@reddit

Seems a bit dishonest to not mention Llama-3.1 in the marketing graphics.

Reply

[-]

AutomataManifold@reddit

Trained on top of Llama 3 **base**, so they're guaranteed to be different from Llama 3 *instruction*. Plus, releasing the finetuning dataset and code makes it much easier to train your own dataset, because you don't have to worry as much about matching the undisclosed data distribution...and you can train other foundation models on the same dataset if you like this one. Plus, it looks like a good reference for how to do your own finetune from scratch at scale, which is handy.

Reply

[-]

mikael110@reddit

Yes, Tulu is Allen AI's finetuned model. They do also have an entirely open made from scratch model named Olmo. Based on recent PRs to transformers and llama.cpp it seems that model is also getting an update at some point this month. Allen AI is in general one of the most open AI organizations around right now.

Reply

[-]

HideLord@reddit

Nice, I love papers like this one which specify what worked and what didn't. Giving the specific training hyperparamerts and data mixtures is incredibly helpful.

Reply

[-]

mikael110@reddit

I've been very impressed by Allen AI's recent Molmo VLM and their general stance to model training and transparency, so I'm excited to try this model out. I remember their Llama 2 finetune being quite good as well.

Reply

[-]

mikael110@reddit

I've been very impressed by Allen AI's recent Molmo VLM and their general stance to model training and transparency, so I'm excited to try this model out.

Reply

Reply to Post

42 Comments