robotphilanthropist's Comments

[-]

robotphilanthropist@reddit

Let us know how we can improve it :)

Reply

[-]

robotphilanthropist@reddit (OP)

It's definitely an underrated model. We tend to weigh top-end performance and how many releases they do too. But yeah OpenAI was top of our specialists, almost in noteworthy category, and maybe just personal bias for not putting it in honorable mentions. Its a super popular model.

Reply

[-]

robotphilanthropist@reddit (OP)

we offer expert curation only, votes are too messy ;)

Reply

[-]

robotphilanthropist@reddit

All good, we know we have a lot of work to do!

Reply

[-]

robotphilanthropist@reddit

Yes! Here’s an image but also the new version of the paper has comparison columns

Reply

[-]

robotphilanthropist@reddit

I personally spent hours in regex’s to do this. It removes most of the samples, but across billions of tokens in pretrain and post train it’s very hard to do. The problem is more of a need then to generate data about your identity rather than patching the long tail of regex’s

Reply

[-]

robotphilanthropist@reddit

Will improve this on future models. We agree. But also we have the instruct model now at 32b with no thinking tokens

Reply

[-]

robotphilanthropist@reddit

working on it for the new version. We changed how we handled system prompts in training and didn't have an in loop eval for this. It's high on my list to fix in the new year :)

Reply

[-]

robotphilanthropist@reddit

sorry, likely made by a sleep deprived team member minutes ahead of time. we'll do better in the future!

Reply

[-]

robotphilanthropist@reddit

For one, RL finetuning like this has been known in industry for years, just not really talked about. We were ahead of the curve on bringing it back into conversation, but I wouldn't say DeepSeek "copied" RLVR.

Reply

[-]

robotphilanthropist@reddit

We obviously know that our Tülu 3 recipe is not a reasoning model, but early experiments that worked very well with the same formulation as reasoning models. We're going to release full reasoning models in the future, good things take time. Both Instruct models and reasoning models use this type of RL.

Reply

[-]

robotphilanthropist@reddit

Need better licenses.

Reply

[-]

robotphilanthropist@reddit

Yeah, lead on post-train here, super excited that the 13b is comprable or even BETTER than 3.1 instruct

Reply

[-]

robotphilanthropist@reddit

Instruct is trained for 4096 tokens. Most of the tokens are in SFT. At DPO we drop the length to 2048, but it doesnt change anything. Preference data is low length.

Reply

[-]

robotphilanthropist@reddit

hmmm let us look! Sorry

Reply

[-]

robotphilanthropist@reddit

Yeah lemme work on this, will add it to the paper. DPO is pretty quick because fewer tokens. Like 12 hours or less on 2 nodes at 8B, \~24 hours at 4 nodes on 70b. RL can really be a long time depending how long you want it to run.

Reply

[-]

robotphilanthropist@reddit

Wow beat us to it at Ai2. We are excited to play a bit more in the local space next year.

Reply

[-]

robotphilanthropist@reddit

<3

Reply

[-]

robotphilanthropist@reddit

We made sure to beat Llama on average without including safety... a lame benchmark to be the only one you win on.

Reply

[-]

robotphilanthropist@reddit

Hey -- co-lead here. All I will add to start is: OLMo soon as well.

Reply

[-]

robotphilanthropist@reddit

not a good enough model ;)

Reply

[-]

robotphilanthropist@reddit

Some general comments on what you can expect from post-training behavior. 1. Most of the data is single turn instruction following. We want to make a v2 that is better at multi-turn. 2. A moderate focus on code/reasoning but we can still do more. 3. Not that much on system prompts / roleplay, so curious what people find. 4. Working on verifiable instruction following (IFEval). Isn't as good as Llama 3.1 type models, but much better than previous OLMos

Reply

[-]

robotphilanthropist@reddit

I say OLMo-y, but it is up for debate. Also Olmmm M O E

Reply

[-]

robotphilanthropist@reddit

We found that OLMoE was only about 20-40% faster to fine tune than OLMo 7B (dense model). I suspect some of that was from rough initial implementations in HF ecosystem for fine-tuning. I didn't look closely at utilization / batch size.

Reply

[-]

robotphilanthropist@reddit

The problem is that colloquially distillation covers two things 1. the technical teacher-student distillation 2. any learning from a more powerful model via synthetic data Both are popular today, the first is the second definition, the second is what Zuck meant.

Reply