TheaterFire

A team from MIT built a model that scores 61.9% on ARC-AGI-PUB using an 8B LLM plus Test-Time-Training (TTT). Previous record was 42%.

Posted by jd_3d@reddit | LocalLLaMA | View on Reddit | 65 comments

A team from MIT built a model that scores 61.9% on ARC-AGI-PUB using an 8B LLM plus Test-Time-Training (TTT). Previous record was 42%.

Reply to Post

65 Comments

NoConcert8847@reddit

This paper is heavily optimizing to solve ARC specifically. Generally applicable ideas are not directly described. However people smarter than me may benefit from the ideas described here.
View on Reddit #40327631

coumineol@reddit

I strongly believe that instead of the jack-of-all-trades-master-of-none AI of today "heavily optimizing" is the way forward.
View on Reddit #40338155

eposnix@reddit

You need to elaborate on this idea. The way forward for what? Highly optimized AI is how we've *always* done things. LLMs with very general abilities are an extremely new phenomenon, but they still aren't reliable enough to take the place of specialized AI.
View on Reddit #40345661

Slimxshadyx@reddit

3.5 was a single general model, no? I don’t think “always” is the term here
View on Reddit #40354019

Mysterious-Rent7233@reddit

No. 3.5 was a language model. A narrow AI specialized for language production.
View on Reddit #40354822

Slimxshadyx@reddit

That’s obviously what we are talking about. Language models trained for specific benchmarks vs a general language model.
View on Reddit #40354913

Mysterious-Rent7233@reddit

Look at the title. We are talking about the path to AGI. AGI will not be a language model.
View on Reddit #40355291

Many-Membership6259@reddit

Just curious what you mean by \`AGI will not be a language model\` ? Do you mean it does not need to have NLP capabilities?
View on Reddit #41048053

Mysterious-Rent7233@reddit

Look back in the thread. They were talking about ChatGPT 3.5 which was definitively and exclusively a model that communicated in language tokens. Not video. Not audio. Not spacial. Just language tokens. Yes, of course AGI will have NLP capabilities, but it will probably learn those the same way humans do, by observing language in context, not by being fed a billion books and doing fill-in-the blank games. We only play those games because we don't know of an efficient and bio-realistic way to train the models.
View on Reddit #41052732

Ok-Passenger6988@reddit

u/Mysterious-Rent7233 I am curious what you think about this model: [https://arxiv.org/html/2408.11039v1](https://arxiv.org/html/2408.11039v1) I believe this is the way to ASI, what are your thoughts?
View on Reddit #41260080

Many-Membership6259@reddit

Whatever happens in other threads stays in those threads. I m just curious about your statement. What does "AGI will not be a language model" even means. A machine learning language model by definition is a machine learning model that can process natural languages, so if it has NLP capabilities, it is a language model, it is NOT JUST a language model.
View on Reddit #41053686

Mysterious-Rent7233@reddit

Are you a language model?
View on Reddit #41056827

Many-Membership6259@reddit

Funny question, I m not a model because I m human, but yes I am a language processing creature. Have a nice day man.
View on Reddit #41074669

BullockHouse@reddit

Wildly incorrect. In no sense are decoder-only LMs "a narrow AI specialized for language production" There are essentially zero language specific architectural features in language models. Arguably the choice of tokenization scheme, but if you remove that, it's literally just a sequence model. It even works on images (non 1D data serialized to 1D) just fine, as seen in the original DALL-E paper. LLMs are not chatboats. They are much, much more general than the other models you mentioned.
View on Reddit #40359358

Mysterious-Rent7233@reddit

ChatGPT 3.5 did not use byte level tokenization and was not multi-modal. It was essentially useless for any non-linguistic task.
View on Reddit #40363540

DigThatData@reddit

GPT 3.5 100% used a BPE tokenizer. Also, calling GPT3 "essentially useless" is only meaningful relative to more recent techniques. You might need to (re?)-read the paper to remind yourself of the historical context. https://arxiv.org/abs/2005.14165 More importantly, your contributions to this discussion are extremely unproductive. Take a breath. Drink some water. Maybe reconsider if engaging here is worth your emotional energy.
View on Reddit #40364994

Mysterious-Rent7233@reddit

>GPT 3.5 100% used a BPE tokenizer. A BPE tokenizer is a tokenizer that is specialized to match its dataset, which for GPT 3.5 was language (and with a bias towards western languages). For a transformer to shed its bias towards language, it would use bits or bytes as its tokens: [https://arxiv.org/abs/2105.13626](https://arxiv.org/abs/2105.13626) [https://arxiv.org/html/2406.19223v1](https://arxiv.org/html/2406.19223v1) [https://arxiv.org/abs/2401.13660](https://arxiv.org/abs/2401.13660) > Also, calling GPT3 "essentially useless" is only meaningful relative to more recent techniques. You might need to (re?)-read the paper to remind yourself of the historical context. [https://arxiv.org/abs/2005.14165](https://arxiv.org/abs/2005.14165) If that before accusing me of being unhelpful, you should read what I actually write and spend the effort to try to understand what I am saying. First you confused byte-level tokenization with BPE so you didn't understand the relationship to multi-modal, which was the context of the conversation. Now you are making the same mistake, claiming that I said that GPT 3.5 was useless. I said that it was _essentially_ useful for NON-LINGUISTIC tasks. Which is born out by the very first sentence of the link you sent me: > Recent work has demonstrated substantial gains on many NLP tasks and benchmarks. The L is NLP is "Language", meaning "Linguistic". GPT 3.5 was a huge breakthrough in linguistic processing and was essentially useless for anything else. If you remember the historical context then you'll remember what a huge deal it was when GPT-4 was shown to have some very rudimentary spatial abilities. https://www.linkedin.com/pulse/pink-unicorn-chatgpt-4-proves-can-think-bryan-brownlie-dq1ye/ GPT-3 and 3.5 could not do that, because they were excellent linguistic models and nothing more. It remains the case to this day that GPT 3.5 is useless for tasks that require spatial, aural, ... awareness. This does not seem to me to be a controverials hot take but simply common sense.
View on Reddit #40366046

Ok-Passenger6988@reddit

I read what both of you have to say and u/Mysterious-Rent7233 is correct, the thing about most people is they actually do not take time to read and they postulate based on incorrect data, Props to you u/Mysterious-Rent7233 for staying calm and explaining correctly +1
View on Reddit #41256693

Onesens@reddit

General AI using specialized AIs. The general one needs base levels high enough to orchestrate, understand, review, plan and iterate on results given by specialized AIs, doesn't need to be on par with the specialized ones. Different objectives, different tasks.
View on Reddit #40741387

Caffeine_Monster@reddit

Or have the models adapt at inference time. I think the key is working out how to: a. Efficiently train models / model routers during inference (i.e. test time training) b. Build expert model weights that don't degrade the quality of the overall model. MoE feels like a like a crude step in this direction.
View on Reddit #40377469

yaosio@reddit

Isn't that how a mixture of experts works? Or are the expert models or whatever it's called more general than I think?
View on Reddit #40376878

peculiarMouse@reddit

I feel that people massively underestimate degradation of LLMs through "heavy optimization". Truly, ML algos in last couple decades were single-purpose/area oriented. But what makes LLMs what they are is broadness of their dataset. As long as your dataset is clean, it seems to scale with ability to reason correctly in any area. Its particularly easy to notice that LLMs acquire ability to produce grammatically and synthetically correct text in any language before they learn reasoning for even just 1 language. Right now approach seems to be essentially a condensation of larger LLM into smaller one, which is why they perform better than previous generation, trained directly from impure dataset. But unless you can train specialized "L for Large" model, you wont be able to guarantee quality of condensed dataset. So we're still tied to Larger models.
View on Reddit #40347032

martinerous@reddit

>LLMs acquire ability to produce grammatically and synthetically correct text in any language before they learn reasoning for even just 1 language. This is also a very important distinction when comparing to how we, humans learn. For us, it's the opposite - we learn the very basic reasoning from the environment before we learn to speak. For example, a child playing with toys would soon learn which items could not be put into a small box. So later, when the child learns to speak, they will never mess up the fact that something does not fit somewhere. Even if you gave a child a book that has stories about horses living in pockets, the child would immediately know that it's fiction and the world does not work that way. An LLM, on the other hand, might easily come up with "And then I took a horse out of my pocket to ride to the castle." For LLMs, the priority seems to be defined by the number of occurrences and not by the actual ground truth about real-world facts. So, training with highly distilled data seems to be the best we can do for now. But how much of this distilled data do we need to counterweight all the possible wrong conclusions that an LLM might hallucinate? For a more general "intelligence" we need better architectures, so that the data could be prioritized by some kind of a ground-truth regarding basic logic and science. Only after getting a solid core, we can throw a ton of text at it and be sure that it doesn't get confused.
View on Reddit #40359465

Mysterious-Rent7233@reddit

"Narrow AI" has always existed. It's just not nearly as economically valuable as integrated AI. Who wants to pay $100M to train an AI for just one thing? And who wants to pay millions to software developers (like me) to integrate all of these narrow AIs? And who wants to deal with the issues when the narrow AIs miscommunicate the way we see daily when ChatGPT and DALL-E cannot correct images to the user's preferences because one doesn't understand languages well enough and the other doesn't understand images well enough. Narrow AI is only "the way forward" if we fail to build what the market really wants: AGI.
View on Reddit #40354788

Any_Pressure4251@reddit

No it's not. We programmers have been doing this from the start. We need a general system that can optimise for all tasks and we will get there, when these systems are embodied.
View on Reddit #40346378

redonculous@reddit

And group those smaller models in to a “mix” of experts if you will…
View on Reddit #40338618

DigThatData@reddit

*every* serious ARC submission is heavily optimizing to solve ARC specifically. There's $600K at stake here.
View on Reddit #40364489

abbumm@reddit

Lmao,  literally not even worth a couple annual salaries...
View on Reddit #40418163

muchcharles@reddit

The big problem is it's using a language model that was likely contaminated on answers to the public dataset, so in some ways they may be coaxing that data out.
View on Reddit #40338599

Impossible_Belt_7757@reddit

Yeah that sounds about right, glad it seems to be what I guessed, the bitter lesson as in we need to create some form of active infrence for true intelligence to emerge The real magic of neural networks doesn’t occur in the passive infrence, it occurs in the training stage As long as the model can dynamically modify itself while in use then we’re on our way to something really cool
View on Reddit #40331135

OfficialHashPanda@reddit

Unfortunately it's not so trivial to apply it to more general tasks, whereas in ARC you're given explicit examples and tests.
View on Reddit #40336022

Crafty-Confidence975@reddit

But it’s not a completely dead path either. A lot of problem spaces that models underperform are peppered with analogous examples with solutions! LLMs themselves make it easier to retrieve them from a corpus. I haven’t tried this myself and the paper doesn’t seem to cover it outside of some titles in graphs —- but is the test time compute method for this narrow use case a lot better than just taking a bunch of examples and injecting them into the prompt? That’s the typical way to make models seem smarter for, admittedly easier stuff, in industry for stuff like plain text to SQL.
View on Reddit #40733881

OfficialHashPanda@reddit

Yeah, things like RAG are a lot better for tasks that require knowledge or are very similar to existing problems with known solutions. The ARC challenge is a little harder to use RAG-like techniques on, since the given tasks have different reasoning patterns that are required and reliably detecting those is one of the hardest parts. The test-time finetuning technique is specifically for useful for finding these patterns, where they don't take outside examples from a corpus, but instead augment the examples that are given in the specific task (adding rotations, color permutations, etc) and then finetune on that. In language tasks, finetuning is something that is already done a lot. The only real difference here is that we do augmentation from a very small amount of samples and finetune on that to learn a given task.
View on Reddit #40735086

Crafty-Confidence975@reddit

Seems like maybe it is a good idea then if we have real problems that fit and have (or could have) adjacent exhales we can retrieve at inference time. Especially for weaker models that are so much cheaper to run.
View on Reddit #40735301

Healthy-Nebula-3603@reddit

It is easy but the problem is absurd compute power needs for it.....
View on Reddit #40355880

OfficialHashPanda@reddit

In what sense is it easy? How would you do it?
View on Reddit #40355971

Healthy-Nebula-3603@reddit

LLM has context ... so after filling up a context during conversation by learning new things the model should go into a sleep mode. Then filter "useful" data and structure in a proper way for assimilation. After that a new acquired data should be used to retrain a model weights. Such a process should be done every time if we have interaction with LLM. Such process should be done with bigger models like 30b or 70b+ because are enough big for finding better patterns and connections in data. But that needs absurd of compute power.
View on Reddit #40356761

OfficialHashPanda@reddit

Yes just add sleep to LLMs. I recommend reading up on how LLMs work at a deeper level and see how your ideas relate to actual implementation. It is really not as simple as just blindly feeding its context in as a dataset.
View on Reddit #40357542

Healthy-Nebula-3603@reddit

I know it's not a simple . Such a filled context must be done way filtered and properly structured and also need probably extra examples inside new data ,etc . ..and retaining the model each time with a new data is heavy for current technology... absurd compute demand each time. But I think the idea is proper. We also have context which is heavily filtered during a day and new data is assimilated to a long term memory at night during a sleep and we wake up with a clean context ready to operate 😅.
View on Reddit #40358199

Head_Beautiful_6603@reddit

I'm curious, what's the difference between this and continuous learning? It sounds like the same concept.
View on Reddit #40403556

jd_3d@reddit (OP)

See the paper here: [https://ekinakyurek.github.io/papers/ttt.pdf](https://ekinakyurek.github.io/papers/ttt.pdf) I don't yet see their results on [https://arcprize.org/](https://arcprize.org/), so hopefully the arcprize team can validate it (and run it on their semi-private eval to make sure it hasn't overfit.
View on Reddit #40322754

SSGSS_goku@reddit

It will not show up on the private leaderboard because they validated on public eval set and not on the private set for the competition. They couldn't validate with the private since they submission to the competition requires <12hrs completion time and their model. Since they use the public eval set, it could also mean that the model they used has been already exposed to the data in training before. They also described that in the limitations section of this paper.
View on Reddit #40362647

jd_3d@reddit (OP)

There is a private and a public leaderboard on the site, I was referring to the public leaderboard, ARC-AGI-PUB.
View on Reddit #40393311

Alive-Age-3034@reddit

its not there, because ARC and ARC-AGI are two completely different benchmarks lmao
View on Reddit #40357360

DigThatData@reddit

Isn't TTT basically cheating though? I thought the TTT paper was like a shitpost or an April Fools joke or something like that. Maybe I'm thinking of something else.
View on Reddit #40328642

jd_3d@reddit (OP)

I think the key difference is if you are training on the answers or not. I think it's fair game to train as much as you want on the questions. The paper you are thinking of is: Pretraining on the Test Set Is All You Need
View on Reddit #40329077

MoffKalast@reddit

but isn't ARC as a benchmark sort of like this: - example, input -> output - example, input -> output - example, input -> output - input -> now write the output It's literally giving you some ground truth examples to train on, so yes it's basically training on the test set.
View on Reddit #40343870

jd_3d@reddit (OP)

Isn't that how humans would take the test as well though? I've done about 100 of the hard evaluation set (they are fun puzzles) and after 20-30 of them I found it got a lot easier to do the rest because I learned all the tricks to them.
View on Reddit #40347806

BioSNN@reddit

It might be even more innocuous than this. What you're describing is something like learning on all previous test set questions to answer new test set questions better (as a way to get more training data). My understanding is that the insight of TTT comes from learning on the current question independently of all other test set questions (as a way to fix distribution mismatches between the current question and all training questions). See section 2.3 of the paper "Thus, TTT trains a specialized prediction model for each test input, obtained by fine-tuning a base model on a test-time dataset generated from that test input." That is, after coming up with some model M, for each test question i, we fine-tune M on the question statement (which includes (input, output) examples for just that question) to get M\_i. For example, when we go from question 1 to question 2 of the test set, we again restart with the original model M so M\_2 = train(M, Q2) instead of M\_2 = train(M\_1, Q2). Hopefully that makes sense.
View on Reddit #40371912

DigThatData@reddit

Did you learn those tricks just because you saw the example inputs? Or did you learn those tricks because you saw the correct answer to those examples and that helped you learn how this worked? My understanding here is that TTT is essentially a kind of in-context learning. You're conditioning your network on the features relevant to the input example. I think a better analogy to human test taking would be entering a "flow state" from focusing on a particular problem for a long time, such that all of your thoughts become "painted" by features of the problem at hand.
View on Reddit #40364318

Mysterious-Rent7233@reddit

It's training on the examples, not memorizing on the answer key. Training on examples is exactly what a human or AGI should do. Memorizing the answer key is of course cheating.
View on Reddit #40355103

BioSNN@reddit

You're probably thinking of "Pretraining on the Test Set Is All You Need" (https://arxiv.org/abs/2309.08632), but as far as I understand it, TTT does not actually train on the answers. ARC is set up so you get a small sample of (input, output) pairs before having to predict a hidden output for one more input. The earlier examples can be used (along with various forms of augmentation) to form a simple training set that the model can be fine-tuned on before figuring out the last output.
View on Reddit #40329060

Bernafterpostinggg@reddit

"Data Leakage: Even though the base Llama-3 perform extremely poorly on the public validation set, the public availability of the dataset on various platforms (GitHub, Kaggle) introduces the possibility that these models may have encountered these examples during pre-training." The real deal ARC-AGI dataset is private. I'm not convinced this is anything but happy to be proven wrong.
View on Reddit #40338973

Mr_GGLS@reddit

It might be a data augmentation, where it seems that they used the training set to construct samples in Section 3.1
View on Reddit #40338413

Tiny_Arugula_5648@reddit

So basically if you have the test data then you can use that to create synthetic data to fine tune a Lora so it can solve the tests?
View on Reddit #40333082

muchcharles@reddit

It's finetuning on the problem examples and augmentations of them without having access to the answer. ARC gives you a few examples with completions so that you can then solve the completion for a final problem example from what you observed in the example completions.
View on Reddit #40336863

OfficialHashPanda@reddit

They take the examples and act as if the examples are tests, which they then train on. In this benchmark you have n examples and k tests. They don't touch the tests during the finetuning, but simply look at the n examples and now for each of the n examples, take the other n-1 example input/output pairs and finetune the model to predict the output example from the left-out input.
View on Reddit #40335929

CATALUNA84@reddit

Test-time fine tuning (also leveraged by MindsAI) is a way to do on-the-fly recombination of the vector functions contained in a DL model to adapt to a new task. Here is Ekin's X Thread going into the details of the methodology https://x.com/akyurekekin/status/1855680785715478546?s=46&t=tMxZqJeuhNmuh3e0D8XHYw
View on Reddit #40334228

hiitkid@reddit

Arc is not safe anymore! We need more arcs. We need more FrontierMaths #
View on Reddit #40333872

Tea_Pearce@reddit

isn't test time gradient updates on few-shot egs exactly what half the meta-learning community was doing circa-2019?
View on Reddit #40332058

Extension-Mastodon67@reddit

ARC Prize Daily Puzzle Task: 58743b76 ⏱️🟩🟩⬜️⬜️⬜️ 1:48 sec 🤔🟩⬜️⬜️⬜️⬜️ 1 attempt Can you solve it? arcprize.org/play Yay!! I'm not a TOTAL idiot!!!!
View on Reddit #40329861

AIPornCollector@reddit

Arc seemed unbeatable even just a few months ago and now they're at 62% jayzuz. Accelerate!
View on Reddit #40326382

davikrehalt@reddit

test-time training is also how alphaproof beat imo--probably will be next paradigm at least it needs a lot of compute for bitter lesson sense it's good
View on Reddit #40323677

Mescallan@reddit

I wonder what the score would be if they rain it again post test time training.
View on Reddit #40322994

ArtArtArt123456@reddit

link?
View on Reddit #40322804