Reverse Engineering o1 Architecture (With a little help from our friend Claude)
Posted by TechnoTherapist@reddit | LocalLLaMA | View on Reddit | 55 comments
I fed Claude with released information from OpenAI (system card, blog posts, tweets from Noam Brown and others) and online discussions (Reddit, YouTube videos) relating to the o1 model.
After a bit of back and forth, this is what it came up with as a potential high level architecture for the model:
The bit about large-scale CoT storage feeding into the RL environment is my own (somewhat cheeky) assumption: I think OpenAI will likely use the CoTs generated in the real world to further adjust RL-optimise the model.
Comments / thoughts / glaring mistakes/ potential improvements, all welcome!
tearo@reddit
Looks structured and detailed, thus conveying being authoritative.
But viewing this as an involved hypothesis, can you identify the points that simply repeat OpenAI press releases, as compared to the conclusions you've reached yourself or the connections you made from hands-on experiences or even experiments with the product?
Then, we could add to the latter more investigation and experimentation.
AlternativePlum5151@reddit
A few months ago, I built a basic React app using Claude to attempt something similar. The concept was to have several agents with specific tasks: one to break down the prompt, others to conduct analysis and reasoning, and a quality control agent to review their responses before delivering a final output to the user. Each agent had its own system prompt, which could be adjusted to improve performance. The right side of the interface could be hidden away. I’m not a coder, just an electrician who enjoys tinkering. The app took me about an hour to create, and I had GPT-3.5 reliably solving the strawberry problem. Using o1 gave me a similar experience, with extended token outputs.
tearo@reddit
Do you manually evaluate each agent outputs, in order to tune / improve each?
Do you semi-manually filter and combine the outputs before the following steps?
Sadzip@reddit
I am quite interested in your work, could you plz provide more info about the process? Thanks a lot!
AlternativePlum5151@reddit
Sure.. the experiment was to make dumb models smart by using multiple agents of the same model play out both a COT and a reflection process by assigning roles and traits through prompt engineering making them experts in their specific role. My theory was; model performance would improve if you could slow down the response and filter it through critical analysis and I guess, allow it to think and reflect. In hindsight, I was ahead of the curve in my thinking and in all honesty, just wanted to prove a bunch of Reddit guys wrong who said it wouldn’t work 😂 if I can figure out how to post it to GitHub, I’ll make it available for people to play with
Sadzip@reddit
Slow thinking and critical reasoning is indeed one of the reasons why o1 works so well( I guess) , you are indeed thinking quite ahead of the curve. I am a graduate student studying LLM safety. So I'd love to see your work on github! I'd also be happy to provide you with some help on posting your project to Github.
tearo@reddit
Can you speak to the flip side, AI Threat or Fear?
dogcomplex@reddit
There's probably something like splitting up the problem into subtasks, running chains of thought on each, scoring them (i.e. ML reward), and integrating the whole. Then if theyre clever, each of those subproblems probably could have been paired with a particular LoRA-trained submodel (Mixture of Experts) which accomplishes the task a bit better than the base model, and that matchmaking process could be scored accordingly too (again, classic ML rewarding). Introspect itself whether the answer is complete and each section has an adequate score according to testable subquestions, and if it's all about as good as it can get within a threshold pass it on.
TechnoTherapist@reddit (OP)
Brilliant. Appreciate the insight.
KevinCola@reddit
Regarding RL, this diagram does not really mean anything. The diagram just shows very general things that are done and choices there are to make when setting up an RL pipeline (exploration vs exploitation, and in any neural network you have gradient calculation followed by ‘parameter update’, that is just backpropagation).
A good diagram should indicate how RL helps training o1: What does adversarial learning have to do with it: what is the objective of the opponent? How meta learning help improve the model? What does adversarial learning have to do with it? And is the RL approach somehow multi-tiered: if there are multiple agents for different specializations in the CoT, is there one ‘super’ agent that decides which agent(s) to use for the current prompt? And is RHLF still used? Because hierarchical RL of three layers (specialization agents, super agent, verification agent) seems suuuper unstable to me.
Interesting that this could be generated, but it just seems to show a very general flow of RL combined with some CoT & LLM input & output.
Achrus@reddit
Correct me if I’m wrong, but isn’t RLHF just human annotators feeding labels back into a continuous / online learning setup? The “reinforcement learning” part of RLHF is a stretch and technically is correct but also anyone in Machine Learning wouldn’t actually call it that. Kind of like how generative transformers got rebranded as “AI.”
It just feels like this is another ad campaign obfuscated by buzzwords and acronyms that the coked out C-suite and their consultants can circle jerk about. I’d be much more excited if OpenAI just came out and said:
At least that’d be something real coming out of that company.
TechnoTherapist@reddit (OP)
Agreed and true. The challenge is, we'd be speculating beyond the very basic generics of the system. (until more information leaks out of OAI).
Complex_Candidate_28@reddit
seems like the standard RL pipeline
Hunting-Succcubus@reddit
Our friend ???
MoffKalast@reddit
"With a little help from some old friends..."
TechnoTherapist: Your skills are required for a job.
Claude: You son of a bitch, I'm in.
ankitm1@reddit
There is a classifier somewhere which determines whether the problem is easy or hard (ie will it take more compute or less). Specifically I am referring to ideas in this paper and it's predecessor MCTS paper. You use a best of n when the problem is simpler, but perhaps go for a beam/lookahead search if the problem is complex (or mcts). Need to have that determined before the model starts generating tokens.
dummyTukTuk@reddit
Side question: Is Claude able to create an architecture diagram?
TechnoTherapist@reddit (OP)
Claude can create diagrams in mermaid, plantuml, svg etc. I typically start with Mermaid and iterate until I have a comprehensive diagram. Then I have Claude convert it to an SVG with the desired visual features (aesthetics, which areas to highlight, etc.). I then use a python script to convert the svg to a png.
Low88M@reddit
Great ! I’m very interested in the process if you wouldn’t mind to explicit it a bit for a newbie I am. I have relatively (to me) complex projects to make diagrams of and It would definitely ease the pain
WhosAfraidOf_138@reddit
Not the actual image no
But models are surprisingly good at generating tree structures and ASCII stuff. Doesn't work all the time obviously
They can also generate the code to generate these diagrams too
toodimes@reddit
The actual SVG of the diagram itself, no. But it is really good at generating mermaid or other code that is then used to generate a diagram.
Tobiaseins@reddit
There is only one thing that matters here, and no OpenAI employee will ever talk about it: the reward function for the RL step
TechnoTherapist@reddit (OP)
You nailed it!
Bewinxed@reddit
OpenAI had sam's fanboys on a leash only to release Langchain: Electric Bogaloo 🫵😹
Thomas-Lore@reddit
One thing to check would be - are the thinking sentences shown in the demos long enough to align with how long the model thinks. Your post seems like you missed the examples showing how it reasons here: https://openai.com/index/learning-to-reason-with-llms/
funky778@reddit
These are all speculations guys we need the system prompt which someone will get soon or later. It is a prompting technique + Reinforcement Learning with Chain of Thought Technology
JoMaster68@reddit
what I don't understand: okay so during inference, the model generates a large amount of CoT text. But how is the output then generated from this? is it just a summary of the LLM from the most important COT parts?
ninjasaid13@reddit
is the model trained on and and anything within this is not shown to the user while the output is made outside of this?
_qeternity_@reddit
No, you're missing the whole point: you don't ever interact with the real o1 model (of which there may be multiple specialized versions). You only ever interact with a model that effectively summarizes the o1 CoT.
Original_Finding2212@reddit
That’s standard agents architecture
TechnoTherapist@reddit (OP)
From what I've pieced together so far, it:
Generates multiple CoTs
Backtracks through them based on available test-time compute
Uses one(s) it finds most correct (based on its RL)
Generates a response based on them
Igoory@reddit
Am I missing something or did you guys not see the example CoT output from OpenAI's blog post?
wolttam@reddit
There could be three tiers; the CoT model, the summarization model, and the output model. Each model is being intelligent about what it produces for the next model the consume, and can choose to include a lot of or just a small amount of specific detail to the final output model depending on the complexity of the task. The final output model then shapes its reply based on the information it has available to it
Fr_kzd@reddit
You are overthinking it. It's just a wrapper that is a prompt-response generation loop that simulates a train of thought. There are many toy examples utilizing this technique and producing similar outputs to o1. The tokens used in line of thought generation are referred to as "reasoning tokens", which is just a fancy term for the hidden tokens.
TechnoTherapist@reddit (OP)
Wouldn't that make it an easily reproducible agentic wrapper? From what I understand, there is also (allegedly) a major RL component (with secret sauce around how it actually rewards the model; no one would know this outside of OpenAI) as well as some (again, alleged) CoT refinement during inference - through search and backtracking that is dependent upon variable test-time compute. (which is not configurable with o1-preview but may become configurable when o1 is released).
Junior_Ad315@reddit
Anyone saying this is just simple CoT with multiple agents is wrong. It is doing that but there’s more to it. You’re correct about the traditional RL component being significant, because this is scalable in the same way AlphaGO or other deep mind projects are. It’s huge leaps in realms with deterministic answers indicates this. Noam Brown mentions it and I’ve seen several others who have intimate knowledge hint at this RL component being the next leap forward. Your diagram seems to be relatively in line with what my understanding of the technique is from reading the some of the deep mind research papers that have come out in August and September and paying attention to relevant researchers.
You can get similar looking results from pure CoT prompts, and it does improve performance, but I truly doubt they will scale and perform this well on benchmarks.
randombsname1@reddit
Any examples of this?
I've seen nothing impressive with regards to reasoning. It falls apart quickly when prompting outside of its almost certainly, very narrow training examples. See my response to another comment:
https://www.reddit.com/r/ClaudeAI/s/wrqpTBHjP4
Junior_Ad315@reddit
I’m mostly just saying that there is RL happening, and I don’t think OP is far off from the rough structure of how they did it. Based on the benchmarks, however flawed they are, it is working in certain realms, and RL is something that has been proven scalable for things with definite answers.
I still won’t use it for most things right now, I mostly just find it interesting what they’re doing.
Fr_kzd@reddit
The benchmarks themselves are flawed. What people want is multiple domain mastery, not some pre cooked leetcode challenge benchmark or student's math problems that it takes a maximum 10,000 retry spams in order to solve them.
Junior_Ad315@reddit
I totally agree benchmarks are flawed for testing whether something is AGI or can attain multiple domain mastery. I don’t think they’re useless though. I think that to solve novel multi step problems you probably need to be able to reliably solve much smaller problems.
But you said he was over thinking it, and I think he’s roughly close to the methodology they’re using.
But who knows maybe it’s just a simple prompt pipeline with a couple agent overseers. I guess we’ll see.
Fr_kzd@reddit
Reproducible? Yes. Easy? No. It takes a lot of fine tuning for this kind of architecture style. Even my observations for o1 is that it tends to derail it's line of thought easily.
And this kind of architecture doesn't perform well without context information injection mechanisms like RAG. You'd need to inject context information for every thought in the chain so it doesn't derail and hallucinate whatever output it wants.
az226@reddit
The wrapper is stateless.
The model trained this way makes connections that are very important.
They’re also not mutually exclusive. You can do both for even stronger results.
n-7ity@reddit
I think you can easily reproduce this as an agent it will just take much longer. From my testing, this is basically a fine tune that produces the CoTs as a guide, then hides them from the output. Seems like a they fine tuned an architecture with existing prompt engineering techniques.
For sure there’s a secret sauce in the RL but it seems like it’s not a massive architectural step change?
If so, it’s good because others will be able to replicate it
GoogleOpenLetter@reddit
I don't think it's just that. I think they generated a whole bunch of synthetic Data COT examples, especially focusing on examples where LLM's perform poorly in reasoning.
I'd be shocked if they didn't train stuff into the model and it was just some Initial Prompt hack.
Fr_kzd@reddit
A CoT architecture is not just an initial prompt hack, but sure. Go off.
Powerful_Pirate_9617@reddit
any examples you can share?
Additional_Test_758@reddit
lol at the downvotes :D
NearbyApplication338@reddit
If they did something simple, it should be something like this:
Generate and insert multiple intermediate steps for a conversation. Check for consistency, prefer conciseness. Train a new model based on this dataset, do this recursively, correcting, removing or making thinking steps more concise.
And, if they did something complex, then it would be they imagined generating text as playing a game. Depending on the game, there are multiple ways in which you can win the game, that is, multiple distinct conclusions are valid. steps can be seen as moves that agent and user are making to reach the conclusion, agent can even see user as a tool that defined its goal and can provide further constraints to narrow down the search space. Do this all in latent space, to give a verbal form to thinking step, let another LLM interpret the vector into a verbal representation. Use tricks previously demonstrated by systems like AlphaGo, like seeing into the future, before deciding on thinking step. And now we get an RL agent that does text instead of actions.
Born_Fox6153@reddit
If the system was this complex we would getting policy violation messages when trying to jailbreak the system prompt as the response generation system is very hellbent on letting it out to the user lol
Still_Ad_4928@reddit
That CoT storage part would likely have to be chunked to oblviion to be any reliable. Recursively embedding, clustering, and summarizing CoT chunks and somehow joining them. Dont think PPO can do that.
Still_Ad_4928@reddit
Think the hardest part was the synth gen tbh.
Fine tuning with PPO and training has been around for ages. The CoT storage thing is interesting but that does imply some kind of on-inference mechanism which would be way ahead of what a first CoT release woul look like at the scale OpenAI is operating.
But interesting nonetheless.
muchcharles@reddit
Where was the commentary from the ARC price team posted? Got a link?
freedomachiever@reddit
Though this is over my head, what I want to know is how you prompted to generate this diagram.
Lammahamma@reddit
"Guys, it's so simple. Quit overthinking things"
Alright, why hasn't anyone else done it yet?
balianone@reddit
Regarding the large-scale CoT storage, it is not entirely clear how it fits into the o1 model architecture. However, based on the search results, it appears that CoT storage refers to a type of energy storage technology. It is possible that the CoT storage in the context of the o1 model is a metaphor for a large-scale storage system that feeds into the RL environment, allowing the model to learn from its interactions with the environment.
The RL environment is a key component of the o1 model, as it enables the model to learn from its interactions with the environment and improve its performance over time. The search results provide more information on reinforcement learning and its applications in AI research.