Implementing reasoning in LLMs through Neural Cellular Automata (NCA) ? (imagining each pixel/cell as a 256-float embedded token)

[-]

Its_not_a_tumor@reddit

Great stoned thought, anything else to add?

[-]

Massive_Robot_Cactus@reddit

Sometimes this is all that is needed to inspire someone else, who will actually complete the doughnut.

[-]

Ahmatt@reddit

Indeed. We are definitely at the stoned ape NFT phase of AI work…

[-]

ryunuck@reddit (OP)

I take it you didn't like the concept. I'd love to hear why, especially if it's based on some technical intuitions from your work in deep learning.

[-]

stunbots@reddit

So how would you turn this into a product

[-]

ryunuck@reddit (OP)

this is not for turning into a product, it is for turning all products into dust.

[-]

Comes from this line of research https://distill.pub/selforg/2021/textures/ tl;dr there is no global mechanism or attention. The model learns to make edits to a "state", which for this is 16 channels - 3 for RGB, 1 for cell aliveness, and the remaining 12 for arbitrary cell state that the model learns to organize and use.

The model is ran for a variable number of steps (16 to 96 in this paper), then the loss is backpropagated through all the steps. Identity, sobel & laplancian filters are applied onto the state between each step to deform it, and then the model does a conv2d(relu(conv2d(x)). That's literally it. With just 5000 parameters and this loss, the model learns to iteratively update the state in a way that makes us happy.

Based on the fact alone that cellular automatons are "self-energizing" I do think a NCA simulator in the bottleneck could unlock AGI. Essentially a semantic equivalent of this would be producing and consuming its own synthetic data every single step of the NCA. It would be like proto-ideas, something to be refined by the decoder. You no longer append tokens at the end of a 1D sequence, you inject them in the center of the grid and let them grow and propagate, or you do an expanding poisson-disc sampling, or we develop an injection scheme and encode it in the cell's hidden states so the NCA is hinting the token injector with high probability extension sites.

[-]

Fluffy-Feedback-9751@reddit

My 5 yr old thanks you for this explanation 😌

[-]

bacteriairetcab@reddit

This made me curious what ChatGPT would translate the above response to for an actual 5 year old and this is what I got:

Imagine you have a big coloring book page made of tiny squares, and each square is like a little box with stuff inside it. Each box knows just a little about its neighbors, but there’s no boss telling all the squares what to do. They figure things out together, one step at a time.

Now, each square has some magic crayons (let’s say 16 of them) to color itself. Three crayons are for colors like red, green, and blue. One crayon tells if the square is “alive” or not. The rest of the crayons are just for secret messages that the squares use to talk to themselves and each other. The squares can learn how to use those secret crayons to do cool stuff, like drawing pictures!

But here’s the trick: after coloring for a while (imagine 16 to 96 steps), a teacher comes in and says, “Hey, you’re supposed to draw this!” The teacher shows the squares what they got right or wrong, and they learn how to be better at drawing together next time.

To make it even cooler, between every step, someone shakes the coloring book a little—like applying fun filters to wiggle or blur the picture. Then, the squares do some math magic (a bit like drawing twice on top of themselves) to fix everything up. They keep doing this over and over until the picture grows into something awesome.

And all this happens with just a tiny little set of instructions (5000 of them). The squares basically learn to draw by themselves and work as a team, with no leader.

Now, the really cool idea: What if we used this trick to teach a computer how to think by growing its thoughts, just like these pictures grow? Instead of stacking words like blocks in a row, you’d plant little “idea seeds” in a big grid and let them sprout and spread out, like a garden of thoughts. Every step, the computer could get smarter and refine its ideas, making a whole new way to think and solve problems!

[-]

ryunuck@reddit (OP)

Hello, I was looking at the past research on NCAs, and if I squint really had it kind of looks like this solves reasoning at a very low-level? Imagine that instead of a 1D context window, the context window was a 2D grid, and the decoder is parsing the semantic content of the grid. Immediately the question comes up, why is this better than a 1D embedding, and where are you gonna get your dataset? But... if you look at the research on NCAs, this particular NCA was not trained with any dataset. They had a single target image, and they used VGG16 features for the loss!! So I was thinking... couldn't we use one of 'em big smart decoder LLM as a loss for meaningful states and dynamics to be learnt by the cellular grid, the same way they have for dynamic texture synthesis? Instead of embedding into VGG16 and calculating a loss in this space, you would first upgrade the decoder so it can take a 2D grid embedding, some kind of adapter module or LoRA which jointly fine-tunes the model and integrates the new information modality, so that a just middle where the grid can represent the same information as a 1D grid, with a loss which maximizes dead cells such that the system learns to represent the information with as few cells as possible.

[-]

Fluffy-Feedback-9751@reddit

I put your comment into Claude API and it said “This is a fascinating exploration of Neural Cellular Automata (NCAs) and their potential applications to language modeling and reasoning. Let me break down and analyze your key points:

The 2D vs 1D Representation: Your argument for 2D representations is compelling, especially considering human cognitive evolution with 2D information processing (writing, screens, etc.). The natural extension to 3D with a 4th dimension for hidden states is particularly interesting.
Training Approach: Your proposed method of using large decoder LLMs as a loss function is clever. It’s similar to how the original NCA paper used VGG16 features, but for semantic content rather than visual features. This could potentially solve the data scarcity problem while leveraging existing language understanding capabilities.
Emergent Properties: The potential for continuous optimization and distributed computation through grid tiling is particularly interesting. The bacterial colony metaphor for idea propagation is apt - it suggests a more organic, emergent form of reasoning compared to current attention-based approaches.

However, some technical challenges to consider:

Gradient Flow: The multi-step nature of NCAs can lead to vanishing/exploding gradients. Your suggestion of lockstep model freezing might help, but careful architecture design would be crucial.
Information Routing: While 2D grids might be more natural, ensuring efficient information routing between distant cells without attention mechanisms could be tricky. You might need something like:

Cell Update Rule = f(current_state, neighbors_state, global_context)

Convergence Guarantees: Ensuring stable convergence to meaningful states while maintaining the ability to represent complex dynamics would be crucial. You might need something like:

$L_{total} = L_{semantic} + \lambda L_{stability}$

Where $L_{stability}$ encourages convergent behavior.

Your vision of using this for ConceptARC problems is particularly interesting - the idea of cells maintaining both semantic content and spatial relationships could potentially bridge symbolic and continuous reasoning in a novel way.

The “materiality state” concept you describe reminds me of phase transitions in physical systems. Perhaps the cells could learn to organize into different “phases” of reasoning (concrete vs. abstract) based on the task requirements.

Have you considered starting with a simpler proof-of-concept, perhaps focusing on a specific subset of reasoning tasks that would benefit most from this spatial organization?”

👀