I bolted an 8-arm reasoning MoE onto a frozen 1.4B Mamba backbone on a single RTX 3060. Here’s the mechanistic autopsy of what broke and what worked.

Posted by Just-Ad-6488@reddit | LocalLLaMA | View on Reddit | 25 comments

I built Mamba-Titan-1.4B-Reasoning (a 2.54B parameter MoE) entirely on a 12GB VRAM budget. I froze a 1.4B pure-PyTorch Mamba-1 backbone, grafted on 8 trainable expert arms at the mid-network level, and trained it on DeepSeek CoT traces to teach it latent reasoning. Along the way, I documented the exact tensor geometry of how models learn to stop thinking, a uniquely SSM-specific repetition failure mode, and the reality of PreNorm signal explosions.

Repo: batteryphil/Mamba-Titan-1.4B-Reasoning

The Architecture: Mamba-Titan

The goal was to get o1-style latent reasoning out of a sub-2B model without melting a consumer GPU.

Going through the SFT phases, I ran deep ablation studies and tensor dissections. Here is the mechanistic reality of what happens when you try to force a small State Space Model to think.

Finding 1: The "Vault Door" Geometry of Termination

During CoT training, the model has to learn when to emit the </think> token to end its internal monologue and output the final answer.

Looking at the LM head weight space at Step 34k, the network engineered a literal mechanical safety valve. The </think> token isolated itself at the 0.0th percentile norm—the absolute smallest row norm in the entire 50,306 token vocabulary (1.991 vs the 4.742 mean).

Because the output embedding norm is so small, it requires a massive, highly specific, and perfectly aligned hidden state accumulation to overcome the baseline probabilities of standard text tokens. The model physically cannot "accidentally" stumble into terminating its thought process; it forces itself to build immense internal confidence before the safety valve unlocks.

Finding 2: The SSM "Attractor Lock" (Repetition Failure)

In an early run, my 64-dim MoE bridge was too weak. The arms were simulating reasoning but failing math (10% accuracy). To fix it, I expanded the bridge to 512-dim and added a scalar multiplier (moe_scale=5.0). Math accuracy instantly jumped to 50%, but it introduced a catastrophic repetition loop (e.g., outputting "56 56 56 56..." infinitely).

In Transformers, repetition is usually attention-head collapse. This is different—it's a uniquely SSM-specific failure. Mamba compresses context into a rolling hidden state vector. By setting the scale to 5.0, the new MoE input signal completely overpowered the historical context. The continuous context collapsed into a fixed attractor. The state saturated, predicted the correct token, fed that token back into the saturated state, and gave itself localized amnesia regarding the EOS token.

Finding 3: The PreNorm Signal Explosion Trap

While debugging why the MoE arms were being ignored by the LM head early on, telemetry revealed a massive signal explosion. The activation norm grew 41x from the embedding layer (norm 3.5) to layer 48 (norm 147.9).

The specialized MoE arms were injecting their signal at Layer 24 (norm \~13), but the frozen backbone amplified its own signal so aggressively in the remaining 24 layers that the MoE contribution was drowned out by the time it reached the LM head.

The intuitive fix is to insert an intermediate LayerNorm at Layer 25 to control the explosion. Do not do this to a frozen backbone. Mamba uses a PreNorm architecture; the frozen outer layers were pre-trained to expect massive input magnitudes. Resetting the norm mid-network blinds the entire second half of the model. The only mathematically viable solution without unfreezing the backbone is expanding the injection bridge dimensions and carefully tuning the scalar output of the MoE to match the backbone's acoustic volume.

Finding 4: The Blackboard IPC Actually Works

I built a latent bus (the Blackboard) for the isolated MoE arms to write to and read from. Initially, it was dead weight. But after re-initializing the gates hot and training for 40k+ steps, the read-back signal grew 76x.

Running an ablation study, removing the Blackboard increased premature reasoning termination by 2.3 percentage points. The IPC bus is now acting as a structural anchor that actively keeps the model in a cognitive state, preventing it from speedrunning the latent space.

Where it's at now

It's currently hitting roughly 60% accuracy on factual QA and 50% on math, but struggles with multi-step algebra due to the token-budget constraint. SFT is ongoing to balance the moe_scale and cure the attractor locks without losing the deep reasoning.

If anyone else is experimenting with bolting MoEs onto frozen backbones or running pure PyTorch SSMs on consumer hardware, I'm happy to share the raw telemetry, cosine similarity matrices, or implementation details.