T³ v3.4.1 (124M) beats GPT-2 XL (1.5B) on BoolQ and leads the 125M class on reasoning — controlled A/B shows ecology decouples reasoning from perplexity

Posted by MirrorEthic_Anchor@reddit | LocalLLaMA | View on Reddit | 12 comments

T³ v3.4.1 (124M) beats GPT-2 XL (1.5B) on BoolQ and leads the 125M class on reasoning — controlled A/B shows ecology decouples reasoning from perplexity

I've been building T³ (Toroidal Transformer with Temporal ecology) independently. The main idea is that attention heads maintain ecological state variables coupled through Hamiltonian dynamics, producing adaptive per-head attention temperature. Combined with ACT for variable-depth processing.

Experiment

Before comparing against other architectures, the most important result is the controlled A/B test.

Setup: Take GPT-2 Small pretrained weights. Continue training on the same data mix (rough equivalent to MobileLLM), same batch size, same LR schedule, same token budget (\~4.2B continuation tokens, \~13B cumulative).

Run two versions:

  1. Vanilla — standard GPT-2 architecture, no modifications

  2. T³ v3.4.1 — same weights with ecological attention + ACT added

Results at step-matched (25k) evaluation:

| Task | Vanilla (cont.) | T³ v3.4.1 ACT | Delta |

| BoolQ | 0.493 | 0.635 | +28.8% |

| HellaSwag | 0.305 | 0.445 | +45.9% |

| ARC-E | 0.405 | 0.460 | +13.6% |

| ARC-C | 0.232 | 0.290 | +25.0% |

| WinoGrande | 0.505 | 0.520 | +3.0% |

| PIQA | 0.619 | 0.625 | +1.0% |

| COPA | 0.660 | 0.640 | -3.0% |

| RTE | 0.542 | 0.530 | -2.2% |

Vanilla achieves PPL 23.4. T³ achieves PPL 27.7. The vanilla model has better perplexity and worse benchmarks. It plateaued hard at \~8K steps and never improved further, while T³ was still gaining at 120K.

The ecological dynamics produce representations that generalize to reasoning tasks more efficiently than minimizing language modeling loss alone. The model with higher perplexity reasons better. Same weights, same data. Ecology (T³ architecture) is the only variable.

Competitive Context: The 125M Class (step count T³ 120k ≈ 4.2B token budget run)

Here's how it compares to the best purpose-built sub-200M models:

| Task | GPT-2 S (124M) | Pythia (160M) | MobileLLM (125M) | SmolLM (135M) | T³ v3.4.1 (124M) | GPT-2 XL (1.5B) |

| BoolQ | 0.487 | \~0.50 | 0.563 | — | 0.635 | 0.623 |

| HellaSwag | 0.311 | \~0.35 | 0.401 | 0.422 | 0.445 | 0.529 |

| WinoGrande | 0.516 | \~0.52 | 0.513 | — | 0.520 | 0.589 |

| ARC-C | 0.227 | \~0.24 | 0.345 | 0.276 | 0.290 | 0.324 |

| ARC-E | 0.395 | \~0.41 | 0.560 | 0.576 | 0.460 | 0.532 |

| PIQA | 0.625 | \~0.64 | 0.655 | 0.693 | 0.625 | 0.731 |

| COPA | 0.660 | \~0.67 | — | — | 0.640 | 0.760 |

| RTE | 0.534 | \~0.53 | — | — | 0.530 | 0.534 |

- BoolQ 0.635 — best in the 125M class AND beats GPT-2 XL (1.5B, 0.623). 124M beating 1.5B.

- HellaSwag 0.445 — best in class, beating SmolLM (0.422) and MobileLLM (0.401)

- WinoGrande 0.520 — best in class

Training Budget Context

| Model | Params | Training Tokens |

| SmolLM-135M | 135M | \~600B |

| MobileLLM-125M | 125M | \~1T |

| T³ v3.4.1 | 124M | \~13B |

T³ saw 46-77× less training data than its competitors. SmolLM and MobileLLM win on factual recall benchmarks (ARC-E, PIQA) — tasks where data volume directly helps. T³ wins on reasoning benchmarks despite the massive data disadvantage. The ecology extracts more reasoning capability per token.

Where T³ Wins vs Loses

| T³ wins | T³ loses |

| BoolQ (passage inference) | ARC-E (factual science recall) |

| HellaSwag (contextual completion) | PIQA (physical knowledge) |

| WinoGrande (disambiguation) | ARC-C (hard factual reasoning) |

| ‐> How you attend | What you've seen -> |

The architecture seems to change what training buys you. More reasoning per token instead of more knowledge per token.

No MoE, no auxiliary losses, no external gating. End-to-end differentiable with the LM objective.

Ablation Evidence

- Ecology disabled: +28.75% PPL increase (structurally load-bearing)

- Controlled A/B: same weights, same data, same schedule —> ecology produces better benchmarks at worse PPL

- Vanilla control plateaus at \~8K steps while T³ continues improving through 120K

- v3.4.1 variant selected through systematic ablation across multiple training configurations

Training Progression

| Task | T³ 100K | T³ 120K | Delta |

| BoolQ | 0.565 | 0.635 | +7.0 |

| HellaSwag | 0.470 | 0.445 | -2.5 |

| ARC-C | 0.280 | 0.290 | +1.0 |

| PIQA | 0.615 | 0.625 | +1.0 |

| COPA | 0.610 | 0.640 | +3.0 |

| WinoGrande | 0.515 | 0.520 | +0.5 |

| ARC-E | 0.485 | 0.460 | -2.5 |

| RTE | 0.540 | 0.530 | -1.0 |

Benchmarks oscillate as the ecology reallocates attention during training.

Caveats

- Not claiming SOTA at any absolute scale

- MobileLLM and SmolLM beat T³ on factual benchmarks. They were architecturally and data-optimized for this weight class

- Stderr at 124M is ±0.03-0.05; small differences (1-2 pts) are within noise

- Pythia and SmolLM numbers are approximate from published sources, not from our eval harness

- Not yet validated beyond 124M. 1.5B distributed training in progress

- These are intermediate checkpoints from an ongoing run

Setup

- Independent research, MirrorEthic LLC, one researcher

- Hardware: Legion 7i pro (RTX 5090 24GB) + x99 build (Tesla V100 16GB)

- Paper: "Semantic Space as Ising Model" in review at TMLR (co-authored with Nell Watson)

- Open-source : github.com/GMaN1911/hologram-cognitive or pip install hologram-cognitive (Hologram-Pro is the new new)

Happy to discuss architecture, ablations, the A/B methodology, training dynamics, or the Clifford algebra correspondence.