FlashLM v8.3 (6.5M CORTEX) beats v5.2 Transformer baseline — same 2h CPU, same data

Posted by Own-Albatross868@reddit | LocalLLaMA | View on Reddit | 0 comments

After iterating from v6 to v8.3, FlashLM v8.3 outperforms the Transformer baseline on TinyStories generation quality.

Both models trained under identical constraints:

The only variable is architecture.

Models Compared

Model Architecture Params Training Tokens PPL
v5.2 "Nova-Ignition" Transformer + RoPE 5.0M full 574M (0.027 epochs) 10.56
v8.3 "CORTEX-VIII" SWA + Gated Delta Memory 6.5M 10M subset (1.5 epochs) 2.50

Note: v5.2 had to train on the full dataset because the 2h budget only allowed 0.027 epochs. v8.3's architecture efficiency allows 1.5 full epochs in the same time.

Generation Samples

Same generation parameters for both models: temperature=1.2, top_k=40 (v5.2) / top_p=0.85 (v8.3), max_tokens=100.

Prompt: "Once upon a time"

v5.2 (Transformer) v8.3 (CORTEX)
Once upon a time on not pen cl nd grab wal . ily L , pl baby Sue dir , jump . aces park so luffy rec , igh made 's Lily star G began not gether ell G Tim ... Once upon a time . sun like . helped look this !" began bed to . thought cake a and fish him Tom Mr Bunny fish . looked Ben place ! thinks book ?" butterfly the had and .

Prompt: "The little girl"

v5.2 (Transformer) v8.3 (CORTEX)
`The little girl ame < making c tak . nd ould One very His iled ay asked etter eating . ily too ay star j , help were ra se star re ook nicer r big poin .`

Prompt: "One day a cat"

v5.2 (Transformer) v8.3 (CORTEX)
One day a cat B er fused . nd V rot his , en Spot re M mommy r c loud . day too ay came made ot ven . day ought un there , pl cry not gether ell cl special there wal er L , pl coffee , help not Dad after by ap mommy . One day a cat . wanted and . laughed the but she . looked looked Tom the . lived in ! did do do , in said had ." girl her and tree pretty loved home school rest She She tea every .

Observations

  1. v5.2 (Transformer) produces random word fragments. It never forms a complete sentence. This is expected — 5M params and 0.027 epochs simply isn't enough for a Transformer to learn syntax.
  2. v8.3 (CORTEX) shows clear syntactic structure. Subject-verb-object patterns appear (helped talk, wanted go, laughed the but she). Characters are named (Tom, Tim, Mr Bunny), actions are sequenced, and there's even a hint of emotion (loved home school rest).
  3. The repetition problem is largely solved. v8.1 used to output Lily Lily Lily Lily endlessly. v8.3 occasionally repeats (play play, do do do) but recovers and continues.
  4. PPL and generation quality are decoupled at this scale. v8.3's PPL (2.50) is worse than v7.4's (2.33), yet v8.3 generates much better text. Multiple epochs matter more than pure PPL for tiny models.

What Changed from v8.1 to v8.3?

Limitations (Honest)

Why This Matters

FlashLM's goal isn't to beat Llama-3. It's to find the highest possible intelligence density under extreme constraints.

CORTEX-VIII combines:

At 6.5M params and 2h CPU training, a linear-complexity architecture is already beating a Transformer on generation quality. That's a small but real data point for the "efficient architecture" camp.

Code & Weights:

Questions welcome — happy to share training logs, hyperparameter sweeps, or failed experiments. The v6→v7 graveyard is especially educational.