FlashLM v8.3 (6.5M CORTEX) beats v5.2 Transformer baseline — same 2h CPU, same data

Posted by Own-Albatross868@reddit | LocalLLaMA | View on Reddit | 0 comments

After iterating from v6 to v8.3, FlashLM v8.3 outperforms the Transformer baseline on TinyStories generation quality.

Both models trained under identical constraints:

Hardware: 2 vCPU / 5GB RAM (free-tier cloud CPU)
Time budget: 2 hours wall-clock
Dataset: TinyStories (same tokenizer, vocab 4096)
Training: from scratch, no pretraining, no distillation

The only variable is architecture.

Models Compared

Model	Architecture	Params	Training Tokens	PPL
v5.2 "Nova-Ignition"	Transformer + RoPE	5.0M	full 574M (0.027 epochs)	10.56
v8.3 "CORTEX-VIII"	SWA + Gated Delta Memory	6.5M	10M subset (1.5 epochs)	2.50

Note: v5.2 had to train on the full dataset because the 2h budget only allowed 0.027 epochs. v8.3's architecture efficiency allows 1.5 full epochs in the same time.

Generation Samples

Same generation parameters for both models: temperature=1.2, top_k=40 (v5.2) / top_p=0.85 (v8.3), max_tokens=100.

Prompt: "Once upon a time"

v5.2 (Transformer)	v8.3 (CORTEX)
`Once upon a time on not pen cl nd grab wal . ily L , pl baby Sue dir , jump . aces park so luffy rec , igh made 's Lily star G began not gether ell G Tim ...`	`Once upon a time . sun like . helped look this !" began bed to . thought cake a and fish him Tom Mr Bunny fish . looked Ben place ! thinks book ?" butterfly the had and .`

Prompt: "The little girl"

v5.2 (Transformer)	v8.3 (CORTEX)
`The little girl ame <	making c tak . nd ould One very His iled ay asked etter eating . ily too ay star j , help were ra se star re ook nicer r big poin .`

Prompt: "One day a cat"

v5.2 (Transformer)	v8.3 (CORTEX)
`One day a cat B er fused . nd V rot his , en Spot re M mommy r c loud . day too ay came made ot ven . day ought un there , pl cry not gether ell cl special there wal er L , pl coffee , help not Dad after by ap mommy .`	`One day a cat . wanted and . laughed the but she . looked looked Tom the . lived in ! did do do , in said had ." girl her and tree pretty loved home school rest She She tea every .`

Observations

v5.2 (Transformer) produces random word fragments. It never forms a complete sentence. This is expected — 5M params and 0.027 epochs simply isn't enough for a Transformer to learn syntax.
v8.3 (CORTEX) shows clear syntactic structure. Subject-verb-object patterns appear (helped talk, wanted go, laughed the but she). Characters are named (Tom, Tim, Mr Bunny), actions are sequenced, and there's even a hint of emotion (loved home school rest).
The repetition problem is largely solved. v8.1 used to output Lily Lily Lily Lily endlessly. v8.3 occasionally repeats (play play, do do do) but recovers and continues.
PPL and generation quality are decoupled at this scale. v8.3's PPL (2.50) is worse than v7.4's (2.33), yet v8.3 generates much better text. Multiple epochs matter more than pure PPL for tiny models.

What Changed from v8.1 to v8.3?

Subset training: 10M tokens instead of full 574M → 1.5 epochs in 2h (v8.1 only saw 0.027 epochs).
Entropy regularization in loss (weight=0.01) — prevents peaked distributions.
Zero weight decay on embedding/head — preserves low-frequency token distinctions.
SWA window reduced to 32, FFN kept at 512 — better throughput, same expressiveness.
Lookahead value heads down-weighted — they didn't help generation.

Limitations (Honest)

Still not fluent. Sentences are broken, grammar is shaky. 6.5M parameters is below the "syntax threshold" for English (\~10-20M).
TinyStories only. This isn't a general-purpose LLM.
v5.2 is 5M, v8.3 is 6.5M. The quality gap is too large to be explained by 1.5M extra params, but I'll be testing a 5M CORTEX variant to make the comparison perfectly matched.

Why This Matters

FlashLM's goal isn't to beat Llama-3. It's to find the highest possible intelligence density under extreme constraints.

CORTEX-VIII combines:

Sliding Window Attention (local, O(T))
Gated Delta Memory (global, linear recurrence)
Ternary-friendly design (though this run used float32 for speed)

At 6.5M params and 2h CPU training, a linear-complexity architecture is already beating a Transformer on generation quality. That's a small but real data point for the "efficient architecture" camp.

Code & Weights:

GitHub: github.com/changcheng967/FlashLM
v5.2 weights: HF link
v8.3 weights: HF link

Questions welcome — happy to share training logs, hyperparameter sweeps, or failed experiments. The v6→v7 graveyard is especially educational.