260K-param LLM running on an emulated 90s CPU inside an 18-year-old RTOS

Posted by MironV@reddit | LocalLLaMA | View on Reddit | 13 comments

I know this sub loves absurd LLM projects, so sharing my contribution with everyone while we wait for the new Qwen 3.7 models to drop!

I got a tiny LLM running inside an RTOS, running inside a custom-built JavaScript emulator for the Freescale ColdFire MCF5307 (derivative of the legendary Motorola 68K that powered the original Mac and Sega Genesis).

The RTOS was written back in 2008 with three classmates for our embedded systems university course. It was lost to time, with the hardware and original ROM long gone. A few months ago, I decided to use Claude and Qwen to revive it, writing the CPU emulator from scratch and reverse-engineering the ROM from kernel calls. Once the original 2008 binary was booting, I wanted to go full inception and try running an LLM on the emulated stack.

As the starting point, I took Karpathy's llama2.c with the stories260K model trained on TinyStories. It's about half a megabyte of weights, which is tight but fits in the 16MB of emulated memory after shrinking the kernel stack to free up room. The ColdFire has no FPU, so every float calculation requires libgcc's software emulation, meaning a forward pass would need millions of emulated instructions per token which is a non-starter.

To get around this, I quantized the model to INT8 with a per-row scale factor, turning the critical matmuls into pure integer math and thus dropping the inner loop to a handful of instructions. For floats outside of matmul, I went old school and used Carmack's fast inverse square root (from Quake) and a whole bunch of lookup tables for RoPE to avoid trig calculations. The only thing that stayed as emulated floating point is softmax/RMSnorm, but those get hit infrequently enough that it's still relatively fast.

The whole model outputs at a blistering 2-4 seconds per token, generating mostly coherent (and sometimes weird) TinyStories-style English!

You can try it directly in your browser, just type %a to run the model. I have a longer write-up on my whole RTOS archeology project here.

Obviously, this is not useful for anything practical, but it's neat to see LLMs running on potato-level stacks. My next step is putting the whole stack on an FPGA that re-implements the original hardware, which should bring it up to actually usable speeds.

[-]

Ok-Internal9317@reddit

Qwen 3.7 128K, I'll wait for that

[-]

MironV@reddit (OP)

Haha I’ll be waiting for the Unsloth quants!

[-]

Disposable110@reddit

Imagine some madlad inventing the transformer architecture in the 90's and instead of having SETI at home we had to distribute training the damn thing on people's home PCs.

And then after booting it up and crunching matrix operations for a week it outputs 42. :D

[-]

ambient_temp_xeno@reddit

I wonder how far back in the past one could go and have enough training data and enough compute to make an LLM.

[-]

jazir55@reddit

Now just finish your time machine and everyone in the 90s will think you are a hero.

[-]

AndreVallestero@reddit

Any reason you didn't go with the latest qwen 3.5 0.8b?

[-]

GrokiniGPT@reddit

Finally! Away with the big corporations, off with my 260k parameter model!

[-]

MironV@reddit (OP)

Haha, long live nano local models on obscure hardware!

[-]

Mickenfox@reddit

I haven't seen anyone run a LLM on a Nintendo 64. Just saying.

It could be a modern remake of Hey You, Pikachu!

[-]

MironV@reddit (OP)

I have something right up your alley then: https://github.com/sophiaeagent-beep/n64llm-legend-of-Elya

[-]

Mickenfox@reddit

Ooh

[-]

SV_SV_SV@reddit

Dude, you are legend! Didnt't understand most of the technical brief - made me actually check "Carmack's fast inverse square root" lol, and yes it is THE Carmack. Interesting learning about RTOS as well.
Massive props for the vintage computer LLM necromancy, very fun, and makes you wonder have we had this knowledge of the technology on the software side - could have been any practical uses on old school 486 / etc architectures?
Seeing C64 3D demos running on C64 systems etc
https://youtu.be/LE_D7H10GAo?si=L9pGQGjlMzXnFZqX

What are the actual limitations of old tech.. and what about the current one? :)

[-]

MironV@reddit (OP)

Thank you! If you go down the Carmack rabbit hole you’ll find that function lives in a Quake source file with a literal “what the fuck?” comment next to it lol. It’s a crazy optimization!

Your bigger question is an interesting one…

Back of the envelope math says a 486 from 1992 could’ve run inference on a stories260K-class model, probably faster than my ColdFire emulator does, since it has a real FPU. Training is another story though. A single 486 would probably take 2-3 years to train this model, but a Pentium cluster of ~100 boxes could’ve maybe done the job.

But even if someone had the compute and knew about transformer architectures, the issue is getting the data. For example, TinyStories is synthetic short stories generated by GPT. So you need a massive LLM to generate it in the first place. Real text corpuses in 1995 would’ve been much harder to get. The web was tiny and the only other digital datasets were probably Project Gutenberg and Usenet.

The demoscene stuff from that era pushed the hardware of the time to do impossible-looking things, just in graphics rather than ML.

I definitely feel applying new ideas to old or constrained hardware is an interesting space right now!