I put a transformer model on a stock Commodore 64

Posted by gizmo64k@reddit | LocalLLaMA | View on Reddit | 17 comments

Not a chatbot pretending. Not a lookup table with a trench coat. A proper decoder-only transformer. Attention, RMSNorm, feed-forward, residuals, the works. Two layers, four heads, about 25,000 parameters. All int8. Trained with quantization-aware training so the float model and the integer model agree on what the next token should be.

It lives on a floppy. It takes more than a minute per token. A full reply is several minutes of waiting while the border flashes colors and the SID chip beeps once per token to tell you it’s still in there, still pondering!

I’ve been sitting in the same room with it for days now. Occasional beep behind me. I still grin every single time it announces a token drop :D

Well, admittedly.. it’s not exactly smart, but considering the fact that its 25,000 parameters are about 70 million times smaller than those of GPT-4 et al I think we can accept that. I trained my C64 on roughly a hundred short emotional-support exchanges (“i’m sad” -> “that sounds really hard”) and now it tries to be nice to me, in its broken little “me me, here here”-way.

“HELLO! RE SOUNDS ME. MEFUL!” is arguably nonsense, but the intention somehow shines through.. Or its my mind tricking me into believing its deeper than it should? All I can say is that the first time I read it I felt a deep satisfaction and a childhood dream coming true..My C64 is alive now! Don’t ask me to defend that. I’m just reporting ;)

64k should be enough for every bot

25 KB of weights on a machine with 64 KB of RAM. After you load them, there’s still room for the code, the activation buffers, the tokenizer tables, BASIC, the KERNAL, all of it. The C64 has actual slack left over after hosting a real transformer. In hardware from 1982.

The trick is that every weight is a single byte. A per-tensor shift baked in during training lets int8 do the work that most frameworks hand to 32-bit floats. 4x less storage, 4x less bandwidth, and no accuracy cliff if you trained for it.

The 6510 has no multiplier, no divider, no floating point. So every matmul is shift-and-add. Division is restoring long division. RMSNorm wants a square root, so there’s an integer isqrt. Softmax is a 128-entry precomputed exp table.. in pure assembly, all bit-exact against a Python reference before any of it touched my precious real hardware.

Who needs NVIDIA anyway?

The chip the C64 ships with can run the same architecture OpenAI or Google runs their models on. It’s just slower. Much, much much slower. Proudly slower.

You can run your own AI chatbot on your own hardware! No excuses! :)

This whole project started as a joke and turned into something I actually mean.

Every headline about AI right now is about scale. Bigger models, bigger clusters, bigger data centers, bigger power draw, bigger water bills, bigger government contracts. Someone announces they’re buying the world supply of DRAM. Memory prices triple. They quietly walk it back. Prices don’t come down. Small builders everywhere get to clean up the mess. Retro repair folks can’t source chips. Game studios’ hardware budgets explode. The child who knocked the shelves over is already in the car.

And then the same people turn around and tell you the future requires more muscle. More compute. More everything. Trust them, Bro! The singularity needs another hundred billion dollars and it also needs your grid capacity and also your groundwater. The future isn’t more muscle. The future is better thinking. A 25k-parameter transformer with a thoughtfully-trained tokenizer, sensible quantization, and honest arithmetic can have a (broken, tiny, sweet) conversation on a computer from 1982. Scale that insight up and you get models that are small enough to run on your phone, your fridge, your car, your Commodore, without anyone needing to own a power plant. The research is already pointing that way. Smaller models, better data, smarter training, sparsity, distillation. Every month there’s another paper saying “actually you can do this with a tenth of the parameters if you just…”

We won’t get to find out where that road leads. Not really. Because the people with the money decided the answer was “more” before anyone finished the sentence. The billionaires eat all the cake. The rest of us get told the cake shortage is our fault and also here’s a subscription.

Well, it doesn’t have to be that way.. and because actions speak louder than words: I put a real transformer on a 1 MHz Home Computer from the year E.T. came out, and I released it for you to experiment with it…

Everything is on GitHub: https://github.com/gizmo64k/soulplayer-c64 .. weights, disk image... and soon the source, too

[-]

WhoRoger@reddit

A while ago I was bored and tried to calculate how fast would an 8-bit home computer run an 8B model (assuming it would be loaded and managed externally, and the computer was only used for calculations). I think it came to about 50 years per token?

Anyway, that's really nice. I keep wondering how far can transformers be pushed, considering the crazy stuff people do. Or what comes next. I guess chatbot isn't ideal here, but maybe something almost useful can be done on C64?

gizmo64k@reddit (OP)

the current state this implementation is in is merely "technically functional" and has severe limitations.. but I am convinced this can be pushed further by more experienced folks, which is why I am so eager to put this out in source asap.. i got far enough with this to think this is worth pushing some more..

For sure. It's really cool, I love it.

rhinodevil@reddit

Also good for the electric bill, because the 6502 consumes the same amount of power, no matter what it is currently doing! ;-)

Haha, really? I wasn't aware of that.. makes an even sweeter argument overall :)

HopePupal@reddit

so you told claude to copy https://github.com/ytmytm/llama2.c64 congrats i guess

Respectfully, no. llama2.c64 *needs* a 2MB REU, a 32× RAM expansion. Mine runs on a stock, unmodified C64. 64KB. Nothing plugged in. That's not a small difference, that's the whole point.

Same project category. Completely different constraint, completely different code, but nice try!

Waarheid@reddit

You can't even write your own 2 line reddit comment?

Enlighten me, be so kind.

idk why people keep depriving themselves of the joy of posting. our pet clankers are supposed to do the annoying parts, not the fun parts

unrelated: love your avatar. CITY was good but Nichijou is supreme

ooohhh i see, should've said so sooner. here, go feral on another project of mine, too: [pure schlopp, no brakes](https://www.indiepixel.de/coc/) enjoy the meal! <3

soooo where's that source, then?

Coming up on GitHub within a week or two. Full source, tests, trained weights, the .d64, and maybe a writeup. We'll see, I'm not as fast as Claude, so sorry that you have to wait a bit :)

Chromix_@reddit

You gave it a whole two layers? That means your model should be able to perform basic XOR. 😉
Now the question is: Which one is the layer that transforms the input into the internal representation, which layer handles the internal representation, and which one transforms it back? Maybe you can upgrade to 4 layer INT4 quant some day.

Yes, indeed! My C64 has caught up to knowledge from 1969 :) And also yes, my C64 speaks without thinking, no middle ground to ponder ;)

I'll put INT4 + 4 layers on the list.. halving the weights frees enough space to stack deeper, and maybe then BOB can finally afford an inner monologue, haha.

philanthropologist2@reddit

This is the actual future