Do state space models dream of recursive sheep?

Posted by Admirable_Dirt_2371@reddit | LocalLLaMA | View on Reddit | 1 comments

Do state space models dream of recursive sheep?

Howdy all! I'm fairly new to the scene, I come from a web dev background. I'm not working and wanted to learn LLM's from the bottom up and started messing around with building micro/nano models that I could train locally and maybe enter into the BabyLM 2026 evaluation. The graphic shows the architecture of my current model, I'd love any feedback, criticism, or questions. I'll explain more about my set up and the architecture below.

I'm using my gaming PC to train, run, and code everything. I have a microcenter pre-built, with a rx7600 8gb vram GPU and 7600x CPU, 16gb ddr5 ram. I'm writing all the code in elixir/Nx which compiles down to c through exla/xla. I started by running a Ubuntu vm through Windows 11 but couldn't get my GPU working in the vm so I set up a dual boot with a Ubuntu partition. I use livebook to execute all the elixir code and run the exla/xla compilation. It was a bit of a headache to get set up but now that it's going it works great for what I'm doing.

When I first started looking into "traditional" LLM architectures, I felt that some aspects should have better solutions. So I wanted to try and build my own micro model based around character level tokenization and structured closer to how a human brain processes language, i.e. letter->syllable->word. I don't expect to come up with anything close to revolutionary or even amusing, this is mostly just a learning exercise. That pursuit led me to this current architecture. All my training is done with the BabyLM 2026 strict-small data from hugging face.

It's not really a diffusion language model, like the title of the graphic implies(thanks Gemini). It's more a hierarchical state space model with the bottom level being a diffusion based character encoder. I first trained the encoder to map the 128 base ASCII characters into a 512d embedding map. I then trained the second level on the frozen, trained weights from the encoder to predict the coordinates(in the embedding map) of the next character. I then trained the top level with the frozen embedding map and frozen level 1 weights, with the idea being it will yield to the lower levels predictions when optimal but would step in to make higher level grammar based predictions when needed.

That's the thing I'm having the most trouble with. I get great results from the trained level 1 weights. A basic inference probe returns meaningful and correctly spelled English words(after just one epoch of training), though nothing structurally coherent and all the words are shorter. But that's exactly what I wanted. I did some basic BLiMP evals too and it scored around 55% on the easier files but trended slightly below 50% on the harder files. Still that's fine for what I wanted. But when I add the larger main ssm on top of everything else it regresses slightly or just doesn't improve, even after 6+epochs of training. I've tried a lot of different tweaks but haven't been able to figure it out yet. It could just be the limit of this architecture but I'm stubborn and don't want to give up yet.

some numbers:

Training data: \~4.5M char list that is the result of me combining and cleaning the 2026 BabyLM strict-small training data. I converted any ASCII advanced characters to their base forms( i.e. à to a) and removed any long sections of symbols.

Diffusion embedding map: 128 vocab size, 512 dimension, 65536 parameters, trained with MSE loss for one epoch, down to a value of 1.1245.

Level 2(tinySSM): six matrices at 512dx256rank and one scalar at 512d, 784944 parameters, trained with a soft nearest neighborhood loss function, for one epoch down to a value of 2.4711

Level 3(mainSSM): three matrices at512dx512rank and 4 more scalars at 512d, 788480 parameters, trained with MSE, cosine distance, and soft-nn loss,(separately and as a mix), none show meaningful improvement after multiple epochs.

For a grand total of 1638960 parameters

I'd appreciate to hear what you all think!