Do state space models dream of recursive sheep?

Posted by Admirable_Dirt_2371@reddit | LocalLLaMA | View on Reddit | 1 comments

Howdy all! I'm fairly new to the scene, I come from a web dev background. I'm not working and wanted to learn LLM's from the bottom up and started messing around with building micro/nano models that I could train locally and maybe enter into the BabyLM 2026 evaluation. The graphic shows the architecture of my current model, I'd love any feedback, criticism, or questions. I'll explain more about my set up and the architecture below.

I'm using my gaming PC to train, run, and code everything. I have a microcenter pre-built, with a rx7600 8gb vram GPU and 7600x CPU, 16gb ddr5 ram. I'm writing all the code in elixir/Nx which compiles down to c through exla/xla. I started by running a Ubuntu vm through Windows 11 but couldn't get my GPU working in the vm so I set up a dual boot with a Ubuntu partition. I use livebook to execute all the elixir code and run the exla/xla compilation. It was a bit of a headache to get set up but now that it's going it works great for what I'm doing.

When I first started looking into "traditional" LLM architectures, I felt that some aspects should have better solutions. So I wanted to try and build my own micro model based around character level tokenization and structured closer to how a human brain processes language, i.e. letter->syllable->word. I don't expect to come up with anything close to revolutionary or even amusing, this is mostly just a learning exercise. That pursuit led me to this current architecture. All my training is done with the BabyLM 2026 strict-small data from hugging face.

It's not really a diffusion language model, like the title of the graphic implies(thanks Gemini). It's more a hierarchical state space model with the bottom level being a diffusion based character encoder. I first trained the encoder to map the 128 base ASCII characters into a 512d embedding map. I then trained the second level on the frozen, trained weights from the encoder to predict the coordinates(in the embedding map) of the next character. I then trained the top level with the frozen embedding map and frozen level 1 weights, with the idea being it will yield to the lower levels predictions when optimal but would step in to make higher level grammar based predictions when needed.

That's the thing I'm having the most trouble with. I get great results from the trained level 1 weights. A basic inference probe returns meaningful and correctly spelled English words(after just one epoch of training), though nothing structurally coherent and all the words are shorter. But that's exactly what I wanted. I did some basic BLiMP evals too and it scored around 55% on the easier files but trended slightly below 50% on the harder files. Still that's fine for what I wanted. But when I add the larger main ssm on top of everything else it regresses slightly or just doesn't improve, even after 6+epochs of training. I've tried a lot of different tweaks but haven't been able to figure it out yet. It could just be the limit of this architecture but I'm stubborn and don't want to give up yet.

some numbers:

Training data: \~4.5M char list that is the result of me combining and cleaning the 2026 BabyLM strict-small training data. I converted any ASCII advanced characters to their base forms( i.e. à to a) and removed any long sections of symbols.

Diffusion embedding map: 128 vocab size, 512 dimension, 65536 parameters, trained with MSE loss for one epoch, down to a value of 1.1245.

Level 2(tinySSM): six matrices at 512dx256rank and one scalar at 512d, 784944 parameters, trained with a soft nearest neighborhood loss function, for one epoch down to a value of 2.4711

Level 3(mainSSM): three matrices at512dx512rank and 4 more scalars at 512d, 788480 parameters, trained with MSE, cosine distance, and soft-nn loss,(separately and as a mix), none show meaningful improvement after multiple epochs.

For a grand total of 1638960 parameters

I'd appreciate to hear what you all think!

[-]

Admirable_Dirt_2371@reddit (OP)

These are my BLiMP results:

====================================================================== 🏆 BLIMP BATCH RESULTS — sorted by accuracy ====================================================================== Dataset Accuracy Margin

✅ principle_A_case_1 100.0% (1000/1000) 0.0194 ✅ principle_A_domain_1 100.0% (1000/1000) 0.015 ✅ wh_vs_that_no_gap 100.0% (1000/1000) 0.016 ✅ wh_vs_that_no_gap_long_distance 100.0% (1000/1000) 0.0118 ✅ wh_questions_subject_gap 99.7% (997/1000) 0.0171 ✅ wh_questions_subject_gap_long_distance 98.9% (989/1000) 0.0097 ✅ wh_questions_object_gap 96.1% (961/1000) 0.012 ✅ principle_A_case_2 96.0% (960/1000) 0.0092 ✅ sentential_negation_npi_licensor_present 95.6% (956/1000) 0.0158 ✅ wh_island 91.1% (911/1000) 0.0138 ✅ existential_there_quantifiers_2 84.1% (841/1000) 0.0056 ✅ irregular_past_participle_verbs 70.6% (706/1000) 0.0049 ✅ principle_A_c_command 69.3% (693/1000) 0.0067 ✅ existential_there_subject_raising 66.8% (668/1000) 0.0051 ✅ anaphor_gender_agreement 66.1% (661/1000) 0.0089 ✅ tough_vs_raising_2 65.3% (653/1000) 0.0068 ✅ sentential_negation_npi_scope 64.1% (641/1000) 0.0007 ✅ superlative_quantifiers_1 63.0% (630/1000) 0.0033 ✅ adjunct_island 62.9% (629/1000) 0.0014 ✅ existential_there_quantifiers_1 62.9% (629/1000) 0.0053 ✅ complex_NP_island 60.9% (609/1000) 0.0008 〰️ existential_there_object_raising 57.6% (576/1000) 0.0009 〰️ superlative_quantifiers_2 56.3% (563/1000) 0.0019 〰️ transitive 54.8% (548/1000) 0.0017 〰️ expletive_it_object_raising 54.5% (545/1000) 0.001 〰️ causative 54.4% (544/1000) 0.001 〰️ left_branch_island_echo_question 54.0% (540/1000) 0.0002 〰️ ellipsis_n_bar_1 52.7% (527/1000) 0.0002 〰️ passive_2 52.3% (523/1000) 0.0004 〰️ determiner_noun_agreement_1 51.4% (514/1000) 0.0002 〰️ animate_subject_passive 50.8% (508/1000) 0.0019 〰️ determiner_noun_agreement_with_adjective_1 50.8% (508/1000) 0.0002 〰️ ellipsis_n_bar_2 50.7% (507/1000) -0.0001 〰️ determiner_noun_agreement_2 50.5% (505/1000) 0.0005 〰️ determiner_noun_agreement_with_adj_2 50.2% (502/1000) 0.0 〰️ irregular_plural_subject_verb_agreement_1 49.8% (498/1000) -0.0 〰️ distractor_agreement_relative_clause 49.2% (492/1000) 0.0001 〰️ distractor_agreement_relational_noun 49.0% (490/1000) -0.0 〰️ determiner_noun_agreement_irregular_2 48.9% (489/1000) -0.0001 〰️ principle_A_domain_3 48.8% (488/1000) -0.0002 〰️ determiner_noun_agreement_with_adj_irregular_2 48.7% (487/1000) -0.0005 〰️ determiner_noun_agreement_irregular_1 48.5% (485/1000) -0.0006 〰️ anaphor_number_agreement 47.9% (479/1000) -0.0028 〰️ passive_1 47.4% (474/1000) -0.0008 〰️ coordinate_structure_constraint_complex_left_branch 47.0% (470/1000) -0.0003 〰️ determiner_noun_agreement_with_adj_irregular_1 45.3% (453/1000) -0.0004 〰️ regular_plural_subject_verb_agreement_2 45.1% (451/1000) -0.0007 〰️ irregular_plural_subject_verb_agreement_2 42.1% (421/1000) -0.0012 〰️ coordinate_structure_constraint_object_extraction 40.5% (405/1000) -0.0027 〰️ inchoative 40.2% (402/1000) -0.0052 ❌ principle_A_domain_2 38.5% (385/1000) -0.0043 ❌ left_branch_island_simple_question 35.8% (358/1000) -0.0018 ❌ tough_vs_raising_1 34.1% (341/1000) -0.0077 ❌ regular_plural_subject_verb_agreement_1 33.2% (332/1000) -0.0029 ❌ sentential_subject_island 32.0% (320/1000) -0.0021 ❌ principle_A_reconstruction 31.0% (310/1000) -0.0031 ❌ drop_argument 28.5% (285/1000) -0.0185 ❌ intransitive 26.4% (264/1000) -0.0145 ❌ animate_subject_trans 24.8% (248/1000) -0.0197 ❌ irregular_past_participle_adjectives 23.5% (235/1000) -0.0096 ❌ only_npi_scope 16.3% (163/1000) -0.0038 ❌ matrix_question_npi_licensor_present 14.8% (148/1000) -0.0122 ❌ npi_present_1 1.0% (10/1000) -0.0195 ❌ npi_present_2 0.8% (8/1000) -0.0242 ❌ only_npi_licensor_present 0.0% (0/1000) -0.0133 ❌ wh_vs_that_with_gap 0.0% (0/1000) -0.0186 ❌ wh_vs_that_with_gap_long_distance 0.0% (0/1000) -0.0127

📊 Overall average accuracy: 52.14%