Baby Dragon Hatchling Training?

Posted by Clueless_Nooblet@reddit | LocalLLaMA | View on Reddit | 10 comments

Hello, I'd like to try building a training set for the BDH (Baby Dragon Hatchling by Pathway). Since the architecture is quite different from that of a transformer, normal training sets won't work.

My question is: is there guidance out there on training one?

Thanks in advance.

[-]

Admirable_Dirt_2371@reddit

That's a cool concept that I'm not familiar with, thanks for sharing! It looks like it's a state space model under the hood. I've been working on my own state space model architecture and I train mine with pretty normal data and in a pretty similar way to a normal transformer architecture.

They have a basic training example that uses the tiny Shakespeare data set, that should be easy enough to copy the pattern or just replace the Shakespeare text with your preferred text data. Or do you mean you want to train it on something other than text?

[-]

Clueless_Nooblet@reddit (OP)

Tiny Shakespeare is just a test set, it doesn't really teach language. After going through this set, the model can't form a single coherent word yet. I tried it, and continued with BabyLM 10M and 100M, and now it could form the words it learned, but the sentences didn't make sense yet.

I've been building training sets for it for a while now, and am nowhere near finished, but I thought if there's anyone who looked into this model before, maybe I can learn something useful from it.

[-]

Admirable_Dirt_2371@reddit

Ah, gotcha. That's kinda the spot I'm at with my current architecture. I'm using the BabyLM strict-small(10M) data set and it can get words right but not really longer syntax. I haven't looked at their code extensively yet but one potential option would be increasing the recurrent depth, both for training and inference, if that's possible.

How many epochs of training are you doing? I'm not sure what your setup is like but are you able to scale to a larger/smaller version of the architecture to confirm that the data is the bottleneck? Are you able to adjust things like the learning rate or the loss function?

For my model, I found that cleaning the BabyLM data further, helped boost learning and protect from runaway gradients. For example I changed all the advanced ascii characters to their base forms(i.e. à ->a) and removed strings of 8+ repeating characters.

[-]

Clueless_Nooblet@reddit (OP)

I scaled it up to 150M now (base is 25M), will try again. I did 30-epoch runs, but BabyLM just seems to lack structure. For a normal transformer, that's not a problem, but BDH is finicky ;)

[-]

Admirable_Dirt_2371@reddit

Yeah the BabyLM data is pretty unstructured and SSM's in general are sensitive in my experience. I'm still pretty new to all of this but 30 epochs seems a bit long, have you tried testing it after 10?

For the inference tests, are you doing anything structured(like a BLiMP evaluation) or just eyeballing what it returns from random prompts?

If possible and not cost/resource prohibitive, I would try to set up two 10 epoch training runs(or 30 epochs with checkpoints every 10 so you can compare), one for the base 25M parameters and one with a scaled up 50M parameters. Train them both on the same exact data set and then evaluate them with some kind of structured test. That should hopefully give you a better idea of where the learning bottlenecks are.

[-]

Clueless_Nooblet@reddit (OP)

Yes, I think I'll do exactly that. I watched the loss, and since it came down smoothly, I let it run through the full 30 epochs, but I might have been too transformer-brained.