Baby Dragon Hatchling Training?
Posted by Clueless_Nooblet@reddit | LocalLLaMA | View on Reddit | 10 comments
Hello, I'd like to try building a training set for the BDH (Baby Dragon Hatchling by Pathway). Since the architecture is quite different from that of a transformer, normal training sets won't work.
My question is: is there guidance out there on training one?
Thanks in advance.
Admirable_Dirt_2371@reddit
That's a cool concept that I'm not familiar with, thanks for sharing! It looks like it's a state space model under the hood. I've been working on my own state space model architecture and I train mine with pretty normal data and in a pretty similar way to a normal transformer architecture.
They have a basic training example that uses the tiny Shakespeare data set, that should be easy enough to copy the pattern or just replace the Shakespeare text with your preferred text data. Or do you mean you want to train it on something other than text?
Clueless_Nooblet@reddit (OP)
Tiny Shakespeare is just a test set, it doesn't really teach language. After going through this set, the model can't form a single coherent word yet. I tried it, and continued with BabyLM 10M and 100M, and now it could form the words it learned, but the sentences didn't make sense yet.
I've been building training sets for it for a while now, and am nowhere near finished, but I thought if there's anyone who looked into this model before, maybe I can learn something useful from it.
Admirable_Dirt_2371@reddit
Ah, gotcha. That's kinda the spot I'm at with my current architecture. I'm using the BabyLM strict-small(10M) data set and it can get words right but not really longer syntax. I haven't looked at their code extensively yet but one potential option would be increasing the recurrent depth, both for training and inference, if that's possible.
How many epochs of training are you doing? I'm not sure what your setup is like but are you able to scale to a larger/smaller version of the architecture to confirm that the data is the bottleneck? Are you able to adjust things like the learning rate or the loss function?
For my model, I found that cleaning the BabyLM data further, helped boost learning and protect from runaway gradients. For example I changed all the advanced ascii characters to their base forms(i.e. à ->a) and removed strings of 8+ repeating characters.
Clueless_Nooblet@reddit (OP)
I scaled it up to 150M now (base is 25M), will try again. I did 30-epoch runs, but BabyLM just seems to lack structure. For a normal transformer, that's not a problem, but BDH is finicky ;)
Admirable_Dirt_2371@reddit
Yeah the BabyLM data is pretty unstructured and SSM's in general are sensitive in my experience. I'm still pretty new to all of this but 30 epochs seems a bit long, have you tried testing it after 10?
For the inference tests, are you doing anything structured(like a BLiMP evaluation) or just eyeballing what it returns from random prompts?
If possible and not cost/resource prohibitive, I would try to set up two 10 epoch training runs(or 30 epochs with checkpoints every 10 so you can compare), one for the base 25M parameters and one with a scaled up 50M parameters. Train them both on the same exact data set and then evaluate them with some kind of structured test. That should hopefully give you a better idea of where the learning bottlenecks are.
Clueless_Nooblet@reddit (OP)
Yes, I think I'll do exactly that. I watched the loss, and since it came down smoothly, I let it run through the full 30 epochs, but I might have been too transformer-brained.
Eyelbee@reddit
What resources do you have to train this? Or are you just trying to make a training set?
Clueless_Nooblet@reddit (OP)
I've been creating my own set, since the BabyLM 100M set didn't seem to help at all, and the Cosmopedia story subset I used (200k stories) had the model end up stuck on collapsing to "the the the".
Eyelbee@reddit
Paper says it rivals gpt 2 "for the same training data". Did you use enough training tokens? What exactly did you do with the Babylm and cosmopedia. They recommend BDH-GPU variant for scarce training data. Also they used raw UTF-8 bytes as tokens. Total vocab size was 256.
Clueless_Nooblet@reddit (OP)
I believe the data I used just wasn't compatible with the architecture. It wasn't a proper curriculum that builds on the Hebbian learning principle. BabyLM is random sentences, and Cosmopedia is very short little stories. BabyLM trained normally over 30 epochs, and losses looked good. In the end, the model could form sentences (something like "the dog swim sky hungry").