I spent months inside verl (an RL post-training framework), forked it, then stopped. Wrote up the internals, the tooling a fork costs, and a nasty NCCL bug.
Posted by ReinforcedKnowledge@reddit | LocalLLaMA | View on Reddit | 0 comments
I wasn't sure whether to post this here or not but a friend of mine said that a lot of researchers lurk into this subreddit and it might help them, and I think it might also help anyone trying to tinker with stuff at home, I don't know how much people do post-training here but I do see distills getting posted here and fine-tunings and datasets and benchmarks etc., so I think it might be interesting to you.
For context, I work on post-training for agentic and tool-use capabilities, and I spent a few months a while ago almost literally living inside verl, ByteDance's RL post-training framework. I read most of the source and absorbed almost all of its knowledge and as I was working with it, I started wanting a "better" version, something with better dev experience for me, so I forked it (non-public, I abandoned it) to make it better (in my view) and while I shipped a lot fixes, and built tooling around it, at one point I had to stop, and it left a hole in my chest and I was finally wrote the whole thing up. As an au-revoir to it but also to get heir from it, all the knowledge and skill that I've learned from it.
It's a close read of the parts that actually run an RLHF loop, plus some of the engineering a fork drags in, nothing major though, and one debugging story I'm still a little proud of.
A quick tour of what the blog post is about:
- The orchestration layer's internals: everything from the data structure (DataProto) every stage (rollout, reward, advantage, update) passes, and the API gotchas its names don't warn you about. There's also a half-finished migration to a plain TensorDict underneath it.
- The single-controller pattern: one driver process holds the schedule and fans work out to GPU workers through a "magic attribute" dispatch system. That one is nasty, it took me so much time to wrap my head around it, but now that I do, it just feels so natural and helps me work on my own little package for orchestration layer in outmost confidence and ease.
- Resource pools and colocation and how the actor, critic, rollout, and reference roles get fused into a single Ray actor per GPU.
Then I talk a little bit about the tooling a fork costs, because it's still an issue with verl, and I don't think they can fix it or at least, it'll be such a hassle for them to fix it since they have to support so many different architectures and whatnot. But mainly the issues are packaging that leaks: torch isn't in the core deps, one version constraint is copy-pasted three times, requirements.txt and setup.py disagree about what's required, and an unmaintained package is still imported on a live code path with its tests skipped. And, tests that are not standardized, I put so much wasted effort into making the test suite squeaky clean and even built a GPU-aware test scheduler to bin-pack tests onto whatever cards are free, instead of letting some GPUs idle.
And I added a small bonus cause I saw that bug happen to a colleague, it was an NCCL issu. A multi-GPU test hung with no error, no timeout, no crash. The CPU barrier passed but the first NCCL collective hung. It came down to NCCL choosing a bonded network interface for its bootstrap socket whose IP didn't route back to itself, so rank 0 was listening on an address no peer could reach. The fix is one env var (`NCCL_SOCKET_IFNAME=lo` on a single node). Getting there meant pulling apart the TCPStore, Gloo, and NCCL layers and reading NCCL's own debug output.
But yeah as you can expect I stopped because every refactor (of the orchestration layer, if you read the blog you'll understand there is so many indirection and magic that it's so hard to wrap your head around and it's not a good base to develop on imho) I cared about had to keep pace with a framework shipping changes almost daily, and the cost of staying in sync outgrew the work itself. I'm building my own little hobby orchestration layer now.
Obviously this kind of knowledge can get deprecated but I think the orchestration part is interesting to know and understand just as fundamentals in general and no matter the implementation I think the abstractions will be more or less the same, unless you change the paradigm (single controller to SPMD for example), I think. And I think if you're interested in contributing to verl you'll get nice ideas from the blog post, though I'm not sure they'll accept contributions to those areas.
Anyways, sorry about the yapping haha here is the full writeup: https://reinforcedknowledge.com/posts/verl-retrospective/