HPC-Style Job Scripts in the Cloud
Posted by mrocklin@reddit | Python | View on Reddit | 8 comments
The first parallel computing system I ever used were job scripts on HPC Job schedulers (like SLURM, PBS, SGE, ...). They had an API straight out of the 90s, but were super straightforward and helped me do research when I was still just a baby programmer.
The cloud is way more powerful than these systems, but kinda sucks from a UX perspective. I wanted to replicate the experience I had on HPC on the cloud with Cloud-based Job Arrays. It wasn't actually all that hard.
This is still super new (we haven't even put up proper docs yet) but I'm excited about the feature. Thoughts/questions/critiques welcome.
deathtopenguin5@reddit
If you want general purpose execution you can construct an HPC system in cloud VMs using HTCondor. Obviously you lose lots of nice features for working directly with dataframes in the cloud (which is probably a dealbreaker for most scenarios) but you do gain a lot of flexibility. You can even use some additional projects to automatically scale the number of worker VM instances up and down according to the waiting jobs in the job queue.
mrocklin@reddit (OP)
Yeah, I think what I like about this approach is that most of the users I interact with wouldn't know how to set up HTCondor very easily. This is designed to be a simple end-user tool.
Bach4Ants@reddit
Intriguing. How hard would it to run an application that uses MPI on one of these clusters?
mrocklin@reddit (OP)
Good question. Honest answer is that I don't know. Deploying MPI is a lot more involved than just running a script a bunch of times. I looked into this several years ago and didn't get to an easy solution. I suspect that others here might know more.
brandonZappy@reddit
I just watched a presentation you gave at SC on Dask. Really enjoyed it!
mrocklin@reddit (OP)
Thanks!
collectablecat@reddit
Can this work with UV's single file scripts with inline deps?
https://docs.astral.sh/uv/guides/scripts/#declaring-script-dependencies
mrocklin@reddit (OP)
Sure, you'd just run `coiled batch run uv run ...`. Anything you can do on your computer you can do on remote computers too 🙂