About to start fine-tuning on RunPod. What should I know to not waste money?

Posted by BriefCardiologist656@reddit | LocalLLaMA | View on Reddit | 12 comments

I was MLOps lead at an AI company managing 5000+ GPUs across GCP and CoreWeave. Left to start my own thing and now I'm back to renting GPUs like everyone else. The experience is rough.

Tried GCP first. Their sales team never got back to me about quota increase.

RunPod seems like the obvious choice. But I've been reading posts here and on r/StableDiffusion and r/comfyui and honestly it's scaring me. Stuff like:

- Pods dying mid-training with no way to recover checkpoints
- Getting charged while pods fail to initialize or throw CUDA errors
- Download speeds so slow you can't even get your trained model off the machine
- Network volumes locked to one datacenter so if GPUs sell out there you're stuck
- Templates that look like they work but break in weird ways

Coming from managing infra at scale where none of this was a problem (automatic checkpointing, job migration on node failure, fast object storage), it feels insane that this is the state of things for individual users.

Not trying to bash RunPod. Genuinely want to know how people make it work day to day without wasting money on failed attempts.

[-]

zball_@reddit

Runpod is dogshit.

[-]

BriefCardiologist656@reddit (OP)

Any particular issue you faced?

[-]

Raise_Fickle@reddit

never had any issues with runpod, so what you read seems like one-off kind of things

[-]

BriefCardiologist656@reddit (OP)

Interesting, I will try it and see for myself.

[-]

sandshrew69@reddit

As someone who is just a casual runpod user. I recently left it because it was annoying me.

The pros:

network storage just works great
very fast boot to first run
never had any ssh or startup problems personaly, it just works

The cons:

You setup a nice network storage in your favorite datacenter, you use cpu or cheap gpu to save costs, then when it comes to actually running your workload guess what? NO GPUS AVAILABLE... waiting entire day for a GPU to come online only for it to be taken instantly. Unless its B200/B300 which no one wants to pay $5-7 an hour for.

Slow internet speed randomly. Sometimes it works, sometimes computer says no... enjoy your dial up modem speed for an entire hour.

Bottom line: it works when it works but its not reliable.

[-]

ForsookComparison@reddit

Lambda is the only one of these providers that gave me zero issues with long-running jobs.

That said it can be harder to get capacity through regular on-demand instances from them (you basically need to make a bot-sniper).

[-]

BriefCardiologist656@reddit (OP)

How long are your jobs typically? And when you say zero issues, do you mean the machines just don't die, or also that setup/environment is clean out of the box? The bot-sniper thing for capacity is interesting though. Someone should create a SAAS tool for that lol. We used something similar to find spot instance availability across GCP regions to move our inference workloads preemptively.

[-]

ForsookComparison@reddit

Just that I haven't had any issues.

Most jobs that I mention where I encounter issues on other platforms are 7-10 days. Runpod machines just had weird issues on startup. Some outright didn't work with ssh, some had base image issues where I couldn't reproduce workflows that worked on the same distro on every other cloud, and one time I just had a session die (though to be fair I cannot rule out that it was their infra and not something I did).. all that with longer boot times.

I tried a few other clouds (have not tried coreweave) that had a slew of issues too.. Vast probably being the roughest for obvious reasons lol

And yeah - I made an instance-sniper for Lambda and still had to wait a good bit lol

[-]

BriefCardiologist656@reddit (OP)

The RunPod SSH issues and base image inconsistencies match what I keep hearing from everyone. Weird that the same distro behaves differently there vs other clouds. What are you training that takes 7-10 days? Full fine-tune on a large model? And do you run on a single GPU or multi-node on Lambda? Curious if the reliability holds up on their multi-GPU setups too.

[-]

ForsookComparison@reddit

Oh I'm not often doing training lately (though sometimes I will), I'm doing large/private inference testing and benching for several days at a time.

For Lambda it depends on what the bot gets haha. If I get a single node I'll use that but I've stitched my workflow together across multiple nodes more than once to mixed degrees of success.

[-]

Conscious_Chapter_93@reddit

The main thing I’d do is treat the pod as disposable from the start. Put checkpoints and logs somewhere outside the pod, make resume-from-checkpoint part of the normal path, and run a 5-10 minute smoke train before the real run: load data, one optimizer step, checkpoint write, checkpoint restore, sample artifact upload.

Also keep a tiny run manifest: base model, dataset hash/version, commit, training args, pod type, image/template id, checkpoint path, and last successful step. When something fails, that manifest is what saves you from guessing whether you lost compute, data, or only the pod.

[-]

BriefCardiologist656@reddit (OP)

If I'm using a network volume for checkpoints, that volume is locked to one datacenter. If that DC runs out of the GPU I need, I can't just spin up somewhere else and continue. My checkpoints are trapped there until I download them (which people say can be painfully slow) How do you handle that? Do you push checkpoints to S3 during training run?