About to start fine-tuning on RunPod. What should I know to not waste money?

Posted by BriefCardiologist656@reddit | LocalLLaMA | View on Reddit | 12 comments

I was MLOps lead at an AI company managing 5000+ GPUs across GCP and CoreWeave. Left to start my own thing and now I'm back to renting GPUs like everyone else. The experience is rough.

Tried GCP first. Their sales team never got back to me about quota increase.

RunPod seems like the obvious choice. But I've been reading posts here and on r/StableDiffusion and r/comfyui and honestly it's scaring me. Stuff like:

- Pods dying mid-training with no way to recover checkpoints
- Getting charged while pods fail to initialize or throw CUDA errors
- Download speeds so slow you can't even get your trained model off the machine
- Network volumes locked to one datacenter so if GPUs sell out there you're stuck
- Templates that look like they work but break in weird ways

Coming from managing infra at scale where none of this was a problem (automatic checkpointing, job migration on node failure, fast object storage), it feels insane that this is the state of things for individual users.

Not trying to bash RunPod. Genuinely want to know how people make it work day to day without wasting money on failed attempts.