I built a "Doctor" for Proxmox clusters (cv4pve-diag) to stop configuration rot

Posted by Franklupog@reddit | sysadmin | View on Reddit | 3 comments

I’m managing a few PVE clusters at work and I’ve reached my limit with "silent" failures.

You know the drill: a VM won't migrate because someone left a random ISO mounted six months ago. Or you check the storage and it's full because of some orphaned disk or a snapshot from 2023. It’s like a slow disease for the cluster.

I got tired of playing detective every time something felt "off," so I wrote cv4pve-diag.

I like to think of it as a doctor performing a check-up on a patient that refuses to admit they're sick. It scans the whole cluster and flags the "health" issues:

Storage rot: Finds orphaned disks and zombie snapshots.

Brain fog: Nodes running different PVE versions because an update was interrupted or forgotten.

Basic hygiene: Expired certs, stopped services, and the classic "forgotten mounted CD-ROM."

Backups: Flags if a job hasn't run lately (this saved my ass last week).

I also just shoved a CVE scanner into the latest build. It pulls from Debian Security Tracker/NVD. It’s disabled by default (I don't like tools calling home without permission), but you can toggle it on if you want to see which packages are vulnerable.

Exports to JSON/Excel if you need to prove to your boss that the cluster needs a cleanup.

Curious to hear how you guys track this stuff. Do you just wait for things to explode or do you have some crazy custom scripts for it?

https://github.com/Corsinvest/cv4pve-diag