I built a "Doctor" for Proxmox clusters (cv4pve-diag) to stop configuration rot
Posted by Franklupog@reddit | sysadmin | View on Reddit | 3 comments
I’m managing a few PVE clusters at work and I’ve reached my limit with "silent" failures.
You know the drill: a VM won't migrate because someone left a random ISO mounted six months ago. Or you check the storage and it's full because of some orphaned disk or a snapshot from 2023. It’s like a slow disease for the cluster.
I got tired of playing detective every time something felt "off," so I wrote cv4pve-diag.
I like to think of it as a doctor performing a check-up on a patient that refuses to admit they're sick. It scans the whole cluster and flags the "health" issues:
Storage rot: Finds orphaned disks and zombie snapshots.
Brain fog: Nodes running different PVE versions because an update was interrupted or forgotten.
Basic hygiene: Expired certs, stopped services, and the classic "forgotten mounted CD-ROM."
Backups: Flags if a job hasn't run lately (this saved my ass last week).
I also just shoved a CVE scanner into the latest build. It pulls from Debian Security Tracker/NVD. It’s disabled by default (I don't like tools calling home without permission), but you can toggle it on if you want to see which packages are vulnerable.
Exports to JSON/Excel if you need to prove to your boss that the cluster needs a cleanup.
Curious to hear how you guys track this stuff. Do you just wait for things to explode or do you have some crazy custom scripts for it?
Kumorigoe@reddit
Sorry, it seems this comment or thread has violated a sub-reddit rule and has been removed by a moderator.
Do Not Conduct Marketing Operations Within This Community.
Your content may be better suited for our companion sub-reddit: /r/SysAdminBlogs
If you wish to appeal this action please don't hesitate to message the moderation team.
St0nywall@reddit
There's a dedicated post for these.
Question: Is this command line or GUI?
Franklupog@reddit (OP)
Command line, in admin web