How to log output of running models and performance monitoring?

Posted by shifty21@reddit | LocalLLaMA | View on Reddit | 3 comments

Sorry for the slightly off-topic question/request. I have colleagues who have built some gnarly bare-metal LLM research workstations and servers and are having a lot of trouble debugging errors, warning and performance monitoring. I realize this is a 2-fold request: 1. add flags to Python command to output logs as flat file log file 2. GPU metrics (GPU utlization, RAM usage, TensorCore usage, etc.) - we are Nvidia shop The overall goal is to correlate logs and metrics for troubleshooting and analytics - trying to justify buying bigger and better GPUs.