How to log output of running models and performance monitoring?
Posted by shifty21@reddit | LocalLLaMA | View on Reddit | 3 comments
Sorry for the slightly off-topic question/request.
I have colleagues who have built some gnarly bare-metal LLM research workstations and servers and are having a lot of trouble debugging errors, warning and performance monitoring.
I realize this is a 2-fold request:
1. add flags to Python command to output logs as flat file log file
2. GPU metrics (GPU utlization, RAM usage, TensorCore usage, etc.) - we are Nvidia shop
The overall goal is to correlate logs and metrics for troubleshooting and analytics - trying to justify buying bigger and better GPUs.
3 Comments
rbgo404@reddit
shifty21@reddit (OP)
Able-Locksmith-1979@reddit