How to log output of running models and performance monitoring?

Posted by shifty21@reddit | LocalLLaMA | View on Reddit | 3 comments

Sorry for the slightly off-topic question/request. I have colleagues who have built some gnarly bare-metal LLM research workstations and servers and are having a lot of trouble debugging errors, warning and performance monitoring. I realize this is a 2-fold request: 1. add flags to Python command to output logs as flat file log file 2. GPU metrics (GPU utlization, RAM usage, TensorCore usage, etc.) - we are Nvidia shop The overall goal is to correlate logs and metrics for troubleshooting and analytics - trying to justify buying bigger and better GPUs.

3 Comments

[-]

rbgo404@reddit

I would suggest you to check Datadog to track and monitor. Also Grafana is pretty useful and widely adopted for ML monitoring.

shifty21@reddit (OP)

I can get practically everything I need for perfmon with collectd and send it via HEC. Nvidia's native monitoring service seems to be lacking Tensor Core and SM utilization metrics, but not the end of the world. As for logging, Datadog can historically log the perfmon, but can't correlate with logs from running the models. I'm still struggling on how to log the outputs of running models. Once I figure that out, I can start the correlation between running models output and perfmon.

Able-Locksmith-1979@reddit

For 1 I would not do anything in python, all llm’s I see are just api’s. Just put a proxy in between your program and the llm api and the proxy can log everything

Reply to Post

3 Comments