Multi-GPU owners here? Cooling question + small experiment
Posted by aospan@reddit | LocalLLaMA | View on Reddit | 13 comments
Hey folks, curious how people here test and monitor cooling on multi-GPU rigs.
Especially when cards are stacked close together, do you mostly rely on GPU temp graphs, fan curves, external sensors, or thermal cameras? Or has anyone gone completely overboard and modeled airflow with CFD? :)
Part of why I’m asking: we recently shipped a monitoring feature in Reefy.ai and added a Bench app that runs GPU stress tests using the open-source gpu-fryer project from Hugging Face.
If anyone has a multi-GPU rig and wants to try it: boot Reefy from a USB dongle, install Bench from the app catalog, run the GPU stress test, and share a screenshot of GPU utilization and temps. Monitoring works out of the box, no Grafana or agents to wire up :)
Curious to see how this works across different setups. Really appreciate it if anyone can try and share a screenshot 🙏
Zealousideal-Lie8829@reddit
Tune fan curves to respond earlier and more aggressively to rising temps.
Ill_Recipe7620@reddit
Put it in a server.
aospan@reddit (OP)
Wow, 4× NVIDIA RTX PRO 6000s, now that’s a flex! 😄
Temps look impressively uniform too, 53 to 57°C across the boards. Curious, how are you cooling them?
Ill_Recipe7620@reddit
MrCool: [)
Fabulous_Fact_606@reddit
Nice. People pay hundreds for some fancy cpu fans. Buy a costway 12000btu minu split for $450. (Have to buy micron gauge and vacuum pump tho..or rent) Heat Problem solved. Mine sits next to the exhaust of my Rheem Hot water heat pump. Same concept.
Fabulous_Fact_606@reddit
You are not utilizing your GPU at 100%. Temps too cold. Give me ssh access so I can install an llm api wireguard tunnel to a VPS so we all can use it. Share the wealth bro.
Ok-Ask1962@reddit
GPU temps always lie to you anyway. Fan curves are the only thing I trust.
Fabulous_Fact_606@reddit
3090 starts to throttle watts when cpu temp >80C.
stoppableDissolution@reddit
Uh, is there a reason you dont have them powerlimited? I got mine set to ~270w with negligible performance loss
aospan@reddit (OP)
Yeah, the funny part is that after some point extra watts mostly stop turning into useful performance and start turning into heat. In this test, the 80% power limit looked like the sweet spot: only \~2.3% slower overall:
https://www.tomshardware.com/news/improving-nvidia-rtx-4090-efficiency-through-power-limiting
Fabulous_Fact_606@reddit
different card models. EVGA 3090 ti 420w and MSI 3090 blower 350w. Could limit the power output on both of them, but i want to squeeze every t/s out of it. The MSI 3090 was hitting >80C when i got it from ebay. Tighten the screws on the backplate and now it is stable <80C.
aospan@reddit (OP)
Just noticed your two 3090s show different power limits: 350W and 420W. Curious, are they different card models/VBIOS, or did you set different power limits manually?
Khipu28@reddit
We use blowers