Dual RTX 6000, Blackwell and Ada Lovelace, with thermal imagery

Posted by Thalesian@reddit | LocalLLaMA | View on Reddit | 25 comments

This rig is more for training than local inference (though there is a lot of the latter with Qwen) but I thought it might be helpful to see how the new Blackwell cards dissipate heat compared to the older blower style for Quadros prominent since Amphere.

There are two IR color ramps - a standard heat map and a rainbow palette that’s better at showing steep thresholds. You can see the majority of the heat is present at the two inner-facing triangles to the upper side center of the Blackwell card (84 C), with exhaust moving up and outward to the side. Underneath, you can see how effective the lower two fans are at moving heat in the flow through design, though the Ada Lovelace card’s fan input is a fair bit cooler. But the negative of the latter’s design is that the heat ramps up linearly through the card. The geometric heatmap of the Blackwell shows how superior its engineering is - it is overall comparatively cooler in surface area despite using double the wattage.

A note on the setup - I have all system fans with exhaust facing inward to push air out try open side of the case. It seems like this shouldn’t work, but the Blackwell seems to stay much cooler this way than with the standard front fans as intake and back fans as exhaust. Coolest part of the rig by feel is between the two cards.

CPU is liquid cooled, and completely unaffected by proximity to the Blackwell card.

[-]

Accomplished_Mode170@reddit

Is yours the Max-Q (300W) or the Server Edition (600W); I’ve got the latter on its way from CDW and curious on temps 📊 🙋

84C seems too good to be true for 600W 🤞🌡️

[-]

Thalesian@reddit (OP)

It’s the workstation 600w, actively cooled. Max-Q should be thermally equivalent to the lower Ada Lovelace 6000. The Blackwell is training T5-3B on Sumerian texts in this photo. I managed to fit T5-11B on this with gradient accumulation and a batch size of 64, and it still stays below 92C.

[-]

Accomplished_Mode170@reddit

Awesome! TY! You got any workflows/notebooks/advice that’s configuration specific?

Was hoping to train small models EFFECTIVELY @ long context small i.e. Qwen 7B-1M but MORE

[-]

Thalesian@reddit (OP)

Long context is going to be tough. Look into hugging face's transformers package.

I'd recommend using adafactor with gradient accumulation. Another approach would be to use adamw_8bitpaged with gradient accumulation and gradient checkpointing. The core compromise there is that you are saving RAM by increasing computational cost. Another effective way to do it - though one I've not explored yet - is to use Deepspeed optimization 2 to offload the gradients to your CPU, but you'll need 128 - 256 GB RAM to do that effectively.

The challenge you will run into with any of these is t that the lessons you learn training smaller models don't really apply to the big ones - I've read conflicting reports on bigger batch sizes vs. smaller. Ultimately you'll need to experiment. Another thing you could look into is to use lora's as opposed to training the full weights of the target model.

Lastly - a theoretical cheat that I don't see used enough is to not just train in mixed precision, but to load in mixed precision (e.g. load the model weights as BF16 on import). Though I've not yet found a way to train these well yet.

[-]

LA_rent_Aficionado@reddit

Why not run unsloth with multiple gpu processing, you’ll be able to reduce a ton of overhead

[-]

Thalesian@reddit (OP)

On the list. I’d prefer to train on a single GPU if possible.

[-]

LA_rent_Aficionado@reddit

You can still train unsloth with a single GPU, the VRAM savings are incredible

[-]

Thalesian@reddit (OP)

I will give it a go next time I have to plan a big run.

[-]

Accomplished_Mode170@reddit

Got the RAM and the willingness.

Was initially hoping to use an ABxJudge (read: n-pair wise comparisons via K/V w/ multimodal input) to figure out ‘Good Enough Precision’ (e.g. appx 3.5 BPW 😆) based on a reference KV

Then do continued post-training (read: QAT) with configurable ‘total wall time’ based on the use case and newly set precision; the idea being ‘Automated SLA-definition & integration’ 📊

TY again for the encouragement and the specifics; be well 🏡

[-]

MengerianMango@reddit

That sounds like an interesting project. What are you trying to do?

[-]

Thalesian@reddit (OP)

University of Chicago had a lot of students write down cuneiform signs for tens of thousands of tablets in Akkadian and Sumerian. I am working on training a model that will, as accurately as possible, translate them all

[-]

MengerianMango@reddit

That is so cool!

I'm thinking it could be really cool to try talking to the model. They say that language shapes our minds in deep ways, like even to the point of enhancing or muting emotions or visual perception based on the presence or lack of precise wording within your native language.

[-]

Thalesian@reddit (OP)

I’ve post them here: https://huggingface.co/Thalesian

[-]

Ecstatic_Signal_1301@reddit

[-]

D3c1m470r@reddit

Finálly

[-]

getgoingfast@reddit

Curious, what kind of PSU are you using for this dual GPU rig?

[-]

Thalesian@reddit (OP)

This [one](https://www.amazon.com/dp/B08F1DKWX5?ref_=ppx_hzsearch_conn_dt_b_fed_asin_title_1&th=1)

[-]

getgoingfast@reddit

Gotcha, 1600W is what I expected and this one does not have 12VHPWR.

[-]

Mythril_Zombie@reddit

What kind of power supply are you using?

[-]

Thalesian@reddit (OP)

1600w [EVGA](https://www.amazon.com/dp/B08F1DKWX5?ref_=ppx_hzsearch_conn_dt_b_fed_asin_title_1&th=1). I earlier had 2x 2080Ti cards with the 6000 RTX Pro, and even with 1200 watts going to the GPU, everything ran seamlessly.

[-]

swagonflyyyy@reddit

Those are REALLY good temps for those cards.

[-]

Thalesian@reddit (OP)

Surface readings are just that - it looks like temp bounces from 86C to 92C on the Blackwell chip itself, floating between 89C and 90C most of the time.Ada Lovelace sticks to 84C, but that's half the wattage on a worse cooling system.

[-]

abnormal_human@reddit

I have four 6000Adas packed back to back in a tower case. The default fan curve on them is braindead. I wrote a bit of code to implement a better fan curve (it updates every 5 seconds based on the temp) and can get them running around 65-75C during full utilization training runs in a kinda warmish room. They perform a few % better too at the lower temps. They are definitely reliable at \~90-95C, and are designed to do it, but I don't love having them sit there for days/weeks at a time when simply running the fan at 100% brings them down to what feels like a healthier temp.

[-]

Thalesian@reddit (OP)

That’s incredible! Do you have a link to the code?

[-]

No_Afternoon_4260@reddit

That's how they do it, with very hot metal!