Stop wasting electricity

[-]

BobbyL2k@reddit

How did you measure Energy used for the second graph? It seems off.

[-]

OkFly3388@reddit (OP)

power consumption * time to generate = total energy spend

[-]

This is definitely not correct. I've been running a multi day job and took your advice and set my limited to 275. My performance dropped 10%. Without the limited, my 4090 is only pulling around 320W for the job.

You're assuming the thing is running at the limit and it is not. Or if it is, then that's for your specific workload, not a general truth.

[-]

OkFly3388@reddit (OP)

Oh yea, if only I put exact setup in post, that gives me this results. Wait a minute, I did exactly that.

[-]

dankfrankreynolds@reddit

Why did you use "Power Limit" instead of "Power Draw" if you know they are identical?... it sure sounds like you're GPU is at 100% and therefore you're saying the power limit is the power draw. I'm at 100% GPU and using 320W.

I'm not posting any of this to argue with you, but to let other readers like myself know that it's not what it seems. I don't really care to argue with you or cause any negative vibes.

[-]

OkFly3388@reddit (OP)

>it sure sounds like you're GPU is at 100% and therefore you're saying the power limit is the power draw.

>On my observation, GPU constantly hitting power limit, so its safe to say that it actual consumption.

Reread post, there are all info.

[-]

BobbyL2k@reddit

You **cannot** assume power limit = power consumption.

In this post, you will see that the GPU isn’t reaching the power limit as each stage of inference doesn’t demand equal amount of power.

If you want to measure energy, you need to use DCGM. Your graph isn’t showing energy usage. It’s showing power limit * time which doesn’t mean anything.

[-]

OkFly3388@reddit (OP)

This is technically correct, but I think such errors are within tolerance range and do not have a real impact on the global picture

[-]

dankfrankreynolds@reddit

This is definitely not correct. I've been running a multi day job and took your advice and set my limited to 275. My performance dropped 10%. Without the limited, my 4090 is only pulling around 320W for the job.

You're assuming the thing is running at the limit and it is not. Or if it is, then that's for your specific workload, not a general truth.

[-]

BobbyL2k@reddit

In the linked post, the redditor concluded that 300W power limit was optimal for Pro 6000 whereas the redditor originally concluded that it was 350w because of incorrect data.

Your overall conclusion that applying a power limit is beneficial is correct. But you second graph suggesting 250w is optimal for a 4090 is wrong.

[-]

MelodicRecognition7@reddit

after more tests I've decided that the optimal power limit for Pro 6000 is 330 W, and this conclusion is supported by Nvidia because they made 325W maximum power limit for Max-Q version: https://old.reddit.com/r/LocalLLaMA/comments/1t4nhip/new_pro6k_maxq_are_power_limited_to_325w/

[-]

BobbyL2k@reddit

I missed your new post, thanks for the info.

[-]

OkFly3388@reddit (OP)

>But you second graph suggesting 250w is optimal for a 4090 is wrong.

You read it wrong. Best is 275w. It \~ as efficient as 250w, but way faster.

[-]

dir3ctly@reddit

LACT now supports undervolting via Voltage-Frequency Curve

Since 2 weeks it is now possible to modify the V/F Curve on Linux just like in MSI Afterburner on Windows:
https://github.com/ilya-zlobintsev/LACT/releases/tag/v0.9.0

The benefits are less power consumption, heat and noise and it is much more effective than power limiting.

[-]

whiteamphora@reddit

unfortunately setting an undervolting takes much more time than simple nvidia-smi pl command - i gave up on undervolting as the results were making it pointless and not enough stable.

[-]

VoidAlchemy@reddit

this is the way! totally agree!

tuned my 3090TI FE and it rides constant 1950 MHz now, never goes into P-state 0 (which immediately throttles due to temp or power). so much smoother, cooler, consistent performance, using less power, and same or better output than untuned.

LACT is definitely better than a simple `nvidia-smi -pl 300` etc.

[-]

iamapizza@reddit

This is great looks just like MSI afterburner way

[-]

tmvr@reddit

Decode (tg) is not an issue. You lose a bit more on prefill (pp), but still only about 20% if you go down from 450W to 270W.

[-]

Dany0@reddit

sage advice has been for a while to use ~80% power limit, oc the memory and optimise the curve a little

Applies to Turing+

[-]

a_beautiful_rhind@reddit

Rather then power limit, might help to prevent turbo as that is where it starts to suck power massively.

[-]

nmkd@reddit

No.

Only adjust powerlimit.

The GPU will boost as much as it can within its powerlimit either way, limiting clocks is never as reliable as limiting the power.

[-]

a_beautiful_rhind@reddit

That's opposite to my experience. When I switched to nvcurve instead of lact and set a power limit, the GPU clocks go up and down and appear to use more power for given work.

When I used lact and the "old" way to undervolt, my power naturally stayed around my current power limit by itself and maybe even 10w less. Clocks above 1695 didn't really benefit my inference speeds and would just push over 300w for nothing.

I have 4x GPUs and setting clock limit of 0, 1695 would keep things in check while maintaining prompt processing performance, image gen speeds, etc.

My hope in not limiting clocks and switching to PL was to avoid the rise at idle and having to suspend/resume the cards but it didn't seem to accomplish that. Instead they now bounce up to the undervolt point +/- 100mhz until the hit the PL.

[-]

nmkd@reddit

Ah, to be fair, my experience is mostly from Windows.

Things are a bit different on Linux I guess.

[-]

tmvr@reddit

Yeah, my is set to 360W (so 80%) after boot as well since the beginning. Could go lower for LLMs because I would be fine with 15% lower pp instead of 5% lower as it is now, but this is a sane compromise for both that and gaming.

[-]

AeroelasticCowboy@reddit

15-20% is a HUGE performance hit, I wouldn't say it's "only 15-20"

[-]

tmvr@reddit

You need to see it in context though. You get that by dropping power consumption by 40% plus it's still a few thousand tok/s. Let's say you get 2500 tok/s max speed. To process a 128K (131072) initial prompt would take 52 sec. Do you really care if it's 60 or 63 sec instead? Not really, and this is a pretty extreme example, you don't regularly process such prompts when working with a model. In case you are doing a lot of processing and not watching it live it still is a better option. If you job runs for 10 hours at full speed it would be 11.5-12 hours at reduced TDP. Maybe you would care, but it is also more expensive as it would cost you 10x450=4.5kWh for the 10 hour run and 12x270=3.25kWh for the 12 hour run, so you need only 72% of the energy/costs.

[-]

keyboardhack@reddit

Well for now. With mtp, dflash and other methods that trade compute for higher tg/s, these numbers will likely start getting affected by power limit.

[-]

jacek2023@reddit

I cut power a lot to have silent 3090s at night

[-]

Mchanger@reddit

What down to?

[-]

nmkd@reddit

3090s run pretty well at 250-270W

[-]

Clean_Initial_9618@reddit

How did you make sure the performance is not impacted i have a rtx 3090 want to reduce the power to reduce heat and don't want my card melting. How to find the safe spot running windows got MSI Afterburner installed

[-]

nmkd@reddit

Write down FPS (in games) or TPS (LLMs), change powerlimit, write the new FPS/TPS down, and then do the maths to see how much % speed you lose per % power reduced.

[-]

cleversmoke@reddit

I ran benchmarks with llama.cpp:full-cuda and MSI Afterburner on +-5 ticks around 64% power (249W), where each % tick is about 4W. This gave an idea of where my RTX 3090 performs and what the trade offs are per W or %.

[-]

jacek2023@reddit

to the silence:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.20             Driver Version: 580.126.20     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:09:00.0 Off |                  N/A |
| 51%   65C    P2            113W /  150W |   15811MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off |   00000000:0A:00.0 Off |                  N/A |
| 44%   61C    P2            112W /  150W |   15517MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        Off |   00000000:43:00.0 Off |                  N/A |
|  0%   72C    P2            129W /  150W |   18101MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 3060        Off |   00000000:44:00.0 Off |                  N/A |
|  0%   50C    P8             21W /  170W |      25MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

[-]

silenceimpaired@reddit

Isn’t there two ways to limit power for Nvidia GPUs?

[-]

nmkd@reddit

There's only one power limit setting.

Obviously you can underclock or something but that won't be a power limit per se.

[-]

silenceimpaired@reddit

Yeah, I was trying to recall the term underclock. I saw a post that argued using power limiting and underclocking can result in better performance than one alone.

[-]

nmkd@reddit

You're probably thinking about undervolting. This can get you higher clocks (=more speed) at the same power limit, even if you already reduced that.

Underclocking is kinda too rigid.

[-]

chimpera@reddit

can you check the prefill performance?

[-]

OkFly3388@reddit (OP)

[-]

PigSlam@reddit

interesting that 450W is more efficient than 150W

[-]

nmkd@reddit

At some point you just throttle it and lose any advantage.

[-]

BagComprehensive79@reddit

Okay there is a sweet spot more clear than my expectations

[-]

BobsView@reddit

275w ? or i read it wrong ?

[-]

BagComprehensive79@reddit

Yes correct, you can check comparison between 150w and 275w on the plot

[-]

OkFly3388@reddit (OP)

[-]

uti24@reddit

I mean, 30t/s vs 31t/s not worth to crank power up from 300W to 500W

But improving prefill from 1700 to 2100 kinda maybe?

[-]

OkFly3388@reddit (OP)

Why ?
for 1k tokens reading time will be 0.58s with 1700 or 0.47 with 2100
but 1k generation time will be 32s so your total speedup(both generation and prefill time) will be less than 1%, which absolutely not worth it. Basically, you should have task, that required agent to read \~100x more tokens than generate to justify going full power.

[-]

uti24@reddit

With agentic use you often have to read like 50k tokens, that might be worth it

[-]

Opening-Broccoli9190@reddit

Yep, you usually read it once and then it's cached for the session.

[-]

uti24@reddit

When you are using limited hardware for local development agent switch task and contexts pretty often and then need to rebuild it.

[-]

Opening-Broccoli9190@reddit

I get it, I am using just 32gb myself. The context is compacted roughly every hour or so and it taskes seconds to load, the full other hour is me iterating with token-gen assist.

[-]

SailIntelligent2633@reddit

200k tokens would be 1m 56s with 1700 or 1m 34s with 2100. 2k generation would be 1m 7s vs 1m 5s. So 3m 3s vs 2m 39s, or 15% slower. So a 10 minute coding task near the end of the context window would take an additional minute and a half.

[-]

Opening-Broccoli9190@reddit

The model doesn't read the context every time you're posting a response, it uses cache generously.

[-]

OkFly3388@reddit (OP)

Enable context cache. You dont need to reread whole history every time agent try to generate code.

[-]

AeroelasticCowboy@reddit

Sweet spot not worth it, unless you're running a giant farm of these for profit, who wants to leave 16% of the performance on the table. 350w is a better compromise.

[-]

Xamanthas@reddit

Me and my body temp. The difference between 1000w of heat output (cpu, two gpus, psu heat) and 680 is noticeable lol

[-]

SkyFeistyLlama8@reddit

1000W running for hours every day adds up in the summer. You're looking at an uncomfortably hot room without air conditioning.

[-]

Zc5Gwu@reddit

Nit: this chart might be clearer if it started at 0 on the y axis.

[-]

fullouterjoin@reddit

In general, big fan of charts that don't lie, but this isn't that instance.

Prefill speed 0, watts 0 there is no data below 150W. This chart is fine.

One that would make it clearer is prefil-token/s/w which is the token speed efficiency.

[-]

popiazaza@reddit

Welcome to Reddit. That's normal.

[-]

New-Implement-5979@reddit

nice one!

[-]

Timely_Intern_4994@reddit

Did u undervolt too?

[-]

OkFly3388@reddit (OP)

No, just power limit. But I check it later and make another post, if there will be more than 1% difference

[-]

bryancr@reddit

My 5090 is overclocked and undervolted. Haven’t touched max power values yet but highest I generally see the gpu temp is 75oC

[-]

Timely_Intern_4994@reddit

There sure will be, use LACT app to control voltages easily

[-]

Badhunter31415@reddit

what do you use to create those graphs ?

[-]

OkFly3388@reddit (OP)

matplotlib

[-]

Narrow-Belt-5030@reddit

I currently cap the power to my 5090 out of fear of it melting, but from that graph it suggests maybe I should dig into it and cut the power even more. Thanks for this.

[-]

PooMonger20@reddit

Could you kindly elaborate how do you cap the power to it? What values are working best for you?

[-]

phazei@reddit

Best to under-volt with MSI Afterburner.

I set my 5090 at 860mV @ 2500MHz it uses about 360W and only loses about 12% compared to uncapped.

You click the dot on the Y-axis of the mV you want to cap at, then drag it up to the max MHz you want, and then hit L.

You can try different MHz values at different mV, some will work, some will be ignored if it's too low. The lower the mV and higher the clock, the better you do.

[-]

PooMonger20@reddit

The detailed explanation is highly appreciated. Thank you, kind sir.

[-]

Narrow-Belt-5030@reddit

Yes - I used the nvidia-smi command and was at 550W, but now thanks to this post set it to 400W. After running a benchmark tool I made (for LLMs - thanks Claude) I still get a very good TTFT & t/s on the model I use (Gemma4 26B MoE). From memory it was something like 160t/s dropped to 140t/s .. so acceptible.

old: sudo nvidia-smi -pl 550
new: sudo nvidia-smi -pl 400

[-]

jai5@reddit

tell us more about your llm benchmark tool.

[-]

Narrow-Belt-5030@reddit

It's just a python script I asked Claude to write for me.

I asked him to create a script that could send a prompt of varying sizes (1K, 4K, 8K) in batches of 1,2,4 prompts. I use vLLM which is extremely good at TTFT & throughput.

This is 400W power, then asked Claude to up the prompt size as I changed the limit to 128K ready for Hermes Agent:

|----------|-------------|--------|----------------|-----------------|

| \~1K tok | 1 | 66 ms | 156 | 156 |

| \~1K tok | 2 | 75 ms | 128 | 256 |

| \~1K tok | 4 | 110 ms | 113 | 428 |

| \~4K tok | 1 | 221 ms | 149 | 149 |

| \~4K tok | 2 | 235 ms | 124 | 246 |

| \~4K tok | 4 | 377 ms | 103 | 352 |

| \~12K tok | 1 | 897 ms | 137 | 137 |

| \~12K tok | 2 | 1747 ms| 111 | 212 |

| \~12K tok | 4 | 2837 ms| 73 | 165 |

| \~32K | 1 | 3.2 s | 128 | 128 |

| \~32K | 2 | 5.0 s | 69 | 61 |

| \~32K | 4 | 8.5 s | 39 | 47 |

| \~64K | 1 | 5.4 s | 123 | 123 |

| \~64K | 2 | 8.2 s | 63 | 39 |

| \~64K | 4 | 13.7 s | 33 | 30 |

| \~120K | 1 | 5.4 s | 123 | 123 |

| \~120K | 2 | 8.2 s | 63 | 39 |

| \~120K | 4 | 13.7 s | 33 | 30 |

[-]

PooMonger20@reddit

Big thank you for the answer. I will check it out.

[-]

notheresnolight@reddit

I cap the power to my 5090 out of fear of it melting the CPU. 450W brings everything down by 10°C.

[-]

SailIntelligent2633@reddit

Melting the CPU?

[-]

notheresnolight@reddit

figuratively speaking, the CPU would normally idle around 40 °C if the GPU wouldn't blow 80°C air its way

[-]

Skystunt@reddit

450W being considered lower power is absolutely crazy, sometimes i forget how high of a TDP the 5090 has

[-]

notheresnolight@reddit

yeah, it's basically a space heater that can compute matrix calculations

[-]

finevelyn@reddit

The lowest you can go with 5090 is 400W if I remember correctly. The driver/firmware won't allow it to be set lower.

[-]

phazei@reddit

Maybe I got really lucky with my silicon. I tried setting max wattage, I don't recall what I set it at, but I got good results with under-volting and stayed with that.

At 860mv @ 2500MHz it uses about 360W with about 12% loss compared to uncapped.

[-]

a_beautiful_rhind@reddit

Just cap the clock. It can spike over the power limit briefly anyways.

[-]

Plabbi@reddit

You want undervolting curves for best results, either MSI Afterburner in windows or LACT in Linux.

Enjoy the rabbit hole.

[-]

a_beautiful_rhind@reddit

I have been. In inference there's really only 2 "speeds" full go and idle. All those middle bits never really get used. At least not that I have seen.

[-]

silenceimpaired@reddit

What cards do you have and what are your settings?

[-]

a_beautiful_rhind@reddit

4x3090. I think I flatten the point around 760mv and 1695mhz. I set 275 PL at the moment because nvcurve doesn't have clock locking.

Prior to that it was 200mhz offset and 0-1695 clock limit. I am gonna move back over to lact when I turn off my ram overclock for the summer. I had to build it against new gnome libs because they dropped support for ubuntu 22.04. I don't even run gnome but MATE.

[-]

Narrow-Belt-5030@reddit

Booo! Thank you though :-)

[-]

koloved@reddit

There's a BIOS that allows you to limit 5090 to 300 watts. Visually, in the application it will still be 400, but in reality it will be 300 from the outlet.

I haven't tried it myself, and I want to try it.

https://www.reddit.com/r/LocalLLaMA/comments/1rbyg5x/if_you_have_a_rtx_5090_that_has_a_single/

[-]

Narrow-Belt-5030@reddit

I am not that brave .. but thank you - something to consider.

[-]

Crinkez@reddit

Datacenters hate this one trick!

[-]

Momsbestboy@reddit

... and in case you use a GPU of AMD: LACT on Linux is a nice program to tune the GPU: https://i.imgur.com/LRuhPom.png

I have set the power limit to 210W now (after 230W in the screen shot) and my card also runs stable at -100mV undervolting. I have run a benchmark with llama-cli using the same prompt before and after, and t/s even increased, because the card is hitting the thermal throttle less often.

On top, the card draws less energy and the fan produces less noise. So it is a win win win situation.

[-]

farewellrif@reddit

Is this meaningfully different than rocm-smi --setpoweroverdrive [watts]?

[-]

computehungry@reddit

Yeah, the power limit is correlated with max clock frequency, the actually relevant graph would be what max frequency you get for each power limit. And the answer would be... Different for every model, and also the vram/ram split in case it is split, hahaha.

[-]

kwinz@reddit

I think tokens per Watt is more meaningful in this case. Not everything is compute bound.

[-]

Momsbestboy@reddit

Much better: give the tuning a try first, before thinking too much. the R9700 is VRAM starved, so gaining a bit more on the GPU by running at higher frequencies is not really relevant.

All I did was to run llama-cli with the same model and prompt, and then check the t/s. In my case, t/s increased after I undervolted and limited the power, because thermal throttling seems to have a much higher impact than gaining some spikes in GPU frequency.

So for me(!) it was a simple decision. Less power drawn, faster, GPU runs cooler and makes less noise. Your results might be different, so test it.

[-]

computehungry@reddit

I guess, if you don't want to adjust the VF curve, you're completely right. You can undervolt and power limit at the same time though for better performance. At the minimum, it's healthier for the GPU since the voltage wouldn't oscillate all the time. If overclocked, the performance gain can be 30-50% at the same wattage, which is why I'm claiming this tuned performance is the more interesting number. But I understand not everyone wants to mess with the factory settings.

[-]

Timely-Perception-26@reddit

You're using LACT, but not a voltage/frequency curve?
https://imgur.com/a/78eeZaJ

[-]

MelodicRecognition7@reddit

https://old.reddit.com/r/LocalLLaMA/comments/1nkycpq/gpu_power_limiting_measurements_update/

[-]

YetAnotherAnonymoose@reddit

Really useful since electricity is fairly expensive in Germany, thanks.

[-]

mr_zerolith@reddit

One problem is that prompt processing speed goes down quite a bit when you power limit.

To gain speed back when power limiting, overclock memory.
Blackwells can take 2-3ghz memory boost because the memory chips are ran below spec for thermal reasons, but by power limiting, you are freeing up thermals to overclock the ram :)

[-]

OkFly3388@reddit (OP)

Check prefill graph there
https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/comment/olct8k6/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/comment/olct7k9/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

[-]

Enough-Astronaut9278@reddit

makes sense — decode is memory bandwidth bound anyway, the CUDA cores are mostly sitting there waiting. prefill is where you'd actually feel the power cut. good data tho, gonna try this on my setup

[-]

FatheredPuma81@reddit

First thing I did with my 4090 was undervolt and downclocked the core and overclocked the RAM. Power draw dropped by 100w and performance went up because of the RAM overclock.

[-]

RobotRobotWhatDoUSee@reddit

Do you have your test code posted anywhere?

[-]

ageek@reddit

I'm always undervolting my GPU so should be at top efficiency.

[-]

zhambe@reddit

I have 2x RTX 3090 in my rig, and I run them at 250W -- unthrottled, they trip the overload alarm on the UPS when they get properly going.

I lose maybe 10% off peak performance, and that's fine. Otherwise the machine runs so hot anyway, cooks the whole room.

[-]

Lissanro@reddit

Seeing the post makes me curious if I should look into optimizing energy consumption of my GPUs. I am still "cooking" with four 3090 with default 350W limit; with models loaded fully in VRAM my PC consumes about 2kW according my UPS (like when running Qwen 3.6 27B with vLLM on all four GPUs), or about 1.2 kW during RAM+VRAM inference (like with Kimi K2.6 Q4_X, which mostly ends up in RAM).

I have a big fan in the window near the PC that can remove the heat fast enough, so heat is not a concern in my case, but if I can reduce electricity cost per token, that's interesting, so I plan to experiment. Your 250W may be a good starting point for me to try, especially for overnight runs when I am not in a hurry, so thanks for your comment!

[-]

Smeetilus@reddit

My four 3090’s are set to 200 watts on my 1600 watt EVGA powered nonsense generator. I don’t feel like it’s slow. I think I can set them to 220 but I’d start having random reboots if much higher so I cut back to be safe. They still will briefly surge past whatever limit you set, from what I’ve read.

[-]

confused-photon@reddit

Dear god it’s mining undervolting all over again

[-]

Slaghton@reddit

When I train models on my 3090's or really do anything on them, I power limit them all to 275w.

[-]

y4m4@reddit

Thanks for posting this! It inspired me to test my cards with qwen3 VL 8B:

PNY 5090 OC - the most efficient point, 410w resulted in 212.35 tok/s. Topped out at 219 tok/s at 550w and decreased slowly as it climbed to 600w.

Nvidia 3090 FE - max efficiency at 240w → 106 tok/s and increasing the power slowly increased the speed at every step up to 350w to 122 tok/s. I settled on 270w to get an extra 11 tok/s.

[-]

qubridInc@reddit

Yeah the efficiency curve on these cards gets weirdly good once you stop chasing absolute max throughput.

A lot of people assume dropping from 550W → 400W would tank performance, but for many inference workloads it’s surprisingly small compared to the heat/noise/power savings you get back.

Especially true for homelabs where the goal is usually “good sustained throughput” and not leaderboard benchmark numbers.

[-]

DataPhreak@reddit

2 things.

First, this is going to be different on every single card, and likely different model architectures.

Second, just because you're using all of your compute doesn't mean you are using all of the electricity. You have to put a watt meter on your machine. You're going to find that even when you set the power limit to 450, the GPU is not going to go over 300.

[-]

Iory1998@reddit

A few weeks ago, I posted here a guide to undervolt your GPUs. Performance drop is often negligible and you might even see perfomance gains with some GPUs.

[-]

Arcuru@reddit

Yea...if your metric is going to be tokens/watt then you should definitely power limit. Usually the optimal work/power ratio is far lower than the top end speed, that's why your devices will frequency limit in the first place.

In this case It probably uses the extra power for number crunching (increasing frequency of the compute cores), whereas token generation is largely bounded by memory bandwidth which probably gets maxed at a lower power limit.

[-]

hidden2u@reddit

I just undervolt, it's insanely efficient

[-]

silenceimpaired@reddit

GPU and setting if you please?

[-]

NineThreeTilNow@reddit

My 4090 has always been minorly under powered / clocked.

Taking like ~10% off the top gives better 1% lows in video games.

It's pretty well known about the 4090 in terms of gaming. I've left the setting on after seeing enough melted connectors.

Even when I use the card for model training, it's the same.

[-]

PooMonger20@reddit

Thanks for sharing, could you elaborate what software do you use to under power / clock it?

[-]

Organic_Scarcity_495@reddit

yeah the prefill speed drop from power limiting is the part people don't talk about. decode tokens per second doesn't tell the full story when your time-to-first-token doubles. for interactive use (chat, coding assistants) that latency hit matters a lot. for batch inference where you're just running through a queue of prompts, the power savings are worth it. the right answer depends on your workload mix

[-]

OkFly3388@reddit (OP)

There are prefill graph in my other comment already, check it out
https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/comment/olct7k9/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/comment/olct8k6/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

[-]

cyberdork@reddit

Yeah, on my 3090 I noticed that reducing power by 25% only resulted in an 11% performance hit. So I always run at max 275W

[-]

gabrielrfg@reddit

You should maybe look into setting up an undervolting curve, usually gets you about 20-30% less power draw for the same performance in gaming. In my experience in LLM inference in my old 3080, I could basically reduce clock speed a lot and overclock the memory frequency and see performance *increases* while lowering power draw dramatically.

[-]

Look_0ver_There@reddit

Graphs that don't start at zero are the work of the devil!

[-]

snorkelvretervreter@reddit

That is so extremely unnecessary in that second graph. That is just downright misleading. Bad OP!

[-]

Beamsters@reddit

you expect 4090 to run at 50w?

[-]

Look_0ver_There@reddit

Y-axis mate

[-]

Borkato@reddit

They mean the y axis

[-]

gigaflops_@reddit

The difference in your electricity bill between running your local AI using 450 watts vs 300 watts is *negligable*. A 150 watt difference, over lets say 45 seconds, to respond to the typical prompt, equals 0.00185 killowatt hours. At the average US electricity rate of 17 cents/KWh, that's **$0.000319 saved per prompt**. In other words, you'd have to send **3137 prompts** to your local model to save a *single dollar*.

[-]

ttkciar@reddit

//me looks at eval which has been running constantly for >30 hours and still not done.

We are not the same.

[-]

D2OQZG8l5BI1S06@reddit

Did you measure the actual consumption? I never hit power limit.

[-]

OkFly3388@reddit (OP)

>On my observation, GPU constantly hitting power limit, so its safe to say that it actual consumption

[-]

D2OQZG8l5BI1S06@reddit

So you just multiplied with the power limit to draw the graph? LOL

[-]

Timely-Perception-26@reddit

An honest tip from one Linux LLM guy to another:

A simple, strict wattage limit causes the graphics card to continue using aggressive boost logic, which results in inefficient voltage/frequency fluctuations and load spikes, and shortens the lifespan of your graphics card.
A voltage/frequency curve, on the other hand, leads to more stable performance, lower voltage at the same clock speed, less heat, often fewer spikes, and better efficiency.
Don’t be stuck in the past—download LACT and set up your own curve. Best regards from someone who loves his graphics card.
This is my curve:
https://imgur.com/a/78eeZaJ

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

gwillen@reddit

Thanks, this is helpful. I have played with the power limit on my GPUs, mostly out of concern for total power draw (my PSU is only rated for 1000W sustained, and my GPUs together plus other draw can exceed that.) But I didn't have a good sense of what the curve was like. I should probably do tests of my own.

[-]

MutantEggroll@reddit

These are great charts! Thanks for sharing.

I've done similar with my 5090, and I found that I actually ended up with thermal headroom for a mild overclock. I'd be interested to hear whether your 4090 has similar headroom, and if you're able to recover or possibly even improve upon baseline performance.

[-]

OkFly3388@reddit (OP)

I actually dont really consider overclocking as option, extra gains dont worth reduced lifespan, and especially dont worth it with new connector, that already hot as f*ck when I run under 450w. I even put some heatsink on it, lol.

[-]

MutantEggroll@reddit

My understanding was that overclocking alone doesn't harm the hardware - it's the overvolting that people often do along with the overclocking that does the damage. And in my case, I actually was able to undervolt my 5090 and still get a mild, stable overclock. And my temps rarely exceed 70C, and I've never seen it reach 80C, let alone the thermal limit of 90C.

Worth looking into, IMO, though I do very much understand the desire to keep your expensive baby safe, lol.

[-]

OkFly3388@reddit (OP)

Fair, I also plan to do some experiments with undervolting later.

[-]

alberto_467@reddit

Did you allow cool down time between runs and did you monitor the temps to make sure they're similar?

[-]

OkFly3388@reddit (OP)

No, cooldown gives artificial advantage to higher power limits, because they can temporary boost performance for first 1 min after cold start. so its opposite, keep gpu hot so there is no temporary boosts, only real long term performance.
Temps also was not similar, each raise in power also raise temps

[-]

alberto_467@reddit

No, cooldown gives artificial advantage to higher power limits

No the only fair why is to start every scenario at the same temperature, otherwise you're comparing apples to oranges. If you don't want to consider the temporary boosts because you want to measure performance in a consistent workload you have to run a longer experiment.

BTW, local inference workloads are often not that consistent and so will take advantage of allowing some time to cool and benefitting from the performance boost.

Temps also was not similar, each raise in power also raise temps

That's perfectly fine, but the starting point has to be the same, if you're starting the higher power settings with a hotter temperature, you're penalizing them unfairly.

When you're trying to prove your thesis, you should never do it with an experiment that even slightly favors it.

[-]

OkFly3388@reddit (OP)

>consistent workload you have to run a longer experiment

thats exactly what I did.

>starting the higher power settings with a hotter temperature, you're penalizing them unfairly

They have consistent temps across test, so it really show how gpu perform under load without boost advantage, so its perfectly fine.

[-]

alberto_467@reddit

They have consistent temps across test

Oh ok, sorry i interpreted that wrong then.

They have consistent temps across test

If they're all starting at the same temperature (even the lowest power settings) then its fine. Otherwise i'm sure that running the tests from lowest to highest power settings vs in the reverse order would produce different results.

[-]

OkFly3388@reddit (OP)

No, I mean, for every single power value I run test with initial warmup and actual test, so when running single test, we have consistent temp. So every test have same temp, but temp is different between tests.

[-]

alppawack@reddit

How to make nvidia-smi power limit persistent? It changes to default everytime I restart. Do I have to write a boot script for it?

[-]

nmkd@reddit

MSI Afterburner

[-]

imgroot9@reddit

I just added it to the same script file I start the model with (I'm using llama-server.exe)

[-]

AvidCyclist250@reddit

I found 200W for my 4080 to be goldilocks zone

[-]

crantob@reddit

My testing showed similar peak efficiency for 3090, somewhere around 250w.

[-]

FIdelity88@reddit

I can confirm!

[-]

iamrealadvait@reddit

This is actually super interesting — I didn’t expect the efficiency curve to drop off that hard after ~250–300W. Feels like a lot of people (including me tbh) just assume “more power = better throughput,” but this shows there’s a pretty clear sweet spot where you’re getting most of the performance without burning extra watts for marginal gains. Curious if this holds across different models or if it’s more GPU/architecture dependent? Also wondering how much this shifts with longer context windows or different batch sizes.

Would be cool to see the same plot with tokens per watt directly — might make the tradeoff even clearer.

[-]

crantob@reddit

Also help shut down "green energy" and allow markets to work again.

We can use nuclear waste for fuel, just need to get the idiots out of the way.

[-]

stddealer@reddit

Ok but how much power is it actually consuming? The way I interpret it, these graphs could just indicate that above the 275W, something other than power supply is limiting the performance, so it might not actually consume any more.

[-]

OkFly3388@reddit (OP)

I am 100% sure it consumed all available power.
Fan speed, warm air from pc, power consumption data and temperatures raised every time my script apply new limit. So its safe to say that it actual consumption.

[-]

stddealer@reddit

That's pretty inefficient then, but not too surprising, I can believe it.

[-]

FIdelity88@reddit

Damn this is great! I hated the energy consumptions of the RTX 30XX series.

GPU's I have:
RTX 3090 24GB @ PCIe 5.0 x16
RTX 3080 20GB (vRAM modded) @ PCIe 4.0 x8

Model I run with layer split:
Qwen3.6-27B-Q6_K-mtp.gguf

The improvements are amazing:

RTX 3090 Watts	RTX 3080 Watts	Tok/s
350w	320w	49.8
250w	275w	46.5

Reduction of \~145w while the tokens only lowered about \~7%.

My llama.cpp settings:

/home/localllm/llama.cpp/build/bin/llama-server
            -m /home/localllm/models/qwen/Qwen3.6-27B-Q6_K-mtp.gguf
            --spec-type mtp --spec-draft-n-max 3
            --host 127.0.0.1 --port 9100
            -ngl 99 -ts 0.52,0.48
            --cache-type-k q8_0 --cache-type-v q8_0
            -c 180224 --parallel 1
            --jinja
            --flash-attn on
            --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0

[-]

HongPong@reddit

great digging, who knew

[-]

GroundbreakingTea195@reddit

Would be great if there is a script to test this out!

[-]

vastaaja@reddit

Check out https://github.com/noonghunna/club-3090 for both results and an easy script to run your own numbers.

[-]

HavenTerminal_com@reddit

my gpu has been at 100% for 3 days asking it what to name a variable

[-]

ComplexType568@reddit

I appreciate how beautifully made these charts are.

[-]

Perfect-Flounder7856@reddit

Currently have my 6000 blackwell set for 450w

[-]

FencingNerd@reddit

Inference is largely limited by memory bandwidth, not compute power.

[-]

Genebra_Checklist@reddit

I have uv panels, don't waste a dime

[-]

kwinz@reddit

pv? :D

[-]

Genebra_Checklist@reddit

exactly! thanks lol

[-]

Badger-Purple@reddit

Looks like your sweet spot is 275

[-]

New-Implement-5979@reddit

or 300 to get a bit of faster prefill

[-]

Ell2509@reddit

Looks right to me.

My best machine for inference has 2 x 9700ai gpus, and both can go up to 330w. I run both at 270w each, and see almost no drop in speed. My gpus will last longer, and I don't have to worry about burning out my 12v6 connectors.

[-]

ProfessionalJackals@reddit

Do you happen to have some numbers like how many hours you run them programming, and the difference in your power bill? I feel this is often overlooked in these topics...

[-]

PANIC_EXCEPTION@reddit

I wonder if most GPUs have an obvious knee point power limit like that?

[-]

Ok-Measurement-1575@reddit

Does q4_0 make you cpu limited?

[-]

MatlowAI@reddit

You should try another one with batched token generation.

[-]

a_beautiful_rhind@reddit

We can lock clocks and undervolt now. Have to watch what happens to idle power as well. I double my idle by messing with these things. 40w idle turns into 100w and that will eat more than peak usage. Although this is for multiple cards. Have to remember to reset them when I'm finished.

[-]

JLeonsarmiento@reddit

Interesting... I also run my Mac in "Low Power" mode when doing long Q&A sessions with local llm... I don't need fast burst of tokens if I am checking the answer like in the next five of 10 minutes since I am doing other stuff on the side...

[-]

wanielderth@reddit

Will this work on series 3000 cards?

[-]

OkFly3388@reddit (OP)

I think it can be applied to any gpu, Just sweet spot will be a bit different.

[-]

unclesabre@reddit

Has anyone got a similar graph or info relating to apple silicon? I have an m4 max and I’m thinking perhaps similar logic applies and I could do something like limit cores etc. idk but interesting observations ty

[-]

marlinspike@reddit

Johnny Depp, Pirates of the Caribbean

[-]

davew111@reddit

Cards like the RTX 6000 Ada (basically a 4090 with 48GB of ram) have a power limit of 300w. The RTX 6000 Pro MaxQ too. Server cards like the L40 are also around 300w. Some a little more, some a little less. But beyond the 300w mark you are often financially better off saving on the extra electricity to pay for additional cards.

In your benchmarking it's interesting that performance actually drops beyond 400w, most likely due to thermal throttling. I've seen that in gaming and have always flattened my voltage curve to be a slight down clock at higher voltages figuring it will only thermal throttle after a few minutes anyway. I'd prefer the card run at 2200 mhz for hours than manage 2500 mhz for a few minutes and then throttle itself.

[-]

Technical-Earth-3254@reddit

I've also noticed basically no impact in performance on my 3090 running at 80% PL (300W). And it doesn't get as loud, which is a plus for me, because my stuff runs on my PC.

[-]

artisticMink@reddit

That chart is very RTX4090 specific. In general the cost/efficiency sweetspot for almost all operations on the 4090 is \~75% to 80% depending on who you ask.

Mileage on other gpus may vary wildly.

[-]

SnooPaintings8639@reddit

If anyone wonder about sweet spot for popular build here of 2 x 3090, running Qwen 3.6 27b it is a bit over 200 W each (I keep it at exact 200). . At least on my build, and keep in mind that model type, especially MoE vs dense does affect shape of the curve.

[-]