Stop wasting electricity
Posted by OkFly3388@reddit | LocalLLaMA | View on Reddit | 190 comments
Run on my rtx4090
llama.cpp params:
llama-server -m ~/Projects/llm/models/Qwen3.6-27B-UD-Q4_K_XL.gguf --flash-attn on -ngl all -ctk q4_0 -ctv q4_0 -t 32 -c 262144
Power limit was set using sudo nvidia-smi -pl N
On my observation, GPU constantly hitting power limit, so its safe to say that it actual consumption. You can cut power consumption to 40% without losing performance(and also reduce noise, heat from pc, and extend lifespan of gpu).
BobbyL2k@reddit
How did you measure Energy used for the second graph? It seems off.
OkFly3388@reddit (OP)
power consumption * time to generate = total energy spend
dankfrankreynolds@reddit
This is definitely not correct. I've been running a multi day job and took your advice and set my limited to 275. My performance dropped 10%. Without the limited, my 4090 is only pulling around 320W for the job.
You're assuming the thing is running at the limit and it is not. Or if it is, then that's for your specific workload, not a general truth.
OkFly3388@reddit (OP)
Oh yea, if only I put exact setup in post, that gives me this results. Wait a minute, I did exactly that.
dankfrankreynolds@reddit
Why did you use "Power Limit" instead of "Power Draw" if you know they are identical?... it sure sounds like you're GPU is at 100% and therefore you're saying the power limit is the power draw. I'm at 100% GPU and using 320W.
I'm not posting any of this to argue with you, but to let other readers like myself know that it's not what it seems. I don't really care to argue with you or cause any negative vibes.
OkFly3388@reddit (OP)
>it sure sounds like you're GPU is at 100% and therefore you're saying the power limit is the power draw.
>On my observation, GPU constantly hitting power limit, so its safe to say that it actual consumption.
Reread post, there are all info.
BobbyL2k@reddit
You **cannot** assume power limit = power consumption.
In this post, you will see that the GPU isn’t reaching the power limit as each stage of inference doesn’t demand equal amount of power.
If you want to measure energy, you need to use DCGM. Your graph isn’t showing energy usage. It’s showing power limit * time which doesn’t mean anything.
OkFly3388@reddit (OP)
This is technically correct, but I think such errors are within tolerance range and do not have a real impact on the global picture
dankfrankreynolds@reddit
This is definitely not correct. I've been running a multi day job and took your advice and set my limited to 275. My performance dropped 10%. Without the limited, my 4090 is only pulling around 320W for the job.
You're assuming the thing is running at the limit and it is not. Or if it is, then that's for your specific workload, not a general truth.
BobbyL2k@reddit
In the linked post, the redditor concluded that 300W power limit was optimal for Pro 6000 whereas the redditor originally concluded that it was 350w because of incorrect data.
Your overall conclusion that applying a power limit is beneficial is correct. But you second graph suggesting 250w is optimal for a 4090 is wrong.
MelodicRecognition7@reddit
after more tests I've decided that the optimal power limit for Pro 6000 is 330 W, and this conclusion is supported by Nvidia because they made 325W maximum power limit for Max-Q version: https://old.reddit.com/r/LocalLLaMA/comments/1t4nhip/new_pro6k_maxq_are_power_limited_to_325w/
BobbyL2k@reddit
I missed your new post, thanks for the info.
OkFly3388@reddit (OP)
>But you second graph suggesting 250w is optimal for a 4090 is wrong.
You read it wrong. Best is 275w. It \~ as efficient as 250w, but way faster.
dir3ctly@reddit
LACT now supports undervolting via Voltage-Frequency Curve
Since 2 weeks it is now possible to modify the V/F Curve on Linux just like in MSI Afterburner on Windows:
https://github.com/ilya-zlobintsev/LACT/releases/tag/v0.9.0
The benefits are less power consumption, heat and noise and it is much more effective than power limiting.
whiteamphora@reddit
unfortunately setting an undervolting takes much more time than simple nvidia-smi pl command - i gave up on undervolting as the results were making it pointless and not enough stable.
VoidAlchemy@reddit
this is the way! totally agree!
tuned my 3090TI FE and it rides constant 1950 MHz now, never goes into P-state 0 (which immediately throttles due to temp or power). so much smoother, cooler, consistent performance, using less power, and same or better output than untuned.
LACT is definitely better than a simple `nvidia-smi -pl 300` etc.
iamapizza@reddit
This is great looks just like MSI afterburner way
tmvr@reddit
Decode (tg) is not an issue. You lose a bit more on prefill (pp), but still only about 20% if you go down from 450W to 270W.
Dany0@reddit
sage advice has been for a while to use ~80% power limit, oc the memory and optimise the curve a little
Applies to Turing+
a_beautiful_rhind@reddit
Rather then power limit, might help to prevent turbo as that is where it starts to suck power massively.
nmkd@reddit
No.
Only adjust powerlimit.
The GPU will boost as much as it can within its powerlimit either way, limiting clocks is never as reliable as limiting the power.
a_beautiful_rhind@reddit
That's opposite to my experience. When I switched to nvcurve instead of lact and set a power limit, the GPU clocks go up and down and appear to use more power for given work.
When I used lact and the "old" way to undervolt, my power naturally stayed around my current power limit by itself and maybe even 10w less. Clocks above 1695 didn't really benefit my inference speeds and would just push over 300w for nothing.
I have 4x GPUs and setting clock limit of 0, 1695 would keep things in check while maintaining prompt processing performance, image gen speeds, etc.
My hope in not limiting clocks and switching to PL was to avoid the rise at idle and having to suspend/resume the cards but it didn't seem to accomplish that. Instead they now bounce up to the undervolt point +/- 100mhz until the hit the PL.
nmkd@reddit
Ah, to be fair, my experience is mostly from Windows.
Things are a bit different on Linux I guess.
tmvr@reddit
Yeah, my is set to 360W (so 80%) after boot as well since the beginning. Could go lower for LLMs because I would be fine with 15% lower pp instead of 5% lower as it is now, but this is a sane compromise for both that and gaming.
AeroelasticCowboy@reddit
15-20% is a HUGE performance hit, I wouldn't say it's "only 15-20"
tmvr@reddit
You need to see it in context though. You get that by dropping power consumption by 40% plus it's still a few thousand tok/s. Let's say you get 2500 tok/s max speed. To process a 128K (131072) initial prompt would take 52 sec. Do you really care if it's 60 or 63 sec instead? Not really, and this is a pretty extreme example, you don't regularly process such prompts when working with a model. In case you are doing a lot of processing and not watching it live it still is a better option. If you job runs for 10 hours at full speed it would be 11.5-12 hours at reduced TDP. Maybe you would care, but it is also more expensive as it would cost you 10x450=4.5kWh for the 10 hour run and 12x270=3.25kWh for the 12 hour run, so you need only 72% of the energy/costs.
keyboardhack@reddit
Well for now. With mtp, dflash and other methods that trade compute for higher tg/s, these numbers will likely start getting affected by power limit.
jacek2023@reddit
I cut power a lot to have silent 3090s at night
Mchanger@reddit
What down to?
nmkd@reddit
3090s run pretty well at 250-270W
Clean_Initial_9618@reddit
How did you make sure the performance is not impacted i have a rtx 3090 want to reduce the power to reduce heat and don't want my card melting. How to find the safe spot running windows got MSI Afterburner installed
nmkd@reddit
Write down FPS (in games) or TPS (LLMs), change powerlimit, write the new FPS/TPS down, and then do the maths to see how much % speed you lose per % power reduced.
cleversmoke@reddit
I ran benchmarks with llama.cpp:full-cuda and MSI Afterburner on +-5 ticks around 64% power (249W), where each % tick is about 4W. This gave an idea of where my RTX 3090 performs and what the trade offs are per W or %.
jacek2023@reddit
to the silence:
silenceimpaired@reddit
Isn’t there two ways to limit power for Nvidia GPUs?
nmkd@reddit
There's only one power limit setting.
Obviously you can underclock or something but that won't be a power limit per se.
silenceimpaired@reddit
Yeah, I was trying to recall the term underclock. I saw a post that argued using power limiting and underclocking can result in better performance than one alone.
nmkd@reddit
You're probably thinking about undervolting. This can get you higher clocks (=more speed) at the same power limit, even if you already reduced that.
Underclocking is kinda too rigid.
chimpera@reddit
can you check the prefill performance?
OkFly3388@reddit (OP)
PigSlam@reddit
interesting that 450W is more efficient than 150W
nmkd@reddit
At some point you just throttle it and lose any advantage.
BagComprehensive79@reddit
Okay there is a sweet spot more clear than my expectations
BobsView@reddit
275w ? or i read it wrong ?
BagComprehensive79@reddit
Yes correct, you can check comparison between 150w and 275w on the plot
OkFly3388@reddit (OP)
uti24@reddit
I mean, 30t/s vs 31t/s not worth to crank power up from 300W to 500W
But improving prefill from 1700 to 2100 kinda maybe?
OkFly3388@reddit (OP)
Why ?
for 1k tokens reading time will be 0.58s with 1700 or 0.47 with 2100
but 1k generation time will be 32s so your total speedup(both generation and prefill time) will be less than 1%, which absolutely not worth it. Basically, you should have task, that required agent to read \~100x more tokens than generate to justify going full power.
uti24@reddit
With agentic use you often have to read like 50k tokens, that might be worth it
Opening-Broccoli9190@reddit
Yep, you usually read it once and then it's cached for the session.
uti24@reddit
When you are using limited hardware for local development agent switch task and contexts pretty often and then need to rebuild it.
Opening-Broccoli9190@reddit
I get it, I am using just 32gb myself. The context is compacted roughly every hour or so and it taskes seconds to load, the full other hour is me iterating with token-gen assist.
SailIntelligent2633@reddit
200k tokens would be 1m 56s with 1700 or 1m 34s with 2100. 2k generation would be 1m 7s vs 1m 5s. So 3m 3s vs 2m 39s, or 15% slower. So a 10 minute coding task near the end of the context window would take an additional minute and a half.
Opening-Broccoli9190@reddit
The model doesn't read the context every time you're posting a response, it uses cache generously.
OkFly3388@reddit (OP)
Enable context cache. You dont need to reread whole history every time agent try to generate code.
AeroelasticCowboy@reddit
Sweet spot not worth it, unless you're running a giant farm of these for profit, who wants to leave 16% of the performance on the table. 350w is a better compromise.
Xamanthas@reddit
Me and my body temp. The difference between 1000w of heat output (cpu, two gpus, psu heat) and 680 is noticeable lol
SkyFeistyLlama8@reddit
1000W running for hours every day adds up in the summer. You're looking at an uncomfortably hot room without air conditioning.
Zc5Gwu@reddit
Nit: this chart might be clearer if it started at 0 on the y axis.
fullouterjoin@reddit
In general, big fan of charts that don't lie, but this isn't that instance.
Prefill speed 0, watts 0 there is no data below 150W. This chart is fine.
One that would make it clearer is prefil-token/s/w which is the token speed efficiency.
popiazaza@reddit
Welcome to Reddit. That's normal.
New-Implement-5979@reddit
nice one!
Timely_Intern_4994@reddit
Did u undervolt too?
OkFly3388@reddit (OP)
No, just power limit. But I check it later and make another post, if there will be more than 1% difference
bryancr@reddit
My 5090 is overclocked and undervolted. Haven’t touched max power values yet but highest I generally see the gpu temp is 75oC
Timely_Intern_4994@reddit
There sure will be, use LACT app to control voltages easily
Badhunter31415@reddit
what do you use to create those graphs ?
OkFly3388@reddit (OP)
matplotlib
Narrow-Belt-5030@reddit
I currently cap the power to my 5090 out of fear of it melting, but from that graph it suggests maybe I should dig into it and cut the power even more. Thanks for this.
PooMonger20@reddit
Could you kindly elaborate how do you cap the power to it? What values are working best for you?
phazei@reddit
Best to under-volt with MSI Afterburner.
I set my 5090 at 860mV @ 2500MHz it uses about 360W and only loses about 12% compared to uncapped.
You click the dot on the Y-axis of the mV you want to cap at, then drag it up to the max MHz you want, and then hit L.
You can try different MHz values at different mV, some will work, some will be ignored if it's too low. The lower the mV and higher the clock, the better you do.
PooMonger20@reddit
The detailed explanation is highly appreciated. Thank you, kind sir.
Narrow-Belt-5030@reddit
Yes - I used the nvidia-smi command and was at 550W, but now thanks to this post set it to 400W. After running a benchmark tool I made (for LLMs - thanks Claude) I still get a very good TTFT & t/s on the model I use (Gemma4 26B MoE). From memory it was something like 160t/s dropped to 140t/s .. so acceptible.
old: sudo nvidia-smi -pl 550
new: sudo nvidia-smi -pl 400
jai5@reddit
tell us more about your llm benchmark tool.
Narrow-Belt-5030@reddit
It's just a python script I asked Claude to write for me.
I asked him to create a script that could send a prompt of varying sizes (1K, 4K, 8K) in batches of 1,2,4 prompts. I use vLLM which is extremely good at TTFT & throughput.
This is 400W power, then asked Claude to up the prompt size as I changed the limit to 128K ready for Hermes Agent:
| Context | Concurrency | TTFT | Per-req tok/s | Aggregate tok/s |
|----------|-------------|--------|----------------|-----------------|
| \~1K tok | 1 | 66 ms | 156 | 156 |
| \~1K tok | 2 | 75 ms | 128 | 256 |
| \~1K tok | 4 | 110 ms | 113 | 428 |
| \~4K tok | 1 | 221 ms | 149 | 149 |
| \~4K tok | 2 | 235 ms | 124 | 246 |
| \~4K tok | 4 | 377 ms | 103 | 352 |
| \~12K tok | 1 | 897 ms | 137 | 137 |
| \~12K tok | 2 | 1747 ms| 111 | 212 |
| \~12K tok | 4 | 2837 ms| 73 | 165 |
| \~32K | 1 | 3.2 s | 128 | 128 |
| \~32K | 2 | 5.0 s | 69 | 61 |
| \~32K | 4 | 8.5 s | 39 | 47 |
| \~64K | 1 | 5.4 s | 123 | 123 |
| \~64K | 2 | 8.2 s | 63 | 39 |
| \~64K | 4 | 13.7 s | 33 | 30 |
| \~120K | 1 | 5.4 s | 123 | 123 |
| \~120K | 2 | 8.2 s | 63 | 39 |
| \~120K | 4 | 13.7 s | 33 | 30 |
PooMonger20@reddit
Big thank you for the answer. I will check it out.
notheresnolight@reddit
I cap the power to my 5090 out of fear of it melting the CPU. 450W brings everything down by 10°C.
SailIntelligent2633@reddit
Melting the CPU?
notheresnolight@reddit
figuratively speaking, the CPU would normally idle around 40 °C if the GPU wouldn't blow 80°C air its way
Skystunt@reddit
450W being considered lower power is absolutely crazy, sometimes i forget how high of a TDP the 5090 has
notheresnolight@reddit
yeah, it's basically a space heater that can compute matrix calculations
finevelyn@reddit
The lowest you can go with 5090 is 400W if I remember correctly. The driver/firmware won't allow it to be set lower.
phazei@reddit
Maybe I got really lucky with my silicon. I tried setting max wattage, I don't recall what I set it at, but I got good results with under-volting and stayed with that.
At 860mv @ 2500MHz it uses about 360W with about 12% loss compared to uncapped.
a_beautiful_rhind@reddit
Just cap the clock. It can spike over the power limit briefly anyways.
Plabbi@reddit
You want undervolting curves for best results, either MSI Afterburner in windows or LACT in Linux.
Enjoy the rabbit hole.
a_beautiful_rhind@reddit
I have been. In inference there's really only 2 "speeds" full go and idle. All those middle bits never really get used. At least not that I have seen.
silenceimpaired@reddit
What cards do you have and what are your settings?
a_beautiful_rhind@reddit
4x3090. I think I flatten the point around 760mv and 1695mhz. I set 275 PL at the moment because nvcurve doesn't have clock locking.
Prior to that it was 200mhz offset and 0-1695 clock limit. I am gonna move back over to lact when I turn off my ram overclock for the summer. I had to build it against new gnome libs because they dropped support for ubuntu 22.04. I don't even run gnome but MATE.
Narrow-Belt-5030@reddit
Booo! Thank you though :-)
koloved@reddit
There's a BIOS that allows you to limit 5090 to 300 watts. Visually, in the application it will still be 400, but in reality it will be 300 from the outlet.
I haven't tried it myself, and I want to try it.
https://www.reddit.com/r/LocalLLaMA/comments/1rbyg5x/if_you_have_a_rtx_5090_that_has_a_single/
Narrow-Belt-5030@reddit
I am not that brave .. but thank you - something to consider.
Crinkez@reddit
Datacenters hate this one trick!
Momsbestboy@reddit
... and in case you use a GPU of AMD: LACT on Linux is a nice program to tune the GPU: https://i.imgur.com/LRuhPom.png
I have set the power limit to 210W now (after 230W in the screen shot) and my card also runs stable at -100mV undervolting. I have run a benchmark with llama-cli using the same prompt before and after, and t/s even increased, because the card is hitting the thermal throttle less often.
On top, the card draws less energy and the fan produces less noise. So it is a win win win situation.
farewellrif@reddit
Is this meaningfully different than rocm-smi --setpoweroverdrive [watts]?
computehungry@reddit
Yeah, the power limit is correlated with max clock frequency, the actually relevant graph would be what max frequency you get for each power limit. And the answer would be... Different for every model, and also the vram/ram split in case it is split, hahaha.
kwinz@reddit
I think tokens per Watt is more meaningful in this case. Not everything is compute bound.
Momsbestboy@reddit
Much better: give the tuning a try first, before thinking too much. the R9700 is VRAM starved, so gaining a bit more on the GPU by running at higher frequencies is not really relevant.
All I did was to run llama-cli with the same model and prompt, and then check the t/s. In my case, t/s increased after I undervolted and limited the power, because thermal throttling seems to have a much higher impact than gaining some spikes in GPU frequency.
So for me(!) it was a simple decision. Less power drawn, faster, GPU runs cooler and makes less noise. Your results might be different, so test it.
computehungry@reddit
I guess, if you don't want to adjust the VF curve, you're completely right. You can undervolt and power limit at the same time though for better performance. At the minimum, it's healthier for the GPU since the voltage wouldn't oscillate all the time. If overclocked, the performance gain can be 30-50% at the same wattage, which is why I'm claiming this tuned performance is the more interesting number. But I understand not everyone wants to mess with the factory settings.
Timely-Perception-26@reddit
You're using LACT, but not a voltage/frequency curve?
https://imgur.com/a/78eeZaJ
MelodicRecognition7@reddit
https://old.reddit.com/r/LocalLLaMA/comments/1nkycpq/gpu_power_limiting_measurements_update/
YetAnotherAnonymoose@reddit
Really useful since electricity is fairly expensive in Germany, thanks.
mr_zerolith@reddit
One problem is that prompt processing speed goes down quite a bit when you power limit.
To gain speed back when power limiting, overclock memory.
Blackwells can take 2-3ghz memory boost because the memory chips are ran below spec for thermal reasons, but by power limiting, you are freeing up thermals to overclock the ram :)
OkFly3388@reddit (OP)
Check prefill graph there
https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/comment/olct8k6/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/comment/olct7k9/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
Enough-Astronaut9278@reddit
makes sense — decode is memory bandwidth bound anyway, the CUDA cores are mostly sitting there waiting. prefill is where you'd actually feel the power cut. good data tho, gonna try this on my setup
FatheredPuma81@reddit
First thing I did with my 4090 was undervolt and downclocked the core and overclocked the RAM. Power draw dropped by 100w and performance went up because of the RAM overclock.
RobotRobotWhatDoUSee@reddit
Do you have your test code posted anywhere?
ageek@reddit
I'm always undervolting my GPU so should be at top efficiency.
zhambe@reddit
I have 2x RTX 3090 in my rig, and I run them at 250W -- unthrottled, they trip the overload alarm on the UPS when they get properly going.
I lose maybe 10% off peak performance, and that's fine. Otherwise the machine runs so hot anyway, cooks the whole room.
Lissanro@reddit
Seeing the post makes me curious if I should look into optimizing energy consumption of my GPUs. I am still "cooking" with four 3090 with default 350W limit; with models loaded fully in VRAM my PC consumes about 2kW according my UPS (like when running Qwen 3.6 27B with vLLM on all four GPUs), or about 1.2 kW during RAM+VRAM inference (like with Kimi K2.6 Q4_X, which mostly ends up in RAM).
I have a big fan in the window near the PC that can remove the heat fast enough, so heat is not a concern in my case, but if I can reduce electricity cost per token, that's interesting, so I plan to experiment. Your 250W may be a good starting point for me to try, especially for overnight runs when I am not in a hurry, so thanks for your comment!
Smeetilus@reddit
My four 3090’s are set to 200 watts on my 1600 watt EVGA powered nonsense generator. I don’t feel like it’s slow. I think I can set them to 220 but I’d start having random reboots if much higher so I cut back to be safe. They still will briefly surge past whatever limit you set, from what I’ve read.
confused-photon@reddit
Dear god it’s mining undervolting all over again
Slaghton@reddit
When I train models on my 3090's or really do anything on them, I power limit them all to 275w.
y4m4@reddit
Thanks for posting this! It inspired me to test my cards with qwen3 VL 8B:
PNY 5090 OC - the most efficient point, 410w resulted in 212.35 tok/s. Topped out at 219 tok/s at 550w and decreased slowly as it climbed to 600w.
Nvidia 3090 FE - max efficiency at 240w → 106 tok/s and increasing the power slowly increased the speed at every step up to 350w to 122 tok/s. I settled on 270w to get an extra 11 tok/s.
qubridInc@reddit
Yeah the efficiency curve on these cards gets weirdly good once you stop chasing absolute max throughput.
A lot of people assume dropping from 550W → 400W would tank performance, but for many inference workloads it’s surprisingly small compared to the heat/noise/power savings you get back.
Especially true for homelabs where the goal is usually “good sustained throughput” and not leaderboard benchmark numbers.
DataPhreak@reddit
2 things.
First, this is going to be different on every single card, and likely different model architectures.
Second, just because you're using all of your compute doesn't mean you are using all of the electricity. You have to put a watt meter on your machine. You're going to find that even when you set the power limit to 450, the GPU is not going to go over 300.
Iory1998@reddit
A few weeks ago, I posted here a guide to undervolt your GPUs. Performance drop is often negligible and you might even see perfomance gains with some GPUs.
Arcuru@reddit
Yea...if your metric is going to be tokens/watt then you should definitely power limit. Usually the optimal work/power ratio is far lower than the top end speed, that's why your devices will frequency limit in the first place.
In this case It probably uses the extra power for number crunching (increasing frequency of the compute cores), whereas token generation is largely bounded by memory bandwidth which probably gets maxed at a lower power limit.
hidden2u@reddit
I just undervolt, it's insanely efficient
silenceimpaired@reddit
GPU and setting if you please?
NineThreeTilNow@reddit
My 4090 has always been minorly under powered / clocked.
Taking like ~10% off the top gives better 1% lows in video games.
It's pretty well known about the 4090 in terms of gaming. I've left the setting on after seeing enough melted connectors.
Even when I use the card for model training, it's the same.
PooMonger20@reddit
Thanks for sharing, could you elaborate what software do you use to under power / clock it?
Organic_Scarcity_495@reddit
yeah the prefill speed drop from power limiting is the part people don't talk about. decode tokens per second doesn't tell the full story when your time-to-first-token doubles. for interactive use (chat, coding assistants) that latency hit matters a lot. for batch inference where you're just running through a queue of prompts, the power savings are worth it. the right answer depends on your workload mix
OkFly3388@reddit (OP)
There are prefill graph in my other comment already, check it out
https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/comment/olct7k9/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/comment/olct8k6/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
cyberdork@reddit
Yeah, on my 3090 I noticed that reducing power by 25% only resulted in an 11% performance hit. So I always run at max 275W
gabrielrfg@reddit
You should maybe look into setting up an undervolting curve, usually gets you about 20-30% less power draw for the same performance in gaming. In my experience in LLM inference in my old 3080, I could basically reduce clock speed a lot and overclock the memory frequency and see performance *increases* while lowering power draw dramatically.
Look_0ver_There@reddit
Graphs that don't start at zero are the work of the devil!
snorkelvretervreter@reddit
That is so extremely unnecessary in that second graph. That is just downright misleading. Bad OP!
Beamsters@reddit
you expect 4090 to run at 50w?
Look_0ver_There@reddit
Y-axis mate
Borkato@reddit
They mean the y axis
gigaflops_@reddit
The difference in your electricity bill between running your local AI using 450 watts vs 300 watts is *negligable*. A 150 watt difference, over lets say 45 seconds, to respond to the typical prompt, equals 0.00185 killowatt hours. At the average US electricity rate of 17 cents/KWh, that's **$0.000319 saved per prompt**. In other words, you'd have to send **3137 prompts** to your local model to save a *single dollar*.
ttkciar@reddit
//me looks at eval which has been running constantly for >30 hours and still not done.
We are not the same.
D2OQZG8l5BI1S06@reddit
Did you measure the actual consumption? I never hit power limit.
OkFly3388@reddit (OP)
>On my observation, GPU constantly hitting power limit, so its safe to say that it actual consumption
D2OQZG8l5BI1S06@reddit
So you just multiplied with the power limit to draw the graph? LOL
Timely-Perception-26@reddit
An honest tip from one Linux LLM guy to another:
A simple, strict wattage limit causes the graphics card to continue using aggressive boost logic, which results in inefficient voltage/frequency fluctuations and load spikes, and shortens the lifespan of your graphics card.
A voltage/frequency curve, on the other hand, leads to more stable performance, lower voltage at the same clock speed, less heat, often fewer spikes, and better efficiency.
Don’t be stuck in the past—download LACT and set up your own curve. Best regards from someone who loves his graphics card.
This is my curve:
https://imgur.com/a/78eeZaJ
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
gwillen@reddit
Thanks, this is helpful. I have played with the power limit on my GPUs, mostly out of concern for total power draw (my PSU is only rated for 1000W sustained, and my GPUs together plus other draw can exceed that.) But I didn't have a good sense of what the curve was like. I should probably do tests of my own.
MutantEggroll@reddit
These are great charts! Thanks for sharing.
I've done similar with my 5090, and I found that I actually ended up with thermal headroom for a mild overclock. I'd be interested to hear whether your 4090 has similar headroom, and if you're able to recover or possibly even improve upon baseline performance.
OkFly3388@reddit (OP)
I actually dont really consider overclocking as option, extra gains dont worth reduced lifespan, and especially dont worth it with new connector, that already hot as f*ck when I run under 450w. I even put some heatsink on it, lol.
MutantEggroll@reddit
My understanding was that overclocking alone doesn't harm the hardware - it's the overvolting that people often do along with the overclocking that does the damage. And in my case, I actually was able to undervolt my 5090 and still get a mild, stable overclock. And my temps rarely exceed 70C, and I've never seen it reach 80C, let alone the thermal limit of 90C.
Worth looking into, IMO, though I do very much understand the desire to keep your expensive baby safe, lol.
OkFly3388@reddit (OP)
Fair, I also plan to do some experiments with undervolting later.
alberto_467@reddit
Did you allow cool down time between runs and did you monitor the temps to make sure they're similar?
OkFly3388@reddit (OP)
No, cooldown gives artificial advantage to higher power limits, because they can temporary boost performance for first 1 min after cold start. so its opposite, keep gpu hot so there is no temporary boosts, only real long term performance.
Temps also was not similar, each raise in power also raise temps
alberto_467@reddit
No the only fair why is to start every scenario at the same temperature, otherwise you're comparing apples to oranges. If you don't want to consider the temporary boosts because you want to measure performance in a consistent workload you have to run a longer experiment.
BTW, local inference workloads are often not that consistent and so will take advantage of allowing some time to cool and benefitting from the performance boost.
That's perfectly fine, but the starting point has to be the same, if you're starting the higher power settings with a hotter temperature, you're penalizing them unfairly.
When you're trying to prove your thesis, you should never do it with an experiment that even slightly favors it.
OkFly3388@reddit (OP)
>consistent workload you have to run a longer experiment
thats exactly what I did.
>starting the higher power settings with a hotter temperature, you're penalizing them unfairly
They have consistent temps across test, so it really show how gpu perform under load without boost advantage, so its perfectly fine.
alberto_467@reddit
Oh ok, sorry i interpreted that wrong then.
If they're all starting at the same temperature (even the lowest power settings) then its fine. Otherwise i'm sure that running the tests from lowest to highest power settings vs in the reverse order would produce different results.
OkFly3388@reddit (OP)
No, I mean, for every single power value I run test with initial warmup and actual test, so when running single test, we have consistent temp. So every test have same temp, but temp is different between tests.
alppawack@reddit
How to make nvidia-smi power limit persistent? It changes to default everytime I restart. Do I have to write a boot script for it?
nmkd@reddit
MSI Afterburner
imgroot9@reddit
I just added it to the same script file I start the model with (I'm using llama-server.exe)
AvidCyclist250@reddit
I found 200W for my 4080 to be goldilocks zone
crantob@reddit
My testing showed similar peak efficiency for 3090, somewhere around 250w.
FIdelity88@reddit
I can confirm!
iamrealadvait@reddit
This is actually super interesting — I didn’t expect the efficiency curve to drop off that hard after ~250–300W. Feels like a lot of people (including me tbh) just assume “more power = better throughput,” but this shows there’s a pretty clear sweet spot where you’re getting most of the performance without burning extra watts for marginal gains. Curious if this holds across different models or if it’s more GPU/architecture dependent? Also wondering how much this shifts with longer context windows or different batch sizes.
Would be cool to see the same plot with tokens per watt directly — might make the tradeoff even clearer.
crantob@reddit
Also help shut down "green energy" and allow markets to work again.
We can use nuclear waste for fuel, just need to get the idiots out of the way.
stddealer@reddit
Ok but how much power is it actually consuming? The way I interpret it, these graphs could just indicate that above the 275W, something other than power supply is limiting the performance, so it might not actually consume any more.
OkFly3388@reddit (OP)
I am 100% sure it consumed all available power.
Fan speed, warm air from pc, power consumption data and temperatures raised every time my script apply new limit. So its safe to say that it actual consumption.
stddealer@reddit
That's pretty inefficient then, but not too surprising, I can believe it.
FIdelity88@reddit
Damn this is great! I hated the energy consumptions of the RTX 30XX series.
GPU's I have:
RTX 3090 24GB @ PCIe 5.0 x16
RTX 3080 20GB (vRAM modded) @ PCIe 4.0 x8
Model I run with layer split:
Qwen3.6-27B-Q6_K-mtp.gguf
The improvements are amazing:
Reduction of \~145w while the tokens only lowered about \~7%.
My llama.cpp settings:
HongPong@reddit
great digging, who knew
GroundbreakingTea195@reddit
Would be great if there is a script to test this out!
vastaaja@reddit
Check out https://github.com/noonghunna/club-3090 for both results and an easy script to run your own numbers.
HavenTerminal_com@reddit
my gpu has been at 100% for 3 days asking it what to name a variable
ComplexType568@reddit
I appreciate how beautifully made these charts are.
Perfect-Flounder7856@reddit
Currently have my 6000 blackwell set for 450w
FencingNerd@reddit
Inference is largely limited by memory bandwidth, not compute power.
Genebra_Checklist@reddit
I have uv panels, don't waste a dime
kwinz@reddit
pv? :D
Genebra_Checklist@reddit
exactly! thanks lol
Badger-Purple@reddit
Looks like your sweet spot is 275
New-Implement-5979@reddit
or 300 to get a bit of faster prefill
Ell2509@reddit
Looks right to me.
My best machine for inference has 2 x 9700ai gpus, and both can go up to 330w. I run both at 270w each, and see almost no drop in speed. My gpus will last longer, and I don't have to worry about burning out my 12v6 connectors.
ProfessionalJackals@reddit
Do you happen to have some numbers like how many hours you run them programming, and the difference in your power bill? I feel this is often overlooked in these topics...
PANIC_EXCEPTION@reddit
I wonder if most GPUs have an obvious knee point power limit like that?
Ok-Measurement-1575@reddit
Does q4_0 make you cpu limited?
MatlowAI@reddit
You should try another one with batched token generation.
a_beautiful_rhind@reddit
We can lock clocks and undervolt now. Have to watch what happens to idle power as well. I double my idle by messing with these things. 40w idle turns into 100w and that will eat more than peak usage. Although this is for multiple cards. Have to remember to reset them when I'm finished.
JLeonsarmiento@reddit
Interesting... I also run my Mac in "Low Power" mode when doing long Q&A sessions with local llm... I don't need fast burst of tokens if I am checking the answer like in the next five of 10 minutes since I am doing other stuff on the side...
wanielderth@reddit
Will this work on series 3000 cards?
OkFly3388@reddit (OP)
I think it can be applied to any gpu, Just sweet spot will be a bit different.
unclesabre@reddit
Has anyone got a similar graph or info relating to apple silicon? I have an m4 max and I’m thinking perhaps similar logic applies and I could do something like limit cores etc. idk but interesting observations ty
marlinspike@reddit
Johnny Depp, Pirates of the Caribbean
davew111@reddit
Cards like the RTX 6000 Ada (basically a 4090 with 48GB of ram) have a power limit of 300w. The RTX 6000 Pro MaxQ too. Server cards like the L40 are also around 300w. Some a little more, some a little less. But beyond the 300w mark you are often financially better off saving on the extra electricity to pay for additional cards.
In your benchmarking it's interesting that performance actually drops beyond 400w, most likely due to thermal throttling. I've seen that in gaming and have always flattened my voltage curve to be a slight down clock at higher voltages figuring it will only thermal throttle after a few minutes anyway. I'd prefer the card run at 2200 mhz for hours than manage 2500 mhz for a few minutes and then throttle itself.
Technical-Earth-3254@reddit
I've also noticed basically no impact in performance on my 3090 running at 80% PL (300W). And it doesn't get as loud, which is a plus for me, because my stuff runs on my PC.
artisticMink@reddit
That chart is very RTX4090 specific. In general the cost/efficiency sweetspot for almost all operations on the 4090 is \~75% to 80% depending on who you ask.
Mileage on other gpus may vary wildly.
SnooPaintings8639@reddit
If anyone wonder about sweet spot for popular build here of 2 x 3090, running Qwen 3.6 27b it is a bit over 200 W each (I keep it at exact 200). . At least on my build, and keep in mind that model type, especially MoE vs dense does affect shape of the curve.
crapaud_dindon@reddit
What clock speed do you get at 200w ?
Glum-Atmosphere9248@reddit
Maybe different graph in concurrent generations with vllm?
StupidScaredSquirrel@reddit
I always power limit my gpu exactly for this reason. I cant stand the noise anyway