Using the iGPU as the primary graphics card may improve token generation speed for PCIe graphics cards
Posted by janvitos@reddit | LocalLLaMA | View on Reddit | 19 comments
A few days ago, I was trying to improve token generation speed on my RTX 4070 Super 12GB while running Qwen3.6 35B A3B UD-IQ3_XXS (Unsloth) with llama.cpp, but to no avail. At that time, I had my monitor plugged in my 4070 and didn't even remember I had an AMD iGPU.
Then, I decided to plug my monitor into my iGPU and see if this would liberate some VRAM on my 4070 and improve token generation speed. I was not wrong. Using the right llama.cpp parameters, the difference was immediately noticeable: Token generation speed went from 50 t/s to 55 t/s, a 10% improvement! I was pleasantly surprised by the result.
So, if you have an iGPU, make sure to use it as your main display adapter. This could free up some VRAM for your PCIe card so it can be exclusively used for LLM inference.
Here's my llama.cpp launch parameters:
exec llama-server \
--model Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf \
--port 8080 \
--host 0.0.0.0 \
--sleep-idle-seconds 1800 \
--parallel 1 \
--fit on \
--fit-target 256 \
--flash-attn on \
--no-mmap \
--mlock \
--no-context-shift \
--fit-ctx 262144 \
--predict 32768 \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--temp 0.6 \
--top-p 0.95 \
--top-k 0.20 \
--min-p 0 \
--threads 8 \
--threads-batch 8 \
--no-warmup \
--chat-template-kwargs '{"preserve_thinking": true}'
Cheers.
fantasticsid@reddit
I did something similar a while back. Of course, there's related fun and games convincing any graphical workloads to not use the "better" card and copy frames to the iGPU. And the gfx1036 drivers on linux are pretty shit such that there's noticable output issues when using hardware acceleration (not a DRAM problem, according to memtest86.)
Still means I can use all 24GB of my 24GB card tho, so that's a win.
FoxiPanda@reddit
Yep and will improve stability too especially on Windows based OSes. If you open up some forsaken GPU-using application (looking at you calculator.exe) that pushes you over the VRAM limit, all sorts of fascinating bad things happen while your drivers attempt to recover and not fully crash the system.
veinamond@reddit
There is also a weird thing with how wddm driver memory allocation works on cuda in windows where it can allocate gpu memory but virtualize it in system ram, and gpu will fetch data through the bus the whole time. It is rare but very frustrating.
janvitos@reddit (OP)
So I'm guessing anyone who has any Chromium-based browser with 10+ tabs will greatly benefit from this :)
FoxiPanda@reddit
Yes absolutely. I have very sadly crashed my system this way...even with hardware acceleration turned off in the browser. There are shenanigans that happen on the backend that I don't fully understand that still use GPU memory for that.
mr_Owner@reddit
What a coincidence i have same setup did his today also, and can confirm my qwen3.6 27b iq3_xxs jumped from 13 tps 800pps to 18tps 1000 pps. And having added later this which speeds up things: spec-type = ngram-mod spec-ngram-size-n = 24 draft-min = 5 draft-max = 64
It hits sometimes 23 tgs.
janvitos@reddit (OP)
I tried using these parameters:
But I was constantly getting lower t/s. Not sure if it's because model is partially offloaded to CPU.
mr_Owner@reddit
With MoE qwen 3.6 it's according to claude sonnet not stable. Other dense llm's and gemma series do get benefits of ngram allegedly.
Also, i keep with --no-mmap about max 500mb / 1gb of vram headroom left to not share/spillover to ram. So i test first ngl layers to fit like that, after it enable --no-mmap to get pps boost.
I see you set fit on fit target, i found testing my own llama config based on the intial fit params and tweak those. Ubatch size impacts a lot on pps and vram with ctx size, a clear trade-off to make yourself.
maxxell13@reddit
The anti-memes potential here is great. Everyone laughs about making sure you plug your monitor into your GPU. But local AI is about to dramatically change the meaning of finding someone with the monitor plugged into the on-motherboard display port.
bartskol@reddit
Shocker
Mashic@reddit
I was plugging the monitor into my rtx 3060 and using it for inference at the same time. I noticed that the card was using 400 to 900 MB of vram when no model was loaded, that's when the idea of using the iGPU for the monitor came to me, and it helped free that vram for bigger models/context.
What you should also do, at least in Winddws, is open Windows Menu, type Graphics Settings, and set your browser to use the iGPU for hardware acceleration, this way if you watch a YouTube video on the side, it doesn't use the inference gpu for decoding. You might want to use an extension like h264ify if your iGPU doesn't support decoding av1 or vps.
And if you need text generation only, you choose to not load the mmproj file altogether, this can save 1-2GB of VRAM.
janvitos@reddit (OP)
Great tips! Thanks :)
Look_0ver_There@reddit
I have this exact setup with my AI GPU's. The display is driven by the iGPU. This enables the GPUs to run at full speed since they don't have to be interrupted to drive the display.
Just make sure that, if you're using Vulkan or ROCm, for those using AMD GPU's with AMD CPU's to exclude the iGPU from the list of available devices. Software like LM Studio does this automatically, but with raw llama.cpp it can pick up the iGPU as a device and shard some load onto it, which is what you don't want to happen.
janvitos@reddit (OP)
Good advice!
vogelvogelvogelvogel@reddit
didn't know, thx!
ga239577@reddit
That's how I have my R9700 machine setup. Display on the 8600G iGPU (760m) and only run AI on the R9700
janvitos@reddit (OP)
I know it sounds obvious to some, but to me it was kind of a "revelation," and I don't recall seeing it mentioned much.
mrmontanasagrada@reddit
Jup! Saves about 20% even on my sise, bringing qwen 35B from 100 tokens per sec to 120. :-)
janvitos@reddit (OP)
Nice!