Using the iGPU as the primary graphics card may improve token generation speed for PCIe graphics cards

[-]

fantasticsid@reddit

I did something similar a while back. Of course, there's related fun and games convincing any graphical workloads to not use the "better" card and copy frames to the iGPU. And the gfx1036 drivers on linux are pretty shit such that there's noticable output issues when using hardware acceleration (not a DRAM problem, according to memtest86.)

Still means I can use all 24GB of my 24GB card tho, so that's a win.

[-]

FoxiPanda@reddit

Yep and will improve stability too especially on Windows based OSes. If you open up some forsaken GPU-using application (looking at you calculator.exe) that pushes you over the VRAM limit, all sorts of fascinating bad things happen while your drivers attempt to recover and not fully crash the system.

[-]

veinamond@reddit

There is also a weird thing with how wddm driver memory allocation works on cuda in windows where it can allocate gpu memory but virtualize it in system ram, and gpu will fetch data through the bus the whole time. It is rare but very frustrating.

[-]

janvitos@reddit (OP)

So I'm guessing anyone who has any Chromium-based browser with 10+ tabs will greatly benefit from this :)

[-]

FoxiPanda@reddit

Yes absolutely. I have very sadly crashed my system this way...even with hardware acceleration turned off in the browser. There are shenanigans that happen on the backend that I don't fully understand that still use GPU memory for that.

[-]

mr_Owner@reddit

What a coincidence i have same setup did his today also, and can confirm my qwen3.6 27b iq3_xxs jumped from 13 tps 800pps to 18tps 1000 pps. And having added later this which speeds up things: spec-type = ngram-mod spec-ngram-size-n = 24 draft-min = 5 draft-max = 64

It hits sometimes 23 tgs.

[-]

janvitos@reddit (OP)

I tried using these parameters:

--spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64

But I was constantly getting lower t/s. Not sure if it's because model is partially offloaded to CPU.

[-]

mr_Owner@reddit

With MoE qwen 3.6 it's according to claude sonnet not stable. Other dense llm's and gemma series do get benefits of ngram allegedly.

Also, i keep with --no-mmap about max 500mb / 1gb of vram headroom left to not share/spillover to ram. So i test first ngl layers to fit like that, after it enable --no-mmap to get pps boost.

I see you set fit on fit target, i found testing my own llama config based on the intial fit params and tweak those. Ubatch size impacts a lot on pps and vram with ctx size, a clear trade-off to make yourself.

[-]

maxxell13@reddit

The anti-memes potential here is great. Everyone laughs about making sure you plug your monitor into your GPU. But local AI is about to dramatically change the meaning of finding someone with the monitor plugged into the on-motherboard display port.

[-]

bartskol@reddit

Shocker

[-]

Mashic@reddit

I was plugging the monitor into my rtx 3060 and using it for inference at the same time. I noticed that the card was using 400 to 900 MB of vram when no model was loaded, that's when the idea of using the iGPU for the monitor came to me, and it helped free that vram for bigger models/context.

What you should also do, at least in Winddws, is open Windows Menu, type Graphics Settings, and set your browser to use the iGPU for hardware acceleration, this way if you watch a YouTube video on the side, it doesn't use the inference gpu for decoding. You might want to use an extension like h264ify if your iGPU doesn't support decoding av1 or vps.

And if you need text generation only, you choose to not load the mmproj file altogether, this can save 1-2GB of VRAM.

[-]

janvitos@reddit (OP)

Great tips! Thanks :)

[-]

Look_0ver_There@reddit

I have this exact setup with my AI GPU's. The display is driven by the iGPU. This enables the GPUs to run at full speed since they don't have to be interrupted to drive the display.

Just make sure that, if you're using Vulkan or ROCm, for those using AMD GPU's with AMD CPU's to exclude the iGPU from the list of available devices. Software like LM Studio does this automatically, but with raw llama.cpp it can pick up the iGPU as a device and shard some load onto it, which is what you don't want to happen.

[-]