Using the iGPU as the primary graphics card may improve token generation speed for PCIe graphics cards

Posted by janvitos@reddit | LocalLLaMA | View on Reddit | 19 comments

A few days ago, I was trying to improve token generation speed on my RTX 4070 Super 12GB while running Qwen3.6 35B A3B UD-IQ3_XXS (Unsloth) with llama.cpp, but to no avail. At that time, I had my monitor plugged in my 4070 and didn't even remember I had an AMD iGPU.

Then, I decided to plug my monitor into my iGPU and see if this would liberate some VRAM on my 4070 and improve token generation speed. I was not wrong. Using the right llama.cpp parameters, the difference was immediately noticeable: Token generation speed went from 50 t/s to 55 t/s, a 10% improvement! I was pleasantly surprised by the result.

So, if you have an iGPU, make sure to use it as your main display adapter. This could free up some VRAM for your PCIe card so it can be exclusively used for LLM inference.

Here's my llama.cpp launch parameters:

exec llama-server \
    --model Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf \
    --port 8080 \
    --host 0.0.0.0 \
    --sleep-idle-seconds 1800 \
    --parallel 1 \
    --fit on \
    --fit-target 256 \
    --flash-attn on \
    --no-mmap \
    --mlock \
    --no-context-shift \
    --fit-ctx 262144 \
    --predict 32768 \
    --cache-type-k q4_0 \
    --cache-type-v q4_0 \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 0.20 \
    --min-p 0 \
    --threads 8 \
    --threads-batch 8 \
    --no-warmup \
    --chat-template-kwargs '{"preserve_thinking": true}'

Cheers.