Is it normal for Gemma 4 26B/31B to run this fast on an Intel laptop? (288V / CachyOS)

Posted by No-Key8555@reddit | LocalLLaMA | View on Reddit | 25 comments

Hey everyone, I just got into local LLMs about a week ago. I tried Ollama and LMStudio on my Core Ultra 9 288V, but they kept failing or giving me "hard stops" on the MoE models, so I figured I’d just try building the environment myself.

I couldn’t get OpenVINO to play nice with the NPU for these larger models yet, so I just compiled a custom Vulkan bridge for the GPU instead. It seems to be working?

Performance Stats:

Model: Gemma-4-26B-it-i1 (GGUF)
Speed: 7-12 t/s (16k context)
Hardware Use: 95-100% GPU, 10-40% CPU, 20-24GB RAM.

I also tried the 31B-it-i1-Q4_K_M.gguf version. It's a bit heavier but still totally usable:

Speed: Decent/Fluid (4-8k context)
Hardware Use: 100% GPU, \~30-60% CPU (Xe2 and the logic cores seems to be sharing the load well).
RAM: Pushing 26GB out of 29GB free, but 0GB swap used so far.

Is this a normal result for integrated graphics? I only got it working on the CPU at first which was faster although unsustainable, but once the Vulkan bridge was built, it is balanced. I'm using CachyOS if that makes a difference.

Just wanted to see if I’m missing something or if Intel Lunar Lake is actually this cracked for local MoE.

[-]

charles25565@reddit

Interesting.

I have Alder Lake and using Vulkan for me just results in the same performance as CPU.

[-]

No-Key8555@reddit (OP)

That’s what I found too. On the 288V, the Vulkan performance is very close to the CPU—a bit slower in raw tokens, but the real win is the stability and zero-swap ceiling.

On Lunar Lake, the 32GB LPDDR5x sits directly on the package, physically bridging the CPU and GPU. It essentially eliminates the bus latency you’re likely hitting on Alder Lake. Using the Vulkan bridge just stabilizes that massive bandwidth, so instead of redlining the system, it just hums along at 12 t/s even with a 26B model loaded.

[-]

BigYoSpeck@reddit

Token generation is ultimately memory bandwidth limited so it's expected for CPU and iGPU performance to be in the same ballpark as one another

The real advantages are lower power consumption which means less heat, which means less likely to throttle and slow down when into heavy usages, and prompt processing should be much much faster which in simple tests like you have posted there doesn't matter much, but on much longer prompts it makes a big difference

[-]

No-Key8555@reddit (OP)

True, bandwidth is the bottleneck, but theory doesn't always equal performance. Most integrated setups fall apart trying to run a 31B model with any kind of fluid speed. Getting consistent 7-12 t/s on a custom Vulkan bridge suggests the MoP (Memory on Package) is doing a lot more heavy lifting for latency than standard LPDDR5 setups. It’s the stability under load that surprised me.

[-]

charles25565@reddit

I do have DDR4 dual-channel RAM which definitely makes a difference. The bottleneck for me is mainly how much data can flow to the processor, which is slow on DDR4. The unified 32 GB of RAM would only raise the bottleneck high, extremely high.

RIP26770@reddit

I run it at almost triple that speed on an Intel Core Ultra 7! Lol, in a full native context, with the same quantization as you.

One_Panda_9925@reddit

what model did you use?

90hex@reddit

The GPU on newer Intel CPUs is quite decent, so not so surprising. If you ran it on pure CPU it'd be very different.

Low-Addition-5218@reddit

Ну если процессор держит avx512 то скорость тоже очень приличная
у меня процессор с 4мя ядрами с avx512 (i7-1195g7) обошел в несколько раз
процессор с 16 ядрами (ultra 7 255h)

Да я на своем ultra 7 255h получал схожие результаты при использовании нативных библиотек для intel ( ipex и sycl).
При использовании vulkan скорость инференса у меня была примерно 2 раза ниже

MEGAnALEKS@reddit

Fast (7-12 t/s)

Bigkillerstorm1@reddit

Damn i hate that im so slow with 3200-3400 t/s but thats with batching hehe and very low context per batch 1024 so its more benchmarking.

Bro can generate 6-7 ai slop projects in a minute 💀

CatiStyle@reddit

When model is load to VRAM it might run about 50 tok/s, in RAM only 5 tok/s. When you got 16G VRAM you need to reduce context size to keep model in VRAM, full 256K context size is too much for 16G VRAM, so it start using RAM and getting slow.

Hytht@reddit

vLLM/transformers and OpenVINO (once the PRs are merged) should be the best way to running Gemma 4 on Intel https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Gemma-4-Models-optimized-for-Intel-Hardware-Enabling-instant/post/1742983

Echo9Zulu-@reddit

I'm so hyped for the openvino prs to land. upstream openvino performance will be fantastic on b70, especially the moe.

Regardless openvino is in dire need of qwen3.5 to land as well.

sund00bie@reddit

Same. Very keen. Won't see the same performance as you but will hope to get something more out of my a770

Firm-Fix-5946@reddit

Speed: Decent/Fluid

what

Mayion@reddit

VoiceApprehensive893@reddit

speed: Decent/fluid (seconds per token probably)

Ok-Measurement-1575@reddit

Runs like ass on my 2 year i7 laptop that hits 100C under average load.

Frosty_Chest8025@reddit

no its not normal at all. you should definetly purchase faster laptop.

No-Veterinarian8627@reddit

I tried Gemma 4 26B moe (q4km or the MX something something from Unsloth), 16k.

LM Studio (Vulkan): \~11 t/s

I am on an older PC of mine (while visiting my parents) and the CPU is AMD Ryzen 5 8500G w/ Radeon 740M. 60 to 70% CPU. The GPU is more or less 90%+. I don't have great diagnostics here, but used Claude to quickly write me a script.

Hope it helps :)

Former-Ad-5757@reddit

26b should basically produce 100 t/s on gpu so 10 looks about right.

mtmttuan@reddit

Speed seems about right forvthe memory bandwidth you likely have