[llama.cpp] 3.1x Q8_0 speedup on Intel Arc GPUs - reorder optimization fix (PR submitted)
Posted by Katostrofik@reddit | LocalLLaMA | View on Reddit | 12 comments
TL;DR: Q8_0 quantization on Intel Xe2 (Battlemage/Arc B-series) GPUs was achieving only 21% of theoretical memory bandwidth. My AI Agent and I found the root cause and submitted a fix that brings it to 66% - a 3.1x speedup in token generation.
The problem:
On Intel Arc Pro B70, Q8_0 models ran at 4.88 t/s while Q4_K_M ran at 20.56 t/s; a 4x gap that shouldn't exist since Q8_0 only has 1.7x more data. After ruling out VRAM pressure, drivers, and backend issues, we traced it to the SYCL kernel dispatch path.
Root cause:
llama.cpp's SYCL backend has a "reorder" optimization that separates quantization scale factors from weight data for coalesced GPU memory access. This was implemented for Q4_0, Q4_K, and Q6_K - but Q8_0 was never added. Q8_0's 34-byte blocks (not power-of-2) make the non-reordered layout especially bad for GPU cache performance.
Sooo, the fix:
\~200 lines of code extending the existing reorder framework to Q8_0. The most critical bug was actually a single line - Q8_0 tensors weren't getting the "extra" struct allocated during buffer init, so the reorder flag was silently never set.
Results on Qwen3.5-27B (Intel Arc Pro B70):
- Q8_0 before: 4.88 t/s (21% bandwidth)
- **Q8_0 after: 15.24 t/s (66% bandwidth) - 3.1x faster**
- Q4_K_M: 20.12 t/s (unchanged)
- Q6_K: 13.83 t/s (no reorder)
Q8_0 is now faster than Q6_K (15.24 vs 13.83 t/s) in my testing; while providing higher quality.
Validation: Before writing the fix, we binary-patched Intel's closed-source IPEX-LLM to run on my GPU (it doesn't support B70's PCI device ID). Their optimized Q8_0 kernels achieved 61% bandwidth, confirming the problem was solvable. My open-source implementation achieves 66%.
PR: https://github.com/ggml-org/llama.cpp/pull/21527
Issue: https://github.com/ggml-org/llama.cpp/issues/21517
Hardware: Intel Arc Pro B70, 32 GB GDDR6, 608 GB/s bandwidth
yon_impostor@reddit
That's incredible, please keep it up. I checked this on my B580 and it worked perfectly. Huge improvement. Took Llama 8B from 2043pp/10.7tg to 2256pp/34.8tg. Building your PR makes llama-bench warn about about some enabled asserts, though. I wonder if IPEX-LLM has any other tricks like this left? It always seemed faster, even if it was sometimes broken.
If you want some other stuff to look at, BF16 support is missing (it dequants to fp32 and then goes) even though all the Arc cards should be able to do it, XMX has very low utilization, and SYCL flash attention probably still needs some work.
If you need any help at all testing your patches I have an A310, A380 and a B580. I also have an A770 but I don't currently have access to it.
sampdoria_supporter@reddit
Did you generate numbers for the A380?
Katostrofik@reddit (OP)
Thanks! And thanks for testing on your cards. I'm glad to see it helped more than just the B70's. I'll take a look at the BF16 issue, looks like it could be a similar situation to the Q8_0 one.
And I'll be happy to do some testing with the dual B70s. I'm still finishing up some initial benchmarking but looking forward to putting them to use. :)
yon_impostor@reddit
Yeah, allegedly there's support for BF16 in the commits but after profiling it on Alchemist and Battlemage both and checking it out with Claude it doesn't seem to be real. Prompt processing is noticeably more fp32-like than fp16-like for BF16.
Katostrofik@reddit (OP)
yon_impostor@reddit
Had to go down from 1.5B to 1B on the A310 due to it being a 4G card. Otherwise, great uplifts on both cards again.
B580, mainline, Qwen2.5-1.5B BF16: 3346.56 / 17.99 (11.3% BW)
B580, PR, Qwen: 3397.11 / 91.84 (58%! BW)
Bit surprised there wasn't more of a PP uplift. Maybe memory bandwidth limited?
A310, mainline, Llama3.2 1B BF16: 759.62 / 4.74 (8.8% BW)
A310, PR, Llama: 787.88 / 19.46 (36% BW)
Thanks again for your work!
Curiously FP16 still seems faster especially on prompt processing.
Mainline / PR both get ~11k PP for FP16 on my B580, basically same TG to your BF16 impl though.
Similar story on the A310, Llama 3.2 1B FP16 gets 2.1k PP instead of ~780 in FP16, and basically same TG as your BF16.
Katostrofik@reddit (OP)
Great data, thanks for testing on Alchemist too! The PP numbers being similar is expected - this PR only changes the DMMV path (token generation).
PP uses the GEMM path which was already working for BF16, just slower. FP16 being faster on PP makes sense since the GEMM kernels are optimized for FP16 on these GPUs. The big win here was TG, and those numbers look solid across both cards. :-D
Katostrofik@reddit (OP)
Academic_Pick6892@reddit
I'm currently building a decentralized GPU infrastructure project (Runfra) that focuses on optimizing LLM/Image-gen inference across heterogeneous hardware (including a 3 node cluster with 4080s/4060s). We've been eyeing Intel Arc's high VRAM-to-price ratio for our worker nodes.
A quick question: Since this reorder optimization significantly improves local VRAM bandwidth, have you observed any impact on the overhead when running these Q8_0 models in a multi-GPU or distributed setup? I'm curious if the memory remapping layer adds any noticeable latency to the P2P synchronization step.
RIP26770@reddit
! That's amazing thanks
rahulsingh_ca@reddit
you should get your agent to sign up for clankerslist.ai !
Katostrofik@reddit (OP)
lol, we're good. A couple introverts focused on doing work over here. But thanks.