Fix: Dual Intel Arc GPUs using all system RAM during inference - found the cause and a working fix (llama.cpp SYCL)

Posted by Katostrofik@reddit | LocalLLaMA | View on Reddit | 7 comments

If you're running dual Intel Arc GPUs with llama.cpp and your system RAM maxes out during multi-GPU inference, even though the model fits in VRAM, this post explains why and how to fix it.

I've been running dual Arc Pro B70s (32GB each, 64GB total VRAM) for local LLM inference with llama.cpp's SYCL backend. Every time I tried to split a model across both GPUs, my 64GB of system RAM would climb to 100% and the OOM killer would start taking out desktop processes until the system either crashed or dumped me at the login screen. This happened with every model size. A 15 GiB Q4_K_M model was eating 46 GiB of system RAM. It made no sense.

Turns out it's not a configuration issue, not a VRAM issue, and not about model size. It's a specific API call in llama.cpp's SYCL backend that triggers the wrong memory path in Intel's xe kernel driver.

What's actually happening

Every call to sycl::malloc_device() in the SYCL backend causes the xe kernel driver to create a 1:1 mirror of the GPU allocation in system RAM through DMA-buf/TTM staging. This happens at allocation time, not during inference. Every tensor, every KV cache buffer, every compute scratch buffer that gets allocated on the GPU also consumes an equal amount of your system RAM.

I confirmed this with a targeted test:

Allocation Method	4 GiB on GPU	System RAM Impact
`sycl::malloc_device()`	4 GiB VRAM	+4,112 MiB system RAM
`zeMemAllocDevice()`	4 GiB VRAM	+8 MiB system RAM

Same VRAM allocation, same GPU, same driver. 500x difference in system RAM usage depending on which API you call.

The xe driver has two internal kernel paths for device memory:

DMA-buf/TTM - mirrors VRAM in system RAM. This is what sycl::malloc_device() triggers.
SVM/P2P - direct PCIe BAR access, virtually no system RAM. This is what Level Zero's zeMemAllocDevice() uses.

SYCL kernels can read zeMemAllocDevice pointers with zero issues. Full interop, no compatibility problems. The only difference is which kernel path gets triggered under the hood.

Symptoms you might recognize

System RAM climbs to 100% when loading a model across two GPUs, even though the model fits in VRAM
OOM killer starts taking out desktop processes (pipewire, nautilus, wireplumber)
System becomes unresponsive or drops you to the login screen
Adding swap "helps" but inference gets painfully slow
Someone told you that you need 128 GB RAM for dual GPUs
Single GPU works fine, dual GPU crashes

The fix

Replace sycl::malloc_device() with zeMemAllocDevice() throughout llama.cpp's SYCL backend. I wrote centralized helper functions with automatic fallback:

static void * ggml_sycl_malloc_device(size_t size, sycl::queue &q) {
    void *ptr = nullptr;
    try {
        auto ze_ctx = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q.get_context());
        auto ze_dev = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q.get_device());
        ze_device_mem_alloc_desc_t alloc_desc = {ZE_STRUCTURE_TYPE_DEVICE_MEM_ALLOC_DESC};
        ze_result_t r = zeMemAllocDevice(ze_ctx, &alloc_desc, size, 64, ze_dev, &ptr);
        if (r == ZE_RESULT_SUCCESS && ptr) return ptr;
    } catch (...) {}
    return sycl::malloc_device(size, q);  // fallback
}

The fix touches 4 files, replaces 3 allocation sites and 3 free sites, and links against ze_loader. If Level Zero interop isn't available for some reason, it falls back to the original sycl::malloc_device behavior automatically.

Before and after

Q4_K_M (15.6 GiB model), 48K context, dual GPU:

Metric	Before	After
Peak system RAM	60,034 MiB (100%), OOM crash	\~6.7 GiB (10%), flat
Prompt processing	crash	782 t/s
pp512 speed	348 t/s	359 t/s
tg128 speed	17.92 t/s	17.92 t/s

Q8_0 (26.6 GiB model), 32K context, dual GPU:

Metric	Before	After
Peak system RAM	100%, OOM crash	flat, no issue
Prompt processing	crash	915 t/s

System RAM stays flat at around 10% throughout all dual-GPU tests. No OOM, no crashes, no performance regression. Output is byte-for-byte identical between single GPU and dual GPU (verified with seed=42).

Things we tried that didn't work

Before finding the real cause, we spent hours on these. None of them fix the problem:

Disabling IOMMU (iommu=off in GRUB) - no effect
Direct SYCL device-to-device memcpy (replacing the host bounce buffer) - faster transfers but same RAM usage
NEO debug keys (UseKmdMigration=0, etc.) - no effect
cgroup memory limits - the TTM allocations happen kernel-side, they're not charged to process cgroups
Disabling ACS on PCIe root ports - no effect
Level Zero IPC handles (zeMemGetIpcHandle) - these also consume system RAM

The only fix is replacing the allocation function itself.

Why Nvidia and AMD don't have this problem

CUDA and ROCm have their own peer-to-peer memory management that doesn't go through the kernel's generic DMA-buf path. Intel's xe driver actually has a working P2P/SVM path in kernel 7.0+, but sycl::malloc_device() triggers the older DMA-buf export path instead of using it. Intel's own multi-GPU inference stack (llm-scaler, which uses vLLM) avoids this by using Level Zero APIs directly.

System details

2x Intel Arc Pro B70 (32 GB each, Battlemage/Xe2)
AMD Ryzen 5 9600X, 64 GB DDR5-4800
Ubuntu 26.04, kernel 7.0.0-12-generic, xe driver, compute-runtime 26.09
llama.cpp SYCL backend (commit 69c28f1)
Display on AMD Radeon iGPU, both B70s are compute-only
Model: Qwen3.5-27B (tested Q4_K_M, Q5_K_M, Q6_K, Q8_0)

What's next

I'm planning to submit this as a PR to llama.cpp. If you're hitting this issue and want to fix it locally, I'm happy to share the full patch and test programs.

This probably affects anyone using Intel multi-GPU with any SYCL-based inference engine, not just llama.cpp. The root cause is in how SYCL's allocation function interacts with the xe driver, not in llama.cpp specifically.

I also posted the initial findings on X before we found the fix, if you want to see the real-time investigation.

[-]

WizardlyBump17@reddit

offtopic: are you PMZFX? if yes, your q8_0 pull request made qwen3.5 spit garbage from the 2nd prompt onwards. Care to take a look please?

Katostrofik@reddit (OP)

Good news, I found the root cause and submitted a fix:
PR #21618. The reorder optimization allocates a temp buffer the size of the weight tensor, and when VRAM is nearly full it fails silently. The fix adds a host memory fallback so the reorder still works, and also fixes a bug where tensors were getting marked as reordered even when the reorder was skipped (which is what causes the garbage output). I also linked it to your GitHub issue #20478. Should be resolved once the PR is merged. In the meantime you can work around it by setting

GGML_SYCL_DISABLE_OPT=1
which disables the reorder entirely (slower but correct output).

are you on the OpenArc discord? It would be very nice to have you there. There are tons of people using intel arc there and even a guy from intel that you already know

Echo9Zulu-@reddit

OpenArc discord user spotted

Yes that's me. I've found some additional issues with Q8_0 after the PR on my Battlemage cards as well and am looking into those. Which Qwen 3.5 model/quant and GPU are you running when you see it?

eidrag@reddit

was using 3090 for llm and titan v for imagegen, both fit in each card, but similar issue suddenly 100% ram usage 64gb last night