Fix: Dual Intel Arc GPUs using all system RAM during inference - found the cause and a working fix (llama.cpp SYCL)

Posted by Katostrofik@reddit | LocalLLaMA | View on Reddit | 7 comments

If you're running dual Intel Arc GPUs with llama.cpp and your system RAM maxes out during multi-GPU inference, even though the model fits in VRAM, this post explains why and how to fix it.

I've been running dual Arc Pro B70s (32GB each, 64GB total VRAM) for local LLM inference with llama.cpp's SYCL backend. Every time I tried to split a model across both GPUs, my 64GB of system RAM would climb to 100% and the OOM killer would start taking out desktop processes until the system either crashed or dumped me at the login screen. This happened with every model size. A 15 GiB Q4_K_M model was eating 46 GiB of system RAM. It made no sense.

Turns out it's not a configuration issue, not a VRAM issue, and not about model size. It's a specific API call in llama.cpp's SYCL backend that triggers the wrong memory path in Intel's xe kernel driver.

What's actually happening

Every call to sycl::malloc_device() in the SYCL backend causes the xe kernel driver to create a 1:1 mirror of the GPU allocation in system RAM through DMA-buf/TTM staging. This happens at allocation time, not during inference. Every tensor, every KV cache buffer, every compute scratch buffer that gets allocated on the GPU also consumes an equal amount of your system RAM.

I confirmed this with a targeted test:

Allocation Method 4 GiB on GPU System RAM Impact
sycl::malloc_device() 4 GiB VRAM +4,112 MiB system RAM
zeMemAllocDevice() 4 GiB VRAM +8 MiB system RAM

Same VRAM allocation, same GPU, same driver. 500x difference in system RAM usage depending on which API you call.

The xe driver has two internal kernel paths for device memory:

  1. DMA-buf/TTM - mirrors VRAM in system RAM. This is what sycl::malloc_device() triggers.
  2. SVM/P2P - direct PCIe BAR access, virtually no system RAM. This is what Level Zero's zeMemAllocDevice() uses.

SYCL kernels can read zeMemAllocDevice pointers with zero issues. Full interop, no compatibility problems. The only difference is which kernel path gets triggered under the hood.

Symptoms you might recognize

The fix

Replace sycl::malloc_device() with zeMemAllocDevice() throughout llama.cpp's SYCL backend. I wrote centralized helper functions with automatic fallback:

static void * ggml_sycl_malloc_device(size_t size, sycl::queue &q) {
    void *ptr = nullptr;
    try {
        auto ze_ctx = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q.get_context());
        auto ze_dev = sycl::get_native<sycl::backend::ext_oneapi_level_zero>(q.get_device());
        ze_device_mem_alloc_desc_t alloc_desc = {ZE_STRUCTURE_TYPE_DEVICE_MEM_ALLOC_DESC};
        ze_result_t r = zeMemAllocDevice(ze_ctx, &alloc_desc, size, 64, ze_dev, &ptr);
        if (r == ZE_RESULT_SUCCESS && ptr) return ptr;
    } catch (...) {}
    return sycl::malloc_device(size, q);  // fallback
}

The fix touches 4 files, replaces 3 allocation sites and 3 free sites, and links against ze_loader. If Level Zero interop isn't available for some reason, it falls back to the original sycl::malloc_device behavior automatically.

Before and after

Q4_K_M (15.6 GiB model), 48K context, dual GPU:

Metric Before After
Peak system RAM 60,034 MiB (100%), OOM crash \~6.7 GiB (10%), flat
Prompt processing crash 782 t/s
pp512 speed 348 t/s 359 t/s
tg128 speed 17.92 t/s 17.92 t/s

Q8_0 (26.6 GiB model), 32K context, dual GPU:

Metric Before After
Peak system RAM 100%, OOM crash flat, no issue
Prompt processing crash 915 t/s

System RAM stays flat at around 10% throughout all dual-GPU tests. No OOM, no crashes, no performance regression. Output is byte-for-byte identical between single GPU and dual GPU (verified with seed=42).

Things we tried that didn't work

Before finding the real cause, we spent hours on these. None of them fix the problem:

The only fix is replacing the allocation function itself.

Why Nvidia and AMD don't have this problem

CUDA and ROCm have their own peer-to-peer memory management that doesn't go through the kernel's generic DMA-buf path. Intel's xe driver actually has a working P2P/SVM path in kernel 7.0+, but sycl::malloc_device() triggers the older DMA-buf export path instead of using it. Intel's own multi-GPU inference stack (llm-scaler, which uses vLLM) avoids this by using Level Zero APIs directly.

System details

What's next

I'm planning to submit this as a PR to llama.cpp. If you're hitting this issue and want to fix it locally, I'm happy to share the full patch and test programs.

This probably affects anyone using Intel multi-GPU with any SYCL-based inference engine, not just llama.cpp. The root cause is in how SYCL's allocation function interacts with the xe driver, not in llama.cpp specifically.

I also posted the initial findings on X before we found the fix, if you want to see the real-time investigation.