Intel Arc B70 Benchmarks/Comparison to Nvidia RTX 4070 Super

Posted by Dave_from_the_navy@reddit | hardware | View on Reddit | 0 comments

Disclaimer: I dumped my raw benchmark data, bash scripts, and homelab notes into an LLM to format this post into something readable. I'm not a writer, and I'm not going to spend hours writing this, along with my explanations for my findings for everyone to come in here and call it AI slop anyway. The hardware, data, custom polling scripts, and the testing methodology are 100% mine.

Data available here. I'll eventually load up the repo with my scripts and the rest of the data (individual run json files and csv data), but quite frankly I've spent enough time on this today and I have other shit I have to do.

I’ve had the Intel Arc Pro B70 running in my homelab for about a week now. Getting it working for LLM inference was a chore (see my previous post on Getting An Intel Arc B70 Running on a Dell Poweredge R730XD), but it is finally stable.

There aren't many real-world LLM benchmarks for this card out there, so I wrote some custom scripts to test it. I don't have an RTX 5090 or a 32GB Nvidia equivalent, so I benchmarked it against the RTX 4070 Super in my gaming PC using the exact same model.

The Hardware & System Mismatch

This isn't a 1:1 sterile test bench.

Nvidia RTX 4070 Super (Gaming PC)

VRAM: 12GB GDDR6X

Memory Bandwidth: 504 GB/s

Compute: 7168 CUDA Cores (~568 INT8 TOPS)

Host: Ryzen 7 9700X, 32GB DDR5, PCIe 4.0 x16, Bare Metal Linux Mint

Intel Arc Pro B70 (Server VM)

VRAM: 32GB GDDR6

Memory Bandwidth: 608 GB/s

Compute: 32 Xe2 Cores (367 INT8 TOPS)

Host: Dell R730XD (Dual XEON E5-2699v4), 32GB DDR4, PCIe 3.0 x16, Ubuntu VM via Proxmox

Addressing the PCIe & VM Mismatch: Because the Dell server is bound by PCIe Gen 3.0, the initial loading time of the model weights into VRAM is slower. However, because I am running these tests with all layers loaded into VRAM (no system RAM offloading), the PCIe bus becomes completely irrelevant once the model is loaded. Token generation and prompt ingestion rely entirely on the GPU's internal compute and memory bandwidth. The VM overhead from Proxmox PCIe passthrough is roughly \~1-2%, well within the margin of error. (Note: I manually resized the BAR on the R730XD to bypass Intel's typical performance cliff without official ReBAR support).

Methodology & Model Choice

To isolate backend performance, I used llama.cpp's built-in llama-bench. I wrote a custom bash script that executes llama-bench across a matrix of prompt/generation sizes while running a background process (nvidia-smi and xpu-smi) to poll power draw and VRAM usage multiple times a second.

The Model: Qwen3.5-9B-Q5_K_M (gguf) Why test a 9B model on a 32GB card? Because the 4070 Super only has 12GB. A 9B Q5_K_M model requires roughly 6.5GB of VRAM for the weights. This leaves \~5.5GB of breathing room on the Nvidia card for the KV cache, allowing me to push massive context windows (up to 128k tokens) to see how the backends handle extreme prefill phases without immediately hitting an Out-Of-Memory error.

Flash Attention is currently bugged on the upstream llama.cpp SYCL implementation, so I tested the Nvidia card twice: once with FA off (for a fair comparison), and once with FA on (to see the ceiling).

The Results: SYCL vs CUDA

Token Generation Speed (Decode)Intel B70: \~32.6 Tokens/Second4070 Super (FA Off): \~67.1 Tokens/Second

Despite the B70 having a wider 256-bit bus and higher raw memory bandwidth (608 GB/s vs 504 GB/s), the 4070 Super outputs tokens twice as fast. Token generation is memory-bandwidth bound, meaning the B70 should win here. The fact that it doesn't is purely a software tax. The SYCL backend is currently unoptimized and failing to fully utilize the physical hardware compared to the highly mature CUDA backend.

Time To First Token (Prefill)

Intel B70 (<=4k context): ~2,309 Tokens/Second

4070 Super FA Off (<=4k context): ~3,705 Tokens/Second

4070 Super FA On (<=4k context): ~4,329 Tokens/Second

Prefill is compute-bound, not memory-bound. Nvidia's raw matrix-multiplication dominance (568 TOPS vs 367 TOPS) combined with their hyper-optimized CUDA backend allowed the 4070 Super to calculate the attention mechanisms significantly faster.

The 128k Context Crash & Scratch Space This is where the software differences become glaring. I pushed both cards to a 131,072 (128k) token context window.

4070 Super (FA Off): Survived up to 64k tokens (using 11.6 GB VRAM). Crashed at 128k with a hard OOM error due to the 12GB physical limit.

4070 Super (FA On): Handled the full 128k context using just 11.0 GB of VRAM.

Intel B70: Handled 64k tokens but required 27.5 GB of VRAM to do it. At 128k tokens, it didn't OOM. Instead, it threw: level_zero backend failed with error: 20 (UR_RESULT_ERROR_DEVICE_LOST).

Why did the Intel card need 16GB more VRAM for the exact same 64k context? Scratch space. Because SYCL doesn't have Flash Attention yet, it uses standard attention, which scales quadratically (O(N^(2)).) During prefill, the backend creates temporary intermediate buffers to calculate attention. CUDA is fiercely optimized to pool and minimize these buffers. SYCL is not. When scaling to 128k, SYCL's intermediate memory allocations ballooned so massively that the compute kernel timed out, causing the host OS to reset the driver (TDR crash), effectively disconnecting the GPU mid-calculation.

Power Efficiency

Intel B70 (290W TDP): Averaged 215W active inference.

4070 Super (220W TDP): Averaged 177W active inference.

Nvidia is generating tokens twice as fast while pulling ~40 fewer watts.

Final Thoughts

If you want a plug-and-play experience, buy Nvidia.

If you are a tinkerer, the B70 is an interesting piece of hardware. We are watching Intel build their AI software stack in real-time. The physical bandwidth is there, but the SYCL backend needs heavy optimization. I plan to re-run these in 6 months to see if OpenVINO or updated SYCL backend unlock the hardware's actual potential.

(A quick note on OpenVINO and Vulkan. I've heard that I can get 30%-50% better performance running on OpenVINO instead of SYCL. I haven't had success getting OpenVINO to run on my system, but if I have more time to tinker with it, I'll check it out! Regarding Vulkan, I'll try it, but I'm really not all that particularly interested in Vulkan performance.)

Next up for the B70: testing models that simply can't fit on a 12GB card, specifically Qwen3.5-27B, Qwen3.5-35B-A3B, and Gemma4-31B. Let me know if there's anything else you want to see run on this thing.