Intel Arc B70 Benchmarks/Comparison to Nvidia RTX 4070 Super
Posted by Dave_from_the_navy@reddit | hardware | View on Reddit | 2 comments
Good day everyone! You may remember me from such posts as Getting An Intel Arc B70 Running For LLM Inference on a Dell Poweredge R730XD. Maybe not. Probably not... Anyway, I've had this card for about a week now, I ordered it on launch day and have been beating my head against a wall with drivers and other issues until finally getting it running properly! Since then, I've realized there's a significant lack of people actually testing this card and getting some real benchmarks out into the community. Something something be the change you want to see in the world, something something... So I've done some testing, and this certainly won't be the last of my tests and benchmarks, but it'll certainly be the first.
I know what is on the community's mind. I hear you ask "How does the new Intel card do against the most reasonable Nvidia 32GB card like the RTX 5090 or RTX 5000?" I hear your question and I provide you the following answer! "I don't know... I don't have money to test that..." BUT! I DO have the Nvidia 4070 sitting in my gaming PC... So instead, I'll give you those benchmarks. The same model on both cards, how does it do?
The Disclosure
Full disclosure. The test isn't fair. The Nvidia card is sitting in my gaming PC with the following specs: Motherboard: Gigabyte B650 AORUS ELITE AX (PCIe 4.0x16) OS: Linux Mint 22.3 Linux Kernel: 6.17.0-19 CPU: AMD Ryzen 7 9700X 8-Core Processor Graphics Card: Nvidia RTX 4070 Super System RAM: 32GB 6000 MT/s DDR5
The Intel B70 is sitting in my home server in a VM with the following specs: System: Dell Poweredge R730XD (PCIe 3.0x16) OS: Ubuntu 25.10 Virtualized via Proxmox 9.1.2 Linux Kernel 6.17.0-20 CPU: 4 cores of the Dual XEON E5-2699v4 Processors (all 4 cores are on the same processor, I made sure.) Graphics Card: Intel Arc B70 System RAM: 32GB 2133 MT/s DDR4
A couple of things to get out of the way. When benchmarking, Time To First Token (TTFT) and prompt processing (prefill) can be bottlenecked by PCIe bandwidth, especially for massive prompts. Because the server is bound by PCIe Gen 3.0, the prompt ingestion speed is guaranteed to be slower on my server than on the Gen 4.0 gaming PC. Could I have run both cards on the same machine for a better result? Yes. But my home server hosts multiple users and services, and I'm not bringing it down just to run a slightly more sterile benchmark.
"So, how much does this system mismatch actually matter for the benchmarks?"
Honestly? Barely at all for everything outside of the large context runs.
Because we are running these tests with all layers loaded into VRAM (no system RAM offloading), the host system's hardware takes a massive backseat. Here is how that system difference actually translates to the numbers you are about to see:
Impact on Generation Speed (Tokens/Second): Less than 3%. Once the model is loaded into VRAM, generating text relies almost entirely on the GPU's VRAM memory bandwidth and compute. The host CPU and system RAM are just twiddling their thumbs waiting for the GPU to spit out the next token.
The Proxmox VM Overhead: ~1-2%. I’m passing the B70 through to an Ubuntu VM. Modern PCIe passthrough is incredibly efficient. You lose a microscopic amount of performance compared to bare metal, but it’s well within the margin of error for a standard benchmark. (And yes, before you ask, although I don't have ReBAR enabled because it isn't officially supported on my platform, I have figured out how to manually resize the BAR. See [this post](https://www.reddit.com/r/LocalLLaMA/comments/1sajy2u/getting_an_intel_arc_b70_running_for_llm/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button) for more details.
The PCIe 3.0 vs 4.0 factor: The PCIe Gen 3.0 bottleneck on my Dell R730XD primarily affects initial model load times and the transfer of large prompts during the prefill phase. But once the prompt is digested and we move into actual token generation (decoding)? The PCIe bus is basically a ghost town, and it becomes a pure test of the GPU's VRAM bandwidth and compute.
The Cards Themselves
Okay, now that we've got that out of the way, let's look at our two competitors.
Intel Arc Pro B70 (Server VM)
VRAM: 32GB GDDR6
Memory Bus: 256-bit
Memory Bandwidth: 608 GB/s
Compute: 32 Xe2 Cores (367 INT8 TOPS)
Interface Used: PCIe 3.0 x16 (~15.75 GB/s host bandwidth)
Nvidia RTX 4070 Super (Gaming PC)
VRAM: 12GB GDDR6X
Memory Bus: 192-bit
Memory Bandwidth: 504 GB/s
Compute: 7168 CUDA Cores (~568 INT8 TOPS)
Interface Used: PCIe 4.0 x16 (~31.5 GB/s host bandwidth)
What Do These Differences Actually Mean?
- Model Fit (VRAM Capacity)The Expectation: The B70 will hold models roughly 166% larger than the 4070 Super.The Why: VRAM is the absolute king of local LLM inference. The 4070 Super's 12GB is a hard ceiling. You can comfortably fit an 8B parameter model at high precision, or maybe squeeze a heavily quantized 14B model in there if you keep the context window small. The B70's massive 32GB of VRAM makes it possible to fit models just not feasible on the 4070 Super. This is obviously what Intel was going for with this card, and almost all of their marketing material was based around this fact. (And we'll see why later...)
- Token Generation Speed / DecodeThe Expectation: The B70 should be roughly 15% to 20% faster at generating text.The Why: Token generation (decoding) is famously memory-bandwidth bound, not compute-bound. It doesn't matter how fast your GPU compute cores are if they are starved for data. The 4070 Super has faster GDDR6X memory, but it's choked by a narrower 192-bit bus, giving it 504 GB/s of total bandwidth. The B70 uses standard GDDR6 but has a wider 256-bit bus, resulting in 608 GB/s. Because decode is just furiously reading weights from memory to generate the next token, that \~20% raw bandwidth advantage directly translates to higher tokens-per-second.
- Time To First Token (TTFT) / PrefillThe Expectation: The 4070 Super will crush the B70, being anywhere from 100% to 150%+ faster at digesting large prompts.The Why: Unlike decode, the prefill phase (reading your prompt and calculating the initial state) is highly compute-bound and host-bandwidth bound. This is where the Nvidia system flexes. First, the 4070 Super simply has vastly more raw compute muscle for matrix multiplication (\~568 TOPS vs the B70's \~367 TOPS). Second, transferring a massive prompt from system RAM to the GPU relies on the PCIe bus. The 4070 Super is sitting on a PCIe 4.0 slot (\~31.5 GB/s), while the B70 is bottlenecked by the Dell server's older PCIe 3.0 slot (\~15.75 GB/s). Between double the PCIe transfer speed and significantly more compute power, the 4070 Super will ingest huge walls of text in less than half the time it takes the B70.
"What's the purpose then of the benchmarks if we already know how the compute numbers stack up?"
For the software stack! Intel's software stack has been notoriously unreliable. With them moving around, using various forks of pytorch, llama.cpp, IPEX, SYCL, OpenVINO... What does it all mean? Basically, it's a moving target at the moment and Intel is figuring it out in real time. What this means for us is that this test is primarily to see what the numbers look like right now on launch for the B70. I fully expect these to change 3-6 months down the line. We'll talk more about this later.
The Methodology & Model Choice
Before I throw the charts at you, let's talk about the setup. To make this as scientific and reproducible as possible, I used llama.cpp's built-in llama-bench tool. It is the gold standard for testing local LLM inference because it completely strips away the UI overhead and isolates the raw backend performance.
To run the tests, I wrote a custom bash shell script to automate a matrix of combinations. For every single configuration, the script does the following:
Starts a background process to poll the GPU (nvidia-smi for the 4070, xpu-smi for the B70) many times a second to log exact power draw and VRAM usage.
Executes llama-bench for that specific prompt size, generation size, and batch size.
Kills the polling tool, outputs the JSON data, and forces the script to sleep for 5 seconds so the GPU can cool down and clear its VRAM. (This prevents thermal throttling from skewing the later runs).
I ran a few different variations of these tests. If you look closely at the data later, you'll see 2 different major sets of runs. The variables for the first run included the following: Prompt Size(tokens): 512, 2048, 4096 Tokens Generated: 128, 512 Batch Size(tokens): 512, 1024
The variables for the second run were for "large prompt" data: Prompt Size(tokens): 8k, 16k, 32k, 64k, 128k Tokens Generated: 512 Batch Size(tokens): 1024, 2048, 4096
The SYCL backend doesn't currently support Flash Attention as there's a bug with it right now.(It's currently an upstream issue in llama.cpp's SYCL implementation.) To keep the test fair between the cards, I've also disabled Flash Attention on the Nvidia card, but because we're also trying to see how good the B70 might eventually be, I ran the Nvidia test twice, the second time with Flash Attention on!
The Elephant in the Room: Why Qwen3.5-9B-Q5_K_M (gguf)?
If the Intel B70 has an ocean of 32GB of VRAM, why am I testing a 9B parameter model quantized to Q5_K_M?
Because the RTX 4070 Super only has 12GB.
To make this an actual test of the software stack, GPU compute and memory bandwidth, zero layers can be offloaded to system RAM. If the Nvidia card spills over into system RAM, its performance will absolutely tank, and the comparison becomes entirely useless.
A 9B model at Q5_K_M quantization takes up roughly 6.5GB of VRAM just for the model weights. That leaves roughly 5.5GB of VRAM breathing room on the 4070 Super.
Why do we need that breathing room? The KV Cache.
When you start testing massive context windows (my script pushes prompts up to 32k, 64k, and even 128k tokens), the GPU has to store the "memory" of that prompt in the VRAM (the KV cache). Qwen 3.5 is fantastic for this because it natively supports massive context lengths and uses a hybrid gated deltanet architecture, so the KV cache scales linearly rather than quadratically like in traditional models. By using a 9B model at this quantization, I left just enough VRAM on the Nvidia card to allow us to test incredibly large prompt ingestions without instantly hitting an Out-Of-Memory (OOM) error. It forces both cards to sweat through massive prefill phases while staying strictly within the boundaries of their VRAM.
Alright, enough preamble. You've waited long enough. Let's look at the data.
The Results: Hardware Theory vs Software Reality
Alright, time to look at the numbers. We ran the test matrix, parsed the data, and averaged the results across the runs.
If you read the hardware specs breakdown above, you might remember my prediction: The B70 should be 15-20% faster at token generation due to its wider memory bus and superior memory bandwidth.
Let's see how that prediction held up.
- Token Generation Speed (Decode)
Average Decode Speeds:
Intel B70: ~32.6 Tokens/Second
Nvidia 4070 Super (FA Off): ~67.1 Tokens/Second
Nvidia 4070 Super (FA On): ~68.0 Tokens/Second
The Verdict: The prediction was dead wrong. The 4070 Super completely and utterly decimated the B70, outputting tokens more than twice as fast.
Why? The Software Tax. This right here is why we test the software stack! On paper, the B70 has 608 GB/s of bandwidth, which should easily push this 9B model to 75+ tokens a second. But llama.cpp's SYCL backend is clearly still in its infancy compared to the deeply entrenched, hyper-optimized CUDA backend. The hardware is physically capable of so much more, but the software stack is currently leaving half of its performance entirely on the table.
- Time To First Token (Prefill)
Average Prefill Speeds (Standard Context <= 4k):
Intel B70: ~2,309 Tokens/Second
Nvidia 4070 Super (FA Off): ~3,705 Tokens/Second
Nvidia 4070 Super (FA On): ~4,329 Tokens/Second
Average Prefill Speeds (Massive Context > 4k):
Intel B70: ~1,745 Tokens/Second
Nvidia 4070 Super (FA Off): ~3,070 Tokens/Second
Nvidia 4070 Super (FA On): ~3,756 Tokens/Second
The Verdict: Exactly as predicted. Nvidia's raw matrix-multiplication compute dominance (568 TOPS vs 367 TOPS) combined with the PCIe Gen 4.0 transfer speeds allowed the 4070 Super to chew through huge walls of text significantly faster. Turning Flash Attention on for the Nvidia card widened the gap even further, letting it ingest massive context over twice as fast as the Intel setup.
- The 128k Context Limit Crunch (The Real Drama)
This was my favorite part of the entire benchmark run. Remember how I chose a 9B model specifically to give the 12GB 4070 Super some breathing room for the KV cache? We pushed both cards up to a 131,072 (128k) token context window.
The 4070 Super (Flash Attention OFF): It successfully handled everything up to 64k tokens, maxing out at 11,672 MiB of VRAM usage. It survived with just 193 MiB to spare before hitting the hard 12GB ceiling (11865 usable, with about 200 MiB being used for the rest of the OS and Cinnamon DE.)! But at 131k tokens? It crashed. A hard Out-Of-Memory (OOM) error. It simply couldn't hold the context.
The 4070 Super (Flash Attention ON): By using Flash Attention to optimize memory, the 4070 Super squeezed the entire 128k context window into 11,009 MiB of VRAM and successfully completed the test.
The Intel B70: With its massive 32GB pool, it sailed past the 64k token mark using roughly 27.5 GB of VRAM. At 128k tokens, the test failed, throwing this massive error: level_zero backend failed with error: 20 (UR_RESULT_ERROR_DEVICE_LOST). It didn't run out of memory; the SYCL backend completely buckled under the math and the GPU essentially dropped offline. I'll go into more why this occurred later.
- Power Draw & Efficiency
Since my script was polling the GPU power sensors multiple times a second throughout the entire run, we can also look at how much electricity these cards were chugging to get these results.
Average Power Draw (During Active Inference):
Intel B70 (290W TDP): Averaged 215W, with momentary peaks hitting 287.5W.
Nvidia 4070 Super (220W TDP) - FA Off: Averaged 177W, peaking at 221.2W.
Nvidia 4070 Super (220W TDP) - FA On: Averaged 186W, peaking at 220.8W.
The Verdict: Nvidia's efficiency dominance is on full display here. Not only is the 4070 Super generating tokens twice as fast, but it is doing so while drawing roughly 30 to 40 fewer watts on average. The B70 gets dangerously close to its 290W limit during heavy prefill phases. Because SYCL is currently keeping the compute units waiting, the B70 is essentially burning power just staying awake for the data to arrive. Once the software stack matures and the operations execute faster, we should actually see the total power efficiency (tokens per watt) improve.
The Software Tax: SYCL vs. CUDA
In the end, it all comes down to the software backends: SYCL vs. CUDA.
Nvidia’s CUDA has been the unquestioned king of AI acceleration for over a decade. llama.cpp's CUDA backend has had thousands of brilliant developers meticulously optimizing every single matrix multiplication, memory pool, and tensor operation. It is arguably the most highly optimized piece of local AI software in existence.
Intel, on the other hand, is currently relying on SYCL (specifically via their OneAPI toolkit). SYCL is an open, cross-platform abstraction layer meant to break Nvidia's monopoly. It's incredibly promising, but right now, it is in its absolute infancy compared to CUDA.
When you look at the decode speeds (32 t/s vs 67 t/s), you are watching the SYCL backend failing to fully utilize the physical hardware. The B70's compute units and memory controllers are sitting idle for fractions of a millisecond between operations because the software hasn't been optimized to feed them efficiently.
The 128k Crash: Why Intel's 32GB Wasn't Enough
This software immaturity becomes painfully obvious when we look at the VRAM consumption during the 64k and 131k context runs.
At 64k tokens, the Nvidia card (even with Flash Attention turned OFF) only used 11.6 GB of VRAM. The Intel B70 needed 27.5 GB to do the exact same math.
Why the 16 GB discrepancy? It's not the KV Cache (the memory of the prompt), which is identical for both models. The difference is how the backends handle "scratch space." During prompt processing, the backend must create temporary intermediate buffers to calculate the attention mechanisms. CUDA has been fiercely optimized to keep these temporary buffers as small as possible.
The SYCL backend currently scales these scratch buffers terribly. As the context window grows, SYCL's intermediate memory allocations balloon out of control. When we hit 128k tokens, the Intel card didn't just quietly run out of memory; the SYCL backend completely buckled under the unoptimized math, causing the level_zero API to panic and essentially disconnect the GPU from the host system entirely (UR_RESULT_ERROR_DEVICE_LOST).
Final Thoughts: Should You Buy The B70?
Right now? If you want a plug-and-play experience, absolutely not. The RTX 4070 Ti Super or a used RTX 3090 are still the undeniable kings of this price bracket.
But if you are a tinkerer and don't mind waiting for the software stack to mature? The B70 is one of the most fascinating pieces of hardware on the market. We are literally watching Intel build their software stack in real-time. Because the hardware physically possesses the bandwidth and compute, it is all but guaranteed that the B70’s performance will drastically improve over the next 6 to 12 months as the SYCL backend matures and Flash Attention is properly implemented. Intel has officially deprecated their IPEX backend and are now shifting all of their attention to OpenVINO and SYCL via llama.cpp. It looks very promising moving forward.
(A quick note on OpenVINO and Vulkan. I've heard that I can get 30%-50% better performance running on OpenVINO instead of SYCL. I haven't had success getting OpenVINO to run on my system, but if I have more time to tinker with it, I'll check it out! Regarding Vulkan, I'll try it, but I'm really not all that particularly interested in Vulkan performance.)
I'll be re-running these benchmarks in a few months to see how much performance Intel manages to unlock. Until then, I'll be running other benchmarks on other highly competitive models that can fit on this card!
I'm especially excited to see how the 32GB B70 handles the massive KV cache appetite of the dense Gemma4-31B (seeing as that's all anyone's talked about on r/LocalLLaMA lately...), compared to the highly efficient hybrid MoE architecture of Qwen3.5-35B-A3B.
The next models on the chopping block specifically: Qwen3.5-27B Qwen3.5-35B-A3B Gemma4-31B Gemma4-26B-A4B
Let me know what else you want me to run on the card!
TL;DR: The RTX 4070 Super crushes the B70 in raw speed (67 t/s vs 32 t/s) because CUDA is highly optimized while Intel's SYCL backend is still in its infancy.
elkond@reddit
sycl is just a translation layer for cuda, openvino is an entire runtime and optimization environment that does things u wouldnt expect from a runtime (in positive sense). u can dm me if u r having issues getting it up and running
AutoModerator@reddit
Hello! It looks like this might be a question or a request for help that violates our rules on /r/hardware. If your post is about a computer build or tech support, please delete this post and resubmit it to /r/buildapc or /r/techsupport. If not please click report on this comment and the moderators will take a look. Thanks!
I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.