Intel Arc Pro B70 32GB performance on Qwen3.5-27B@Q4

Posted by Puzzleheaded_Base302@reddit | LocalLLaMA | View on Reddit | 75 comments

Posted something when I initially got the GPU on r/IntelArc. Did not have vllm working at the time, so no real use case numbers. After many nights fighting with vllm, I finally got it to work.

Here are some summery.

both llama.cpp and llm-scaler-vllm produce \~12tps token generation rate.
tensor parallel degrade performance in all fronts (this may have something to do with my PCIe topology)
pipeline parallel improves PP, but degrades TG at single query, improve both at high concurrency
high concurrency performance is a lot better. TG reach 135 tps at 32 concurrency, which is about 20% less than RTX PRO 4500 32GB
Power consumption at 32 concurrency is about 50% higher than RTX PRO 4500 32GB, which is consistent with spec. Power consumption is maxed out at PP step, it drop almost half during single query TG period. Power consumption does not maxed out during TG step even at high concurrency situation.
you will need the latest beta fork to get qwen3.5 working.
once you install ubuntu 26.04 (yes, pre-release version), no special driver installation is needed. i was not able to get ubuntu 24.04.4 working at all, and also not in any mood to install officially supported ubuntu 25.10, which will be obsolete in 3 months.

The below command-line prompt will get your vllm intel fork running qwen3.5 on Ubuntu 26.04 LTS

export HF_TOKEN="---your hf token---"

docker run -it --rm \

--name vllmb70 \

--ipc=host \

--shm-size=32gb \

--device /dev/dri:/dev/dri \

--privileged \

-p 8000:8000 \

-v \~/.cache/huggingface:/root/.cache/huggingface \

-e HF_TOKEN=$HF_TOKEN \

-e VLLM_TARGET_DEVICE="xpu" \

--entrypoint /bin/bash \

intel/llm-scaler-vllm:0.14.0-b8.1 \

-c "source /opt/intel/oneapi/setvars.sh --force && \

python3 -m vllm.entrypoints.openai.api\_server \\

\--model Intel/Qwen3.5-27B-int4-AutoRound \\

\--tokenizer Qwen/Qwen3.5-27B \\

\--served-model-name qwen3.5-27b \\

\--gpu-memory-utilization 0.92 \\

\--allow-deprecated-quantization \\

\--trust-remote-code \\

\--port 8000 \\

\--max-model-len 4096 \\

\--tensor-parallel-size 1 \\

\--pipeline-parallel-size 1 \\

\--enforce-eager \\

\--distributed-executor-backend mp"

Below are measured token rate:

Single GPU

Concurrency: 1

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen3.5-27b	pp2048	1700.83 ± 7.03		1196.95 ± 13.22	1104.11 ± 13.22	1196.99 ± 13.22
qwen3.5-27b	tg512	13.43 ± 0.09	14.00 ± 0.00

Concurrency: 4

model	test	t/s (total)	t/s (req)	peak t/s	peak t/s (req)	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen3.5-27b	pp2048 (c4)	1492.15 ± 93.77	802.83 ± 468.06			3155.68 ± 1403.00	3047.58 ± 1403.00	3155.71 ± 1402.98
qwen3.5-27b	tg512 (c4)	45.91 ± 0.46	12.03 ± 0.38	52.00 ± 0.00	13.00 ± 0.00

Concurrency: 8

model	test	t/s (total)	t/s (req)	peak t/s	peak t/s (req)	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen3.5-27b	pp2048 (c8)	1554.80 ± 5.58	533.91 ± 466.39			5677.56 ± 2849.77	5580.43 ± 2849.77	5677.59 ± 2849.76
qwen3.5-27b	tg512 (c8)	84.37 ± 0.31	11.73 ± 0.72	112.00 ± 0.00	14.00 ± 0.00

Concurrency: 32 this basically saturates all the compute cores on B70.

model	test	t/s (total)	t/s (req)	peak t/s	peak t/s (req)	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen3.5-27b	pp2048 (c32)	1503.41 ± 1.04	194.92 ± 302.24			20599.68 ± 11444.52	20509.48 ± 11444.52	20599.70 ± 11444.52
qwen3.5-27b	tg512 (c32)	130.90 ± 13.08	5.22 ± 0.91	288.00 ± 0.00	10.39 ± 1.60

Now Dual GPUs. Tensor Parallel 2

Concurrency: 1

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen3.5-27b	pp2048	1019.80 ± 67.88		1962.77 ± 135.14	1835.82 ± 135.14	1962.82 ± 135.14
qwen3.5-27b	tg512	9.10 ± 0.45	11.00 ± 1.41

Concurrency: 32

model	test	t/s (total)	t/s (req)	peak t/s	peak t/s (req)	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen3.5-27b	pp2048 (c32)	1057.36 ± 1.69	133.90 ± 206.98			29738.38 ± 16330.06	29597.02 ± 16330.06	29738.40 ± 16330.05
qwen3.5-27b	tg512 (c32)	140.30 ± 1.78	6.08 ± 1.14	320.00 ± 0.00	10.32 ± 0.47

Pipeline Parallel 2

Concurrency 1

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen3.5-27b	pp2048	1680.59 ± 124.37		1367.69 ± 105.88	1161.99 ± 105.88	1367.74 ± 105.89
qwen3.5-27b	tg512	10.31 ± 0.01	12.00 ± 0.00

Concurrency 32

model	test	t/s (total)	t/s (req)	peak t/s	peak t/s (req)	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen3.5-27b	pp2048 (c32)	2750.77 ± 1.96	261.41 ± 294.53			11889.30 ± 5927.16	11768.85 ± 5927.16	11889.32 ± 5927.16
qwen3.5-27b	tg512 (c32)	195.82 ± 4.09	7.14 ± 0.57	293.33 ± 7.54	9.51 ± 0.50

[-]

edison_reddit@reddit

int4 autoround never work with vollm here.
```
docker run -it --rm \

--name vllmb70 \

--ipc=host \

--shm-size=32gb \

--device /dev/dri:/dev/dri \

--privileged \

-p 8000:8000 \

-v \~/.cache/huggingface:/root/.cache/huggingface \

-v /home/intel/LLM:/llm/models \

-e VLLM_TARGET_DEVICE="xpu" \

--entrypoint /bin/bash \

intel/llm-scaler-vllm:0.14.0-b8.2.1 \

-c "source /opt/intel/oneapi/setvars.sh --force && \

python3 -m vllm.entrypoints.openai.api_server \

--model /llm/models/Qwen3.6-35B-A3B-int4-AutoRound \

--served-model-name qwen3.6-35b-a3b-int4 \

--gpu-memory-utilization 0.92 \

--allow-deprecated-quantization \

--trust-remote-code \

--port 8000 \

--max-model-len 4096 \

--tensor-parallel-size 1 \

--pipeline-parallel-size 1 \

--enforce-eager \

--distributed-executor-backend mp"

(APIServer pid=1) raise RuntimeError(

(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

```

[-]

Puzzleheaded_Base302@reddit (OP)

used Vulkan on LM-Studio, it is slower than vllm.

[-]

RIP26770@reddit

Please compile llama.cpp yourself and try again.

[-]

Obvious_Okra@reddit

I’m considering the B70 for single user and 24 tps is good for me. Did you test it with Vulkan? What was the model, quant, and context?

[-]

luancyworks@reddit

I am using all pretty much default downloads and settings and getting 27tk/s on Qwen 3.6 27B. Q4_K_M full context window. This is using the Vulkan llama.cpp v2.13.0

[-]

So, a total disappointment. I expected this to be a solid card for local LLMs like Qwen 3.5 27B or Gemma 4 31B with at least a 100k context. I considered a dual gpu setup, perhaps even a quad, but given these benchmarks, it seems I'm better off saving for Nvidia hardware. It might be viable for multi-agent systems, but for now, we just have to wait for software optimizations.

[-]

luancyworks@reddit

so far solid card, some issues with ollama at first but that worked out and for Qwen 3.6 27B getting 26 Tk/ s, and Qwen 3.6 35B A3B is around 100Tk/S. First couple of runs would get down in the 13-15 T/k when I didn't have the latest updated and some KV wasn't on the GPU.

[-]

overand@reddit

It looks "fine" for that use case, for single user (and maybe more.) But, it;s not knocking it out of the park. I wonder how much of it is the kinda "meh" memory bandwidth

[-]

suprjami@reddit

This is worse performance than a 3060.

15 tok/sec makes reasoning pretty unfeasible.

[-]

masterlafontaine@reddit

May I ask what is the performance in this 3 rtx 3060 setup? Pp and tg?

[-]

suprjami@reddit

With Qwen 3.5 27B I get tg around 14 tok/sec. It decreases a few tok/sec at long ctx like 30k+.

I power limit my cards to 100W so pp runs slower about 450 tok/sec. iirc at full 170W it ran faster like 650 tok/sec.

I submitted results with other models to Localscore as well:

https://www.localscore.ai/result/3062

https://www.localscore.ai/result/3063

https://www.localscore.ai/result/3064

[-]

masterlafontaine@reddit

Thank you! Yes, indeed, a very nice setup. I have one rtx 3060, and I intend to add an rtx 5060ti with 16gb.

[-]

Similar-Republic149@reddit

It's not worse than a 3060 but it is worse than an AMD MI50

[-]

suprjami@reddit

I own three 3060s. It is worse than a 3060.

[-]

Makers7886@reddit

I'm looking at my 3090s rn like they are the bruce willis of gpus

[-]

DataGOGO@reddit

It isn't the card.

[-]

Sound_and_the_fury@reddit

Yeah dam...was thinking of getting one should some stats alighen but not impressed

[-]

Pablo_the_brave@reddit

The same fillings. I was seriously thinking to buy two-three of them for my job but this... Thx OP for sharing. Currently in my home lab I have hybrid setup with Vulkan, 5070Ti+iGPU 780M and with Qwen3.5 27B i4_XS, kvcache 8_0, 85k context I've got prefill 700-300 t/s and decoding 14-11t/s (the lower for full context)...

[-]

Monad_Maya@reddit

That's kinda low for a single user single GPU scenario. I hope it's just a software optimization issue.

[-]

Hytht@reddit

Someone got the same 13 tk/s tg with even larger dynamic FP8 quant https://forum.level1techs.com/t/intel-b70-launch-unboxed-and-tested/247873/2
Something is wrong in this setup.

[-]

lawldoge@reddit

I'm going to say it's just the lack of maturity on the software side. Output between my B60's and B70's are nearly identical, assuming the models fit into the memory pool of the cards. Considering the performance capability of the B70 approaches double the B60 on paper, <10% difference between them in real world use under different versions and forks of vLLM says there's a lot left to be ironed out and tuned.

[-]

simracerman@reddit

Put in perspective. In main llama.cpp, my single 5070 Ti with iGPU offload does 16 t/s at empty and ~12-13 t/s at 64k context.

I had higher hopes for this intel variant. On paper it should be slightly slower than my 5070 Ti when both models are in VRAM.

[-]

lawldoge@reddit

Suppose I should go back around and check again, seem to remember them being above the 5070Ti but below the 5080 on paper (under ideal conditions).

[-]

simracerman@reddit

If that’s the case, intel drivers are the culprit which is a bigger problem IMO

[-]

lawldoge@reddit

Their software stack is a painful experience. I have both B60's and B70's and am really hoping for some maturity so the cards can finally shine. Being locked to specific versions of vLLM, or llm-scaler, or any of their dependencies is awful. llm-scaler is built on vLLM 0.14. Intel has a v0.17.0 in their docker repo, but nothing is validated. vLLM upstream is on v0.19 but seems to be dealing with regressions and doesnt necessarily perform well out-of-the-box. Qwen3.5 works on some of the lower releases but has no supported tool calling. On vLLM v.019 it does nothing but hallucinate. On v0.17 it core dumps on first token. Half the stack is delivered through install scripts, the other half through the repos. They have the docker images, but even those aren't necessarily being kept up to date and the best way is to pull the build files and customize it.

They also haven't gotten power tuning figured out yet. My 60s idle at about 40W. My 70s seem to idle around 90W.

It's all over the place right now. Not to say the product line doesnt have a future, but they certainly do have an uphill battle. They're not delivering a turn-key product by any means.

[-]

simracerman@reddit

Aside from premature hardware failures, You just went through my biggest nightmares of acquiring PC parts. Idle at 90W is a crime..

[-]

bennyb0y@reddit

Agree. Let’s go Intel we are all rooting for you.

[-]

seamonn@reddit

I still have PTSD from how quickly they abandoned Intel IPEX.

[-]

stormy1one@reddit

You and me both. IPEX made my Arc 170T scream

[-]

Puzzleheaded_Base302@reddit (OP)

i hope that is the case. otherwise, this card won't fly in datacenter either, even if they can beat on token/dollar figure of metric.

[-]

Puzzleheaded_Base302@reddit (OP)

LM Studio (llama.cpp vulkan) results in case people want to compare.

single gpu

concurrency 1

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen3.5-27b	pp2048	454.01 ± 27.17		5034.88 ± 185.80	4145.24 ± 185.80	5034.88 ± 185.80
qwen3.5-27b	tg512	11.87 ± 0.01	19.67 ± 2.05

Concurrency 2

model	test	t/s (total)	t/s (req)	peak t/s	peak t/s (req)	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen3.5-27b	pp2048 (c2)	320.37 ± 3.51	170.79 ± 6.93			11534.06 ± 383.42	11067.92 ± 383.42	11534.06 ± 383.42
qwen3.5-27b	tg512 (c2)	16.79 ± 3.72	8.45 ± 1.88	27.67 ± 4.78	17.67 ± 1.70

Concurrency 4

model	test	t/s (total)	t/s (req)	peak t/s	peak t/s (req)	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
qwen3.5-27b	pp2048 (c4)	314.58 ± 5.19	93.29 ± 18.06			21316.60 ± 3255.12	20844.93 ± 3255.12	21316.60 ± 3255.12
qwen3.5-27b	tg512 (c4)	25.54 ± 0.21	6.90 ± 0.25	46.00 ± 0.82	16.67 ± 1.60

[-]

DistanceAlert5706@reddit

Crazy. It's 2 times slower than RTX5060ti for single use. Support is not there. And enforce eager in vLLM command so it's not using any graph optimizations.

[-]

This_Maintenance_834@reddit

removing the —force-eager argument will crash vllm. i don’t know if this card support the graph function or it is due to lack of driver support.

vllm literally complains it does not know how to build the graph. this is a intel fork of vllm. so, if the official intel fork doesn’t support it, who supports it?

[-]

lawldoge@reddit

Seems that it's slowly coming in, pyTorch 2.11 looks to be the first implementation of it. Quick reading suggests that it doesn't support parallelism yet, and need to wait for everything else to catch up now that it's somewhat available.

[-]

fallingdowndizzyvr@reddit

Here are the numbers from my A770 and Strix Halo. Which both are faster for TG and not that much slower for PP. Which is why I don't use my A770s anymore. Since the Strix Halo is pretty much like a 128GB A770. Which also why I probably won't get the B70. Since Strix Halo is pretty much like a 128GB B70.

A770

| model                          |       size |     params | backend    | ngl | fa | dev          | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------ | ---: | --------------: | -------------------: |
| qwen35 27B Q4_0                |  14.87 GiB |    26.90 B | Vulkan     |  99 |  1 | Vulkan0      |    0 |           pp512 |        285.39 ± 0.25 |
| qwen35 27B Q4_0                |  14.87 GiB |    26.90 B | Vulkan     |  99 |  1 | Vulkan0      |    0 |           tg128 |         13.04 ± 0.01 |

Strix Halo

| model                          |       size |     params | backend    | ngl | fa | dev          | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------ | ---: | --------------: | -------------------: |
| qwen35 27B Q4_0                |  14.87 GiB |    26.90 B | ROCm,Vulkan |  99 |  1 | Vulkan1      |    0 |           pp512 |       291.09 ± 11.54 |
| qwen35 27B Q4_0                |  14.87 GiB |    26.90 B | ROCm,Vulkan |  99 |  1 | Vulkan1      |    0 |           tg128 |         13.32 ± 0.03 |

[-]

Puzzleheaded_Base302@reddit (OP)

I think the performance will improve in a few months. The memory bandwidth of Strix Halo is quite less than B70. If kernels are fully optimzed, B70 should outperform strix halo, but not now.

[-]

fallingdowndizzyvr@reddit

It's been years for the A770. Still waiting. It also has better memory bandwidth than the Strix Halo. But you can't tell that from it's performance.

[-]

Thanks-Suitable@reddit

Its crazy how much the drivers matter for single concurrency scenario token generation! Hope the cards are actually available in stores in europe aswell so we can try fixing the software support :) or at least contribute

[-]

RaDDaKKa@reddit

The cards are in stock in Poland, but it looks like there’s a 10-day lead time

https://www.morele.net/karta-graficzna-intel-arc-pro-b70-32gb-gddr6-33p01ib0bb-15926398/

[-]

ea_man@reddit

Here's it is cheaper: https://www.dustin.dk/product/5020089421/arc-pro-b70-ai-workstation

[-]

Nvclead@reddit

cant order without an organisation number

[-]

ea_man@reddit

Can't you order as a guest?

[-]

Nvclead@reddit

nope, it asks for the orgnanisation number too

[-]

Pablo_the_brave@reddit

In EU also in proshop for similar price.

[-]

Thanks-Suitable@reddit

Its true but its around 1250 eur, so its closer to the AMD card at 1500 with solid drivers :/ But Thanks!

[-]

TheBlueMatt@reddit

There's definitely some trivial driver and optimization headroom, but we'll see how far it goes. With some trivial patches going upstream that shouldn't make a huge difference and the mesa opts from https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15162 on a single Arc Pro B60 using unsloth/Qwen3.5-27B-GGUF:Q4_0 (which I assume is what you used - its probably similar to the OP at least), I get concurrency 1 tg512 15.87 ± 0.40.

[-]

libregrape@reddit

It's crazy, how I literally get the better result (\~800ts on pp, and \~25ts on tg) with rtx 5060 ti 16GB + CUDA + llama.cpp in single-user scenarios. What a disappointment. I hope that Intel fixes their software.

[-]

brosvision@reddit

What quants?

[-]

libregrape@reddit

IQ3_XXS

[-]

brosvision@reddit

Is it any good? Do you use it for coding?

[-]

libregrape@reddit

It is amazing for me. Way beyond my expectations for such a model. And miles ahead of the MoE models, and even Qwen 3.5 27B for my usage. I don't use llms much for coding these days tho, so no, can't tell you much about that..

[-]

MiniCactpotBroker@reddit

even my old 3090 is miles ahead

[-]

Otherwise-Host9153@reddit

I did opus tune a little bit the llamacpp code - that's what i was possible to get right now:

Our result (llama.cpp SYCL b70-tuning, Qwopus3.5-27B Q4_K_M, B70):

  - pp2048: 
687.85 ± 2.88 t/s

  - tg512: 
22.47 ± 0.00 t/s


on

Qwopus3.5-27B-v3-Q4_K_M.gguf

[-]

Puzzleheaded_Base302@reddit (OP)

this result is not bad. could you share the command line for llama.cpp?

[-]

Otherwise-Host9153@reddit

  
Server (Prod-Config, korrekt + fast, ≤32k ctx, f16 KV):

  ONEAPI_DEVICE_SELECTOR=
'level_zero:0'
 \                                                                                                
  /home/llm/llama.cpp-main/build/bin/llama-server \                                                                                      
    -m /home/llm/.cache/huggingface/models/unsloth/Qwen3.5-27B-GGUF/Qwen3.5-27B-Q4_K_M.gguf \                                            
    --host 0.0.0.0 --port 8181 \                                                                                                         
    -ngl 99 -fa on -ub 2048 -c 32768 \                                                                                                   
    --parallel 1 -dev SYCL0 --jinja --reasoning-budget 0


Server (long ctx bis 96k, q8 KV eats ~20% tg):

  ONEAPI_DEVICE_SELECTOR=
'level_zero:0'
 \                                                                                                
  /home/llm/llama.cpp-main/build/bin/llama-server \                                                                                      
    -m /home/llm/.cache/huggingface/models/unsloth/Qwen3.5-27B-GGUF/Qwen3.5-27B-Q4_K_M.gguf \                                            
    --host 0.0.0.0 --port 8181 \                                                                                                         
    -ngl 99 -fa on -ub 2048 -c 98304 \                                                                                                   
    -ctk q8_0 -ctv q8_0 \                                                                                                                
    --parallel 1 -dev SYCL0 --jinja --reasoning-budget 0


Bench (Depth-Sweep, f16 KV):

  ONEAPI_DEVICE_SELECTOR=
'level_zero:0'
 \                                                                                                
  /home/llm/llama.cpp-main/build/bin/llama-bench \                                                                                     
    -m /home/llm/.cache/huggingface/models/unsloth/Qwen3.5-27B-GGUF/Qwen3.5-27B-Q4_K_M.gguf \                                            
    -p 2048 -n 32 -d 0,8000,16000,32000 \                                                                                              
    -ngl 99 -fa 1 -ub 2048 -r 2 -dev SYCL0


Bench (Long ctx, q8 KV):

  ONEAPI_DEVICE_SELECTOR=
'level_zero:0'
 \                                                                                                
  /home/llm/llama.cpp-main/build/bin/llama-bench \                                                                                       
    -m /home/llm/.cache/huggingface/models/unsloth/Qwen3.5-27B-GGUF/Qwen3.5-27B-Q4_K_M.gguf \                                            
    -p 2048 -n 32 -d 0,16000,32000,64000,96000 \                                             
    -ngl 99 -fa 1 -ub 2048 -ctk q8_0 -ctv q8_0 -r 2 -dev SYCL0

[-]

Otherwise-Host9153@reddit

the 23 t/s was with q_0 quant - not q_k_m

[-]

Pablo_the_brave@reddit

And with 100k context?

[-]

Otherwise-Host9153@reddit

Merged+patched q8_0 KV deep-depth bench:


  ┌─────┬────────┬───────┐                                                                                                               
  │  d  │ pp2048 │ tg32  │                                                                                                               
  ├─────┼────────┼───────┤
  │   0 │    923 │ 19.83 │                                                                                                               
  ├─────┼────────┼───────┤                                                                                                             
  │ 16k │    517 │ 14.30 │
  ├─────┼────────┼───────┤                                                                                                               
  │ 32k │    353 │ 11.19 │
  ├─────┼────────┼───────┤                                                                                                               
  │ 64k │    213 │  7.90 │                                                                                                             
  ├─────┼────────┼───────┤
  │ 96k │    150 │  6.05 │
  └─────┴────────┴───────┘


Side-by-side f16 vs q8 KV (alle merged+patched):


  ┌─────┬─────────────┬─────────────┐                                                                                                    
  │  d  │ f16 pp / tg │ q8 pp / tg  │                                                                                                  
  ├─────┼─────────────┼─────────────┤                                                                                                    
  │   0 │ 925 / 19.90 │ 923 / 19.83 │                                                                                                  
  ├─────┼─────────────┼─────────────┤
  │ 16k │ 517 / 17.55 │ 517 / 14.30 │
  ├─────┼─────────────┼─────────────┤                                                                                                    
  │ 32k │ 352 / 15.58 │ 353 / 11.19 │
  ├─────┼─────────────┼─────────────┤                                                                                                    
  │ 64k │         OOM │  213 / 7.90 │                                                                                                  
  ├─────┼─────────────┼─────────────┤                                                                                                    
  │ 96k │         OOM │  150 / 6.05 │
  └─────┴─────────────┴─────────────┘

[-]

Capital_Evening1082@reddit

Qwen3.5-27B-FP8 runs at 29t/s on 2x AMD R9700 for a single request. 524t/s at concurrency 32.
This is the league the B70 should play in. Less than 10t/s an concurrency 1 and 200t/s at concurrency 32 hints at a massive software issue.

[-]

fallingdowndizzyvr@reddit

That's how Intel rolls. It was the same with the A770. It should have been 3070/3080 performance based on paper, it was 3060 in reality.

[-]

munkiemagik@reddit

Well just maybe this is a chance to get hands on a relatively cheap product because they suck (sorry intel, you are trying and sincerely thank you for that)

But if/when they fix up, the price on these is surely going to skyrocket just like everything else due to demand because everyone and their granny will be trying to get one (or two or four)

[-]

mr_zerolith@reddit

Bought fourth tier hardware, got sixth tier performance

[-]

LocalLLaMa_reader@reddit

Are you intending to continue with llama.cpp or VLLM, not that you managed to set it up? Why?

Thank you so much for sharing and taking the plunge. Let's hope Intel indeed improves their software...

[-]

fallingdowndizzyvr@reddit

Let's hope Intel indeed improves their software...

Don't hold your breath. I'm still waiting for my A770s to reach their paper potential.

[-]

This_Maintenance_834@reddit

i kind of intended to sell the card. i have already got RTX PRO 4500.

this was a cheaper way to get 128GB VRAM, but the token rate is so horrible that I have no use of it even if I get to 128GB with four. This card with presnet driver support is only financially making sense in datacenter when concurrency is high.

[-]

D2OQZG8l5BI1S06@reddit

OpenVINO backend is faster.

[-]

Monkey_1505@reddit

That's the speed my mobile amd dgpu pushes out for tg when i'm using an moe that doesn't entirely fit in vram. NGL if I brought this card, I'd feel pretty bad about that.

[-]

an0maly33@reddit

Exactly what I was thinking. "I get that with my 8gb 3070 while it thrashes between vram and system ram."

[-]

Ok_Try_877@reddit

On the NVFP4 model of 27B I get 300 t/s+ aggregated output, running batches of 14 with 30K contexts and over 4000 t/s prompt processing with 2x 5060ti. They idle at 5w each and max out at 110 to 115w each without changing any voltage/power settings.

[-]

Winter_Tension5432@reddit

Hey, could you tell me what Runtime and config you're using?

[-]

Ok_Try_877@reddit

Hi,

I compiled VLLM from source with Cuda 13.1. I think there is a small patch/script you have to run as well to get native NVFP4 working with consumer Blackwell cards

CUDA_VISIBLE_DEVICES=0,1 vllm serve Kbenkhaled/Qwen3.5-27B-NVFP4 --tensor-parallel-size 2 --max-model-len 30000 --max-num-seqs 14 --skip-mm-profiling --gpu-memory-utilization 0.92 --kv-cache-dtype fp8

What's also a bit weird is this model has been taken off HuggingFace, or I can't find it..... I'm sure some of the others that have appeared since work just as well.

[-]

Tensor-parallel-size equals (more or less) the number of GPUs you have.