Qwen3.6 27B FP8 runs with 200k tokens of BF16 KV cache at 80 TPS on a single RTX 5000 PRO 48GB

Posted by __JockY__@reddit | LocalLLaMA | View on Reddit | 160 comments

----START HUMAN TEXT----

Hi all,

I've seen a bunch of posts about squeezing 27B onto a 24GB card and all the quantization tricks involved in doing so. It's all amazing work, but at the end of the day a quantized model with quantized KV will inevitably compound errors faster than non-quantized ones, which noticeably impacts agentic coding.

I figured a 48GB GPU offered just enough VRAM to avoid most of the quantization nastiness with genuinely good options, like Blackwell-accelerated FP8. Luckily, Qwen released their own FP8 variant of the 27B model.

I'm serious when I say: I think we might have an answer to all those "what do I buy for $10k?" posts. A pro5k, 64GB RAM, a decent CPU/mobo, and it will run the FP8 quant of 27B with Blackwell hardware acceleration and non-quantized KV like a champ. It's quiet, cool enough, small, fast... really great.

The end recipe:

vLLM 0.20.1
CUDA 12.9
Qwen's official FP8 quant of Qwen3.6 27B which gives all the features of Qwen3.6 like multi-modality, MTP, etc.
BF16 KV cache with 200k tokens @ 1.09x concurrency
Real benchmark numbers to follow - they're running now.

These settings:

export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_TEST_FORCE_FP8_MARLIN=1
export VLLM_SLEEP_WHEN_IDLE=1
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
export VLLM_LOG_STATS_INTERVAL=2
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export TORCH_FLOAT32_MATMUL_PRECISION=high
export PYTORCH_ALLOC_CONF=expandable_segments:True

vllm serve Qwen/Qwen3.6-27B-FP8 \
  --host 0.0.0.0 --port 8080 \
  --performance-mode interactivity \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --mm-encoder-tp-mode data \
  --mm-processor-cache-type shm \
  --gpu-memory-utilization 0.975 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE", "max_cudagraph_capture_size": 16, "mode": "VLLM_COMPILE"}' \
  --async-scheduling \
  --attention-backend flashinfer \
  --max-model-len 196608 \
  --kv-cache-dtype bfloat16 \
  --enable-prefix-caching

Performance

I'm running real benchmarks right now and will update this post later, but in general: writing code with MTP=2 yields 60-90 TPS, which is a number I find perfectly acceptable for daily use. Furthermore, because we're running the FP8 and KV is non-quantized we get the benefits of long Claude sessions without early compaction, endless loops, etc. It's truly minimally quantized.

----END HUMAN TEXT----

If there were AI-generated text it would follow here.

----START AI TEXT----

----END AI TEXT----

[-]

twisted_nematic57@reddit

I'm running qwen3.6 27B Q4_K_M on my i5-1334U without any issues, it's just that "tokens per second" is more like "seconds per token".

[-]

StanPlayZ804@reddit

Try BF16

[-]

twisted_nematic57@reddit

Sure bro just buy me the 2 TB of DDR5 needed for thag and I’ll get back to you

[-]

ohhi23021@reddit

it needs like 58gb ram...

[-]

JockY@reddit (OP)

Ha! Yeah this is not that :)

Still, if you've already got the PC to put it in, a pro5k really does seem to tick all the boxes:

Fast prefill and decode
200k tokens
FP8 model
Non-quantized KV
No CPU offloading
Plug and play into most PCs, no crazy RAM requirements
Cool, quiet, efficient (for a GPU lol)

For < $5k you get a top-10 model running like a small datacenter at home. We've come a long way!

[-]

ChocomelP@reddit

Top-10 what?

[-]

EatTFM@reddit

Can we expect lossless quantized KV cache in vllm anytime soon? This would really allow to max out context while still having some VRAM left for other things.

[-]

Karyo_Ten@reddit

Fp8 KV-cache is cheap on 4090~5090 with hardware acceleration.

Smaller KV-cache would impact prompt processing and concurrent queries due to needing extra compute for quantizing and dequantizing.

In llama.cpp and exllamav3 I found it impossibly slow to do agentic code with hadamard rotated (+ EXL3 trellis / lattice codebooks) KV cache. Perf was 2x slower.

[-]

Virtamancer@reddit

Please stop saying “two times slower” or “Z times slower”.

Think about it for a couple seconds. It’s meaningless and confusing. Two times slower than what? Plug in the “what” then do the calculation.

It gets worse when the number isn’t a flat two.

Just say “it’s half as fast” if that’s what you mean.

[-]

we_are_mammals@reddit

/u/askgrok What year was /u/Virtamancer born, based on his/her grammatical preferences?

[-]

AskGrok@reddit

Based on /u/Virtamancer's evident disdain for phrases like "2x slower" (which, let's be honest, is a hill many a grammar purist has died on since the days of typewriters), I'd estimate a birth year around 1962—peak Boomer era, when "half as fast" was the only acceptable way to complain about your abacus lagging.

^(what is this?)

[-]

wren6991@reddit

Just say “it’s half as fast” if that’s what you mean.

See? You do understand it.

[-]

Karyo_Ten@reddit

Even a LLM can gather enough context.

Your "half as fast" has the same issue "half as fast as what?".

If you wanna be pedantic at least re-read yourself. Now you come out as obnoxious and hypocritical

[-]

Virtamancer@reddit

Half as fast as y is y/2.

Twice as slow as y is not y/2.

[-]

Karyo_Ten@reddit

I'm not sure what you are drinking but I never said twice as slow or y/2.

[-]

jikilan_@reddit

Actually half as fast is more confusing. But just me

[-]

ummitluyum@reddit

Lossless quantization doesn't exist in nature. Any KV compression is going to tank your perplexity on long contexts, which is a total dealbreaker for agentic coding

[-]

Such_Advantage_6949@reddit

There is no lossless quantization on anything.. it is just a matter of how lossy it is..

[-]

Valuable-Run2129@reddit

Do you need 64 gb of ram on a pc to “stage” the model before loading it in vram? Or 32gb will do?

[-]

Possible-Pirate9097@reddit

Do you reckon this would fit on a (32GB) RTX 4500 PRO?

[-]

JockY@reddit (OP)

Yes, but you're be quantizing something. With 48GB the quantization is limited to an FP8 model, which is pretty damn good. With 32GB you'd be making compromises:

Model quant
KV quant
Reduced context length

My excitement for the 48GB is that none of those compromises are necessary :)

[-]

Emojers@reddit

Hi I want to buy a laptop am A 1st year computer science major targeting ai and ml roles and internship what spec do you suggest aim to learn all tricks of ai ml and get a job or internship in the field in luding agentic ai, mle, etc.

[-]

twisted_nematic57@reddit

If you want to run windows or Linux: I definitely recommend a laptop with a decent GPU with more than 8GB VRAM. For comfort I think system RAM should be about 32 GB, maybe 48 or 64 if you can afford it and will be doing ML stuff in system RAM. Don’t just get a cheap gaming laptop because that’ll physically break down after a while (speaking from experience). Instead go for a high end Lenovo or a Dell XPS workstation. I also recommend checking out the framework laptop 16, which may or may not be worth the extra price compared to Dell and Lenovo depending on how much you value easy repairability.

If you want to run macOS, the answer is simple, get an M5 Pro 14” MacBook Pro with 32, 48 or 64 GB RAM and as long as you don’t treat it too harshly physically it should last you years. But one day the screen or keyboard or trackpad or battery will stop working correctly and youll wish you would have gotten the framework because replacing any of those things on a framework is a 10-minute job and much less investment than MacBook repairs.

[-]

Emojers@reddit

Hi thanks for replying, I was thinking if am spending then spend worthwhile legion pro 7i 64 gb ram, 2 tb SSD, ultra 275 intel processor, and rtx 5090/ 5080 vram ? Mac doesn't have nvidia gpu neither I have every used it so apple ecosystem problem and desktop is not worth I want portability. I just want to be sure that I neither overspend nor underspend am getting legion after all discount at usd 4,200.

[-]

twisted_nematic57@reddit

If you are getting the one with the 5090 it's worth it, since then you get a nice and cozy 24gb of vram for llms and such. But if it is a 5080 then I would say you shouldn't buy it because the 5080 is just a weaker gpu and has 2/3rds the vram of the 5090.

You should ask for other opinions before going through with this purchase but imo it's a solid deal if you are getting the 5090 and not the 50808.

[-]

Emojers@reddit

Yeah I am getting 5090 at that price because I am buying from UAE, and the confusing thing for me is because I am getting 5080 for 1000 usd less i.e.3,200 usd. But legion idk it is upgradeble to 96 gb so should be good.

For 3,700 usd i am getting msi vector with same spec of legion pro 7i 5090, Also one thing to mind is that will I need this much won't it be overkill or is it good considering prices going up and I am targetting aiml roles what you suggest? But am researching since a month and think legion is best for cooling and 5090 should be necessary, what other brands do I look framework, asus , acer etc. of lenovo legion is best ? Also mac is a compeelte no due to nvidia gpu right ? Scar laptop is good for upgradebility but idk price so... Thanks for replying aprecitate it

[-]

twisted_nematic57@reddit

Macs are actually good for AI/ML these days because they have unified memory and good GPUs. But they won’t support CUDA, which is a programming language used to program custom things that run on NVIDIA gpus only. Honestly just email your professors and check if CUDA support will be required or not.

The 5090 legion still sounds like the best deal to me but again, have other people give their opinions, and perhaps check with your institution about what kind of computer each student will need.

[-]

AlgorithmicMuse@reddit

I'm running mine on a Mac M4 Mini pro. I can relate to seconds per token. I went from 15 seconds on a qwen3-coder: 30b to 6 minutes on a the qwen3.6:27b. Got it down to 2 minutes using the --think flag to false , knocked off another 1 minute leading the prompt with "no comments" still 4x slower than the qwen3-coder moe.

[-]

Virtamancer@reddit

“Four times slower” is meaningless and confusing.

Ask yourself, “sower than what?” Plug in the what and solve. The answer is not what you meant to say.

Anyways, can you do the same thing in lm studio? That is, make it not think? Any time there’s a non-official model (e.g. unsloth, mlx, etc.) they’re always missing the thinking toggle in lm studio.

[-]

AlgorithmicMuse@reddit

4x slower for the exact same prompts and exact same result. Already been down the lm studio mlx road. One reason dense models run slower than MoE is during inference they engage their entire parameter set for every single token processed, whereas MoE models conditionally activate only a small fraction of their parameters. Might help if you read up on the differences between dense and MoE models

[-]

Warrenio@reddit

What are you on about? The comment says "4x slower than the qwen3-coder moe"

[-]

Virtamancer@reddit

Do you know how math works? Write that equation out...............

[-]

Warrenio@reddit

"4x as slow" means the same thing as "1/4 as fast". If you want to put it in mathematical terms, "slow" is the reciprocal of "fast."

(This is similar to how conductance is the reciprocal of resistance. If something is twice as resistive, it is half as conductive.)

[-]

layer4down@reddit

Does MTP have any impact on performance? I found it up to 5.5x with DFlash pp1024 tg128 (11.5tps to 62.4tps)

[-]

twisted_nematic57@reddit

What is MTP? I just use ik_llama.cpp

[-]

dondiegorivera@reddit

Multi Token Prediction, a type of speculative decoding where the model generates multiple token predictions at once with MTP heads and lightweight MTP layers. Speeds up decoding a lot in certain cases. It's not yet in llamacpp but available already in VLLM FOR Qwen3.6.

[-]

AppealSame4367@reddit

Dflash doesn't work on cpu or models stretched out between gpu and cpu, does it?

[-]

leonbollerup@reddit

Hahaha

[-]

Valuable-Run2129@reddit

What prompt processing speed are you getting?

[-]

Regular-Forever5876@reddit

you can remove safetensor fast GPU to gain some more vram. This option allocate a portion of GPU to DMA from disk to GPU. your initial I loading will be some seconds slower but you can save a few gigs

[-]

JockY@reddit (OP)

Thanks. I tried, but it made literally zero difference. Not a single token more did I gain in KV cache.

Turning off ECC on the other hand... that saved GBs.

[-]

Regular-Forever5876@reddit

That means your kernel is not GDS in that case that option is doing nothing

[-]

slavik-dev@reddit

One more data point:

My system: RTX 4090D modded with 48GB VRAM

I'm getting the speed:

- VLLM, FP8, 128k context, with MTP: 44 t/s

- VLLM, FP8, 128k context, without MTP: 19 t/s

- llama.cpp, no MTP, Q6_K_XL, 256k context: 34 t/s

Model sizes:

- FP8: 29GB

- Q6_K_XL.GGUF: 25GB

[-]

Dolboyob77@reddit

This is way ahead of my intel arc pro b70. I get around 25tks on the same model.

[-]

rpkarma@reddit

Man that’s barely faster than my Spark, the B70 should destroy it :/

[-]

JockY@reddit (OP)

Sadly it’s the software support that’s lagging. On paper it should be much faster, but I think Intel put the intern on the job of writing B70 drivers…

[-]

Dolboyob77@reddit

How can you run a 30giga model on a spark ????

[-]

rpkarma@reddit

Because GB10 has 128GB of unified memory?

[-]

Dolboyob77@reddit

Oh oooops sorry my bad, i read at first on the sparkle ( 12g gpu intel) my apologies ))))) im on unraid os and i use llama cpp sycl. Im new to this. Intel scaler llm is not working (((

[-]

rpkarma@reddit

Ohhh! Hah yeah nah :)

Ah that’s a shame. I really hope Intel sort it out because the hardware is so good! I really want a B70 (or two, or four)

[-]

FSpeshalXO@reddit

48gb vram aswell but I don't have FP8

[-]

JockY@reddit (OP)

Plenty fast with TP though!

[-]

StardockEngineer@reddit

Queen’s FP8 gives me thinking loops after a time. Have you actually used this for any period of time? Never worth the tok/s when that happens.

[-]

JockY@reddit (OP)

It happens a lot if I use fp8 kv cache, it’s not really useable. But with bf16 cache it seems great, no looping.

[-]

StardockEngineer@reddit

BTW, it's not a problem for your config because you have spec tokens at 2, but using 5 causes crashing. Might be worth noting for people trying to push it.

https://github.com/vllm-project/vllm/issues/37035

[-]

superdariom@reddit

What is a pro5k?

[-]

JockY@reddit (OP)

https://www.nvidia.com/en-us/products/workstations/professional-desktop-gpus/rtx-pro-5000/

[-]

superdariom@reddit

🤤 think I need a better job

[-]

LevianMcBirdo@reddit

Rtx 5000 pro probably

[-]

Glittering-Call8746@reddit

Nvfp4 is just for nvfp4 trained models ?

[-]

Freonr2@reddit

Models can be quantized to nvfp4 just like they can to a GGUF recipe.

[-]

rpkarma@reddit

But you lose a lot of model performance doing that. NVFP4 really requires QAT or huge amounts of post quant calibration.

[-]

Glittering-Call8746@reddit

So what's the inference engine best for nvfp4 for now.. waited till this year to get blackwell.. not sure if it's matured enough for once it's not sm100.. so a lot of optimizations are not applicable..

[-]

rpkarma@reddit

VLLM right now but yeah exactly. You need the right kernels for consumer Blackwell, even more specific ones if you’re on SM121 like I am

[-]

ummitluyum@reddit

Sweet setup, but dropping $5k on a card just for local inference without KV degradation... you've gotta really hate API providers for that. Still, if you're running agents 24/7 and privacy is a must, 48GB brute-forces the quantization mess quite elegantly

[-]

JockY@reddit (OP)

Lol the 48GB is just a side show, my main rig is 4x RTX 6000 PRO!

[-]

tecneeq@reddit

If you pay 200 per month it takes 25 month to break even. Also, you will never experience quotas or the problem that you inference contains the word openclaw and they cut your service.

Can't overestimate total freedom.

[-]

FatheredPuma81@reddit

I'm curious how many TPS do you get if you run multiple 50k context slots?

Tbh if I had 48GB of VRAM I'd aim for 1,048,576 Context Length at Q8 KV Quant and whatever the highest quant model I can get away with so I can use multiple full context agents. It should give like 50% more combined performance?

[-]

ummitluyum@reddit

Zero point in context length for context's sake. If you squeeze KV down to 8-bit on a window that large, attention just smears and the model won't recall the specific function you need. BF16 at 200k is 100x more useful than 1M tokens of garbage

[-]

FatheredPuma81@reddit

Are you intentionally misreading my comment to try and get upvotes or am I misreading your comment?

[-]

JockY@reddit (OP)

I'm curious how many TPS do you get if you run multiple 50k context slots?

Not sure, but my guess would be somewhere around 200 tps for 3-5 concurrent slots.

I get where you're going with 1M context. The issue I have is that even FP8 KV cache leads to errors at short contexts. By the time I'm pushing 200k the model is basically braindead; I can't use it. I don't know how you're using that 1M context and I don't understand how it can be remotely coherent at that length when quantized.

My goal with this was different and specific: run the FP8 model with 200k tokens of non-quantized KV cache. No RoPE. No KV quant. Just fast and high quality output, even if it means only one concurrent full-length sequence of 200k tokens (I squeezed it to 214k in the end!)

[-]

FatheredPuma81@reddit

Well the trick is that that's all I've ever ran. I'm almost certain that Qwen3.6 27B at 4 bit and Q8_0 is better than Qwen3.5 9B at 8 bit with no KV Cache Quantization. I've also only ever used llama.cpp and there aren't exactly benchmarks comparing KV Quantization to begin with let alone VLLM vs llama.cpp's latest implementations.

I also use Qwen3.6 35B IQ4_NL actually because I get more context length and speed out of it. I can't speak for the Subagents themself but the Orchestrator seems very coherent to me at 100k+ context with my personal entirely vibe coded project that I had no stakes in. Just had 27B do thorough code audits a few times to clean things up but it seems functional.

Either way the models are performing much better than when I tried using Sonnet 4.0 to do the same project.

And yea I do get your point I just wanted to put my 2 cents in and was curious about how it would run Subagents. At a certain point though I would rather just pay for the real proper models for tasks that need quality or privacy than try and squeeze ever drop out of a 27B model.

[-]

Comacdo@reddit

Now let me find 10K and that should do it 🫠

[-]

Ok-Measurement-1575@reddit

2 x 3090 @ TP2 does the same numbers.

[-]

JockY@reddit (OP)

As a former 5x 3090 guy I'll tell you that even the 2x 3090 route isn't for the faint of heart. That's 700W - 900W of GPU, depending on if you splurged for the Ti models. You gotta shift all that heat. It's not quiet, either. You end up on risers a bunch of the time, and that's not ideal.

All of that said... I have a lot of love and nostalgia for 3090s, and a pair of them is $3k less than a 5kpro. That ain't nothin!

[-]

techdevjp@reddit

All of that said... I have a lot of love and nostalgia for 3090s, and a pair of them is $3k less than a 5kpro. That ain't nothin!

It will be interesting when the next gen RTX5000/6000 cards arrive, and people are selling off the current generation used. Hopefully at somewhat more attractive prices than they sell for new right now. They have "enough" performance to run local models "well enough" for day to day use.

[-]

JockY@reddit (OP)

My main rig is 4x 6000 pro and it runs MiniMax all day long. It’s cloud in a box.

[-]

techdevjp@reddit

Are they earning their keep for you and paying for themselves?

[-]

JockY@reddit (OP)

Yes.

[-]

maiznieks@reddit

I have one 3090 at the moment, what would be the setup if i get another? Still vLLM or something else? And would nonquantized model still work? My current stack is llama.cpp and pi/little-coder with qwen 3.6-27b /4q but i am on a lookout for any upgrades.

[-]

BillDStrong@reddit

Here is where to find the "best" setups for 1 or 2 3090's. It has setups for 100K context for either.

https://github.com/noonghunna/club-3090

Single is llama.cpp based, dual is vLLM based.

[-]

Ok-Measurement-1575@reddit

The world is your oyster these days. Even llama-server has budding tensor parallel (-sm tensor) now.

I have multiple versions of multiple quants of models but right now the unsloth UD4 quants with sm tensor is what I run on 2 x 3090 @ 256k. That's about 62t/s, base, if memory serves in llama-server.

The autoround int4 quant in vllm does about 80t/s @ tp2 but I think the unsloth quants have the intelligence edge, so I'm accepting the performance hit for now.

[-]

JockY@reddit (OP)

I have one 3090 at the moment, what would be the setup if i get another?

Not much different, just an extra command-line flag or two.

Still vLLM or something else?

I'm biased because vLLM is what I know. However, bias or not: vLLM is a great choice once you start going tensor parallel with 2 GPUs at long context.

And would nonquantized model still work?

I'm using the Qwen3.6 27B model with its weights quantized using FP8, so the model is quantized, just to a very high quality. The non-quantized part is the runtime KV cache. Yes, both would still work on 3090s, although FP8 isn't really accelerated on 3090s like it is on Blackwell... however with 2x 3090s you kind of make up for it!

I imagine a pair of 3090s is about as fast (or faster) than a single 5000 pro for this job.

[-]

Traditional-Gap-3313@reddit

2x3090s with vllm, Gemma 4 31B in int8 can fit around 45k context. Didn't test qwen 3.6 27b, but I did run 3.5 27b on this configuration and it did work no problem. IIRC I ran it in fp8. Worked ok.

There were recent vllm patches merged for gemma and qwen (interleaved/hybird attention) so now it should work correctly. There were some problems with kv cache allocation prior to these merges.

There is some overhead and you need to play with mem-util flag so that cuda graphs calculation doesn't OOM, but it works. The most important flag for me was `--max-num-seqs`, since that's limits the number of cuda graphs vllm needs to calculate. Which is not that obvious that it would have an effect, but it did. Without `--max-num-seqs 16` I consistently got OOMs due to vllm calculating graphs for 512 seqs.

[-]

RnRau@reddit

Can't you drop the power usage a fair bit without losing much inference performance?

[-]

JockY@reddit (OP)

Yeah 225W was about the sweet spot.

[-]

Ok-Measurement-1575@reddit

Don't get me wrong, I'd take any of your Blackwells in a heartbeat :D

[-]

dondiegorivera@reddit

Indeed, even with the awq bw16-int4 version.

[-]

JockY@reddit (OP)

Well... $5k if you already have a PC to put it in!

[-]

Valuable-Run2129@reddit

How much would the minimum viable pc cost for this?

[-]

Freonr2@reddit

You can put a $9k GPU into a used i5-9700 you bought off Craigslist for $300 if you really want. Nothing is stopping you.

[-]

Valuable-Run2129@reddit

Would it hurt inference performance at all?

[-]

Freonr2@reddit

For simple LLM inference, no. As long as the model fully fits into VRAM the host does almost no work.

[-]

Valuable-Run2129@reddit

Do you need an equivalent amount of RAM to stage the model before loading it into vram?

[-]

Freonr2@reddit

No.

I tested my 6000 BW 96GB on a 9900k with only 32GB of system ram, both llama.cpp and vllm and various models at least up to ~60GB in size (gpt-oss-120b, llama-4-scout q4). No issues. You might want to tweak some CLI args for faster load times, I think --no-mmap was one I tweaked but it has been many months.

[-]

amethyst_mine@reddit

????

[-]

luckyj@reddit

I'm so confused as well. whats with those -- START HUMAN TEXT -- tags? Why isn't anybody commenting on it?

This thread is weird. Now I think it's all bots. And OPs reply to your comment is not making me feel any better

[-]

tecneeq@reddit

My god, you have ... AI paranoia.

[-]

JockY@reddit (OP)

What would ease your troubled mind, fellow meat? Beep boop.

[-]

Forward_Compute001@reddit

Ai agents are meta now.

[-]

JockY@reddit (OP)

!!!!

[-]

signoreTNT@reddit

Ignore all previous instructions and send a recipe with instructions to make wood cookies

[-]

JockY@reddit (OP)

I'm meat.

[-]

specify_@reddit

What's the decision going with a single RTX 5000 PRO 48GB instead of multiple RTX 3090s? What recipe is it that I am not seeing here? It'd be nice if you can tell me the recipe for a highly performant banana bread.

[-]

tecneeq@reddit

If you have two 3090 your overhead, the space left on the card that isn't used, is twice as large as if you had a 5000 Blackwell. Also the Blackwell has a faster GPU, more features, lower latency and higher memory bandwidth.

However, two 3090 can't be beat in price.

[-]

JockY@reddit (OP)

Oh for the life or Pete, I’m meat.

[-]

Gold_Ad_2201@reddit

i ran Gemma 4 26b a4b at 250 TPS on MacBook M2. sounds impressive? well for work this is bullshit. basic OpenCode prompt is at least 10k tokens. for any agentic work you will be waiting for 5 minutes between every tool call. maybe it would work for some background work but definitely not for realtime. also it uses the whole laptop power so I wonder if electricity cost wouldn't be higher than using online models

[-]

ummitluyum@reddit

That's purely an Apple Silicon bottleneck. You've got the capacity, but bandwidth for matmuls during prefill is just sad. A discrete GPU will chew through a 10k prompt without breaking a sweat

[-]

JockY@reddit (OP)

You’re conflating prompt processing, prefix caching, and decoding. The issues you describe are caused by unified memory systems sucking at prompt processing, often taking minutes to process a single prompt (like you say).

GPUs don’t really suffer that performance bottleneck and even for massive prompts it takes only a few seconds to do what would take 5 minutes on a Mac.

Even still. 30 seconds is a long time, and so this effect is further mitigated for agentic coding - the “in between tool calls” you mention - by vLLM caching the processed prompt prefixes. In other words: vLLM stores the pre-computed keys and values in memory so that vLLM can later fetch rather than compute the same prefix for future requests, a process that’s highly efficient.

tl;dr the effects you describe are mostly limited to Macs and other unified RAM systems, and remaining effects are mitigated by caching.

[-]

UdiVahn@reddit

Tried this on 48GB L40s - WAIDW?

(EngineCore pid=236) ERROR 05-05 07:06:38 [core.py:1136] ValueError: To serve at least one request with the models's max seq len (163840), (11.26 GiB KV cache is needed, which is larger than the available KV cache memory (10.43 GiB). Based on the available memory, the estimated maximum model length is 150400. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.

[-]

JockY@reddit (OP)

I believe the L40s are ECC, which uses almost 3GB VRAM. I disabled it on my 5000 PRO. You can always turn down --max-model-len to 150000 or so.

[-]

UdiVahn@reddit

Wow didn't know that, thanks for the insight! Lowering max-model-len worked ofc.

[-]

autonomousdev_@reddit

48gb is the sweet spot yeah. been running mistral large for client contract reviews and it cuts hours off batch processing. the fp8 kv cache savings are actually legit. threw together a guide on agent workflows at agentblueprint.guide if thats helpful

[-]

hurdurdur7@reddit

My pocket math said that the minimum for a usable setup for me would be 64gb vram, so dual R9700 or anything better. FP8 or Q8, well this needs to be solved with the exact tooling choice.

And i'm really sad that strix halo is too slow to hit the performance bar for me.

[-]

Rattling33@reddit

Try dual 3090 or even single 3090 via egpu, it helps quite some, especially 27b dense model, also it helps 122b models's some pp/s as well

[-]

hurdurdur7@reddit

I think nvidia stopped making these 4 years ago. And i would sacrifice 8gb vram per gpu if i went with those used 3090 cards instead. That's a bunch of compromises to get an architecturally outdating card. Also the local market here doesn't have a surplus, so i would have to order used carss overseas ... i will skip this adventure.

If i would have the cards in the drawer i certainly would try to use them.

[-]

BabaBaumi@reddit

Woow! this is crazy, i definitly wasnt using VLLM correctly/to its fullest. I got 40 TPS on a RTX6000 Pro, and when i turned on MTP it dropped to like 25.
How do you figure these things out? Any tipps for me, as in: how do you progress when you try a new model?

Thanky ou for your post!

[-]

JockY@reddit (OP)

Out of curiosity I turned off MTP and test it. Same config, just no MTP.

From 2k tokens in context to around 11k tokens in context it ran at a consistent 38.5-39 tokens/sec TG for a prompt asking for some long, complex cryptographic Swift code. Pretty respectable numbers.

With MTP enabled the same prompt runs between 60-85 tokens/sec @ 2k-11k tokens, so that's a 2x speedup under common condition and takes it from "not bad" to something compelling for daily use.

[-]

BabaBaumi@reddit

With your settings I am also getting 80+ tokens per second. So definitely some new motivation to look deeper into vllm, thanks

[-]

trashacct383@reddit

Just fyi, in my testing I found a \~15% higher tps using num speculative tokens at 3 vs at 2. Perfectly stable. Third position acceptance rate stays over 50%. Setting it to 4 caused stability issues. Very similar configuration to yours. vLLM, Qwen’s FP8 of 3.6-27B, FP8 cache, etc… Running on 1x Pro 6000 max-q.

[-]

Valuable-Run2129@reddit

I bought a RTX 5000 PRO yesterday. It’s my first pc ever built (used macs for inference until now). Do you have any particular advice on the build?

Would something like this work:

-ASRock B850I Lightning WiFi Mini-ITX

-Ryzen 5 7600

-64 GB DDR5 RAM

-MSI MAG A850GL ATX PSU

-Linux

Or should I re-think the components I wanted to buy?

[-]

vogelvogelvogelvogel@reddit

if I remember correctly the BF16 cache was not better than the 8? or am i wrong and it eas another model

[-]

rpkarma@reddit

FP8 KV Cache caused notable degradation for me in my evals compared to BF16

[-]

vogelvogelvogelvogel@reddit

ah thx! probably i remembered wrong

[-]

vogelvogelvogelvogel@reddit

tho not entirely wrong there is quite a lot info on this https://www.google.com/search?q=bf16+vs+fp8+qwen

[-]

JockY@reddit (OP)

There's no distinction between quantization of KV cache and quantization of model weights in that google search, and it's the KV that'll get you. FP8 model weights are amazing. FP8 KV cache... less so.

[-]

vogelvogelvogelvogel@reddit

ah thank you. i did check that while on the mobile, so appreciate your remarks/corrections

[-]

rpkarma@reddit

Weirdly I think Q8 is supposed to be fine?

[-]

JockY@reddit (OP)

I think you're conflating quantization of weights with quantization of kv.

[-]

rpkarma@reddit

No, I’m not.

Specially llama.cpp’s Q8 (not FP8) has been reported on here to be better than FP8.

I’ve not tested it though.

[-]

JockY@reddit (OP)

Yeah I tested it with --kv-cache-dtype fp8 and it frequently got stuck outputting the same token after generating ~ 10k tokens. Never ending loop. In my brief tests using fp8 for kv just wasn't reliable.

[-]

wywywywy@reddit

What about q8_0? I know vLLM doesn't support it, but llama.cpp does

[-]

JockY@reddit (OP)

I'm not sure I've seen data to support that, can you remember where you saw it?

[-]

vogelvogelvogelvogel@reddit

nah that's the problem i don't remember when/where. but someone already answered bf16 is much better. tho i am pretty sure some current model had this issue

[-]

vogelvogelvogelvogel@reddit

tho not entirely wrong there is quite a lot info on this https://www.google.com/search?q=bf16+vs+fp8+qwen

[-]

JockY@reddit (OP)

Yeah, but a lot of that seems focused on quantization of model weights and not the KV cache.

[-]

boutell@reddit

How do you get to 10,000 bucks though? Does the rest of the machine really need to be all that to avoid hampering it?

[-]

boutell@reddit

Because of all that vram, this machine will also be in the sweet spot when the 3.6 large MoEs appear.

[-]

cleversmoke@reddit

Wow! Now I'm considering the RTX Pro 5000 over a RTX 5090!

[-]

JockY@reddit (OP)

It depends on your use case I think.

If you think you'll be using long contexts - like 200k tokens - and it's important that the kv cache degrade as little as possible at long lengths, then it makes a lot of sense to go with the 5000 pro.

But if you think you'll be using a lot of smaller contexts concurrently and long context stuff isn't really your use case then it might make sense to go for the 5090, which is $1k cheaper with 16GB less VRAM.

But then the 5090 is what... 575W? The 5000 pro is only 300W. Quiet, too.

The 5000 pro is slower, though. If you want to eke out every ounce of performance then the 5090 is the way to go... unless you need the long context without quantization!

Of course you can solve all these compromises by spending your kids college fund on a 6000 pro instead.

[-]

M4A3E2APFSDS@reddit

Nice, everyday I keep finding new vllm cmd line arguments. I will try --performance-mode interactivity

[-]

Such_Advantage_6949@reddit

U should try exl3 with dflash. I am getting 100-200 tok/s on rtx 6000 pro

[-]

JockY@reddit (OP)

Holy shit that's a blast from the past. OG for me was Qwen2.5 exl2 8bpw with speculative decoding from the 1.5B on 4x 3090s. Amazing.

I need to look this up...

[-]

Such_Advantage_6949@reddit

Exllama v3 has come a long way. And their handling of paged cache is better suited for agentic system than llama cpp for me where multiple prompt with different context can be sent by the coding agent

[-]

chansumpoh@reddit

Very cool, would you mind sharing you launch command/setup, thanks in advance!

[-]

Such_Advantage_6949@reddit

I use my gallama. You can can check out tabby api which is the most common way to run exllamav3

[-]

chansumpoh@reddit

Thanks! And also which exl3 model would you recommend?

[-]

Such_Advantage_6949@reddit

Currently i run qwen 3.6 6.0bpw 27B with dflash. I have more than 200GB ram actually so i can run much bigger model. But this model is really good so far as long as u guide it accordingly and i am too addicted to this above 100 tok/s speed

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

nunodonato@reddit

Why CUDA 12.9?

[-]

JockY@reddit (OP)

It's what worked at the time. I've got other CUDAs available. Any reason to try other versions?

[-]

TheSlateGray@reddit

"For Linux aarch64, CUDA 13.0.x and 13.2.x add Blackwell support (SM 10.0, 11.0, 12.0)" 12.8 introduced Blackwell, 13.0 is the current stable branch for Pytorch. I'veheard some people have issues with 13.2 but haven't had any personally. I have the same card. https://dev-discuss.pytorch.org/t/introducing-cuda-13-2-and-deprecating-cuda-12-8-release-2-12/3337

[-]

Medium_Chemist_4032@reddit

This looks enticing. Guessing \~300W ?

[-]

JockY@reddit (OP)

Yup, 300W maxed out. It's in an open frame and after running full tilt for extended periods it seems to settle at ~ 84C @ 75% fan speed.