Rig For Qwen3.5 27B FP16

Posted by Ok-Internal9317@reddit | LocalLLaMA | View on Reddit | 4 comments

What will you build for running specifically this model, at half precision, with fast prompt processing and token generation up to 500K context. How much would it cost?

Reply to Post

4 Comments

[-]

Running Qwen3.5 27B at \*\*FP16\*\* with a \*\*500k context\*\* is a massive VRAM requirement. - Model weights: \~54GB. - KV Cache (500k context): At FP16, this is roughly 80-100GB on its own depending on the architecture details (GQA helps, but at 500k it's still huge). You’re looking at \*\*160GB+ of VRAM\*\* minimum to run this comfortably with room for the activation buffers. \*\*The Build:\*\* - \*\*4x RTX 3090 (24GB) or 4090 (24GB):\*\* Total 96GB. \*\*NOT ENOUGH\*\* for FP16 + 500k. - \*\*7-8x RTX 3090/4090:\*\* This gets you to 168-192GB. This is a 'franken-server' build. You'd need an EPYC/Threadripper platform for the PCIe lanes. - \*\*Professional Option:\*\* 4x RTX 6000 Ada (48GB each) = 192GB VRAM. This fits in a workstation. - \*\*Budget-conscious Option:\*\* 8x used RTX 3090s. \*\*Cost:\*\* - 8x Used 3090s (\~$8k) + EPYC Mobo/CPU/PSUs ($3k) = \*\*\~$11k\*\*. - 4x RTX 6000 Ada (\~$28k new, maybe $18k used) = \*\*$20k+\*\*. \*\*Note on Speed:\*\* Fast prompt processing at 500k context requires massive memory bandwidth. Even with a multi-GPU setup, the 'prefill' phase for 500k tokens will take time. If you can compromise on precision (e.g., 4-bit or 8-bit quants), your VRAM needs drop by 2-4x.

[-]

cchung261@reddit

I think you’ll need 64 gb vram, so that’s two RTX 4500 Blackwell if you want CUDA. There’s also Halo Strix or Apple Mac Mini if fastest performance is less important. Looking at low end ~$3000 to ~$9500.

[-]

Ok_Technology_5962@reddit

you will have issues with 500k Context not because of handwear but because of how that attention mechanism will use it. It is about 3 to 1 delta net which is much better but i saw a prefill drop from 400 tps to 100 tps around 100k token on prompt prefil. Your PP will be restricted by how much horsepower you have so use GPU if you need that to be fast and your Tgen (5090 is around 2k PP on the q4 version so estimate around the for RTX PRO 6000 which you will need for the fp16 27b model) is related to bandwidth mostly so how fast the memory speed is. Balance cost vs output. I would suggest looking through benchmarks there are on oMLX [Community Benchmarks — oMLX](https://omlx.ai/benchmarks) Its MAC but you can multiply out the speeds based on memory bandwidth and compute so you can see what you are satisfied with. and you can calculate the quadratic drop of in speed at larger KV Cache lengths as well

Rig For Qwen3.5 27B FP16

Reply to Post

4 Comments

Opteron67@reddit

bytebeast40@reddit

cchung261@reddit

Ok_Technology_5962@reddit