Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer | TheaterFire

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer

Posted by One_Slip1455@reddit | LocalLLaMA | View on Reddit | 57 comments

Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer

The angle here is native Windows, no WSL. Simple installation, open source, no telemetry. Not selling or promoting anything: https://github.com/devnen/qwen3.6-windows-server

Numbers (RTX 3090, Windows 10): - 72 tok/s short prompt - 64.5 tok/s long prompt (~25k tokens) - 53.4 tok/s at 127k ctx (single GPU) - 160k ctx on PP=2 (2×3090 GPUs)

Honestly, these aren't r/LocalLLaMA records. Community has hit 80–82 tok/s on a 3090 with TurboQuant 3-bit KV, and 160 tok/s on a 5090 on Linux. My launcher and patched vLLM closes that gap on Windows.

Simple installation: 1. Download qwen3.6-windows-server-portable-x64.zip from the Release 2. Unzip anywhere. No admin, no pip, no Python required 3. Double-click start.bat, pick a snapshot, hit Enter 4. OpenAI-compatible endpoint at http://127.0.0.1:5001/v1

I had to build a patched vLLM fork for Windows to fix a few issues and make this work. I am including a portable launcher that ships the prebuilt wheel.

First run installs the bundled vLLM wheel + deps into the embedded Python (~5–15 min, one-time), then offers to auto-download the Lorbus AutoRound INT4 quant from HuggingFace if you don't already have it. Subsequent launches skip straight to the TUI.

Tested on Windows 10 + 2× RTX 3090 with the Lorbus AutoRound INT4 quant. Should work on any Ampere/Ada/Blackwell card (3090/4090/5090/A6000). Won't work on Pascal, Turing, Arc, or AMD.

I have a similar launcher and a patched vLLM for Linux with some very competitive numbers, but it is still a work in progress.

If you're on a 3090/4090/5090 on Windows, give it a spin and post your numbers.

Full details, patches, benchmarks, and config snapshots: https://github.com/devnen/qwen3.6-windows-server

[-]

pepedombo@reddit

Any chance it's going to work on 5070+5060 or 5060+5060? i'm having errors from cl.exe and triton problems and cuda-utils. Gpt summary: vLLM 0.19.0 native Windows fails inspecting Qwen3_5ForConditionalGeneration; traceback hits FLA/GatedDeltaNetAttention, then Triton compiling cuda_utils.c with MSVC cl.exe exits 2

[-]

One_Slip1455@reddit (OP)

5070+5060 is mixed Blackwell SKUs. Tensor parallel generally wants matched architectures, but PP=2 across them is more likely to work than TP=2. 5060+5060 should be fine since they're identical. Either way, test on a single card first to confirm the wheel loads, then add the second.

The cl.exe / Triton error means MSVC 2022 isn't installed or the launcher can't find vcvars64.bat. Triton JIT-compiles attention kernels at first request and needs cl.exe on PATH. Install Visual Studio 2022 Community or just the Build Tools (free), pick the "Desktop development with C++" workload. The snapshots auto-shell vcvars64.bat after that.

If you still hit errors after MSVC is installed, open an issue on GitHub with nvidia-smi, winver, and the boot section of logs\vllm_server..log. I don't have the hardware to test and be 100% sure, but someone may jump in with a solution.

Troubleshooting: https://github.com/devnen/qwen3.6-windows-server/blob/main/docs/TROUBLESHOOTING.md

[-]

pepedombo@reddit

I've copied python includes into qwen-server and I woke up. Just quick image upload, I need to dig deeper. Until this time I had pure results when playing vllm at w11.

[-]

One_Slip1455@reddit (OP)

Nice, glad it's running! Copying the Python includes is interesting, that hints the embedded Python's Include/ dir wasn't where Triton expected it. If you can share the exact paths you copied from/to, I'll add a check to the runtime installer so the next person on a 5070/5060 doesn't hit it.

Tell me what tok/s you're seeing once it stabilizes. If you have time, please run windows_tools\check_coherence.py --port 5001 to confirm the output is clean before trusting any numbers, Blackwell + AutoRound INT4 is a combo I haven't validated.

Either way, an issue on GitHub with what you copied and your final config would help everyone with the same cards.

[-]

relmny@reddit

will this work on an rtx 4080 super (16gb) + rtx 3060 (12gb)?

[-]

calibrae@reddit

Windows ? Who the fuck would use windows ?

[-]

One_Slip1455@reddit (OP)

I also have a Linux setup on the same 2x 3090 hardware, hitting around 90 tok/s with the Genesis/TurboQuant stack (TP=2, vision-off, 480k context, coherence-validated). Writing that one up as a separate post soon. Different OS, different ceiling, same model.

Plenty of people daily-drive Windows for work and don't want to dual-boot just to run a model. That's the niche this fills. If you can run Linux, run Linux. This is for the people who can't.

[-]

calibrae@reddit

Anyone can run Linux as soon as they stop being lazy bastards making Pedo Gates richer than he already is.

[-]

Fit_Split_9933@reddit

Is a 16GB graphics card capable of handling this? Currently I'm using llamacpp to get qwen3.6 27b iq4_xs to 100k context. I've heard that VLLM itself consumes VRAM, and your model is significantly larger...

[-]

hideo_kuze_@reddit

Currently I'm using llamacpp to get qwen3.6 27b iq4_xs to 100k context.

I'm surprised you can even use it at Q4.

It is offloading to RAM? How many t/s are you getting?

[-]

Fit_Split_9933@reddit

See this, no offloading https://www.reddit.com/r/LocalLLaMA/comments/1su0il5/qwen_36_27b_iq4_xs_22_tps_on_rtx_5060ti_16b_24k/

[-]

One_Slip1455@reddit (OP)

No, not for the 27B. The Lorbus AutoRound INT4 weights alone are about 17 GiB, before any KV cache or activations. vLLM also pre-allocates KV at startup so it can't spill to RAM the way llama.cpp does. On a 16 GB card you'd OOM at boot.

Stick with llama.cpp + iq4_xs for your setup, that's the right tool for 16 GB. If you want to try vLLM specifically, wait until they release smaller Qwen3.6 models which fit comfortably and you should get MTP if you find a quant with the BF16 head.

The launcher here is configurable so you can just add a new card by modifying launcher/configs.yaml. My vLLM fork should be compatible with future smaller Qwen3.6 models if they are released.

[-]

Anbeeld@reddit

Not trying to discredit anything, but stating speed as "on 3090" in title is a bit dishonest when it's 2x3090 in reality, which also changes everything regarding context limitations.

[-]

One_Slip1455@reddit (OP)

Fair pushback, let me clarify: the 64.5 tok/s and 72 tok/s headlines are single-card decode, not TP=2 across both.

The model and KV cache live entirely on one 3090. The reason there are two cards in the test rig is the display tax: Windows eats 1-3 GiB on the GPU driving your monitor, so we run inference on GPU1 with no display attached and use GPU0 for the desktop.

If you have one 3090 and your monitor is on the iGPU or a cheap secondary card, you get the same numbers. If you have one 3090 driving your display, that's the start_gpu0_50k snapshot which is the realistic single-GPU-with-desktop case (\~9-50k context depending on what's open).

The only snapshot that actually uses both GPUs is start_pp2_160k (43.5 tok/s, 160k context), and that one's labeled clearly. So the speed numbers don't change with one card vs two, but context ceiling does drop from 127k (single card, no display) to \~50k (single card, display attached).

The full breakdown of the Windows desktop VRAM tax with measured numbers: https://github.com/devnen/qwen3.6-windows-server/blob/main/docs/WINDOWS_VRAM_HEADLESS.md

[-]

Squallhorn_Leghorn@reddit

WSL2 is a Type-1 hypervisor, correct?

I can provision directly to CUDA?

Why is not using linux a flex?

[-]

One_Slip1455@reddit (OP)

Not using Linux isn't a flex, and the README isn't framed that way. If you can run Linux, run Linux, mainline vLLM is strictly better. This project exists for people stuck on Windows for work reasons who don't want to dual-boot just to run a model.

The Windows host driver still owns the GPU and DWM keeps its allocation. That's the overhead someone in another thread measured as 85 tok/s in WSL vs 160 tok/s in native Ubuntu on the same hardware.

I have checked the thread again and it turns out that updating WSL to 2.7.3 closes some of the gap but it is still 115 tok/s in WSL vs 160 tok/s in Ubuntu.

[-]

Squallhorn_Leghorn@reddit

WSL2 is a layer 1 hypervisor - Hyper-V. Not sure where you get your metrics.

[-]

aspectop@reddit

Nice

[-]

cleversmoke@reddit

Solid numbers! I'm on Docker to keep everything containerized, but dang these numbers are making me reconsider.

[-]

One_Slip1455@reddit (OP)

If you're on one GPU the speed is great, but context is limited. On GPU0 with your desktop attached you'll get around 50k, because Windows eats 2 to 5 GB of VRAM for the desktop and apps. Use the start_gpu0_50k snapshot for that.

If you have two GPUs, set CUDA_VISIBLE_DEVICES=1 so the model runs on the second card with all VRAM free. That's the fast path, 64.5 tok/s at 90k context.

Or use both cards with start_pp2_160k for 160k context, but it drops to 43.5 tok/s. One card for speed, two cards when you need the bigger context.

[-]

cleversmoke@reddit

Your response gave me a thought to look into Task Manager and it turns out a ton of my apps was utilizing my eGPU for boost! I've gone and made sure every app only uses the iGPU via Windows 11 Settings > Display > Graphics, wowza made a difference when the eGPU is on full load. No increase in toks but at least no mouse and keyboard disconnecting. Thank you!

[-]

One_Slip1455@reddit (OP)

Glad that helped. The Windows 11 per-app GPU setting is one of the most useful options that most people miss.

There's a whole writeup of this in the repo with measured numbers and the full ranked workaround list (iGPU routing, secondary display GPU, HW accel toggles, etc.): https://github.com/devnen/qwen3.6-windows-server/blob/main/docs/WINDOWS_VRAM_HEADLESS.md

[-]

flobernd@reddit

Question: why pp over tp? TP should be faster and also leaves a bit more room for kv-cache compared to pp.

[-]

One_Slip1455@reddit (OP)

You're right on Linux. On Windows it flips because there's no real NCCL. The wheel falls back to Gloo and CUDA tensors trip on direct allreduce, so the fork stages collectives through pinned CPU buffers.

TP fires an allreduce on every transformer layer, so the CPU-relay tax compounds and decode drops to about 7.5 tok/s on Qwen3.6-27B. PP only hands off the hidden state once per pipeline boundary, so the relay cost stays manageable and you get about 43 tok/s with 160k context.

KV-wise PP does eat slightly more, but on Windows TP is just unusable here. If NCCL ever lands on Windows or someone ports the collective ops differently, TP would probably win again.

Full writeup:

https://github.com/devnen/qwen3.6-windows-server/blob/main/docs/HARDWARE.md#tensor-vs-pipeline-parallelism

https://github.com/devnen/qwen3.6-windows-server/blob/main/docs/SPEC_DECODE_MATRIX.md

[-]

flobernd@reddit

Oh interesting! Didn’t know that.

[-]

alex20_202020@reddit

because Windows eats 2 to 5 GB of VRAM for the desktop and apps

Can one not setup iGPU for graphics in Windows?

[-]

Craftkorb@reddit

Docker doesn't introduce a perceivable overhead, it's just a namespace in the kernel. It's not a VM.

[-]

One_Slip1455@reddit (OP)

You're right on Linux, Docker there is just namespaces and the overhead is basically zero. On Windows it's different. Docker Desktop runs through WSL2, which is a real VM with its own kernel. So when a Windows user runs vLLM in Docker, they're paying the WSL2 tax, not a Docker tax.

Someone measured the same hardware going from 85 tok/s in WSL to 160 tok/s in native Ubuntu with the same settings:
https://www.reddit.com/r/LocalLLaMA/comments/1sw21op/comment/oid8d9n/

That gap is what this project is trying to close for Windows users who can't or won't dual-boot.

If you're already on Linux with Docker, stick with it, you're not losing anything.

[-]

jaMMint@reddit

For folks using Blackwell cards (eg 5090 or RTX 6000 pro), here is a guide I wrote to reach up to 120t/s for the dense 27b model, and up to 200t/s for the 35b MoE qwen 3.6. https://github.com/lastloop-ai/vllm-blackwell-guide

[-]

Dany0@reddit

I'm using the CobraPhil/qwen36-27b-single-5090 recipe right now and I'm loving it, looks quite close to yours

What I don't understand is your base numbers in llamacpp. With my llamacpp setup on my 5090, I get 70-75tok/s easily without self-spec, and with ngram-mod I get same \~70tok/s even at higher ctx with random bumps of say \~200tps which averages to \~100tps decode in practice. I do use ud q4 k xl + q8 kv cache max ctx for context

[-]

Dany0@reddit

it's also a little upsetting to me that the quant we're using wasn't converted using the highest quality autoround option

in my experience these quantisation scripts usually require huge amounts of VRAM if done on the GPU or a long time of CPU churning lik 4-8hours minimum though

that kaitchup one seems nice too, has anyone tried it?

basically Lorbus autoround + nvfp4 + highest quality = holy grail

[-]

gAmmi_ua@reddit

Nice. Thanks for sharing. I will try to test it on my rtx pro 4000 sff card and share my results (if I manage to run it :)). I will have to go with q4 or something but still

[-]

jaMMint@reddit

Let me know how it goes, thanks for trying it out!

[-]

arcandor@reddit

For us peasants with slower or smaller vram Nvidia cards, would this also be optimally performant or close to it for other models?

[-]

One_Slip1455@reddit (OP)

The wheel serves any vLLM-supported model, the launcher just ships configs tuned for 27B. To run a smaller model: copy a snapshot in snapshots/, change the model path and a few constants, copy the matching .bat, then add a card to launcher/configs.yaml. The launcher picks it up automatically.

Stick to Qwen3 family quants if you want the tool-calling fix and MTP for free.

How to add a snapshot:
https://github.com/devnen/qwen3.6-windows-server/blob/main/snapshots/README.md

If you land a good config for a smaller card, please send a PR, I'd love to include validated configs in the launcher.

[-]

arcandor@reddit

I'm limited by my 12gb card, so I can't play with as many of the new "toys"! Thanks and I'll test it out soon.

[-]

havnar-@reddit

How does the 8bit one fare?

[-]

One_Slip1455@reddit (OP)

Haven't tested 8-bit on this stack. Qwen3.6-27B at INT8 is around 27 GB of weights alone, so it won't fit on a single 3090. You'd need PP=2 across both cards, which means no MTP (PP+MTP is broken in vLLM 0.19), so expect roughly the start_pp2_160k decode rate of \~43 tok/s minus whatever the larger weights cost in memory bandwidth.

Honestly INT4 AutoRound is the sweet spot on 3090-class cards. KLD vs INT8 is small for Qwen3.6, and you keep MTP, the speed, and headroom for big context. If you have a 5090 or A6000 with more VRAM and want to try INT8, would love to see numbers.

Spec-decode compatibility matrix:

https://github.com/devnen/qwen3.6-windows-server/blob/main/docs/SPEC_DECODE_MATRIX.md

[-]

Important_Quote_1180@reddit

Well done. Community needs work like this.

[-]

urekmazino_0@reddit

Looking forward to running this in my windows server 2x3090s

[-]

One_Slip1455@reddit (OP)

Nice, 2x 3090 is the reference setup. start_speed on GPU1 will give you 64.5 tok/s with 90k context, and start_pp2_160k uses both cards for 160k context at 43.5 tok/s.

Also worth running windows_tools\check_coherence.py --port 5001 once after first boot to confirm the output is clean. If you hit anything weird, an issue with winver and nvidia-smi -q | head -25 would be useful since Server isn't in the tested matrix yet.

[-]

Ranmark@reddit

I won't be able to run this on two 1080 ti's?

[-]

One_Slip1455@reddit (OP)

Sorry, no. 1080 Ti is Pascal (sm_61) and this wheel needs Ampere or newer (sm_86+). The TRITON_ATTN backend and the AutoRound INT4 kernels both require it.

For Pascal your best bet is llama.cpp with a GGUF quant. You won't get MTP or the vLLM speed numbers, but Qwen3.6-27B Q4 will run, just slower.

Hardware compatibility:
https://github.com/devnen/qwen3.6-windows-server/blob/main/docs/HARDWARE.md

[-]

Training-Cup4336@reddit

How do I uninstall this once I am done with it?

[-]

One_Slip1455@reddit (OP)

It's portable, so uninstall is just deleting folders:

Stop the server if it's running. The launcher's stop button works, or run snapshots\stop_vllm.bat.
Delete the extracted folder (wherever you unzipped qwen3.6-windows-server-portable-x64). That removes the launcher, the embedded Python, the bundled wheel, and the installed deps.
Optional cleanup of stuff that lives outside the install folder:

- %LocalAppData%\qwen36-windows-server\ - logs and saved config when the install dir was read-only (Program Files installs).

- %USERPROFILE%\.cache\vllm\torch_compile_cache\ - torch compile cache, a few hundred MB.

- Your model weights folder - wherever you downloaded Qwen3.6-27B-int4-AutoRound. About 16 GB.

- %USERPROFILE%\.cache\huggingface\ - only if you used Hugging Face for other models too, otherwise safe to

clear.

That's it. Nothing touches your system Python, no DLLs registered, no services installed.

[-]

Training-Cup4336@reddit

ok thanks 🙏

[-]

Hurricane31337@reddit

Very interesting, thanks!

[-]

CMatUk@reddit

Hows this compare with LM studio server?

[-]

One_Slip1455@reddit (OP)

LM Studio is great for chat, but it runs on llama.cpp under the hood. That means no MTP speculative decoding, no AutoRound INT4 quant, GGUF only, and one request at a time. Tool calling is still beta there.

So this solution runs faster, plus full OpenAI-spec tool calling that works with your favorite coding agent.

Tradeoff is no GUI, you pick a snapshot in the TUI and that's it. If you like the GUI workflow, LM Studio is fine. If you want max tok/s and agentic tool calling on Windows without WSL, this is faster.

[-]

CMatUk@reddit

Do you have Claude API compatibility ready or planned?

[-]

One_Slip1455@reddit (OP)

The vLLM wheel implements Anthropic's /v1/messages endpoint natively, so Claude Code talks to it the same way it talks to api.anthropic.com. No proxy needed.

Setup is just env vars before launching claude:

ANTHROPIC_BASE_URL=http://127.0.0.1:5001

ANTHROPIC_API_KEY=dummy

ANTHROPIC_AUTH_TOKEN=dummy

ANTHROPIC_DEFAULT_OPUS_MODEL=any

ANTHROPIC_DEFAULT_SONNET_MODEL=any

ANTHROPIC_DEFAULT_HAIKU_MODEL=any

Or put them in \~/.claude/settings.json under "env". The model name can be literally any because the wheel is patched with a wildcard served-model-name, so Claude Code's model picker won't reject it.

Tool calling works out of the box because every snapshot ships the tool-calling fix (PRs 35687 + 40861 + qwen3.5-enhanced.jinja + preserve_thinking=false).

For more information, please visit this web page:
https://github.com/devnen/qwen3.6-windows-server/blob/main/docs/CLAUDE_CODE.md

[-]

Monad_Maya@reddit

Anything for AMD folks? \:)

[-]

schuttdev@reddit

on it

[-]

One_Slip1455@reddit (OP)

Sorry, no. This is NVIDIA only, Ampere or newer (3090, 4090, 5090, A6000). The vLLM ROCm path doesn't ship in the Windows wheel, so AMD cards won't work here.

For AMD on Windows your best bet right now is llama.cpp with Vulkan or ROCm.

[-]

ihaag@reddit

You’ll get more with SGlang

[-]

vogelvogelvogelvogel@reddit

rhx for sharing!

[-]

Ok-Measurement-1575@reddit

Very nice.