Qwen 3.6 27B - beginner questions

Posted by Jagerius@reddit | LocalLLaMA | View on Reddit | 24 comments

Hi,

I would like to try running this model locally - I have RTX 4090, 64GB DDR5, Ryzen 9800X3D. Win11.

What is the best way to set this model up for local coding, using IDE?

What would be the best version to download? Ollama, vLLM, LLM Studio, llama.cpp?

Best way to optmize performance for such rig?

Appreciate any advice!

[-]

crablu@reddit

I also have a question. I can run Qwen3.6-27B-UD-Q4_K_XL.gguf with 128k context or Qwen3.6-27B-UD-Q5_K_XL.gguf with q8 kv cache. Which would be better?

[-]

Kiro369@reddit

You can try them both with one of those create a game in single html file prompts, and let us know =D

[-]

On Windows 11, I found LM Studio is beginner friendly, before transitioning over to llama.cpp. vLLM is better on linux and a bit of a pain to get going on WSL. Follow the unsloth guide for inference, and in LM studio for best performance enable K and V Cache quantization at Q8_0 and make sure GPU Offload is maxed out (under load).

[-]

chisleu@reddit

lmstudio is the way

lmstudio and vscode and cline and <3 emojis for variable names jk but not really I like emojis for variable names.

[-]

pepedombo@reddit

lmstudio has completely no idea how to load models properly

[-]

No_Block8640@reddit

Everybody is suggesting llama cpp, I thought it’s not the most efficient when the model fully loads in VRAM?! And I would strongly argue that pi agent would be top choice comparing to open code!

[-]

Mart-McUH@reddit

Assuming single user (as most of us here are) you are going to be memory speed bound anyway, no backend magic can change this physical limitation. Then llama.cpp is probably simplest and most convenient way to use (and there are usually great range of GGUF quants available for size you need).

If you need multi-user server inference then it might be good to look for other solutions like vllm.

[-]

Beautiful-Floor-5020@reddit

I honestly dont know about CUDA too much since Im a full AMD. I got the 32GB R9700.

Running Q6 XL with vulkan. And the coding is fast. In Pi.dev. insane. I can comfortably run at 131k and it one shots so much with TINY edits.

Of course a long way to go but its amazing.

I use llama server. Vulkan coopmat honestly didnt adjust much, and even with a lot of testing I found this to be the fastest.

[-]

ttkciar@reddit

Install llama.cpp, and download the Q4_K_M quant from Bartowski (on Huggingface).

Set up llama-server (part of llama.cpp) and make sure it's working well via its built-in web interface.

Download OpenCode and configure it to use your local llama-server OpenAI-compatible API endpoint.

There is ample documentation on the llama.cpp Github repo and the OpenCode website, but if you get stuck all of us here on LocalLLaMA are here for you!

[-]

bigh-aus@reddit

Then keep increasing context until just before you get OOM.

[-]

Particular_Pear_4596@reddit

You don't need OpenCode to vibe code, just llama-server is more than enough.

[-]

DeedleDumbDee@reddit

This is the way

[-]

exact_constraint@reddit

Yup. 👍🏽

[-]

Jagerius@reddit (OP)

Thanks a lot for all the tips, managed to get it running, compiled it with CUDA, here's my start.bat:

u/echo off

cd /d F:\AI\Lokalnie\Llama\llama.cpp\build\bin\Release

llama-server.exe --model "F:\AI\Lokalnie\Qwen3.6_27B\Qwen_Qwen3.6-27B-Q4_K_M.gguf" --alias qwen36-27b-q4km --host 127.0.0.1 --port 8080 -c 131072 -ngl 999

pause

On webUI I'm getting around 11-12t/s - is this the expected performance? Anyway to speed it up a little more?

[-]

Wildnimal@reddit

I just stumbled upon this from a reddit thread here

https://medium.com/@fzbcwvv/an-overnight-stack-for-qwen3-6-27b-85-tps-125k-context-vision-on-one-rtx-3090-0d95c6291914?postPublishedType=repub

[-]

necile@reddit

You're giving that to a beginner... And on windows?

[-]

Negative-Web8619@reddit

easiest thing is reducing context until it fits into the vram

[-]

ttkciar@reddit

Yep, this. At 128K context you're guaranteed to be spilling to system RAM. Two things to do about it:

Quantize your k and v caches to q8_0, with "-fa on -ctk q8_0 -ctv q8_0"
Reduce your context limit to something really low like 16K and then slowly bump it up until performance tanks again, then bump it back down to just below that point.

[-]

lemondrops9@reddit

LM Studio then figure out your goto models then move to llama.cpp.

Ollama is painfully slow and custom models make it more of a pain.

vLLM is more when your very serious and have dual or quad or more gpus.

[-]

CreamPitiful4295@reddit

LM Studio is simple enough. Ollama is even easier

[-]

cviperr33@reddit

start with LM studio , test out all the different quants and settings/size , after a few days of testing 20/30 different quants/models then you can switch to llama.ccp and gain a bit of extra perfomance

The ui in LM studio makes it much easier to understand whats going on , what do the settings do and why are they important , and model downloading / picking is very easy , you just browse the huggingface repo directly inside LM studio and it shows you like most downloaded/most liked and upload/update dates

[-]

to play in CLI

and later:

llama-server -m your_model.gguf

to connect with your browser