Qwen 3.6 27B - beginner questions
Posted by Jagerius@reddit | LocalLLaMA | View on Reddit | 24 comments
Hi,
I would like to try running this model locally - I have RTX 4090, 64GB DDR5, Ryzen 9800X3D. Win11.
What is the best way to set this model up for local coding, using IDE?
What would be the best version to download? Ollama, vLLM, LLM Studio, llama.cpp?
Best way to optmize performance for such rig?
Appreciate any advice!
crablu@reddit
I also have a question. I can run Qwen3.6-27B-UD-Q4_K_XL.gguf with 128k context or Qwen3.6-27B-UD-Q5_K_XL.gguf with q8 kv cache. Which would be better?
Kiro369@reddit
You can try them both with one of those create a game in single html file prompts, and let us know =D
kayox@reddit
On Windows 11, I found LM Studio is beginner friendly, before transitioning over to llama.cpp. vLLM is better on linux and a bit of a pain to get going on WSL. Follow the unsloth guide for inference, and in LM studio for best performance enable K and V Cache quantization at Q8_0 and make sure GPU Offload is maxed out (under load).
chisleu@reddit
lmstudio is the way
lmstudio and vscode and cline and <3 emojis for variable names jk but not really I like emojis for variable names.
pepedombo@reddit
lmstudio has completely no idea how to load models properly
No_Block8640@reddit
Everybody is suggesting llama cpp, I thought it’s not the most efficient when the model fully loads in VRAM?! And I would strongly argue that pi agent would be top choice comparing to open code!
Mart-McUH@reddit
Assuming single user (as most of us here are) you are going to be memory speed bound anyway, no backend magic can change this physical limitation. Then llama.cpp is probably simplest and most convenient way to use (and there are usually great range of GGUF quants available for size you need).
If you need multi-user server inference then it might be good to look for other solutions like vllm.
Beautiful-Floor-5020@reddit
I honestly dont know about CUDA too much since Im a full AMD. I got the 32GB R9700.
Running Q6 XL with vulkan. And the coding is fast. In Pi.dev. insane. I can comfortably run at 131k and it one shots so much with TINY edits.
Of course a long way to go but its amazing.
I use llama server. Vulkan coopmat honestly didnt adjust much, and even with a lot of testing I found this to be the fastest.
ttkciar@reddit
Install llama.cpp, and download the Q4_K_M quant from Bartowski (on Huggingface).
Set up
llama-server(part of llama.cpp) and make sure it's working well via its built-in web interface.Download OpenCode and configure it to use your local
llama-serverOpenAI-compatible API endpoint.There is ample documentation on the llama.cpp Github repo and the OpenCode website, but if you get stuck all of us here on LocalLLaMA are here for you!
bigh-aus@reddit
Then keep increasing context until just before you get OOM.
Particular_Pear_4596@reddit
You don't need OpenCode to vibe code, just llama-server is more than enough.
DeedleDumbDee@reddit
This is the way
exact_constraint@reddit
Yup. 👍🏽
Jagerius@reddit (OP)
Thanks a lot for all the tips, managed to get it running, compiled it with CUDA, here's my start.bat:
u/echo off
cd /d F:\AI\Lokalnie\Llama\llama.cpp\build\bin\Release
llama-server.exe --model "F:\AI\Lokalnie\Qwen3.6_27B\Qwen_Qwen3.6-27B-Q4_K_M.gguf" --alias qwen36-27b-q4km --host 127.0.0.1 --port 8080 -c 131072 -ngl 999
pause
On webUI I'm getting around 11-12t/s - is this the expected performance? Anyway to speed it up a little more?
Wildnimal@reddit
I just stumbled upon this from a reddit thread here
https://medium.com/@fzbcwvv/an-overnight-stack-for-qwen3-6-27b-85-tps-125k-context-vision-on-one-rtx-3090-0d95c6291914?postPublishedType=repub
necile@reddit
You're giving that to a beginner... And on windows?
Negative-Web8619@reddit
easiest thing is reducing context until it fits into the vram
ttkciar@reddit
Yep, this. At 128K context you're guaranteed to be spilling to system RAM. Two things to do about it:
Quantize your k and v caches to q8_0, with "-fa on -ctk q8_0 -ctv q8_0"
Reduce your context limit to something really low like 16K and then slowly bump it up until performance tanks again, then bump it back down to just below that point.
lemondrops9@reddit
LM Studio then figure out your goto models then move to llama.cpp.
Ollama is painfully slow and custom models make it more of a pain.
vLLM is more when your very serious and have dual or quad or more gpus.
CreamPitiful4295@reddit
LM Studio is simple enough. Ollama is even easier
cviperr33@reddit
start with LM studio , test out all the different quants and settings/size , after a few days of testing 20/30 different quants/models then you can switch to llama.ccp and gain a bit of extra perfomance
The ui in LM studio makes it much easier to understand whats going on , what do the settings do and why are they important , and model downloading / picking is very easy , you just browse the huggingface repo directly inside LM studio and it shows you like most downloaded/most liked and upload/update dates
Wildnimal@reddit
This is good advice.
Yayman123@reddit
LM Studio is very beginner friendly compared to the rest, and will more or less guide you through the process.
jacek2023@reddit
in llama.cpp you run:
llama-cli -m your_model.gguf
to play in CLI
and later:
llama-server -m your_model.gguf
to connect with your browser