I wonder how good the Qwen 3.6 4B will be given the insane boost of performance in the 27B and 36B

Posted by exaknight21@reddit | LocalLLaMA | View on Reddit | 13 comments

I personally am a simpleton with crappy hardware. I run the Qwen 3 4B still for my simple tasks for simple RAG. I personally cannot wait for the 4B Instruct model as I believe it’s my go to “ChatGPT” replacement for dumb question via OpenWebUI and vLLM.

I rock an old T5610, DDR 3 - 64 GB Dual Xeon (sadly AVX) slow processors, 256 GB Sata SSD and an Mi50 32 GB

I run dockerized vLLM (nlzy archived so on the sweet mobydick branch), i run my in-home experiments and use 8K contexr, usually cyankiwi’s awq version, it does wonders for me.

I pray the Qwen team releases this soon!

[-]

segmond@reddit

you have a 32gb GPU yet run a 4b? Why? You clearly can run the 36B model at Q6

[-]

exaknight21@reddit (OP)

I use vLLM, only need 8K context and have 2 beta users for my app as well. I can do a lot more, but test with this every now and then. In testing, there isn’t any sense in paying premium API price if I can self host the calls.

I will try the Qwen3.6:27B today at Q6 and see how far I can take the context with llama.cpp. I do use claude code a lot and this would be a banger.

[-]

Blues520@reddit

I'm interested in self hosting as well and also looking at smaller models for lower latency. How do you go about self hosting? Are you using a cloud instance or a bare metal server and if so, how do you expose it?

[-]

exaknight21@reddit (OP)

I have a VPS, on it I have dokploy + a gateway. The VPS is connected to my home server via Tailscale.

I have API keys configured so only authorized connections such as my SaaS or if an OpenWebUI instance, it would also need the endpoint (same as OpenAI-compatible), + API key. Works like a charm on my subdomain.

[-]

CurrentNew1039@reddit

Yes if it's beats 3.5 27 b or 35b, I will jump through heaven

[-]

Monad_Maya@reddit

Unlikely. The parameters count difference is way too large.