Workstation upgrade for 5 concurrent users (Qwen 3.6 27B)

Posted by DanielusGamer26@reddit | LocalLLaMA | View on Reddit | 23 comments

Hello, I would like a suggestion from those who are already actively involved in this world.

Basically, I own this workstation:

Currently, I can quite easily code with Qwen3.6 27b IQ3 XXS via llama.cpp + llama-swap to implement small assigned tasks (I like staying low-level to direct the implementations and I take advantage of the speed-up that the models provide compared to writing by hand).

My config:

"Qwen3.6-27B":
    ttl: 0
    filters:
      strip_params: "top_p, top_k, presence_penalty, frequency_penalty, temperature, min_p"
      setParamsByID:
        "${MODEL_ID}:coding":
          temperature: 0.6
          top_p: 0.95
          top_k: 20
          min_p: 0.0
          presence_penalty: 0.0
        "${MODEL_ID}:general":
          temperature: 1.0
          top_p: 0.95
          top_k: 20
          min_p: 0.0
          presence_penalty: 1.5
        "${MODEL_ID}:instruct":
          chat_template_kwargs:
            enable_thinking: false
          temperature: 0.7
          top_p: 0.8
          top_k: 20
          min_p: 0.0
          presence_penalty: 1.5
        "${MODEL_ID}:reasoning":
          chat_template_kwargs:
            enable_thinking: false
          temperature: 1.0
          top_p: 0.95
          top_k: 20
          min_p: 0.0
          presence_penalty: 1.5
    cmd: |
      ${llama-server} --model /mnt/fast_data/models/huggingface/Qwen3.6-27B/Qwen3.6-27B-UD-IQ3_XXS.gguf \
      --threads 9 --ctx-size 180000 -fa 1 --jinja -np 3 -ngl 99 -ctk q4_0 -ctv q4_0 --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 12 --draft-max 48 --chat-template-kwargs '{"preserve_thinking": true}' -b 256 -ub 256 -kvu

On average, I get about 900tk/s in prefill (dropping to 600 when the context is around 50/60k tokens) and 25 in tg.

However, lately I often find myself using the model in parallel to perform reviews in one terminal, git commits in another, and perhaps with Nanoclaw running to check the LocalLlama subreddit for useful news. This is where the workstation limitations start to become apparent; everything begins to slow down, and while it's doing the prefill for the Telegram bot, my tasks freeze completely (obviously, llama.cpp is not designed for parallel request).

So I was thinking of doing a small upgrade/investment to my workstation by adding a modded RTX 3080 20GB for $370 (I still have a free PCI slot on the motherboard) and getting my hands on vLLM/sglang with 4-bit (Maybe even more?) quantizations.

Usually, my tasks don't exceed 120k of context, but I'm concerned about the batch processing capability. Specifically, the biggest limitation I'm currently encountering is that the cache for the tasks I'm performing gets invalidated because, for example, a periodic check for the Telegram bot (which uses 80k tokens around) is triggered; consequently, my task has to redo the entire prefill from scratch because the cache was invalidated.

In your opinion, with vLLM and 36GB of total VRAM, will I have enough KV space for the cache to avoid invalidation while maintaining decent speeds with ~5 active parallel requests? I'm afraid of upgrading and then finding out I've wasted my money.

Thank you very much for the help and all the knowledge I have acquired thanks to this subreddit <3