Workstation upgrade for 5 concurrent users (Qwen 3.6 27B)

Posted by DanielusGamer26@reddit | LocalLLaMA | View on Reddit | 23 comments

Hello, I would like a suggestion from those who are already actively involved in this world.

Basically, I own this workstation:

Ryzen 9 5900X
32GB di RAM DDR4
RTX 5060Ti
PCCOOLER CPS YS1000 1000W

Currently, I can quite easily code with Qwen3.6 27b IQ3 XXS via llama.cpp + llama-swap to implement small assigned tasks (I like staying low-level to direct the implementations and I take advantage of the speed-up that the models provide compared to writing by hand).

My config:

"Qwen3.6-27B":
    ttl: 0
    filters:
      strip_params: "top_p, top_k, presence_penalty, frequency_penalty, temperature, min_p"
      setParamsByID:
        "${MODEL_ID}:coding":
          temperature: 0.6
          top_p: 0.95
          top_k: 20
          min_p: 0.0
          presence_penalty: 0.0
        "${MODEL_ID}:general":
          temperature: 1.0
          top_p: 0.95
          top_k: 20
          min_p: 0.0
          presence_penalty: 1.5
        "${MODEL_ID}:instruct":
          chat_template_kwargs:
            enable_thinking: false
          temperature: 0.7
          top_p: 0.8
          top_k: 20
          min_p: 0.0
          presence_penalty: 1.5
        "${MODEL_ID}:reasoning":
          chat_template_kwargs:
            enable_thinking: false
          temperature: 1.0
          top_p: 0.95
          top_k: 20
          min_p: 0.0
          presence_penalty: 1.5
    cmd: |
      ${llama-server} --model /mnt/fast_data/models/huggingface/Qwen3.6-27B/Qwen3.6-27B-UD-IQ3_XXS.gguf \
      --threads 9 --ctx-size 180000 -fa 1 --jinja -np 3 -ngl 99 -ctk q4_0 -ctv q4_0 --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 12 --draft-max 48 --chat-template-kwargs '{"preserve_thinking": true}' -b 256 -ub 256 -kvu

On average, I get about 900tk/s in prefill (dropping to 600 when the context is around 50/60k tokens) and 25 in tg.

However, lately I often find myself using the model in parallel to perform reviews in one terminal, git commits in another, and perhaps with Nanoclaw running to check the LocalLlama subreddit for useful news. This is where the workstation limitations start to become apparent; everything begins to slow down, and while it's doing the prefill for the Telegram bot, my tasks freeze completely (obviously, llama.cpp is not designed for parallel request).

So I was thinking of doing a small upgrade/investment to my workstation by adding a modded RTX 3080 20GB for $370 (I still have a free PCI slot on the motherboard) and getting my hands on vLLM/sglang with 4-bit (Maybe even more?) quantizations.

Usually, my tasks don't exceed 120k of context, but I'm concerned about the batch processing capability. Specifically, the biggest limitation I'm currently encountering is that the cache for the tasks I'm performing gets invalidated because, for example, a periodic check for the Telegram bot (which uses 80k tokens around) is triggered; consequently, my task has to redo the entire prefill from scratch because the cache was invalidated.

In your opinion, with vLLM and 36GB of total VRAM, will I have enough KV space for the cache to avoid invalidation while maintaining decent speeds with ~5 active parallel requests? I'm afraid of upgrading and then finding out I've wasted my money.

Thank you very much for the help and all the knowledge I have acquired thanks to this subreddit <3

[-]

FriendlyTitan@reddit

For vllm, I couldn't fully use the vram if 2 gpus have different vram capacity. It will take 16gb on each of your gpu only (the smaller one), so 32gb total. Also 2 of your gpus have different architectures (ampere vs blackwell), there will be a lot of bugs.

[-]

DanielusGamer26@reddit (OP)

This comes from your experience? Thanks for the advice <3

[-]

FriendlyTitan@reddit

Yea. As usual, please also check online to see if my experience matches with other people. But last time I check I think I couldn't get vllm to fully utilize all vram on mismatched vram capacity. And my 4090+5090 couldn't start some models if they are in fp8. Mismatched GPU architectures is wonky on vllm.

[-]

DanielusGamer26@reddit (OP)

Okay so I can stick to llama.cpp but i need to solve the problem of single threaded prefill, because when a process starts it's prefill all the parallel generation get stucked...

[-]

FriendlyTitan@reddit

Thats not to discourage you from trying vllm. If you still want to buy another gpu, feel free to experiment with vllm. But in my experience it has been a bit of a pita with a 4090 and a 5090.

[-]

DanielusGamer26@reddit (OP)

Thanks for the advice but another user said that 4bit quantization are worse on tensor than llama.cpp. i think i will try to get another gpu and continue using llama.cpp rip

[-]

FriendlyTitan@reddit

I think they meant that 4bit on vllm is worse than 4 bit quants on llama.cpp

Since you are using the lowest 3bit quant possible, I think 4bit vllm is still higher quality.

[-]

FriendlyTitan@reddit

If you really want it, you can run 2 llama servers, one on each gpu. Use higher context, or higher quality with q3kxl quant/q8 kv cache on the 20gb. Try to put mmproj on cpu to free vram for context.

[-]

MelodicRecognition7@reddit

-np 3

did you try to set this to 4?

-ctk q4_0 -ctv q4_0

this is not a good idea but if it works for you then ok

-b 256 -ub 256

this needs testing, higher values are usually faster

[-]

DanielusGamer26@reddit (OP)

Hi, thanks for the response. That parameters comes from day-to-day finetune. I can use all the vram on my rtx with that flags. If i set np to 4 and 4 tasks are running in parallel, there is a chance that the llama.cpp crash for insufficient space to allocate KV cache, with 3 i got the stability. Without KV quantization at 4bit i was able only to use 80k of context and it is not sufficient for my use case. Setting the batch size to 256 allow me to free up a little bit of space to allocate it on KV context. Threads 9 is my sweet spot for my cpu that have 12 phisical cores, if i increse or decrese that value it becomes slower.

[-]

MelodicRecognition7@reddit

I'm glad that you've actually tested these params instead of blindly copypasting them from somewhere on the Internet. I am not completely confident about 3080 but I have a gut feeling that it will make things slower as your current 50xx will be bottlenecked by the older technology in 30xx, perhaps buying another 5060Ti will be a better choice.

Also you could disable some security features of OS+BIOS to get extra few percents https://old.reddit.com/r/LocalLLaMA/comments/1qxgnqa/running_kimik25_on_cpuonly_amd_epyc_9175f/o3w9bjw/

[-]

BitGreen1270@reddit

3080 for 370 sounds like a good price to me. I heard people say in this sub that 2 cards gives a tremendous boost to performance. No experience myself though.

But I have found vast.ai quite affordable for my needs. I can get a 3090 for 0.17 usd/hour. And they are usually quite available. The only problem is the transient nature of it which requires you to install everything and download the models every time you rent an instance.

But they support custom docker templates. So last weekend I spent a few hours setting up my template and now it works great. Precompiled llama.cpp and auto downloads the models. Takes about 5-10 mins after renting an instance to be ready. And delete the instance at night. So 12 hours costs as much as bus fare, which is reasonable I feel.

[-]

DanielusGamer26@reddit (OP)

In my country the electricity cost is about 15cent kw/h so i would like to prefer fully local instead of vast.ai that can increse the prices without any control. Also that low cost machines on vast comes from non secure cloud, so i do not really trust it much. Just personal feeling.

[-]

Farmadupe@reddit

The 4-bit safetensors formats for most models are way worse quality than llama.cpp quants. Like, surprisngly unusable. I'd definitely find a way to try before comitting just to be sure
It seems like a lot of the "4bit" safetensors quants are way over the expected 20Gb size. for example the "cyankiwi" is actually 20Gb (on size alone, it should have been labelled 6bit) possibly because actual flat 4bit quants probably don't work.
You can rent an A40 on runpod for 44 cents per hour, that could help you make up your mind and should fit the FP8 quant and 5 sequences.

ps: the official "4bit" qwen team quant of 3.5-27b is actually 30Gb

[-]

DanielusGamer26@reddit (OP)

Hmm i didn't know this, thanks for the info!

[-]

timbo2m@reddit

You need to use Vllm or Sglang for concurrency, llama-cpp is good for one call at a time

[-]

Creepy-Bell-4527@reddit

You may be better spending that money on an SSD for paged KV caching.

[-]

DanielusGamer26@reddit (OP)

But currently, with my configuration, I am unable to run the 27b via vLLM; I can't find \~3-bit quantizations like the IQSS 3XL that I use. Does llama.cpp also have a config like that?

[-]

overand@reddit

I'm not sure I follow, but, if you're asking if llama.cpp supports 3-bit GGUF quants? Yes, several. The ones in the unsloth GGUF repo should all be compatible, and range from 12.0 to 14.5 GB

[-]

overand@reddit

I'm not sure about the question, but, it seems like you're asking if llama.cpp can use 3 bit quants? Definitely. Which ones specifically? Take a look at Unsloth's repo for it, all of which should be fully compatible with llama.cpp

- UD-IQ3_XXS
- Q3_K_S
- Q3_K_M
- UD-Q3_K_XL

[-]

Jumpy-Possibility754@reddit

You’re hitting cache invalidation not a raw compute limit Parallel requests with long contexts will constantly blow KV and force full prefills

vLLM helps but only if you control batching and reuse patterns Otherwise you just move the bottleneck

[-]

Safe-Introduction946@reddit

Vast.ai has A40 rentals too. You can spin one up for a few hours and test FP8 vs those "4bit" safetensors on the exact same hardware. if a "4bit" file is \~20–30GB it's probably a packed/higher-bit quant (AWQ/GPTQ), so compare quality and memory use on the same GPU before committing.

[-]

DeltaSqueezer@reddit

Max KV cache requires 16GB VRAM, leaving you 20GB for the weights. So it is doable.