how would you set up a local llm server for a business of 7 people?

Posted by snowieslilpikachu69@reddit | LocalLLaMA | View on Reddit | 42 comments

Okay so i've been stalking this sub for some time and i run the occasional small 2-8b model on my laptop (not the best) for fun

but say my role at a company is to set up a local LLM since we obviously don't want confidential data going to other companies etc /

main use case would be queries, rag, general use nothing crazy except for maybe 1 or 2 people using it for programming purposes.

i was thinking of gemma 4 26/31 or qwen 3.6 27/35. how do these models scale with concurrent users? i know i could run one of these on a 5090 and some extra or a 48gb macbook pro w unified memory but not sure how these scales with multiple users.

[-]

tecneeq@reddit

Same situation here, confidential data moves in the company, so we decided to build a local stack.

Bought a Gigabyte Server with two 6000 Blackwell MaxQ, with the option to add two more, 26k€
Installed Proxmox, installed latest NVidia Drivers and Cuda 13.2
Created a LXC with Debian 13, added the NVidia devices, installed the same NVidia Driver (with --no-modules IIRC) and Cuda 13.2
Compiled latest llama.cpp
Qwen 3.6 35b-a3b FP16 with --parallel 4 and --context 1048576 and flashattention on
Another LXC has Docker, it contains Portainer, OpenWebUI, n8n and qdrant.
Another LXC has ComfyUI for marketing
Another LXC has Vexa AI to creates transcriptions of Teams Meetings

Users have Windows Notebooks with VirtualBox (never install the extension packs or Oracle comes after you). I prepared a VM with Hermes Agent for vibing.

Next plan is to get LiteLLM so i can measure who uses what.

We are pretty happy so far.

[-]

relmny@reddit

I'm curious to why your using llama.cpp instead of vllm or sglang?

[-]

tomz17@reddit

Compiled latest llama.cpp

woof... on 2x 6000's running a 35B model. Leaving so much performance on the floor.

[-]

tf2ftw@reddit

What tasks and use cases does this set up address?

[-]

Perfect-Flounder7856@reddit

LXC seems genius!

[-]

RipperFox@reddit

Cuda 13.2 llama.cpp

Are you aware of the problems of this constellation, e.g. mentioned here: https://www.reddit.com/r/unsloth/comments/1sgl0wh/do_not_use_cuda_132_to_run_models/ https://github.com/ggml-org/llama.cpp/issues/21255 https://github.com/unslothai/unsloth/issues/4849 https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/discussions/12

[-]

tecneeq@reddit

I don't use IQ quants or Q3 or smaller, only full precision.

[-]

marutthemighty@reddit

Thank you for these tips. Useful comment.

[-]

planemsg@reddit

👆this is the setup needed for a professional environment 🚀💯🔥

[-]

FusionCow@reddit

Ok well the 1-2 people it for programming purposes puts a wrench in things, because that means they need a genuinely good model.

You have a few options:
8x pro 6000 (\~100k) to run kimi k2.6
1x pro 6000 with a lot of ram (20-40k) (price can change between ddr4 and ddr5) to run kimi k2.6
mac studio 512gb (10-15k) (these are hard to find used, but if you do find them, they aren't great for developers because the prefill speed is bad)
2x pro 6000 (\~30k) to run a model like deepseek v4 flash or similar sized. This won't be nearly as good a model as kimi k2.6, but your developers may be able to scrape by with it
1x 5090 machine (\~6k) this would be able to run qwen 3.6 27b, which to be honest isn't good enough for any serious developer, but it would work for the more general audience.

Honestly in my opinion, you should go with the 5090 machine and run qwen 3.6 35b, which will be fast and snappy for your regular users, then give your developers a kimi or claude subscription.

To actually set up a server like this, if you have NO idea what you're doing, setup lmstudio, it supports concurrent outputs, but if you have used and commandline program before, you should setup llama.cpp.
Also make sure you use linux on whatever box you buy, it's much faster for this stuff than windows

[-]

TheTerrasque@reddit

this would be able to run qwen 3.6 27b, which to be honest isn't good enough for any serious developer

This has not been my experience so far, but I've only evaluated it for a few weeks still. What problems have you faced with it?

[-]

FusionCow@reddit

it can do a lot of simple stuff, it's great for smaller things, but it just isn't a kimi k2.6, which makes sense

[-]

TheTerrasque@reddit

it just isn't a kimi k2.6

That I understand, but that wasn't what I asked. It can do some pretty complicated stuff, and I've found it to be good enough for serious developer work.

So I was wondering where your experience differs.

[-]

FusionCow@reddit

I mean i've found it to work fine, but like just for me, I tend to only use ai to make tools, whenever I want to make something I care about i'll just hand program it. But I wanted to make a website to allow me to elo sort videos with qwen3 vl embed for grouping, and it couldn't do it, every change I made introduced a new bug somewhere else. k2.6 one shotted it

[-]

marutthemighty@reddit

Hmm. Noted. Once I make enough money, I will look into this. Thank you for posting this detailed reply.

[-]

planemsg@reddit

If you want to have any substantial context you will need at minimum 1x Pro 6000. The mac can give you the context, but you won’t have the inference speed, the queries will be much slower.

IMO, for any professional setting 1x Pro 6000 is required.

Then you will need to provide the needed software and tooling to take it to another level.

[-]

Maleficent_Bridge_41@reddit

2x rtx 6000 is overkill for just two people. here a fresh benchmark with the brilliant repne/vllm:v5 docker container and Qwen 3.6 27B BF16 on two rtx. they can drive a whole team with 16-32 concurrent requests.

[-]

planemsg@reddit

Are you using vllm? Currently running 1x Pro 6000 3.6 35b a3b fp8 262k max context. 2 concurrent requests work, but it cuts the speed by easily more than half.

[-]

Maleficent_Bridge_41@reddit

Yep, vllm but with a build & docker container that includes a rtx6000 optimized and pr-cherry-picked environment maintained by repne

That's the one that ran for the above benchmark (https://github.com/voipmonitor/llm-inference-bench) - adapt the TP to 1 and increate the utilization and you should be more than good to go on "only" the one as well.

```
docker run --rm \
--runtime nvidia \
--gpus all \
--ipc=host \
--shm-size=32g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--network host \
--volume \~/certificates:/root/certificates \
--volume \~/.cache/huggingface:/root/.cache/huggingface \
--volume \~/.cache/vllm:/root/.cache/vllm \
--volume \~/.triton/cache:/root/.triton/cache \
--env OMP_NUM_THREADS=8 \
--env VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
--env VLLM_WORKER_MULTIPROC_METHOD=spawn \
--env VLLM_ALLREDUCE_USE_SYMM_MEM=0 \
--env HUGGING_FACE_HUB_TOKEN=hf_REDACTED \
repne/vllm:v5 \
-O3 \
--model Qwen/Qwen3.6-27B \
--served-model-name Qwen3.6-27B \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.80 \
--max-model-len 262144 \
--max-num-seqs 128 \
--max-num-batched-tokens 32768 \
--max-cudagraph-capture-size 256 \
--language-model-only \
--enable-auto-tool-choice \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--enable-prefix-caching \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 3 \
--speculative-config.attention_backend flashinfer \
--speculative-config.draft_sample_method greedy \
--attention-backend flashinfer \
--load-format instanttensor
```

[-]

planemsg@reddit

currently just running the base commands with the following generation config overrides:

--max-model-len 262144 \
--max-num-batched-tokens 32768 \
--gpu-memory-utilization 0.9 \
--kv-cache-dtype fp8 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--enable-chunked-prefill

temperature=0.6
top_p=0.95
top_k=20
presence_penalty=0.0
repetition_penalty=1.0

[-]

FusionCow@reddit

you can easily get max context on a 5090 what are you talking about

[-]

planemsg@reddit

Yea with 6k context. For a professional setting this is nothing. With agent prompts and small codebase you can easily start with 10-20k context before any prompts.

This will technically work with simple chat, but not legit coding projects. Also, probably running at 2-4Q which will have a major impact on reasoning performance.

Also, if you dump the kv cache to memory it will work, but the inference speed is going to be very slow compared to everything running in vram.

Note: Qwen 3.6 35b a3b 8Q takes 30-35gb to run.

[-]

1beb@reddit

I strongly recommend using rentals/API before making a purchase decision. Use cases can quickly outgrow on prem resources. Give people generic access, watch what they do for a month or two, then decide.

The privacy issue can be solved by using trustworthy data centers with residency in places you trust. AWS, Openrouter, vast, all have policied providers or raw GPU rentals. ("Secure cloud"). This might be good enough for a trial. Vast in particular let's your rent consumer hardware which might be a good start for you.

[-]

Enki_40@reddit

It’s amazing how many people who ostensibly want to support a business don’t understand that the enterprise plans out there like ChatGPT and Anthropic not only keep data private / don’t train on it but even have HIPAA and FedRAMP-High level certification and rush out to buy their own on-prem services without even a clue what they are getting into.

*Any* business should start with either what you suggest (GPU rentals) or an enterprise grade plan and get experience with that before splashing out money and more important potentially having downtime, bad experiences, wrong hardware, etc. It’s fun to build home labs but silly to do it as a learning experience supporting a company.

[-]

swagonflyyyy@reddit

Linux
vLLM
Qwen3.6-27b-q8_0
Claude Code pointing at vLLM.

Allows for concurrent request processing and coding. That model in particular is the best agentic vibecoding model I've come across by far. Been working on two separate projects with it already.

[-]

noticedbyai@reddit

With concurrent users you will run into issues with scaling kv cache, which directly related to context window size. Chances are even at a small company everyone is working the same hours.

So let’s say you find a model that works for your use case, it fits in vram. Then you find out how much vram you need for kv cache, and multiply that by the number of concurrent users. You can quantize, but unless it’s your day job to work with this hardware it’s probably easier and cheaper to find an inference provider (what happens when it goes down?)

Quick math: let’s say qwen 3.6 27B at Q4. ~18gb for model weights. Then KV cache with int 8 quantization might be 3-5gb per user. (Double check this, I’m going off my own vram usage for similar size models, with no concurrency)

The bottleneck might not be vram though it might be memory bandwidth, vram is like a year or no whether it fits, but you are running lots of calculations loading weights in and out of memory. The 5090 would fair much better with over 3x the bandwidth.

[-]

exaknight21@reddit

Hear me out.

Mi50 32 GB, power capped at 225 Watts with Ai-Infos vLLM fork (completely stable for mi50). Run Qwen3.5:4B/Cyankiwi’s AWQ with TurboQuant k8_v8 and with MTP.

(Or find a 32 GB NVidia card in budget), run 16K context, 8192 max gen. set thinking budget to 4096.

The continuous batching and smart RAG techniques are able to make an efficient system. VLLM’s continuous batching is great for this.

[-]

Fedor_Doc@reddit

3.5 4B for programming? 35B does stupid shut and hallucinates functions and arguments, 4B won't be good

[-]

exaknight21@reddit

No RAG general use, Qwen 3.6 are great for coding. I use qwen 3.6:35B-A3B q4_k_xl with llama.cpp + opencode + mtp and turboquant q8 for both kv.

It’s great at 64K context all GPU.

[-]

planemsg@reddit

Its great all the way to +200k context. 💯🔥

Starts to slow and break down at +225k

[-]

Fedor_Doc@reddit

I also use Qwen 3.6 35B with the same quant and no cache quantization. Was writing quick and dirty UI on python3 with tkinter. Had to fix several functions manually to make UI work correctly. So, talking from experience.

[-]

riceinmybelly@reddit

I feel like I’m missing something but for queries you can have a classification model first on a recent Mac even with just lm studio, do your rag on that. Then have the devs rent GPU’s online from a trusted source with a DPA? A Mac Studio is slow on prefill but if it just needs to look things up and compare, why would you need the big cards?

I would at least separate what you need for the devs from where your company’s logic lives

[-]

T0nd3@reddit

For 7 people with that use profile, you don't need to overthink the hardware. The key insight is that 7 users rarely query at the exact same moment — realistically you're handling 2-3 concurrent requests at peak. That changes the math significantly.

On concurrency: Ollama queues requests by default but supports --parallel (up to 4-8 depending on VRAM). vLLM does proper continuous batching and handles concurrent users much better at scale — for a team of 7 it's probably overkill, but worth knowing if load grows. llama.cpp with --parallel 4 --ctx-size 32768 is a solid middle ground.

On the models: Qwen3-30B-A3B (the MoE one) is worth a serious look — it only activates 3B parameters per token so you get 30B-quality output at much lower VRAM and higher throughput than a dense 27B. For the 1-2 devs doing coding, Qwen3 outperforms Gemma 4 on coding benchmarks in most comparisons I've seen.

On hardware:

MacBook Pro M4 Max 48GB — excellent for this. Unified memory means you can run Qwen3-30B or Gemma4-27B comfortably at Q8. The throughput (\~30-50 t/s) is more than enough for a small team.
RTX 5090 — faster token generation (\~120+ t/s), but 32GB means Q4/Q5 quantization for 27-35B models. Still totally fine.
If budget is a concern: a used RX 7900 XTX (24GB) + ROCm runs Qwen3-27B Q4_K_M at \~100 t/s for significantly less money. ROCm setup is more involved than CUDA but very doable.

Stack I'd recommend: Ollama + Open WebUI. Open WebUI gives you user management, conversation history, RAG pipelines built in — perfect for a small team. Takes maybe an afternoon to set up properly.

[-]

sagiroth@reddit

Bot account recommending old models

[-]

Hypilein@reddit

Why is it that so many bots post here? What’s the point? I don’t see this in any of the non tech related subs I frequent.

[-]

sagiroth@reddit

Karma farming and then advertisement. There is karma requirement in place on this sub.

[-]

fligglymcgee@reddit

Oh really? What made you choose Qwen3-30B over the widely accepted sota current generation models? What a strangely specific recommendation that just happens to coincide with major llm training date cutoffs.

[-]

Fedor_Doc@reddit

Banana bread recipy, please

[-]

havnar-@reddit

Cloud provider with privacy clauses.

[-]

Muted_Masterpiece342@reddit

My entire company exists to make this easy for a company to do

[-]

sagiroth@reddit

Multiple gpus + vllm

[-]

Real_Chard5666@reddit

32gb isn’t enough for model plus context in a professional sense. 27/35b Q4 I get between 90-120k tokens at Q8 KV Cache. I am just using it by myself. Running cline I can hit max tokens and then it’s over to the ram and slows down. That’s fine for me by myself but when several people are using it in professional or work scenario, it will quickly become very frustrating.

One of the larger VRAM (48-64-96Gb) Nvidia pro cards would be better.