how would you set up a local llm server for a business of 7 people?
Posted by snowieslilpikachu69@reddit | LocalLLaMA | View on Reddit | 42 comments
Okay so i've been stalking this sub for some time and i run the occasional small 2-8b model on my laptop (not the best) for fun
but say my role at a company is to set up a local LLM since we obviously don't want confidential data going to other companies etc /
main use case would be queries, rag, general use nothing crazy except for maybe 1 or 2 people using it for programming purposes.
i was thinking of gemma 4 26/31 or qwen 3.6 27/35. how do these models scale with concurrent users? i know i could run one of these on a 5090 and some extra or a 48gb macbook pro w unified memory but not sure how these scales with multiple users.
tecneeq@reddit
Same situation here, confidential data moves in the company, so we decided to build a local stack.
Users have Windows Notebooks with VirtualBox (never install the extension packs or Oracle comes after you). I prepared a VM with Hermes Agent for vibing.
Next plan is to get LiteLLM so i can measure who uses what.
We are pretty happy so far.
relmny@reddit
I'm curious to why your using llama.cpp instead of vllm or sglang?
tomz17@reddit
woof... on 2x 6000's running a 35B model. Leaving so much performance on the floor.
tf2ftw@reddit
What tasks and use cases does this set up address?
Perfect-Flounder7856@reddit
LXC seems genius!
RipperFox@reddit
Are you aware of the problems of this constellation, e.g. mentioned here: https://www.reddit.com/r/unsloth/comments/1sgl0wh/do_not_use_cuda_132_to_run_models/ https://github.com/ggml-org/llama.cpp/issues/21255 https://github.com/unslothai/unsloth/issues/4849 https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/discussions/12
tecneeq@reddit
I don't use IQ quants or Q3 or smaller, only full precision.
marutthemighty@reddit
Thank you for these tips. Useful comment.
planemsg@reddit
👆this is the setup needed for a professional environment 🚀💯🔥
FusionCow@reddit
Ok well the 1-2 people it for programming purposes puts a wrench in things, because that means they need a genuinely good model.
You have a few options:
8x pro 6000 (\~100k) to run kimi k2.6
1x pro 6000 with a lot of ram (20-40k) (price can change between ddr4 and ddr5) to run kimi k2.6
mac studio 512gb (10-15k) (these are hard to find used, but if you do find them, they aren't great for developers because the prefill speed is bad)
2x pro 6000 (\~30k) to run a model like deepseek v4 flash or similar sized. This won't be nearly as good a model as kimi k2.6, but your developers may be able to scrape by with it
1x 5090 machine (\~6k) this would be able to run qwen 3.6 27b, which to be honest isn't good enough for any serious developer, but it would work for the more general audience.
Honestly in my opinion, you should go with the 5090 machine and run qwen 3.6 35b, which will be fast and snappy for your regular users, then give your developers a kimi or claude subscription.
To actually set up a server like this, if you have NO idea what you're doing, setup lmstudio, it supports concurrent outputs, but if you have used and commandline program before, you should setup llama.cpp.
Also make sure you use linux on whatever box you buy, it's much faster for this stuff than windows
TheTerrasque@reddit
This has not been my experience so far, but I've only evaluated it for a few weeks still. What problems have you faced with it?
FusionCow@reddit
it can do a lot of simple stuff, it's great for smaller things, but it just isn't a kimi k2.6, which makes sense
TheTerrasque@reddit
That I understand, but that wasn't what I asked. It can do some pretty complicated stuff, and I've found it to be good enough for serious developer work.
So I was wondering where your experience differs.
FusionCow@reddit
I mean i've found it to work fine, but like just for me, I tend to only use ai to make tools, whenever I want to make something I care about i'll just hand program it. But I wanted to make a website to allow me to elo sort videos with qwen3 vl embed for grouping, and it couldn't do it, every change I made introduced a new bug somewhere else. k2.6 one shotted it
marutthemighty@reddit
Hmm. Noted. Once I make enough money, I will look into this. Thank you for posting this detailed reply.
planemsg@reddit
If you want to have any substantial context you will need at minimum 1x Pro 6000. The mac can give you the context, but you won’t have the inference speed, the queries will be much slower.
IMO, for any professional setting 1x Pro 6000 is required.
Then you will need to provide the needed software and tooling to take it to another level.
Maleficent_Bridge_41@reddit
2x rtx 6000 is overkill for just two people. here a fresh benchmark with the brilliant repne/vllm:v5 docker container and Qwen 3.6 27B BF16 on two rtx. they can drive a whole team with 16-32 concurrent requests.
planemsg@reddit
Are you using vllm? Currently running 1x Pro 6000 3.6 35b a3b fp8 262k max context. 2 concurrent requests work, but it cuts the speed by easily more than half.
Maleficent_Bridge_41@reddit
Yep, vllm but with a build & docker container that includes a rtx6000 optimized and pr-cherry-picked environment maintained by repne
That's the one that ran for the above benchmark (https://github.com/voipmonitor/llm-inference-bench) - adapt the TP to 1 and increate the utilization and you should be more than good to go on "only" the one as well.
```
docker run --rm \
--runtime nvidia \
--gpus all \
--ipc=host \
--shm-size=32g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--network host \
--volume \~/certificates:/root/certificates \
--volume \~/.cache/huggingface:/root/.cache/huggingface \
--volume \~/.cache/vllm:/root/.cache/vllm \
--volume \~/.triton/cache:/root/.triton/cache \
--env OMP_NUM_THREADS=8 \
--env VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
--env VLLM_WORKER_MULTIPROC_METHOD=spawn \
--env VLLM_ALLREDUCE_USE_SYMM_MEM=0 \
--env HUGGING_FACE_HUB_TOKEN=hf_REDACTED \
repne/vllm:v5 \
-O3 \
--model Qwen/Qwen3.6-27B \
--served-model-name Qwen3.6-27B \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.80 \
--max-model-len 262144 \
--max-num-seqs 128 \
--max-num-batched-tokens 32768 \
--max-cudagraph-capture-size 256 \
--language-model-only \
--enable-auto-tool-choice \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--enable-prefix-caching \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 3 \
--speculative-config.attention_backend flashinfer \
--speculative-config.draft_sample_method greedy \
--attention-backend flashinfer \
--load-format instanttensor
```
planemsg@reddit
currently just running the base commands with the following generation config overrides:
FusionCow@reddit
you can easily get max context on a 5090 what are you talking about
planemsg@reddit
Yea with 6k context. For a professional setting this is nothing. With agent prompts and small codebase you can easily start with 10-20k context before any prompts.
This will technically work with simple chat, but not legit coding projects. Also, probably running at 2-4Q which will have a major impact on reasoning performance.
Also, if you dump the kv cache to memory it will work, but the inference speed is going to be very slow compared to everything running in vram.
Note: Qwen 3.6 35b a3b 8Q takes 30-35gb to run.
1beb@reddit
I strongly recommend using rentals/API before making a purchase decision. Use cases can quickly outgrow on prem resources. Give people generic access, watch what they do for a month or two, then decide.
The privacy issue can be solved by using trustworthy data centers with residency in places you trust. AWS, Openrouter, vast, all have policied providers or raw GPU rentals. ("Secure cloud"). This might be good enough for a trial. Vast in particular let's your rent consumer hardware which might be a good start for you.
Enki_40@reddit
It’s amazing how many people who ostensibly want to support a business don’t understand that the enterprise plans out there like ChatGPT and Anthropic not only keep data private / don’t train on it but even have HIPAA and FedRAMP-High level certification and rush out to buy their own on-prem services without even a clue what they are getting into.
*Any* business should start with either what you suggest (GPU rentals) or an enterprise grade plan and get experience with that before splashing out money and more important potentially having downtime, bad experiences, wrong hardware, etc. It’s fun to build home labs but silly to do it as a learning experience supporting a company.
swagonflyyyy@reddit
Linux
vLLM
Qwen3.6-27b-q8_0Claude Code pointing at vLLM.
Allows for concurrent request processing and coding. That model in particular is the best agentic vibecoding model I've come across by far. Been working on two separate projects with it already.
noticedbyai@reddit
With concurrent users you will run into issues with scaling kv cache, which directly related to context window size. Chances are even at a small company everyone is working the same hours.
So let’s say you find a model that works for your use case, it fits in vram. Then you find out how much vram you need for kv cache, and multiply that by the number of concurrent users. You can quantize, but unless it’s your day job to work with this hardware it’s probably easier and cheaper to find an inference provider (what happens when it goes down?)
Quick math: let’s say qwen 3.6 27B at Q4. ~18gb for model weights. Then KV cache with int 8 quantization might be 3-5gb per user. (Double check this, I’m going off my own vram usage for similar size models, with no concurrency)
The bottleneck might not be vram though it might be memory bandwidth, vram is like a year or no whether it fits, but you are running lots of calculations loading weights in and out of memory. The 5090 would fair much better with over 3x the bandwidth.
exaknight21@reddit
Hear me out.
Mi50 32 GB, power capped at 225 Watts with Ai-Infos vLLM fork (completely stable for mi50). Run Qwen3.5:4B/Cyankiwi’s AWQ with TurboQuant k8_v8 and with MTP.
(Or find a 32 GB NVidia card in budget), run 16K context, 8192 max gen. set thinking budget to 4096.
The continuous batching and smart RAG techniques are able to make an efficient system. VLLM’s continuous batching is great for this.
Fedor_Doc@reddit
3.5 4B for programming? 35B does stupid shut and hallucinates functions and arguments, 4B won't be good
exaknight21@reddit
No RAG general use, Qwen 3.6 are great for coding. I use qwen 3.6:35B-A3B q4_k_xl with llama.cpp + opencode + mtp and turboquant q8 for both kv.
It’s great at 64K context all GPU.
planemsg@reddit
Its great all the way to +200k context. 💯🔥
Starts to slow and break down at +225k
Fedor_Doc@reddit
I also use Qwen 3.6 35B with the same quant and no cache quantization. Was writing quick and dirty UI on python3 with tkinter. Had to fix several functions manually to make UI work correctly. So, talking from experience.
riceinmybelly@reddit
I feel like I’m missing something but for queries you can have a classification model first on a recent Mac even with just lm studio, do your rag on that. Then have the devs rent GPU’s online from a trusted source with a DPA? A Mac Studio is slow on prefill but if it just needs to look things up and compare, why would you need the big cards?
I would at least separate what you need for the devs from where your company’s logic lives
T0nd3@reddit
For 7 people with that use profile, you don't need to overthink the hardware. The key insight is that 7 users rarely query at the exact same moment — realistically you're handling 2-3 concurrent requests at peak. That changes the math significantly.
On concurrency: Ollama queues requests by default but supports
--parallel(up to 4-8 depending on VRAM). vLLM does proper continuous batching and handles concurrent users much better at scale — for a team of 7 it's probably overkill, but worth knowing if load grows. llama.cpp with--parallel 4 --ctx-size 32768is a solid middle ground.On the models: Qwen3-30B-A3B (the MoE one) is worth a serious look — it only activates 3B parameters per token so you get 30B-quality output at much lower VRAM and higher throughput than a dense 27B. For the 1-2 devs doing coding, Qwen3 outperforms Gemma 4 on coding benchmarks in most comparisons I've seen.
On hardware:
Stack I'd recommend: Ollama + Open WebUI. Open WebUI gives you user management, conversation history, RAG pipelines built in — perfect for a small team. Takes maybe an afternoon to set up properly.
sagiroth@reddit
Bot account recommending old models
Hypilein@reddit
Why is it that so many bots post here? What’s the point? I don’t see this in any of the non tech related subs I frequent.
sagiroth@reddit
Karma farming and then advertisement. There is karma requirement in place on this sub.
fligglymcgee@reddit
Oh really? What made you choose Qwen3-30B over the widely accepted sota current generation models? What a strangely specific recommendation that just happens to coincide with major llm training date cutoffs.
Fedor_Doc@reddit
Banana bread recipy, please
havnar-@reddit
Cloud provider with privacy clauses.
Muted_Masterpiece342@reddit
My entire company exists to make this easy for a company to do
sagiroth@reddit
Multiple gpus + vllm
Real_Chard5666@reddit
32gb isn’t enough for model plus context in a professional sense. 27/35b Q4 I get between 90-120k tokens at Q8 KV Cache. I am just using it by myself. Running cline I can hit max tokens and then it’s over to the ram and slows down. That’s fine for me by myself but when several people are using it in professional or work scenario, it will quickly become very frustrating.
One of the larger VRAM (48-64-96Gb) Nvidia pro cards would be better.