Serving 1B+ tokens/day locally in my research lab
Posted by SessionComplete2334@reddit | LocalLLaMA | View on Reddit | 13 comments
I lead a reserach lab at a university hospital and spent the last weeks configuring our internal LLM server. I put a lot of thought into the server config, software stack and model. Now I am at a point where I am happy, it actually holds up under load and we are pushing more than 1B tokens/day (roughly 2/3 ingestion, 1/3 decode) through 2x H200 serving GPT-OSS-120B. I Thought this could be interesting for others looking to do something similar and also hoping to get some feedback. So I am sharing my software stack below as well as some considerations why I chose GPT-OSS-120B.
Disclaimer Used Claude to help writing this.
Hardware
Our server has two H200 GPUs, apart from that it is not very beefy with 124GB RAM 16 core cpu, 512 GB disk space. Enough to hold the models, docker images and logs.
Model
I tried a bunch of models a couple of weeks ago. Qwen 3 models, GLM-Air and GPT-OSS. GPT-OSS-120B seemed to be the best for us:
- Throughput is important, as we have multiple jobs processing large amounts of data. For GPT-OSS single-user decode hits up to ~250 tok/s (mostly ~220 tok/s). Other models I tried got to ~150 tok/s at most. Only GPT-OSS-20B was faster, but not by that much (300 tok/s). Unfortunately the 20B model is a lot dumber than the 120B.
- The model is reasonably smart. Good enough for clinical structuring, adheres well to JSON output, calls tools reliably. Still makes dumb mistakes, but at least it does them very fast.
- I trust the published evals of GPT-OSS-120B more, because the deployed weights are the evaluated weights (was trained in mxfp4). With community quants I think you are always a bit uncertain if the claimed performance really is the true performance. The models are thus hard to compare.
- It seems like mxfp4 is just really well supported on vllm and hopper GPUs.
Things I tried that were worse on H200:
- nvfp4/GGUF → ~100-150 tok/s single user
- Speculative decoding for GPT-OSS-120B → ~150 tok/s (the draft model overhead killed it for this setup)
mxfp4 on H200 just seems extremely well optimized right now. Still,. I am always looking for models with better performance. Currently eyeing Mistral Small 4 (vision, 120B as well), Qwen 3.5, and Gemma 4. However, Gemma being dense makes me skeptical it can match throughput and I am not trusting the smaller MoE models to be as smart as a 120B model. Same with the Qwen models. Currently I also can't take GPT-OSS offline anymore to test more models properly because the demand is too high. But as soon as we scale hardware, I would like to try more.
Architecture
I do all in docker with a big docker compose (see below)
Client → LiteLLM proxy (4000) → vLLM GPU 0 (8000)
→ vLLM GPU 1 (8000)
↓
PostgreSQL (keys, usage, spend)
Prometheus (scrapes vLLM /metrics every 5s)
Grafana (dashboards)
MkDocs (user docs)
- vLLM does the actual serving, one container per GPU
- LiteLLM for OpenAI-compatible API, handles keys, rate limits, the priority queue, and routing
- Postgres to store usage data
- Prometheus + Grafana for nice dashboards
I picked one instance per GPU over tensor parallel across both because at this model size with mxfp4 it fits comfortably on a single H200, and two independent replicas give better throughput and no NCCL communication overhead. KV cache is also not a bottleneck for us. With simple-shuffle routing the load split is almost perfect (2.10B vs 2.11B prompt tokens after ~6 days of uptime). Other routing strategies did not work as well (litellm also recommends simple-shuffle in their docs).
vLLM
--quantization mxfp4
--max-model-len 128000
--gpu-memory-utilization 0.80
--max-num-batched-tokens 8192
--enable-chunked-prefill
--enable-prefix-caching
--max-num-seqs 128
Plus environment:
VLLM_USE_FLASHINFER_MXFP4_MOE=1
NCCL_P2P_DISABLE=1
For details on this:
VLLM_USE_FLASHINFER_MXFP4_MOE=1 needed for this model on H200.
NCCL_P2P_DISABLE=1 is needed even though each container only sees one GPU. If I remember right, without it NCCL throws cryptic errors.
TIKTOKEN_RS_CACHE_DIR=/root/.cache/tiktoken I think usually the container would download tiktoken, but behind our firewall it cannot connect to the web, so I have to manually provide the tokenizer.
--enable-prefix-caching we send a lot of near-identical system prompts (templated structuring tasks, agent scaffolds). Cache hit rate is high so TTFT drops with this.
--max-num-seqs 128 per instance, so 256 concurrent sequences across the box. KV cache is rarely the bottleneck for us (Grafana usually shows 25-30%, occasional spikes toward 90% under bursts), the actual ceiling is decode throughput. Increasing max-num-seqs higher would just slow each individual stream down without buying real headroom. I tried up to 512 parallel requests and decoding speed does not exceed 3000 token/s, instead the individual response just gets slower.
gpu-memory-utilization 0.80 and --max-num-batched-tokens 8192 (not used currently, but will swap this in if needed) are both there for logprobs requests. After some mysterious crashes of the vllm servers, I found that if a client requests top-k logprobs on a long context, vLLM materializes a chunk of memory that scales fast, leads to OOM on the GPU and crashes the server. Capping batched tokens at 8k and leaving 20% VRAM headroom absorbs those spikes without hurting steady-state throughput. --max-num-batched-tokens 8192 limits the burst size, as it only calculates the logprobs for 8192 tokens at a time. As KV cache is not a limiting factor for us, I keep gpu-mem at 0.8 constantly.
Healthcheck start_period: 900s. Loading a 120B MoE takes 10-15 minutes from cold. Anything shorter and LiteLLM spams its logs about unhealthy upstreams.
docker-compose (vLLM + LiteLLM)
Stripped down to just vllm and litellm. Postgres, Prometheus, Grafana are left out, they are standard.
services:
vllm-gpt-oss-120b:
image: vllm/vllm-openai:latest
container_name: vllm-gpt-oss-120b
environment:
- VLLM_USE_FLASHINFER_MXFP4_MOE=1
- NCCL_P2P_DISABLE=1
- TIKTOKEN_RS_CACHE_DIR=/root/.cache/tiktoken
volumes:
- /srv/cache/tiktoken:/root/.cache/tiktoken:ro
- /srv/models/gpt-oss-120b:/models/gpt-oss-120b
expose:
- "8000"
ipc: host
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
interval: 30s
timeout: 5s
retries: 20
start_period: 900s
command: >
/models/gpt-oss-120b
--served-model-name gpt-oss-120b
--quantization mxfp4
--max-model-len 128000
--gpu-memory-utilization 0.80
--enable-chunked-prefill
--enable-prefix-caching
--max-num-seqs 128
# --max-num-batched-tokens 8192
vllm-gpt-oss-120b_2:
image: vllm/vllm-openai:latest
container_name: vllm-gpt-oss-120b_2
environment:
- VLLM_USE_FLASHINFER_MXFP4_MOE=1
- NCCL_P2P_DISABLE=1
- TIKTOKEN_RS_CACHE_DIR=/root/.cache/tiktoken
volumes:
- /srv/cache/tiktoken:/root/.cache/tiktoken:ro
- /srv/models/gpt-oss-120b:/models/gpt-oss-120b
expose:
- "8000"
ipc: host
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['1']
capabilities: [gpu]
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
interval: 30s
timeout: 5s
retries: 20
start_period: 900s
command: >
/models/gpt-oss-120b
--served-model-name gpt-oss-120b_2
--quantization mxfp4
--max-model-len 128000
--gpu-memory-utilization 0.80
--enable-chunked-prefill
--enable-prefix-caching
--max-num-seqs 128
# --max-num-batched-tokens 8192
litellm:
image: ghcr.io/berriai/litellm:main-latest
container_name: litellm-proxy
ports:
- "4000:4000"
volumes:
- ./litellm_config.yaml:/app/config.yaml
environment:
- LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
- DATABASE_URL=postgresql://litellm:${POSTGRES_PASSWORD}@postgres:5432/litellm
command: >
--config /app/config.yaml
--port 4000
--num_workers 4
depends_on:
vllm-gpt-oss-120b:
condition: service_healthy
vllm-gpt-oss-120b_2:
condition: service_healthy
postgres:
condition: service_healthy
redis:
condition: service_healthy
The served model name on the second replica is deliberately gpt-oss-120b_2 (not gpt-oss-120b), because LiteLLM's upstream model field needs to disambiguate them even though the public-facing name is the same.
LiteLLM config
model_list:
- model_name: gpt-oss-120b
litellm_params:
model: openai/gpt-oss-120b
api_base: http://vllm-gpt-oss-120b:8000/v1
api_key: "EMPTY"
timeout: 600
stream_timeout: 60
- model_name: gpt-oss-120b
litellm_params:
model: openai/gpt-oss-120b_2
api_base: http://vllm-gpt-oss-120b_2:8000/v1
api_key: "EMPTY"
timeout: 600
stream_timeout: 60
router_settings:
routing_strategy: "simple-shuffle" # best under heavy load, tried "least-busy" and others, did not perform well.
cooldown_time: 5 # brings back vllm instance immediately if too many requests fail. Failure can be due to rate limits vllm side, so this is not a real cooldown needed
enable_priority_queue: true
redis_host: "litellm-redis"
redis_port: 6379
litellm_settings:
cache: false
max_parallel_requests: 196
request_timeout: 600
num_retries: 20
allowed_fails: 200
drop_params: true # apparently for Claude Code compatibility, not tested.
Two model entries with the same model_name is how you get LiteLLM to load balance across them. Apparently it does this natively. No configuration needed.
Numbers after ~6 days uptime
| Metric | Value |
|---|---|
| Total tokens processed | 6.57B |
| Prompt tokens | 4.20B |
| Generation tokens | 2.36B |
| Input:output ratio | 1.78:1 |
| Total requests | 2.76M |
| Avg tokens per request | ~2,380 |
Throughput
| 1-min rate | 1-hour avg | |
|---|---|---|
| Generation tok/s | 2,879 | 2,753 |
| Prompt tok/s | 24,782 | 21,472 |
| Combined tok/s | 27,661 | 24,225 |
Per-instance load split
| Instance | Prompt | Generation |
|---|---|---|
| GPU 0 | 2.10B | 1.18B |
| GPU 1 | 2.11B | 1.19B |
Latency under heavy load
This was captured at a moment with 173 running and 29 queued requests.
| p50 | p95 | p99 | |
|---|---|---|---|
| TTFT | 17.8s | 37.8s | 39.6s |
| E2E | 41.3s | 175.3s | 750.7s |
| ITL | 35ms | 263ms | — |
| Queue wait | 18.7s | 29.4s | — |
The TTFT is dominated by queue time (p50 queue 18.7s vs p50 TTFT 17.8s). Under lighter load TTFT is in the low seconds. The E2E p99 of 750s is one user generating 4k+ tokens off a 100k context, which is fine and expected. Still, one current issue is the ping pong effect, I detail below.
ITL p50 of 35ms means each individual stream sees ~28 tok/s when the box is full, which is probably fine for most interactive use.
Cost tracking
LiteLLM tracks "equivalent spend" against configured per-token rates. I set ours to GPT-OSS-120B pricing on Amazon Bedrock ($0.15/M in, $0.60/M out). Over the last 7 days the hypothetical spend is $1,909 USD. The H200 did cost us about 25k each, so the server basically pays for itself after a year.
Stuff I am still unhappy with
When one vLLM replica returns too many errors in a window, LiteLLM cools it down. The other replica then takes the full load, starts erroring under the doubled pressure, and gets cooled down too. In the meantime the first came back, but now it will get the bursts and start throwing errors again. Now the whole proxy is effectively only 50% capacity even though both GPUs are perfectly healthy. I have played with cooldown_time, allowed_fails, and num_retries but cannot find a setting that distributes the load well without this ping pong effect.
Happy to share the prometheus.yml, the Grafana dashboard JSON, or the metrics collection script if anyone wants them. Also very curious what others running similar scale setups are doing for admission control and retry handling, since that is where I feel most of my remaining headroom is.
Screenshots from Grafana below.
intellidumb@reddit
Would be curious if you swapped Litellm for this which has been in my bookmark list for a while https://github.com/maximhq/bifrost
SessionComplete2334@reddit (OP)
I also read in a blog post, that it apparently is better during larger scale inference than litellm. Definitely want to try it out soon.
somerussianbear@reddit
Great stuff man! I’m not familiar with how vLLM handles prefix cache, mind to elaborate on how you can get it to work with this “little” memory and so many concurrent users?
SessionComplete2334@reddit (OP)
With 65 GB taken by the model we have 50GB for KV cache. This allows several million tokens. Most request are not that long.
This setup is not failure safe. But a research server can have occasional downtime. So I am driving it at the edge on purpose.
4xi0m4@reddit
That prefix caching setup is clean. The Tiktoken cache directory trick behind a firewall is a solid workaround. For anyone running similar setups, enabling --enable-prefix-caching on vLLM with repeated system prompts is basically free performance if your prompts share common prefixes. The 128 seqs per instance limit also gives good headroom for concurrent users without starving the KV cache.
jzn21@reddit
Don’t be afraid of quants, they still can be quite smart. You should definitely give Gemma 4 31b a try. In my comprehensive tests, it is (much) smarter than OSS 120b in terms of data processing.
SessionComplete2334@reddit (OP)
Yeah, I am probably a bit too conservative here. How is it speed wise as a dense model?
AFruitShopOwner@reddit
What are users actually using it for? Do you use a RAG system? What tools does it have access to? What front end do you use?
SessionComplete2334@reddit (OP)
A lot of structuring workflows. Structuring radiology reports (about 2M reports) and other clinical documents. Then also agentic workflows. Not projects I code so I don’t know details about the stack.
Our main endpoint is the OpenAI compatible API. Personally I use the OpenAI python package a lot with it. Good support in guided generation and easy to use.
Structuring is implemented with langchain. For the agentic workflows I believe my PhD build his own harness and tools.
For user interface I vibe coded a few applications and we also work with a company the provides a user interface with secure backends (user database encrypted, good access roles etc.)
AFruitShopOwner@reddit
Thanks
streppelchen@reddit
thanks for sharing!
tremendous_turtle@reddit
Great write-up! Lots of good insight here for real production deployments. I hope the LiteLLM team sees this, very useful real world feedback around improving its load balancing features.
How was throughput on Qwen 3.5 122B-A10B compared to GPT OSS 120B? I’d expect it should be very fast on an H200, and I think would be a considerable upgrade in model capability.
draconisx4@reddit
That's a solid setup for handling that volume kudos on getting it stable. In a hospital lab, make sure you're locking down governance around data privacy and model oversight to prevent any unintended leaks or biases creeping in.