Serving 1B+ tokens/day locally in my research lab

Posted by SessionComplete2334@reddit | LocalLLaMA | View on Reddit | 13 comments

I lead a reserach lab at a university hospital and spent the last weeks configuring our internal LLM server. I put a lot of thought into the server config, software stack and model. Now I am at a point where I am happy, it actually holds up under load and we are pushing more than 1B tokens/day (roughly 2/3 ingestion, 1/3 decode) through 2x H200 serving GPT-OSS-120B. I Thought this could be interesting for others looking to do something similar and also hoping to get some feedback. So I am sharing my software stack below as well as some considerations why I chose GPT-OSS-120B.

Disclaimer Used Claude to help writing this.

Hardware

Our server has two H200 GPUs, apart from that it is not very beefy with 124GB RAM 16 core cpu, 512 GB disk space. Enough to hold the models, docker images and logs.

Model

I tried a bunch of models a couple of weeks ago. Qwen 3 models, GLM-Air and GPT-OSS. GPT-OSS-120B seemed to be the best for us:

Throughput is important, as we have multiple jobs processing large amounts of data. For GPT-OSS single-user decode hits up to ~250 tok/s (mostly ~220 tok/s). Other models I tried got to ~150 tok/s at most. Only GPT-OSS-20B was faster, but not by that much (300 tok/s). Unfortunately the 20B model is a lot dumber than the 120B.
The model is reasonably smart. Good enough for clinical structuring, adheres well to JSON output, calls tools reliably. Still makes dumb mistakes, but at least it does them very fast.
I trust the published evals of GPT-OSS-120B more, because the deployed weights are the evaluated weights (was trained in mxfp4). With community quants I think you are always a bit uncertain if the claimed performance really is the true performance. The models are thus hard to compare.
It seems like mxfp4 is just really well supported on vllm and hopper GPUs.

Things I tried that were worse on H200:

nvfp4/GGUF → ~100-150 tok/s single user
Speculative decoding for GPT-OSS-120B → ~150 tok/s (the draft model overhead killed it for this setup)

mxfp4 on H200 just seems extremely well optimized right now. Still,. I am always looking for models with better performance. Currently eyeing Mistral Small 4 (vision, 120B as well), Qwen 3.5, and Gemma 4. However, Gemma being dense makes me skeptical it can match throughput and I am not trusting the smaller MoE models to be as smart as a 120B model. Same with the Qwen models. Currently I also can't take GPT-OSS offline anymore to test more models properly because the demand is too high. But as soon as we scale hardware, I would like to try more.

Architecture

I do all in docker with a big docker compose (see below)

Client → LiteLLM proxy (4000) → vLLM GPU 0 (8000)
                              → vLLM GPU 1 (8000)
                ↓
          PostgreSQL       (keys, usage, spend)
          Prometheus       (scrapes vLLM /metrics every 5s)
          Grafana          (dashboards)
          MkDocs           (user docs)

vLLM does the actual serving, one container per GPU
LiteLLM for OpenAI-compatible API, handles keys, rate limits, the priority queue, and routing
Postgres to store usage data
Prometheus + Grafana for nice dashboards

I picked one instance per GPU over tensor parallel across both because at this model size with mxfp4 it fits comfortably on a single H200, and two independent replicas give better throughput and no NCCL communication overhead. KV cache is also not a bottleneck for us. With simple-shuffle routing the load split is almost perfect (2.10B vs 2.11B prompt tokens after ~6 days of uptime). Other routing strategies did not work as well (litellm also recommends simple-shuffle in their docs).

vLLM

--quantization mxfp4
--max-model-len 128000
--gpu-memory-utilization 0.80
--max-num-batched-tokens 8192
--enable-chunked-prefill
--enable-prefix-caching
--max-num-seqs 128

Plus environment:

VLLM_USE_FLASHINFER_MXFP4_MOE=1
NCCL_P2P_DISABLE=1

For details on this:

VLLM_USE_FLASHINFER_MXFP4_MOE=1 needed for this model on H200.

NCCL_P2P_DISABLE=1 is needed even though each container only sees one GPU. If I remember right, without it NCCL throws cryptic errors.

TIKTOKEN_RS_CACHE_DIR=/root/.cache/tiktoken I think usually the container would download tiktoken, but behind our firewall it cannot connect to the web, so I have to manually provide the tokenizer.

--enable-prefix-caching we send a lot of near-identical system prompts (templated structuring tasks, agent scaffolds). Cache hit rate is high so TTFT drops with this.

--max-num-seqs 128 per instance, so 256 concurrent sequences across the box. KV cache is rarely the bottleneck for us (Grafana usually shows 25-30%, occasional spikes toward 90% under bursts), the actual ceiling is decode throughput. Increasing max-num-seqs higher would just slow each individual stream down without buying real headroom. I tried up to 512 parallel requests and decoding speed does not exceed 3000 token/s, instead the individual response just gets slower.

gpu-memory-utilization 0.80 and --max-num-batched-tokens 8192 (not used currently, but will swap this in if needed) are both there for logprobs requests. After some mysterious crashes of the vllm servers, I found that if a client requests top-k logprobs on a long context, vLLM materializes a chunk of memory that scales fast, leads to OOM on the GPU and crashes the server. Capping batched tokens at 8k and leaving 20% VRAM headroom absorbs those spikes without hurting steady-state throughput. --max-num-batched-tokens 8192 limits the burst size, as it only calculates the logprobs for 8192 tokens at a time. As KV cache is not a limiting factor for us, I keep gpu-mem at 0.8 constantly.

Healthcheck start_period: 900s. Loading a 120B MoE takes 10-15 minutes from cold. Anything shorter and LiteLLM spams its logs about unhealthy upstreams.

docker-compose (vLLM + LiteLLM)

Stripped down to just vllm and litellm. Postgres, Prometheus, Grafana are left out, they are standard.

services:
  vllm-gpt-oss-120b:
    image: vllm/vllm-openai:latest
    container_name: vllm-gpt-oss-120b
    environment:
      - VLLM_USE_FLASHINFER_MXFP4_MOE=1
      - NCCL_P2P_DISABLE=1
      - TIKTOKEN_RS_CACHE_DIR=/root/.cache/tiktoken
    volumes:
      - /srv/cache/tiktoken:/root/.cache/tiktoken:ro
      - /srv/models/gpt-oss-120b:/models/gpt-oss-120b
    expose:
      - "8000"
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities: [gpu]
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
      interval: 30s
      timeout: 5s
      retries: 20
      start_period: 900s
    command: >
      /models/gpt-oss-120b
      --served-model-name gpt-oss-120b
      --quantization mxfp4
      --max-model-len 128000
      --gpu-memory-utilization 0.80
      --enable-chunked-prefill
      --enable-prefix-caching
      --max-num-seqs 128
#      --max-num-batched-tokens 8192

  vllm-gpt-oss-120b_2:
    image: vllm/vllm-openai:latest
    container_name: vllm-gpt-oss-120b_2
    environment:
      - VLLM_USE_FLASHINFER_MXFP4_MOE=1
      - NCCL_P2P_DISABLE=1
      - TIKTOKEN_RS_CACHE_DIR=/root/.cache/tiktoken
    volumes:
      - /srv/cache/tiktoken:/root/.cache/tiktoken:ro
      - /srv/models/gpt-oss-120b:/models/gpt-oss-120b
    expose:
      - "8000"
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['1']
              capabilities: [gpu]
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
      interval: 30s
      timeout: 5s
      retries: 20
      start_period: 900s
    command: >
      /models/gpt-oss-120b
      --served-model-name gpt-oss-120b_2
      --quantization mxfp4
      --max-model-len 128000
      --gpu-memory-utilization 0.80
      --enable-chunked-prefill
      --enable-prefix-caching
      --max-num-seqs 128
#      --max-num-batched-tokens 8192

  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    container_name: litellm-proxy
    ports:
      - "4000:4000"
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    environment:
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
      - DATABASE_URL=postgresql://litellm:${POSTGRES_PASSWORD}@postgres:5432/litellm
    command: >
      --config /app/config.yaml
      --port 4000
      --num_workers 4
    depends_on:
      vllm-gpt-oss-120b:
        condition: service_healthy
      vllm-gpt-oss-120b_2:
        condition: service_healthy
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy

The served model name on the second replica is deliberately gpt-oss-120b_2 (not gpt-oss-120b), because LiteLLM's upstream model field needs to disambiguate them even though the public-facing name is the same.

LiteLLM config

model_list:
  - model_name: gpt-oss-120b
    litellm_params:
      model: openai/gpt-oss-120b
      api_base: http://vllm-gpt-oss-120b:8000/v1
      api_key: "EMPTY"
      timeout: 600
      stream_timeout: 60

  - model_name: gpt-oss-120b
    litellm_params:
      model: openai/gpt-oss-120b_2
      api_base: http://vllm-gpt-oss-120b_2:8000/v1
      api_key: "EMPTY"
      timeout: 600
      stream_timeout: 60

router_settings:
  routing_strategy: "simple-shuffle"  # best under heavy load, tried "least-busy" and others, did not perform well.
  cooldown_time: 5  # brings back vllm instance immediately if too many requests fail. Failure can be due to rate limits vllm side, so this is not a real cooldown needed
  enable_priority_queue: true
  redis_host: "litellm-redis"
  redis_port: 6379

litellm_settings:
  cache: false
  max_parallel_requests: 196
  request_timeout: 600
  num_retries: 20
  allowed_fails: 200
  drop_params: true   # apparently for Claude Code compatibility, not tested.

Two model entries with the same model_name is how you get LiteLLM to load balance across them. Apparently it does this natively. No configuration needed.

Numbers after ~6 days uptime

Metric	Value
Total tokens processed	6.57B
Prompt tokens	4.20B
Generation tokens	2.36B
Input:output ratio	1.78:1
Total requests	2.76M
Avg tokens per request	~2,380

Throughput

	1-min rate	1-hour avg
Generation tok/s	2,879	2,753
Prompt tok/s	24,782	21,472
Combined tok/s	27,661	24,225

Per-instance load split

Instance	Prompt	Generation
GPU 0	2.10B	1.18B
GPU 1	2.11B	1.19B

Latency under heavy load

This was captured at a moment with 173 running and 29 queued requests.

	p50	p95	p99
TTFT	17.8s	37.8s	39.6s
E2E	41.3s	175.3s	750.7s
ITL	35ms	263ms	—
Queue wait	18.7s	29.4s	—

The TTFT is dominated by queue time (p50 queue 18.7s vs p50 TTFT 17.8s). Under lighter load TTFT is in the low seconds. The E2E p99 of 750s is one user generating 4k+ tokens off a 100k context, which is fine and expected. Still, one current issue is the ping pong effect, I detail below.

ITL p50 of 35ms means each individual stream sees ~28 tok/s when the box is full, which is probably fine for most interactive use.

Cost tracking

LiteLLM tracks "equivalent spend" against configured per-token rates. I set ours to GPT-OSS-120B pricing on Amazon Bedrock ($0.15/M in, $0.60/M out). Over the last 7 days the hypothetical spend is $1,909 USD. The H200 did cost us about 25k each, so the server basically pays for itself after a year.

Stuff I am still unhappy with

When one vLLM replica returns too many errors in a window, LiteLLM cools it down. The other replica then takes the full load, starts erroring under the doubled pressure, and gets cooled down too. In the meantime the first came back, but now it will get the bursts and start throwing errors again. Now the whole proxy is effectively only 50% capacity even though both GPUs are perfectly healthy. I have played with cooldown_time, allowed_fails, and num_retries but cannot find a setting that distributes the load well without this ping pong effect.

Happy to share the prometheus.yml, the Grafana dashboard JSON, or the metrics collection script if anyone wants them. Also very curious what others running similar scale setups are doing for admission control and retry handling, since that is where I feel most of my remaining headroom is.

Screenshots from Grafana below.

[-]

intellidumb@reddit

Would be curious if you swapped Litellm for this which has been in my bookmark list for a while https://github.com/maximhq/bifrost

SessionComplete2334@reddit (OP)

I also read in a blog post, that it apparently is better during larger scale inference than litellm. Definitely want to try it out soon.

somerussianbear@reddit

Great stuff man! I’m not familiar with how vLLM handles prefix cache, mind to elaborate on how you can get it to work with this “little” memory and so many concurrent users?

With 65 GB taken by the model we have 50GB for KV cache. This allows several million tokens. Most request are not that long.

This setup is not failure safe. But a research server can have occasional downtime. So I am driving it at the edge on purpose.

4xi0m4@reddit

That prefix caching setup is clean. The Tiktoken cache directory trick behind a firewall is a solid workaround. For anyone running similar setups, enabling --enable-prefix-caching on vLLM with repeated system prompts is basically free performance if your prompts share common prefixes. The 128 seqs per instance limit also gives good headroom for concurrent users without starving the KV cache.

jzn21@reddit

Don’t be afraid of quants, they still can be quite smart. You should definitely give Gemma 4 31b a try. In my comprehensive tests, it is (much) smarter than OSS 120b in terms of data processing.

Yeah, I am probably a bit too conservative here. How is it speed wise as a dense model?

AFruitShopOwner@reddit

What are users actually using it for? Do you use a RAG system? What tools does it have access to? What front end do you use?

A lot of structuring workflows. Structuring radiology reports (about 2M reports) and other clinical documents. Then also agentic workflows. Not projects I code so I don’t know details about the stack.

Our main endpoint is the OpenAI compatible API. Personally I use the OpenAI python package a lot with it. Good support in guided generation and easy to use.

Structuring is implemented with langchain. For the agentic workflows I believe my PhD build his own harness and tools.

For user interface I vibe coded a few applications and we also work with a company the provides a user interface with secure backends (user database encrypted, good access roles etc.)

Thanks