Serving 1B+ tokens/day locally in my research lab

Posted by SessionComplete2334@reddit | LocalLLaMA | View on Reddit | 13 comments

I lead a reserach lab at a university hospital and spent the last weeks configuring our internal LLM server. I put a lot of thought into the server config, software stack and model. Now I am at a point where I am happy, it actually holds up under load and we are pushing more than 1B tokens/day (roughly 2/3 ingestion, 1/3 decode) through 2x H200 serving GPT-OSS-120B. I Thought this could be interesting for others looking to do something similar and also hoping to get some feedback. So I am sharing my software stack below as well as some considerations why I chose GPT-OSS-120B.

Disclaimer Used Claude to help writing this.

Hardware

Our server has two H200 GPUs, apart from that it is not very beefy with 124GB RAM 16 core cpu, 512 GB disk space. Enough to hold the models, docker images and logs.

Model

I tried a bunch of models a couple of weeks ago. Qwen 3 models, GLM-Air and GPT-OSS. GPT-OSS-120B seemed to be the best for us:

Things I tried that were worse on H200:

mxfp4 on H200 just seems extremely well optimized right now. Still,. I am always looking for models with better performance. Currently eyeing Mistral Small 4 (vision, 120B as well), Qwen 3.5, and Gemma 4. However, Gemma being dense makes me skeptical it can match throughput and I am not trusting the smaller MoE models to be as smart as a 120B model. Same with the Qwen models. Currently I also can't take GPT-OSS offline anymore to test more models properly because the demand is too high. But as soon as we scale hardware, I would like to try more.

Architecture

I do all in docker with a big docker compose (see below)

Client → LiteLLM proxy (4000) → vLLM GPU 0 (8000)
                              → vLLM GPU 1 (8000)
                ↓
          PostgreSQL       (keys, usage, spend)
          Prometheus       (scrapes vLLM /metrics every 5s)
          Grafana          (dashboards)
          MkDocs           (user docs)

I picked one instance per GPU over tensor parallel across both because at this model size with mxfp4 it fits comfortably on a single H200, and two independent replicas give better throughput and no NCCL communication overhead. KV cache is also not a bottleneck for us. With simple-shuffle routing the load split is almost perfect (2.10B vs 2.11B prompt tokens after ~6 days of uptime). Other routing strategies did not work as well (litellm also recommends simple-shuffle in their docs).

vLLM

--quantization mxfp4
--max-model-len 128000
--gpu-memory-utilization 0.80
--max-num-batched-tokens 8192
--enable-chunked-prefill
--enable-prefix-caching
--max-num-seqs 128

Plus environment:

VLLM_USE_FLASHINFER_MXFP4_MOE=1
NCCL_P2P_DISABLE=1

For details on this:

VLLM_USE_FLASHINFER_MXFP4_MOE=1 needed for this model on H200.

NCCL_P2P_DISABLE=1 is needed even though each container only sees one GPU. If I remember right, without it NCCL throws cryptic errors.

TIKTOKEN_RS_CACHE_DIR=/root/.cache/tiktoken I think usually the container would download tiktoken, but behind our firewall it cannot connect to the web, so I have to manually provide the tokenizer.

--enable-prefix-caching we send a lot of near-identical system prompts (templated structuring tasks, agent scaffolds). Cache hit rate is high so TTFT drops with this.

--max-num-seqs 128 per instance, so 256 concurrent sequences across the box. KV cache is rarely the bottleneck for us (Grafana usually shows 25-30%, occasional spikes toward 90% under bursts), the actual ceiling is decode throughput. Increasing max-num-seqs higher would just slow each individual stream down without buying real headroom. I tried up to 512 parallel requests and decoding speed does not exceed 3000 token/s, instead the individual response just gets slower.

gpu-memory-utilization 0.80 and --max-num-batched-tokens 8192 (not used currently, but will swap this in if needed) are both there for logprobs requests. After some mysterious crashes of the vllm servers, I found that if a client requests top-k logprobs on a long context, vLLM materializes a chunk of memory that scales fast, leads to OOM on the GPU and crashes the server. Capping batched tokens at 8k and leaving 20% VRAM headroom absorbs those spikes without hurting steady-state throughput. --max-num-batched-tokens 8192 limits the burst size, as it only calculates the logprobs for 8192 tokens at a time. As KV cache is not a limiting factor for us, I keep gpu-mem at 0.8 constantly.

Healthcheck start_period: 900s. Loading a 120B MoE takes 10-15 minutes from cold. Anything shorter and LiteLLM spams its logs about unhealthy upstreams.

docker-compose (vLLM + LiteLLM)

Stripped down to just vllm and litellm. Postgres, Prometheus, Grafana are left out, they are standard.

services:
  vllm-gpt-oss-120b:
    image: vllm/vllm-openai:latest
    container_name: vllm-gpt-oss-120b
    environment:
      - VLLM_USE_FLASHINFER_MXFP4_MOE=1
      - NCCL_P2P_DISABLE=1
      - TIKTOKEN_RS_CACHE_DIR=/root/.cache/tiktoken
    volumes:
      - /srv/cache/tiktoken:/root/.cache/tiktoken:ro
      - /srv/models/gpt-oss-120b:/models/gpt-oss-120b
    expose:
      - "8000"
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities: [gpu]
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
      interval: 30s
      timeout: 5s
      retries: 20
      start_period: 900s
    command: >
      /models/gpt-oss-120b
      --served-model-name gpt-oss-120b
      --quantization mxfp4
      --max-model-len 128000
      --gpu-memory-utilization 0.80
      --enable-chunked-prefill
      --enable-prefix-caching
      --max-num-seqs 128
#      --max-num-batched-tokens 8192

  vllm-gpt-oss-120b_2:
    image: vllm/vllm-openai:latest
    container_name: vllm-gpt-oss-120b_2
    environment:
      - VLLM_USE_FLASHINFER_MXFP4_MOE=1
      - NCCL_P2P_DISABLE=1
      - TIKTOKEN_RS_CACHE_DIR=/root/.cache/tiktoken
    volumes:
      - /srv/cache/tiktoken:/root/.cache/tiktoken:ro
      - /srv/models/gpt-oss-120b:/models/gpt-oss-120b
    expose:
      - "8000"
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['1']
              capabilities: [gpu]
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
      interval: 30s
      timeout: 5s
      retries: 20
      start_period: 900s
    command: >
      /models/gpt-oss-120b
      --served-model-name gpt-oss-120b_2
      --quantization mxfp4
      --max-model-len 128000
      --gpu-memory-utilization 0.80
      --enable-chunked-prefill
      --enable-prefix-caching
      --max-num-seqs 128
#      --max-num-batched-tokens 8192

  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    container_name: litellm-proxy
    ports:
      - "4000:4000"
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    environment:
      - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
      - DATABASE_URL=postgresql://litellm:${POSTGRES_PASSWORD}@postgres:5432/litellm
    command: >
      --config /app/config.yaml
      --port 4000
      --num_workers 4
    depends_on:
      vllm-gpt-oss-120b:
        condition: service_healthy
      vllm-gpt-oss-120b_2:
        condition: service_healthy
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy

The served model name on the second replica is deliberately gpt-oss-120b_2 (not gpt-oss-120b), because LiteLLM's upstream model field needs to disambiguate them even though the public-facing name is the same.

LiteLLM config

model_list:
  - model_name: gpt-oss-120b
    litellm_params:
      model: openai/gpt-oss-120b
      api_base: http://vllm-gpt-oss-120b:8000/v1
      api_key: "EMPTY"
      timeout: 600
      stream_timeout: 60

  - model_name: gpt-oss-120b
    litellm_params:
      model: openai/gpt-oss-120b_2
      api_base: http://vllm-gpt-oss-120b_2:8000/v1
      api_key: "EMPTY"
      timeout: 600
      stream_timeout: 60

router_settings:
  routing_strategy: "simple-shuffle"  # best under heavy load, tried "least-busy" and others, did not perform well.
  cooldown_time: 5  # brings back vllm instance immediately if too many requests fail. Failure can be due to rate limits vllm side, so this is not a real cooldown needed
  enable_priority_queue: true
  redis_host: "litellm-redis"
  redis_port: 6379

litellm_settings:
  cache: false
  max_parallel_requests: 196
  request_timeout: 600
  num_retries: 20
  allowed_fails: 200
  drop_params: true   # apparently for Claude Code compatibility, not tested.

Two model entries with the same model_name is how you get LiteLLM to load balance across them. Apparently it does this natively. No configuration needed.

Numbers after ~6 days uptime

Metric Value
Total tokens processed 6.57B
Prompt tokens 4.20B
Generation tokens 2.36B
Input:output ratio 1.78:1
Total requests 2.76M
Avg tokens per request ~2,380

Throughput

1-min rate 1-hour avg
Generation tok/s 2,879 2,753
Prompt tok/s 24,782 21,472
Combined tok/s 27,661 24,225

Per-instance load split

Instance Prompt Generation
GPU 0 2.10B 1.18B
GPU 1 2.11B 1.19B

Latency under heavy load

This was captured at a moment with 173 running and 29 queued requests.

p50 p95 p99
TTFT 17.8s 37.8s 39.6s
E2E 41.3s 175.3s 750.7s
ITL 35ms 263ms
Queue wait 18.7s 29.4s

The TTFT is dominated by queue time (p50 queue 18.7s vs p50 TTFT 17.8s). Under lighter load TTFT is in the low seconds. The E2E p99 of 750s is one user generating 4k+ tokens off a 100k context, which is fine and expected. Still, one current issue is the ping pong effect, I detail below.

ITL p50 of 35ms means each individual stream sees ~28 tok/s when the box is full, which is probably fine for most interactive use.

Cost tracking

LiteLLM tracks "equivalent spend" against configured per-token rates. I set ours to GPT-OSS-120B pricing on Amazon Bedrock ($0.15/M in, $0.60/M out). Over the last 7 days the hypothetical spend is $1,909 USD. The H200 did cost us about 25k each, so the server basically pays for itself after a year.

Stuff I am still unhappy with

When one vLLM replica returns too many errors in a window, LiteLLM cools it down. The other replica then takes the full load, starts erroring under the doubled pressure, and gets cooled down too. In the meantime the first came back, but now it will get the bursts and start throwing errors again. Now the whole proxy is effectively only 50% capacity even though both GPUs are perfectly healthy. I have played with cooldown_time, allowed_fails, and num_retries but cannot find a setting that distributes the load well without this ping pong effect.

Happy to share the prometheus.yml, the Grafana dashboard JSON, or the metrics collection script if anyone wants them. Also very curious what others running similar scale setups are doing for admission control and retry handling, since that is where I feel most of my remaining headroom is.

Screenshots from Grafana below.