llama-server is cooking! gemma3 27b, 100K context, vision on one 24GB GPU.

Posted by No-Statement-0001@reddit | LocalLLaMA | View on Reddit | 54 comments

llama-server has really improved a lot recently. With vision support, SWA (sliding window attention) and performance improvements I've got 35tok/sec on a 3090. P40 gets 11.8 tok/sec. Multi-gpu performance has improved. Dual 3090s performance goes up to 38.6 tok/sec (600W power limit). Dual P40 gets 15.8 tok/sec (320W power max)! Rejoice P40 crew.

I've been writing more guides for the llama-swap wiki and was very surprised with the results. Especially how usable the P40 still are!

llama-swap config ((source wiki page)[https://github.com/mostlygeek/llama-swap/wiki/gemma3-27b-100k-context]):

macros:
  "server-latest":
    /path/to/llama-server/llama-server-latest
    --host 127.0.0.1 --port ${PORT}
    --flash-attn -ngl 999 -ngld 999
    --no-mmap

  # quantize KV cache to Q8, increases context but
  # has a small effect on perplexity
  # https://github.com/ggml-org/llama.cpp/pull/7412#issuecomment-2120427347
  "q8-kv": "--cache-type-k q8_0 --cache-type-v q8_0"

models:
  # fits on a single 24GB GPU w/ 100K context
  # requires Q8 KV quantization
  "gemma":
    env:
      # 3090 - 35 tok/sec
      - "CUDA_VISIBLE_DEVICES=GPU-6f0"

      # P40 - 11.8 tok/sec
      #- "CUDA_VISIBLE_DEVICES=GPU-eb1"
    cmd: |
      ${server-latest}
      ${q8-kv}
      --ctx-size 102400
      -ngl 99
      --model /path/to/models/google_gemma-3-27b-it-Q4_K_L.gguf
      --mmproj /path/to/models/gemma-mmproj-model-f16-27B.gguf
      --temp 1.0
      --repeat-penalty 1.0
      --min-p 0.01
      --top-k 64
      --top-p 0.95

  # Requires 30GB VRAM
  #  - Dual 3090s, 38.6 tok/sec
  #  - Dual P40s, 15.8 tok/sec
  "gemma-full":
    env:
      # 3090s
      - "CUDA_VISIBLE_DEVICES=GPU-6f0,GPU-f10"

      # P40s
      # - "CUDA_VISIBLE_DEVICES=GPU-eb1,GPU-ea4"
    cmd: |
      ${server-latest}
      --ctx-size 102400
      -ngl 99
      --model /path/to/models/google_gemma-3-27b-it-Q4_K_L.gguf
      --mmproj /path/to/models/gemma-mmproj-model-f16-27B.gguf
      --temp 1.0
      --repeat-penalty 1.0
      --min-p 0.01
      --top-k 64
      --top-p 0.95
      # uncomment if using P40s
      # -sm row