llama-server is cooking! gemma3 27b, 100K context, vision on one 24GB GPU.
Posted by No-Statement-0001@reddit | LocalLLaMA | View on Reddit | 54 comments
llama-server has really improved a lot recently. With vision support, SWA (sliding window attention) and performance improvements I've got 35tok/sec on a 3090. P40 gets 11.8 tok/sec. Multi-gpu performance has improved. Dual 3090s performance goes up to 38.6 tok/sec (600W power limit). Dual P40 gets 15.8 tok/sec (320W power max)! Rejoice P40 crew.
I've been writing more guides for the llama-swap wiki and was very surprised with the results. Especially how usable the P40 still are!
llama-swap config ((source wiki page)[https://github.com/mostlygeek/llama-swap/wiki/gemma3-27b-100k-context]):
macros:
"server-latest":
/path/to/llama-server/llama-server-latest
--host 127.0.0.1 --port ${PORT}
--flash-attn -ngl 999 -ngld 999
--no-mmap
# quantize KV cache to Q8, increases context but
# has a small effect on perplexity
# https://github.com/ggml-org/llama.cpp/pull/7412#issuecomment-2120427347
"q8-kv": "--cache-type-k q8_0 --cache-type-v q8_0"
models:
# fits on a single 24GB GPU w/ 100K context
# requires Q8 KV quantization
"gemma":
env:
# 3090 - 35 tok/sec
- "CUDA_VISIBLE_DEVICES=GPU-6f0"
# P40 - 11.8 tok/sec
#- "CUDA_VISIBLE_DEVICES=GPU-eb1"
cmd: |
${server-latest}
${q8-kv}
--ctx-size 102400
-ngl 99
--model /path/to/models/google_gemma-3-27b-it-Q4_K_L.gguf
--mmproj /path/to/models/gemma-mmproj-model-f16-27B.gguf
--temp 1.0
--repeat-penalty 1.0
--min-p 0.01
--top-k 64
--top-p 0.95
# Requires 30GB VRAM
# - Dual 3090s, 38.6 tok/sec
# - Dual P40s, 15.8 tok/sec
"gemma-full":
env:
# 3090s
- "CUDA_VISIBLE_DEVICES=GPU-6f0,GPU-f10"
# P40s
# - "CUDA_VISIBLE_DEVICES=GPU-eb1,GPU-ea4"
cmd: |
${server-latest}
--ctx-size 102400
-ngl 99
--model /path/to/models/google_gemma-3-27b-it-Q4_K_L.gguf
--mmproj /path/to/models/gemma-mmproj-model-f16-27B.gguf
--temp 1.0
--repeat-penalty 1.0
--min-p 0.01
--top-k 64
--top-p 0.95
# uncomment if using P40s
# -sm row
HabbleBabbleBubble@reddit
I'm sorry to be such a n00b here, but can someone please explain this to me? Is it possible to fit f16 precision Gemma 27B on 24GB with this, and how does that work? Why are we providing two different models on --model and --mmproj? Is the point of this only to get more context, and not to fit models of a higher quant onto the same card? I'm having trouble working with Danish on a 24GB L4 and would like to run a higher quant without it being incredibly slow :D
Electronic-Site8038@reddit
well did you fit that 27b on your 24gb vram?
rorowhat@reddit
what's the advantage of using llama-server as opposed to llama-cli?
InterstellarReddit@reddit
Any ideas on how I can process videos through ollama ?
Scotty_tha_boi007@reddit
Can open web UI do it?
InterstellarReddit@reddit
Actually I need to be able to do it from a command line
extopico@reddit
For command line just use llama.cpp directly. Why use a weird abstraction layer like ollama?
Scotty_tha_boi007@reddit
Based opinion
InterstellarReddit@reddit
Thank you
coding_workflow@reddit
100k context with 27b? What Quand is this? I have trouble doing the math as I see 100k even with Q4 need far more than the 24GB while OP show Q8?
What kind of magic here?
ttkciar@reddit
I think SWA reduces the memory overhead of long contexts.
coding_workflow@reddit
But that could have a huge impact on performance on output. It means the models out is no more taking notice of the long specs I have added.
I'm not sure this is very effective. And this will likely fail needle in haystack often!
Mushoz@reddit
SWA is lossless compared to how the old version of llama.cpp was doing it. So you will not receive any penalties by using this.
coding_workflow@reddit
How it's lossless?
The attention sink phenomenon Xiao et al. (2023), where LLMs allocate excessive attention to initial tokens in sequences, has emerged as a significant challenge for SWA inference in Transformer architectures. Previous work has made two key observations regarding this phenomenon. First, the causal attention mechanism in Transformers is inherently non-permutation invariant, with positional information emerging implicitly through token embedding variance after softmax normalization Chi et al. (2023). Second, studies have demonstrated that removing normalization from the attention mechanism can effectively eliminate the attention sink effect Gu et al. (2024).
https://arxiv.org/html/2502.18845v1
There will be loss. If you reduce the input/context it will loose focus.
Mushoz@reddit
SWA obviously has its drawbacks compared to other forms of attention. But what I meant with my comment, is that enabling SWA for Gemma under llama.cpp will have identical quality as with it disabled. Enabling or disabling it doesn't change Gemma's architecture, meaning it will have the exact same attention mechanism and therefore performance. But enabling SWA will reduce the memory footprint.
LostHisDog@reddit
I feel so out of the loop asking this but... how do I run this? I mostly poke around in LM Studio, played with Ollama a bit, but this script looks like model setup instructions for llama.cpp or is it something else entirely?
Anyone got any tips for kick starting me a bit? I've been playing on the image generation side of AI news and developments too much and would like to at least be able to stay somewhat current with LLMs... plus a decent model with 100k on my 3090 would be lovely for some writing adventures I've backburnered.
Thanks!
LostHisDog@reddit
NVM mostly... I keep forgetting that ChatGPT is like 10x smarter than a year or so ago and can actually just explain stuff like this... think I have enough to get started.
SporksInjected@reddit
Just because ChatGPT may not know, llamacpp now has releases of their binaries. Making the file used to be a lot of the challenge but now it’s just download and run the binary with whatever flags like you see above.
LostHisDog@reddit
Yeah, ChatGPT wanted me to build it out but there were very obviously binaries now so that helped. It's kind of like having a supper techie guy sitting next to you helping all the way... but, you know, the guy has a bit of the alzheimer's and sometimes is going to be like "Now insert your 5 1/4 floppy disk and make sure your CRT is turned on."
SporksInjected@reddit
“I am your Pentium based digital assistant”
extopico@reddit
Yes current issue LLMs are very familiar with llama.cpp but for latest features you’ll need to consult the GitHub issues.
iwinux@reddit
Is it possible to load models larger than the 24GB VRAM by offloading something to RAM?
IllSkin@reddit
This example uses
Which means to put at most 999 layers on the GPU. Gemma3 27b has 63 layers (I think), so that means all of them.
If you want to load a huge model, you can pass something like -ngl 20 to just load 20 layers to VRAM and the rest to RAM. You will need to experiment a bit to find the best offload value for each model and quant.
presidentbidden@reddit
can this be used in production ?
No-Statement-0001@reddit (OP)
Depends on what you mean by "production". :)
sharpfork@reddit
Prod or prod-prod? Are you done or done-done?
Environmental-Metal9@reddit
People underestimate how much smoke and mirrors go into hiding that a lot of deployment pipelines are exactly like this, e.g. high school assignment naming convention but in practice not in naming. Even worse are the staging envs that are actually prod because if they break then CI breaks and nobody can ship until not-prod-prod-prod is being restored
Only_Situation_4713@reddit
Engineering practices are insanely bad in 80% of companies and 90% of teams. I've worked with contractors that write tests to return true always and the tech lead doesn't care.
SkyFeistyLlama8@reddit
That's funny as hell. Expect it to become even worse when always-true tests become part of LLM training data.
Environmental-Metal9@reddit
Don’t forget the
# This is just a placeholder. In a real application you would implement this function
lazy comments we already get…SkyFeistyLlama8@reddit
When you see that in corporate code, it's time to scream and walk away.
Environmental-Metal9@reddit
My favorite is working on legacy code and finding 10yo comments like “wtf does this even do? Gotta research the library next sprint” and no indication of the library anywhere in code. On one hand it’s good they came back and did something over the years but now this archeological code fossil is left behind to confuse explorers for the duration of that codebase
SporksInjected@reddit
Yep I’m in one of those teams
Anka098@reddit
Final-final-prod-prod-2
jazir5@reddit
Final-final-prod-prod2-lastversion
extopico@reddit
Well, it’s more production ready than LLM tools already in production.
Nomski88@reddit
How? My 5090 crashes because it runs out of memory if I try 100k context. Running the Q4 model on LM Studio....
extopico@reddit
Well, use llama-server instead and it’s built in gui on localhost:8080
No-Statement-0001@reddit (OP)
My guess is that LM Studio doesn't have SWA from llama.cpp ([commit](https://github.com/ggml-org/llama.cpp/pull/13194)) shipped yet.
LA_rent_Aficionado@reddit
It looks like because he’s quantizing the KV cache which should reduce context vram iirc already on top of a q4 quant
Scotty_tha_boi007@reddit
Have you played with any of the AMD instinct cards? I got an MI60 and I have been using it with llama-swap trying different configs for qwen 3, I haven't ran Gemma 3 on it yet so I can't compare but I feel like it's pretty usable for a local setup. I ordered two mi50's too they should be in soon!
shapic@reddit
Tested some SWA. Without it i could fit 40k q8 cache. With it 100k. While it looks awesome past 40k context model becomes barely usable with recalculating cache every time and getting timeout without any output after that.
bjivanovich@reddit
Is it possible in lm studio?
shapic@reddit
No swa yet
ggerganov@reddit
The unnecessary recalculation issue with SWA models will be fixed with https://github.com/ggml-org/llama.cpp/pull/13833
dampflokfreund@reddit
Great news and thanks a lot. Fantastic work here, yet again!
PaceZealousideal6091@reddit
Bro, thanks a lot for all your contributions. Without llama.cpp for what it now, local llms wouldn't be where it is now! A sincere thanks man. Keep up the awesome work!
No-Statement-0001@reddit (OP)
“enable swa speculative decoding” … does this mean i can use a draft model that also has a swa kv?
also thanks for making all this stuff possible. 🙏🏼
ggerganov@reddit
Yes, for example Gemma 12b (target) + Gemma 1b (draft).
Thanks for llama-swap as well!
skatardude10@reddit
REALLY loving the new iSWA support. Went from chugging along at like 3 tokens per second when Gemma3 27B first came out at like 32K context to 13 tokens per second now with iSWA, some tensor overrides and 130K context (Q8 KV cache) on a 3090.
FullstackSensei@reddit
Wasn't aware of those macros! Really nice to shorten the commands with all the common parameters!
No-Statement-0001@reddit (OP)
I just landed the PR last night.
FullstackSensei@reddit
Hadn't had much time to update llama-swap in the last few weeks. Still need to edit my configurations make use of groups :(
TheTerrasque@reddit
Awesome! I had a feature request for something like this that got closed, glad to see it's in now!