PSA: llama.cpp patch doubled my max context size
Posted by No-Statement-0001@reddit | LocalLLaMA | View on Reddit | 12 comments
A few days ago llama.cpp quietly landed a patch with this one liner:
Row split mode (
-sm row
): KV and other non-matrix weights are split among the available GPUs in the same way as split by layer mode.
I'm using 3xP40s (72GB VRAM) and previously using 70B models at a Q4 quantization, my maximum context size was 60K tokens. After this change I can fit 120K tokens. Huge win for P40 users that use row split mode.
Here's the RAM usage before the patch (60K context). Note the first P40 is using 23GB/24GB VRAM while the other two are 14GB/24GB.
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla P40 On | 00000000:05:00.0 Off | Off |
| N/A 33C P0 52W / 160W | 23983MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 Tesla P40 On | 00000000:06:00.0 Off | Off |
| N/A 35C P0 51W / 160W | 14109MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 Tesla P40 On | 00000000:09:00.0 Off | Off |
| N/A 30C P0 50W / 160W | 14509MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 3090 On | 00000000:0A:00.0 Off | N/A |
| 40% 26C P8 24W / 275W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Here's the RAM usage after the path with 128K context. VRAM is more evenly distributed. Another benefit is that GPU utilization is more balanced across the GPUs. Previously GPU0 would be about 10% to 15% higher. The inference speed does not seem to have changed for better or worse with this change.
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla P40 On | 00000000:05:00.0 Off | Off |
| N/A 33C P0 53W / 160W | 21339MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 Tesla P40 On | 00000000:06:00.0 Off | Off |
| N/A 36C P0 53W / 160W | 21585MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 Tesla P40 On | 00000000:09:00.0 Off | Off |
| N/A 31C P0 53W / 160W | 21721MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 3090 On | 00000000:0A:00.0 Off | N/A |
| 40% 26C P8 24W / 275W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
_hypochonder_@reddit
I hope that not the rocm performance suffer from this update.
But I will see in the future in koboldcpp-rocm.
https://github.com/ggerganov/llama.cpp/pull/10026#issuecomment-2443855758
firearms_wtf@reddit
PSA: You can use nvidia-pstated to keep your P40s in P8 (8-10W) while loaded and idle.
No-Statement-0001@reddit (OP)
I use this tool, it’s great. The author was also quite responsive to fixing a bug I found too. Definitely a must use for the P40 crew. I run it with CUDA_VISIBLE_DEVICES so it only manages the P40s and not the 3090.
harrro@reddit
Same. Highly recommend it as well for P40.
No longer needed now with latest versions of pstated -- there's now a
-i <gpu_id>
flag that lets you select the device directly.segmond@reddit
maybe it matters based on the quantization? for llama3.1-70B Q8 on 4 3090's with "sm row" I can load 64000 context before I run out of VRAM, without it 61000, so a difference of 3000 across 4 gpus. I just built my branch about an hour again. Can you show the entire cli flag you are passing to llama?
-ngl -fa -ts and -sm for me.
No-Statement-0001@reddit (OP)
I'm running qwen-72B Q4. The key seems to be using an 8bit quantized cache instead of a 16bit (default). Try it with these flags: `--cache-type-k q8_0 --cache-type-v q8_0`. I haven't noticed any difference; though I'm not sure even how to measure the difference. :)
LinkSea8324@reddit
Ran few tests last day and I have better performances on a layer split mode rather than row, weird
Steuern_Runter@reddit
Does this only apply to the CUDA backend?
Judtoff@reddit
3x p40 gang here, this is great news. It was a real pain choosing a model that fit across all 3 gpus while having all the context on the main gpu. I ended up just using large models with small context, since really long context even on small models didn't really benefit from my multiple cards (with sm row. Without it sure, but that was super slow)
For reference: ~/llama.cpp/llama-server -m ~/llama.cpp/models/magnum-v2-123b.Q4_K_S.gguf -ngl 89 --split-mode row --tensor-split 32,31,28 --flash-attn -mg 2 -c 7000 --port 8080 --host 192.168.50.126 --log-format text >> ~/llama.cpp/llama-server.log
It'll be interesting to see how this patch affects things. I've got a p4 in the same server doing nothing, maybe I'll be able to offload some context to it...
-my_dude@reddit
Oh shit I gotta check this out on my 2 P40s
a_beautiful_rhind@reddit
So now it's best of both worlds? Model by row and the better handling of cache.
DinoAmino@reddit
Cool! Instead of increasing my context I'll be moving up to q8 \m/