PSA: llama.cpp patch doubled my max context size

Posted by No-Statement-0001@reddit | LocalLLaMA | View on Reddit | 14 comments

A few days ago llama.cpp quietly landed a patch with this one liner:

Row split mode (-sm row): KV and other non-matrix weights are split among the available GPUs in the same way as split by layer mode.

I'm using 3xP40s (72GB VRAM) and previously using 70B models at a Q4 quantization, my maximum context size was 60K tokens. After this change I can fit 120K tokens. Huge win for P40 users that use row split mode.

Here's the RAM usage before the patch (60K context). Note the first P40 is using 23GB/24GB VRAM while the other two are 14GB/24GB.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P40                      On  |   00000000:05:00.0 Off |                  Off |
| N/A   33C    P0             52W /  160W |   23983MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla P40                      On  |   00000000:06:00.0 Off |                  Off |
| N/A   35C    P0             51W /  160W |   14109MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla P40                      On  |   00000000:09:00.0 Off |                  Off |
| N/A   30C    P0             50W /  160W |   14509MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        On  |   00000000:0A:00.0 Off |                  N/A |
| 40%   26C    P8             24W /  275W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Here's the RAM usage after the path with 128K context. VRAM is more evenly distributed. Another benefit is that GPU utilization is more balanced across the GPUs. Previously GPU0 would be about 10% to 15% higher. The inference speed does not seem to have changed for better or worse with this change.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P40                      On  |   00000000:05:00.0 Off |                  Off |
| N/A   33C    P0             53W /  160W |   21339MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla P40                      On  |   00000000:06:00.0 Off |                  Off |
| N/A   36C    P0             53W /  160W |   21585MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla P40                      On  |   00000000:09:00.0 Off |                  Off |
| N/A   31C    P0             53W /  160W |   21721MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        On  |   00000000:0A:00.0 Off |                  N/A |
| 40%   26C    P8             24W /  275W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

[-]

HunyuanQiTeacher@reddit

I am using 3x T4, but noticed that while the performance doubles, it seems to be unstable. Meaning, I load the model, and if the message is short, it works, but longer messages, or if i unload the server and reload it, and then continue the conversation, the model starts producing gibberish. I am not sure what i am doing wrong or if it is a bug? this is the command i am running:

sudo docker run --rm --gpus all -v /home/matias/.llama/models:/root/.cache/llama.cpp/ -p 11433:11433 ghcr.io/ggml-org/llama.cpp:full-cuda --server -hf "Qwen/Qwen2.5-7B-Instruct-GGUF:q4_k_m" --host 0.0.0.0 --port 11433 --n_gpu_layers 100 -sm row --ctx-size 131072 -ts 1,1,1

any thoughts?

a_beautiful_rhind@reddit

So now it's best of both worlds? Model by row and the better handling of cache.

b3081a@reddit

It did slow down prompt processing though, there could be some room for optimization there as well.

_hypochonder_@reddit

I hope that not the rocm performance suffer from this update.
But I will see in the future in koboldcpp-rocm.

https://github.com/ggerganov/llama.cpp/pull/10026#issuecomment-2443855758

firearms_wtf@reddit

PSA: You can use nvidia-pstated to keep your P40s in P8 (8-10W) while loaded and idle.

No-Statement-0001@reddit (OP)

I use this tool, it’s great. The author was also quite responsive to fixing a bug I found too. Definitely a must use for the P40 crew. I run it with CUDA_VISIBLE_DEVICES so it only manages the P40s and not the 3090.

harrro@reddit

Same. Highly recommend it as well for P40.

I run it with CUDA_VISIBLE_DEVICES

No longer needed now with latest versions of pstated -- there's now a -i <gpu_id> flag that lets you select the device directly.

segmond@reddit

maybe it matters based on the quantization? for llama3.1-70B Q8 on 4 3090's with "sm row" I can load 64000 context before I run out of VRAM, without it 61000, so a difference of 3000 across 4 gpus. I just built my branch about an hour again. Can you show the entire cli flag you are passing to llama?

-ngl -fa -ts and -sm for me.

I'm running qwen-72B Q4. The key seems to be using an 8bit quantized cache instead of a 16bit (default). Try it with these flags: `--cache-type-k q8_0 --cache-type-v q8_0`. I haven't noticed any difference; though I'm not sure even how to measure the difference. :)

LinkSea8324@reddit

Ran few tests last day and I have better performances on a layer split mode rather than row, weird

Steuern_Runter@reddit

Does this only apply to the CUDA backend?

Judtoff@reddit

3x p40 gang here, this is great news. It was a real pain choosing a model that fit across all 3 gpus while having all the context on the main gpu. I ended up just using large models with small context, since really long context even on small models didn't really benefit from my multiple cards (with sm row. Without it sure, but that was super slow)

For reference: ~/llama.cpp/llama-server -m ~/llama.cpp/models/magnum-v2-123b.Q4_K_S.gguf -ngl 89 --split-mode row --tensor-split 32,31,28 --flash-attn -mg 2 -c 7000 --port 8080 --host 192.168.50.126 --log-format text >> ~/llama.cpp/llama-server.log

It'll be interesting to see how this patch affects things. I've got a p4 in the same server doing nothing, maybe I'll be able to offload some context to it...

-my_dude@reddit

Oh shit I gotta check this out on my 2 P40s

DinoAmino@reddit

Cool! Instead of increasing my context I'll be moving up to q8 \m/