PSA: llama.cpp patch doubled my max context size

Posted by No-Statement-0001@reddit | LocalLLaMA | View on Reddit | 12 comments

A few days ago llama.cpp quietly landed a patch with this one liner:

Row split mode (-sm row): KV and other non-matrix weights are split among the available GPUs in the same way as split by layer mode.

I'm using 3xP40s (72GB VRAM) and previously using 70B models at a Q4 quantization, my maximum context size was 60K tokens. After this change I can fit 120K tokens. Huge win for P40 users that use row split mode.

Here's the RAM usage before the patch (60K context). Note the first P40 is using 23GB/24GB VRAM while the other two are 14GB/24GB.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P40                      On  |   00000000:05:00.0 Off |                  Off |
| N/A   33C    P0             52W /  160W |   23983MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla P40                      On  |   00000000:06:00.0 Off |                  Off |
| N/A   35C    P0             51W /  160W |   14109MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla P40                      On  |   00000000:09:00.0 Off |                  Off |
| N/A   30C    P0             50W /  160W |   14509MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        On  |   00000000:0A:00.0 Off |                  N/A |
| 40%   26C    P8             24W /  275W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Here's the RAM usage after the path with 128K context. VRAM is more evenly distributed. Another benefit is that GPU utilization is more balanced across the GPUs. Previously GPU0 would be about 10% to 15% higher. The inference speed does not seem to have changed for better or worse with this change.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P40                      On  |   00000000:05:00.0 Off |                  Off |
| N/A   33C    P0             53W /  160W |   21339MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla P40                      On  |   00000000:06:00.0 Off |                  Off |
| N/A   36C    P0             53W /  160W |   21585MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla P40                      On  |   00000000:09:00.0 Off |                  Off |
| N/A   31C    P0             53W /  160W |   21721MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        On  |   00000000:0A:00.0 Off |                  N/A |
| 40%   26C    P8             24W /  275W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+