Unexpected Token / s on my V100 32GB GPU Setup.

Posted by abmateen@reddit | LocalLLaMA | View on Reddit | 10 comments

I am running a hobbyist setup to run local LLM with my a bit old server Dell PowerEdge R730 DDR4 total 64GB 2x32GB 2133Mhz. Recently I could get hold of a V100 32GB original PCIe version. I am properly doing passthrough using vfio drivers in Proxmox VM, so no overhead of drivers or conflict between the host and guest.

The issue is I am getting some unexpectedly low token per second when I run smaller models like Llama-3.1-3B Q4_K_M GGUF from unsloth. I am getting only 180 tok/s while according to the bandwidth of V100 which is reported by Bandwidth test D2D is around 800 GB/s. The bandwidth utilisation stays 35% when I run smaller models like 3-7B, but when I run a 31B dense model I get 30tok/s which is sorta expected and Bandwidth Utilisation of 82%.

I did all optimisations like NUMA bindings etc, driver is also latest from Nvidia, I am using LLama.cpp with Flash Attention enabled, All layers are on GPU.

Is anybody using V100 / Tesla cards or Local GPU setup has optimised it? I am not quite getting the math behind it, smaller models should give higher token/s provided the GPU bandwidth.

What could potentially be bottleneck in this setup ?

[-]

SSOMGDSJD@reddit

Your numbers match mine, I run sxm2 v100 32gb via an adapter board. The small models are compute bound, they don't use enough memory to saturate your hbm2 bandwidth. The dense at 82% sat and 30 tok/s is pretty close to optimal performance on our old dog cards learning new tricks, and tells you that your set up is in good shape

[-]

MelodicRecognition7@reddit

1 or 2 CPU setup? If 2 then run llama.cpp on the CPU where GPU is physically connected to avoid NUMA overhead..

[-]

abmateen@reddit (OP)

Already did use NUMA properly binded CPU and GPU

[-]

MelodicRecognition7@reddit

disable HyperThreading and enable Turbo Boost, high CPU frequency is crucial even for "GPU-only" inference because CPU is still doing like 20% of the work.

[-]