Unexpected Token / s on my V100 32GB GPU Setup.
Posted by abmateen@reddit | LocalLLaMA | View on Reddit | 10 comments
I am running a hobbyist setup to run local LLM with my a bit old server Dell PowerEdge R730 DDR4 total 64GB 2x32GB 2133Mhz. Recently I could get hold of a V100 32GB original PCIe version. I am properly doing passthrough using vfio drivers in Proxmox VM, so no overhead of drivers or conflict between the host and guest.
The issue is I am getting some unexpectedly low token per second when I run smaller models like Llama-3.1-3B Q4_K_M GGUF from unsloth. I am getting only 180 tok/s while according to the bandwidth of V100 which is reported by Bandwidth test D2D is around 800 GB/s. The bandwidth utilisation stays 35% when I run smaller models like 3-7B, but when I run a 31B dense model I get 30tok/s which is sorta expected and Bandwidth Utilisation of 82%.
I did all optimisations like NUMA bindings etc, driver is also latest from Nvidia, I am using LLama.cpp with Flash Attention enabled, All layers are on GPU.
Is anybody using V100 / Tesla cards or Local GPU setup has optimised it? I am not quite getting the math behind it, smaller models should give higher token/s provided the GPU bandwidth.
What could potentially be bottleneck in this setup ?
SSOMGDSJD@reddit
Your numbers match mine, I run sxm2 v100 32gb via an adapter board. The small models are compute bound, they don't use enough memory to saturate your hbm2 bandwidth. The dense at 82% sat and 30 tok/s is pretty close to optimal performance on our old dog cards learning new tricks, and tells you that your set up is in good shape
MelodicRecognition7@reddit
1 or 2 CPU setup? If 2 then run llama.cpp on the CPU where GPU is physically connected to avoid NUMA overhead..
abmateen@reddit (OP)
Already did use NUMA properly binded CPU and GPU
MelodicRecognition7@reddit
disable HyperThreading and enable Turbo Boost, high CPU frequency is crucial even for "GPU-only" inference because CPU is still doing like 20% of the work.
abmateen@reddit (OP)
Thanks I will try it.
Plastic-Stress-6468@reddit
Maybe the GPU is the bottleneck? Instead of being memory bandwidth bound, maybe you are compute bound?
abmateen@reddit (OP)
How to check any idea?
MelodicRecognition7@reddit
check
nvidia-smiabmateen@reddit (OP)
Nvitop and nvidia-smi says Sm% 99 but GMBW is very low when I run smaller model like hardly 35%
abmateen@reddit (OP)
How to check? Any idea?