Unexpected Token / s on my V100 32GB GPU Setup.

Posted by abmateen@reddit | LocalLLaMA | View on Reddit | 10 comments

I am running a hobbyist setup to run local LLM with my a bit old server Dell PowerEdge R730 DDR4 total 64GB 2x32GB 2133Mhz. Recently I could get hold of a V100 32GB original PCIe version. I am properly doing passthrough using vfio drivers in Proxmox VM, so no overhead of drivers or conflict between the host and guest.

The issue is I am getting some unexpectedly low token per second when I run smaller models like Llama-3.1-3B Q4_K_M GGUF from unsloth. I am getting only 180 tok/s while according to the bandwidth of V100 which is reported by Bandwidth test D2D is around 800 GB/s. The bandwidth utilisation stays 35% when I run smaller models like 3-7B, but when I run a 31B dense model I get 30tok/s which is sorta expected and Bandwidth Utilisation of 82%.

I did all optimisations like NUMA bindings etc, driver is also latest from Nvidia, I am using LLama.cpp with Flash Attention enabled, All layers are on GPU.

Is anybody using V100 / Tesla cards or Local GPU setup has optimised it? I am not quite getting the math behind it, smaller models should give higher token/s provided the GPU bandwidth.

What could potentially be bottleneck in this setup ?