Is interference speed of the llama3.3 70B model on my setup too slow?

Posted by caetydid@reddit | LocalLLaMA | View on Reddit | 15 comments

My setup is a Dell Precision T5820, Xeon w2245-8core, 160GB RAM, (24+8)GB VRAM (RTX3090+RTX4000). The RTX3090 is connected with x8 PCIe and the RTX4000 with x4 PCIe speed. When I run models smaller than 24GB they fit in the VRAM of my RTX3090, which yields in great speeds in between 30-50t/sec. It seems, however, I cannot benefit at all from my second GPU with 8Gb of VRAM. |llama3:3|size|token/s|load rtx3090 rtx4000| |:-|:-|:-|:-| |70b-instruct-q3\_K\_M|34GB|4.7|25%,20%| |70b-instruct-q3\_K\_S|30GB|6.7|35%,30%| |70b-instruct-q2\_K|26GB|12.9|55%,45%| As it seems I hardly benefit from the second GPU (RTX4000). Is this supposed to be the case? Are these cards too different to work together smoothly or am I doing something wrong in my setup? I'd really like to understand this issue in order to run some larger models such as the llama3.3 70B variants. thanks in advance!