Is interference speed of the llama3.3 70B model on my setup too slow?
Posted by caetydid@reddit | LocalLLaMA | View on Reddit | 15 comments
My setup is a Dell Precision T5820, Xeon w2245-8core, 160GB RAM, (24+8)GB VRAM (RTX3090+RTX4000). The RTX3090 is connected with x8 PCIe and the RTX4000 with x4 PCIe speed.
When I run models smaller than 24GB they fit in the VRAM of my RTX3090, which yields in great speeds in between 30-50t/sec. It seems, however, I cannot benefit at all from my second GPU with 8Gb of VRAM.
|llama3:3|size|token/s|load rtx3090 rtx4000|
|:-|:-|:-|:-|
|70b-instruct-q3\_K\_M|34GB|4.7|25%,20%|
|70b-instruct-q3\_K\_S|30GB|6.7|35%,30%|
|70b-instruct-q2\_K|26GB|12.9|55%,45%|
As it seems I hardly benefit from the second GPU (RTX4000). Is this supposed to be the case? Are these cards too different to work together smoothly or am I doing something wrong in my setup?
I'd really like to understand this issue in order to run some larger models such as the llama3.3 70B variants.
thanks in advance!
15 Comments
koalfied-coder@reddit
caetydid@reddit (OP)
koalfied-coder@reddit
caetydid@reddit (OP)
koalfied-coder@reddit
caetydid@reddit (OP)
koalfied-coder@reddit
koalfied-coder@reddit
Back2Game_8888@reddit
caetydid@reddit (OP)
FrederikSchack@reddit
caetydid@reddit (OP)
gpupoor@reddit
FrederikSchack@reddit
FrederikSchack@reddit