Is interference speed of the llama3.3 70B model on my setup too slow?

Posted by caetydid@reddit | LocalLLaMA | View on Reddit | 15 comments

My setup is a Dell Precision T5820, Xeon w2245-8core, 160GB RAM, (24+8)GB VRAM (RTX3090+RTX4000). The RTX3090 is connected with x8 PCIe and the RTX4000 with x4 PCIe speed. When I run models smaller than 24GB they fit in the VRAM of my RTX3090, which yields in great speeds in between 30-50t/sec. It seems, however, I cannot benefit at all from my second GPU with 8Gb of VRAM. |llama3:3|size|token/s|load rtx3090 rtx4000| |:-|:-|:-|:-| |70b-instruct-q3\_K\_M|34GB|4.7|25%,20%| |70b-instruct-q3\_K\_S|30GB|6.7|35%,30%| |70b-instruct-q2\_K|26GB|12.9|55%,45%| As it seems I hardly benefit from the second GPU (RTX4000). Is this supposed to be the case? Are these cards too different to work together smoothly or am I doing something wrong in my setup? I'd really like to understand this issue in order to run some larger models such as the llama3.3 70B variants. thanks in advance!

Reply to Post

15 Comments

[-]

koalfied-coder@reddit

May you please post your VLLM command or similar? At those speeds idk if you are using the GPUs at all. Try this please vllm serve "casperhansen/llama-3.3-70b-instruct-awq" --gpu-memory-utilization 0.95 --max-model-len 8000 --tensor-parallel-size 2 --enable-auto-tool-choice --tool-call-parser llama3\_json Not the fastest but should be at least 15-20 I imagine.

[-]

caetydid@reddit (OP)

This just works when I specify CUDA\_VISIBLE\_DEVICES=1 because otherwise it breaks with an error about an unsupported CUDA version. I guess that means only my RTX3090 is being used. Instead of a uvicorn launching on localhost:8000 I get an Ray instance on [127.0.0.1:8265](http://127.0.0.1:8265) ... Not sure how to invoke a completion on that one in order to check performance.

[-]

koalfied-coder@reddit

My apologies I use that command for letta. try this... vllm serve "casperhansen/llama-3.3-70b-instruct-awq" --gpu-memory-utilization 0.95 --max-model-len 8000 --tensor-parallel-size 2 --enable-auto-tool-choice --tool-call-parser llama3\_json

[-]

caetydid@reddit (OP)

Thanks for the hint. Actually I use ollama, and I did the interference testing with open webui - hence I wont know the exact command line. But anyways vllm is something else which I have not tried yet. If I can use it with open webui and it is know to be faster than ollama I might consider it.

[-]

koalfied-coder@reddit

I was certain a w2245 can run 2 cards in x8 any particular reason?

[-]

caetydid@reddit (OP)

Yeah, the motherboard has a stupid division of the PCIe slots - I cant cramp the cards next to each other or the fans will be blocked!

[-]

koalfied-coder@reddit

Ahh I gotcha

[-]

koalfied-coder@reddit

[https://forums.developer.nvidia.com/t/can-i-use-a-quadro-rtx-4000-rtx-3060-in-one-system/226309](https://forums.developer.nvidia.com/t/can-i-use-a-quadro-rtx-4000-rtx-3060-in-one-system/226309)

[-]

Back2Game_8888@reddit

I second vLLM - I've used both ollama and vllm and vllm always give me much better performance. But I am not sure if Open WebUI supports vLLM, last time I check it seems built around Ollama only

[-]

caetydid@reddit (OP)

Supposedly it can be added to open webui via the OpenAI API interface as long as there is still an ollama instance running

[-]

FrederikSchack@reddit

Can you see RAM usage on the cards when running these? There is also the KV cache that takes up some memory that's related to the context length. If some is stored in system memory, then the PCIe bus will easily become a bottleneck.

[-]

caetydid@reddit (OP)

vram usage is quite high, even higher than what I'd anticipate given the model sizes. The GPU usage suggests there is some bottlenecking taking place, probably in the CPU, since it is maxxed out at 1600%. Maybe the KV cache is high with these llama3.3 models? However, the cards behave quite differently in speed as well - lets say I run ollama3.1 3b on the RTX4000 vs the RTX3900 - the latter will be much faster!

[-]