Fastest local inference options for 2 x 3090 with NVLink

Posted by IngeniousIdiocy@reddit | LocalLLaMA | View on Reddit | 48 comments

Fastest local inference options for 2 x 3090 with NVLink

I’m currently at 15 tps with llama.cpp but both cards are at 50% utilization during inference on deepseek-r1:70b q4_k_m

I’m wondering if any others have experience with vLLM, exllama or TensorRT and what kind of throughput they have seen with llama 3.3 70b class models at 4 bit?