Llama-3.3 and Qwen2.5 speed comparisons on a 4-GPU / 120GB VRAM system

Posted by __JockY__@reddit | LocalLLaMA | View on Reddit | 6 comments

I did a couple of speed tests with Llama-3.3 70B Instruct, Qwen-2.5 72B Instruct, and Qwen2.5 Coder 32B Instruct where I asked each of them to "write a flappy bird game in Python that will run on a MacBook". For this test I didn't care about code quality or results, I just wanted the model to output approximately 1k tokens of code, for which the task of writing flappy bird is perfect. The only data I was really interested in comparing was raw prompt processing speed and inference/generation speed. I figured some of the folks round here might be curious about the numbers, so here they are. **Hardware setup** * Supermicro M12SWA-TF motherboard * AMD Ryzen Threadripper Pro 3945WX * 128GB DDR4 RAM * NVMe SSDs * 1x EVGA RTX 3090 Ti 24GB * 1x Pny RTX A6000 48GB * 2x EVGA RTX 3090 FTW3 24GB * EVGA 2000W PSU running on dedicated 240V/20A 60Hz (USA) * All GPUs throttled at 250W **Software setup** * Ubuntu Linux * tabbyAPI / exllamav2 (8bpw exl2 quants unless otherwise noted) * tensor parallel enabled * speculative decoding (draft mode) enabled * context lengths for Llama and Qwen are empirically the ceiling of what I can fit in available VRAM (120GB) **LLama-3.3 70B Instruct with 3B draft model** * Draft Model: Llama-3.2-3B-Instruct-exl2_8.obpw * Main Model: Llama-3.3-70B-Instruct-exl2_8.0bpw * Context Size: 108,544 bytes * Cache Mode: FP16 * **Process**: 44.12 T/s * **Generate**: 30.89 T/s **Qwen-2.5 72B Instruct with 3B draft model** * Draft Model: Qwen2.5-3B-Instruct-exl2_8.0bpw * Main Model: Qwen2.5-72B-Instruct-exl2_8.0bpw * Context Size: 128,512 * Cache Mode: FP16 * **Process**: 97.83 T/s * **Generate**: 37.93 T/s **Qwen-2.5 Coder 32B Instruct with 1.5B draft model** * Draft Model: Qwen2.5-Coder-1.5B-Instruct-exl2_8bpw (6 head bits) * Main Model: Qwen2.5-Coder-32B-Instruct-exl2_8bpw (6 head bits) * Context Size: 32,768 * Cache Mode: FP16 * **Process**: 246.16 T/s * **Generate**: 65.24 T/s **Qwen-2.5 Coder 32B Instruct with 3B draft model** * Draft Model: Qwen2.5-Coder-3B-Instruct-exl2_8bpw (6 head bits) * Main Model: Qwen2.5-Coder-32B-Instruct-exl2_8bpw (6 head bits) * Context Size: 32,768 * Cache Mode: FP16 * **Process**: 201.84 T/s * **Generate**: 57.08 T/s I find it interesting that Qwen 72B was faster than Llama 70B by a whopping 7 tokens/sec despite each model using a 3B draft model, Qwen being 2B parameters larger, and Qwen having an extra 20k bytes of context. My guess is that the output of the smaller Qwen model is more closely matched to its larger counterpart than the Llama models, which therefore boosts the speed of speculative decoding... but I'm just pulling guesses out of my butt. What do you think?