mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100

Posted by EricBuehler@reddit | LocalLLaMA | View on Reddit | 35 comments

Hey all! I’ve been working on CUDA performance in mistral.rs, and v0.8.2 is focused on CUDA throughput.

The result: on Gemma 4 (dense & MoE), mistral.rs is faster than llama.cpp at every point in my release sweep on GB10/H100/B200. See some results below on GB10 and B200:

The full report includes all steps to reproduce these results. The results hold up across quantization type (eQ8_0, Q4K), model (dense and MoE), and GPU. Please see the full report for more details: https://github.com/EricLBuehler/mistral.rs/blob/master/releases/v0.8.2/report.md

If you want to try this out, you can install mistral.rs easily:

# Mac/Linux:
curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.sh | sh

# Windows
irm https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.ps1 | iex

Then, you can start a OpenAI-compatible server on port 1234 and a web chat UI with built-in agentic features:

mistralrs serve --agent -m google/gemma-4-E4B-it --quant 4

Reproductions, criticism, and benchmark suggestions are welcome!

Check out the GitHub for more details, documentation, and examples: https://github.com/EricLBuehler/mistral.rs

https://reddit.com/link/1tttevw/video/z0ayf1f1go4h1/player