mistral.rs v0.8.2: up to 2.8x faster CUDA inference than llama.cpp on GB10, B200, and H100

Posted by EricBuehler@reddit | LocalLLaMA | View on Reddit | 35 comments

Hey all! I’ve been working on CUDA performance in mistral.rs, and v0.8.2 is focused on CUDA throughput.

The result: on Gemma 4 (dense & MoE), mistral.rs is faster than llama.cpp at every point in my release sweep on GB10/H100/B200. See some results below on GB10 and B200:

The full report includes all steps to reproduce these results. The results hold up across quantization type (eQ8_0, Q4K), model (dense and MoE), and GPU. Please see the full report for more details: https://github.com/EricLBuehler/mistral.rs/blob/master/releases/v0.8.2/report.md

If you want to try this out, you can install mistral.rs easily:

# Mac/Linux:
curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.sh | sh

# Windows
irm https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.ps1 | iex

Then, you can start a OpenAI-compatible server on port 1234 and a web chat UI with built-in agentic features:

mistralrs serve --agent -m google/gemma-4-E4B-it --quant 4

Reproductions, criticism, and benchmark suggestions are welcome!

Check out the GitHub for more details, documentation, and examples: https://github.com/EricLBuehler/mistral.rs

https://reddit.com/link/1tttevw/video/z0ayf1f1go4h1/player

[-]

a_beautiful_rhind@reddit

I think the philosophy of no knobs and dials doesn't really work for me. Plus I already have models and I didn't use HF downloader.

From what I read you only read them from HF cache? Plus should be mentioned that this is custom GGUF support so everyone's normal GGUF won't work.

Have you compared your quants vs llama.cpp, vllm, ik_llama, exl3, etc? Not just speed but quality.

[-]

dtdisapointingresult@reddit

Can you update the README to show how to use an already-downloaded model? For example I have cyankiwi/MiniMax-M2.7-AWQ-4bit already saved to /usb/models/, how do I use it without redownloading?

[-]

noatoms@reddit

Does it support Gemma4 MTP? How is VRAM usage compared to VLLM? Sorry if it's been asked before..

[-]

EricBuehler@reddit (OP)

No worries 🙂 ! Gemma 4 MTP is supported: https://ericlbuehler.github.io/mistral.rs/guides/perf/gemma4-mtp/

VRAM usage is going to be very similar to vLLM.

[-]

Voxandr@reddit

Can you post DGX SPark numbers for lets say Qwen 3.5 122B ?

[-]

EricBuehler@reddit (OP)

Will do in future benchmarks with more models being demonstrated.

[-]

gusbags@reddit

Any chance it can do multi node support like a 2x DGX Spark cluster?

[-]

EricBuehler@reddit (OP)

Yes! Check out: https://ericlbuehler.github.io/mistral.rs/guides/perf/multi-gpu-distributed/

[-]

Jipok_@reddit

Can test it on 26B A4B? It seems to me that the llama.cpp has problems with ExB models.

[-]

EricBuehler@reddit (OP)

Yes! The full Gemma 4 lineup and all modalities are supported.

[-]

Jipok_@reddit

Ah, I found the numbers. There's essentially no difference with llama.cpp on the other model. It seems it really does have some kind of problem with ExB.

[-]

takoulseum@reddit

Nice to see the project actively developped, even if I love the llamacpp team work, this sub is just a hole of fanboys who do not understand diversity is the key

[-]

JockY@reddit

Does it have a native Anthropic-compatible API or would I require a translation layer like litellm?

How’s prefill performance vs vLLM on Blackwell sm120?

How’s long context (150k+ tokens) decode on sm120 with mistral.rs vs vLLM?

[-]

EricBuehler@reddit (OP)

Hey! No Anthropic-compatible API yet but that is coming very soon.

I didn't measure context prefill at 128k+ tokens yet, but I expect it will be very competitive with vllm.

For prefill performance vs vLLM, it is very good - see the technical report linked in my post or these figures:

[-]

jake_that_dude@reddit

the missing chart is concurrency=8 with prefix cache on/off. single request tok/s is useful, but it does not tell you whether paged KV is actually packing well under agent workloads. I would run ShareGPT or a fixed 4k prompt mix at n=1/4/8/16, then publish p50/p95 latency, decode tok/s, TTFT, and peak KV bytes. if it still beats llama.cpp there, the claim gets way more interesting.

[-]

FullstackSensei@reddit

Maybe I'm being a bit dense, but how's multi GPU support and support for older architectures (namely Pascal to Ampere)? Not everyone has spare kidneys to exchange for new data center GPUs, and in my personal experience many sellers seem reluctant to accept exchanges for human organs.

[-]

EricBuehler@reddit (OP)

No worries 😄

Multi-gpu support is fully supported (https://ericlbuehler.github.io/mistral.rs/explanation/device-mapping/#multi-gpu-layouts). mistral.rs will automatically use the most performant method, which on CUDA is NCCL.

These optimizations are systemic, and apply across architectures (i.e. Blackwell, Hopper). While I haven't tested older GPUs beyond Hopper yet, I would expect that the story is very similar.

[-]

FullstackSensei@reddit

Sounds like you're targeting vllm audience?

While NCCL works on my P40s, it doesn't on cards that don't support p2p out of the box like 3090, 4090, etc. Specific to Pascal, fp16 is really bad and llama.cpp has custom kernels that upcast everything to fp32.

I really like your project and have been a fan for a long time, but haven't been able to use it until now, mainly because of lack of clarity on whether and how I can make it work with my hardware.

[-]

GaelOffMySoul@reddit

Well if you're motivated, you still have: https://github.com/aikitoria/open-gpu-kernel-modules

[-]

EricBuehler@reddit (OP)

Thanks for the feedback! I should make the hardware story much clearer in the docs.

I’m not trying to target only the vLLM audience. There are really two lanes:

High-end CUDA / datacenter GPUs
Local inference / agents, where the goal is easy deployment across consumer CUDA, Metal, and CPU.

For older GPUs, I agree it needs more explicit documentation, especially regarding the multi-GPU situation. CUDA multi-GPU is supported and does not only rely on NCCL (it can fall back to P2P in bf16/f16), but this should be better documented.

So while this release is mainly a CUDA performance report on newer GPUs, I think that it should generalize to local GPUs.

[-]

anzzax@reddit

Are those all charts for single request only? Can you give us similar charts for n = 4, 8 ,16?
How KV space reserved/used? I'm looking for vLLM alternatives but all what I tried are less efficient with 8 batched requests either worse in total throughput or inefficiently use available VRAM for cache and context.
We are past single thread chats - parallel agents/requests and effective prefix caching are mandatory.

[-]

OsmanthusBloom@reddit

Is it GPU-only or can it do partial CPU offload for MoEs like Qwen3.6-35B-A3B if you don't have enough VRAM?

[-]

EricBuehler@reddit (OP)

Yes, it can do partial CPU offload for MoEs. If you run an MoE and dont have enough VRAM it will place layers on your GPU and CPU to be able to run the model.

[-]

Remove_Ayys@reddit

Are you observing the same speed differences for other combinations of models and GPUs? How representative is this particular data point of the average case?

[-]

EricBuehler@reddit (OP)

I measured the cases in the report (https://github.com/EricLBuehler/mistral.rs/blob/master/releases/v0.8.2/report.md), but since the changes made to mistral.rs are general and apply to all CUDA GPUs, I expect that this data point should be representative.

[-]

Remove_Ayys@reddit

I read the report, I don't agree that it has sufficient coverage to claim a general speed advantage.

[-]

DrBearJ3w@reddit

Does it support B300? I think I have few lying around.

[-]

EricBuehler@reddit (OP)

Yep! If you have a B300 it should work 😄 We support CUDA compute Turing and up.

[-]

DrBearJ3w@reddit

Oh no. It says 7900 XTX. Might have switched up first 2 letters. Welp. I will probably skip this one.

[-]

EricBuehler@reddit (OP)

AMD support is coming once we make some changes to the multi-GPU backend support in candle.

[-]

DrBearJ3w@reddit

Oh. Good luck with AMD support 💪

[-]

alew3@reddit

Does it have these speedups on RTX 5090 (Blackwell) ?

[-]

EricBuehler@reddit (OP)

Yes! Any blackwell machine will benefit from this, you should see improvements similar to the B200 and GB10 blackwell machines I benchmarked.

[-]

nullbyte420@reddit

Nice! How come it's so much faster? I do happen to have access to some h200 gpus

[-]

EricBuehler@reddit (OP)

Thanks!

I think the speedup is mostly from the CUDA execution path and how models are run in mistral.rs.

For this release, I think the biggest factors were optimized paged attention and flash decoding paths, CUDA graphs/low launch overhead. This was not one magic trick so much as a bunch of deep engine-level work adding up.

For vLLM: I would not say mistral.rs is “better than vLLM” generally. vLLM is still excellent for high-throughput/batched BF16 serving, and we haven't benchmarked for large concurrency yet. However, I think that mistral.rs's continuous batching features should enable efficient small-batch serving compared to vLLM.

If you have H200 access, I would love to see a reproduction!