What's the most optimized engine to run on a H100?

Posted by Obamos75@reddit | LocalLLaMA | View on Reddit | 9 comments

Hey guys,

I was wondering what is the best/fastest engine to run LLMs on a single H100? I'm guessing VLLM is great but not the fastest. Thank you in advance.

I'm running a LLama 3.1 8B model.

[-]

hurdurdur7@reddit

LLama 3.1 8B ... on a H100? This is like doing doordash in a Ford F550 ...

[-]

idk what they guys talk about llama.cpp it wont accelerate anything on the h100 - single user the h100 is useless you are better off with a 6000 pro - if you run on the h100 use lmdeploy / vllm / sglang .. and make sure you optimise prefill

[-]

spky-dev@reddit

If you give me one I’ll figure that out for you :)

Probably a nightly build of llama.cpp with the latest Cuda, for single user throughout. VLLM will be best for multi.

[-]

Obamos75@reddit (OP)

okok thank you.

[-]

ea_nasir_official_@reddit

llama.cpp with cuda and flash attention. use Q8 or or Q4 on the model and Q8 on the kv cache. try mmap or mlock as well. compile it yourself on your machine for your specific CPU instructions. Try adding --prio 2 --prio-batch 3.

[-]

Obamos75@reddit (OP)

ok thank you for the tips!

[-]

twnznz@reddit

Can I offer you a banana for that H100? It's a really good banana

[-]

Obamos75@reddit (OP)

at least 3 godly bananas or we not even talking

[-]

Stochastic_berserker@reddit

Anything using Flash Attention