What's the most optimized engine to run on a H100?
Posted by Obamos75@reddit | LocalLLaMA | View on Reddit | 9 comments
Hey guys,
I was wondering what is the best/fastest engine to run LLMs on a single H100? I'm guessing VLLM is great but not the fastest. Thank you in advance.
I'm running a LLama 3.1 8B model.
hurdurdur7@reddit
LLama 3.1 8B ... on a H100? This is like doing doordash in a Ford F550 ...
MrAlienOverLord@reddit
idk what they guys talk about llama.cpp it wont accelerate anything on the h100 - single user the h100 is useless you are better off with a 6000 pro - if you run on the h100 use lmdeploy / vllm / sglang .. and make sure you optimise prefill
spky-dev@reddit
If you give me one I’ll figure that out for you :)
Probably a nightly build of llama.cpp with the latest Cuda, for single user throughout. VLLM will be best for multi.
Obamos75@reddit (OP)
okok thank you.
ea_nasir_official_@reddit
llama.cpp with cuda and flash attention. use Q8 or or Q4 on the model and Q8 on the kv cache. try mmap or mlock as well. compile it yourself on your machine for your specific CPU instructions. Try adding --prio 2 --prio-batch 3.
Obamos75@reddit (OP)
ok thank you for the tips!
twnznz@reddit
Can I offer you a banana for that H100? It's a really good banana
Obamos75@reddit (OP)
at least 3 godly bananas or we not even talking
Stochastic_berserker@reddit
Anything using Flash Attention