How to run Hunyuan-Large (389B)? Llama.cpp doesn't support it
Posted by TackoTooTallFall@reddit | LocalLLaMA | View on Reddit | 11 comments
I have a homelab server that can run Llama 3.1 405B on CPU. I'm trying to run the new Hunyuan-Large 389B MoE model using Llama.cpp but I can't figure out how to do it.
If I try to use llama-cli with the FP8 Instruct safetensors files directly, I get "main: error: unable to load model". If I try to convert it to GGUF using convert_hf_to_gguf.py, I get "Model HunYuanForCausalLM is not supported".
How are others running this?
ambient_temp_xeno@reddit
I wish someone would test it, somehow!
arousedsquirel@reddit
Use the Code sample, in the model card. copy to a large enough model and ask to verify if a gradio interface is provided, otherwise ask it to refactor the code to include it. Agents would be a nice have to automate test runs for working code. And you have your interface on your model running local.
zotero-chatpdf@reddit
How is it possible to make this large size model runned locally?
Conscious_Cut_6144@reddit
He said he is doing CPU inference, so all that is needed is a few hundred gigs or ram and a lot of patients.
zotero-chatpdf@reddit
Guess it would cost at least 30 seconds per request.
AmericanNewt8@reddit
I'm afraid your only option this early, unless you want to do the legwork yourself, is going to be using straight transformers. It's not the worst out there but it's nowhere near as performant as vllm or llama.cpp, or the other custom engines.
TackoTooTallFall@reddit (OP)
Is there a guide on how to use this? I've never heard of it
arthurwolf@reddit
Typically, the github, huggingface, or website of the model, will have a simple python code example that shows how to run the model. Typically, that code will use the
transformers
library.So the plan is, essentially, find instructions, follow instructions: Do whatever they tell you to do (often doing
pip install some stuff
, copy script, paste script into file, and dopython script.py
.And then it should run. In theory.
TheDreamWoken@reddit
Be the change you want to see
fallingdowndizzyvr@reddit
If you want to run it with llama.cpp, you have to go here and make a post. Then hopefully someone will deem it important enough to implement.
https://github.com/ggerganov/llama.cpp/issues
TackoTooTallFall@reddit (OP)
Yes makes sense. Thank you for weighing in