dual spark with llama.cpp

Posted by koibKop4@reddit | LocalLLaMA | View on Reddit | 6 comments

I'm daily driving dual Asus GX10 (spark) with vllm and it's fantastic.
But I want to try model that is GGUF only and won't fit into single spark.

I couldn't find any howtos about running llama cpp with dual sparks.

Did anyone tried it? Any suggestions how to run it?
I want to run uncensored minimax

[-]

g_rich@reddit

Why don't you just use https://huggingface.co/llmfan46/MiniMax-M2.7-BF16-ultra-uncensored-heretic with vLLM?You'd need to quantize the existing model because this on is in BF16 but that's no too difficult and is a good skill to learn.

[-]

segmond@reddit

llama.cpp has rpc options, search and use it. you start one machine as the rpc host, then on the main server you start llama.cpp and add the remote rpc as one of your device.

[-]

ImportancePitiful795@reddit

Why you want to use llama.ccp while using vLLM? The latter is far superior either way on multi environments, especially dual DGX.

[-]

koibKop4@reddit (OP)

tell me how to use vLLM with GGUF please

[-]

-dysangel-@reddit

btw IQ2_XXS of Minimax M2.7 is very good. Basically the same KLD loss as Q4. If you could get a similar quant of that heretic model it should work great on a single Spark

[-]

koibKop4@reddit (OP)

thanks, let me know where to find IQ2 XXS of uncensored M2.7