is possible to run a quantized llama 70B model on CPU / iGPU / APU ?

Posted by grigio@reddit | hardware | View on Reddit | 7 comments

I can run it at 2 tokens/s but I'd like to run it at least at 10 tokens/s I don't want a GPU

[-]

Wrong-Quail-8303@reddit

Great question. This sub needs more AI related questions and answers to build knowledge in the explosive AI field.

If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM.

If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total.

If you have a good SSD (NVME PCIe 3+ etc), then when you run out of RAM, your system will use the page file.

You won't get much higher than 2 tokens/s on a CPU. A GPU such as a 4090 will do 140 tokens/s. Your goal of 10 tokens/s is achievable at Q4 with a cheap low end GPU. See here:

https://new.reddit.com/r/LocalLLaMA/comments/1ci9i0a/llm_inference_speed_table/

[-]

The OP was a short sentence and you ignored every variable in it. He said 70B model and you gave advice for a 7B model. He said "I don't want a GPU" and almost your whole post was about GPUs. We get it; GPU is much better... but answer the question as it was asked.

[-]

Azzcrakbandit@reddit

They literally said that op wouldn't get much higher than 2 tokens on a cpu.

[-]

grigio@reddit (OP)

Probably with faster cpu and ram is possible to improve the situation, but I already have Ryzen 7 and ddr5 ram

[-]

meodd8@reddit

I think the r/localllama sub would have better suggestions.

The real answer is that you, effectively, can’t run those larger models on an iGPU system. They are designed to run on multiple GPUs in parallel and also across clusters.

From my limited experience, 4 bit quantized models take about as much VRAM (a bit less) as you see with the model name, so with 70b, I’d probably want ~60GB worth of VRAM+RAM. You can find lower quant models, but I haven’t tried them, and I’ve heard they can be pretty dumb.

Now, the question is, is a model with more parameters but a very low quant better than a model with fewer parameters but much higher precision?

A system with a unified memory architecture like the Apple products could work if just really don’t want a dedicated GPU.

[-]

grigio@reddit (OP)

I tried but i'm shadowbanned there, i don't know why

[-]

Affectionate-Memory4@reddit

You'll get the best performance with the whole model in memory. Having enough RAM and enough bandwidth is likely the path forward. Probably 48-64GB and as fast as you can get. I don't know how much compute you're going to need for 70B Llama at 10T/s, but I can say that a pair of 7900XTXs manages a bit over that.

Any current iGPU is going to lack the compute power needed, but in theory can do some of the work for the CPU as an offload device. Your best bet on consumer hardware is probably a 9950X right now, and if that can't do it, maybe something in the Xeon or Threadripper family can muster enough cores and memory bandwidth to throw at it.