Is more cores faster ?

Posted by VolkoTheWorst@reddit | LocalLLaMA | View on Reddit | 18 comments

I would like to make an server to run big models (slowly)

I will run on CPU (or maybe add a GPU but it would be mostly offloaded to ram)

I was wondering if I should get an old Xeon (more cores) or a more classic CPU (less cores but each faster)

Basically, is llamacpp using all cores ? Can it suffer from having too much cores ?

Thanks \^\^

PS: I think I will run it on DDR3, I know it will be very very slow but it's just so much cheaper

[-]

Mart-McUH@reddit

It should not matter much for inference (memory bound), but it will matter a lot for prompt processing (compute bound) until you add that GPU for that, which you should definitely do.

[-]

Pp matter more for me than tg so yeah, I think I will add a GPU for that. Is there a particular amount of VRAM required? Should most of the model fit in VRAM ? I would like to run big models that will never fit in the VRAM.

[-]

Mart-McUH@reddit

For dense it would still matter (but large dense are almost non-existent now). For MoE it matters less but still, you ideally want to put all non-expert layers on GPU as well as KV cache. How much is that depends on model. For huge ones it can be quite a lot still but I do not have clear data as I do not run those behemoths. I guess it would be \~10-20% of model size (depends on quants and desired context size too).

[-]

VolkoTheWorst@reddit (OP)

Okay

[-]

sinevilson@reddit

Is this your first computer? How damn cute is this.. you go you go getter you.

[-]

VolkoTheWorst@reddit (OP)

[-]

jacek2023@reddit

CPU on DDR3 will be slow.

You can say "I don't care about speed" but that's not true.

Waiting minutes for each answer will make you just stop trying.

[-]

VolkoTheWorst@reddit (OP)

I never said I was going to use it as chatbot

[-]

My_Unbiased_Opinion@reddit

Doesn't matter much after 4 cores. The biggest factor will be total memory bandwidth between how many memory channels you have and the ram speed. But if you are using this as a general use server, I would take more cores. You can spread the load over many cores using llama.cpp, ik-llama.cpp or even LMstudio if you want a GUI.

This will free up performance for other tasks you want like game servers, etc.

[-]

VolkoTheWorst@reddit (OP)

Thanks a lot !

[-]

--Rotten-By-Design--@reddit

I wont just be slow, it will be EXTREMELY slow, no matter what old cpu you decide to run it on.

If I offload halv of the llama3.3-70B model to my 3090, and the other half to my CPU/RAM, which is a 12600k and 64GB DDR4 3600Mhz, the token generation halts to about 2t/s, which is utterly useless, you experience will be worse... Don´t...

[-]

My_Unbiased_Opinion@reddit

Moe is pretty viable these days on CPU.

[-]

--Rotten-By-Design--@reddit

So you are saying a old Xeon or i5-8600k for that matterand ddr3 ram will give a good experience on a modern moe model?

[-]

My_Unbiased_Opinion@reddit

If he is going the DDR3 xeon route, he could have 4 (or more) memory channels compared to dual channel on consumer DDR4 boards. This can counter the lower ram speeds on DDR3.

Also there is a recent very high sparsity model that is 17B with less than 1B active parameters. I have not tried it myself so I can't vouch for how good it is. But it's an option if OP wants very high speeds even on CPU.

https://huggingface.co/AIDC-AI/Marco-Mini-Instruct

[-]

--Rotten-By-Design--@reddit

Such a model might run at something op can accept.
Don´t know the specific model, but when op wrote big model, I did imagine something bigger.

[-]

My_Unbiased_Opinion@reddit

Understandable. The biggest model I think he might be able to get away with is 120B Derestricted. It's better than the stock model. Only 5B activated parameters. It's gonna be slow, but probably fast enough for creative use for the OP which I assume he will use it for.

[-]

VolkoTheWorst@reddit (OP)

Yes, this is exactly what I wanted to run Didn't knew about the 4 memory channel things, thanks a lot

I don't need it to be fast, even 2 tok/s can be enough. I might use it for agentic tasks like verifying my code on pull request, automating stuff with OpenClaw or stuff like this. Also I will probably use the server also for other stuff (selfhost things)

[-]

StardockEngineer@reddit

Don't bother.