CPU-only LLM performance - t/s with llama.cpp
Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 64 comments
How many of you do use CPU only inference time to time(at least rarely)? .... Really missing CPU-Only Performance threads here in this sub.
^(Possibly few of you waiting to grab one or few 96GB GPUs at cheap price later so using CPU only inference for now just with bulk RAM.)
I think bulk RAM(128GB-1TB) is more than enough to run small/medium models since it comes with more memory bandwidth.
My System Info:
Intel Core i7-14700HX 2.10 GHz | 32 GB RAM | DDR5-5600 | 65GB/s Bandwidth |
llama-bench Command: (Used Q8 for KVCache to get decent t/s with my 32GB RAM)
llama-bench -m modelname.gguf -fa 1 -ctk q8_0 -ctv q8_0
CPU-only performance stats (Model Name with Quant - t/s):
Qwen3-0.6B-Q8_0 - 86
gemma-3-1b-it-UD-Q8_K_XL - 42
LFM2-2.6B-Q8_0 - 24
LFM2-2.6B.i1-Q4_K_M - 30
SmolLM3-3B-UD-Q8_K_XL - 16
SmolLM3-3B-UD-Q4_K_XL - 27
Llama-3.2-3B-Instruct-UD-Q8_K_XL - 16
Llama-3.2-3B-Instruct-UD-Q4_K_XL - 25
Qwen3-4B-Instruct-2507-UD-Q8_K_XL - 13
Qwen3-4B-Instruct-2507-UD-Q4_K_XL - 20
gemma-3-4b-it-qat-UD-Q6_K_XL - 17
gemma-3-4b-it-UD-Q4_K_XL - 20
Phi-4-mini-instruct.Q8_0 - 16
Phi-4-mini-instruct-Q6_K - 18
granite-4.0-micro-UD-Q8_K_XL - 15
granite-4.0-micro-UD-Q4_K_XL - 24
MiniCPM4.1-8B.i1-Q4_K_M - 10
Llama-3.1-8B-Instruct-UD-Q4_K_XL - 11
Qwen3-8B-128K-UD-Q4_K_XL - 9
gemma-3-12b-it-Q6_K - 6
gemma-3-12b-it-UD-Q4_K_XL - 7
Mistral-Nemo-Instruct-2407-IQ4_XS - 10
Huihui-Ling-mini-2.0-abliterated-MXFP4_MOE - 58
inclusionAI_Ling-mini-2.0-Q6_K_L - 47
LFM2-8B-A1B-UD-Q4_K_XL - 38
ai-sage_GigaChat3-10B-A1.8B-Q4_K_M - 34
Ling-lite-1.5-2507-MXFP4_MOE - 31
granite-4.0-h-tiny-UD-Q4_K_XL - 29
granite-4.0-h-small-IQ4_XS - 9
gemma-3n-E2B-it-UD-Q4_K_XL - 28
gemma-3n-E4B-it-UD-Q4_K_XL - 13
kanana-1.5-15.7b-a3b-instruct-i1-MXFP4_MOE - 24
ERNIE-4.5-21B-A3B-PT-IQ4_XS - 28
SmallThinker-21BA3B-Instruct-IQ4_XS - 26
Phi-mini-MoE-instruct-Q8_0 - 25
Qwen3-30B-A3B-IQ4_XS - 27
gpt-oss-20b-mxfp4 - 23
So it seems I would get 3-4X performance if I build a desktop with 128GB DDR5 RAM 6000-6600. For example, above t/s * 4 for 128GB (32GB * 4). And 256GB could give 7-8X and so on. Of course I'm aware of context of models here.
Qwen3-4B-Instruct-2507-UD-Q8_K_XL - 52 (13 * 4)
gpt-oss-20b-mxfp4 - 92 (23 * 4)
Qwen3-8B-128K-UD-Q4_K_XL - 36 (9 * 4)
gemma-3-12b-it-UD-Q4_K_XL - 28 (7 * 4)
I stopped bothering 12+B Dense models since Q4 of 12B Dense models itself bleeding tokens in single digits(Ex: Gemma3-12B just 7 t/s). But I really want to know the CPU-only performance of 12+B Dense models so it could help me deciding to get how much RAM needed for expected t/s. Sharing list for reference, it would be great if someone shares stats of these models.
Seed-OSS-36B-Instruct-GGUF
Mistral-Small-3.2-24B-Instruct-2506-GGUF
Devstral-Small-2507-GGUF
Magistral-Small-2509-GGUF
phi-4-gguf
RekaAI_reka-flash-3.1-GGUF
NVIDIA-Nemotron-Nano-9B-v2-GGUF
NVIDIA-Nemotron-Nano-12B-v2-GGUF
GLM-Z1-32B-0414-GGUF
Llama-3_3-Nemotron-Super-49B-v1_5-GGUF
Qwen3-14B-GGUF
Qwen3-32B-GGUF
NousResearch_Hermes-4-14B-GGUF
gemma-3-12b-it-GGUF
gemma-3-27b-it-GGUF
Please share your stats with your config(Total RAM, RAM Type - MT/s, Total Bandwidth) & whatever models(Quant, t/s) you tried.
And let me know if any changes needed in my llama-bench command to get better t/s. Hope there are few. Thanks
ttkciar@reddit
I sometimes infer pure-CPU on my dual Xeon E5-2660v3 with all eight channels filled with DDR4-2133. As you can imagine it is quite slow, but some tasks don't need high performance.
Some inference speeds are tabulated here -- http://ciar.org/h/performance.html -- but I haven't updated that in a while.
More recently:
Valkyrie-49B-v2: 0.9 tokens/second
GLM-4.5-Air: 1.2 tokens/second
Qwen3-235B-A22B-Instruct-2507: 1.7 tokens/second
Granite-4.0-h-small: 4.0 tokens/second
pmttyji@reddit (OP)
:) It's in my bookmarks already. Only update needed
What's the total RAM of those 2 channels & 8 channels? And how much bandwidth you're getting?
Yeah, DDR4's bandwidth is less comparing to DDR5.
Thanks for your stats! But expected to see models under 20B.
StardockEngineer@reddit
I don’t know how you can look at those numbers and think “this is what I want”. For the price of the board and starting ram you could get an RTX Pro and a 5090 and be able to run Qwen3 235b.
Also, your plan to buy some RAM now and add more later could backfire. DDR5 is notoriously fickle and it is very common to buy the same exact memory from the same exact manufacturer at a later date and it not worth. I implore you to research this point. Bundled packs are often validated together.
There is no amount of CPU you can buy that will outperform the GPUs on tokens per dollar basis.
pmttyji@reddit (OP)
Blaming myself for unintentional painting of [CPU vs GPU image] over my thread.
I just replied to other comment for that.
Regarding purchasing RAM thing, I think you know the fact that RAM price went up like double-triple the rate since last September. So it's impossible for me to buy 320-512GB RAM now. 128GB for sure, but will try 256GB possibly.
StardockEngineer@reddit
I understand the RAM situation, which is why I’m imploring you to abandon it.
Your hybrid setup will be inefficient. Offloading experts to the CPU comes with a huge performance hit. It’s a better than nothing solution for people without options. But you’re building from scratch. Makes no sense to aim for this.
pmttyji@reddit (OP)
So what do you recommend? For my requirements mentioned in other thread
StardockEngineer@reddit
I can’t keep track of your threads. Can you relink me
pmttyji@reddit (OP)
Direct thread link. Thanks
https://www.reddit.com/r/LocalLLaMA/comments/1ov7idh/ai_llm_workstation_setup_run_up_to_100b_models/
StardockEngineer@reddit
There are conflating requirements. One is your actual use case - running agents and MoE models - and then your assumed specs.
Sticking with just your use case - a single RTX Pro will do everything you want if you can live with Q6 quants for the largest models at 100b. The best 100bish MoE is gpt-oss-120b, which is mxfp4, and it fits comfortably at full context.
It’ll be 5-7 times faster than your best effort CPU machine at 240-260 tok/s. And that’s without speculative decoding, which can reach 300+ tok/s. And prompt processing speeds are absolutely no comparison.
If you were to point Claude Code at a CPU only machine, it would take 4-10 minutes to even get the first token.
Agentic coding and agents in general need horsepower.
pmttyji@reddit (OP)
Right now I can't buy RTX Pro, that's why I'm planning to get 5090 first with enough RAM(128GB minimum). With this setup, I can use GPT-OSS-120B model with usable tokens since it's 65GB size model. Later RTX Pro around next year end possibly.
ttkciar@reddit
The i7-9750H had 32GB when those measurements were taken. It is now 64GB, but I don't think its performance has changed. Hypothetical peak bandwidth is 41.8 GB/s, per Intel ARK.
The E5-2660v3 has 256GB. Hypothetically each processor's peak bandwidth is 68 GB/s, per Intel ARK, but in practice inference performance is only slightly better on two processors than on one. I suspect there is an interprocessor channel which is saturating, which is why I fiddled with NUMA settings, trying to improve upon it, with limited success.
Quite welcome! Usually I don't infer pure-CPU with smaller models, since I have some decent GPUs now, but I real quick ran some tests on the E5-2660v3 just now:
Gemma3-270M: 90 tokens/second
Phi-4 (14B): 6.0 tokens/second
Qwen3-14B: 5.8 tokens/second
Qwen3-4B: 15.2 tokens/second
Qwen3-8B: 10.4 tokens/second
Tulu3-8B: 10.5 tokens/second
Tiger-Gemma-12B-v3: 6.4 tokens/second
Again, all are Q4_K_M.
pmttyji@reddit (OP)
Thanks again!
xxPoLyGLoTxx@reddit
For those models, does all of it fit in the available memory? Also, do you have any GPU at all?
ttkciar@reddit
Yes. I constrain and/or quantize context as needed to make sure it all fits in memory. Hitting swap at all tanks performance.
Yes, but for these tests I did not use any GPU at all. OP was interested in pure-CPU inference.
I have a 32GB MI50, a 32GB MI60, and a 16GB V340, all in different systems. The MI50 normally hosts Phi-4-25B, the MI60 normally hosts Big-Tiger-Gemma-27B-v3, and the V340 gets switched around between different smaller models a lot.
xxPoLyGLoTxx@reddit
For the systems with AMD MI50 and whatnot, how much ram do you have in those systems? I’d be curious of your speeds with large MoE models where the model still fits in RAM + VRAM.
Successful-Arm-3967@reddit
Epyc 9115 & 12 x DDR5 4800 here.
gpt-oss-120b 32-35 t/s
gpt-oss-20b \~80 t/s
Probably still throtling on cpu.
I use neo IQ4_NL quant which for some reason are much faster on cpu and I like it's responses more than unsloth quants.
slavik-dev@reddit
Running ggml-org/gpt-oss-120b-GGUF
- Intel Xeon 3425 (12 cores)
- DDR5 4800 * 8 channels (not sure if I'm getting max memory speed)
- prompt eval time: 43.03 tokens per second
- eval time: 15.56 tokens per second
pmttyji@reddit (OP)
Thanks. Total RAM & bandwidth?
slavik-dev@reddit
512GB (8 * 64GB)
Theoretically I should get 307 GB/s bandwidth, but when I run Intel mlc, it reports ~190GB/s
pmttyji@reddit (OP)
4800's bandwidth little bit slow comparing to 5600 or 6400 series.
You could try overclocking.
Thanks for details.
pmttyji@reddit (OP)
Thanks. How much Total RAM you have? and how much bandwidth totally?
Have you tried any other models? Because many recommended MXFP4 quant for GPT-OSS models.
Successful-Arm-3967@reddit
I tried ggml-org and unsloth F16 quants, which from my understanding are MXFP4, as well as a few other unsloth quants, but all of them runs at only 18-20 t/s.
No idea why gpt-oss is so fast with DavidAU's IQ4_NL. I didn't notice that speed boost with other models. https://www.reddit.com/r/LocalLLaMA/comments/1ndx2tq/gptoss_120b_on_cpu_is_50_faster_with_iq4_nl/
384GB total, an gpt says it's theoretical bandwidth is 460.8 GB/s. But I didn't notice literally any performance boost above 8 ram sticks with that cpu.
pmttyji@reddit (OP)
Sorry, I was talking about GPT-OSS-20B model particularly which my 8GB VRAM could handle. Below one literally which gave me better t/s
https://huggingface.co/ggml-org/gpt-oss-20b-GGUF
Your link mentioned about ik_llama.cpp by few. Unfortunately my laptop doesn't have AVX-512 support(which usually great for ik_llama's optimizations).
Thanks for all other details.
Successful-Arm-3967@reddit
I use llama.cpp, not ik_llama, and it is still faster. There is also 20b version https://huggingface.co/DavidAU/Openai_gpt-oss-20b-NEO-GGUF
pmttyji@reddit (OP)
OK. I'll try this one when I get chance. I got 40 t/s with ggml's GGUF (default context) just with 8GB VRAM + 32GB RAM.
xxPoLyGLoTxx@reddit
What I’m wondering is about using an old server with 256gb-512gb ddr4, such as a Xeon server, but then also placing inside a new Nvidia GPU (eg 5090). I wonder how the speed would be for MoE models where all the active experts fit in vram and the rest of the model fits in the ddr4 ram.
Anyone have any info on that?
pmttyji@reddit (OP)
It's possible. We have an expert here & check his comments history for more info.
Njee_@reddit
it does make a difference. Especially for promt processing.
This is gpt 120b on a pretty bulky 64c epyc with 2400 mhz ddr4 ram.
CPU only
prompt eval time = 12053.37 ms / 1459 tokens ( 8.26 ms per token, 121.05 tokens per second)
eval time = 142469.75 ms / 2073 tokens ( 68.73 ms per token, 14.55 tokens per second)
total time = 154523.12 ms / 3532 tokens
Experts on (slow) GPU is about 1.7x the speed taking only
9712MiB / 12288MiB on NVIDIA GeForce RTX 3060
prompt eval time = 7498.49 ms / 1552 tokens ( 4.83 ms per token, 206.98 tokens per second)
eval time = 84381.14 ms / 2097 tokens ( 40.24 ms per token, 24.85 tokens per second)
total time = 91879.63 ms / 3649 tokens
xxPoLyGLoTxx@reddit
Thanks for sharing!
StardockEngineer@reddit
With the price of ram and server parts, just get a Strix Halo or DGX. 128GB of ddr5 consumer level ram is almost 2k alone.
And that machine will be far faster than your CPU only machine.
You’re going to be limited by compute. You think it’s just memory speed but it’s not. Prompt processing (half of the work) is all compute. Token generation is only memory bound if the compute is present. And it’s is not with CPU.
All that memory and you won’t be able to reasonably run any large models, even MoE.
pmttyji@reddit (OP)
Not interested with unified setups.
That's my laptop actually & can't upgrade it anymore.
Agree with what you're saying in later part of your comment. Just replied to other comment which could clarify the purpose this thread.
Thanks
Lissanro@reddit
With today's models I feel GPU+CPU is the best compromise. In my case, I have four 3090 that can hold full 128K context cache, common expert tensors and some full layers when running K2 / DeepSeek 671B IQ4 quants (or alternatively 96 GB VRAM can hold 256K cache without full layers for Q4_X quant of K2 Thinking), and I get around 100-150 tokens/s prompt processing.
With just relying on RAM (CPU-only inference), I would be getting around 3 times slower prompt processing and over 2 times slower inference (like 3 tokens/s instead of 8 tokens/s, given EPYC 7763 CPU). I have 8-channel 1 TB 3200 MHz RAM.
pmttyji@reddit (OP)
I remember your config & comments :)
Frankly, the point of this thread is to get highest t/s possible just with CPU-only inference, that means I'll get all other optimizations on llama.cpp(or ik_llama) side from comments here. Usually after some period, we get new things(like parameters, optimizations, etc.,). For example, -ncmoe came later(previously -ot with regex was the only way which is tough for newbies like me)
Of course I'm getting GPU(s) .... (32GB one first & 96GB one later after price down). Definitely I need those for Image/Video generations which's my prime requirement after building PC.
My plan to to build a good setup for Hybrid inference(CPU+GPU). I even posted a thread on this :) please check. Expecting your reply since you're one of bunch of folks here in this sub who play LLMs with 1TB RAM. What would you do in my case? Please share here or there. Thanks in advance.
https://www.reddit.com/r/LocalLLaMA/comments/1ov7idh/ai_llm_workstation_setup_run_up_to_100b_models/
gofiend@reddit
bulk RAM? In this economy
pmttyji@reddit (OP)
:D Sorry, this thread was in draft for long time.
They ruined my build plan :(
_realpaul@reddit
Asrock has some funky mainboards that take laptop sodimms maybe you can scrounge together some ram from decommissioned laptops. But it will be ddr4 not 5 so ymmv.
starkruzr@reddit
God, ASRock loves to build the weirdest shit. bless them tbh.
_realpaul@reddit
Yeah they embody the spirit of we did it and didnt ask if we should have 😂
Chimpuat@reddit
I am temporarily running a couple qwen3 7b models on my R730 server, dual E5-2698v4 cpu’s and 512gb ddr4 LRDIMMs (bought in the pre-price-hike days for 1/4 what it would cost today).
They average about 12-15 tokens/sec, sometimes a bit more.
I ran a deepseek r1 70b variant that did about 3 tokens/sec. It does over 30 on a 16gb nvidia T4.
The qwen’s are running in vm’s with between 8 and 12 virtual cores, and 64gb ram.
I’m just learning this stuff. I just know some models can get away perfectly fine on just cpu…and some struggle. 🙂
pmttyji@reddit (OP)
Maybe it's better for you to have Llama-3_3-Nemotron-Super-49B(derivative of Llama-3.3-70B-Instruct) additionally since it could give you better t/s due to less model size. NVIDIA has some more Nemotron models on their collection.
Icy_Resolution8390@reddit
Tomorrow i make the tests with some speed of my models and ibtold you with my specs i have more than 300 models stored for benchmark
pmttyji@reddit (OP)
Please share. Thanks
Icy_Resolution8390@reddit
I hope the ai compaies can adapt the MOE model architecture to the requisites of the medium freak…that was moe models that can run in 300-400 $ maximum gpu because we have to boy every time motherboards with more ram because modules of ddr arent cheaper also…and the people need to eat food also , they cannot maintain themselves only with freak toys..one medium individual can afford buy one card of this type every year…this is the limit…and go ask for double data in the models…more intelligence..for waste his money in this freak hobbie…this of the ai offline is the new hobbie of the new freak generations…but we ask every time more intelligence…to pay for them ..more capable of doing things offline…generaring images..generate 3D objetcs…conversation..:.programming…all useful thing you want to be the propietary the owner for resolve any problem witohout dopending a internet connection….this is really how i pay for this rtx and maintain alive this industry..for the capacity of make this works offline…not depending from anybody…nvidia and openai know this ..::more than we think
Icy_Resolution8390@reddit
I prefer quality than speed, but for some task speed is needed for this reason moe models with small size MOE models that can run in old server motherboards with big ram and one gpu is the key.i think dense models never go to return the market because they must develop every time better models we can run in hardware that a freak can asume i calculate 300-400 euros maximum because they must calculate the medium every user of millions users is disposable to pay for have this models offline…is a good bussines..and a drug for freaks…store information as a diogenes sindrome..we want have offline in “our hands”
Icy_Resolution8390@reddit
Opensource community is a error that see the big companys as the bad of the film..here nobody is bad or god..all must colaborate..they make her work they must know make and us must pay these results…i prefer closed models that was intelligent and very useful than totally opened models that dont run good as the private companies make..he must invert in develop for opensource dont eat her ..the opensource must be on the back pushing they private companies to invest and develop..and all of us win..we enjoy his products offline and we buy her hardware and use his online ia portals in some cases for some thing online ai is very useful but never must forget the finall goal that maintain all this businness..is provide offline good ai for the freaks as us that are the people that buy the rtx expensive cards…woman dont buy rtx nvidia…are only the freaks…enthusiasts we send the money for this things…and they know very well…for this reason give you moe technology (not is a gift) all of us are paying for it. Now they must develop every time more engineering software to run bigger models in modest consumer hardware than freaks can afford
Terminator857@reddit
I wonder what are the tps for systems with lots of memory channels. 8 channels per cpu is the max?
pmttyji@reddit (OP)
Nope, 12 channels also there. I even heard 16, 24 too.
Terminator857@reddit
Doubt it. The numbers are fudged when there are more than one socket .
Icy_Resolution8390@reddit
Is just is justice they ask for money for gpus because train this models have a energy cost data recollection….etc…the user must supoort nvidia openai and all of this bussinnes because if they want buy money they must offer product to buy..and all of us want to have a chat gpt5 offline in our houses…the career is this…they develop every time more capable intelling and bigger software and we must buy to nvidia the cards..and these money go also for openai because redundant all on this for sinergical bussiness all win..all colaborate..all win.They now there are millions of users want this technollogy offline , we can use it online also but we want to have it offline..they build every time better for selling us the hardware to run it and all os us gain benefits…the companies want to gain money is completely normal , and we must give thanks to it because for this reason with money can exist magical tecnologios of this…i hope this never explode this bubble and these companies wain trillions but at same time offer the user this we want…have part os this magical tecnologie updated at the last date in training in our hands without depend from internet , is all good for the computer industry..they sold all hdds and sdds for store this models..etc…is a very good thing for moving one industry and make it bigger and bigger that the results go to redund i all of us
Icy_Resolution8390@reddit
Is a colaboration enthusiasts from ia with the nvidia and big companies of ia..they give us more capable models every time and us must send him some money for his “free opensource models” you understand? Is a very well make bussinness that is win win for all parts they need to reinvert some of this money in develop more capabilities and the companies see that this bussinees of the offline disconected ia propietary to the owner is a good bussinness ilimitatwd because we dont have hardwqre to train the models but we want actualiced models to the last year information and more intelligent every time…they work for ir and we must send them our money for have this technology offline that is we want , not depend from online connection for have artificial intelligence in our houses…there are a lot of market of enthusiasts that see is as a good inversion buy gpus for have this magical technology in our hands with the last data at the last year
Icy_Resolution8390@reddit
We must be buying new more hardware if we have more parameters…more gpus…and more old motherboards with ram buy they know(the companies) they know that if they pass big sizes with dense models…users dont go to buy her nvidia cards (the bussinness is here colaborwring openai with nvidia) and they know they must liberwte new opensource models doubling the parameter but with the moe size model can fit in a only gpu because the users buy motherboard with a lot of ddr banks to archieve big ram in second hand market…ebay…etc..for they can sold his product need double parameters every time he send free models but most important now they know is this models must have moe architecture that can fit in nvidia consumer card..for example next step can be for sold 24 gb vram cards for 200B models with 20B moe size. Because the users search for every time can run more parameters with hight quality data of all type….the resume is the users want every time run more big size more capable better models doubling the size each time double size model..double size moe expert…double size vram cards for they can sell them at a affordable prices for consumer market…maximum 300-400 euros is the top i think some entusiast go to pay for this tecnologie for have it locally
Pentium95@reddit
I think the main reason why CPU-only inference is not popular is because there are, mainly, 2 types of users, for local LLM: - Got a gaming rig with 16/24 VRAM what can I run? (Including MoE experts on CPU) - Got 10k $, how many rtx 6000 pro should I buy?
Also, CPU-only inference needs at least 6-8 Channels DDR5 RAM, which needs a proper CPU and Motherboard, usually a server-rated hardware. With dual channel memory (or quad channel too) you're not gonna go far, unless you go for really sparse MoEs, like GPT-OSS.
pmttyji@reddit (OP)
That's the plan.
Icy_Resolution8390@reddit
Today i test qwen3-next 80B a3B is a few slower that gpt120B but i think this qwenmodel if you send good prompts is better for coding than gpt-oss in some areas…gpt120 oss is also a good model the two models are very similar
pmttyji@reddit (OP)
A quick question. Below thread's title is typo or not? Please share your stats of that model with your config. I even asked same question there.
https://www.reddit.com/r/LocalLLM/comments/1p8xlnw/run_qwen3next_locally_guide_30gb_ram/
Icy_Resolution8390@reddit
If them can make this rhe users can buy in second hand market 256 gb xeon dual proceessor motherboards for go to the nedt step double the parameter…if you look..this career of two companies was try to compete in the market validating their technologies to test it with final users and every step or liberation with double parameters dedicated for final users consumer hardware that must make the effort to buy every time more power hardware to run it..
Icy_Resolution8390@reddit
With 128 gb ram you can run moe models al low or decent spped from 4 to 10 tk/sec
pmttyji@reddit (OP)
Other comment clarified your comment. I'm going for server CPU only for more memory channels.
Icy_Resolution8390@reddit
You must also have a gpu to speed the moe expert…you can combine old server..128gb of ram and a rtx3060 to run this models if you have a nvidia gpu you can run faster the models and for working seriusly is needed gpu..
pmttyji@reddit (OP)
Of course I'm gonna get one. 5090 probably. Laterrr bigger one.
Trying to build server for better hybrid CPU+GPU performance.
Icy_Resolution8390@reddit
We must pray openai and alibaba gobin competition to deploy more models..i think the next step this companys will fight to liberate next step of size to 256 GB ram…can be 200B model with 100 experts of 5 or 10B size
shockwaverc13@reddit
there's not much to talk about when it comes to CPU. your only choice is llama.cpp, you can only run up to 8B (activated) parameters, and the performance will never go past 10t/s unless you go sub 1B.
czktcx@reddit
RAM bandwidth is not scaling with capacity, why you think you can do "*4" in performance...
Consumer CPU only supports 2 channel, and 256G is the max.
If you really want a pure CPU machine, pick a server cpu.
pmttyji@reddit (OP)
That was my stupid rough calculation only.
I'm going for server only like Epyc, 8 or 12 channels. Used 128GB in above calculation due to recent price hike of RAM :( Initially planned to buy 320-512GB RAM, right now it's huge expense *sigh*