New llama.cpp options make MoE offloading trivial: `--n-cpu-moe`

[-]

jacek2023@reddit

My name was mentioned ;) so I tested it today in the morning with GLM

llama-server -ts 18/17/18 -ngl 99 -m ~/models/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf --n-cpu-moe 2 --jinja --host 0.0.0.0

I am getting over 45 t/s on 3x3090

[-]

TacGibs@reddit

Would love to know how much t/s you can get on 2 3090 !

[-]

jacek2023@reddit

It's easy: you just need to use a lower quant (smaller file).
for the same file, you’d need to offload the difference to the CPU, so you need fast CPU/RAM

[-]

TacGibs@reddit

I'm not talking about a lower quant, just what kind of performance you can get using a Q4 with 2 3090 :)

Going lower than Q4 with only 12B active parameters isn't something goof quality wise !

[-]

jacek2023@reddit

As you can see in this discussion another person has an opposite opinion :)

I can test 2x3090 speed for you but as I said, it will be affected by my slow DDR4 RAM on x399

[-]

TacGibs@reddit

Please do it !

I think a lot of people got 2 3090 with DDR4 :)

[-]

jacek2023@reddit

for two 3090s, the magic command is:

CUDA_VISIBLE_DEVICES=0,1 llama-server -ts 15/8 -ngl 99 -m ~/models/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf --n-cpu-moe 18 --jinja --host 0.0.0.0

the memory looks like that:

load_tensors: offloaded 48/48 layers to GPU

load_tensors: CUDA0 model buffer size = 21625.63 MiB

load_tensors: CUDA1 model buffer size = 21586.17 MiB

load_tensors: CPU_Mapped model buffer size = 25527.93 MiB

llama_context: CUDA_Host output buffer size = 0.58 MiB

llama_kv_cache_unified: CUDA0 KV buffer size = 512.00 MiB

llama_kv_cache_unified: CUDA1 KV buffer size = 224.00 MiB

llama_kv_cache_unified: size = 736.00 MiB ( 4096 cells, 46 layers, 1/1 seqs), K (f16): 368.00 MiB, V (f16): 368.00 MiB

llama_context: CUDA0 compute buffer size = 862.76 MiB

llama_context: CUDA1 compute buffer size = 852.01 MiB

llama_context: CUDA_Host compute buffer size = 20.01 MiB

and the speed over 20 t/s

[-]

TacGibs@reddit

Pretty good speed ! Thanks a lot for your time 👍

[-]

jacek2023@reddit

If you can't fit model into your GPUs try experimenting with -ts option

[-]

ivanrdn@reddit

Sorry for necroposting, but why do you suggest an uneven tensor split for dual 3090? More than that, how the heck does it work if the 1,1 split doesn't

[-]

jacek2023@reddit

Because --n-cpu-moe, but for some Nemotron it was broken even without it

[-]

ivanrdn@reddit

Hmmm, so I had a GLM-4.5-air setup with even layer split and cpu offload, and it works worse than your setup.

I launched with --n-gpu-layers 32 --split-mode layer --tensor-split 1,1
So basically 16 cpu layers, irrespective of their "denseness" on CPU,

and your command
--tensor-split 15,8 -ngl 99 --n-cpu-moe 18
have 18 excluding the dense on cpu.

That i kinda get, but why the uneven split works - that is still a mystery.
Could you please tell me how you ended up with that split, what was the logic?
Or is it purely empirical?

[-]

jacek2023@reddit

I start with no options then I modify ts to fill the VRAM, I don't understand what is your use case, maybe post your setup and full command line

[-]

ivanrdn@reddit

x99 xeon, 128Gb DDR4, 2 x 3090 on Gen 3.

My usecase is coding, I am hitting some kind of bottleneck on longer contexts (8K+), the t/s drops from 15 to 4, the prefill speed also drops but not as much as generation speed.

The reason I am asking is I have a spare A4000 16Gb but it will have to sit on x8 slot. And I need to figure out how to split the model. GLM-4.5-Air IQ4_XS quant won't fit into 24+24+16Gb with 64K kv cache, even with 1 slot/ parallel 1. So I'm still gonna have to offload something to CPU.

This is the command.

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
CUDA_VISIBLE_DEVICES=0,1
/mnt/nvme/LLM/llama.cpp/build/bin/llama-server \
--model GLM-4.5-Air-IQ4_XS.gguf --alias z-ai/glm-4.5-air \
--host 0.0.0.0   --port 11434 \
--ctx-size 65536   --n-gpu-layers 32   --split-mode layer   --tensor-split 1,1 \
--cache-type-k q8_0   --cache-type-v q8_0   --batch-size 1024   --threads 20 -kvu  --parallel 1   --flash-attn on \
--jinja --api_key test

[-]

jacek2023@reddit

If you want to speed up don't use ngl with moe, use --n-cpu-moe instead, ngl is now max by default Check llama.cpp log ouput to see is your VRAM usage maximized

[-]

ivanrdn@reddit

Hmmm, I'll try that, thanks

[-]

McSendo@reddit

I can also confirm this, 20 tok/s 2x3090, 64gb ddr4 3600 on ancient AM4 X370 chipset.

[-]

McSendo@reddit

Some more stats:
prompt eval time = 161683.19 ms / 16568 tokens ( 9.76 ms per token, 102.47 tokens per second)

eval time = 104397.18 ms / 1553 tokens ( 67.22 ms per token, 14.88 tokens per second)

total time = 266080.38 ms / 18121 tokens

[-]

serige@reddit

Can you share your command? I am getting like 8t/s with 16k ctx. My build has 7950x, 256gb ddr5 5600, 3x 3090, I must have done something wrong.

[-]

McSendo@reddit

LLAMA_SET_ROWS=1 llama-server -m GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf --n-cpu-moe 20 -c 30000 --n-gpu-layers 999 --temp 0.6 -fa --jinja --host 0.0.0.0 --port 1234 -a glm_air --no-context-shift -ts 15,8 --no-mmap --swa-full --reasoning-format none

[-]

Educational_Sun_8813@reddit

with 2x3090 and ddr3 i'm getting 15t/s

[-]

RedKnightRG@reddit

Thanks for this man, nice to see some setups from other folks. With max ctx-size, flash-attention, and q8 KV cache quantization I have to keep 27 layers on CPU:

--ctx-size 131072 \

--flash-attn \

--n-gpu-layers 99 \

--tensor-split 32,14 \

--n-cpu-moe 27 \

--cache-type-k q8_0 \

--cache-type-v q8_0 \

--jinja

I'm seeing about 8 t/s with the above setup on a machine with a Ryzen 9950x and 128GB of DDR5 running at 6000mt/s. I'm guessing you're seeing similar scaling if you turn up the context?

[-]

csixtay@reddit

Wow this is fantastic news.

[-]

gofiend@reddit

Am I right in thinking that your performance would be no better with a typical desktop DDR5 motherboard? Quad channel DDR4 @ 3200 Mt/s vs . dual channel DDR5 @ 6400 Mt/s?

[-]

jacek2023@reddit

The reason I use the x399 is its 4 PCIe slots and open frame (I replaced a single 3090 with a single 5070 on my i7-13700 DDR5 desktop)

[-]

gofiend@reddit

Gotcha I've been debating 4x4 splitting PCI with an AM5 vs. picking up an older threadripper setup. What you have is probably a lot easier to setup and keep running ...

[-]

TacGibs@reddit

PCIe speed doesn't really matter for inference once the model is loaded, but it's a totally different story for fine-tuning !

[-]

gofiend@reddit

Yeah if I'm picking up something to run 4 GPUs ... probably good to use it to run trial finetunes etc. vs. spending $2/hr in the cloud

[-]

jacek2023@reddit

some photos https://www.reddit.com/r/LocalLLaMA/comments/1kooyfx/llamacpp_benchmarks_on_72gb_vram_setup_2x_3090_2x/

[-]

Paradigmind@reddit

I would personally prefer a higher quant an lower speeds.

[-]

jacek2023@reddit

But the question was about speed in the two 3090s. It depends on your CPU/RAM speed if you offload big part of the model.

[-]

Green-Ad-3964@reddit

I guess we'll have huge advantages with ddr6 and socamm models, but they are still far away

[-]

Educational_Sun_8813@reddit

15.7 t/s with ddr3

[-]

Tx3hc78@reddit

Wait isn't this worse then using `-ot` to offload first few layers to GPU instead of CPU?

> -ncmoe N keep the Mixture of Experts (MoE) weights of the first N layers in the CPU

Wouldn't it be better if it was last N layers?

[-]

jacek2023@reddit

could you test both cases?

[-]

Tx3hc78@reddit

Tried testing some but my setup is shitty so someone can provide better results.
I have 9070 XT and 64GB of RAM.

```sh
-ot "blk.([0-9]).ffn.*shexp.=Vulkan0,.ffn.*shexp.*=Vulkan0,ffn_.*_exps.=CPU"

```

Only this provides better results while using half of GPU then `--n-cpu-moe 30` which uses whole GPU. 6.29 vs 4.11

Should try with this, probably gonna get even better results:

```sh
-ot "blk.([0-9]).ffn.*shexp.=Vulkan0,.ffn.*shexp.*=Vulkan0" --n-cpu-moe 30
```

Will try and report back.

[-]

jacek2023@reddit

I don't really understand why you compare 10 with 30, please explain, maybe I am missing something (GLM has 47 layers)

[-]

Tx3hc78@reddit

Turns out I'm smooth brained. Removed comments to avoid causing more confusion.

[-]

Tx3hc78@reddit

Not getting much better results then first -ot command but still better then just using --n-cpu-moe.

Can anybody else test?

[-]

LagOps91@reddit

why not have a slightly smaller quant and offload nothing to cput?

[-]

jacek2023@reddit

Because smaller quant means worse quality.

My result shows that I should use Q5 or Q6, but because files are huge it takes both time and disk space, so I must explore slowly.

[-]

LagOps91@reddit

you could just use Q4_K_M or something, hardly any different. you don't need to drop to Q3.

Q5/Q6 for a model of this size should hardly make a difference.

[-]

Whatforit1@reddit

Depends on the use case IMO. For creative writing/general chat, Q4 is typically fine. If you're using it for code gen, the loss of precision can lead to malformed/invalid syntax

[-]

skrshawk@reddit

In the case of Qwen 235B using Unsloth Q3 I find sufficient since the gates that need higher quants to avoid quality degradation are already there.

Also if for general/writing purposes I find using 8-bit KV cache to be fine but I would not want to do that for code for the same reason, syntax will break.

[-]

CheatCodesOfLife@reddit

Weirdly, I disagree with this. Code gen seems less affected than creative writing. It's more subtle but the prose is significantly worse with smaller quants.

I also noticed you get a much larger speed boost coding vs writing (more acceptance from the draft model).

Note: This is with R1 and Comamnd-A, I haven't compared glm4.5 or Qwen3 yet.

[-]

LagOps91@reddit

that's true - but in this case Q5 and Q6 don't help either. And in the post we are talking to going from Q4 XL to Q4 M... there really hardly is any difference there. i see no reason not to do it if it helps me avoid offloading to ram.

[-]

Paradigmind@reddit

People were saying that MoE is more prone to degradation from lower quants.

[-]

LagOps91@reddit

really? the data doesn't seem to support this. especially for models with shared experts you can simply quant those at higher bits while lowering overall size.

[-]

Paradigmind@reddit

Maybe I mixed something up.

[-]

CheatCodesOfLife@reddit

You didn't mix it up. People were saying this. But from what I could tell, it was an assumption (eg. Mixtral being degraded as much as a 7b model vs llama-2-70b).

It doesn't seem to hold up though.

[-]

Paradigmind@reddit

Ah okay thanks for clarifying.

[-]

jacek2023@reddit

Do you have some specific test results explaining why there is no big difference between Q4 and Q6 for bigger models?

[-]

LagOps91@reddit

yes. the most testing has been done for the large qwen moe and particularly r1. here are some results: https://www.reddit.com/r/LocalLLaMA/comments/1lz1s8x/some_small_ppl_benchmarks_on_deepseek_r1_0528/

as you can see, Q4 quants are just barely (0.5%-1.5%) worse than the Q8 quant. there really is no point at all in sacreficing speed to get a tiny bit of quality.

now, GLM-4.5 air is a smaller model and it's not yet known how the quant quality looks like, but i am personally running dense 32b models are Q4 and that is already entirely fine. i can't imagine it being any worse for GLM-4.5 air.

[-]

jacek2023@reddit

Thanks for reminding me that I must explore perplexity more :)

As for differences you can find that a very unpopular llama scout is better than qwen 32B because qwen has no as much knowledge about western culture and maybe you need that in your prompt. That's why I would like to see Mistral MoE. But maybe the OpenAI model will be released soon?

Largest model I run is 235B and I use Q3

[-]

LagOps91@reddit

different models have different strengths, that's true. I am also curious if mistral will also release MoE models in the future.

as for perplexity, it's a decent enough proxy for quality, at least if the perplexity drop is very low. for R1 in particular i have heard that even the Q2 quants offer high quality in practice and are sometimes even preferred as they run faster due to the smaller memory footprint (and thus smaller reads).

i can't confirm any of that tho, since i can't run the model on my setup. but as i said, Q4 was perfectly fine for me when using dense 32b models. it makes the most out of my hardware as smaller models at a higher quant are typically worse.

[-]

jacek2023@reddit

I read paper from Nvidia that small models are enough for agents, by small they mean like 4-12B. That's another topic I need to explore - to run a swarm of models on my computer :)

[-]

jonasaba@reddit

How is I am to use this for Qwen 30B A3B?

[-]

MrTooWrong@reddit

did you found an answer?

[-]

jonasaba@reddit

Yes. You can use `-ngl 49` and just pass `--n-cpu-moe 20`. Also add `-fa` and `-ctk q8_0 -ctv q8_0`.

Larger the number, less seem to be GPU load. The performance does not seem to drop a lot, not as much as it does if I just reduce `-ngl`.

[-]

MrTooWrong@reddit

Thaaaaank you! I'll give a try tonight

[-]

Infamous_Jaguar_2151@reddit

But this still isn’t as good as ik-llama for loading just the moe active params into the gpus and offloading the rest into ram right? I was just about the start with ik llama for this reason, I want to try runningthe new glm4.5 like this as I have 768 gb ram and 2 rtx 6000

[-]

Marksta@reddit

No, this is just a quality of life option they added to llama.cpp. It doesn't impact how you run MoE models besides you write and edit less lines of ot regex patterns.

Yes, you should probably still use ik_llama.cpp if you want to use SOTA quants and get better CPU performance. Use either if you're all in GPU but if you're dumping 200gb+ of moe experts onto CPU, 100% use ik. Also those quants are really amazing, ~Q4s that are on par with Q8. Literally need half the half hardware to run.

[-]

Infamous_Jaguar_2151@reddit

Hey, thanks for the clarification! Just to make sure I’m understanding this right, here’s my situation:

I’ve got a workstation with 2×96 GB RTX 6000 GPUs (192 GB VRAM total) and 768 GB RAM (on an EPYC CPU).
My plan is to run huge MoE models like DeepSeek R1 or GLM 4.5 locally, aiming for high accuracy and long context windows.
My understanding is that for these models, only the “active” parameters (i.e., the selected experts per inference step—maybe 30–40B params) need to be in VRAM for max speed, and the rest can be offloaded to RAM/CPU.

My question is: Given my hardware and goals, do you think mainline llama.cpp (with the new --cpu-moe or --n-cpu-moe flags) is now just as effective as ik_llama.cpp for this hybrid setup? Or does ik_llama.cpp still give me a real advantage for handling massive MoE models with heavy CPU offload?

Any practical advice for getting the best balance of performance and reliability here?

[-]

Marksta@reddit

So to be more clear, the new flags are nothing new you couldn't have done before. (But very happy they added them and hope ik_llama.cpp mimics it soon too for the simplicity it adds) So wouldn't really focus on it.

So for your setup, take note you're pretty close to running almost all in VRAM for even big MoE models depending on what model we're talking about like the brand new 120B from openAI can all get in there. So also think about vLLM and tp=2, using both your RTX 6000s at 'full speed' in parallel instead of sequentially. But that's a whole different beast of setup and documentation to flip through.

For ik_llama.cpp vs. llama.cpp argument, 1000% EPYC CPU and going to off load to CPU, it's not question you want to be on ik_llama.cpp for that. The speed up is 2-3x on token generation. Flip through Ubergarm's model list and compare it to Unsloth's releases. They're seriously packing Q8 intelligence into Q4, which with the method they're using currently only runs on ik_llama.cpp not main line. While with your beast setup you could really fit the Q8, it matters even more since with the IQ4_KS_R4 368GiB R1 vs. the ~666GiB Q8, you can get that fancy Q4 at least 30+% of the weights into your GPUs too. The speed up there will be massive. For most of us, we just have enough GPU VRAM to barely fit in the KV cache, the dense layers, and maybe 1 set of experts and we get 10 tokens/second TG. You, you're going to get like, 25 sets of the experts if you go with these compact quants. I'm thinking you see maybe 20 tokens/second TG on R1, maybe even higher.

only the “active” parameters need to be in VRAM for max speed

The architecture is very usable and good to run like this, but it's still more ideal if you had 1TB of VRAM. That's what the big business datacenters are doing and how they provide their huge models at blazing 50-100 tokens/second for you on their services. It's just we're very happy at 5-10 t/s at all with our $ optimized setup putting the dense layers and cache to GPU. The experts are 'active' too, but not for every pass of the model. So the always active (dense) layers in GPU is definitely key (-ngl 99) and then the CPU taking on the extra alternating use of randomly selected experts gets us up and running.

Any practical advice for getting the best balance of performance and reliability here?

Reliability as far as the setup running isn't really problematic once you dial something in that works. You can use llama-sweep-bench on ik_llama.cpp to test and I don't usually use it for production use, but when dialing settings in set --no-mmap if you're testing at out-of-memory's edge. This will fail your test run way quicker. Mmap is good for a start up speed-up, but it also allows you to go 'over' your limit and then your performance drops hard or go out of memory later on. But yeah, once you figure out how many experts can go into your GPU RAM and run for a few minutes of llama-sweep-bench, there's no more variables that'll change and mess things up. Setup should be rock solid and you can bring those settings over to llama-server and use it for work or whatever.

So just go download ik_llama.cpp from the github, build it, and learn from Ubergarm's model cards recommended commands to run to get started and he comments on here too. Great guy, he's working on GLM 4.5 right now too. But you can get started with an Unsloth release, they're great too but just focused on llama.cpp main line compatible quants.

[-]

waiting_for_zban@reddit

I know ik_llama is doing great work, but it's still gguf quants, which sometimes end up a bit unreliable (the method calling issues with Qwen3). How does it compare to ktransformers, where you can use INT8 models in this case?

[-]

Infamous_Jaguar_2151@reddit

Think I watched your k-transformers video on yt? Great question and interested to hear the responses too. I think the issue with k-transformers is how finicky it can be to get running.

[-]

VoidAlchemy@reddit

Really appreciate you spreading the good word! (i'm ubergarm)!! Finding this gem brought a smile to my face! I'm currently updating perplexity graphs for my https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF and interestingly the larger version is misbehaving perplexity-wise haha...

[-]

Infamous_Jaguar_2151@reddit

That’s awesome 🙌🏻

[-]

VoidAlchemy@reddit

Yeah I have tried openwebui a little bit but ended up just vibe coding a simple python async streaming client. I had been using litellm but wanted something even more simple and had a hard time understanding their docs for some reason.

I call it `dchat` as it was originally for deepseek and counts incoming tokens on the client side to give a live refreshing estimate of token generation tok/sec with a simple status bar from enlighten.

Finally it has primp there too for scraping http to markdown to inject a URL into the prompt. Otherwise very simple and keeps track of a chat thread and works with any llama-server /chat/completions endpoint. the requirements.txt has: aiohttp enlighten deepseek-tokenizer primp

[-]

Infamous_Jaguar_2151@reddit

That’s cool I’ll try Kani and gradio, indeed the minimalist approach and flexibility

[-]

Wooden-Potential2226@reddit

Hugely informative thx!

[-]

Muted-Celebration-47@reddit

Yeah, I found this way is easier than find the best -ot by yourself. This --n-cpu-moe option is perfect fit with GLM4.5-Air gguf case.

[-]

DistanceSolar1449@reddit

I tried with a dual GPU setup, and --n-cpu-moe consistently puts only 500mb of tensors on one of my GPUs, which is annoying. Manually setting -ot still works.

[-]

thenomadexplorerlife@reddit

This seems a good enhancement! Just curious and may be a bit off-topic, is there a way to do something similar using two machines? For example, I have a Mac mini 64GB RAM and another linux laptop with 32GB RAM. It would be nice if I can run some layers in Mac GPU and remaining layers in linux laptop. This will allow me to run larger models by combining the RAM of two machines to load the model. New models are becoming bigger and buying a new machine with more RAM is out of budget for me.

[-]

Zyguard7777777@reddit

You can use llama.cpp's RPC feature, https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc

[-]

johnerp@reddit

Oh interesting, didn’t know this was a thing I assume network bandwidth / latency would prevent this. Does it work due to different requirements when handing off been components of an LLM architecture?

[-]

segmond@reddit

it makes it possible to run models you won't be able to run, but network bandwidth/latency is a thing! it's the difference between 0tk/sec and 3tk/sec. Pick one.

[-]

CheatCodesOfLife@reddit

Latency specifically. I was using this to fully offload R1 to GPUs, and found my prompt processing was capped at about 12t/s. Ended up faster to use the CPU + local GPUs.

But network traffic was nowhere near the 2.5gbit link limit.

I hope they optimize this in the future as vllm is fast when running across multiple machines (meaning there's room for optimization).

[-]

DistanceSolar1449@reddit

It’s not optimizable. You cant transfer data in parallel.

Prompt processing has to be machine 1 process layers 1-30, network transfer the kv cache, machine 2 processes layers 31-60, transfers the modified kvcache back

[-]

CheatCodesOfLife@reddit

It’s not optimizable.

It is; running box1[4x3090] box2[2x3090] with vllm, is very fast with either -tp 2 -pp 3, or just -pp 6. Almost no loss in speed compared with box1[6x3090]

Prompt processing has to be machine 1 process layers 1-30, network transfer the kv cache, machine 2 processes layers 31-60, transfers the modified kvcache back, rinse repeat.

Nope, you can use --tensor-split and the -ot regex to keep the KV cache on box1, fill the GPUs on box2 with expert tensors and avoid sending the kv cache over the network.

This is a limitation of the transformers architecture. You can’t fix this.

I can't fix this because I'm not smart enough, but it can be done, big labs setup multiple nodes of 8xH100 to serve 1 model.

[-]

DistanceSolar1449@reddit

Nope, you can use --tensor-split and the -ot regex to keep the KV cache on box1, fill the GPUs on box2 with expert tensors and avoid sending the kv cache over the network.

That’s… not how it works.

First off, llama.cpp automatically stores the kv cache with the compute. So for layers in gpu, the kv cache is in gpu. For layers on cpu, kv cache is in system ram. kv_cache_init() always allocates K & V on the same backend as the layer’s attention weights, so layers on RPC back-ends keep their KV on that remote GPU; layers on the host keep KV in system RAM.

Secondly, there is a kv cache for each layer. KV_cache = (Key + Value) = 2 × num_heads × head_dim × dtype_size. So for something like Qwen3 235b, you get 73.7KB per layer per token. The transformer architecture literally demands you do matmuls to multiply the kv cache for that layer with the attention weights of that layer, so you can’t win- if they’re stored on different devices, then either you transfer the kv cache over, or you transfer the weights over.

[-]

spookperson@reddit

Note for others reading this thread. Last week I started experimenting with using both -ot and RPC. You can use -ot to specify a named RPC buffer. I didn't spend enough time on it to figure out if it actually helps in terms of speed though in my case (as the comments in this thread seem to be confirming). I have been hoping to use a 4090 in a Linux box to speed up MoE models that I can fit using a M1 Ultra 128gb

[-]

CheatCodesOfLife@reddit

Actually, I think you're correct (I'll review more carefully when I have a chance).

On my other point though, vllm is "blazing fast' across 2 machines with 2.5gbit Ethernet. Therefor, I see no reason why:

It’s not optimizable.

Though perhaps it's not about the network layer. I recall reading a comment where someone noticed a huge performance penalty running 2 rpc instances on the same machine.

[-]

JMowery@reddit

I have a question, perhaps a dumb one. How does this work in relation to gpu-layers count? When I load models on llama.cpp to my 4090, I try to squeeze out the highest number possible while maintaining a decent context size.

If I add in this --n-cpu-moe number, how does this work in relation? What takes precedence? What is the optimal number?

I'm still relatively new to all of this, so an ELI5 would be much appreciated!

[-]

henk717@reddit

In the next KoboldCpp we will have --moecpu which is a remake of that PR (Since the launcher for koboldcpp is different).

[-]

arousedsquirel@reddit

It's about llama.ccp not kobold promotion dude. So what about llama.ccp?

[-]

henk717@reddit

I'm not allowed to tell users that we will be implementing this when we are based on llamacpp?

2 people asked me about it today, so I figured i'd let people know what our plans are as far as this PR go since KoboldCpp is based on llamacpp but its not a given that projects implement this feature.

To me its an on topic comment since it relates to this PR and people have been asking. So I don't see why giving official confirmation we will implement this command is a bad thing.

[-]

arousedsquirel@reddit

If your group thinks so, yet it is about llama.cpp, not promoting a derivate.

[-]

relmny@reddit

Will that work with things like:

"\.(4|5|6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9]).ffn_(gate|up|down)_exps.=CPU"

or is that too specific?

[-]

silenceimpaired@reddit

Hopefully future revisions will intelligently offload. I assume some parts of the model are better on GPU. Would be nice if this considered this on a per model basis - perhaps all future models added could have these parts marked and existing ones could be patched in when this was added. Or maybe I’m talking silly talk.

[-]

Marksta@reddit

A little silly talk. There is dense layers and then there is the moe sparse layers, or the 'experts' layers. With this option or the older way of handling it via -ot, the dense layers are already accounted for via setting -ngl 99. So all dense layers (usually 1-3 of them) all go to GPU and sparse layers to CPU, and then if you can fit it add some of the sparse layers to GPU too instead of CPU.

There is some more inner logic to consider of keeping experts 'together', not sure how this handles it here or any real performance implications. But most people regex'ed experts as units to keep them together so this new arg probably does too.

[-]

TheTerrasque@reddit

I'm guessing some of the experts are "hotter" than others, and moving those to gpu would help more than moving random ones.

Basically it could keep track of which layers saw the most activation and move them to the gpu. If the distribution is uniform or near uniform, this of course isn't a viable thing to do.

[-]

Former-Ad-5757@reddit

I would guess which experts are hot or not would be a combination of training, model and question. So it would be userspecific. Perhaps it could be a feature request or pr to keep a log of activated layers/expert in a run. And then a simple recalculation tool which could read the log and generate the perfect regex for your situation but it would be a totally new feature

[-]

TheTerrasque@reddit

Could just be as simple as keeping a table of each layer and a counter for when it's activated, and now and then rearrange layers based on the count. It would be a new feature, yes.

[-]

Former-Ad-5757@reddit

Actually in theory it should not be that hard I would guess, if you account for enough ram to hold all the tensors (Ram is usually not the problem, vram is) and load all tensors to ram then everything is at least in the slowest place. And then you could copy a tensor to gpu, after that is done just change the router which says where everything is located.

Worst case scenario is that it isn't in vram but you will know it is in ram as a fallback.

[-]

Secure_Reflection409@reddit

Excellenté!

Really impressed with LCP's web interface, too.

If it had a context estimator like LMS it would prolly be perfect.

[-]