Run Deepseek locally on a 24g GPU: Quantizing on our Giga Computing 6980P Xeon

[-]

Few-Yam9901@reddit

Does anyone have updated Deeepseek V3 quants for llama.cpp? The ones more than 4 weeks ago all take too much space for KV

[-]

A few days ago I released the equivalent IQ1_S_R4 for DeepSeek-V3 on huggingface ubergarm collection because people wanted no thinking versions. It uses the smaller tensors for GPU offload to allow running in 16GB VRAM or more context with more VRAM.

[-]

notdba@reddit

Thank you so much! This IQ1_S_R4 quant is just amazing. It turns out DeepSeek V3 can actually run on a laptop.

And not only that. I have a use case where the output has to be precise. The IQ1_S_R4 quant is able to give the exact same output as the FP8 version from Fireworks. And it does so with just `-ser 5,1`. Mind blown.

(To be fair, I can also get the precise output with Qwen2.5-Coder-32B-Instruct at Q4_K_M, Gemini 2.5 flash with reasoning disabled, Gemini 2.5 pro, and all the Claude models since Sonnet 3.5, again with reasoning disabled. But still.)

A couple of notes:

Similar to the finding from https://github.com/ikawrakow/ik_llama.cpp/pull/520, I have to use `-DGGML_CUDA_MIN_BATCH_OFFLOAD=16` to improve the pp512 performance, while pp256 is indeed faster without GPU offload.
Similar to the finding from https://github.com/ikawrakow/ik_llama.cpp/pull/531, I also notice that IQ1_S_R4 is still faster than IQ1_S in pp after the PR, without GPU offload. I could only test up to pp2048 though.

I copied your recipe and made a smaller quant with IQ1_S_R4 for the 3-60 ffn_down_exps layers as well, which gives me about 10% of usable memory after loading the model. File size I got (down, gate/up):

IQ1_S_R4, IQ1_S_R4 : 132998282944

IQ1_S, IQ1_S : 137772449472

IQ1_M_R4, IQ1_S_R4 : 139809832608

IQ1_M, IQ1_S : 142881111744

It does make the model a little dumber though, so I have to compensate with `-ser 6,1`. On this laptop with i9-11950H, 128GB DDR4 2933 MHz memory, and RTX A5000 mobile, I can get about 3.5 tok/sec generation, and 13 tok/sec for pp256, 30 tok/sec for pp512, 50 tok/sec for pp1024, and 85 tok/sec for pp2048. While the generation speed is slower than the 5 tok/sec I got from DeepSeek V2, the model has 3x more parameters, and can complete the task successfully.

[-]

Few-Yam9901@reddit

Ok I’ll try it agai. Couldnt get it to compile last time tried it

[-]

celsowm@reddit

How many tokens per seconds?

[-]

No_Afternoon_4260@reddit

About 10 for a q4 and 13 for a q1.
These are ik quants

He has dual xeon 6 6980p, 12 mr dimm 8800. He said he might try cxl modules later.

He spoke a lot about quantization, surprisingly didn't talk about single vs dual socket perfs

[-]

smflx@reddit

+1 Dual socket is not worth. Expect 10% boost. Accessing memory over different NUMA is quite slow.

CPU matters for actual RAM & Compute speed. Single CPU with full of memory channel will do about the same.

I don't think CXL modules will be helpful ...

[-]

VoidAlchemy@reddit

Heya, good seein' u around!

Yeah its hard to use these big rigs to run a single large workload imo. I think leaving a big dual socket 6980P in its default 6x NUMA nodes and running VMs pin'd to the correct CPU cores <-> memory nodes would work well.

Then you could use a CXL module for any remaining tasks running on the hypervisor which doesn't have to be as fast as the VM tasks. So you'd make the full use of all available RAM for end users etc.

Or I suppose CXL might be good for some big database workloads where it is at least faster to have stuff cached on there read off disk.

But yeah its no panacea especially given the costs!

[-]

smflx@reddit

Good to see you improving deepseek quants. Happy to use them without -rtr :)

Yeah, running many tasks or VMs on NUMA works well. It's server designed to that! But, It's hard to run a big LLM on it. Tensor parallel over two CPU (complicated) or duplicating weight over two CPU (2x memory use) seem possible ways. The later is what ktransformer does.

Yeah, CXL is for what purpose you state. I saw a manufacturer's hope it can be an memory add to GPU. I meant it's not because it's CPU memory offloading we're already doing.

[-]

No_Afternoon_4260@reddit

I don't either. May be someday probably as storage but our backends needs a lot of work before it is even worth it.

[-]

VoidAlchemy@reddit

The top/bottom split screen was running on a single socket using numactl -N 0 -m 0 llama-server .... --numa numactl. You are correct that dual socket is not great on any CPU inferencing engine that I know and there will be a hit given lower RAM latency/bandwidth from outside the numa node. The rig is configured with SNC=Disable so 1x Numa Node per socket.

Dual socket does benefit token generation speeds given that is more CPU bottlenecked.

[-]

No_Afternoon_4260@reddit

Thanks for the reply, iirc you are the level1tech guy. Thanks for testing that cpu, been looking at it for months nobody seemed to have it lol.
Have to say kind of underwhelmed by its perfs.

You should try ktransformer as it leverages the AMX instructions for these intel cpus. Iirc you have 786gb of ram per socket. Ktransformers keeps a copy of the model for each socket. You may also leverage dual socket.

[-]

VoidAlchemy@reddit

Yeah, I wrote a getting started guide for ktransformers a few months ago before they released the full AMX code: https://github.com/ubergarm/r1-ktransformers-guide

The problem afaict ktransformers *requires* at least a single GPU which is not installed on this test rig. I tried some experimetal PRs for llama.cpp allocating essentially `USE_NUMA=1` like they are doing, but couldn't get much more out of it. Have a whole discussion on the multi-numa challenges for Intel Xeon here https://github.com/ggml-org/llama.cpp/discussions/12088 and fairydreaming has a good one on Epyc as well (linked on there).

Interestingly AMX hasn't helped much over ik's avx2 and avx512 CPU implementations from what I can tell so far, though things are moving fast on both projecgts I'm sure.

And yeah I'm having trouble saturating the expected RAM bandwidth on a single socket in terms of theoretical max token generation speeds. The AMD Threadripper Pro 24x core in NPS1 hits around 70% expected theoretical max TG. My home rig 9950x with overclocked "gear 1" infinity fabric and 2x48GB DDR5@6400 dimms can hit over 90% theoretical max.

I assume the ability to hit theoretical max TG is related to both instruction set implementation latency stuff as well as how the system BIOS is handling the NUMA node <-> CPU cores mapping under the hood.

Anyway fun benchmarking these monster rigs lol.. Sounds like you know your stuff! Cheers!

[-]

No_Afternoon_4260@reddit

Hey thanks been around for a couple of years now lol

I forgot that the numbers stated in the video are for cpu only (no gpu), that changes my perspective a lot!

Iirc from fairydreaming work in the sub, he could get 80% max ram theoretical bw on Genoa (9375F iirc) and 90% on Turin. And I speak about synthetic ram speed benchmark. Obviously numbers are lower with our backend because ofc our backends aren't optimised for numa domain.
So kind of normal that smaller beast are closer to their theoretical max perf.
Have you tried benchmarking ram bandwidth at all?

I've joined your level1tech forum yesterday, really good vibes there !

[-]

VoidAlchemy@reddit

On a thread ripper pro 24x core with 256GB RAM and 24GB CUDA VRAM can run the ik_llama.cpp quants over 10 tok/sec token generation and 100+ prompt processing by increasing batch size. Up to 32k context or so.

On an AM5 gaming rig with 2x64GB DDR5 kit you're lucky to get say 80GB/s RAM bandwidth which directly limits token generation speeds as that is memory i/o bottlenecked.

So if you have more VRAM you can increase context and/or offload more exps layers for faster speed.

[-]

LagOps91@reddit

So the huge cpu from the video isn't actually that important? And how large would the impact of ram speeds be on inference? If you double the ram speed (for example 4 channel vs 2 channel), how much of a % increase can you expect?

On that gaming rig you mentioned, what kind of speed would be possible?

The new deepseek also has multi token prediction - is that supported yet and does it meaningfully change things?

[-]

smflx@reddit

RAM speed is important for token generation. I get 17 tok/sec with 12 channel (350GB/s), 15 tok/sec with 8 channel. (200GB/s).
His two RAM stick is high speed DD5-6400.

[-]

LagOps91@reddit

Yeah I'm thinking about getting 2x64 sticks with 6400mt/s, so I'm interested in whether or not that makes any sense for running large MoE models.

[-]

smflx@reddit

It's the best choice for gaming rig because it has only two memory channels.

If you newly build, even old epyc server cpu/mainboard with 8 DDR4 is better. 64G stick is too big for its memory bandwidth. 8 stick of 32G is better, also sweet spot pricewise too.

[-]

LagOps91@reddit

yeah unfortunately i bought my gaming rig just before local ai was taking off. i'm well aware that a server build would be better, but i already bought the hardware and just want to know if i can sensibly get a usable performance if i buy more ram. if i only get 1-2 t/s then it's not worth it, but if i would get 5+ t/s with usable pp speed... yeah that would be quite tempting for me.

[-]

VoidAlchemy@reddit

I've run on 2x48GB DDR5@6400 with a PCIe Gen 5 NVMe paging linux cache hard. Might be able to pull off 3\~5 tok/sec in short context even without enough RAM given mmap() does its magic.

I have an old video doing it with ktransformers, but my new quants on ik's fork are better now: https://www.youtube.com/watch?v=4ucmn3b44x4

Give it a shot before investing those new 2x64GB DDR5 kits. (though it is tempting for me too) hah...

[-]

admajic@reddit

32k context isn't very useful for angelic coding. I use a 32b version and 100k context to get the job done.

[-]

LA_rent_Aficionado@reddit

Especially when the cline system prompt takes up like 15k lol

[-]

VoidAlchemy@reddit

I mentioned how this model does support 160k context (longer than any else especially without yarn).

But yeah those system prompts are huge, it is kinda ridic imo. I tend to use no system prompt and have my own little python async vibe coding client as shown in the video. Works great for 1shot or few-shot code generations / refactors.

Have fun vibing!

[-]

VoidAlchemy@reddit

This full deep-seek model does support the full 160k context if you have the RAM/VRAM and or patience. It uses Multi-Head Latent Attention (MLA) which is *much* more effcient that standard Grouped Query Attention (GQA) as MLA grows linearly whereas GQA grows exponentially.

I designed my larger quants to support 32k in 24GB VRAM. But the smaller models do spport 64k context in under 24GB VRAM as I used smaller tensors for GPU offload.

But yeah if you can't run anything larger than a 32B then enjoy what you have!

[-]

macumazana@reddit

You mean seconds per token

[-]

waywardspooky@reddit

asking the real questions

[-]

AdventurousSwim1312@reddit

What rough speed would I give on 2x3090 + Ryzen 9 3950x + 128go ddr4 @3600.

Are we talking in token per minute? Token per seconds? Tens of tokens per seconds?

[-]

radamantis12@reddit

I get 6 tokens at the best using ik_llama for the 1 bit quant with the same setup except using a Ryzen 7 5700x and 3200 ddr4.

[-]

VoidAlchemy@reddit

Great to hear you got it going! Pretty good for ddr4-3200! How many extra exps layers can you offload into VRAM for speedups?

[-]

radamantis12@reddit

The best that what i get was 6 layers each for balance between prompt and tokens:

CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-server \
    --model /media/ssd_nvme/llm_models/DeepSeek-R1-0528-IQ1_S_R4/DeepSeek-R1-0528-IQ1_S_R4-00001-of-00003.gguf \
    --alias DeepSeek-R1-0528-IQ1_S \
    --ctx-size 32768 \
    --tensor-split 24,23 \
    -ctk q8_0 \
    -mla 3 -fa \
    -amb 512 \
    -fmoe \
    --n-gpu-layers 99 \
    -ot "blk\.(3|4|5|6|7|8)\.ffn_.*=CUDA0" \
    -ot "blk\.(9|10|11|12|13|14)\.ffn_.*=CUDA1" \
    --override-tensor exps=CPU \
    -b 4096 -ub 4096 \
    -ser 6,1 \
    --parallel 1 \
    --threads 8 --threads-batch 12 \
    --host 127.0.0.1 \
    --port 8080

the downside from my pc is the lower prompt processing, something between 20-40 t/s. Its possible to put one layer, maybe two if I lower the batches, but it will hurt more the prompt speed.

I see someone with the same config but using a threadripper 3th gen and was able to get around to 160 t/s in prompt so my guess is that memory bandwidth, instructions or even the cores gives a huge impact here.

Oh and i forgot to mention that i use a overclock in my Ryzen to reach the 6 t/s

[-]

VoidAlchemy@reddit

Very cool! Glad you got it running and seems decent speeds for a gaming rig.

I stopped using --tensor-split as it seemed to cause issues combining with -ot for me. Also if you aren't already you could try compiling:

cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1 -DGGML_CUDA_F16=ON
cmake --build ./build --config Release -j $(nproc)

I explain my reasoning on that here

[-]

radamantis12@reddit

Oh, you are the goat ubergarm! Yours comments in the repo definably help me and i love the q1 that you cooked.

Current i use this build:

cmake -B build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1

I will try the DGGML_CUDA_F16 later but inspired in this discussion I decided to monitor my pci-e speeds and the cuda 0 was going until speeds of pci gen 2 4x, I will try to fix this and see if the problem was the speed, even with high batch i guess that the pci speed still hurts and probably was the main cause from the lower pp.

[-]

FormalAd7367@reddit

how is your set up with the distilled model?

i have 4 x 3090 + ddr4. but my family wants to build another one. i have two 3090 laying around so want to know if that would be enough to run a small model

[-]

AdventurousSwim1312@reddit

I'm using my setup with models up to 80B in Q4.

Usual speed with tensor parallélisme: - 70b alone : 20t/s - 70b with 3b draft model : 30t/s - 32b alone : 55t/s - 32b with 1.5b draft model : 65-70t/s - 14b : 105 t/s - 7b : 160 t/s

Engine : vllm / exllama v2 Quant : Awq, gptq, exl2 4.0bpw

[-]

VoidAlchemy@reddit

you can run the small distill models on a single 3090...

[-]

Threatening-Silence-@reddit

Probably looking at 3 tokens a second or thereabouts.

[-]

smflx@reddit

That server is very loud like jetplane. Don't even think of getting one at your home :)

[-]

premium0@reddit

This is so pointless it hurts.

[-]

GreenTreeAndBlueSky@reddit

At that size id be interested to see how it fares compared to Qwen3 235b. At 4bit

[-]

VoidAlchemy@reddit

I have a Qwen3-235B-A22B quant that fits on 96GB RAM + 24GB VRAM. If possible I would prefer to run the smallest DeepSeek-R1-0528. DeepSeek arch is nice because you can put all the attention, shared expert, and first 3 "dense layers" all onto GPU for good speedups while offloading the rest with `-ngl 99 -ot exps=CPU`.

[-]

Zc5Gwu@reddit

It would be interesting to see full benchmark comparisons... i.e. GPQA score for the full model versus the 1bit quantized model, live bench scores, etc.

[-]

VoidAlchemy@reddit

If you find The Great Quant Wars of 2025 reddit post i wrote, me and bartowski do that for the Qwen3-30B-A3B quants. That informed some of my quantization strategy with this larger model.

Doing those full benchmarks is really slow though even at say 15 tok/sec generation. Also benchmarks of lower quants sometimes score *better* which is confusing. There is a paper called "Accuracy is all you need" which discusses it more and suggests looking at "flips" in benchmarking.

Anyway, Perplexity and KLD are fairly straight forward and accepted ways to measure the relative quality of a quant with its original. It is not useful for measuring quality across different models/architechtures.

[-]

Meronoth@reddit

Big asterisk of 24G GPU plus 128G RAM, but seriously impressive stuff

[-]

mark-haus@reddit

Can you shard models and compute of models between CPU/RAM & GPU/VRAM?

[-]

VoidAlchemy@reddit

Yup, i recommend running this DeepSeek-R1-0528 with `-ngl 99 -ot exps=CPU` as a start and improve the command specific to your rig and VRAM to improve from there.

Hybrid CPU+GPU inferencing is great on this model.

There is also the concept of RPC to shard across machines but doesn't work great yet afaict and requires super fast networking if possible hah...

[-]

MINIMAN10001@reddit

Models can shard across anything at the layer level

The petals project was created for distributing model load across multiple users.

[-]

Threatening-Silence-@reddit

Of course.

You use --override-tensor with a custom regex to selectively offload the experts to CPU/RAM while keeping the attention tensors and shared experts on GPU.

[-]

Thireus@reddit

Big shout-out to /u/VoidAlchemy 👋

[-]

VoidAlchemy@reddit

Aww thanks! Been enjoying watching you start cooking your own quants too Thireus!!!