llama.cpp DeepSeek v4 Flash experimental inference
Posted by antirez@reddit | LocalLLaMA | View on Reddit | 33 comments
Hi, here you can find experimental support for DeepSeek v4, and here there is the GGUF you can use to run the inference with "just" (lol) 128GB of RAM. The model, even quantized at 2 bit, looks very solid in my limited testing, and the speed of 17 t/s in my MacBook M3 Max is quite interesting, I would say we are into the usable zone.
What I did was to heavily quantize the routed experts to 2 bits using two different 2 bit quants to balance error and size. All the rest of the model, including the shared expert for each layer, is Q8: it is not worth it to play with the most sensible parts of the model if the bulk of the weights are in the routed experts.
I have the feeling that even 2 bit quantized this will prove to be a stronger model than Qwen 3.6 27B, but this is only a feeling based on the quality of the replies I get chatting with it. There is to experiment more and run benchmarks.
tarruda@reddit
Have you considered trying IQ3_XXS? It might also fit in 128G
antirez@reddit (OP)
We are already at the limit with 86GB of weights... Also I tested it with the Pi agent and tool calling works perfectly, it is able to modify C code, check files, commit, and so forth. Even if v4 KV cache is smaller than other models, still you need space for the OS and for the cache.
LegacyRemaster@reddit
So with Minimax 2.7 Q4 UD local I was able to convert cpu-metal to cuda ---> compile --> load. load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 42 repeating layers to GPU
load_tensors: offloaded 44/44 layers to GPU
load_tensors: CPU_Mapped model buffer size = 1010.00 MiB
load_tensors: CUDA0 model buffer size = 81687.67 MiB
....................................................................................................
common_init_result: added <|end▁of▁sentence|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 1048576
llama_context: n_ctx_seq = 1048576
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 10000.0
llama_context: freq_scale = 0.0625
llama_context: CUDA_Host output buffer size = 1.97 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 1048576 cells
llama_kv_cache: size = 0.00 MiB (1048576 cells, 0 layers, 4/1 seqs), K (f16): 0.00 MiB, V (f16): 0.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 0
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 0
llama_kv_cache_iswa: creating SWA KV cache, size = 1024 cells
llama_kv_cache: CUDA0 KV buffer size = 43.00 MiB
llama_kv_cache: size = 43.00 MiB ( 1024 cells, 43 layers, 4/1 seqs), K (f16): 43.00 MiB, V (f16): 0.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 512
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 512
llama_memory_recurrent: CUDA0 RS buffer size = 82.00 MiB
llama_memory_recurrent: size = 82.00 MiB ( 4 cells, 43 layers, 4 seqs), R (f32): 41.00 MiB, S (f32): 41.00 MiB
llama_memory_hybrid_iswa: CUDA0 DeepSeek4 compressed KV buffer size = 27520.00 MiB
sched_reserve: reserving ...
sched_reserve: layer 2 is assigned to device CUDA0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
[0msched_reserve: Flash Attention was auto, set to disabled
[0msched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve: CUDA0 compute buffer size = 281.57 MiB
sched_reserve: CUDA_Host compute buffer size = 224.42 MiB
sched_reserve: graph nodes = 9961 (with bs=512), 6936 (with bs=1)
sched_reserve: graph splits = 856 (with bs=512), 690 (with bs=1)
sched_reserve: reserve took 97.05 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[0msrv load_model: initializing slots, n_slots = 4
common_context_can_seq_rm: the target context does not support partial sequence removal
[0msrv load_model: speculative decoding will use checkpoints
[0mno implementations specified for speculative decoding
[0mslot load_model: id 0 | task -1 | new slot, n_ctx = 1048576
no implementations specified for speculative decoding
[0mslot load_model: id 1 | task -1 | new slot, n_ctx = 1048576
no implementations specified for speculative decoding
[0mslot load_model: id 2 | task -1 | new slot, n_ctx = 1048576
no implementations specified for speculative decoding
[0mslot load_model: id 3 | task -1 | new slot, n_ctx = 1048576
srv load_model: prompt cache is enabled, size limit: 8192 MiB
[0msrv load_model: use `--cache-ram 0` to disable the prompt cache
[0msrv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
[0msrv init: init: idle slots will be saved to prompt cache and cleared upon starting a new task
init: chat template, example_format: '<|begin▁of▁sentence|>You are a helpful assistant<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
srv init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://127.0.0.1:8080
LegacyRemaster@reddit
will try on my rtx 6000 96
tarruda@reddit
Not sure he put a cuda kernel yet, but you can try on CPU if you have enough RAM
LegacyRemaster@reddit
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 42 repeating layers to GPU
load_tensors: offloaded 44/44 layers to GPU
load_tensors: CPU_Mapped model buffer size = 1010.00 MiB
load_tensors: CUDA0 model buffer size = 81687.67 MiB
....................................................................................................
common_init_result: added <|end▁of▁sentence|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 1048576
llama_context: n_ctx_seq = 1048576
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 10000.0
llama_context: freq_scale = 0.0625
llama_context: CUDA_Host output buffer size = 1.97 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 1048576 cells
llama_kv_cache: size = 0.00 MiB (1048576 cells, 0 layers, 4/1 seqs), K (f16): 0.00 MiB, V (f16): 0.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 0
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 0
llama_kv_cache_iswa: creating SWA KV cache, size = 1024 cells
llama_kv_cache: CUDA0 KV buffer size = 43.00 MiB
llama_kv_cache: size = 43.00 MiB ( 1024 cells, 43 layers, 4/1 seqs), K (f16): 43.00 MiB, V (f16): 0.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 512
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 512
llama_memory_recurrent: CUDA0 RS buffer size = 82.00 MiB
llama_memory_recurrent: size = 82.00 MiB ( 4 cells, 43 layers, 4 seqs), R (f32): 41.00 MiB, S (f32): 41.00 MiB
llama_memory_hybrid_iswa: CUDA0 DeepSeek4 compressed KV buffer size = 27520.00 MiB
sched_reserve: reserving ...
sched_reserve: layer 2 is assigned to device CUDA0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
[0msched_reserve: Flash Attention was auto, set to disabled
[0msched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve: CUDA0 compute buffer size = 281.57 MiB
sched_reserve: CUDA_Host compute buffer size = 224.42 MiB
sched_reserve: graph nodes = 9961 (with bs=512), 6936 (with bs=1)
sched_reserve: graph splits = 856 (with bs=512), 690 (with bs=1)
sched_reserve: reserve took 97.05 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[0msrv load_model: initializing slots, n_slots = 4
common_context_can_seq_rm: the target context does not support partial sequence removal
[0msrv load_model: speculative decoding will use checkpoints
[0mno implementations specified for speculative decoding
[0mslot load_model: id 0 | task -1 | new slot, n_ctx = 1048576
no implementations specified for speculative decoding
[0mslot load_model: id 1 | task -1 | new slot, n_ctx = 1048576
no implementations specified for speculative decoding
[0mslot load_model: id 2 | task -1 | new slot, n_ctx = 1048576
no implementations specified for speculative decoding
[0mslot load_model: id 3 | task -1 | new slot, n_ctx = 1048576
srv load_model: prompt cache is enabled, size limit: 8192 MiB
[0msrv load_model: use `--cache-ram 0` to disable the prompt cache
[0msrv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
[0msrv init: init: idle slots will be saved to prompt cache and cleared upon starting a new task
init: chat template, example_format: '<|begin▁of▁sentence|>You are a helpful assistant<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
srv init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://127.0.0.1:8080
LegacyRemaster@reddit
compiling
fragment_me@reddit
Well how did it go?
LegacyRemaster@reddit
testing
Then-Topic8766@reddit
This can be a reason for my building error (tried with -DGGML_CUDA=ON)
LegacyRemaster@reddit
128gb ram yeah
thereisonlythedance@reddit
It’s a shame DeepSeek and llama.cpp devs don’t coordinate much. Their architecture is complex and seemingly not well supported by llama.cpp (still no DSA, though I know one of the great devs is working on it).
LegacyRemaster@reddit
CMake Error at tools/CMakeLists.txt:22 (add_subdirectory):
add_subdirectory given source "deepseek4-quantize" which is not an existing
directory.
---------- added dir--------->
CMake Error at tools/CMakeLists.txt:22 (add_subdirectory):
The source directory
C:/llm/llama.cpp-deepseek-v4-flash/tools/deepseek4-quantize
does not contain a CMakeLists.txt file.
antirez@reddit (OP)
Fixed, sorry I used a tool that is not ready for prime time in order to generate the GGUF but forgot to remove from CMake.
LegacyRemaster@reddit
no cuda right ? print_info: FIM MID token = 128802 '<|fim▁end|>'
print_info: EOG token = 1 '<|end▁of▁sentence|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = true)
load_tensors: offloading output layer to GPU
load_tensors: offloading 42 repeating layers to GPU
load_tensors: offloaded 44/44 layers to GPU
load_tensors: CUDA0 model buffer size = 81687.67 MiB
load_tensors: CUDA_Host model buffer size = 1010.00 MiB
...................................................................................................
common_init_result: added <|end▁of▁sentence|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 8192
llama_context: n_ctx_seq = 8192
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 10000.0
llama_context: freq_scale = 0.0625
llama_context: n_ctx_seq (8192) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
[0mllama_context: CUDA_Host output buffer size = 1.97 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 8192 cells
llama_kv_cache: size = 0.00 MiB ( 8192 cells, 0 layers, 4/1 seqs), K (f16): 0.00 MiB, V (f16): 0.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 0
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 0
llama_kv_cache_iswa: creating SWA KV cache, size = 1024 cells
llama_kv_cache: CUDA0 KV buffer size = 43.00 MiB
llama_kv_cache: size = 43.00 MiB ( 1024 cells, 43 layers, 4/1 seqs), K (f16): 43.00 MiB, V (f16): 0.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 512
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 512
llama_memory_recurrent: CUDA0 RS buffer size = 82.00 MiB
llama_memory_recurrent: size = 82.00 MiB ( 4 cells, 43 layers, 4 seqs), R (f32): 41.00 MiB, S (f32): 41.00 MiB
llama_memory_hybrid_iswa: CUDA0 DeepSeek4 compressed KV buffer size = 215.00 MiB
sched_reserve: reserving ...
sched_reserve: layer 2 is assigned to device CUDA0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
[0msched_reserve: Flash Attention was auto, set to disabled
[0msched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve: CUDA0 compute buffer size = 281.57 MiB
sched_reserve: CUDA_Host compute buffer size = 224.42 MiB
sched_reserve: graph nodes = 9961 (with bs=512), 6936 (with bs=1)
sched_reserve: graph splits = 856 (with bs=512), 690 (with bs=1)
sched_reserve: reserve took 57.50 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[0msrv load_model: initializing slots, n_slots = 4
common_context_can_seq_rm: the target context does not support partial sequence removal
[0msrv load_model: speculative decoding will use checkpoints
[0mno implementations specified for speculative decoding
[0mslot load_model: id 0 | task -1 | new slot, n_ctx = 8192
no implementations specified for speculative decoding
[0mslot load_model: id 1 | task -1 | new slot, n_ctx = 8192
no implementations specified for speculative decoding
[0mslot load_model: id 2 | task -1 | new slot, n_ctx = 8192
no implementations specified for speculative decoding
[0mslot load_model: id 3 | task -1 | new slot, n_ctx = 8192
srv load_model: prompt cache is enabled, size limit: 8192 MiB
[0msrv load_model: use `--cache-ram 0` to disable the prompt cache
[0msrv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
[0msrv init: init: idle slots will be saved to prompt cache and cleared upon starting a new task
init: chat template, example_format: '<|begin▁of▁sentence|>You are a helpful assistant<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
srv init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...
srv update_slots: all slots are idle
srv params_from_: Chat format: peg-native
slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
srv get_availabl: updating prompt cache
[0msrv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
[0msrv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 8192 tokens, 8589934592 est)
[0msrv get_availabl: prompt cache update took 0.01 ms
[0mslot launch_slot_: id 3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id 3 | task 0 | processing task, is_child = 0
slot update_slots: id 3 | task 0 | new prompt, n_ctx_slot = 8192, n_keep = 0, task.n_tokens = 25
slot update_slots: id 3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 21, batch.n_tokens = 21, progress = 0.840000
C:\llm\llama.cpp-deepseek-v4-flash\build\bin\Release>
antirez@reddit (OP)
CPU / Metal right now. But GPT 5.5 should likely be able to implement CUDA just looking at the kernels of Metal.
LegacyRemaster@reddit
MotokoAGI@reddit
Are you going to do it?
LegacyRemaster@reddit
LegacyRemaster@reddit
we will see...
LegacyRemaster@reddit
testing on cpu only now but... on windows --> no way
MotokoAGI@reddit
Can you please upload a q8 gguf?
Then-Topic8766@reddit
Me too :(
LegacyRemaster@reddit
Fixed. The
deepseek4-quantizedirectory exists but is empty - it has no CMakeLists.txt or source files. I removed theadd_subdirectory(deepseek4-quantize)reference from tools/CMakeLists.txt (line 22).tarruda@reddit
Yea got this too.
LegacyRemaster@reddit
always a good occasion to test something :d
LegacyRemaster@reddit
Fixed. The
deepseek4-quantizedirectory exists but is empty - it has no CMakeLists.txt or source files. I removed theadd_subdirectory(deepseek4-quantize)reference from tools/CMakeLists.txt (line 22).Then-Topic8766@reddit
Thank you, it is building now.
MotokoAGI@reddit
Good stuff, thanks antirez, gonna test drive this.
antirez@reddit (OP)
For the first time, even with this selective 2 bit quantization, I feel like I have a frontier model running on my computer. The quality of the replies is incredible, and its mental order, the fact that it thinks the right amount of time based on the question complexity. The language used. Incredibly cool.
markole@reddit
Omg, it's the Redis guy. Thanks!
Then-Topic8766@reddit
Thank you! I was waiting for something like that.
Monkey_1505@reddit
Huh, 86gb. Could run that entirely on a blackwell 6000 and get solid speeds at near full context. Mind you it's a13b so you probably could offload some of the expert to CPU instead, and use a slightly higher quant.
Interesting. I didn't think it would fit this small.