llama.cpp DeepSeek v4 Flash experimental inference

[-]

tarruda@reddit

Have you considered trying IQ3_XXS? It might also fit in 128G

[-]

We are already at the limit with 86GB of weights... Also I tested it with the Pi agent and tool calling works perfectly, it is able to modify C code, check files, commit, and so forth. Even if v4 KV cache is smaller than other models, still you need space for the OS and for the cache.

[-]

LegacyRemaster@reddit

So with Minimax 2.7 Q4 UD local I was able to convert cpu-metal to cuda ---> compile --> load. load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)

load_tensors: offloading output layer to GPU

load_tensors: offloading 42 repeating layers to GPU

load_tensors: offloaded 44/44 layers to GPU

load_tensors: CPU_Mapped model buffer size = 1010.00 MiB

load_tensors: CUDA0 model buffer size = 81687.67 MiB

....................................................................................................

common_init_result: added <｜end▁of▁sentence｜> logit bias = -inf

llama_context: constructing llama_context

llama_context: n_seq_max = 4

llama_context: n_ctx = 1048576

llama_context: n_ctx_seq = 1048576

llama_context: n_batch = 2048

llama_context: n_ubatch = 512

llama_context: causal_attn = 1

llama_context: flash_attn = auto

llama_context: kv_unified = true

llama_context: freq_base = 10000.0

llama_context: freq_scale = 0.0625

llama_context: CUDA_Host output buffer size = 1.97 MiB

llama_kv_cache_iswa: creating non-SWA KV cache, size = 1048576 cells

llama_kv_cache: size = 0.00 MiB (1048576 cells, 0 layers, 4/1 seqs), K (f16): 0.00 MiB, V (f16): 0.00 MiB

llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 0

llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 0

llama_kv_cache_iswa: creating SWA KV cache, size = 1024 cells

llama_kv_cache: CUDA0 KV buffer size = 43.00 MiB

llama_kv_cache: size = 43.00 MiB ( 1024 cells, 43 layers, 4/1 seqs), K (f16): 43.00 MiB, V (f16): 0.00 MiB

llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 512

llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 512

llama_memory_recurrent: CUDA0 RS buffer size = 82.00 MiB

llama_memory_recurrent: size = 82.00 MiB ( 4 cells, 43 layers, 4 seqs), R (f32): 41.00 MiB, S (f32): 41.00 MiB

llama_memory_hybrid_iswa: CUDA0 DeepSeek4 compressed KV buffer size = 27520.00 MiB

sched_reserve: reserving ...

sched_reserve: layer 2 is assigned to device CUDA0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)

[0msched_reserve: Flash Attention was auto, set to disabled

[0msched_reserve: resolving fused Gated Delta Net support:

sched_reserve: fused Gated Delta Net (autoregressive) enabled

sched_reserve: fused Gated Delta Net (chunked) enabled

sched_reserve: CUDA0 compute buffer size = 281.57 MiB

sched_reserve: CUDA_Host compute buffer size = 224.42 MiB

sched_reserve: graph nodes = 9961 (with bs=512), 6936 (with bs=1)

sched_reserve: graph splits = 856 (with bs=512), 690 (with bs=1)

sched_reserve: reserve took 97.05 ms, sched copies = 1

common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

[0msrv load_model: initializing slots, n_slots = 4

common_context_can_seq_rm: the target context does not support partial sequence removal

[0msrv load_model: speculative decoding will use checkpoints

[0mno implementations specified for speculative decoding

[0mslot load_model: id 0 | task -1 | new slot, n_ctx = 1048576

no implementations specified for speculative decoding

[0mslot load_model: id 1 | task -1 | new slot, n_ctx = 1048576

no implementations specified for speculative decoding

[0mslot load_model: id 2 | task -1 | new slot, n_ctx = 1048576

no implementations specified for speculative decoding

[0mslot load_model: id 3 | task -1 | new slot, n_ctx = 1048576

srv load_model: prompt cache is enabled, size limit: 8192 MiB

[0msrv load_model: use `--cache-ram 0` to disable the prompt cache

[0msrv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391

[0msrv init: init: idle slots will be saved to prompt cache and cleared upon starting a new task

init: chat template, example_format: '<｜begin▁of▁sentence｜>You are a helpful assistant<｜User｜>Hello<｜Assistant｜>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜>'

srv init: init: chat template, thinking = 1

main: model loaded

main: server is listening on http://127.0.0.1:8080

[-]

LegacyRemaster@reddit

will try on my rtx 6000 96

[-]

tarruda@reddit

Not sure he put a cuda kernel yet, but you can try on CPU if you have enough RAM

[-]

LegacyRemaster@reddit

load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)

load_tensors: offloading output layer to GPU

load_tensors: offloading 42 repeating layers to GPU

load_tensors: offloaded 44/44 layers to GPU

load_tensors: CPU_Mapped model buffer size = 1010.00 MiB

load_tensors: CUDA0 model buffer size = 81687.67 MiB

....................................................................................................

common_init_result: added <｜end▁of▁sentence｜> logit bias = -inf

llama_context: constructing llama_context

llama_context: n_seq_max = 4

llama_context: n_ctx = 1048576

llama_context: n_ctx_seq = 1048576

llama_context: n_batch = 2048

llama_context: n_ubatch = 512

llama_context: causal_attn = 1

llama_context: flash_attn = auto

llama_context: kv_unified = true

llama_context: freq_base = 10000.0

llama_context: freq_scale = 0.0625

llama_context: CUDA_Host output buffer size = 1.97 MiB

llama_kv_cache_iswa: creating non-SWA KV cache, size = 1048576 cells

llama_kv_cache: size = 0.00 MiB (1048576 cells, 0 layers, 4/1 seqs), K (f16): 0.00 MiB, V (f16): 0.00 MiB

llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 0

llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 0

llama_kv_cache_iswa: creating SWA KV cache, size = 1024 cells

llama_kv_cache: CUDA0 KV buffer size = 43.00 MiB

llama_kv_cache: size = 43.00 MiB ( 1024 cells, 43 layers, 4/1 seqs), K (f16): 43.00 MiB, V (f16): 0.00 MiB

llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 512

llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 512

llama_memory_recurrent: CUDA0 RS buffer size = 82.00 MiB

llama_memory_recurrent: size = 82.00 MiB ( 4 cells, 43 layers, 4 seqs), R (f32): 41.00 MiB, S (f32): 41.00 MiB

llama_memory_hybrid_iswa: CUDA0 DeepSeek4 compressed KV buffer size = 27520.00 MiB

sched_reserve: reserving ...

sched_reserve: layer 2 is assigned to device CUDA0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)

[0msched_reserve: Flash Attention was auto, set to disabled

[0msched_reserve: resolving fused Gated Delta Net support:

sched_reserve: fused Gated Delta Net (autoregressive) enabled

sched_reserve: fused Gated Delta Net (chunked) enabled

sched_reserve: CUDA0 compute buffer size = 281.57 MiB

sched_reserve: CUDA_Host compute buffer size = 224.42 MiB

sched_reserve: graph nodes = 9961 (with bs=512), 6936 (with bs=1)

sched_reserve: graph splits = 856 (with bs=512), 690 (with bs=1)

sched_reserve: reserve took 97.05 ms, sched copies = 1

common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

[0msrv load_model: initializing slots, n_slots = 4

common_context_can_seq_rm: the target context does not support partial sequence removal

[0msrv load_model: speculative decoding will use checkpoints

[0mno implementations specified for speculative decoding

[0mslot load_model: id 0 | task -1 | new slot, n_ctx = 1048576

no implementations specified for speculative decoding

[0mslot load_model: id 1 | task -1 | new slot, n_ctx = 1048576

no implementations specified for speculative decoding

[0mslot load_model: id 2 | task -1 | new slot, n_ctx = 1048576

no implementations specified for speculative decoding

[0mslot load_model: id 3 | task -1 | new slot, n_ctx = 1048576

srv load_model: prompt cache is enabled, size limit: 8192 MiB

[0msrv load_model: use `--cache-ram 0` to disable the prompt cache

[0msrv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391

[0msrv init: init: idle slots will be saved to prompt cache and cleared upon starting a new task

init: chat template, example_format: '<｜begin▁of▁sentence｜>You are a helpful assistant<｜User｜>Hello<｜Assistant｜>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜>'

srv init: init: chat template, thinking = 1

main: model loaded

main: server is listening on http://127.0.0.1:8080

[-]

LegacyRemaster@reddit

compiling

[-]

fragment_me@reddit

Well how did it go?

[-]

LegacyRemaster@reddit

testing

[-]

Then-Topic8766@reddit

This can be a reason for my building error (tried with -DGGML_CUDA=ON)

[-]

LegacyRemaster@reddit

128gb ram yeah

[-]

thereisonlythedance@reddit

It’s a shame DeepSeek and llama.cpp devs don’t coordinate much. Their architecture is complex and seemingly not well supported by llama.cpp (still no DSA, though I know one of the great devs is working on it).

[-]

LegacyRemaster@reddit

CMake Error at tools/CMakeLists.txt:22 (add_subdirectory):

add_subdirectory given source "deepseek4-quantize" which is not an existing

directory.

---------- added dir--------->

CMake Error at tools/CMakeLists.txt:22 (add_subdirectory):

The source directory

C:/llm/llama.cpp-deepseek-v4-flash/tools/deepseek4-quantize

does not contain a CMakeLists.txt file.

[-]

antirez@reddit (OP)

Fixed, sorry I used a tool that is not ready for prime time in order to generate the GGUF but forgot to remove from CMake.

[-]

LegacyRemaster@reddit

no cuda right ? print_info: FIM MID token = 128802 '<｜fim▁end｜>'

print_info: EOG token = 1 '<｜end▁of▁sentence｜>'

print_info: max token length = 256

load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = true)

load_tensors: offloading output layer to GPU

load_tensors: offloading 42 repeating layers to GPU

load_tensors: offloaded 44/44 layers to GPU

load_tensors: CUDA0 model buffer size = 81687.67 MiB

load_tensors: CUDA_Host model buffer size = 1010.00 MiB

...................................................................................................

common_init_result: added <｜end▁of▁sentence｜> logit bias = -inf

llama_context: constructing llama_context

llama_context: n_seq_max = 4

llama_context: n_ctx = 8192

llama_context: n_ctx_seq = 8192

llama_context: n_batch = 2048

llama_context: n_ubatch = 512

llama_context: causal_attn = 1

llama_context: flash_attn = auto

llama_context: kv_unified = true

llama_context: freq_base = 10000.0

llama_context: freq_scale = 0.0625

llama_context: n_ctx_seq (8192) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized

[0mllama_context: CUDA_Host output buffer size = 1.97 MiB

llama_kv_cache_iswa: creating non-SWA KV cache, size = 8192 cells

llama_kv_cache: size = 0.00 MiB ( 8192 cells, 0 layers, 4/1 seqs), K (f16): 0.00 MiB, V (f16): 0.00 MiB

llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 0

llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 0

llama_kv_cache_iswa: creating SWA KV cache, size = 1024 cells

llama_kv_cache: CUDA0 KV buffer size = 43.00 MiB

llama_kv_cache: size = 43.00 MiB ( 1024 cells, 43 layers, 4/1 seqs), K (f16): 43.00 MiB, V (f16): 0.00 MiB

llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 512

llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 512

llama_memory_recurrent: CUDA0 RS buffer size = 82.00 MiB

llama_memory_recurrent: size = 82.00 MiB ( 4 cells, 43 layers, 4 seqs), R (f32): 41.00 MiB, S (f32): 41.00 MiB

llama_memory_hybrid_iswa: CUDA0 DeepSeek4 compressed KV buffer size = 215.00 MiB

sched_reserve: reserving ...

sched_reserve: layer 2 is assigned to device CUDA0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)

[0msched_reserve: Flash Attention was auto, set to disabled

[0msched_reserve: resolving fused Gated Delta Net support:

sched_reserve: fused Gated Delta Net (autoregressive) enabled

sched_reserve: fused Gated Delta Net (chunked) enabled

sched_reserve: CUDA0 compute buffer size = 281.57 MiB

sched_reserve: CUDA_Host compute buffer size = 224.42 MiB

sched_reserve: graph nodes = 9961 (with bs=512), 6936 (with bs=1)

sched_reserve: graph splits = 856 (with bs=512), 690 (with bs=1)

sched_reserve: reserve took 57.50 ms, sched copies = 1

common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

[0msrv load_model: initializing slots, n_slots = 4

common_context_can_seq_rm: the target context does not support partial sequence removal

[0msrv load_model: speculative decoding will use checkpoints

[0mno implementations specified for speculative decoding

[0mslot load_model: id 0 | task -1 | new slot, n_ctx = 8192

no implementations specified for speculative decoding

[0mslot load_model: id 1 | task -1 | new slot, n_ctx = 8192

no implementations specified for speculative decoding

[0mslot load_model: id 2 | task -1 | new slot, n_ctx = 8192

no implementations specified for speculative decoding

[0mslot load_model: id 3 | task -1 | new slot, n_ctx = 8192

srv load_model: prompt cache is enabled, size limit: 8192 MiB

[0msrv load_model: use `--cache-ram 0` to disable the prompt cache

[0msrv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391

[0msrv init: init: idle slots will be saved to prompt cache and cleared upon starting a new task

init: chat template, example_format: '<｜begin▁of▁sentence｜>You are a helpful assistant<｜User｜>Hello<｜Assistant｜>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜>'

srv init: init: chat template, thinking = 1

main: model loaded

main: server is listening on http://127.0.0.1:8080

main: starting the main loop...

srv update_slots: all slots are idle

srv params_from_: Chat format: peg-native

slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1

srv get_availabl: updating prompt cache

[0msrv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000

[0msrv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 8192 tokens, 8589934592 est)

[0msrv get_availabl: prompt cache update took 0.01 ms

[0mslot launch_slot_: id 3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist

slot launch_slot_: id 3 | task 0 | processing task, is_child = 0

slot update_slots: id 3 | task 0 | new prompt, n_ctx_slot = 8192, n_keep = 0, task.n_tokens = 25

slot update_slots: id 3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)

srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200

slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 21, batch.n_tokens = 21, progress = 0.840000

C:\llm\llama.cpp-deepseek-v4-flash\build\bin\Release>

[-]

antirez@reddit (OP)

CPU / Metal right now. But GPT 5.5 should likely be able to implement CUDA just looking at the kernels of Metal.

[-]

LegacyRemaster@reddit

# Plan: Implement CUDA Kernels for DeepSeek v4 (DSV4) Operations


## Overview


The DeepSeek v4 model has 4 operations that only exist in the Metal backend and are missing from CUDA:
- `GGML_OP_DSV4_HC_SPLIT_SINKHORN`
- `GGML_OP_DSV4_HC_EXPAND`
- `GGML_OP_DSV4_FP8_KV_QUANTIZE`
- `GGML_OP_DSV4_ROPE_TAIL`


This plan outlines porting these 4 kernels from Metal to CUDA.


## Files to Create


### 1. `ggml/src/ggml-cuda/dsv4.cuh`
Header file declaring the 4 CUDA operation functions:
```cpp
bool ggml_cuda_op_dsv4_hc_split_sinkhorn(ggml_cuda_ctx_t * ctx, struct ggml_tensor * dst);
bool ggml_cuda_op_dsv4_hc_expand(ggml_cuda_ctx_t * ctx, struct ggml_tensor * dst);
bool ggml_cuda_op_dsv4_fp8_kv_quantize(ggml_cuda_ctx_t * ctx, struct ggml_tensor * dst);
bool ggml_cuda_op_dsv4_rope_tail(ggml_cuda_ctx_t * ctx, struct ggml_tensor * dst);
```


### 2. `ggml/src/ggml-cuda/dsv4.cu`
Implementation file containing:
- Kernel argument structs (mirroring Metal's `ggml_metal_kargs_dsv4_*`)
- CUDA kernel implementations for each operation
- Helper functions (e.g., E4M3 quantization for `dsv4_fp8_kv_quantize`)


## Files to Modify


### 1. `ggml/src/ggml-cuda/ggml-cuda.cu`
Add dispatch cases in `ggml_cuda_compute_forward()` switch statement:
- Case `GGML_OP_DSV4_HC_SPLIT_SINKHORN` -> call `ggml_cuda_op_dsv4_hc_split_sinkhorn`
- Case `GGML_OP_DSV4_HC_EXPAND` -> call `ggml_cuda_op_dsv4_hc_expand`
- Case `GGML_OP_DSV4_FP8_KV_QUANTIZE` -> call `ggml_cuda_op_dsv4_fp8_kv_quantize`
- Case `GGML_OP_DSV4_ROPE_TAIL` -> call `ggml_cuda_op_dsv4_rope_tail`


### 2. `ggml/src/ggml-cuda/CMakeLists.txt`
Add `dsv4.cu` to the CUDA backend source file list.


## Kernel Implementations


### 1. `dsv4_hc_split_sinkhorn`
**Source:**
 `ggml-metal.metal` lines 2076-2180


- 
**Grid:**
 `(n_rows + nth - 1) / nth` threadgroups, `nth = min(256, max(1, n_rows))`
- 
**Parameters:**
 `n_hc`, `sinkhorn_iters`, `n_rows`, `mix_hc`, `nb01`, `nb1`, `eps`
- 
**Logic:**
  - Applies sigmoid activation to first/second HC elements
  - Computes HCxHC coupling matrix C with comb_scale
  - Runs Sinkhorn algorithm for `sinkhorn_iters` iterations
  - Outputs processed mixture with coupling matrix appended


### 2. `dsv4_hc_expand`
**Source:**
 `ggml-metal.metal` lines 2182-2211


- 
**Grid:**
 `(n_elem + nth - 1) / nth` threadgroups, `nth = min(256, max(1, n_elem))`
- 
**Parameters:**
 `n_embd`, `n_hc`, `n_tokens`, multiple stride values
- 
**Logic:**
  - For each token t, embedding d, dst_hc: reads `block_out[d,t]` and `post[dst_hc,t]`
  - Accumulates `block_v * post_v`
  - Sums contributions: `sum(comb[dst_hc, src_hc, t] * residual[d, src_hc, t])`


### 3. `dsv4_fp8_kv_quantize`
**Source:**
 `ggml-metal.metal` lines 2239-2287


- 
**Grid:**
 `n_rows = ne01*ne02*ne03` threadgroups, 64 threads per threadgroup
- 
**Parameters:**
 `ne00-ne03`, `nb00-nb03`, `nb0-nb3`, `n_rot`
- 
**Logic:**
  - Quantizes f32 to FP8 E4M3 format
  - Uses threadgroup shared memory for reduction to find amax
  - Processes `n_nope = ne00 - n_rot` dimensions in chunks of 64
  - Rotated dimensions copied directly without quantization
- 
**Helper:**
 E4M3 dequantization function needed


### 4. `dsv4_rope_tail`
**Source:**
 `ggml-metal.metal` lines 4817-4904


- 
**Grid:**
 `(ne01, ne02, ne03)` threadgroups, `min(256, ne00)` threads per threadgroup
- 
**Parameters:**
 `ne00-ne03`, strides, `n_dims`, `mode`, `n_ctx_orig`, `freq_base`, `freq_scale`, YaRN parameters
- 
**Logic:**
  - Copies first `n_nope = ne00 - n_dims` dimensions unchanged
  - Applies YaRN (Yet another RoPE extension) RoPE to last `n_dims`
  - Supports Neox (mode==2) and interleaved styles
  - Uses position tensor for per-token theta_base


## Implementation Order


1. Create `dsv4.cuh` with function declarations
2. Create `dsv4.cu` with `dsv4_hc_split_sinkhorn` kernel and launch wrapper
3. Add `dsv4_hc_expand` kernel and wrapper
4. Add `dsv4_fp8_kv_quantize` kernel with E4M3 helpers and wrapper
5. Add `dsv4_rope_tail` kernel with YaRN helpers and wrapper
6. Update `ggml-cuda.cu` with dispatch cases
7. Update `CMakeLists.txt`
8. Build and test


## Verification


After implementation, verify by:
1. Compiling with CUDA enabled (`GGML_CUDA=ON`)
2. Running DeepSeek v4 model inference with CUDA backend
3. Comparing outputs with Metal backend for correctness

[-]

MotokoAGI@reddit

Are you going to do it?

[-]

LegacyRemaster@reddit

[-]

LegacyRemaster@reddit

we will see...

[-]

LegacyRemaster@reddit

testing on cpu only now but... on windows --> no way

[-]

MotokoAGI@reddit

Can you please upload a q8 gguf?

[-]

Then-Topic8766@reddit

Me too :(

[-]

LegacyRemaster@reddit

Fixed. The deepseek4-quantize directory exists but is empty - it has no CMakeLists.txt or source files. I removed the add_subdirectory(deepseek4-quantize) reference from tools/CMakeLists.txt (line 22).

[-]

tarruda@reddit

Yea got this too.

[-]

LegacyRemaster@reddit

always a good occasion to test something :d

[-]

LegacyRemaster@reddit

Fixed. The deepseek4-quantize directory exists but is empty - it has no CMakeLists.txt or source files. I removed the add_subdirectory(deepseek4-quantize) reference from tools/CMakeLists.txt (line 22).

[-]

Then-Topic8766@reddit

Thank you, it is building now.

[-]

MotokoAGI@reddit

Good stuff, thanks antirez, gonna test drive this.

[-]

antirez@reddit (OP)

For the first time, even with this selective 2 bit quantization, I feel like I have a frontier model running on my computer. The quality of the replies is incredible, and its mental order, the fact that it thinks the right amount of time based on the question complexity. The language used. Incredibly cool.

[-]