Is it possible to edit LLAMA.CPP with Cline+Vscode+Minimax 2.7 Q4_K_S and get a working build?

Posted by LegacyRemaster@reddit | LocalLLaMA | View on Reddit | 11 comments

Is it possible to edit LLAMA.CPP with Cline+Vscode+Minimax 2.7 Q4_K_S and get a working build?

It all started yesterday with this post by u/antirez

https://www.reddit.com/r/LocalLLaMA/comments/1sw3stb/llamacpp_deepseek_v4_flash_experimental_inference/

I was intrigued by the first Deepseek V4 Flash GGUF in a small size that can fit on the RTX 6000 96gb

I wasn't optimistic, and I was right: as specified, it was running on CPU/Metal and there were several problems. The author said the editing was done with GPT 5.5, so I wondered: should we try copying from CPU/Metal to Cuda with a 100% local setup?

I started with vscode + KiloCode but then preferred to migrate to Cline.

After an hour of work (start C:\llm\llamaVulkan\build\bin\Release\llama-server.exe --model "H:\gptmodel\unsloth\MiniMax-M2.7-GGUF\MiniMax-M2.7-UD-Q4_K_S-00001-of-00004.gguf" --ctx-size 1920 --threads 16 --host 127.0.0.1 --no-mmap --jinja --fit on --flash-attn on -sm layer --n-cpu-moe 0 --threads 16 --cache-type-k q8_0 --cache-type-v q8_0 --parallel 1 ) I finally managed to compile.

The result is still not satisfactory and now I'm trying to improve the execution on cuda.

However, I did get an initial answer to the question: can a 100% local setup that costs me only electricity run the first local GGUF of V4 Flash before official releases, starting from an "AI-based" work on GitHub?

The answer is yes. LocalLLMs are becoming very powerful.

My pull request: https://github.com/antirez/llama.cpp-deepseek-v4-flash/pulls

Here's the next step (in progress):

## Analysis: DeepSeek-V4-Flash Graph Split Bottleneck

### 1. Why DeepSeek-V4-Flash Generates 856 Graph Splits

The `ggml_backend_sched_print_assignments()` function counts splits:

```cpp

for (int i = 0; i < graph->n_nodes; i++) {

if (*node_backend_id != cur_backend_id) {

cur_split++; // Each backend transition = 1 split

}

*node_backend_id = cur_backend_id;

}

```

The high split count occurs because:

- **Mamba/GDN recurrent state tensors** alternate between CUDA (computation) and CPU (state storage)

- Each Mamba layer has \~20+ operations: `select`, `gate`, `scan`, `normalize`, `sigmoid`, `element-wise multiply`, etc.

- The **state tensors** (hidden state, gate state) appear to be allocated on CPU, forcing synchronization per layer

- With bs=512: more parallel operations, more state tensor accesses → 856 splits

- With bs=1: fewer parallel operations → 690 splits

### 2. Which Tensors/Operations Are Assigned to CPU

Split causes identified in code:

- `"1.dst"` - dst tensor has no backend → CPU fallback

- `"1.vsrc"` - view source tensor on different backend

- `"1.inp"` - input tensor on CPU backend

- `"1.off"` - operation not offloaded to CUDA

- `"2.sup"` - unsupported operation on CUDA

- `"3.upg"` - upgraded tensor (copy needed)

- `"4.cpy"` - tensor copy between backends

**Most likely culprits for DeepSeek-V4:**

  1. **Mamba state tensors** - recurrent hidden states allocated on CPU

  2. **Custom DSV4 operations** if CUDA kernels aren't registered

  3. **Element-wise operations on state** that require CPU reads

### 3. Can Mamba/GDN Operations Be Fused or Kept on CUDA?

**Yes, this is the key optimization:**

  1. **Fuse Mamba step into single CUDA kernel:**

- Combine: select → gate → scan → normalize → output

- Currently likely broken into 10-20 separate node operations

  1. **Keep recurrent state on CUDA:**

- Use `ggml_backend_cuda_buffer_from_host()` for state tensors

- Avoids CPU↔GPU transfer per token

  1. **Register custom DSV4 ops as CUDA-offloadable:**

```cpp

// In ggml-backend-impl.h or ggml-cuda

static bool ggml_backend_cuda_supports_op(ggml_backend_t backend, const struct ggml_tensor * op) {

// Add GGML_OP_DSV4_* to CUDA-supported ops

}

```

### 4. How to Force All Operations to CUDA

**Option A: Environment Variable (Immediate)**

```bash

export GGML_SCHED_DEBUG=1 # View assignment

# Then modify ggml-backend.cpp to force CUDA

```

**Option B: Modify Backend Assignment (Recommended)**

In `ggml_backend_sched_print_assignments()` around line 1410, force all tensors to CUDA backend:

```cpp

// Find where tensors are assigned to CPU and modify:

// Change:

cur_backend_id = sched->n_backends - 1; // last backend (assumed CPU)

// To check for CUDA and use that instead:

// Find CUDA backend index and use it for state tensors

```

**Option C: CUDA Buffer Allocation for State**

In `src/models/deepseek4.cpp`, allocate recurrent state tensors on CUDA:

```cpp

// Instead of ggml_viewtensor(ctx, prev_state, ...)

// Use: ggml_backend_create_tensor(sched->backends[0], ...) // CUDA backend

```

### Recommended Solution Path

  1. **Enable debug logging** to identify exact split causes:

```bash

GGML_SCHED_DEBUG=1 ./deepseek4 inference...

```

  1. **Implement CUDA kernels for:**

- `GGML_OP_DSV4_HC_SPLIT_SINKHORN`

- Mamba selective scan

- GDN normalization

  1. **Keep recurrent state on GPU** - allocate persistent tensors on CUDA buffer

  2. **Target:** Reduce from 856 → <100 splits by:

- Fusing Mamba layer into 1-2 nodes instead of 20+

- Keeping state tensor on CUDA (eliminates \~20 splits per layer)

- Using persistent compute graphs

### Key Files to Modify

  1. `ggml/src/ggml-backend.cpp` - Backend assignment logic

  2. `ggml/src/ggml-cuda/mmq.cu` or new file - Mamba CUDA kernels

  3. `src/models/deepseek4.cpp` - State tensor allocation

  4. `ggml/src/ggml.c` - Register DSV4 ops as CUDA-supported