Is it possible to edit LLAMA.CPP with Cline+Vscode+Minimax 2.7 Q4_K_S and get a working build?

Posted by LegacyRemaster@reddit | LocalLLaMA | View on Reddit | 11 comments

It all started yesterday with this post by u/antirez

https://www.reddit.com/r/LocalLLaMA/comments/1sw3stb/llamacpp_deepseek_v4_flash_experimental_inference/

I was intrigued by the first Deepseek V4 Flash GGUF in a small size that can fit on the RTX 6000 96gb

I wasn't optimistic, and I was right: as specified, it was running on CPU/Metal and there were several problems. The author said the editing was done with GPT 5.5, so I wondered: should we try copying from CPU/Metal to Cuda with a 100% local setup?

I started with vscode + KiloCode but then preferred to migrate to Cline.

After an hour of work (start C:\llm\llamaVulkan\build\bin\Release\llama-server.exe --model "H:\gptmodel\unsloth\MiniMax-M2.7-GGUF\MiniMax-M2.7-UD-Q4_K_S-00001-of-00004.gguf" --ctx-size 1920 --threads 16 --host 127.0.0.1 --no-mmap --jinja --fit on --flash-attn on -sm layer --n-cpu-moe 0 --threads 16 --cache-type-k q8_0 --cache-type-v q8_0 --parallel 1 ) I finally managed to compile.

The result is still not satisfactory and now I'm trying to improve the execution on cuda.

However, I did get an initial answer to the question: can a 100% local setup that costs me only electricity run the first local GGUF of V4 Flash before official releases, starting from an "AI-based" work on GitHub?

The answer is yes. LocalLLMs are becoming very powerful.

My pull request: https://github.com/antirez/llama.cpp-deepseek-v4-flash/pulls

Here's the next step (in progress):

## Analysis: DeepSeek-V4-Flash Graph Split Bottleneck

### 1. Why DeepSeek-V4-Flash Generates 856 Graph Splits

The `ggml_backend_sched_print_assignments()` function counts splits:

```cpp

for (int i = 0; i < graph->n_nodes; i++) {

if (*node_backend_id != cur_backend_id) {

cur_split++; // Each backend transition = 1 split

}

*node_backend_id = cur_backend_id;

}

```

The high split count occurs because:

- **Mamba/GDN recurrent state tensors** alternate between CUDA (computation) and CPU (state storage)

- Each Mamba layer has \~20+ operations: `select`, `gate`, `scan`, `normalize`, `sigmoid`, `element-wise multiply`, etc.

- The **state tensors** (hidden state, gate state) appear to be allocated on CPU, forcing synchronization per layer

- With bs=512: more parallel operations, more state tensor accesses → 856 splits

- With bs=1: fewer parallel operations → 690 splits

### 2. Which Tensors/Operations Are Assigned to CPU

Split causes identified in code:

- `"1.dst"` - dst tensor has no backend → CPU fallback

- `"1.vsrc"` - view source tensor on different backend

- `"1.inp"` - input tensor on CPU backend

- `"1.off"` - operation not offloaded to CUDA

- `"2.sup"` - unsupported operation on CUDA

- `"3.upg"` - upgraded tensor (copy needed)

- `"4.cpy"` - tensor copy between backends

**Most likely culprits for DeepSeek-V4:**

**Mamba state tensors** - recurrent hidden states allocated on CPU
**Custom DSV4 operations** if CUDA kernels aren't registered
**Element-wise operations on state** that require CPU reads

### 3. Can Mamba/GDN Operations Be Fused or Kept on CUDA?

**Yes, this is the key optimization:**

**Fuse Mamba step into single CUDA kernel:**

- Combine: select → gate → scan → normalize → output

- Currently likely broken into 10-20 separate node operations

**Keep recurrent state on CUDA:**

- Use `ggml_backend_cuda_buffer_from_host()` for state tensors

- Avoids CPU↔GPU transfer per token

**Register custom DSV4 ops as CUDA-offloadable:**

```cpp

// In ggml-backend-impl.h or ggml-cuda

static bool ggml_backend_cuda_supports_op(ggml_backend_t backend, const struct ggml_tensor * op) {

// Add GGML_OP_DSV4_* to CUDA-supported ops

}

```

### 4. How to Force All Operations to CUDA

**Option A: Environment Variable (Immediate)**

```bash

export GGML_SCHED_DEBUG=1 # View assignment

# Then modify ggml-backend.cpp to force CUDA

```

**Option B: Modify Backend Assignment (Recommended)**

In `ggml_backend_sched_print_assignments()` around line 1410, force all tensors to CUDA backend:

```cpp

// Find where tensors are assigned to CPU and modify:

// Change:

cur_backend_id = sched->n_backends - 1; // last backend (assumed CPU)

// To check for CUDA and use that instead:

// Find CUDA backend index and use it for state tensors

```

**Option C: CUDA Buffer Allocation for State**

In `src/models/deepseek4.cpp`, allocate recurrent state tensors on CUDA:

```cpp

// Instead of ggml_viewtensor(ctx, prev_state, ...)

// Use: ggml_backend_create_tensor(sched->backends[0], ...) // CUDA backend

```

### Recommended Solution Path

**Enable debug logging** to identify exact split causes:

```bash

GGML_SCHED_DEBUG=1 ./deepseek4 inference...

```

**Implement CUDA kernels for:**

- `GGML_OP_DSV4_HC_SPLIT_SINKHORN`

- Mamba selective scan

- GDN normalization

**Keep recurrent state on GPU** - allocate persistent tensors on CUDA buffer
**Target:** Reduce from 856 → <100 splits by:

- Fusing Mamba layer into 1-2 nodes instead of 20+

- Keeping state tensor on CUDA (eliminates \~20 splits per layer)

- Using persistent compute graphs

### Key Files to Modify

`ggml/src/ggml-backend.cpp` - Backend assignment logic
`ggml/src/ggml-cuda/mmq.cu` or new file - Mamba CUDA kernels
`src/models/deepseek4.cpp` - State tensor allocation
`ggml/src/ggml.c` - Register DSV4 ops as CUDA-supported

[-]

Farmadupe@reddit

Not sure whether renaming `float` to `__half` or bumping the entire project form `cpp17` to `cpp20` is more adventurous, but good luck getting your deepseek4 implementation merged!

LegacyRemaster@reddit (OP)

We need to:

Revert the C++20 bump - keep C++17 for the whole project, use only

C++17-compatible syntax in concat.cu (replace "if constexpr" with

regular if/else or template specialization)

Keep the original F32 kernel untouched, add F16 as a completely

separate parallel implementation without modifying the F32 path

This makes the PR minimal and safe to merge.

fairydreaming@reddit

There are already hundreds of if constexpr in llama.cpp CUDA code. There are 4 in the very concat.cu file where you want to replace them with regular ifs.

Hint: it helps if you actually read and understand the code that you want to modify.

This is not the purpose of this "demo". The purpose is to demonstrate automation with minimax 2.7, not to replace the developers.

I just wanted to say that removing all "if constexpr" won't make your PR minimal like you said above.

will never be merged on llamacpp . Every day I see posts like "I did Tetris" or other ephemeral things. Here's a Deepseek V4 flash that runs slowly, but it runs. And all with a local setup. That's the news.

Formal-Exam-8767@reddit

Wasn't support for some of the models already done using similar method, i.e. porting python transformers code to llama.cpp by the help of LLM?

HyperWinX@reddit

Keep that slop to yourself please.

I'm just showing that something that was pure utopia months ago is now becoming possible. Whether this is "quality" is another topic.

People here know that perfectly fine, no need to spam.

Adding 1 image about it running