Is it possible to edit LLAMA.CPP with Cline+Vscode+Minimax 2.7 Q4_K_S and get a working build?
Posted by LegacyRemaster@reddit | LocalLLaMA | View on Reddit | 11 comments
It all started yesterday with this post by u/antirez
I was intrigued by the first Deepseek V4 Flash GGUF in a small size that can fit on the RTX 6000 96gb
I wasn't optimistic, and I was right: as specified, it was running on CPU/Metal and there were several problems. The author said the editing was done with GPT 5.5, so I wondered: should we try copying from CPU/Metal to Cuda with a 100% local setup?
I started with vscode + KiloCode but then preferred to migrate to Cline.
After an hour of work (start C:\llm\llamaVulkan\build\bin\Release\llama-server.exe --model "H:\gptmodel\unsloth\MiniMax-M2.7-GGUF\MiniMax-M2.7-UD-Q4_K_S-00001-of-00004.gguf" --ctx-size 1920 --threads 16 --host 127.0.0.1 --no-mmap --jinja --fit on --flash-attn on -sm layer --n-cpu-moe 0 --threads 16 --cache-type-k q8_0 --cache-type-v q8_0 --parallel 1 ) I finally managed to compile.
The result is still not satisfactory and now I'm trying to improve the execution on cuda.
However, I did get an initial answer to the question: can a 100% local setup that costs me only electricity run the first local GGUF of V4 Flash before official releases, starting from an "AI-based" work on GitHub?
The answer is yes. LocalLLMs are becoming very powerful.
My pull request: https://github.com/antirez/llama.cpp-deepseek-v4-flash/pulls
Here's the next step (in progress):
## Analysis: DeepSeek-V4-Flash Graph Split Bottleneck
### 1. Why DeepSeek-V4-Flash Generates 856 Graph Splits
The `ggml_backend_sched_print_assignments()` function counts splits:
```cpp
for (int i = 0; i < graph->n_nodes; i++) {
if (*node_backend_id != cur_backend_id) {
cur_split++; // Each backend transition = 1 split
}
*node_backend_id = cur_backend_id;
}
```
The high split count occurs because:
- **Mamba/GDN recurrent state tensors** alternate between CUDA (computation) and CPU (state storage)
- Each Mamba layer has \~20+ operations: `select`, `gate`, `scan`, `normalize`, `sigmoid`, `element-wise multiply`, etc.
- The **state tensors** (hidden state, gate state) appear to be allocated on CPU, forcing synchronization per layer
- With bs=512: more parallel operations, more state tensor accesses → 856 splits
- With bs=1: fewer parallel operations → 690 splits
### 2. Which Tensors/Operations Are Assigned to CPU
Split causes identified in code:
- `"1.dst"` - dst tensor has no backend → CPU fallback
- `"1.vsrc"` - view source tensor on different backend
- `"1.inp"` - input tensor on CPU backend
- `"1.off"` - operation not offloaded to CUDA
- `"2.sup"` - unsupported operation on CUDA
- `"3.upg"` - upgraded tensor (copy needed)
- `"4.cpy"` - tensor copy between backends
**Most likely culprits for DeepSeek-V4:**
-
**Mamba state tensors** - recurrent hidden states allocated on CPU
-
**Custom DSV4 operations** if CUDA kernels aren't registered
-
**Element-wise operations on state** that require CPU reads
### 3. Can Mamba/GDN Operations Be Fused or Kept on CUDA?
**Yes, this is the key optimization:**
- **Fuse Mamba step into single CUDA kernel:**
- Combine: select → gate → scan → normalize → output
- Currently likely broken into 10-20 separate node operations
- **Keep recurrent state on CUDA:**
- Use `ggml_backend_cuda_buffer_from_host()` for state tensors
- Avoids CPU↔GPU transfer per token
- **Register custom DSV4 ops as CUDA-offloadable:**
```cpp
// In ggml-backend-impl.h or ggml-cuda
static bool ggml_backend_cuda_supports_op(ggml_backend_t backend, const struct ggml_tensor * op) {
// Add GGML_OP_DSV4_* to CUDA-supported ops
}
```
### 4. How to Force All Operations to CUDA
**Option A: Environment Variable (Immediate)**
```bash
export GGML_SCHED_DEBUG=1 # View assignment
# Then modify ggml-backend.cpp to force CUDA
```
**Option B: Modify Backend Assignment (Recommended)**
In `ggml_backend_sched_print_assignments()` around line 1410, force all tensors to CUDA backend:
```cpp
// Find where tensors are assigned to CPU and modify:
// Change:
cur_backend_id = sched->n_backends - 1; // last backend (assumed CPU)
// To check for CUDA and use that instead:
// Find CUDA backend index and use it for state tensors
```
**Option C: CUDA Buffer Allocation for State**
In `src/models/deepseek4.cpp`, allocate recurrent state tensors on CUDA:
```cpp
// Instead of ggml_viewtensor(ctx, prev_state, ...)
// Use: ggml_backend_create_tensor(sched->backends[0], ...) // CUDA backend
```
### Recommended Solution Path
- **Enable debug logging** to identify exact split causes:
```bash
GGML_SCHED_DEBUG=1 ./deepseek4 inference...
```
- **Implement CUDA kernels for:**
- `GGML_OP_DSV4_HC_SPLIT_SINKHORN`
- Mamba selective scan
- GDN normalization
-
**Keep recurrent state on GPU** - allocate persistent tensors on CUDA buffer
-
**Target:** Reduce from 856 → <100 splits by:
- Fusing Mamba layer into 1-2 nodes instead of 20+
- Keeping state tensor on CUDA (eliminates \~20 splits per layer)
- Using persistent compute graphs
### Key Files to Modify
-
`ggml/src/ggml-backend.cpp` - Backend assignment logic
-
`ggml/src/ggml-cuda/mmq.cu` or new file - Mamba CUDA kernels
-
`src/models/deepseek4.cpp` - State tensor allocation
-
`ggml/src/ggml.c` - Register DSV4 ops as CUDA-supported
Farmadupe@reddit
Not sure whether renaming `float` to `__half` or bumping the entire project form `cpp17` to `cpp20` is more adventurous, but good luck getting your deepseek4 implementation merged!
LegacyRemaster@reddit (OP)
We need to:
C++17-compatible syntax in concat.cu (replace "if constexpr" with
regular if/else or template specialization)
separate parallel implementation without modifying the F32 path
This makes the PR minimal and safe to merge.
fairydreaming@reddit
There are already hundreds of
if constexprin llama.cpp CUDA code. There are 4 in the very concat.cu file where you want to replace them with regular ifs.Hint: it helps if you actually read and understand the code that you want to modify.
LegacyRemaster@reddit (OP)
This is not the purpose of this "demo". The purpose is to demonstrate automation with minimax 2.7, not to replace the developers.
fairydreaming@reddit
I just wanted to say that removing all "if constexpr" won't make your PR minimal like you said above.
LegacyRemaster@reddit (OP)
will never be merged on llamacpp . Every day I see posts like "I did Tetris" or other ephemeral things. Here's a Deepseek V4 flash that runs slowly, but it runs. And all with a local setup. That's the news.
Formal-Exam-8767@reddit
Wasn't support for some of the models already done using similar method, i.e. porting python transformers code to llama.cpp by the help of LLM?
HyperWinX@reddit
Keep that slop to yourself please.
LegacyRemaster@reddit (OP)
I'm just showing that something that was pure utopia months ago is now becoming possible. Whether this is "quality" is another topic.
HyperWinX@reddit
People here know that perfectly fine, no need to spam.
LegacyRemaster@reddit (OP)
Adding 1 image about it running