I found a perfect coder model for my RTX4090+64GB RAM

Posted by srigi@reddit | LocalLLaMA | View on Reddit | 92 comments

Disappointed with vanilla Qwen3-coder-30B-A3B, I browsed models at mradermacher. I had a good experience with YOYO models in the past. I stumbled upon **mradermacher/Qwen3-Yoyo-V3-42B-A3B-Thinking-TOTAL-RECALL-ST-TNG-III-i1-GGUF**. First, I was a little worried that **42B** won't fit, and offloading MoEs to CPU will result in poor perf. But thankfully, I was wrong. Somehow this model consumed only about 8GB with `--cpu-moe` (keep all Mixture of Experts weights on the CPU) and Q4_K_M, and 32k ctx. So I tuned llama.cpp invocation to fully occupy 24GB of RTX 4090 and put the rest into the CPU/RAM: ```bash llama-server --model Qwen3-Yoyo-V3-42B-A3B-Thinking-TOTAL-RECALL-ST-TNG-III.i1-Q4_K_M.gguf \ --ctx-size 131072 \ --flash-attn on \ --jinja \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --batch-size 1024 \ --ubatch-size 512 \ --n-cpu-moe 28 \ --n-gpu-layers 99 \ --repeat-last-n 192 \ --repeat-penalty 1.05 \ --threads 16 \ --host 0.0.0.0 \ --port 8080 \ --api-key secret ``` With these settings, it eats 23400MB of VRAM and 30GB of RAM. It processes the RooCode's system prompt (around 16k tokens) at around 10s and generates at 44tk/s. With 100k context window. And the best thing - the RooCode tool-calling is very reliable (vanilla Qwen3-coder failed at this horribly). This model can really code and is fast on a single RTX 4090! Here is a 1 minute demo of adding a small code-change to medium sized [code-base](https://github.com/srigi/type-graphql): ![adding a small code-change](https://i.postimg.cc/cHp8sP9m/Screen-Flow.gif)