Best coding agent + model for strix halo 128 machine

Posted by Fireforce008@reddit | LocalLLaMA | View on Reddit | 20 comments

I recently got my hands on a strix halo machine, I was very excited to test my coding project. My key stack is nextjs and python for most part, I tried qwen3-next-coder at 4bit quantization with 64k context with open code, but I kept running into failed tool calling loop for writing the file every time the context was at 20k.

Is that what people are experiencing? Is there a better way to do local coding agent?

[-]

Due_Net_3342@reddit

you have 128 gb memory, why use a 4 bit quant? however tells you that those quants don’t lose in quality they are just poor in ram. Try the Q8 as you should for this type of hardware

[-]

Fireforce008@reddit (OP)

I am operating out of fear of context size, given 80G will go to the model, what do you think is right context size give I have big codebase this will work on

[-]

Look_0ver_There@reddit

Host Setup: https://github.com/kyuz0/amd-strix-halo-toolboxes?tab=readme-ov-file#kernel-parameters-tested-on-fedora-42

That will work on any Linux system though that uses Grub

Grab the latest llama-server binaries from here: https://github.com/ggml-org/llama.cpp/releases

Direct Link to the latest set: https://github.com/ggml-org/llama.cpp/releases/download/b8664/llama-b8664-bin-ubuntu-vulkan-x64.tar.gz

Then run llama-server. Substitute in the host, port, and exact model name as suits the model you downloaded.

llama-server --host 0.0.0.0 --port 8033 --jinja \
--cache-type-k q8_0 --cache-type-v q8_0 \
--temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 \
--repeat-penalty 1.0 --threads 12 \
--batch-size 4096 --ubatch-size 1024 \
--flash-attn on --kv-unified --mlock \
--ctx-size 262144 --parallel 1 --swa-full \
--cache-ram 16384 --ctx-checkpoints 128 \
--model ./Qwen3-Coder-Next-Q8_0.gguf \
--alias Qwen3-Coder-Next-Q8_0

This is what's running on my machine right now. Still working fine at this moment at 180K context depth. I'm using ForgeCode as my coding harness. -> https://forgecode.dev/

[-]

JumpyAbies@reddit

How many tokens/sec can you get with this setup?

[-]

Look_0ver_There@reddit

Using llama-benchy on the running end-point as per above.

Command to run test: uvx llama-benchy --base-url http://localhost:8033/v1 --tg 128 --pp 512 --model unsloth/Qwen3-Coder-Next-GGUF --tokenizer qwen/Qwen3-Coder-Next

pp512=650.1
tg128=42.2

| model                         |   test |           t/s |     peak t/s |      ttfr (ms) |   est_ppt (ms) |   e2e_ttft (ms) |
|:------------------------------|-------:|--------------:|-------------:|---------------:|---------------:|----------------:|
| unsloth/Qwen3-Coder-Next-GGUF |  pp512 | 650.14 ± 5.20 |              | 734.30 ± 21.66 | 733.67 ± 21.66 |  734.37 ± 21.67 |
| unsloth/Qwen3-Coder-Next-GGUF |  tg128 |  42.22 ± 0.06 | 43.00 ± 0.00 |                |                |                 |

[-]

JumpyAbies@reddit

42 toks is quite reasonable. With TurboQuant, it should improve even further.

Local LLMs are already fully viable. And I'm eager to see what the next generation from AMD will bring.

[-]

Look_0ver_There@reddit

Even today you can get \~90-100tg/s single-client with Qwen3-Coder-Next @ Q8_0 with 3 x R9700Pro's for \~$5K for a full system.

[-]

JumpyAbies@reddit

Thank you for the information. I really appreciate it.

Until the arrival of qwen 3.5 (3.6), nemotron, gemma4, and TurboQuant, I felt that a Strix Halo, excellent as it is, would not be quite enough to deliver at least 40 toks. As a result, I was tempted to build a system with an RTX 6000 + RTX 5090. I have the funds, but that would hurt a lot.

However, the progress in smaller models, which now produce very impressive results, has made me realize that something like a Strix Halo or AMD’s next generation will be more than sufficient for home use.

[-]

Look_0ver_There@reddit

Here's some extra results for you to ponder over. The results here will highlight the difficulties that the Strix Halo has with dense models vs MoE models.

Strix Halo:

Dense:
Qwen3.5-27B @ Q6_K -> PP=310, TG=9.7
Qwen3.5-27B @ Q8_0 -> PP=325, TG=7.8

Gemma4-31B @ Q6_K -> PP=270, TG=8.5
Gemma4-31B @ Q8_0 -> PP=275, TG=6.7

MoE:
Qwen3.5-35B-A3B @ Q6_K -> PP=956, TG=61.1
Qwen3.5-35B-A3B @ Q8_0 -> PP=1153, TG=54.6

Gemma4-26B-A4B @ Q6_K -> PP=1235, TG=52.9
Gemma4-26B-A4B @ Q8_0 -> PP=1365, TG=47.7

Bonus Big Brain MoE:
MiniMax-M2.5 @ IQ3_XXS : PP=226, TG=37.0

That MiniMax result there is exactly the type of model that the Strix Halo really shines with. Even at IQ3_XXS, it's way smarter than any of the other models listed, and perfectly usable for use as a "Planning/Analysis" model for local coding even if the PP is pretty slow.

A pair of 32GB R9700Pro's in a single system will run all of the smaller models, as well as a quantized Qwen3-Coder-Next, at twice the speed of the Strix Halo.

IMO, This is where the recent price rises of the Strix Halo machines have really hurt its viability. When the 128GB Strix Halos were just $1800 they made a lot of sense. Now that they're pushing $3000 each, suddenly a system with 2 or 3 R9700Pro's starts asking the hard questions and eating the Strix Halo's lunch. It's only the ability to run models like MiniMax-M2.5 above, or other \~200B models that really justifies the Strix Halo nowadays.

Hmm, I didn't start out this response meaning to be critical of the Strix Halo. I have two of them, but I also have another system with a 9700XTX + R9700Pro, and now I'm starting to ask myself if I'd be better off returning one of the Strix Halo's, and picking up 2 more R9700Pro's, and just keep the single Strix Halo for the MiniMax style models.

[-]

Look_0ver_There@reddit

You can run Qwen3-Coder-Next at Q8_0 with 262144 context size on the 128GB Strix Halo just fine, and still have room for your desktop and whatever else you're doing.

Assuming you're using Linux, make sure you follow the strix-halo-toolboxes system configuration by kyuz0 on GitHub. He tells you what to change in your grub config to get the Strix Halo to use up to 124GB of memory for unified VRAM (not that you'll need that much).

[-]

MaybeOk4505@reddit

Use GLM 4.7 REAP. It's the best model that will fit in this class of system. Use https://huggingface.co/unsloth/GLM-4.7-REAP-218B-A32B-GGUF @ 3bit quant, all will fit. Pick the biggest one that still gives you enough for context and your system RAM requirements.

[-]

Fireforce008@reddit (OP)

UD-IQ3_XXS is the only option due to context size @ 3bit quant

[-]

RevolutionaryGold325@reddit

are you using the turboquants for context?

[-]

Due_Net_3342@reddit

mradermacher/MiniMax-M2.5-REAP-172B-A10B-i1-GGUF q4 very good but you need linux and run it on q8 kv cache for around 120000 context. Stop chasing context because it degrades amyway

[-]

RevolutionaryGold325@reddit

I have not tried this. Is it better than the Qwen-3.5-397b IQ2_XXS?

[-]

RevolutionaryGold325@reddit

Qwen-3.5-397b IQ2_XXS with 200k context using turboquants

[-]

PvB-Dimaginar@reddit

I have good results with Qwen3 Coder Next 80B Q6 UD K XL on Python and Jupyter projects. However with Rust projects it really struggles. If I have time I will try other models for this like Gemma4. If someone has advice on which local model is good for Rust, Tauri and React, please let me know!

[-]

TheWaywardOne@reddit

Nemotron Cascade 2 30B-A2B runs snappy and fits the full 1mil context into memory with room to spare. It's decent at tool calling but I usually laid out a lot of planning with a smarter/bigger model beforehand. Decent code output, not awesome.

Gemma 4 26B A4B is feeling better but the runtimes are catching up with patches so maybe wait a bit on that. My personal preliminary experiences with Gemma 4 have been phenomenal compared to other MoE models I've been coding with. Excited for updates on this. I tested it day 1, and even with all the bugs it one shotted a test game prompt I'd been using and blew away anything else I've been using, even some of my paid models stumbled with this.

Qwen 3.5 35B A3B is a good all rounder, has been default for a while.

Qwen 122B A10B is too slow for coding imo but a good 'lead' model to run with. So is Nemotron Super, I've liked it for planning, not so much for coding.

I never really had good luck with Qwen 3 Coder Next. It was fast but I couldn't get consistently good code from it for some reason. Not a config or harness thing, I just personally didn't like it's code.

To answer your question, play around with them to find one you like. I think my future default is Gemma 4. 262K context is nice. A good harness and agent chain can do a lot more than 1mil context can.

[-]

sleepingsysadmin@reddit

Strix Halo can run Medium MOE models:

https://artificialanalysis.ai/models/open-source/medium

Find the bench that most fits your use case.

In my case, Term Bench Hard is where it's at.

Qwen3.5 122b seem like a nobrainer to me. I would certainly give nemotron 3 super a try.

[-]

Worth_Peak7741@reddit

I have one of these machines and am running that coder model at the same quant. You need to up your context. Mine is set to 200k