JSVD2

BeeLlama v0.3.1 – latest llama.cpp with extras! DFlash, MTP, q6_0 cache, TurboQuant. Single RTX 3090: Qwen 3.6 27B & Gemma 4 31B up to 177.8 tps (4.93x over baseline)

Posted by Anbeeld@reddit | LocalLLaMA | View on Reddit | 10 comments

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

JSVD2@reddit (OP)

That is so true tho. Good point!!!!! My Beelink went from $2500 to $4400 in just a few months. Ridiculous actually.

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

JSVD2@reddit (OP)

Fair. The star wording was a mistake here, so I removed it. It made the post look more promotional than I intended. I am sure there is actual value in the guide if you read it tho.

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

JSVD2@reddit (OP)

That is a really useful datapoint, thanks. The hosted playground makes me more interested in Ultra’s quality, even if it does not change the one-box 128GB local constraint yet. The correction from this thread is that I skipped Super. I tested Super UD-IQ4\_XS locally on Strix Halo and it runs directly: \- pp512/tg128 r3: 292.51 pp512 / 17.94 tg128 \- p0/tg128 r3: 17.73 t/s generation-only So my corrected map is: Ultra = quality/watchlist, Super = runnable 120B middle route, Nano = faster smaller route. If you try Q4\_K\_M on the 256GB Xeon box, I’d be very interested in the artifact/runtime/backend and tg128 number.

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

JSVD2@reddit (OP)

Fair point. The guide link probably made it look more promotional than intended. The useful correction from this thread is that I skipped Super. I tested it now: Nemotron 3 Super 120B-A12B UD-IQ4\_XS runs directly via llama.cpp Vulkan/RADV at 292.51 pp512 / 17.94 tg128, with 17.73 t/s generation-only. So the better map is: Ultra = watchlist, Super = runnable 120B middle route, Nano = faster smaller route.

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

JSVD2@reddit (OP)

Good catch. You’re right: Super is the missing middle route here, and that is actually the better objection than the wording critique. My post jumped from Ultra -> Nano because I was checking the newest Ultra release first, and Nano was the route I already had a clean direct llama-bench result for. But if Nemotron 3 Super has a practical GGUF route that fits a 128GB Strix Halo box, then yes, that is the more interesting next test than Nano. So let’s make it concrete: \- Which Super GGUF / quant would you test? \- IQ2\_M? \- Q4\_K\_M if it fits? \- llama.cpp Vulkan/RADV? \- ROCm/HIP? \- long-context serving instead of short llama-bench? I’m not trying to defend Nano as the best Nemotron path. I’m trying to turn “this obviously won’t run” into an actual runnable route with raw numbers. If Super is the better target, I’ll test Super and add the result. That is exactly the kind of correction I’m looking for.

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

JSVD2@reddit (OP)

Fair enough on the wording. “Reality check” is probably overused. But I’ll push back on the second part a bit: if the answer is “obviously no”, then what is the actual runnable path people should use today? That is the gap I’m trying to document. Every big model release creates the same vague local-AI question: “can this run on my machine?” The useful answer is not just parameter count. It is artifact format, quant, file size, backend, memory fit, and measured command output. For Ultra, the current public artifacts I found are: \- BF16: \~1.1 TB \- NVFP4: \~352 GB \- safetensors / Transformers \- no direct GGUF / llama.cpp route I could find So yes, I agree: not a single-box 128GB target today. But then the practical question becomes: what is the nearest runnable Nemotron path on this hardware? That is why I tested Nano 30B-A3B GGUF and posted the result. If there is a better Ultra route, I’d actually like someone to point to it: \- smaller quant? \- GGUF? \- multi-node? \- different runtime? \- offload strategy? \- 128GB unified-memory trick I missed? I’m not attached to this result. I’m trying to replace “lol obviously no” with a concrete tested path or a better counterexample.

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

JSVD2@reddit (OP)

Fair pushback. The 352GB NVFP4 point is exactly why I’m not claiming Ultra is practical on one 128GB box. “Watchlist” here means: wait for a smaller artifact, GGUF route, multi-node route, or a different runtime path. I agree the current NVFP4 artifact itself is not a one-box fit. And yes, running a 30B MoE on Strix Halo is not surprising by itself. The useful part I was trying to document is the practical route after the Ultra release: which Nemotron artifact actually has a GGUF/llama.cpp path today, what quant fits, and what speed it gets on this hardware. So the takeaway is not “wow, 30B fits.” It is: Ultra is not a direct one-box 128GB llama.cpp target right now; Nano is the currently runnable Nemotron GGUF route I could verify. If you know a better Ultra path, I’d genuinely like to test it.

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

JSVD2@reddit (OP)

Can I ask what you have tried to come to this conclusion? Its very interesting to me and I just merely have been benchmarking, but real life results are maybe even more useful.

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

NVIDIA Nemotron 3 Ultra is out.

Posted by justdoitanddont@reddit | LocalLLaMA | View on Reddit | 1 comments

JSVD2@reddit

Couldnt make it work on Strix halo just yet. The practical NVIDIA Nemotron route I could actually run was Nemotron 3 Nano 30B-A3B GGUF. [https://github.com/hogeheer499-commits/strix-halo-guide/blob/main/CURRENT\_MODELS.md](https://github.com/hogeheer499-commits/strix-halo-guide/blob/main/CURRENT_MODELS.md)

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 138 comments

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 138 comments

JSVD2@reddit

I checked this against a 128GB Strix Halo / Ryzen AI MAX+ 395 box. Ultra itself does not look like a practical direct llama.cpp target yet: \- BF16 artifact: \~1.1 TB \- NVFP4 artifact: \~352.4 GB \- safetensors / Transformers \- no direct GGUF route found for Ultra yet So I tested the practical NVIDIA route instead: Nemotron 3 Nano 30B-A3B IQ4\_XS GGUF. Direct llama.cpp Vulkan/RADV on Radeon 8060S: \- 619.00 pp512 / 65.45 tg128 \- 66.60 t/s generation-only Raw evidence: [https://github.com/hogeheer499-commits/strix-halo-guide](https://github.com/hogeheer499-commits/strix-halo-guide)

llama.cpp - Qwen3.6/3.5-MTP - Share your benchmarks t/s

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 40 comments

JSVD2@reddit

I have a lot of them here, from 7 systems: [https://github.com/hogeheer499-commits/strix-halo-guide](https://github.com/hogeheer499-commits/strix-halo-guide)

gemma-4-12b-it vs Qwen3.5-9B on shared benchmarks: Qwen is overall winner beating gemma in 5/8 benchmarks despite a smaller footprint

Posted by fulgencio_batista@reddit | LocalLLaMA | View on Reddit | 146 comments

Direct 100.0 t/s on Strix Halo with Qwen3 30B-A3B. Can anyone reproduce or beat this?

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 23 comments

JSVD2@reddit (OP)

* Experimental server route: Qwen3.6 MTP at 101.1 t/s with `llama-server` speculative decoding. testing with this too.

Direct 100.0 t/s on Strix Halo with Qwen3 30B-A3B. Can anyone reproduce or beat this?

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 23 comments

I found what I was looking for in Qwen 3.7.

Posted by CosmicRiver827@reddit | LocalLLaMA | View on Reddit | 9 comments

JSVD2@reddit

Grok is excellent at browsing and real time information for stocks. btw. This is useful. Not sure why the post is deleted.

Strix Halo 128Gb: what models, which quants are optimal?

Posted by DevelopmentBorn3978@reddit | LocalLLaMA | View on Reddit | 50 comments

JSVD2@reddit

Dit helpt je misschien wel. [https://github.com/hogeheer499-commits/strix-halo-guide](https://github.com/hogeheer499-commits/strix-halo-guide)

Direct 100.0 t/s on Strix Halo with Qwen3 30B-A3B. Can anyone reproduce or beat this?

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 23 comments

Direct 100.0 t/s on Strix Halo with Qwen3 30B-A3B. Can anyone reproduce or beat this?

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 23 comments

Direct 100.0 t/s on Strix Halo with Qwen3 30B-A3B. Can anyone reproduce or beat this?

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 23 comments

JSVD2@reddit (OP)

Very good suggestion. I can share some results with Q6 MTP! I do have this path too tho if interested: * Experimental server route: Qwen3.6 MTP at 101.1 t/s with `llama-server` speculative decoding.

Direct 100.0 t/s on Strix Halo with Qwen3 30B-A3B. Can anyone reproduce or beat this?

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 23 comments

Direct 100.0 t/s on Strix Halo with Qwen3 30B-A3B. Can anyone reproduce or beat this?

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 23 comments

JSVD2@reddit (OP)

Nice numbers. can you share the raw `llama-bench` row and exact command/build? My post is specifically about direct Strix Halo Vulkan/RADV results, not trying to beat a 5090. A 5090 should obviously win on decode. Also, 10k pp is prompt processing; my headline is tg/decode. I’m mainly collecting reproducible rows, so model, quant, backend, commit, batch/ubatch, context and power numbers would be useful.

Direct 100.0 t/s on Strix Halo with Qwen3 30B-A3B. Can anyone reproduce or beat this?

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 23 comments

Stop traumatizing AI into loops and turn hallucinations into an honest "I don't know!" by being NICE to them (Proof of Concept, Research, I don't want to sell anything)

Posted by OttoRenner@reddit | LocalLLaMA | View on Reddit | 349 comments

JSVD2@reddit

with T3 it uses a 250K context window. give me better results this way. its something at least! yep im gonna try it if it happens https://preview.redd.it/ia15m928vy4h1.png?width=624&format=png&auto=webp&s=ee32041745a2a68e0c7b3ff302f9feef5e9034f9

Shoutout to Gemma4 as a conversational assistant / agent

Posted by goldcakes@reddit | LocalLLaMA | View on Reddit | 66 comments

1-bit Bonsai Image 4B and Ternary Bonsai Image 4B Image Generation for Local Devices with just 0.93 GB and 1.21 GB respectively of Diffusion Transformer Footprint. So tiny!

Posted by Addyad@reddit | LocalLLaMA | View on Reddit | 16 comments

I have become George Jetson: my job is now Yes/No supervision for a machine I don’t fully understand.

Posted by Helpful_Today7449@reddit | LocalLLaMA | View on Reddit | 72 comments

what do you use your local llm?

Posted by FormalAd7367@reddit | LocalLLaMA | View on Reddit | 39 comments

Qwen3.6 35B-A3B successfully completed the FoodTruck Bench!

Posted by PulseVector@reddit | LocalLLaMA | View on Reddit | 20 comments

Putting together a pc. Are my assumptions correct?

Posted by Competitive_Wait_267@reddit | LocalLLaMA | View on Reddit | 17 comments

Putting together a pc. Are my assumptions correct?

Posted by Competitive_Wait_267@reddit | LocalLLaMA | View on Reddit | 17 comments

DIY Local 2x DGX Spark cluster cooler with automatic temperature controlled fan.

Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 6 comments

My home data center

Posted by alecKarfonta@reddit | LocalLLaMA | View on Reddit | 86 comments

125 tok/s for Qwen3.6 q4xl on 2x 4060ti is insane perf/dollar

Posted by Chuyito@reddit | LocalLLaMA | View on Reddit | 100 comments

What memory system are you using for your agents?

Posted by Mr_Moonsilver@reddit | LocalLLaMA | View on Reddit | 60 comments

My new home office radiator 🥵

Posted by lantern_lol@reddit | LocalLLaMA | View on Reddit | 73 comments

DolphinGemma release when?

Posted by Environmental-Metal9@reddit | LocalLLaMA | View on Reddit | 12 comments

How do you prove an open model actually improved?

Posted by tonyblu331@reddit | LocalLLaMA | View on Reddit | 22 comments

So qwen3.7-4b when?

Posted by ab2377@reddit | LocalLLaMA | View on Reddit | 47 comments

So qwen3.7-4b when?

Posted by ab2377@reddit | LocalLLaMA | View on Reddit | 47 comments