JSVD2

BeeLlama v0.3.1 – latest llama.cpp with extras! DFlash, MTP, q6_0 cache, TurboQuant. Single RTX 3090: Qwen 3.6 27B & Gemma 4 31B up to 177.8 tps (4.93x over baseline)

Posted by Anbeeld@reddit | LocalLLaMA | View on Reddit | 10 comments

[-]

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

JSVD2@reddit (OP)

That is so true tho. Good point!!!!! My Beelink went from $2500 to $4400 in just a few months. Ridiculous actually.

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

JSVD2@reddit (OP)

Totally bait and switch. just so I get 10000 stars.

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

JSVD2@reddit (OP)

Consider giving him a star if you find it useful LOL

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

JSVD2@reddit (OP)

I do use AI tho. ofcourse I do. but I use it as an assistant and to do benchmarks.

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

JSVD2@reddit (OP)

what is lacking common sense, and being convincing? Sounds like opinions tho.

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

JSVD2@reddit (OP)

Fair. The star wording was a mistake here, so I removed it. It made the post look more promotional than I intended. I am sure there is actual value in the guide if you read it tho.

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

That is a really useful datapoint, thanks. The hosted playground makes me more interested in Ultra’s quality, even if it does not change the one-box 128GB local constraint yet. The correction from this thread is that I skipped Super. I tested Super UD-IQ4\_XS locally on Strix Halo and it runs directly: \- pp512/tg128 r3: 292.51 pp512 / 17.94 tg128 \- p0/tg128 r3: 17.73 t/s generation-only So my corrected map is: Ultra = quality/watchlist, Super = runnable 120B middle route, Nano = faster smaller route. If you try Q4\_K\_M on the 256GB Xeon box, I’d be very interested in the artifact/runtime/backend and tg128 number.

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

JSVD2@reddit (OP)

Fair point. The guide link probably made it look more promotional than intended. The useful correction from this thread is that I skipped Super. I tested it now: Nemotron 3 Super 120B-A12B UD-IQ4\_XS runs directly via llama.cpp Vulkan/RADV at 292.51 pp512 / 17.94 tg128, with 17.73 t/s generation-only. So the better map is: Ultra = watchlist, Super = runnable 120B middle route, Nano = faster smaller route.

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

JSVD2@reddit (OP)

Absolutely. I will take that as feedback 😄

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

JSVD2@reddit (OP)

I love your comment

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

JSVD2@reddit (OP)

Good catch. You’re right: Super is the missing middle route here, and that is actually the better objection than the wording critique. My post jumped from Ultra -> Nano because I was checking the newest Ultra release first, and Nano was the route I already had a clean direct llama-bench result for. But if Nemotron 3 Super has a practical GGUF route that fits a 128GB Strix Halo box, then yes, that is the more interesting next test than Nano. So let’s make it concrete: \- Which Super GGUF / quant would you test? \- IQ2\_M? \- Q4\_K\_M if it fits? \- llama.cpp Vulkan/RADV? \- ROCm/HIP? \- long-context serving instead of short llama-bench? I’m not trying to defend Nano as the best Nemotron path. I’m trying to turn “this obviously won’t run” into an actual runnable route with raw numbers. If Super is the better target, I’ll test Super and add the result. That is exactly the kind of correction I’m looking for.

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

JSVD2@reddit (OP)

Fair enough on the wording. “Reality check” is probably overused. But I’ll push back on the second part a bit: if the answer is “obviously no”, then what is the actual runnable path people should use today? That is the gap I’m trying to document. Every big model release creates the same vague local-AI question: “can this run on my machine?” The useful answer is not just parameter count. It is artifact format, quant, file size, backend, memory fit, and measured command output. For Ultra, the current public artifacts I found are: \- BF16: \~1.1 TB \- NVFP4: \~352 GB \- safetensors / Transformers \- no direct GGUF / llama.cpp route I could find So yes, I agree: not a single-box 128GB target today. But then the practical question becomes: what is the nearest runnable Nemotron path on this hardware? That is why I tested Nano 30B-A3B GGUF and posted the result. If there is a better Ultra route, I’d actually like someone to point to it: \- smaller quant? \- GGUF? \- multi-node? \- different runtime? \- offload strategy? \- 128GB unified-memory trick I missed? I’m not attached to this result. I’m trying to replace “lol obviously no” with a concrete tested path or a better counterexample.

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

JSVD2@reddit (OP)

Fair pushback. The 352GB NVFP4 point is exactly why I’m not claiming Ultra is practical on one 128GB box. “Watchlist” here means: wait for a smaller artifact, GGUF route, multi-node route, or a different runtime path. I agree the current NVFP4 artifact itself is not a one-box fit. And yes, running a 30B MoE on Strix Halo is not surprising by itself. The useful part I was trying to document is the practical route after the Ultra release: which Nemotron artifact actually has a GGUF/llama.cpp path today, what quant fits, and what speed it gets on this hardware. So the takeaway is not “wow, 30B fits.” It is: Ultra is not a direct one-box 128GB llama.cpp target right now; Nano is the currently runnable Nemotron GGUF route I could verify. If you know a better Ultra path, I’d genuinely like to test it.

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

JSVD2@reddit (OP)

Can I ask what you have tried to come to this conclusion? Its very interesting to me and I just merely have been benchmarking, but real life results are maybe even more useful.

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

JSVD2@reddit (OP)

Thank you!

NVIDIA Nemotron 3 Ultra is out.

Posted by justdoitanddont@reddit | LocalLLaMA | View on Reddit | 1 comments

[-]

JSVD2@reddit

Couldnt make it work on Strix halo just yet. The practical NVIDIA Nemotron route I could actually run was Nemotron 3 Nano 30B-A3B GGUF. [https://github.com/hogeheer499-commits/strix-halo-guide/blob/main/CURRENT\_MODELS.md](https://github.com/hogeheer499-commits/strix-halo-guide/blob/main/CURRENT_MODELS.md)

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 138 comments

[-]

JSVD2@reddit

what have you done with it? curious

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 138 comments

[-]

JSVD2@reddit

I checked this against a 128GB Strix Halo / Ryzen AI MAX+ 395 box. Ultra itself does not look like a practical direct llama.cpp target yet: \- BF16 artifact: \~1.1 TB \- NVFP4 artifact: \~352.4 GB \- safetensors / Transformers \- no direct GGUF route found for Ultra yet So I tested the practical NVIDIA route instead: Nemotron 3 Nano 30B-A3B IQ4\_XS GGUF. Direct llama.cpp Vulkan/RADV on Radeon 8060S: \- 619.00 pp512 / 65.45 tg128 \- 66.60 t/s generation-only Raw evidence: [https://github.com/hogeheer499-commits/strix-halo-guide](https://github.com/hogeheer499-commits/strix-halo-guide)

llama.cpp - Qwen3.6/3.5-MTP - Share your benchmarks t/s

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 40 comments

[-]

JSVD2@reddit

I have a lot of them here, from 7 systems: [https://github.com/hogeheer499-commits/strix-halo-guide](https://github.com/hogeheer499-commits/strix-halo-guide)

gemma-4-12b-it vs Qwen3.5-9B on shared benchmarks: Qwen is overall winner beating gemma in 5/8 benchmarks despite a smaller footprint

Posted by fulgencio_batista@reddit | LocalLLaMA | View on Reddit | 146 comments

[-]

JSVD2@reddit

Good share.

Direct 100.0 t/s on Strix Halo with Qwen3 30B-A3B. Can anyone reproduce or beat this?

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 23 comments

[-]

JSVD2@reddit (OP)

* Experimental server route: Qwen3.6 MTP at 101.1 t/s with `llama-server` speculative decoding. testing with this too.

Direct 100.0 t/s on Strix Halo with Qwen3 30B-A3B. Can anyone reproduce or beat this?

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 23 comments

[-]

JSVD2@reddit (OP)

yep totally agree. thank you for the feedback

I found what I was looking for in Qwen 3.7.

Posted by CosmicRiver827@reddit | LocalLLaMA | View on Reddit | 9 comments

[-]

JSVD2@reddit

Grok is excellent at browsing and real time information for stocks. btw. This is useful. Not sure why the post is deleted.

Strix Halo 128Gb: what models, which quants are optimal?

Posted by DevelopmentBorn3978@reddit | LocalLLaMA | View on Reddit | 50 comments

[-]

JSVD2@reddit

Dit helpt je misschien wel. [https://github.com/hogeheer499-commits/strix-halo-guide](https://github.com/hogeheer499-commits/strix-halo-guide)

Direct 100.0 t/s on Strix Halo with Qwen3 30B-A3B. Can anyone reproduce or beat this?

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 23 comments

[-]

JSVD2@reddit (OP)

I like the way of thinking.

Direct 100.0 t/s on Strix Halo with Qwen3 30B-A3B. Can anyone reproduce or beat this?

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 23 comments

[-]

JSVD2@reddit (OP)

Absolutely.

Direct 100.0 t/s on Strix Halo with Qwen3 30B-A3B. Can anyone reproduce or beat this?

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 23 comments

[-]

JSVD2@reddit (OP)

Very good suggestion. I can share some results with Q6 MTP! I do have this path too tho if interested: * Experimental server route: Qwen3.6 MTP at 101.1 t/s with `llama-server` speculative decoding.

Direct 100.0 t/s on Strix Halo with Qwen3 30B-A3B. Can anyone reproduce or beat this?

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 23 comments

[-]

JSVD2@reddit (OP)

Good Question

Direct 100.0 t/s on Strix Halo with Qwen3 30B-A3B. Can anyone reproduce or beat this?

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 23 comments

[-]

JSVD2@reddit (OP)

Nice numbers. can you share the raw `llama-bench` row and exact command/build? My post is specifically about direct Strix Halo Vulkan/RADV results, not trying to beat a 5090. A 5090 should obviously win on decode. Also, 10k pp is prompt processing; my headline is tg/decode. I’m mainly collecting reproducible rows, so model, quant, backend, commit, batch/ubatch, context and power numbers would be useful.

Direct 100.0 t/s on Strix Halo with Qwen3 30B-A3B. Can anyone reproduce or beat this?

Posted by JSVD2@reddit | LocalLLaMA | View on Reddit | 23 comments

[-]

JSVD2@reddit (OP)

hahaha no problem.

Stop traumatizing AI into loops and turn hallucinations into an honest "I don't know!" by being NICE to them (Proof of Concept, Research, I don't want to sell anything)

Posted by OttoRenner@reddit | LocalLLaMA | View on Reddit | 349 comments

[-]

JSVD2@reddit

with T3 it uses a 250K context window. give me better results this way. its something at least! yep im gonna try it if it happens https://preview.redd.it/ia15m928vy4h1.png?width=624&format=png&auto=webp&s=ee32041745a2a68e0c7b3ff302f9feef5e9034f9