alexp702

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 138 comments

[-]

alexp702@reddit

Yeah, but look across all the models they are comparing with. They all win and loose in different areas. If you include 27B you'd see that actually beats some of these huge models: [https://huggingface.co/Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) no model seems to be giving me head and shoulders better results. Its now nuanced. For reference I have been running 397B, 3.6 27B and 9B - all Q8. 27B is "good enough" that 397B is used for rare occasions where I think the problem domain might be too wide. I have seen better results occasionally from it, but not that often. BTW the results are generally Good Enough, so I am certainly not complaining!

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 138 comments

[-]

alexp702@reddit

Feels like models are topping out. Benchmarks are hardly shooting up. “Good enough” should move us towards “efficient use of hardware” over bigger, better more…

How much VRAM needed for Qwen 3.6 27B Q8 with 262K context?

Posted by My_Unbiased_Opinion@reddit | LocalLLaMA | View on Reddit | 129 comments

[-]

alexp702@reddit

From memory it’s about 53gb - LM Studio has a little slider that shows memory requirements pretty accurately.

NVIDIA GB300 Grace Blackwell Ultra pricetags

Posted by X-N2O@reddit | LocalLLaMA | View on Reddit | 126 comments

[-]

alexp702@reddit

Why do the bottom to have 748GB??

I have 2x PC's. One with a 5090 and one with a 4080. Is there an easy way to use both together networked?

Posted by F0UR_TWENTY@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

alexp702@reddit

Telling people to just use Claude is not really an answer. It does not cover the anecdotes covered in the growing answers. Claude will find you a way, but miss the nuances of recent user experience - especially if everyone answers like you!

Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA

Posted by Uiqueblhats@reddit | LocalLLaMA | View on Reddit | 16 comments

[-]

alexp702@reddit

That’s not local though. If you want the absolute best locally use Qwen 397b - I have been able to find the differences between it and 9b in torture tests. However for general tasks 9b is “good enough”.

For users have have both 6000 PRO MaxQ and Workstation Edition (or Server Edition), how much slower is the MaxQ vs the WS/SV on compute? (Prompt processing, Diffusion, etc)

Posted by panchovix@reddit | LocalLLaMA | View on Reddit | 32 comments

[-]

alexp702@reddit

Max-Q if you intend to run 2 or more cards now or in the future, and care about your power bill. 2 slot blower design means you can actually fit them without turning your case in to an octopus of PCI extenders. If you only expect one ever and really want the extra <25% that the WS brings go for that. Personally it’s Max-Q all the way for me. Fitting bar heaters to my computer for a little extra juice, with no easy routes to do more does not appeal.

Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA

Posted by Uiqueblhats@reddit | LocalLLaMA | View on Reddit | 16 comments

[-]

alexp702@reddit

Try Qwen - it’s unreal at vision tasks. 9B+ outscores Opus on the benchmarks, and I can believe it.

M5 vs DGX Spark vs Strix Halo vs RTX 6000

Posted by Signal_Ad657@reddit | LocalLLaMA | View on Reddit | 261 comments

[-]

alexp702@reddit

Shouldn’t be a problem unless you choose disk based prompt caching - that’s what kills the drive

The RTX 5000 PRO (48GB) arrived and it is better than I expected.

Posted by Valuable-Run2129@reddit | LocalLLaMA | View on Reddit | 212 comments

[-]

alexp702@reddit

Man buys 4300 dollar gpu - surprised it’s good. What times we live in!

Bad news: Apple drops high-memory Mac Studio configs

Posted by jzn21@reddit | LocalLLaMA | View on Reddit | 137 comments

[-]

alexp702@reddit

You're assuming its a Mac Studio as we see today. Apple could pack multiple M5U into a package with HBM memory next time round.

Bad news: Apple drops high-memory Mac Studio configs

Posted by jzn21@reddit | LocalLLaMA | View on Reddit | 137 comments

[-]

alexp702@reddit

Nvidia: DGX 100k workstation! Apple: hold my beer!

Sell my 3090FE for a 5060ti 16gb? Does it make sense for energy consumption?

Posted by ThrowRA_194_M@reddit | buildapc | View on Reddit | 18 comments

[-]

alexp702@reddit

If you care about power and are a video editor get a Mac. Sips power, probably faster at everything video related due to optimised code

Forgive my ignorance but how is a 27B model better than 397B?

Posted by No_Conversation9561@reddit | LocalLLaMA | View on Reddit | 286 comments

[-]

alexp702@reddit

We have just run a test on our agentic flow - 397B\_Q8 is still better than 27B\_Q8\_K\_XL. It handles our particular documents more accurately. Shame, it would be great if 27B was actually better, but it isn't yet. On our Mac Studio 397B runs faster too, so lets hope they update 397B to 3.6 standards...

Given how good Qwen become, is it time to grab a 128gb m5 max?

Posted by Rabus@reddit | LocalLLaMA | View on Reddit | 151 comments

[-]

alexp702@reddit

He’s talking prompt processing which is in line with M5 Max post earlier

What starts to become possible with two 3090s that wasn't with just one?

Posted by GotHereLateNameTaken@reddit | LocalLLaMA | View on Reddit | 83 comments

[-]

alexp702@reddit

Tool calls noticeably fail more with q4 compared to 8. This ruins agentic flows. You can also see the difference in image processing quite starkly. A good Q8 is my personal quality floor

What’s a low memory way to run a Python http endpoint?

Posted by alexp702@reddit | Python | View on Reddit | 96 comments

[-]

alexp702@reddit (OP)

Thanks, see above edit. With limited memory I now fit in 384, and I think its stable enough for my purposes now. Yes the node services are pretty small too. Node seems to have a floor of 64-128MB if serving stuff - its hungry too. PHP uses the memory you'd expect for 64bit interpreter with some buffers - i.e. bugger all. Now I know why all the Wordpress instances are out there. They are much cheaper to deliver than something using a newer language.

What’s a low memory way to run a Python http endpoint?

Posted by alexp702@reddit | Python | View on Reddit | 96 comments

[-]

alexp702@reddit (OP)

This is not production its development.

What’s a low memory way to run a Python http endpoint?

Posted by alexp702@reddit | Python | View on Reddit | 96 comments

[-]

alexp702@reddit (OP)

It does matter to me. We’re building a system for the future and whilst this component is not large or high frequency in use it is important.

What’s a low memory way to run a Python http endpoint?

Posted by alexp702@reddit | Python | View on Reddit | 96 comments

[-]

alexp702@reddit (OP)

Thanks - I will check that as some of that may be happening.

What’s a low memory way to run a Python http endpoint?

Posted by alexp702@reddit | Python | View on Reddit | 96 comments

[-]

alexp702@reddit (OP)

Htop shows the usage all in the “uv run uvicorn” process.

What’s a low memory way to run a Python http endpoint?

Posted by alexp702@reddit | Python | View on Reddit | 96 comments

[-]

alexp702@reddit (OP)

Not doing a full implementation just rapidly prototyping a solution to see the memory usage. If the AI can get it up and running in 10 minutes warts and all that’s good for this purpose. I’m impressed - in about 2hrs it has allowed me to test almost every proposal here. I just spotted that in the output, the 4 year unmaintained problem still stands.

What’s a low memory way to run a Python http endpoint?

Posted by alexp702@reddit | Python | View on Reddit | 96 comments

[-]

alexp702@reddit (OP)

Yes everything pretty similar to a windows PC. Side note we were using Amd64 images on arm macs - they use about 30% more ram to emulate. Will pick up tomorrow I think!

What’s a low memory way to run a Python http endpoint?

Posted by alexp702@reddit | Python | View on Reddit | 96 comments

[-]

alexp702@reddit (OP)

Yes, but bjoern uses the older V2 WSGI protocol which is now (apparently according to my AI) WSGI V3. Personally I don't go for stuff that's not obviously maintained and relatively active - it causes problems down the line.

What’s a low memory way to run a Python http endpoint?

Posted by alexp702@reddit | Python | View on Reddit | 96 comments

[-]

alexp702@reddit (OP)

I am running on a Mac - Arm64 image for all. Node seems to have a baseline of 128MB - seems a well documented thing to do with the garbage collector. You can reduce it with some command line flags, but it then starts to become unstable. My actual python program is using about 70Mb on start up (possibly due to libraries) - which I can live with. My surprise is how hard it seems to be to serve this without eating up RAM. We have a bunch of 16Gb Macs, developing a docker based system. Most code based containers are Node, with a Python one stuffed in there. I want to make sure the team has as much ram left as possible, which began this investigation. Disk isn't a problem - just ram as we grow the number of services.

What’s a low memory way to run a Python http endpoint?

Posted by alexp702@reddit | Python | View on Reddit | 96 comments

[-]

alexp702@reddit (OP)

I am using uv run in the container - I think it may be part of the problem, as it seems to not matter what I try it stubbornly wants 512MB.

What’s a low memory way to run a Python http endpoint?

Posted by alexp702@reddit | Python | View on Reddit | 96 comments

[-]

alexp702@reddit (OP)

bjoern seems very old - 4 years since update!

What’s a low memory way to run a Python http endpoint?

Posted by alexp702@reddit | Python | View on Reddit | 96 comments

[-]

alexp702@reddit (OP)

I need python - various libraries on the end point require it, unless there is some trick to Go <-> python?

What’s a low memory way to run a Python http endpoint?

Posted by alexp702@reddit | Python | View on Reddit | 96 comments

[-]

alexp702@reddit (OP)

Currently python:3.13-slim

What’s a low memory way to run a Python http endpoint?

Posted by alexp702@reddit | Python | View on Reddit | 96 comments

[-]

alexp702@reddit (OP)

Sorry all Megabytes. All the others are irrelevant to me ;-)

RDMA Mac Studio cluster - performance questions beyond generation throughput

Posted by quietsubstrate@reddit | LocalLLaMA | View on Reddit | 3 comments

[-]

alexp702@reddit

All seems very prototype personally. I prefer stable-ish production. Very interested too to hear if anyone has actually used this kind of configuration for anything real. Recent article by the Google engineer using b200 confirmed my suspicions- keep the model on a single piece of hardware for best overall throughput.

M5 Max 128G Performance tests. I just got my new toy, and here's what it can do.

Posted by affenhoden@reddit | LocalLLaMA | View on Reddit | 90 comments

[-]

alexp702@reddit

Didn’t mean to be patronising- I have run many useless benchmarks in the fever of a new machine. However most are interested - myself included - in proper M5 Max benchmarks. Hoping the OP updates this with more information.

M5 Max 128G Performance tests. I just got my new toy, and here's what it can do.

Posted by affenhoden@reddit | LocalLLaMA | View on Reddit | 90 comments

[-]

alexp702@reddit

Having fun with a new toy eh😉? When you calm down prompt processing is the only metric that matters to most normal people - coding or openclawing you spend the whole time there. Llama.cpp does prompt caching properly now with qwen3.5, giving such a speed up actual token generation speeds are blurred by how much or little can be cached. Also with 128gb you should be running 27b at bf16 and at least 8 if you care about quality- which you should if you’re not just playing. Enjoy!

Qwen3.5 MLX vs GGUF Performance on Mac Studio M3 Ultra 512GB

Posted by BitXorBit@reddit | LocalLLaMA | View on Reddit | 57 comments

[-]

alexp702@reddit

I agree llama cpp 397 q8 seems built to run well on m3 Ultra. You can actually fit 1m context with 4 parallels. This helps the prompt cache if used on different tasks. PRefill is much better than it was on past models

Whats up with MLX?

Posted by gyzerok@reddit | LocalLLaMA | View on Reddit | 54 comments

[-]

alexp702@reddit

I have given up on the idea of MLX for now - llama.cpp running Qwen3.5 keeps getting better and in ways that are not only performance related - as you say quality matters most. At some point I expect to swap to VLLM MLX, but that’s another system that feels like it needs to cook more. Basically while things are moving quickly in the space speed of stable delivery matters more than speed of inference.

A few early (and somewhat vague) LLM benchmark comparisons between the M5 Max Macbook Pro and other laptops - Hardware Canucks

Posted by themixtergames@reddit | LocalLLaMA | View on Reddit | 60 comments

[-]

alexp702@reddit

He does test that later on in the video??

A few early (and somewhat vague) LLM benchmark comparisons between the M5 Max Macbook Pro and other laptops - Hardware Canucks

Posted by themixtergames@reddit | LocalLLaMA | View on Reddit | 60 comments

[-]

alexp702@reddit

Pp speeds are much better with M5Max: https://youtu.be/XGe7ldwFLSE?si=AFTdqPV4Np0gsgj-

Qwen 3.5 VS Qwen 3

Posted by SlowFail2433@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

alexp702@reddit

Compared to 8 not, but there was slightly more incorrect across my test images. Below 8 I was seeing huge mistakes. Putting text on the wrong line on an off axis photo was the biggest failure mode I noticed across all quants of the smaller models. The big one no problem (but with thinking on it thought for 4 minutes which was excessive). I have some horrible low light phone shots of printed schedules covered in hand written notes. These are our use case and quickly separate out good from bad. I must say all failed in some way with the smaller models and the bigger model is quantifiably better even on a smallish test. However the small models are very good. Ironically the bf16 9b actually performs at similar speeds as the 397b 8 bit (bandwidth and all that) - so I am unsure if we’ll actually use it!

Qwen 3.5 VS Qwen 3

Posted by SlowFail2433@reddit | LocalLLaMA | View on Reddit | 18 comments

[-]

alexp702@reddit

Running less quantized 3.5 compared to 3 and it’s a big step change from 4->16 bit. The smaller models perform very well on our image recognition tasks the 9b at bf16 almost comparable to 235b at q4. We didn’t do ask many tests at higher quants before as people seemed to imply all this marginal perplexity increase didn’t matter. For us it does, so we’re interested in 8bit or higher only. The new models fit neatly into GPUs, and we have a Mac Studio for the big ones.

Which one are you waiting for more: 9B or 35B?

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 220 comments

[-]

alexp702@reddit

A draft model for 397b!

Post your hardware/software/model quant and measured performance of Kimi K2.5

Posted by fairydreaming@reddit | LocalLLaMA | View on Reddit | 47 comments

[-]

alexp702@reddit

RemindMe! 10 days

Has anyone got GLM 4.7 flash to not be shit?

Posted by synth_mania@reddit | LocalLLaMA | View on Reddit | 130 comments

[-]

alexp702@reddit

Sorry GLM 4.6v

Has anyone got GLM 4.7 flash to not be shit?

Posted by synth_mania@reddit | LocalLLaMA | View on Reddit | 130 comments

[-]

alexp702@reddit

Found Qwen 4.6V to work pretty well at 8_0. Perhaps they don’t quantise well?

Qwen3-Coder-480B on Mac Studio M3 Ultra 512gb

Posted by BitXorBit@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

alexp702@reddit

NB we also use Macs for App development, so another mac a bit overspecced is always welcome even when its outlasted LLM work.

Qwen3-Coder-480B on Mac Studio M3 Ultra 512gb

Posted by BitXorBit@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

alexp702@reddit

It works - and the quality is good. Its a good R&D device - allowing you to bring up different models on it without drama. The responds decently quickly on smaller models. We've currently just swapped to GLM 4.6V - which is 100b and 22b active (we need vision as well for other purposes). This runs full BF16 with maximum context size happily. That kind of flexibility will cost triple on Nvidia, albeit with a faster output. However OpenRouter if you don't care about data visibility is much cheaper and quicker (well some times - some providers are quite bad, failing randomly and generally being slower than you'd hope).

Qwen3-Coder-480B on Mac Studio M3 Ultra 512gb

Posted by BitXorBit@reddit | LocalLLaMA | View on Reddit | 34 comments

[-]

alexp702@reddit

I have been using Qwen coder 480b for a while on the M3 Ultra. It’s slow. I found it works ok with Cline, but context processing is go away and come back in an hour. It definitely works, so for a background task it does a good job. Code quality is much better than smaller models. Output speed is good too, just than darn prompt processing - you’re looking at 100’s of tokens a second so on a 100k context it’s 1000 seconds. The box in general is awesome - being able to have lots of models to hand, and just fire up a different model or two is perfect for R&D. Production wise it’s ok if you have slow agentic flows. Just don’t expect snappy interactions

Mac Studio as an inference machine with low power draw?

Posted by aghanims-scepter@reddit | LocalLLaMA | View on Reddit | 41 comments

[-]

alexp702@reddit

Agree Cline is too slow - that’s the crazy prompts in creates though. I have other uses that need shorter prompts and more precision, so the Mac is well suited. A 48GB Nvidia solution doesn’t work if the model you need requires 200gb+ of ram to run at all.

Mac Studio as an inference machine with low power draw?

Posted by aghanims-scepter@reddit | LocalLLaMA | View on Reddit | 41 comments

[-]

alexp702@reddit

Mac stability is pretty rock solid. Have had one running Qwen 480b for weeks - no restarts. Performance is slow, but then so is most stuff on that size model. Prompt processing is slow for sure. But running large unquantized models is nothing to be sniffed at.

🧠 Inference seems to be splitting: cloud-scale vs local-first

Posted by Code-Forge-Temple@reddit | LocalLLaMA | View on Reddit | 9 comments

[-]

alexp702@reddit

Macs are the unsung king of local private inference. Load a high quality 600+b parameter model, run queries against it slowly, but fast enough. Cost 10k. Nvidia’s offering are horrid in this basic use case.

Start of 2026 what’s the best open coding model?

Posted by alexp702@reddit | LocalLLaMA | View on Reddit | 57 comments

[-]

alexp702@reddit (OP)

Interesting, is that because of 128k context?