[-]

jkcoxson@reddit

I’ve personally settled on qwen3.5-122b. I get roughly 40 t/s using oMLX, which is faster than any other program I tried. I use OpenCode, and generally leave it while doing other things like sleeping or socializing. I give it clear specs of what I want done and how I think it should be done, as well as a way for it to test its own work. Usually it’ll iterate for a few hours and eventually get done the list of tasks I have for it.

Eventually I want to write my own harness, since I feel like OpenCode is too loose. I know how to write code, I know how I want things implemented, I know how to test for success, so I need more structure for the LLM.

Basically it’s not super fast, but can get things done in time that is otherwise occupied by my life.

[-]

Poolunion1@reddit

Look at pi.dev it’s pretty flexible what you can do with it. It’s almost like a harness starter kit.

[-]

PinkySwearNotABot@reddit

i used to use MLX variants for the speed boost, but then i started reading about them not performing as well as their GGUF counterparts. so now I'm strictly going for high quality GGSUF quants. what's your experience with MLX?

[-]

jkcoxson@reddit

GGUF is abysmally slow on Apple Silicon, roughly half the performance for me. I haven’t heard anything saying that MLX is lower quality, just another format to store parameters. If you have any source, I’d be interested in reading it.

[-]

PinkySwearNotABot@reddit

lol unfortunately the 2 sources it cited are Reddit links...but if you do deep dive into this and find anything interesting, please share the knowledge! I'm definitely open to switching formats again if these guys are leapfrogging one another

For your 16-inch M1 Max with 64GB of RAM, the choice between GGUF (llama.cpp) and MLX has shifted significantly in 2026. While MLX used to be the "de facto" recommendation for Mac, recent benchmarks and the maturity of GGUF's K-Quants and Flash Attention have leveled the playing field.

Here is the breakdown based on the current 2026 landscape:

At a Glance: Which one to use?

Use Case	Recommended Format	Why?
Agentic Coding / Long Context	GGUF	Better prompt caching and stable performance as context grows.
Creative Writing / Chat	MLX	Higher raw generation speed (tokens/sec) for short-to-medium prompts.
Maximum Quality (4-bit)	GGUF (Q4_K_M)	K-Quants are smarter than MLX's uniform quantization, retaining more "intelligence."
Multi-Modal (Vision/Audio)	MLX	Native support for Apple’s newest multi-modal unified architectures (like Qwen 3.5-VL).

Export to Sheets

1. Performance & "Effective" Speed

There is a major distinction between Generation Speed (how fast words appear) and Prefill Speed (how fast the AI reads your prompt).

MLX is often 2x faster at generating text (e.g., 50+ tps vs. 25 tps on an M1 Max).
GGUF (via llama.cpp/LM Studio) is often much faster at prompt processing.
The Trap: In 2026, many users find MLX "feels" slower for long conversations because its prompt caching is less stable. If you are doing agentic workflows (like using OpenCode or coding agents), GGUF’s ability to "remember" the previous turn without re-processing the whole history gives it the edge in Effective TPS.

2. Quantization Quality

Not all 4-bit models are equal.

GGUF (K-Quants): Uses a "importance matrix" to keep more precision in the layers that matter most for logic.
MLX: Historically used uniform quantization. While newer MLX quants are improving, GGUF still generally wins on perplexity (a measure of how "smart" the model remains after shrinking). If a model feels "lobotomized" in MLX, try the GGUF Q4_K_M or Q6_K version.

3. The "Ollama" Factor

If you use Ollama, you are using GGUF. However, in 2026, the Go-wrapper in Ollama has been noted to add a 20–30% performance overhead compared to running the same GGUF in LM Studio or raw llama.cpp.

4. Hardware Synergy (Your M1 Max)

Your M1 Max has 400 GB/s memory bandwidth.

MLX excels at utilizing this bandwidth for raw throughput.
GGUF is better at managing the 64GB capacity, especially with "Unified Desktop" (UD) quants that allow you to run massive models (like 70B+ variants) more efficiently by squeezing the KV cache.

Final Recommendation

Since you’re doing CS50 and web agency work:

Use GGUF for your coding assistants (where you paste large blocks of code). It handles the "prefill" and context much more reliably.
Use MLX for exploratory chatting or trying out the absolute newest models (like the latest Qwen or Gemma releases) which often hit the MLX-community HuggingFace page a few days before GGUF conversions are optimized.

For your 16-inch M1 Max with 64GB of RAM, the choice between GGUF (llama.cpp) and MLX has shifted significantly in 2026. While MLX used to be the "de facto" recommendation for Mac, recent benchmarks and the maturity of GGUF's K-Quants and Flash Attention have leveled the playing field.Here is the breakdown based on the current 2026 landscape:At a Glance: Which one to use?Use Case Recommended Format Why?
Agentic Coding / Long Context GGUF Better prompt caching and stable performance as context grows.
Creative Writing / Chat MLX Higher raw generation speed (tokens/sec) for short-to-medium prompts.
Maximum Quality (4-bit) GGUF (Q4_K_M) K-Quants are smarter than MLX's uniform quantization, retaining more "intelligence."
Multi-Modal (Vision/Audio) MLX Native support for Apple’s newest multi-modal unified architectures (like Qwen 3.5-VL).
Export to Sheets1. Performance & "Effective" SpeedThere is a major distinction between Generation Speed (how fast words appear) and Prefill Speed (how fast the AI reads your prompt).MLX is often 2x faster at generating text (e.g., 50+ tps vs. 25 tps on an M1 Max).

GGUF (via llama.cpp/LM Studio) is often much faster at prompt processing.

The Trap: In 2026, many users find MLX "feels" slower for long conversations because its prompt caching is less stable. If you are doing agentic workflows (like using OpenCode or coding agents), GGUF’s ability to "remember" the previous turn without re-processing the whole history gives it the edge in Effective TPS.2. Quantization QualityNot all 4-bit models are equal.GGUF (K-Quants): Uses a "importance matrix" to keep more precision in the layers that matter most for logic.

MLX: Historically used uniform quantization. While newer MLX quants are improving, GGUF still generally wins on perplexity (a measure of how "smart" the model remains after shrinking). If a model feels "lobotomized" in MLX, try the GGUF Q4_K_M or Q6_K version.3. The "Ollama" FactorIf you use Ollama, you are using GGUF. However, in 2026, the Go-wrapper in Ollama has been noted to add a 20–30% performance overhead compared to running the same GGUF in LM Studio or raw llama.cpp.Tip: For the best performance on your M1 Max, use LM Studio with the GGUF backend or oMLX if you prefer the MLX route.4. Hardware Synergy (Your M1 Max)Your M1 Max has 400 GB/s memory bandwidth.MLX excels at utilizing this bandwidth for raw throughput.

GGUF is better at managing the 64GB capacity, especially with "Unified Desktop" (UD) quants that allow you to run massive models (like 70B+ variants) more efficiently by squeezing the KV cache.Final RecommendationSince you’re doing CS50 and web agency work:Use GGUF for your coding assistants (where you paste large blocks of code). It handles the "prefill" and context much more reliably.

Use MLX for exploratory chatting or trying out the absolute newest models (like the latest Qwen or Gemma releases) which often hit the MLX-community HuggingFace page a few days before GGUF conversions are optimized.

https://www.reddit.com/r/LocalLLaMA/comments/1rwaq47/qwen35_mlx_vs_gguf_performance_on_mac_studio_m3/#:\~:text=5%20models%20prompt%20processing%20is,like%20working%20technology%20on%20llama.

https://www.reddit.com/r/LocalLLaMA/comments/1s49lvh/gguf_llamacpp_vs_mlx_round_2_your_feedback_tested/#:\~:text=Verified%20by%20compiling%20with%20Metal,wrapper%20seems%20to%20be%20expensive.

https://www.reddit.com/r/LocalLLaMA/comments/1s49lvh/gguf_llamacpp_vs_mlx_round_2_your_feedback_tested/#:\~:text=The%20Ollama%20overhead%20has%20a,matters%20more%20than%20the%20engine.

[-]

jkcoxson@reddit

That's a lot of AI generated words, did you mean to reply to my comment with that lol.

[-]

PinkySwearNotABot@reddit

yes i did. you said you were interested in GGUF vs MLX comparisons

[-]

ResearchCrafty1804@reddit

Here you can see a companion between MLX and GGUF on accuracy for agentic coding.

[-]

PinkySwearNotABot@reddit

hey thanks for this -- very helpful. where did you pull this from? i can't find what "BPM" is

[-]

PinkySwearNotABot@reddit

another question: are you using some sort of proxy (e.g. bifrost, liteLLM) to switch between all your models or are you manually switching them?

[-]

JuliaMakesIt@reddit

QwenML has a Python Module for doing agentic workflow that supports tools, rag, and tool use.

It’s model agnostic so you don’t have to use Qwen models. I use it with Llama.cpp or MLX in router mode loading models as needed.

If you’re going to write your own harness for agentic workflows and are comfortable with writing Python, it handles all the boring stuff for you. They have a lot of examples in their GitHub.

https://github.com/QwenLM/Qwen-Agent

(Disclosure: I’m not a Qwen employee and have nothing to do with this code. I just use it in projects.)

[-]

jkcoxson@reddit

Because others have mentioned smaller models, I find 27b and 31b too stupid to write Rust and C well. They loop over work they’ve already tried even with a 250k context window.

[-]

shansoft@reddit

Very happy about it. I am able to run Qwen 3.5 122B with 200k+ context while maintaining 40-60tok/s and 500-3000 ppt/s. Upgraded from M3 Max and the speed difference is wild. Local model is now on Sonnet level and it can easily do what SOTA cloud model can do these days, as long as you prompt / plan it right. It also fixed some secret keys problem I was having when Opus 4.6 / GPT 5.4 both just put a hack patch instead of fixing the root cause.

[-]

Acrobatic-Desk3266@reddit

Could you say more about your setup? Very intrigued with you saying it compress to SOTA. I'm assuming you're coding and are a software engineer?

[-]

shansoft@reddit

Yes, I am a software engineer and it is used for coding. I used both my laptop and 5090 desktop at the same time for different purposes. Most backend and web related task I use Qwen3.5 122B 4bit on oMLX since its pretty reliable and decent speed for typescripts and swift vapor code. For mobile, since its somewhat related to UI, I mostly tackle it with Gemma4 31B 5bit or Qwen3.6 27B 5bit on Llamacpp. I also used ComfyUI with custom setup to create assets when I need to. Mobile coding in general seems to be a problem for all the models out there, doesn't matter if it is Opus or GPT or local model, its much better to breakdown the task and code along with the LLM together. I mostly use opencode with these models. I still use claude code / codex from time to time to try different things, but I failed to see any value it provide that I couldn't get from my local setup.

[-]

Acrobatic-Desk3266@reddit

Awesome to hear,thanks!

[-]

_derpiii_@reddit (OP)

Very happy about it. I am able to run Qwen 3.5 122B with 200k+ context while maintaining 40-60tok/s and 500-3000 ppt/s. Upgraded from M3 Max and the speed difference is wild. Local model is now on Sonnet level and it can easily do what SOTA cloud model can do these days, as long as you prompt / plan it right.

Woah, sounds like your configuration is setup well! I'm seeing so many varying experiences. Yours seems the most positive so f ar.

[-]

zorgis@reddit

Interested too.

I still can't decide if the max 128gb is worth it over the pro 64gb.

[-]

_derpiii_@reddit (OP)

Same! I'm playing around with the m2 ultra 64GB and quickly finding how disappointing it is. And I don't see how having double the memory would help much.

So if I'm going to continue local inference with Apple, might as well wait for M5 512GB.

[-]

ImpressiveHair3798@reddit

Spec ? Taille ? Écran ? Ssd ?

[-]

SkyFeistyLlama8@reddit

Prompt processing is much improved on the M5 Max. Anything before that can run large models slowly and with glacial prompt processing, which makes agentic coding or long workflows excruciatingly slow.

[-]

_derpiii_@reddit (OP)

Could you share some of your workflows you're using local inference for? I'm trying to get an idea of what's pleasantly feasible.

[-]

JuliaMakesIt@reddit

I have both (Mac mini M4 Pro 64GB) and MacBook Pro M5 Max 128GB).

It’s night and day both in speed and the scale of model / context length I can run. I don’t regret paying the extra for the Max with more RAM. It was 100% worth it.

[-]

ImpressiveHair3798@reddit

Spec ? Taille ? Écran ? Ssd ?

[-]

zorgis@reddit

What do you run with the 128gb?

What do you use the model for? Coding, local agent, chat?

[-]

JuliaMakesIt@reddit

I don’t do much chat or writing with AI.

I automate a lot of workflows, mostly with smaller local models (20-35B with large but managed context)

I use the Qwen-Agent Python module for a lot of things as it has great tool calling, RAG, code execution all built-in. I have built maybe 6 of these agents since Qwen3 came out. Think of them as mini OpenClaw’s but built for specific missions.

They can use any model, but I’ve been mostly running the Qwen3.5 series. I use 4B as a super fast automation support model, then based on the task upscale to Qwen3.5 27B, 35B-A3B or the 122B-A10B model.

It’s really all about the tooling. A smaller model with a good RAG document set and tools will do a lot more than a big model without.

The other big models I sometimes use are: NVIDIA-Nemotron-3-Super-120B-A12B and GPT-OSS-120B. The Gemma4 models are also quite nice.

The M5 Max will reduce your time to first token quite a bit.

The 128GB will let you have some things running in Docker, and a few very solid models loaded. You can also jettison the smaller models and load in a Q4_K_M 120B model if you need.

I use the Colima engine to run Docker with full METAL MPS access.

I use either MLX or Llama.cpp in router mode to dynamically pick and load models on the fly.

Yup, wall of text. Sorry just wanted to share my setup. It’s been super useful for me. Also, I’m human and like pineapple on pizza.

[-]

bwjxjelsbd@reddit

How much quicker in your experience for coding task?

[-]

JuliaMakesIt@reddit

Now and then one of my agents will write scripts or some Python as part of a task.

I’m pretty old-school and write most of my code by hand. I do sometimes use Claude Code for code review, writing tests or refactoring; it’s useful for grunt work.

I don’t think even Claude Opus-level models are really that useful for doing detailed coding. Local models have a ways to go.

[-]

ResearchCrafty1804@reddit

How do use Colima to run Docker with Metal MPS access? Can you share a bit more about this?

[-]

JuliaMakesIt@reddit

If you install the Docker Desktop App on MacOS, you end up with containers that can’t access METAL/MPS. Delete all of that and use Brew to install Docker on top of the Mac native Colima engineer instead.

It’s as simple as:

% brew install colima
% brew install docker
% colima start --arch aarch64 --vm-type=vz  --cpu 4 --memory 2 --disk 64
% docker info

I set my container host up with 4 cpu cores, 2GB of ram and 64GB disk, but depending on what and how many docker images you run, you can adjust it. You can even run Intel containers using Rosetta 2 this way.

[-]

ResearchCrafty1804@reddit

But can you run mlx or use metal api inside the docker containers that run through colima?

[-]

JuliaMakesIt@reddit

http://colima.run

Check under the “AI Workloads / GPU-Accelerated” section.

[-]

rm-rf-rm@reddit

excellent write up! not a wall of text. This is giving me the idea of doing a megathread for folk to share their setups from hardware through applications.

Qwen-Agent Python module for a lot of things

Never heard of it before actually. And as a Openclaw skeptic, it looks interesting. What are you using it for and how much effort does it take to setup for reliable use?

[-]

Flashy_Koala9976@reddit

I'm torn between the M4 Max 128GB and the M5 Max 128GB. Is the M5 really necessary?

[-]

JuliaMakesIt@reddit

The M5 Max does prompt processing much faster (like a lot 50-80%) Inference is a bit faster too (10-15%).

In my opinion the prompt processing boost makes it worth considering the M5 if you have the extra cash for it.

[-]

txdv@reddit

800$ for 64gb of ram sounds like a good deal nowadays

[-]

_derpiii_@reddit (OP)

jesus. checking my email, you know how much I sold my G.SKILL TridentZ RGB 64GB?

$50 + $10 shipping 🥲

Jan 4, 2025. Just a year ago. wtf

[-]

txdv@reddit

time to let it go, i looked at btc at 100$ and decided they hit the roof

[-]

rm-rf-rm@reddit

Still not too late to buy. Its either going to be bigger than gold's market cap (so something like a 30x multiplier left) or will tend to 0 as time goes to infinity. So its only a question of your worldview and risk apetite

[-]

thelebaron@reddit

Just imagine some guy is posting somewhere about the score he got in light of today’s prices and how happy he is though

[-]

Consumerbot37427@reddit

Right? It shows how out of hand things have gotten. The "Apple Tax" for RAM or disk upgrades has always been signficant.

[-]

xraybies@reddit

Apart from macOS being a cluster of interlinked junk consuming >5GB on load and being hard to debloat, the hardware is not bad. I have an M5 Max 128GB; if I let the agents do their thing, the fans kick on in 10s and you can watch the battery go down 1% every 20s with any model above an active 3B. MLX is pretty good, but realistically you only use 118GB (54GB on the 64GB) for models, so you still cannot run \~120B Q8, at best Q6.https://omlx.ai/benchmarks will give you a good idea of what you can run. I ordered both the 64GB and 128GB versions and, apart from the SSDs (SanDisk vs. Toshiba), they performed identically. The keyboard also felt very slightly different, just a tad firmer on the 64GB.

Think of it as an RTX 5060 with 110GB VRAM + i7 12700k.

Image gen is \~1/3 the t/s (DrawThings) vs an RTX 4090 (ComfyUI)

As for workloads: heretic Qwen 122B, Nemotron 120B and GPT OSS 120GB q6 or mxfp4.

Overall 6.5/10 if it weren't for macOS be such a bloated PoS it would be 8/10.

[-]

TheOriginalG2@reddit

good luck with intel and nvidia, their hardware is garbage with constantly crashing and problems plus no ram. Apple has scores a 10/10 with the m5 max.

[-]

_derpiii_@reddit (OP)

the fans kick on in 10s and you can watch the battery go down 1% every 20s with any model above an active 3B.

Have you run into any thermal throttling issues? Was curious if thermals would be an issue (vs let's say a mac studio).

realistically you only use 118GB (54GB on the 64GB) for models, so you still cannot run ~120B Q8, at best Q6

Oh wow, that's 10GB taken up. I was expecting 5GB max. That's Really good to know.

https://omlx.ai/benchmarks will give you a good idea of what you can run. I ordered both the 64GB and 128GB versions and, apart from the SSDs (SanDisk vs. Toshiba), they performed identically.

I am very surprised. I would think 2x RAM would yield nonlinear gain, but sounds like... exponential decay in diminishing returns?

What about people that connect like 4x512GB for 2TB of ram? Are they seeing any benefit at all, or is just for the memes?

[-]

xraybies@reddit

I was expecting about 3GB for the OS and VS Code, but look at a normal workload before loading any local AI platform.

It thermal throttles with a delay (longer than my x86 machines), so if your prompts are chat turn style, I would say no. But if you ask for a code review it throttles. My 14" MPR becomes very loud and hot, and since I use it as a "lap"top, uncomfortable.

With regard to performance (64 vs 128) I'm referring to synthetic bench... so margin of error mem bandwidth, cpu, gpu, etc scores.
My original plan was to get the 64GB bcos who the f wants to shell out $1k for 64GB!? But then the realization kicked in that with 64GB you've only got \~40GB for a model and context size is also increasing and sure to w/ TurboQ I figured I might limit my options so I kept the 128GB.

I tried to reduce the OS footprint (https://github.com/rayone/machete) which was mostly unsuccessful (-2GB on boot), but it has decluttered the macOS experience. My last macOS experience was 12yrs ago and it's only gotten much worse... they just pile crap on top of junk and make it all interdependent.. like remove one component and then you find out that some completely unrelated icon in settings requires it. The idiots had a golden opportunity to start fresh with ARM64 but instead just bolted crap on... beyond belief for a multi TRILLION $ company. Some dumb ass in Apple probably had a requirement for their to be 1 macOS package for every supportable machine and so Intel get ARM junk and ARM get AMD and Intel crap with some Rossetta on top.

[-]

_derpiii_@reddit (OP)

That’s the unfortunate nature of modern operating systems, esp Apple. And probably 90% of this unnecessary bloat is for UI polish like liquid glass bleh

[-]

No_Mountain_5569@reddit

I thing one thing to remember is that it’s unified ram. The system needs ram to run but it also needs some video ram for rendering the screens

[-]

_derpiii_@reddit (OP)

I thing one thing to remember is that it’s unified ram. The system needs ram to run but it also needs some video ram for rendering the screens

As obvious as that sounds, I didn't realize it 😂

I'm so used to that being handled by an integrated graphics card. Thank you for that insight :)

[-]

PinkySwearNotABot@reddit

i started with MLX and now are moving more towards high quality GGUF quants. MLX are a bit faster, but I hear they can be "dumber" -- which makes sense. your experience?

[-]

cobquecura@reddit

I have one and I have found that while it is not incredibly fast, with Qwen3-Coder-Next in conjunction with OpenCode and OpenSpec I am able to consistently get features added with only occasional intervention. Something around 500 t/s prompt processing and 50 t/s generation up to 200k~ context.

I also make heavy use of kubernetes locally and having a ton of memory is a huge help for that too.

[-]

SkyFeistyLlama8@reddit

At 200k context and PP 500 t/s you're waiting 7 minutes for the first token. I hope there's a way of caching that huge context.

[-]

MiaBchDave@reddit

No, you don’t wait 7 mins. Unless you’re deleting KV every time (or using LM Studio instead of oMLX).

[-]

TheOriginalG2@reddit

I find oMLX is slower than LM Studio actually both prompt processing and TPS is slower.

[-]

SkyFeistyLlama8@reddit

You are waiting 7 mins if you kill the inference process to save RAM for another program or if you switch models without using a model router.

Sometimes I deal with large documents that are stuffed completely into context. There could be different documents used for each chat run so it's a fresh context load at 100k for each new chat.

[-]

MiaBchDave@reddit

oMLX has hot/cold KV cache on SSD. Is that what you’re looking for? Check their settings. LM Studio does not handle cache well at all through any harness using MLX (or batching for that matter).

[-]

ImpressiveHair3798@reddit

Pourquoi ta pas pris 8 pour avoir les débits de 14,5 max car en 4 to je crois que c’est 12,5

[-]

SkyFeistyLlama8@reddit

I'm not on Apple, I use llama-server from llama.cpp as my main inference runner.

[-]

MiaBchDave@reddit

Ahhhh… it was an Apple topic so I assumed. I think there’s swap disk cache or something through vLLM possibly, but I haven’t experience there.

[-]

SkyFeistyLlama8@reddit

No prob, it's good to see what's working and what's not across different inference solutions and hardware architectures. We're all a bunch of mad scientists tinkering in home labs.

[-]

dgdosen@reddit

assuming you're running a fork of that same 200k context - could prompt caching be a savior?

[-]

_derpiii_@reddit (OP)

200k context and PP 500 t/s you're waiting 7 minutes for the first token

ohhh, I never made that connection. You could just straight math it 🤣

tbf, that's the entire context window.

I guess realistically, for light coding (prototyping, no massive codebase loading), if you restart session at 20% that's around 90 seconds.

And that's just token processing/intake. Inference/thinking is another step right?

[-]

SkyFeistyLlama8@reddit

To be fair, I rarely use that much fresh context. You gotta calculate TTFT (time to first token) to see how responsive a rig would be, depending on model and context size.

I don't have an M Apple beast machine so I'm getting pathetic numbers like PP 150 t/s at smaller contexts like 20k tokens. At that speed and context size, I'm waiting about 2 minutes for prompt processing to finish.

Inference can be with reasoning or non-thinking. With reasoning, I could be waiting another minute before the final output tokens appear.

[-]

_derpiii_@reddit (OP)

To be fair, I rarely use that much fresh context.

Hmm. Are you implying there's prompt caching optimizations?

[-]

No_Afternoon_4260@reddit

Ofc there is

[-]

skilesare@reddit

Ok..fine ..I'll ask...link? I run llama.cpp and could certainly benefit from caching as I run a ton of agentic stuff with different personalities and emphasis.

[-]

SkyFeistyLlama8@reddit

Based on my own usage of llama-server, it caches prompts in memory up to a certain size. I think it looks at the first few thousand tokens to find a cache match, then it runs prompt processing only on the difference between that cache hit and the new prompt.

It works if you run a long chat session against a fixed document corpus or you run coding agents against a fixed code base. If your context is always changing, prompt caching doesn't work.

I wish there was a way to save a huge prompt cache to disk and then reload it using the same llama-server web interface as regular chat history.

[-]

hurdurdur7@reddit

Llama.cpp caches prompts just fine.

[-]

DifficultyFit1895@reddit

There are ways to cache context. LM Studio has it for most GGUF models and oMLX has it for MLX.

[-]

somerussianbear@reddit

I’m getting good cache on oMLX but can’t see the GGUF work on LMS. Mind pointing to how to set up/how to get it to work? Or just which HF models you’re using that works for you. oMLX makes it super clear to see the cache and the impact while LMS and llama.cpp have no visuals (that I’m aware) so hard to see it.

[-]

SkyFeistyLlama8@reddit

Caching works on llama.cpp or llama-server but only if the coding harness or chat UI doesn't add a prefix to chat history.

I don't have much use for long context caching because I'm switching models, agents and conversations a lot. Killing a llama-server process kills any cached prompts. Can you save and reload processed context from disk?

[-]

somerussianbear@reddit

oMLX’s hot/cold cache would make that almost instant

[-]

Narrow-Belt-5030@reddit

That still feels surprisingly good though - 200K context (claude code) + 50t/s is fast enough to work with (imho anything below 10 is a nightmare, and above 20 is ok)

[-]

TheOriginalG2@reddit

try Qwen3.6-35B-A3B I can get 90TPS and about 77tps loaded up with 80k context and about 50 at 200k its 100% fast enough.

[-]

rm-rf-rm@reddit

below 10 is a nightmare, and above 20 is ok)

Honestly, for most tasks that most devs have, its more amenable to just get the agent running and only revisit after 1-2hrs - the dev can/should move on to other work. If its done in minutes, you are in that no mans land where you cant switch to doing something else as by the time you start with that, the agent is awaiting input. And even if you are getting somethnig else done (could even be other agents), the constant context switching is unproductive and taxing

[-]

kyr0x0@reddit

Until it decides to rm -rf and hallucinated a symlink before. Are you running everything in a container?

[-]

rm-rf-rm@reddit

yup only ever run in a container.

[-]

hurdurdur7@reddit

After seeing what the hallucinating does, i am at the same point, the agent runs always in a container

[-]

AlwaysLateToThaParty@reddit

50t/s is fast enough to work with (imho anything below 10 is a nightmare, and above 20 is ok)

Yeah, that's pretty much where I'm at. On an edge device, it doesn't matter if it's 5-10, as they generally aren't time-critical. But if you're doing back and forth inference, whether it's coding or RP, it's just not usable below that.

[-]

ImpressiveHair3798@reddit

Ta quel Mac

[-]

jonydevidson@reddit

Are you hosting in oMLX?

[-]

cobquecura@reddit

I tried, but couldn’t get oMLX working as well as LM Studio. I ran into looping issues, and I can add a system prompt in LM Studio that I can’t seem to do in oMLX. I’d like to try again with oMLX in the near future because it seems the processing speed would improve even more.

[-]

MartiniCommander@reddit

I'm using oMLX without a hitch an I'm pretty illiterate. Start from scratch.

[-]

Gipetto@reddit

This actually sounds great for me. I'm heavy handed with my monitoring of AI coding, so I have it go in steps and I monitor the diffs. I think I'm gonna be just fine when mine finally arrives.

[-]

_derpiii_@reddit (OP)

with Qwen3-Coder-Next in conjunction with OpenCode and OpenSpec I am able to consistently get features added with only occasional intervention. Something around 500 t/s prompt processing and 50 t/s generation up to 200k~ context.

I know this isn't a fair comparison but, how would you rate it against Claude Code w/ Opus? I'm not expecting parity of course :)

I hear of hybrid workflows of using Opus to plan, and local to implement.

[-]

Broad_Stuff_943@reddit

Opus to plan and then local or a cheap model to implement is exactly what I do. It works well. The key is to have Opus outline an implementation plan so the "lesser" model has very little thinking to do.

[-]

_derpiii_@reddit (OP)

The key is to have Opus outline an implementation plan so the "lesser" model has very little thinking to do.

That sounds like the best of all worlds :)

Are there any clever prompting tricks you give to make Opus aware it's handing off to a local agent model? Or does it just figure it out from the config?

[-]

Broad_Stuff_943@reddit

I tell it to create a plan and write it as though a junior is going to pick it up. It adds a lot more information that way.

Whenever I've told Opus that I'm planning to hand off to a different model the output isn't as good. By telling it that a junior is picking up the work, it seems to add a lot more context and examples.

[-]

_derpiii_@reddit (OP)

I tell it to create a plan and write it as though a junior is going to pick it up. It adds a lot more information that way.

That's a great idea. Thank you!

[-]

Broad_Stuff_943@reddit

You're welcome!

[-]

Last_Mastod0n@reddit

This is the way

[-]

Randomdotmath@reddit

Yeah, this is classic Cline-style workflow from the early days. Those tools were built around the idea that even if the LLM isn’t that smart overall, if you break the task down into tiny, well-defined pieces, it can perform surprisingly well.

The meta hasn’t really changed — except now your local Qwen is probably stronger than GPT-4o. So using frontiers for planning and a strong local model for implementation is actually excellent.

[-]

Last_Mastod0n@reddit

Its okay if you give it explicit instructions. Claude code with Opus is going to fill in all the gaps and design the implementation MUCH better than any open model at the moment.

[-]

_derpiii_@reddit (OP)

I also make heavy use of kubernetes locally

How does Kubernetes fit in with local LLM workflows? Is each pod running its own LLM?

[-]

cobquecura@reddit

I just mean that it’s another valuable use of the substantial RAM that goes beyond LLMs, not that I use kubernetes with LLMs (in containers or something).

[-]

_derpiii_@reddit (OP)

ah gotcha! I was envisioning some sort of elastic LLM 🤣. Kubernetes is so cool.

[-]

rhapsodyvm@reddit

I’m on the same doubt. I have a M1 Pro 32gb. I can run a few quantized 35b and 24b models, like qwen3.6, qwen coder, devstral, etc. I have two different response speeds: Ok speed - for simple/small questions, small project, it reply and generate code in a “ok” speed. But it’s completely unusable when it needs to gather out info and read files to work. The thinking and token processing steps takes forever. Looking at the logs looks it seems the reason is that the actual speed is very low for medium/big reasoning tasks, as it generate a lot of tokens and process a lot of prompts while it thinks.

It might be ok if you will leave it working in background for you throughout the day. I tested qwen 9b and it was far more responsive, but sometimes it makes dumb errors and waste a lot of trying to solve something it really doesn’t know how to.

The tests I did: a have a medium project that already have about 15k loc. It has a very standard of adding new crud screens. I asked it to create the screen for user pets crud operations, following the project standards, as described in the agents.md.

Using qwen 3.6 35b a3b iq2_m from unsloth, it generate the new pets feature with 13 new files (small ones). The files have the edit/show form with the fields of pets model, the list table, schema of the models, hook to use the pets data. It followed the project guidelines correctly, using base form, fields and data table of the project. In the first run it generate the files with a few import and lint errors. It didn’t create any test. It took 40 minutes for this task. The same task was done in less than 5 min by GitHub copilot with GPT-5.4.

Honestly, it did a good job. But a junior dev would do it in about the same time, because it just need basically copy another crud feature and rename to “pet”.

So I’m wondering: with a m5 max, how would it perform? I know that my m1 pro doesn’t compare with a m5 max, but the point is: in a medium sized project with real tasks, how would it perform? Which model size would be realistically performant?

Loading a big model and ask isolated for something isn’t the same as actually using it to work on real projects.

[-]

MiaBchDave@reddit

The M5 Max is the first system that can sorta work locally with large enough code bases. I currently am trying a few things. Qwen3.5 122B is “ok” for getting one-shots done. Will be trying new Gemma4 26B MoE as well.

Harness stack: OpenCode > oMLX > https://huggingface.co/andrzejmontano/Qwen3.5-122B-A10B-Vision-MLX-Mixed-5bit

If you pull up oMLX’s website, it has a great amount of uploaded model benchmarks (which the UI can run) to get an idea of PP speeds… just filter by M5 Max (40 core GPU): https://omlx.ai/benchmarks

I find the context cache in oMLX makes relatively quick work with 100-200k context sizes.

[-]

dgdosen@reddit

Those benchmarks are exactly what i was looking for...

[-]

ImpressiveHair3798@reddit

Ta quels config ?

[-]

ImpressiveHair3798@reddit

Ta quel config exact avec ssd ?

[-]

Particular-Pumpkin42@reddit

I bought this one as a successor to my MacBook M1 64GB and don't regret. I run Qwen3.5 122B 6bit with full context as my daily driver and it's a huge help for my professional work as a software developer. The M1's prompt processing was becoming a stopper for the more recent capable models. However note: money-wise it didn't hurt me much. And I started with local interference with my Macbook and never really Had hands-on experience with large models in a GPU Cluster. And I do not use Cloud interference at all for professional work due to privacy concerns. That's why I never felt that the M5 Max is not intelligent or fast enough as I am not comparing :) Only you can decide what's right for you, but to me, it feels magical having a LLM model strong enough for professionell work running locally in an all-in-one portable machine with Keyboard, Touchpad and Display :)

[-]

PinkySwearNotABot@reddit

how much ram does that model take up on your 128GB?

[-]

dankfrankreynolds@reddit

Around 90GB For me it crashes my Mac Studio around 80k context window (The screen just suddenly goes black and reboots)

I switched to MFXP4 and it’s closer to 70GB. I don’t have a lot of anecdotal data on quality, but it’s still more competent than the other models I tried.

I’m currently trying to switch between that and gemma31b as a very slow “was this actually implemented the best way?” second pass

[-]

_derpiii_@reddit (OP)

That's awesome! How's the thermals? I'm sold as long as it has no thermal limitations over the upcoming M5 Ultra.

Qwen3.5 122B 6bit with full context

Taking note, thank you.

[-]

lolwutdo@reddit

how fast is your PP with 122b?

[-]

wouldntyaliktono@reddit

I stepped up to M5 128gb from M1 64gb and it's a night and day difference, mostly because of the prompt processing speed. It's made local Claude Code a realistic option for offline development. Qwen Coder Next 70b with the 8-bit quant has been my go-to, but I've also had some success running the 4-bit quant plus a smaller secondary model for sub-agent tasks. Here's a quick comparison I just did of my new machine vs. the M1 I was using previously: https://www.youtube.com/watch?v=k8YCLZ-OAuk

[-]

_derpiii_@reddit (OP)

https://www.youtube.com/watch?v=k8YCLZ-OAuk

Ok wow, that's QUITE the difference.

What's your main use case?

Taking note your favorite model is: Qwen Coder Next 70b 8-bit

[-]

wouldntyaliktono@reddit

At the moment, my use case is fast iteration on firmware for embedded electronics. But I also have a bunch of ML projects cooking as well. So the extra ram helps for both serving of language models for development, and training / evaluation of the models I'm building for my projects.

[-]

_derpiii_@reddit (OP)

That’s awesome. What a framework/tooling are you using for training/evaluation?

[-]

wouldntyaliktono@reddit

PyTorch mostly.

[-]

_derpiii_@reddit (OP)

PyTorch mostly.

Thank you - I'm trying to figure out the eval tooling options. Will take a look

[-]

BidWestern1056@reddit

i use some of the 120s but they arent enough of a jump in intelligence over the 30b class to justify the drop in speed usually

[-]

_derpiii_@reddit (OP)

Oh wow, that's a nuance I would not have expected.

So it's diminishing returns.

So it's more beneficial from the standpoint of having multiple models in memory at once (3x 30b models at each point in pipeline)?

[-]

t_krett@reddit

Why? You don't get more speed by loading a model 3 times. The only advantage is you can have more specific and fine tuned models without loading them into memory. But unless you have specific models for a use case it would probably be better to go with the bigger model.

[-]

_derpiii_@reddit (OP)

e.g. running a RAG with 3 different models: embedding/retrieval, requery/re-ranking, generation

[-]

BidWestern1056@reddit

yeah i'd say so, one of the things i'm working on with npcpy and npc ecosystem is to make it possible to achieve greater quality from ensembling responses from 1b-10b class models so you could likewise get 10 responses in parallel and then a smarter (30b-100b) model at the end synthesizes .

https://github.com/npc-worldwide/npcpy

https://github.com/npc-worldwide/npcsh

[-]

somerussianbear@reddit

But 120s MoE are way faster than 30s dense right?

[-]

Themash360@reddit

Depends on active parameter count, but yeah recent models go for like 10b active for instance , so if all 120b fits in fast memory speeds will be really good.

I do unfortunately think qwen 27b is better quality wise, but if you’re low on compute it will be a tough model to run.

[-]

BidWestern1056@reddit

yeea but like qwen 35b is also moe so its way faster lol

[-]

ResearchCrafty1804@reddit

But can you run mlx or use metal api inside the docker containers that run through colima?

[-]

Varmez@reddit

I bought one, it hasn't arrived yet. I figured with how much belt tightening the online models are doing, in conjunction with the expectation itl'l last me\~5 years, that the extra between 64-128gb is "cheap insurance" .

My hope is that I can effectively get by on a $20 plan or two, use something liek Codex as the "planner" in something like Cline, then have a local model, likely qwen or gemma do the actual implementation. I've been trying this on my M1 Max in a round about way by using some of the offline available models on openrouter to get a feel for it and it works pretty well for my uses.

[-]

shuwatto@reddit

it works pretty well for my uses.

Could you care for sharing your usecases and workflows?

I've tried the same thing and miserably failed. :/

[-]

Varmez@reddit

My use is probably considered pretty basic, just crafting n8n workflows that interact with shopify and zoho apis, do a fair bit of data parsing from scraping vendor sites / controling browserless. Along with a bit of liquid code related stuff on shopify themes / templates.

I have docs / scafolding that i've refined continuously overtime that probably help a fair bit though

[-]

eaz135@reddit

I have a max 64, and a pc with a 5090 (with 192gb ram). Find my hands automatically wanting to work with the PC. I run local qwopus 27b v3, and getting very good results with it.

I treat the Mac as more of a beastly machine for working with cloud inference (cursor, codex, cc, etc). Good specs to be running many agents simultaneously on ghostty, building multiple things at once, etc.

I get more done with the Mac in the setup above, I treat the PC local AI setup mostly as entertainment / hobby. Don’t get my wrong I’m very productive on it and get a lot done, and it’s very fun at the same time running it locally - but it’s not the same as having 10 terminals open simultaneously with Opus / Codex cranking away in each of them.

[-]

_derpiii_@reddit (OP)

Good specs to be running many agents simultaneously on ghostty, building multiple things at once, etc.

I get more done with the Mac in the setup above, I treat the PC local AI setup mostly as entertainment / hobby. Don’t get my wrong I’m very productive on it and get a lot done, and it’s very fun at the same time running it locally - but it’s not the same as having 10 terminals open simultaneously with Opus / Codex cranking away in each of them.

Appreciate the insight into your workflow. I'm new and still plebbing out with iterm2 🤣

I keep on seeing Ghostty mentioned. Could you go more into your setup/tooling (ghostty with tmux?)?

Also surprised you're using all 3 (cursor, codex, CC). Are they collaborating together or are they individual projects?

[-]

eaz135@reddit

I'm subscribed and I have access to pretty much everything, deciding which tools I want to use today on certain projects is kind of like deciding what clothes I want to wear on the day.

I've been working in software (big tech, investment banks, scale-ups, etc) since 2010 so I have a good sense of tech. My style with AI is generally giving very direct, bite-sized tasks for execution. With this approach a lot of the tools are good enough - because I'm driving a lot of the direction myself, so it almost doesn't really matter what I pick, the end outcome is going to be very similar. I just like to get my hands dirty with all the tools, I find it fun

[-]

xXy4bb4d4bb4d00Xx@reddit

it’s great

[-]

somerussianbear@reddit

Solid argument

[-]

xXy4bb4d4bb4d00Xx@reddit

thanks

[-]

abnormal_human@reddit

If you want a mid-competent chat assistant with predictable latency and privacy they're great, but it's no RTX 6000 when it comes to running large models quickly with long context and it's too compute poor for significant parallel work.

[-]

Consumerbot37427@reddit

when it comes to running large models quickly with long context and it's too compute poor for significant parallel work

To elaborate on that point: my observation is that parallel sub-agents basically freeze entirely whenever there is any prompt processing to be done. I can only assume that a machine with multiple graphics cards wouldn't behave this way.

[-]

_derpiii_@reddit (OP)

my observation is that parallel sub-agents basically freeze entirely whenever there is any prompt processing to be done.

Are the subagents sharing the same model or does each have its own LLM? If they're sharing the same model, I can understand it pausing.

But if they each have their own LLM, I wonder what the bottleneck is (not memory, nor bandwidth - maybe some queue switching issue?).

[-]

Consumerbot37427@reddit

I was using the "parallel slots" feature in LM Studio.

[-]

somerussianbear@reddit

Same. Noticed that we definitely have to disable background agents.

[-]

victor_lowther@reddit

It is good stuff. Opencode + oMLX (0.3.4) + unsloth-Qwen3-Coder-Next-mlx-8bit is a local sweet spot -- I average around 50tok/s generation, and oMLX's prompt cache makes prompt processing a total non-issue especially compared to lm studio. Currently experimenting with pi + oh-pi, but the ant colony agent style is driving the system into swap -- currently getting 1k tok/s prompt, 20 tok/s gen. Haven't experimented with turboquant yet -- it and Gemma are next on the list once oMLX support stabilizes.

[-]

_derpiii_@reddit (OP)

Opencode + oMLX (0.3.4) + unsloth-Qwen3-Coder-Next-mlx-8bit is a local sweet spot

Awesome. Saving this for later, thank you

[-]

synn89@reddit

Depends on what you want to do with it. I have a M1 Ultra 128GB and it's been wonderful for chat models. It's low enough power I can just leave it on, all the time, and 128GB of RAM is a lot of breathing room for 120B and down models. Even though right now I'm running a Drummer Skyfall-31B, which doesn't need all the RAM, it's nice to have when I want to run a 120/122B and I can squeeze in a 235B if I really want to.

It's quiet, sips power and is very flexible.

[-]

_derpiii_@reddit (OP)

I have a M1 Ultra 128GB and it's been wonderful for chat models.

I'm actually setting up a RAG on M2 Ultra 64 for a friend. What do you like to use for chat models (chat is generation right?)?

[-]

habachilles@reddit

Get the ultra if you can

[-]

PrinceOfLeon@reddit

I have a M3 Max 128 GB and have been happy since picking it up the week it came out.

I keep Qwen3-Coder-Next @ Q8 w/256k context running at all times, with Qwen2.5-VL-7B-Instruct (for occasional vision) alongside.

There is enough memory left over that I have felt no impact with dozens of browser windows, my IDE, Docker, email client, and so on. As in looking right now there is still 8 GB of memory free.

I still use Claude Code for primary development work, but with a custom hooks-based AI monitor leveraging the local LLMs (via llama.cpp server) to watch what Claude is doing and analyzing risky tool calls and other operations (reading is green, writing or network transfers are orange, and delete is red), as well as evaluating "drift" if it looks like CC is doing things which are not aligned with user prompts or CLAUDE.md instructions. This results in periodic, brief bursts of GPU usage which don't have any perceivable effect on my workflow. I'm not actively waiting for replies and performance-wise I wouldn't know the LLMs were kicking in if I didn't have a CPU/GPU/RAM/Network monitor going in the taskbar.

I've tried pointing CC at Qwen3-Coder-Next for development, and it can get the job done, but I've actually had better results using Mistral Vibe (currently) or OpenCode (previously) as the harness. "Better" as in I get responses back quicker and sometimes CC seems to just get "lost" an will still be appear to be processing files an hour later but with no clear end in sight. I only tend to go entirely local more for routine sysadmin tasks or editing things like Home Assistant and Frigate configurations (things I don't want to leave the private network).

In short, having that level of headroom for memory means not only being able to run "large-ish" models locally, but being able to run useful models while still using the system to get actual work done, without compromises.

[-]

JonSwift2023@reddit

How's it compare directly to the M4 Max 128GB? Anybody do the upgrade and have real numbers?

[-]

Southern_Sun_2106@reddit

Loving it. Had M3 128GB before. PP is the real deal with this generation. My fav models are GLM 4.5 Air and the new dense 27b Qwen. This M5 makes running those two and smaller omnicoder 8b instances all together (as little agents) very nice. I recommend taking it for a drive for 14 days as apple allows, if you have such an opportunity, and decide for yourself, trying it for your uses. PLUS it is an awesome laptop, with a magnificent screen and super-nice sound, think and slick.

[-]

Pleasant-Shallot-707@reddit

This makes me happy to hear as someone who’s picking up their new m5 max 128gb soon.

[-]

drewbiez@reddit

Running Gemma4 moe, it flies, does well for my use case and I’m done paying for ai plans for a while :-)

[-]

_derpiii_@reddit (OP)

What's your use cases and workflows like?

[-]

drewbiez@reddit

I'm also experimenting with the lower end models too, but I figured, my laptop has the ram, why not use the big baddie.

[-]

drewbiez@reddit

Light automation and just general use really... Web scrapers with n8n, some automation with n8n, some data ETL workflows for prepping data for clients, general questions, light scripting/coding, nothing too wild.

[-]

monjodav@reddit

Ok ish but honestly not that fast you’ll need gpus to achieve anything opus-ready at more than 40t/s

Been using qwen 122b and it’s incredibly slow but makes the job

Lets see which models are coming next

[-]

_derpiii_@reddit (OP)

What's your use case and applications?

And even if it's slow, How's the quality of the output?

If it's 1/10th the speed of opus with 90% the results, I would just toss it in a ralph loop and sleep in a toasty room.

[-]

New_Public_2828@reddit

No no. Hold up 3 fingers in front of your face. Only way to know for sure this isn't AI.

Commenting because I'm also curious

[-]

Hey_Gonzo@reddit

I almost died reading this. That was the dumbest interviewee.

[-]

_derpiii_@reddit (OP)

it's an interview? I thought it was from the Indian filter scammer?

[-]

_derpiii_@reddit (OP)

Instructions unclear, face stuck in 3 dicks

[-]

lolexecs@reddit

3 dicks? Shouldn’t this be base 2 - like four dicks - you know going tip to tip?

[-]

-Crash_Override-@reddit

Ok grok

[-]

StandardKey7566@reddit

Common problem, wait it out, it'll either get better or a lot worse!

[-]

GymRatNowCovidFat@reddit

I think the qwen 3 next coder 8bit seems like a really good model so far. I almost find myself wishing they gave us the option for 256 GB for the macbook pro. I think I could have been OK with 96 GB if it existed. I don't think 64GB would have worked for me because I'm constantly seeing how far I can push large local models.

[-]

wgg_3@reddit

It’s ok

[-]

rorowhat@reddit

Atriz halo 💯, more versatile and cheaper.

[-]

_derpiii_@reddit (OP)

ahaha, I acknowledge the PCmasterrace :)

At the moment, I get the impression, local LLMs are best suited for brute force coding (vs orchestration). And linux + GPU wins by a landslide.

Apple's unified memory only has the advantage in orchestration, but sounds like the models aren't good enough yet.

[-]

rorowhat@reddit

Macs are great for casual use, but for real work the flexibility of the Pac always wins.

[-]

Look_0ver_There@reddit

Sadly nowhere near as cheap as they used to be. Started off at $1800 for the 128GB models. Now it's rare to find them for less than $3000. Above the $3000 mark it all gets uncomfortably close to being better off to just throw a bunch of R9700Pro's at a PC instead.

[-]

DoorStuckSickDuck@reddit

Eh, the Bosgame M5 used to be $1800 at release, I got mine at $2000 (\~2 months ago), and now it's $2400 for the 128GB model. Will likely grow even higher in the future.

They're very good machines for specific use cases. People here parrot throwing together rigs with 3090's until they look at the power consumption, noise, etc etc. It's not the fastest, but it's very efficient for what it does, and it's great for an always-on server.

[-]

Look_0ver_There@reddit

Yeah, Bosgame are the only remaining vendor below $2500. Last I looked two days ago, everyone else is now at the $3000+ mark. It will be surprising if Bosgame don't go to $3000 within a week.

Don't get me wrong. I have two Strix Halo machines myself linked together via USB4, and I can run 400B models on them using rpc-server at 20tg/sec. Fantastic machines. They unfortunately struggle with dense models. I also have a pair of R9700 Pro's in my PC for handling dense models at an acceptable speed.

When I got the Strix Halo's, they were $2000 each (I got them just as the prices started to go up). Today though when I look at what the R9700 Pro's can do, I find myself asking the question if Strix Halo's make sense any more at the $3000+ mark.

That's where I'm coming from. I ain't no parrot. This is based on direct experience.

[-]

_derpiii_@reddit (OP)

Sadly nowhere near as cheap as they used to be.

Wait the M5 Max has gone up? Oh sheesh, I thought Apple would keep the old retail pricing

[-]

Look_0ver_There@reddit

I was referring to the Strix Halo's, since the comment I responded to explicitly mentioned the Strix Halo machines.

[-]

SwordsAndElectrons@reddit

Read again what this person replied to.

[-]

IsThisStillAIIs2@reddit

if your expectation is “near cloud model performance locally,” you’ll be disappointed, but if it’s “fast, private, always-on inference,” it’s actually great. the sweet spot tends to be \~20b–70b quantized models for coding, assistants, and structured tasks, anything bigger starts to feel slow or memory-constrained even with 128gb.

[-]

droning-on@reddit

Are you using OpenClaw?

Curious if it's able to handle a decently complex context and still perform some coding tasks.

Ie code and follow a workflow at the same time.

[-]

Velocita84@reddit

I'm getting really tired of these "honest take"s

[-]

_derpiii_@reddit (OP)

wdym? I did a search and didn't find anything.

[-]

Velocita84@reddit

It an overused slop phrase, like "curious what you guys/the community thinks". You might've picked it up from other ai generated posts if you didn't ask an llm to make you a title

[-]

brendanl79@reddit

as someone looking to buy a fat Mac in the next few months this question and thread interested me. go sulk in a corner

[-]

_derpiii_@reddit (OP)

are you waiting for the M5 Ultra announcement too? 😆

[-]

brendanl79@reddit

hahaha exactly

[-]

Velocita84@reddit

My problem isn't with the subject matter, it's the literal string "honest take"

[-]

_derpiii_@reddit (OP)

idk what's worse: my writing style triggering you (a small minority), or you commenting something that's so negative (which is going to negatively impact more people).

If you don't like it, please go complain on another thread that's actual slop. I wrote this myself, and yes I even use - dashes, long before reddit even existed.

Your problem with me is just you.

[-]

_derpiii_@reddit (OP)

It an overused slop phrase

Or maybe LLMs are trained on old school redditors with a penchant for this kind of writing style.

[-]

Velocita84@reddit

I have NEVER seen it overused this much before LLMs, and it's only in this sub so you know it's because of LLMs

[-]

FastDecode1@reddit

Yeah, like wtf are people expecting by saying that? That someone with a dishonest take won't give it to you anyway?

Kinda like combating illegal guns by making it more difficult to obtain a gun legally. I'm sure the criminals will care about the new laws buddy, great job.

[-]

_derpiii_@reddit (OP)

Yeah, like wtf are people expecting by saying that? That someone with a dishonest take won't give it to you anyway?

hah, fair enough. I agree logically it makes no sense. It's more of a colloquial rhetoric I find helpful in... natural sounding discussions?

Either way, not sure why you're so tilted.

[-]

That_Country_7682@reddit

got one last month. 70b quants run surprisingly well, the unified memory is no joke.

[-]

nickludlam@reddit

Which 70B models? My impression was that the 70B size had fallen out of fashion, and we're seeing more of a cluster around 30B and 120B.

[-]

_derpiii_@reddit (OP)

got one last month. 70b quants run surprisingly well,

What applications and workflows do you find it best for? And which models and quantization model would you recommend?

[-]

New_Public_2828@reddit

I heard it's ok for large models but if you need speed you need gpus

[-]

_derpiii_@reddit (OP)

That's my impression as well, would be curious to hear some concrete examples in this thread

[-]

Its_Powerful_Bonus@reddit

MoE ~120B works really well. Prompt processing improved dramatically vs M3 Max. In token generation there is some improvements, but I’ve expected more difference - maybe I’ll see it after software will adapt. For sure it is not rtx 6000 pro 96gb speed, but for device which I can run in travel it’s wonderful.

[-]

That_Country_7682@reddit

got one last month. 70b quants run surprisingly well, the unified memory is no joke.