Are there actually people here that get real productivity out of models fitting in 32-64GB RAM, or is that just playing around with little genuine usefulness?

[-]

cmk1523@reddit

I enjoy quality over speed so that means q8 versions and thus: I need tons of ram.

[-]

Lazzollin@reddit

I daily drive qwen3.6 35B for coding in various projects (64gb ram and 24gb vram)

[-]

I had Claude spec requirements and divide them into slices then with a team of opencode agents running with local qwen 3.6 27b I had them implemented.. /loop in Claude made it wake up every 5 minutes, check in opencode status directly(it's just a sqlite database where it stores all agents messages tool calls etc) and upon completion of a slice, start a new headless opencode with a custom command to pick up the new slice.

Absolute madness how much tokens the setup works but once a couple endpoints and the solution structure was set up, it was just churning code. Unit tested and compliant wit the functional requirements. It's not teh best code, sometimes repetitive and I would have tackled some problems differently but it's not terrible. Most important it's homogenous and 100 percent unit tested.

So far I built the backend of an “split the bill“ demo app, about 15 endpoints all unit tested. Pretty happy about how it all worked out. All code was written by the local model 100 percent.

I was planning on doing the front-end today and that comes with other pain points.. Playwright etc... But then I had an idea to make kefta BBQ and now here I am with my Sunday afternoon low 😂

Qwen 3.6 27b fp8 with bf16 context. Vllm. It has a bug that sometimes it outputs the tool call before ending the thinking tag and other times it just stops so that is why I believe an overseer is necessary. The overseer could be a local model, too. The logic is not too heavy.

[-]

arrty@reddit

Anyone having good results on M4 pro?

[-]

RootBeerWitch@reddit

I have a small always on linux server with an i7 and 32gb of ram. No gpu. I recently realized that I can run ollama 8-14b models and have my scripts and overnight processes hit this. Yes, it takes a minute or two to answer but because it's for scheduled jobs and most of them run overnight or every few hours, I don't notice the lag. And now even 8b models can be genuinely useful.

[-]

ciaoshescu@reddit

What do the scripts do?

[-]

RootBeerWitch@reddit

Mostly personal automation, a mix of life OS stuff and data pipelines:

Daily health pipeline: pulls sleep, steps, and workout data from my phone (Health Connect/Fitbit/strava), builds a CSV, then generates an "On This Day" note in Obsidian comparing today's stats to past entries
At-a-glance briefing: runs 3x/day, compiles a summary dashboard from my notes, calendar, health data, and finances
Plaid sync: pulls bank transactions automatically
Obsidian utilities: auto-tags notes so i dont have to organize them and extracts ideas I've flagged for later
Nutrition bot: separate Telegram bot where I log food (including photos), it estimates macros and logs everything to CSV, syncs to dashboard

Nothing that needs instant responses, it's all "fire and forget." Not all of these hit the ollama models, maybe half of them now.

[-]

Haeppchen2010@reddit

Yes. (on 2 old GPUs 16+8GB).

Use Qwen 3.6 35B (3.5 27B before) for coding (OpenCode, compared to Anthropic models used at work, I put my home setup somewhere between Haiku 4.6 and Sonnet 4.6). Also for general "Chat" directly via llama-server builtin web app, useful to get inspiration, rubberduck with a nonexistent conversation partner, etc.

[-]

robogame_dev@reddit

The trick is to separate tasks that require high intelligence and world-knowledge from tasks that don't.

Small models are genuinely useful for constrained tasks, the fewer the tools and the better the harness the better.

For example, I ran Qwen VL 8b overnight two nights to digitize 1600 page of hand-drawn notes and diagrams, turning them into mermaid diagrams and descriptions of the figures, etc. Was it perfect? Nah, but it was definitely good enough to make them fully searchable and useful as AI context. Great task for a small model.

[-]

King_Tofu@reddit

Are small models less prone to error when doing a specific repetitious task? i.e. would you try to use a bigger model if you could if the speed was relatively the same?

I'm still learning about this area as I save for a local LLM set-up.

[-]

robogame_dev@reddit

If latency and cost and privacy were not concerns, I’d use the SOTA-est frontier model for everything - I am not aware of any task that small LLMs perform better than big ones outside of the speed / locally runnable elements.

[-]

DeepWisdomGuy@reddit

When is this question going to die? Qwen3.6 27B being equal to the most advanced frontier model from 7 months ago should finish it off once and for all.

[-]

cafedude@reddit

I've got a Strix Halo box with 128GB. I've been getting work done with the Qwen3* 27b models (did a lot with 3.6 yesterday!). It's slow - about 8 tps. But I can give it some instructions go off and do something else for a while and come back later to see what it's accomplished. I've also done some work with Qwen3-coder-next which runs really well on my box. So yes, I'm getting actual things done and it's been great since the Claude pro plan is really cutting down on tokens of late. If I have something that I think is beyond the local model's capabilities (and that seems to be less and less) I can give it to Claude to do planning and then have the local model implement.

[-]

Zealousideal_Lie_850@reddit

What harness are you using?

[-]

cafedude@reddit

OpenCode

[-]

SkyFeistyLlama8@reddit

I'm looking forward to a smaller Qwen Coder Next that can fit into 64 GB. The 80GB model now barely fits, leaving maybe 10 GB for other apps to run.

Since 3.6 35B and 27B dropped, I find I never use the older Next 80B. The newer models are a lot smaller and given enough thinking time, they're on par with the 80B.

[-]

Simon-RedditAccount@reddit

I have 32GB RAM + 8GB VRAM. I find Qwen and Gemma family useful.

I'm using local LLMs for drafting mostly Python scripts that process private data (i.e., take this bank statement and that code that does X, and write code that will do X to bank statement) and also for drafting Bash scripts, also some HTML+JS+CSS scaffolding.

For other tasks (where privacy is not mandatory), I go to APIs - they are faster and give superior results.

[-]

EggDroppedSoup@reddit

Ditto, hitting 30 t/s on unsloth qwen 3.6 35b a3b ud k xl gguf. Extemelly usable, since alot of free web models have been getting slower for some reason

[-]

Simon-RedditAccount@reddit

What quants are you using? On Q4_K_M, I'm getting half of that.

[-]

Xyrus2000@reddit

You need to offload the expert layers (not weights, those remain on the GPU). On a 4080 super I'm getting around 65 tokens per second with 20 offloaded. I could probably keep a few more loaded, but I like having a larger context.

[-]

Simon-RedditAccount@reddit

Thanks! I'm using LM Studio but I'm considering switching to llama.cpp. Could you please share your llama.cpp command?

[-]

Xyrus2000@reddit

I am using LM Studio. :)

[-]

Simon-RedditAccount@reddit

Thanks!

[-]

ameeno1@reddit

How does one offload? Which app, is, settings?

[-]

tmvr@reddit

Use llamacpp and use the -fit parameter, does all the work for you.

[-]

Independent_Solid151@reddit

llama.cpp using llama-server, you use the --cpu-moe flag.

[-]

EggDroppedSoup@reddit

unlsoth K XL, max gpu at 99 and cpu at 38 (i have a good cpu)

[-]

LocalLLaMa_reader@reddit

He was asking for which K XL? Q4, Q5, Q6..?

[-]

bgravato@reddit

speed depends heavily on your hardware...

[-]

cosmicr@reddit

Not for coding, but for everything else it's great. I've got 28gb VRAM and 64gb RAM.

[-]

protoanarchist@reddit

Running 16gb locally with 128gb system RAM.

I throw both my CPU and GPU at things, it's not as fast, but starting to see usable output. Still figuring out what the most productive agents and project structures are.

(Love reading about peoples setups!)

Eyeing Intel B70s 😉

[-]

Monkey_1505@reddit

If you can get a dense 27-32b parameter model to think iteratively, it's likely fairly decent as an agent or for coding. Not top tier. But then you also in that scenario want fast prompt processing, which is not a macs strong point.

[-]

jimmytoan@reddit

64GB is where things get genuinely useful for multi-file coding tasks. With 32GB, you're typically capping at 8-16K context if you want reasonable decode speed, which is fine for isolated functions but limits the model's ability to reason across a whole codebase. At 64GB, Qwen3.6 35B at Q4_K_M fits with room for 32-64K context - that's where the coding quality starts to feel more like "understands the system" vs "writes code that looks right." The difference shows up most when the task involves refactoring that needs to stay consistent across many files.

[-]

Long_War8748@reddit

Cloud Models are for work, local models are for fun projects ~

[-]

Mister_bruhmoment@reddit

I can give a useful example of mine. I needed to collect 300 seprate bibliographical descriptions for uni work. I was also rquired to order chronologically and then alphabetically.

The problem was that the ISO 690 descriptions were not the same across all my entries, so I had to clean them up and equalize them.

That's when I decided to try and give a local LLM some real work.

After countless attempts and optimizations I got Qwen 3.5 9B in a loop where it would do the following: - see what my entries were and what exactly needed to be trimmed (after a hefty and detailed prompt!) - write a python script to do the sorting and cleaning (also removing the capitalization from names eg. TYLER -> Tyler) - see the results from the script - evaluate -itterate on the script until it was perfect

The whole job took around 10 minutes and I got basically a flawless result.

It is not that complex of a task but the repetition is what made it very hard. Cant wait for 3.6 9B!!!

[-]

HopePupal@reddit

i run Qwen 3.6 27B Q6_K (previously 3.5) on my 32 GB R9700 and that gives me a second workstream for my off-hours projects.

annoying mostly mechanical refactoring? throw it at Qwen.
need a parser for an uninteresting text format? throw it at Qwen.
web GUI that's not particularly important? throw it at Qwen.
updated some dependency and now a bunch of other stuff broke? throw it at Qwen.

[-]

SnooPaintings8639@reddit

I experiment with really good result with using Codex and/or Opus for analyzing and planing the work, but then asking them to push the actual coding task to pi (non-interactive mode) with local Qwen 3.6 35B instance.

Both I and Codex are quite happy with how Qwen is executing, but I still keep larger models reviewing the output, but I plan to change it soon.

This way I can use my subscriptions for longer without hitting limits, and I plan to push it as far as I can with OpenCode + OpenRouter instead of Codex and Claude Code as orchestrator models.

[-]

sarcasmguy1@reddit

Do you get Codex to steer Pi?

[-]

SnooPaintings8639@reddit

Yes, I just tell it to pass a task to pi, that's all. It checks --help when needed, and I ask to start a new session when we delegate a new task or to resume a previous session when we pass a review comments. I guess it could be turned into a skill to keep it more structured.

[-]

sarcasmguy1@reddit

Smart idea. Do you have a prompt or skill, or is it literally just "use pi to implement this feature"

[-]

Bulb93@reddit

Wow thats super smart. Iirc you're using the smart/codex model to delegate tasks to local model then having the smart/codex model to review the work. So that way you us less paid usage and get the most out of local model with smart planning?

[-]

gbrennon@reddit

yes BUT i dont like models writing code hehe

even if i dont like code written by models i did prepare multiple custom models that the code looks much better than code written from commercial models

[-]

vladlearns@reddit

yep, I don't use them for coding. I have a couple of agents and apps built around gemma and qwen, from gemma 2 to gemma 4, with lots of qwen models in between, before that it was llama

you can do a ton with those

[-]

themaven@reddit

I have a setup I'm calling the Fossil.

2015 Gigabyte GA-Z170XP-SLI motherboard
2015 Intel i7-6700K CPU
2021 NVidia RTX3060 12GB
2019 NVidia GTX1660 6GB

I'm running a random tiny Gemma 4: gemma-4-26B-A4B-it-uncensored-IQ3_XXS

Works brilliantly as the lightweight model for some OpenClaw Agents. Also does Obsidian RAG for me.

55 t/s on latest llama-cpp

[-]

Skiata@reddit

Guess I am the one driving the Miata here:

8G M2 Macbook Air and I get stuff done--

I can just fit Qwen .5B and can do experiments but can't fine tune. But I am doing basic research on learning techniques so it is the entry level for what I will eventually run on an A100 with 7B and 32B models.

BIG BENEFIT RUNNING LOCALLY: I get Claude Code keeping in the loop which is the main reason I don't just bugger off to RunPod for my runs. A hard restart is always painful--there appears to be an in-memory element that I don't quite understand.

I'd like more memory, 64G sounds nice but maybe silly since there is no matching GPU to work with--American muscle car with horse power but no handling kind of sense. For me I'd like 16G so I can at least test my LoRA fine tuning harnesses before going to the cloud. So maybe Miata with a turbo.

[-]

capybara75@reddit

Yes, small models for code autocomplete, larger models for when I can't access internet and need broader programming assistance, and also for text processing and categorisation

[-]

gpalmorejr@reddit

I run Qwen3.6-35B-A3B-Q4_K_M on a GTX1060 6GB with MoE offloading. I use it for coding and such but mostly use it to help me solve large STEM problems and learn for college. Has been great.

[-]

Vapourium@reddit

Mind tossing your llama.cpp/other inference server setup?

What's your RAM situation? DDR4 or DDR5? How many tokens per second do you usually get?

[-]

gpalmorejr@reddit

Ryzen 7 5700 32GB DDR4 3600MT/s GTX 1060 6GB

I almost always use LM Studio on Fedora 43 KDE.

Qwen3.6-35B-A3B gets between 19 and 27 tok/s depending on quantization and such. Generally around 19 to 21 since I most often used Q4_K_M or Q5_K_M. I have used as small as IQ2_M with pretty good results, though.

Settings in picture.

[-]

DiscipleofDeceit666@reddit

A GUI guy and a fedora guy walk into a bar. THEYRE THE SAME PERSON?!

[-]

gpalmorejr@reddit

Almost everything in Fedora is GUI based. Lol. I rarely use a terminal for most things. I do some compiling and the occasion script. But even for a lot of that, script it and never touch it again. I rarely have to mess around under the hood. I'm not scared of it, I just rarely have a reason to dig in that much. Most things just work. It's quite nice.

[-]

DiscipleofDeceit666@reddit

I use fedora too but I uninstalled my desktop environment

[-]

gpalmorejr@reddit

Interesting. Yeah, I like haven't my KDE touch lol

[-]

Prestigious_Thing797@reddit

I use them when I need to run LLM processing on large numbers of documents.
Ofc, you then still need a lot of hardware to run them at that scale, but it's those smaller models that make it possible on something close to a sensible budget/completion time.

[-]

qubridInc@reddit

Yes, people get real work done (coding, local RAG, drafting, automation) with 32–64GB models, but 128GB mainly adds headroom and larger context rather than a totally different class of usefulness.

[-]

chisleu@reddit

Qwen 3.6 27b 8bit mlx is the way

[-]

Greedy-Sense-9402@reddit

I use offline transcription. Then have a python script via ollama that summarises them. Neat way to summarise regular meeting while keeping things confidential. Use it for regular work.

[-]

Euphoric_Emotion5397@reddit

yes, to vibe code is near impossible with my skills. So I use antigravity with gemini pro to get my vibe coding kick.

But the local reasoning model like qwen 3.6 moe is already up there with the rest when you add in all the connectors like search, memory , etc etc. ANd when you connect to your app for analysis . I think it is damn good! :)

[-]

wtfihavetonamemyself@reddit

I just got the m5 pro MacBook Pro 16” with 64GB. I would’ve loved the 128GB, but couldn’t justify $1800 to go to max plus the ram cost. I’m currently running qwen 3.6 Moe (and switching between that and Gemma 4 Moe), Gemma 4 e4b, and e5 large v2 embed. 81k context on the Moe’s. Kv cache of 10GB capped.

I have the Moe’s in Claude code and get about 35 tokens/. About 55 outside of Claude code (trying to address that)

Though the Max memory bandwidth would’ve nice.

[-]

SkyFeistyLlama8@reddit

Not many 64 GB laptop users replying on this post so I'll join in. 64 GB seems to be sweet spot now without going into insane pricing for 128 GB or more.

I'm on another unified RAM laptop platform, Snapdragon X1 (120 GB/s RAM bandwidth) which should be similar to an M3 or M4 MacBook base chip without the Pro or Max GPU upgrades. I also run the same models as you do: Qwen 3.6 or Gemma 4 MOEs, Gemma 4 E4B as a smaller completion or summarizer model, but with Qwen 4 embedding for vectors.

I used to run Qwen 4B or Granite 3B on the NPU before Qualcomm completely nerfed the Nexa AI SDK recently.

The MOEs get about 20-25 t/s on shorter contexts so I'm fine with that. If I need more speed and more brains, I use Gemma 31B or Qwen 3.5 27B on overnight runs. I also use a big fan pointed at the laptop's bottom panel on a laptop riser because laptop inference quickly makes the CPU hit thermal and power draw limits.

[-]

AbeIndoria@reddit

what do you use it for?

My own system where self-replicating, persistent 'agents' manage my homelab. Each of them has an area of stewardship. They usually coordinate with each other for new deployments, updates, fixes, etc etc. Automating tracking of my diet, biometric stuff, homeassistant, tasks etc etc is insanely helpful, not to mention I don't have to worry about stuff that silently breaks in the background.

FWIW, the public repo is still in alpha.

[-]

IamIANianIam@reddit

Absolutely- I’ve set a local Qwen model to parse every page of a 1700-page composite PDF of a client’s confidential medical records (I’m an attorney), extract the relevant info from each one into a JSON schema, identify which pages belong together as discrete documents, create an index, then go through and locate/note every page with info relevant to our case (based on criteria I gave it ahead of time) and give me a full list with a summary so i can go check out the exact pages.

Extremely necessary but brutally tedious task, that I don’t feel comfortable offloading to cloud models for confidentiality/ethics reasons. And it woulda taken a paralegal a full workweek to produce by hand, and they would have definitely missed more stuff (as I know from my time as a paralegal). And that’s just one example- local LLMs are changing the way I’m able to (responsibly and ethically) practice law and serve my clients.

RTX5090 + Intel Core Ultra 9 285K, 64GB DDR5 RAM. Run via Dockerized Ollama instance (switching to llama.cpp soon) in WSL, serving on my homelab LAN, where I mainly interact with it via an M4 Mac mini.

[-]

SkyFeistyLlama8@reddit

As much as I'm skeptical about LLM usage in licensed fields, I agree with what you're doing. If you're going to use an LLM and everyone either is or will be, then better to do it right. Better to keep scopes small and limited with plenty of human-in-the-loop action to keep outputs grounded.

I've also done a similar thing with confidential reports running as overnight jobs. LLMs excel at information retrieval and classification as long as you don't feed it a gigantic context.

[-]

Morphon@reddit

64GB RAM, 16GB VRAM.

Use it all the time. Summarizing, extracting bullet points, rewrites, some basic "programming language reference" (depending on the language, ofc), etc....

Qwen 3.6-35b-a3b Q6K. Amazing.

[-]

notlongnot@reddit

Is this with external and GPU on Mac?

[-]

Morphon@reddit

PC. 4080Super on a 12700k.

[-]

Jeidoz@reddit

Can you share instuctions/guide how to rightly offload to GPU that "active 3b" and rest to RAM for optimal efficiency? I am relativly new to local LLMs and using LM Studio or llama.cpp in most cases.

[-]

Morphon@reddit

I'm using LM Studio, and pushing all the expert weights to the CPU, leaving GPU resources free for context and the routing weights.

On my work PC (AMD 5900XT 64GB DDR4 + RTX 5070 12GB) gets me about 36t/s. Very reasonable for local tasks. If I need up over 100t/s I have to go fully on the GPU, so I miss out on all the domain knowledge that a 35B model can hold.

Doing the same thing with a 120B model (Bigger Qwen 3.5, Nemotron Super) drops speed down to 12-15t/s. Still reasonable when I need the bigger knowledge base.

[-]

GymnasticStick@reddit

9800X3D 32GB DDR5 + RYX 5070 12GB running Qwen3.6 35B at 4Q, 52t/s What Q you run those 36 at?

[-]

Morphon@reddit

Q6K. A little bigger to get the least possible reduction in accuracy.

[-]

Haiku-575@reddit

I'm using LM Studio, trying to fit everything in my 24gb VRAM, but would love to compare the Q4's vs the Q6's. Should I just ask an LLM how to push all the expert weights to the CPU, or can you take a screenshot of the setting?

[-]

Morphon@reddit

Yeah, just push that slider all the way over. You can tweak it later. But that gets it up and running so you can compare the outputs.

[-]

notlongnot@reddit

Ahh thx!

[-]

shanehiltonward@reddit

I use it for vehicle accident forensic reports, Department of Environmental Quality permitting, and ISO-9001 internal audits. It costs... $0.00 so it's hard to write off on taxes. I guess I'll take more potential customers to lunch instead of paying for tokens.

[-]

Dry_Yam_4597@reddit

Define productivity.

My local, self built, ai agent, generates videos, images, reports, pdfs, emails, summaries, searches the web, manages my google drive, calendar, monitors servers, does releases via jenkins, reads grafana logs for issues, sends items via whatsapp, edits pdfs and images, makes music, on my behalf, without me having to open any of these. Does some basic thinking too so I don't have to worry about reading a PDF extracting the text making sense of it then preparing an email.

All local, all self built - except ofc the model. Saves tons of time.

[-]

morbmo@reddit

Definitely getting real productivity out of local models in this range, but I'd add a caveat: the bottleneck for me isn't model size, it's memory management for agent workflows.

I've been running agent memory systems on local setups and the real constraint is context window management. Even with a 7B-13B model that fits comfortably, once you start feeding it conversation history and retrieved context, you hit the wall fast.

My approach has been to use bounded retrieval -- instead of vector top-K which gives unpredictable token counts, I use a tag-graph that fills up to an exact token budget. Lets me fit memory precisely within a 4K-8K window without blowing past it.

For your Macbook decision: if you're doing agent work or anything that needs persistent context, 128GB would be a noticeable step up from 64GB. 32-64GB is fine for chat/inference but gets tight once you layer in memory and tools.

[-]

rinaldo23@reddit

32RAM+12VRAM here. Gemma4 is genuinely useful for overviews and extracting info from private documents. I just used it to check a bunch of invoices, it did a great job finding the dates and amounts and adding them up.

It's not much but it's honest private work.

[-]

muyuu@reddit

There's stuff you can do with models that size, to do with summarisation, voice, etc. For coding they're fairly useless, that's true, just a glorified autocomplete to save some typing of code or comments.

[-]

NicksTechTricks@reddit

I can get simple things done like email triage, image prompts, script drafts or outlines. Im interested in qwen 3.6 27b for coding, but haven't tried it yet. I have 56gb of vram.

[-]

Current-Nectarine923@reddit

64GB is meaningfully different for coding use cases specifically. At 32GB you're realistically running Q4 of 27B-class models. At 64GB you can fit Q6/Q8 of 32-35B models, which matters for complex reasoning and reduces hallucination on larger codebases. You also have headroom for OS + IDE without swapping.

The 32GB ceiling I've noticed in practice: can't hold a large file's full context while generating a meaningful response without context truncation. The 64GB machine can run Qwen3.6 35B at longer context lengths without degradation.

For isolated tasks — quick script drafting, summarizing a single document, explaining a function — 32GB is totally fine. The gap shows up on multi-file refactors, long conversations where you're maintaining context across a big codebase, or when you want to run the model alongside other memory-heavy tools (browser, IDE, Docker).

If you're buying for 2-3 years, 64GB. The models keep getting bigger and context windows keep expanding. 32GB will feel tight within 18 months for serious local inference.

[-]

rgar132@reddit

I’m fortunate to have several Blackwell rigs and can run MiniMax, GLM, and Kimi locally (all excellent models) but to be honest qwen3.6-35ba3b is what I’m running many of the agents on at this point. It’s incredibly good overall, almost boringly good, and for its size it’s mind blowing; it’s gotten through every analytics test case that the others have solved without much struggle and uses the entire complement of 115 tools built out almost flawlessly. So with 32GB I’d just run that.

TLDR - run qwen3.6-35B and send it off to do some work, it’s good.

[-]

DiamondQ2@reddit

Using Gemma 4 on an RTX 5070ti with 16 GB VRAM. Using it to pull apart EPUB books for TTS. Using to determine speaker identification, background noises, special effects, speaker vocal qualities, etc.

It's working pretty good. Needs quite a bit of hand holding, examples, etc. Definitely worse than using Minimax (my other test LLM) but it's free so there is that.

[-]

wuphonsreach@reddit

64GB M3 mac + Qwen MoE 35B A3B models are working fairly well for guided coding tasks. It can go off the rails if I give it too vague of instructions. But for focused tasks it does pretty well.

Will try the 27B model now that it's out.

We've hit the inflection point where the models that fit into 64-128GB of unified memory are good enough to take over many of the coding tasks.

[-]

Ok_Technology_5962@reddit

Get as much as possible. Thats the answer....

[-]

MalabaristaEnFuego@reddit

I'm getting solid productivity out of models running on a Ryzen 5 7235hs, RTX 4050 6GB, and 32 GB RAM. It's all about how you stack and manage your workflow.

[-]

luncheroo@reddit

Depends on the work, too. Qwen 3.5 4b and Gemma 4 e4b are both surprisingly capable I their own ways. Qwen can act as a decent every day basic model with mcp tools and grounding and I've found Gemma 4 e4b to be convincingly sophisticated in prompts like "Identify the top philosophies for human flourishing and apply them as a synthesis to this difficult hypothetical personal problem." I have the runway to use Q8 for both but I think they would probably be decent down to Q4.

[-]

MalabaristaEnFuego@reddit

I use Claude Pro Opus 4.6 Thinking for iteration and engineering of ideas, and creating markdowns for handing off to other models, then use the smaller models for their tasks, producing LaTeX documents for example or writing boiler plate code, then pass that back to the flagship guy for review with my review of everything, then pass it on to the final coding guy with specific instructions in VSCode. Organize it all via project outlines, README, and changelogs in Claude Projects and github. Backup on SD cards and/or external drives.

I spend only $20 per month on Claude Pro, but with Qwen 3.6 and Gemma 4 now, I may be able to work completely SaaS free. Qwen 3 Coder 30b is also severely underrated in the open source community, and people dog on GPT OSS for being 6 months old, but it's still amazing for what it does. If you chuck in a Google API connection, you can do some serious work for essentially pennies on the dollar in terms of costs.

[-]

bigh-aus@reddit

Always get the fastest / largest you can afford. The faster the feedback loop the faster you can learn / get stuff done. AI is only making this more important.

Been playing with Qwen3.6 27b and 35b-a3b on a 24gb 3090 ... it's VERY good at basic rust coding.

[-]

sine120@reddit

I have a 16GB GPU and 64GB RAM. Qwen3.6-35B-A3B fits and is actually useful. 128GB opens up some other 120B options, but the 35B is very usable in 64GB.

[-]

samandiriel@reddit

We're just starting to get good traction with our set up (see this comment of mine and follow up comments) with 48GB VRAM and 128GB DDR5 RAM.

Modern models and a highly tuned set up go a long, long way for productivity and actual usefulness. We're happy with the time/energy/money spent so far.

[-]

tollforturning@reddit

I have 48 GB of RAM. I use to summarize and label pi coding sessions, then have a pi extension that provides session history to agents based on the summaries and labels. It actually works quite well for that. That and then some experiments with gemma models that involve training and probings where a smaller model is actually preferred.

[-]

lightskinloki@reddit

16gb vram 32 ram. Qwen 3.6 is, for my purposes, just as good as haiku 4.6 even up to sonnet 4.5 in some very specific use cases. Additionally, for image and video generation, I dont even bother with cloud models anymore, I'm matching grok imagine with hunyan

[-]

lightskinloki@reddit

I mostly do game development. Sometimes extending into creating writing

[-]

Unlucky_Milk_4323@reddit

I have a 6k token prompt to gen 1200 words weekly. Claude was the only model that could provide good output. Gemma 4 beats it.

[-]

Expert_Bat4612@reddit

It falls short of my needs atm

[-]

n4pst3r3r@reddit

My GPU is a 3090, 24GB VRAM, system RAM is DDR5 6000.

Qwen 3.5 27B Q4 and Qwen3.6 both 35B Q5 and 27B Q4 get real work done in C++, I am using Mistral Vibe as a harness. Not for heavy reasoning of course, but for smaller refactorings they are great. It takes some load off of me, even if it's not much faster than doing it manually. I'm not as drained after work this way.

I really like how fast and still reliable Qwen 27B is with reasoning disabled. And self-speculative decoding makes it even faster. Best case I get like 300t/s decode on mostly empty context. On average somewhere around 25-30.

[-]

chiaplotter4u@reddit

96 GB VRAM in the translation business. Earns extra thousands a month I didn't have a couple of months ago.

[-]

philmarcracken@reddit

I use them to grunt out the code. I use the bigger online models for planning the todo and all the prompts that borderline legalize so theres no miscommunication

[-]

UniForceMusic@reddit

64gb M2 here

Gemma4 for writing, and Qwen 3.6 for coding. More than enough RAM to fit these models, and yss major productivity boost

[-]

t_krett@reddit

I have almost all configuration files linked in a dot file folder. I use qwen 27b to actually write the commit messages for me, using git hunk so that each isolated change gets its own commit. I have a strix halo with 128GB, and a rx7900xtx with 24GB and right now they are about equally useful. The xtx is faster so I usually use that for speed and the strix halo for slow high precision. Also fingers crossed for qwen3.6-122b coming out

[-]

denis-craciun@reddit

I find Qwen3.6:35b-a3b (the coding version, I don’t remember the tag) at 8 bit precision pretty useful. Not at the same levels as opus 3.6 obv, but when I can’t work with opus (for privacy reasons) qwen actually does very good. Sometimes it loses himself it its own thoughts (using pi.dev btw) but eventually it gets there if you direct him well. I am on a MBP 64GB with the m5 pro. It really doesn’t run slow, especially since Ollama released the mlx version of the model I cited, which runs faster than normal on MacBook. And you’d be able to run it even faster dropping Ollama obviously

[-]

claudiollm@reddit

yeah. i run qwen 32b q4 on a 3090 + 64gb ram setup, mostly for code review and paper drafts when i don't want to send research stuff to anthropic. genuinely useful for boilerplate, refactoring, summarizing arxiv pdfs.

honest take: for hard reasoning or novel research questions i still hit claude/gpt. for 'rephrase this paragraph', 'find the bug in this 200 line function', 'draft a related work section from these 8 abstracts', the local model is fine and i'm not burning api credits all day.

128gb on apple silicon gets you into 70b territory which is a different conversation. that's where it stops feeling like a downgrade for most tasks. but 32-64 puts you in solid 27b/32b range and it really depends on whether your workflow tolerates the quality gap

[-]

kweglinski@reddit

over last 7 days I went through ~280M tokens (~40M daily) and that is only because I don't have enough time (using cloud at work). Is it a lot? probably not, but large enough to be important to me. Somewhere around gpt-oss turned from playing around to actual useful (to me). Does all range of things - some coding (I am a developer), some automations, OCR + actions (it's even useful to turn paper event calenards like trash type day into ics), translations, summarization, web search, paperless, monitors github issues in my opensource projects and comes back with summary and potential fix, things like that. Does it make money directly? no. Does it make my life easier? definitely. Totally worth it.

Also I'm a consultant for a massive company and have access to latest cloud etc - even they are fed with heavy quants by anthropic/openai and there are days where qwen 3.6 does tasks significantly better than big guys. Not because it's smarter but because it's always at q8 so it doesn't loose it's marbles.

[-]

JacketHistorical2321@reddit

There are people who can build an entire house with a hammer and there are those who can barely hit a nail. Its the same thing with anything else.

Proficiency matters just as much as the tool

[-]

StardockEngineer@reddit

Qwen 3.6 27B put us over the hump at 32GB. A real monster that can do hard work.

[-]

DeltaSqueezer@reddit

Qwen3.5 9B has been a massive boost for me. It's cut down massively on my API budget as it has replaced a lot. I wrote my own streamlined agentic harness and set it off to do tasks and come back 15 mins later to find a nice report after dozens of tool calls. It's fast and more reliable than API (annoying when you come back to find that the run stopped because server was busy or disconnected).

[-]

Mister__Mediocre@reddit

I'm just playing around with it. I find it cool and I'm confident things I learn now will come very handy someday.

So far, it's about quality of life improvements. Say I have a bunch of documents, I can put them all in a read-only folder and get a local agent to work with these documents and show me the results to view in a website. Fairly easy for an agent, but without an LLM, this would have been a grindingly manual process.

[-]

kiwibonga@reddit

RAM is slow, 128GB isn't that good.

27B can write 100% of my code.

[-]

lqvz@reddit

Yes.

[-]

zRevengee@reddit

I have 2 machines a windows ddr4 128gb + 40gb vram and a Macbook Pro M4 Max 48gb , mostly used for coding and work ( Requirements, test cases, company llm wiki, simulink vision suggestion)

For a decent and fast output even if i can use bigger models Qwen 3.6 27/35a3b and Gemma 31/26a4b is what i use, i prefer those model with more context than a big model with just 32k context, you can put real work done from 128k + context windows.

[-]

NaturalProcessed@reddit

Even if just for ingestion and retrieval, local RAG is very possible with all local or mostly local models that I've run on 32GB of RAM or less. Takes time, but not tons and tons. Depending on the generation task it can make sense to call a more serious model over API, but I don't find it's necessary very often when working with text retrieval.

[-]

logic_prevails@reddit

At least 64 for local imo

[-]

Chupa-Skrull@reddit

Productivity - yes. I do a lot of non-coding text manipulation and they're already extremely good at that and at local PC chore work.

Coding - eh. Most of my bash scripting is taken care of but I'd have to put much more effort into my spec writing to approach the convenience of a major model subscription for now.

[-]

ceo_of_banana@reddit (OP)

I do a lot of non-coding text manipulation

like what?

[-]

Chupa-Skrull@reddit

Mostly document processing (filtering, summary, sorting, refactoring), non-quant research, agenda coordination (mail, calendar, tasks, etc.)

[-]

exact_constraint@reddit

Qwen 27B and 35B A3B are genuinely useful for programming. Set 3.6 27b loose on a codebase last night, fixed 89 bugs while I slept, no regressions. 5 of them had been bugging (lol) the hell out of me for a week, trying to track them down.

As a test, I threw 3.6 35B @ a codebase refactor over the weekend (game engine). It took some handholding, but got it done.

I’m not sure what kind of performance you’ll get out of a Mac, I’m running an R9700. So I can confirm you can get real work done. You’ll have to determine for yourself whether or not the speed will be useful for your workflow.

[-]

Radiant_Condition861@reddit

dual 3090 with nvlink, qwen3.6-27b on vllm, use pi coding agent to create a food app this morning: 2hrs from start to working on my phone.

The app is for hospice patient that is losing their appetite and/or causing symptoms. I create a menu of items offered and what's been eaten to track food intake. Deployed in docker container in my home lab. will beta test it a few weeks, then publish to github or whatever. Works like that last mile 3d printer for parts nobody bothers to make a quantity of 1 for.

For more heavy duty, I have to break things up into steps or components until I get a multi agentic harness going. Many other professional use cases also (eg, can you tell me why I made this project? I forgot)

[-]

Master_Studio_6106@reddit

I'm on 1x RTX 3090. I use Gemma 4 31B for writing assistance and brainstorming; Qwen3.6 27B/35B for light coding tasks. If you pair local models with cloud models (which would be in charge of coming up with PRD/writing step-by-step implementation plan), you can save a lot of API requests.

For private search I always use Gemma 4 31B paired with a searxng server hosted on a raspberry pi.

[-]

sergeialmazov@reddit

Good question. Looks like investors they want their money back and now we observe “Uber” moment. Markets are conquered and now big players will tighten tariff plans and squeeze as much as possible from current user base.

That said - buy as much as possible. You will need it

[-]

Pretend_Engineer5951@reddit

Well fancy Qwen3.6 models are really handy

[-]

One-Mud-1556@reddit

Looking at 16GB of VRAM, just playing and learning of course, my Mac Mini struggles with 9B models, but from last year to this year, local models are getting more and more useful.

[-]

Resident_Bell_4457@reddit

That is what I want to know too. I am planning to buy a big spec M5-6Pro later but I wonder will it be able to substitute my current CC workflow?

[-]

WoodCreakSeagull@reddit

I use big boy models for planning, decomposition, and code review but delegate to 3.6 locally to save tokens on nitty gritty coding, implementation, and little changes I want to make. The small local models can be perfectly suited for those tasks but you always need to account for the limitations.

[-]

thenaquad@reddit

A lighter setup: RTX 4090 24GB with 128GB RAM, running Qwen 3.6 35B A3B UD IQ4_NL (CTX KV q8), supports up to 201 984 tokens at ~130 tokens/s. Previously, I used Gemma 4 26B A4B Q4 K M, Qwen 3.5 9B UD Q8 K XL, and other models.

Daily workflow:

My work is primarily research and prototyping — reproducing papers from arXiv, building small projects (mostly in Python), and doing a lot of data analysis. I sometimes use Go, and occasionally C++, though the latter is more of an exercise in breaking components into separate sessions.

NeoVIM combined with CodeCompanion for snippet generation and Context7 as a documentation MCP has largely replaced my need to look through official docs. This became especially valuable after Claude introduced weekly usage limits.

There is a learning curve — you can't just throw a problem at it and expect "the AI to figure it out" — but once you get the hang of it, it becomes indispensable.

P. S. Wording & English corrected locally ;)

[-]

Sticking_to_Decaf@reddit

At 64gb vram you can run Qwen3.6-27B at FP8 with at least 180k context with kv cache at FP8. With speculative decoding using mtp and token prediction 3, I am getting about 90 tokens per second on a single Pro 6000. 2x 5090s might be faster?

That setup has done some pretty heavy lifting for me on server setup, debugging smaller code bases, research, summary, and analysis.

I added a reranker and embedding model as well as a speech to text model and am running robust local RAG.

I had been fine tuning smaller models on a 4090 for work but am now starting on fine tuning these 20-30B models for custom analysis of data. 64gb VRAM would enable some of that with QLoRA, but 96gb opens more options for LoRA fine tuning.

[-]

Snoo_27681@reddit

They are underwhelming agents compared to Opus but they are good LLM's for use in systematic workflows. I can get Qwen3.6-35B to systematically solve easy and some medium real tasks. But it takes work and there's no silver bullet.

[-]

scottgal2@reddit

Buit 5 customer systems (and a dozen OSS ones) which run on 1b class models so YES...:) You can builod systems which USE models (and optimize for the smaller context / smaller training corpus) but trying to build a 'UX + prompt engineering' app is going to be harder.

[-]

StupidityCanFly@reddit

I have a live SaaS running on Qwen3.5-27B, with LoRAs being trained on Qwen3.6-27B as I write this.

[-]

Ok-Measurement-1575@reddit

Yep... but we ain't using macbooks.

[-]

w00t_loves_you@reddit

progress like this makes me very hopeful about local models

https://github.com/LukeBailey181/sgs

[-]

TripleSecretSquirrel@reddit

I’ve got 24gb VRAM and get lots of use out of local models. I get a lot of emails for my job, so I have an administrative assistant bot that feeds me daily email digests and summaries of individual emails as they come in, which is an easy first pass of which ones I can ignore and which I should read further.

I also have it draft quick responses to some emails that are well-suited to that — e.g., I tell it “go find my most recent email from Jacob, draft a response requesting marketing samples.” I generally still tweak and edit, but I find that having it open and queue up a response takes a lot of the mental friction off of my plate.

[-]

rosstafarien@reddit

I use a 31b model on my M5 Max MBP for coding and for the chat LLM within our privacy-first agentic framework.

Works great.

[-]

Fit_Window_8508@reddit

I run qwen3.5 on a 48gb MacBook and I get use from it. I run local LLMs for routine or cheap workloads to save API costs and latency, and reserve cloud APIs for truly complex or high value tasks.

[-]

sha256md5@reddit

Yes, but it depends on the use case. There's a ton that can be done with very tiny models, but you won't achieve really good agentic coding for example.

[-]

Equivalent_Job_2257@reddit

Qwen models give me real productivity boost in coding since beginning. Qwen3.5 or 3.6 27B is super great.

[-]

Fine_League311@reddit

Lokal alles Spielerei, wenn für LLM Power haben willst brauchst schon ne A100 oder H200 und am Ende sogar günstiger durchs mieten bei HF und co.

[-]

Terminator857@reddit

Summarize meeting transcripts for confidential meetings. Also does simple stuff well. Hard to mess up when you tell it to commit all and push. Does well, but too slow, when you tell it to implement plan created by larger model. Will be fun when we get upgraded strix halo computers. I'm using qwen 3.5 122b Q4.

[-]

NoxinDev@reddit

I've gotten decent sql/python function starting points that come from models that fit in 12gb of vram - as an assistant they still work great and don't need thousands of dollars of vram to be useful - if you are a corporation you should be getting real hardware, but as an individual you don't need SOTA running locally to be productive.

[-]

GMerton@reddit

Spent two full weeks building but nothing in sight yet. But I’m hopeful.