I hate this group but not literally
Posted by No_Run8812@reddit | LocalLLaMA | View on Reddit | 133 comments
True story,
I got interested in AI after seeing it at work and wanted to run models locally. I started with an M3 Ultra 96GB, quickly learned it was not enough for what I wanted, and kept upgrading hardware (including refurbished Mac Studios at 256GB/512GB and now an RTX Pro 6000 that arrived today). I tested many model families (Qwen, DeepSeek, Gemma, Minimax, etc.). My current favorite is MiniMax M2.7 230B/A10B. I’m also waiting for LM Studio support for DeepSeek v4 Flash.
I have mixed feelings: excitement about local speed/bandwidth and sadness about how much money I spent learning this stack. Also funny point: my 16GB MacBook Pro has been more stable than my 512GB setup, which crashed multiple times.
Still, I’m convinced local LLMs are the future, and this community helped me learn a lot. Thank you to everyone here.
Question for the group: For people running high-end local setups, what gave you the biggest real-world stability + speed gains (not just benchmark wins)?
If you want, I can also give you a more technical version focused on benchmarks/specs.
No-Comfortable-2284@reddit
unless you spend 100k+ you wont get inference capabilities of sota models. even then ur options are limited by whats open source. you wont be able to run anythint claude opus or codex capabilities locally. People buy hardware such as rtx pro 6000 etc to train massive datasets not to locally inference. its a big waste of money not using api keys for inference. only time it makes sense is for privacy requirement such as clinical or law context. I run local llm host for a local medical clinic which they use to summarize doctor notes into headers (saving them alot of time than doing it manually), but for everything else they use chat gpt or other api sources like everyone else.
cointegration@reddit
I'm going the opposite way, trying smaller and smaller models that can do the job satisfactorily. If i need a frontier model i use that, but for local deployments, smaller makes sense. Many small models to do a very specific job, 9B and below. Watching the ternary model space very closely.
No_Run8812@reddit (OP)
I tried 30B models and was disappointed with their tool calling that I didn't dare to go smaller, what do you use the 9B models for? it might help me.
No-Comfortable-2284@reddit
unless you spend 100k+ you wont get inference capabilities of sota models. even then ur options are limited by whats open source. you wont be able to run anythint claude opus or codex capabilities locally. People buy hardware such as rtx pro 6000 etc to train massive datasets not to locally inference. its a big waste of money not using api keys for inference. only time it makes sense is for privacy requirement such as clinical or law context. I run local llm host for a local medical clinic which they use to summarize doctor notes into headers (saving them alot of time than doing it manually), but for everything else they use chat gpt or other api sources like everyone else.
ego100trique@reddit
Qwen 3.5 9b with an MCP/tool to use the web is pretty good to scrap and compile a bunch of informations quickly. I use on my 7900XT and M1 Pro macbook and it is really one of the model I prefer because it's just super fast. Tried 3.6 27b with KV quant and that thing is sloooooooow on my hardware, such a pain in the butt.
cointegration@reddit
Semantic chunking, for tool calling small models, use Nemotron 3
darktotheknight@reddit
If only there was a method, to mix different smaller expert models together... /s
Techngro@reddit
So you're splitting/dividing elements of a single task among multiple smaller models? How do you combine the outputs to get the finished task?
Dazzling_Equipment_9@reddit
I am also like this. Many small models are used according to the scene.
Bromlife@reddit
What kind of tasks are you having them do?
Perfect-Flounder7856@reddit
What 9B are you using?
Silver-Champion-4846@reddit
Me too, primarily for cpu-only inference since I have no gpu
TheRealSol4ra@reddit
Qwen3.6 27b and 35b is literally sonnet 4.5 at home
__JockY__@reddit
Haha same deal. Started in early 2023 with a 3090. Then a pair of 3090. Then 4x. Then 5x. This rig ran Qwen2.5 72B exl2 8-bit at an amazing 70-ish tokens/sec using speculative decoding with Qwen2.5 1.5B.
Upgraded to a pool of 4x 48GB RTX A6000 Ampere. Upgraded again to a pair of RTX 6000 PRO Blackwell. Upgraded again to 4x RTX 6000 PRO.
And now, like you, I run MiniMax-M2.7, although with 4x GPUs I can run it in FP8 with BF16 KV cache.
I'm already eyeing the next big jump: 8x 6kpro to run things like GLM5.1, Kimi2.6, MiMi Pro, DS4 Pro, etc.
running101@reddit
What board do you need for 8x?
__JockY__@reddit
Doesn’t matter, I’d just use my existing H14SSL-N and throw a 100-lane PCIe 5 switch in it, like this one: https://c-payne.com/products/pcie-gen5-mcio-switch-100-lane-microchip-switchtec-pm50100
BadUsername_Numbers@reddit
Not as bad but can relate. Just went from macbook m1 pro w 32GB ram to a macbook m5 max with 128GB ram and... yeah sure, there's a difference. But I have yet to really appreciate it fully, I hope.
my_name_isnt_clever@reddit
Have you tried one of the big ~120b models, like Qwen 3.5 122b? I have the Strix Halo with 128gb and that model is making it pay for itself.
comp21@reddit
Are you using the a10b or the standard 122b model? I have a Corsair 300 with 96gigs GPU ram and the a10b model runs so very slow. I do genetic research with my server and my llama3.3:70b ran the report in about 12 hours but the qwen a10b was still running after 36 hours.
my_name_isnt_clever@reddit
The only Qwen 3.5 122b is the a10b MoE, not sure what you mean. I get 27 t/s baseline on mine, it's not amazing but very usable. I swap to 35b a3b when I need speed.
comp21@reddit
Yes that's the model I mean. The 122b-a10b... Not sure why but it runs very very slow on my home server
Bromlife@reddit
Just curious, how is it making it pay for itself?
my_name_isnt_clever@reddit
Actually I probably misused that phrase. That model is making the purchase worth it to me, it's just a hobby.
Bromlife@reddit
That's cool, wasn't an interrogation. I'm just curious how people are putting their local models to work.
BadUsername_Numbers@reddit
Thanks mate, really appreciate the suggestion. Gonna look into this, cheers!
FoxSideOfTheMoon@reddit
Oof. That’s an expensive upgrade too…
No_Run8812@reddit (OP)
my biggest mistake was exploring phase, I went from video generation to trying llm with 450B. Just stick on very small thing with a 30B model. It will be so complex in no time, that can then explore bigger models.
Ok_Sprinkles_6998@reddit
I'm just sitting with my 8gb vram not knowing what and when to jump in 🫠
FunkyMuse@reddit
I have 4gb vram if that comforts you 🫣
Candid-Camp-8928@reddit
This is us literally
Zerohero2112@reddit
Can I have a say with 1050 2gb VRAM ...
FullOf_Bad_Ideas@reddit
You can mess with finetuning even on 8GB VRAM and learn a lot about it.
BitGreen1270@reddit
Look at the rich guy here showing off with his 8gb vram. Meanwhile I'm dragging my igpu kicking and screaming through moe models on 32gb ram.
redmctrashface@reddit
Man, on this sub ppl are casually throwing "I had 8 3090, now I got 4 RTX 6000. I think I will have 1 or 2 lol". Meanwhile Im here with 16gb lmao
Imaginary-Unit-3267@reddit
12GB here, you lucky bastard with your extra four.
BitGreen1270@reddit
I'm with you - I'm renting instances on vast.ai and trying my best to optimize for my laptop.
hustla17@reddit
Most relatable comment in this whole sub.
Silver-Champion-4846@reddit
I'm just sitting with my 8gb ram (no v in the beginning) browsing posts and hoping for someone to keep compressing llms until I can run useful models on cpu. So be glad you have 8gb vram, train a small model or something. Or codesign an optimized agentic harness with a frunteer llms that can squeeze the highest performance out of 4b models like qwen
ItsNoahJ83@reddit
What size llms can you run?
Silver-Champion-4846@reddit
3b/4b max. No moes
nervehammer1004@reddit
Similar - running dual GTX 1080Ti’s. They are remarkably capable though, giving good speed and context length with Qwen3.6 35B_A3B. That plus OpenCode has been very reliable for my use case, which is just playing with local models.
mechasquare@reddit
Honestly, the first step is having a use case. I'm in the same boat but decided to jump into the world of trying to do local roleplay with a local hosted LLM. I've learned so much but that first step was the steepest.
gpalmorejr@reddit
8GB would be an improvement for me. I have a GTX1060 6GB. I don't even have tensor cores or INT4/INT8/FP8 support. I run Qwen3.6-35B-A3B, but using MoE offload, which helps, but I still only get 20tok/s at Q4. 😭
Zodiexo@reddit
Have you run any local models
beyonceblow@reddit
same! i have a 5060 and a dream lololol
sagiroth@reddit
I am like a golden retriever just always happy to be around and part of whatever is going on with LLMs
segmond@reddit
I see obsession with speed of token generation. But really, it's about speed of prompt processing. If you are doing serious work, then you are going to have a lot of prompts. So really total speed of generation PP + TG. However, I'll go extend this further and say that total time of generation is irrelevant. The only speed that matters is the speed to a complete and correct solution. This speed is heavily dependent on your personal skill, and less on your hardware. You can have a high end setup and if you're lazy or an idiot, you will either take too long to get to the right solution or get none at all. Meanwhile, someone with a meager setup who is gritty and resourceful will get to the right solution...
Clear-Ad-9312@reddit
recently saw a youtube video of someone pairing a DGX Spark for Prompt Processing with a Mac Studio for Token Generation. The results were interesting, as each one completed their task exceptionally well, along with the fact that once the DGX Spark finished, the next prompt could be processed/queued up while the Mac Studio was still completing the TG of the first prompt. The hardware being dedicated to one side of the task is fascinating especially if you think about how well it would perform for parallel requests. The PP will not slow down due to TG, and vice versa.
segmond@reddit
can you show the video? my bet is that the model they were running was smaller than both the memory in both systems which I don't find interesting. can you do it with a model that wont fit in both systems were you have to split the layers across both machines?
Clear-Ad-9312@reddit
https://www.youtube.com/watch?v=D2oZHzC_M28
https://blog.exolabs.net/nvidia-dgx-spark/
They did the reliable older models, but at this point, I wish these dam reviewers would just stop testing on models that are more than a year old...
Not sure what you mean by smaller models, <30B or <= 8B. Either way, the whole setup is incredibly expensive for only 2-4x the performance. As the model size gets larger the Mac Studio does struggle to give good TG that it about matches the DGX Spark.
Even he recognized at the end that the RTX Pro 6000 would make more sense to buy.
Technology keeps moving along, if you strictly care about low power use then the Mac Studio paired with a cheaper NVIDIA GPU to help with PP is better than going for a full GPU setup. The DGX Spark is interesting, but it's too expensive for what it does and does not even support external GPUs. It doesn't have much of an advantage compared to a normal GPU.
Imaginary-Unit-3267@reddit
Yes! Which is why I am spending lots of effort designing tools to help my LLMs edit text better so they can stop making so many goddamn mistakes and mangling files, lol. The upfront cost will pay off over time.
Pretend_Engineer5951@reddit
Well I'm waking up from this llm madness. At first I bought Strix Halo and had much fun but I wanted a real alternative to Claude or Codex. I expected Qwen 3 Coder Next could handle writing small/medium pet projects. But soon I realized that it wouldn't. Tried others - no success. All of them narrowed to workflow to loop: fix syntax or import mistakes, attempt to build or run test until "all green". Nothing was like "stop and rethink what you've done". So I assumed these LLMs was just too small lacking of experise or quants -> need more RAM. So I bought second Strix. But game change didn't happen. And I sold it and now rethinking of usage with only one machine.
I could've changed hardware to more capable but what for?
Imaginary-Unit-3267@reddit
Did you somehow forget that you have a brain and can work WITH your models rather than leaving them to figure out everything alone? These are tools, not replacements for your own skill.
Pretend_Engineer5951@reddit
I just didn't know what these models capable of. They wrote something like solution but it didn't work all
misha1350@reddit
You should use Qwen3.5 122B A10B instead. Or Minimax M2.7 at Q4_K_S or IQ4_K_S.
getstackfax@reddit
This is the part of local LLMs that feels under-discussed.
It’s easy to keep chasing the next hardware jump because each upgrade unlocks something, but the real-world win seems to come from the boring stability layer as much as the raw specs.
The question I’d want answered after every upgrade is:
- did the daily workflow get faster?
- did crashes go down?
- did context handling improve?
- did model switching get simpler?
- did I reduce cloud/API spend?
- did I actually ship more, or just benchmark more?
The 16GB MacBook being more stable than the huge setup is honestly the perfect reminder that “bigger stack” and “better working stack” are not always the same thing.
At some point the best upgrade might be less about another machine and more about a clean runbook: which models for which tasks, what runs where, what the fallback is, and what workload actually deserves the big rig.
No_Run8812@reddit (OP)
you are right, I am very poor with resource management. Going forward I am going to put 3 of them in my success metrics. Thanks for sharing.
getstackfax@reddit
That sounds like the right move honestly…Raw specs are fun, but success metrics keep the setup honest.
Even just tracking 3 things consistently — shipped output, stability/crashes, and cloud/API spend — would probably tell you more than another benchmark score.
peter941221@reddit
I just bought a 5090, which is super expensive in my country, but found local LLM is so stupid compared to GPT.
No_Run8812@reddit (OP)
you can generate videos using LTX, I don't think anyone here claims they can replace gpt with any amount of ram. unless you are setting up a datacenter with 2tb ram and loading Deepseek v4 pro full version
LukeLikesReddit@reddit
Try apply to LM studio link and then you can basically link all that hardware together although I only got accepted last night so haven't had the chance to check what exactly I can do with it.
BustyMeow@reddit
It just lets you run LLMs on one computer remotely.
ea_man@reddit
I think local LMM are a thing yet buyin hardware for those ain't a thing right about today.
In fact this could be the worst time ever to buy the kinda hw you need for it, it's also the time when this tech is advancing fastest than ever so it ain't the time to throw stupid money on a particular arch paradigm like MoE (big slow unified memory stuff) or dense (small fast GPU).
You may get some new tech that radically changes caches, parallel loads, model sizes while today we are buying even 6 years old hw for outrageous money that could become obsolete the moment a new paradigm comes out.
So decide a budget that's friendly to your curiosity to enjoy the moment and don't overspend for the "final setup".
UniqueIdentifier00@reddit
I hate it but I have to agree with this perspective. Local LLM will progress but this time period feels like I’m buying a windows XP comp with 512mb of RAM for 3-4 weeks of pay. Seems great in the moment.
ea_man@reddit
On the happy side we've reached the point when we can run useful models like QWEN3.6 on 16GB and even 12GB if you do it right, RDNA2 AMD 6700xt and 6800 can do 20-50tok/sec, they cost 200-300 used, can be downvolted easy to <150W so you can probbly can run even one more without a new PSU. They are not super stars, no shame in plugging in a 4x PCI slot...
techcodes@reddit
Well.. the advantage of owning the devices is the experimentation phase. No need to worry about rate limits, api costs, high usage windows
SnooPaintings8639@reddit
I wonder if you all rich geniuses, or indebted wierdoes. There is a lot of talk about 512 RAM Macs and RTX 6000 around here lately.
darktotheknight@reddit
For a hobby, 10k USD/EUR is a lot of money and pure luxury. But if you're doing work (e.g. IT), that kind of price tag is justifiable. Often you can deduct it from taxes and get a tax return. E.g. when I would buy a card for 10k, I would get nearly 5k back the next year from taxes (yes, I pay a lot of taxes).
FullOf_Bad_Ideas@reddit
I don't have the mindset of selling my stuff. When would you recommend selling 3090 Tis (8) and moving to 5090s or RTX 6000 Pro?
darktotheknight@reddit
I can't give financial advices. I don't know how the prices will be tomorrow or the week after. You have to decide for your own. I bought the 3090 Ti at 800€ new from a retailer and the current prices are about 1000€ - 1400€ right now. At these prices, I personally (!) think this is the peak and I will sell mine. Again, I could be totally wrong.
My take is: the need for VRAM will not vanish overnight. Memory shortage will not be fixed anytime soon (they estimate 2028, maybe end of 2027 - if nothing special happens). And there are no signs 3090 support will be dropped anytime soon, despite being 6 year old cards. At the same time, I can't see RTX 5000 falling significantly in the near future.
There might be other factors, too, which you should consider, such as power consumption, or memory bandwith, or warranty, or tax return. I think 8x 3090 (= 192GB VRAM, \~9000€) should buy you 1x RTX Pro 6000 (96GB, \~9000€). You would get half the VRAM, but also much simpler setup (1 vs 8 cards), faster inference, a much more modern card (better resell value in the future), fresh warranty. When selling, it also matters if you have leftover warranty, so don't hold any cards too long or you might lose resell value (some manufacturers only offer limited warranty to the original buyer, keep that in mind). Of course, selling a RTX 3090 or even an RTX 5090 will be much easier than a 9000€ RTX Pro 6000, so keep that in mind.
You also don't have to go the "all or nothing" route; you can sell only a portion of your cards, if you want. But that really depends on your workloads, what you want to do and what benefits you expect from the newer generations. That being said and afaik, 3090 is still one of the sweet spots for CUDA card and €/VRAM.
mc_nu1ll@reddit
"buying pc hardware is like buying gold" and this is why we can't afford DDR5 RAM, kids
brickout@reddit
Seriously. I've never owned a car that was as expensive as half the setups i see here. I feel lucky that i scored a laptop with 32GB ram before prices went crazy...
super1701@reddit
I did a "mid size" build, and all my irl friends called me stupid(For spending money on something that wouldn't make me money). It has been a wonderful experience, and its crazy seeing the world of LocalLLMs progress. Shout out to all the providers/contributors who make this possible <3.
Fit-Statistician8636@reddit
What about genius weirdos? :) It is about priorities. 4x RTX 6000 are not more expensive than a regular new car, and cars sell like pancakes.
boutell@reddit
Some people really do need a car 😀
Fit-Statistician8636@reddit
Sure 😀. And some people really do need to run AI locally. As I said - it is about priorities.
Weekest_links@reddit
Where are people seeing 512??
__JockY__@reddit
Never get into debt for depreciating assets!
SnooPaintings8639@reddit
Damn, I wish GPUs were deprecating 😭
larrytheevilbunnie@reddit
I’m betting there’s a lot of SWEs here, who earn enough to be able to burn like 10k without much issue
Bromlife@reddit
I do wonder how anyone can possibly justify these rigs for personal uses.
somerussianbear@reddit
Until yesterday my plan was to buy a Mac Studio M5 Ultra 1TB on release day. Then I calculated my daily token consumption and checked the API price of DeepSeek (even not discounted). I laughed so freaking hard alone at home. I spend an average of 2 to 3 dollars a day to work 8h. That’s $60 a month, $720 a year. That Mac Studio wouldn’t be less than 15 grand. That’s 20 years of API down on a computer.
Most of us don’t need to buy hardware to run stuff. If it happens to run on our local existing hardware (no extra investment) good, if it doesn’t, API.
Even with the 15 grand hardware I wouldn’t be able to run DeepSeek v4 Pro locally (or would, but it would be slow AF!), and v4 Flash wouldn’t be as flashy as it is on the API.
I think most people think API cost is high due to insane Opus and GPT SOTAs, when token cost on the pay for play APIs is incredibly low.
All I can say is: do the math before buying stuff. If you can prove that you’ll profit in less than 2 years, then ping me cause I’m really curious to see your workflow and learn how to use more tokens and get more value out of it. I think most people have no clue that they’re just wasting GPU cycles.
running101@reddit
They are increasing the costs at the providers . Won’t stay cheap for long
somerussianbear@reddit
I believe they’re increasing the price of subscriptions (all you can eat buffet), but the token cost is not increasing if you consider the level of intelligence available on the market.
In other words, GPT and Opus are indeed higher than ever per million tokens, but the intelligence of current open source models, which reflects American SOTAs from 6 months ago, is pennies on the dollar.
The reality is that SOTA from a year ago is more than enough for our normal use (and by normal I mean SWE work). People think they need Opus 4.7 to write a couple of tests.
No_Run8812@reddit (OP)
Not looking for any profit, I would have doubled the money by investing in AI infra stock. It’s for learning the new skill, I am pretty sure I will not have my job because of these APIs.
Thebandroid@reddit
The future is not locally run frontier models that can do everything.
The future is lightweight routing, specialised models and smart decisions about which one to use for a specific task, then you have a few dollars for a frontier model api for really complex tasks.
FullOf_Bad_Ideas@reddit
Deepseek was using small specialized models for handling some web search tool calls. They now are using the main model for it - kv cache reuse saves more money than doing prefill + decode on big model. So, routing works either on low context, or when you can plug kv cache into a different model - but even lora swap should not have kv cache reused.
running101@reddit
Agreed
anitamaxwynnn69@reddit
Looks like I'm down the same path. Started with nothing. Thought I'll get a 5060ti 16gb to get started since I wanted to "game" too. Got another 5060ti because 16GB vram wasn't enough. And now I'm trying to convince myself to go bigger...but 3090 prices are just brutal.
rmhubbert@reddit
Ha, exact same story for me. Now up to 8 3090s in a dedicated Epyc system, which works brilliantly, but aye, those costs add up.
I bought my first second hand 3090 in January for just over £500. Bought my eighth recently for £850.
FullOf_Bad_Ideas@reddit
You are similar time wise to me and my setup. I bought first 3090 ti somewhere in 2023 or 2024. Second one came around 9 months ago. During winter holidays I got into it and bought 6 from December to February and now I'm thinking about getting a 9th.
see_spot_ruminate@reddit
If you keep going with 5060ti, I think the sweeeet spot is 4. anything more and it the cost and setup become too outrageous.
fizzy1242@reddit
hahah yup, once you get a 2nd gpu, you'll want a 3rd one. that's when you realize you've fallen into the rabbit hole and there's no getting out
No_Run8812@reddit (OP)
yeah, ram is the new memory card of Nokia era.
reefine@reddit
How I feel about everyone of my regular subreddits
WTF3rr0r@reddit
I can’t even test a 256 Mac Studio to check if if can do something useful actually
qfox337@reddit
You're not wrong that the large models are better, and depending on your use case, AI/LLMs are overhyped even in more reasonable groups like this one.
You can actually go sell the hardware that didn't work out though!
MaruluVR@reddit
Honestly the fun starts when you get into fine tuning and continual pre training to make models do exactly what you want them to correctly. You dont need a 240b model if a custom trained one at 30b or even smaller outperforms it at your needed task.
No_Run8812@reddit (OP)
I want to fine tune a model, what was your experience, which model did you fine tune? Did you fine tune on your own dataset?
MaruluVR@reddit
I started with using datasets that were out there on huggingface for things like improving languages, roleplaying etc once I go those working I started with looking at cpt with ebooks to further improve language support. Now I am looking at custom datasets for a game I am working on where every npc is ai so they roleplay with custom lore slang while using tools specific to the game.
For the start avoid moe models and try unsloths fine-tuning ui they just released when you get comfortable you can always try axolotl. First test your fine tune on the smallest model of the model family you are looking at and make a small test run if everything works you will know without waiting days and then you can start looking at the bigger models.
swagonflyyyy@reddit
I just use whatever multimodal model makes my personal assistant useful and fun to talk to.
D2OQZG8l5BI1S06@reddit
Bro you even copied the follow-up question GPT suggested you...
No_Run8812@reddit (OP)
Yes my bad, I asked to look for spelling errors 😂
DeepOrangeSky@reddit
In regards to the 512GB M3 Ultra not being as stable as the Mac mini and crashing numerous times, is this likely more to do with it having 512GB of memory, or more to do with it having an interconnected double-chip design, do you think? For example, did the smaller M3 Ultras you tried (the 256GB and the 96GB ones) also seem similarly unstable?
I have an M4 Max Studio with 128GB memory, and so far it has been good, but I am curious if instability issues for the Mac studios tends to be more to do with how much total memory it has, or something to do with double-chip vs single-chip designs, or something to do with M3 vs M4, or what. Like, am I probably "safe" with the one I have? Or how does it work?
Miserable-Dare5090@reddit
I feel like I thought my dual DGX Sparks were a waste of money, and the Strix Halo and M2 ultra machines were not. Now I can tell you I feel the complete opposite.
PaceZealousideal6091@reddit
Well, my take is it's better to go with multi-LLM (small to mid-size) setup using a well oiled harness that take advantage of specialized capabilities of each of those LLMs rather trying to brute force everything with one gigantic LLM. I think this is where things are heading. I don't think there is a need to keep going for eye-wateringly crazy hardwares. My expectations suggest a 64 GB VRAM might be. A sweet spot for everything local if we optimize everything. I can already see things aligning for this to happen.
ego100trique@reddit
I hope 20Gb of vram will be once we can figure out how to do lossless compression for bigger models
PaceZealousideal6091@reddit
20 should be enough for most purposes. 64 GB should be perfect for having multiple models loaded at the same time with ample context.
bnightstars@reddit
consider the fact that there are companies that run 2B/4B models for the workflows they have on CPU only (this old Xeon servers) and we here are complaining about RTX Pro 6000's. From practical perspective we need to figure out what our workflows are and stick with them. Personally anything that is Sonnet4.5 level at coding is good enough for me.
No_Run8812@reddit (OP)
tool calling is a very important part of my workflow, what do you achieve with 4B models, I fail to understand.
SettingAgile9080@reddit
Simple classification, sorting, extraction within a constrained use case work very well on small models.
Couple of real examples I use small models for in production:
* If you need to work out if a user's message includes any discussion of a time period, you can run it through a programmatic extractor that catches 80% of use cases and have it fall back to a small LLM model that takes text like "compare this month to the time around Black Friday 2 years ago" and extracts it to a structured date object.
* Sentiment analysis - "score whether this text message is negative or positive from 1 to -1".
* "Take this database table schema and write a short description of what each column is for"
However there's significant overlap with what is better done programmatically or with classical ML models. The advantage is it's quick to implement/change and can be done with a small number of prior examples.
my_name_isnt_clever@reddit
To use really small models it's a cost of supporting their restrictions in exchange for speed and lower compute requirements.
They could be used as a summarization tool called by a larger and slower model, for example. I saw someone here has Qwen 3.5 0.8b regularly used for writing long files, orchestracted by a smarter model.
bnightstars@reddit
It really depends on use cases I have a friend working as Data Scientist in small logistics company and they use 2B/4B models on CPU to process documents for logistics. I guess they don't need so much tool calling as regular coding agent does.
Important_Quote_1180@reddit
Small models doing an extremely well defined repetitive task is not going to get any sexy points. It’s not going to beat GPT 5.4 in a 60 step tool call heavy super prompt. It’s going to work every time and will save money. It can run in ram and you won’t care.
All that said, I’m loving my 27b on 3090. 256k context, tool calling, vision, 40-60 toks. Or 125k context vision and 80toks. All on a single 3090. Happy to share specs but you can check the source and you can just point CC or Codex at it. https://github.com/noonghunna/club-3090/blob/0df8f743192809dbdcda942887b625b0f48699f2/docs/CLIFFS.md
Silver-Champion-4846@reddit
Would 5090 run full precision qwen3.6 27b? Or is fp8 enough?
No_Run8812@reddit (OP)
27B full would be around 54gig + need room for KV cache. fp8 maybe.
Silver-Champion-4846@reddit
Thanks
Important_Quote_1180@reddit
Likely but with limited context window. Fp8 is virtually lossless and will give you a full context window of 256k I bet
Silver-Champion-4846@reddit
Interesting. Thanks.
Ell2509@reddit
Did you keep all the devices? If so, what is your stack like now?
BannedGoNext@reddit
I have a strix halo. It's for testing and learning. I'll upgrade when there is clear ROI.
RagingAnemone@reddit
Hi, I'm RagingAnemone and I run local LLMs.
I have a M3 Ultra 256 and a Strix Halo. ROCm is kinda unstable, but it was very unstable. It's getting better. But I can feel the 275gb/s memory bandwidth vs the mac. I'm thinking about building a big box with 1tb+ ram trying to do cpu inferencing as long as I can get 12 channel.
my_name_isnt_clever@reddit
Are we AA now? Or LA, Localllama Anonymous?
I use Vulkan on my Strix Halo, it's so much more stable. The PP speed is worse, but with some benchmarks on the batch sizes I've been able to speed it up quite a bit.
Silver-Champion-4846@reddit
What cpu, threadripper?
my_name_isnt_clever@reddit
I used llama-bench to find the best batch sizes for my hardware and it's helped way more than I thought, I get +50% pp speeds for free.
AldebaranBefore@reddit
You can chase speed or you can chase stability. If you want it stable, find the simplest setup you can and don’t f^<k with it.
jacek2023@reddit
This sub is filled with 8GB people hyping DeepSeek models
silenceimpaired@reddit
This sub is filled with bots hyping models on API. Fixed it for you. :P
Long_comment_san@reddit
this revelation struck them with a force of a physical blow.
Silver-Champion-4846@reddit
You are absolutely right – you've hit the nail on the head. It's not just foolish–it's insane! Next time you see a 256GB setup being described as worthless, remember the peaceful old days.
Kodix@reddit
> (including refurbished Mac Studios at 256GB/512GB and now an RTX Pro 6000 that arrived today)
I hate you and literally.
Silver-Champion-4846@reddit
Seriously, imagine the thousands of cuncurrent experiments you can do on training a bunch of 20m param tts models to find the best most efficient architecture with 256gb ram or 512! And these people are dropping them like flies!
TraditionalAd8415@reddit
yes, I am interested