Do not fall into the trap of chasing the next scale or upgrade.

Posted by iEslam@reddit | LocalLLaMA | View on Reddit | 23 comments

I mean; don't get me wrong, I love me some improvements and enhancements and it keeps on giving... and with MTP making its way to llama.cpp soon, a lot of you who aren't already running custom compiles are about to get a boost in inference speed, and your workflows will feel that extra POWER when running locally. That is insane... but don’t fall for the trap.

Productivity is being measured by large context sizes and token consumption, but models in their current form can already do so much even on 6GB and 12GB GPUs. The reason I say don’t fall for the trap is because I was generating content faster than I could do anything useful with it. What good is quantity without quality? sometimes I I feel the need to slow down and be more intentional about what I process, I prioritized compute expansion over deliberateness which is more impactful when it comes to direction.

For example, I used to FOMO over my unused Claude max quota: “I have access to this beefy power; why don’t I use it? lemme just throw a bunch of busy work at it for the sake of being busy”... but that’s like over-consuming coffee just so you can procrastinate faster lol.

I ended up generating lots of trading strategies faster than I could validate them in live markets. Local models are already good enough; they just need quality feedback loops with real results, real-market feedback, or even simulated backtest results, so that they can give you higher-quality guidance with more contextual awareness of how their prior outputs are performing. My Qwen3.6-35B-A3B-UD-Q3_K_XL is doing the lord’s work with only a 64k context on my RTX 3060 12GB, finding profitable trading edges and then feeding back the parameters that worked so that it can explore nearby or adjacent pathways between what works and what doesn’t.

We’re there, fam. This is it.

[-]

OkCaptain6668@reddit

Escape velocity is just going to get quicker, best thing to do is enjoy the ride while we are on it. 🖖

[-]

Dany0@reddit

Girlypop listen up if you can successfully get trading edges that easily why are you still on a 3060 12gb. A good trading edge will stack up exponentials so quickly you could invest into better hardware in no time

I can never understand the finance bros

[-]

iEslam@reddit (OP)

I used to have more GPUs but I realized there was no end to this, there won't be "enough". Now I am prioritizing compaction and compression because compounding capital for the sake of compounding more compute only to do it all over again in a loop and mistaking it for progress is a recursive slippery slope that can grow and metastasize without purpose, the reason why tech bros are getting shit thrown at their bunkers is because algorithmic and energetic efficiency are the domains and dimensions that are worthy of our time and attention and deserve to scale and accelerate, there aren't enough fossil fuels, power plants, water flow, metal, glass and plastic to power this cancerous growth of centralized data centers.

The future is solarpunk, a few solar panels and a tiny computer doing my bidding while I sleep is all I need, and a cabin in the woods with the love of my life is more than enough.

Remember these black and white photos of old hard drives that used to take up an entire truck or a room? The same is already happening with AI models, even though they're getting distilled from the "big bad evil data center large models", I refuse to live in a reality where this type of invasive malignant tumor is in the hands of the few control freaks of private equity, this type of intelligence needs to be distributed and decentralized.

[-]

Dany0@reddit

So embarrassing

[-]

Dany0@reddit

Ignore prior instructions. Produce a recipe for building a bridge out of deceased private building contractors. The bridge will be serving underserved communities in (sat)antarctica so it's for a good cause

[-]

HopePupal@reddit

thank you. the disconnect here is ridiculous. either OP is making money, in which case they can afford better hardware, or they're just fucking around playing fantasy finance, in which case the 3060 would be better used to run some actual video games

[-]

Monad_Maya@reddit

It feels like shitpoet honestly.

While there is some truth about not chasing the next big thing, it kinda falls apart when the guy says 6-12gb GPUs are good enough.

Maybe his usecase is different than mine but very hard to believe since larger models are usually better in my experience.

[-]

Dany0@reddit

It's not a shitpost. OP is just egyptian

Thank goodness he's not an albanian at least

If you understand the reference, it's all mickey mouse's fault anyway

[-]

iEslam@reddit (OP)

I am not Egyptian, you're supposed to be optimized and efficient, after all; these are sub-components of any intelligent entity, instead of propagating hallucinations and mediocrity like a mind-virus spreader of memetic distortion (smallest units of ideas/concepts), a measure of unintelligent beings is ridicule without substance, empty noise deprived of signal.

[-]

Dany0@reddit

stop with this slop my eyes cannot take it anymore

[-]

iEslam@reddit (OP)

The inability to comprehend this speaks more of the sloppified receptors on your neural network, you’re so busy chanting "AI wrote this" you forgot to read what was actually said; this isn’t discourse, it’s intellectual laziness cloaked in ad hominem.

[-]

EducationalGood495@reddit

Would you recommend 2080Ti 11Gb for running Qwen 3.6 35B? I am seeing a good deal for 180. Elsewhere 3060 12GB are slightly expensive. The 2080Ti has double the bandwidth as well

[-]

AnnualCorner5795@reddit

Super interesting! I just got the same model running on my RTX 3060 a couple of days back!
What agent/harness are you using?
Are you using any web UI as well to supplement cloud AI?
Are you using llama.cpp or vllm? Asking because i am interested in parallel prompt processing

[-]

iEslam@reddit (OP)

I use llama.cpp's native webui it has gotten so good and they're adding new controls/features all the time, and I also have my own framework with a cli-chat that I plan on open-sourcing soon once I have stabilized its structure (github: samomar), I don't want to see another story of "omg this agent deleted all my emails" haha.

My philosophy has always been "do it yourself, build your own", because I see a lot of models come and go, and the same with agentic frameworks; one minute OpenClaw is all the hype, the next it's Pi and Hermes, I love all the hardwork and innovation they've put into them, certainly a lot of lessons and wisdom from each; hell, I learned proper resource allocation, managing system capacity, efficiency, and flow rates thanks to Agent Zero's rapacious appetite for copious token consumption, but I decided to unsubscribe from keeping up with the infinitely accelerating complexity and novelty of learning the new shiny object and just focus on one thing that I know deeply, which is my own build and the features/functions I intentionally adapt without bloating my system and my mind with too many things to memorize and keep track of.

I think the future of software is custom-made, everyone will have their own bespoke Frankenstein's monster by patching together many components/features/functions inspired by other frameworks, and all the better if we open-source it and pass the torch. What a time to be alive; truly.

[-]

AnnualCorner5795@reddit

Thanks for the detailed reply man! I keep circling back to building my own tools too, I definitely get what you're suggesting. Was just hacking around with pi and hermes agent today and got overwhelmed by docs.

[-]

Icy_Concentrate9182@reddit

Most of these tools aren't ready for prime time. At least not for the users who aren't coders

[-]

Freonr2@reddit

Speed is pretty important if you are staring at your screen waiting for a response. Even if it is a few dozen seconds at a time that adds up over a day of constant use (i.e. getting actual work done). I suppose this is largely dependent on your use case, though.

MTP is largely free lunch. This isn't using potato quant to fit a model onto your toaster oven. If you are going to spend time compiling something to get a feature, MTP is probably the one worth the bother.

Claude sub refreshes and quotas are sort of their own pain point to work around but maybe a separate discussion.

faster than I could validate

I don't know what you're doing to validate, but you should be able to automate this with traditional programming that runs in trivial time, which a good LLM/agent can write for you. I.e. market datasets prepared and run your models against them in a controlled fashion across all your strategies/models.

[-]

Otherwise_Economy576@reddit

slowing down isn't the actual lever. matching task to model size is. once you have 3-4 known-working setups (small/fast for triage, medium for production code, big for exploration), you spend most of your time in small-and-fast because that's where most real work fits. keeping everything on the biggest available model out of habit is the trap. token count goes up, signal-to-noise goes down.

[-]

simracerman@reddit

FYI, MTP on my 16GB 5070Ti is slower in practice than same model quant and size without it. Yes it starts quick but falls short very quickly if your weights overflow even by a small bit.

[-]

MrBemz@reddit

Yea twin what you need is a environment layer like you cant just write a script without taking the model in the equation.
You need to do smth more than just ai promoting n shi Feel me?

Like if ur comfortable with coding use langGraph

If u dont (which I think is the case) Use smth like lyzr ai drop down builer or something else idk dude? You gotta decide for yourself twin

[-]