Local AI is the best
Posted by fake_agent_smith@reddit | LocalLLaMA | View on Reddit | 50 comments
Funny image, but also I'd like to add that I love how much freedom and honesty I can finetune the model to. No glazing, no censorship, no data harvesting. I can discuss and analyze personal stuff with ease of mind knowing that it stays in my home. I'm eternally grateful to llama.cpp developers, everyone involved in open-weight models development and everyone else involved in these tools.
artisticMink@reddit
I want to see the reasoning so bad.
RebouncedCat@reddit
llama.cpp is goated
Limp_Classroom_2645@reddit
So good
Kerem-6030@reddit
its base of almost every other local ai app(like lm studio ollama etc.) š
united_we_ride@reddit
No idea why you're being downvoted. I've heard there's a little more going on under ollama's hood now. But llama.cpp was originally the full backend.
Same goes for Lm Studio, Jan etc. people are weird.
Kerem-6030@reddit
i dont know they always do it to me š¤
Gunch_@reddit
I've noticed a correlation between people who use emojis and people that constantly get downvoted
somersetyellow@reddit
Redditors are absolute chad Linux kernel programmers who grew up using pure ASCII to communicate.
We do not use the emoji and shun the normies who do =ā _ā =
Foreign-Beginning-49@reddit
I've also noticed high effort posts that are definitely not ai getting downvoted because "ai". It's a scary world where too much effort is penalized out of suspicion it's ai. What's this gonna do to Homo sapiens? Cray cray future for sure.
Webfarer@reddit
Just out of curiosity, what base model do you use? And what hardware?
fake_agent_smith@reddit (OP)
Currently Gemma 4, I tend to stay in bounds of 20-35B, because my hardware is just an RX 9070 XT and 64GB DDR5. For convos and simple issues, most of the time I use MoE variants for speed.
camracks@reddit
āJustā
LienniTa@reddit
stuff like 4x3090 are pretty common in this sub tbh
camracks@reddit
I know but those are like the top posts in the sub, on average I donāt think most people doing Local LLM have 4 3090s, a lot of people donāt even have a graphics card at all and are limited to APIs, but youāre not going to see a post of someone flexing their $100 in API usage lmao
fake_agent_smith@reddit (OP)
Well, people in this sub often share amazing specs with multi-GPU setup or other solutions capable of running GLM-5.1 and other amazing models (at least something 100B+). So in comparison my hardware is rather humble.
Force88@reddit
Yep, people here are really dedicated to llm, compared to their a6000, or 4x 5090 rig, my 2x 5060ti 16gb is just childplay.
Btw, is llama.cpp hard to use, I'm a newbie and only just learnt to use ollama due to how easy it is.
rainbyte@reddit
I think llama.cpp is easier if you interact with the community, because you can share the exact command you are running, and other users can suggest adding or removing options.
Syntax is literally: llama-server -m model.gguf --option-a value-a --option-b value-b
Give it a try!
Devatator_@reddit
No, you basically just run llama-server with whatever model you want and open the url that shows up
True-Lychee@reddit
Try LMStudio if you want a user-friendly UI for llama.cpp
Far-Low-4705@reddit
i got two amd mi50's for 64Gb of VRAM for like $200-300ish, but i sold old hardware that i got for free to offset the cost, so im very happy with my purchase
Just sucks that software support lags behind so much, and software is nowhere near as optimized as cuda. the mi50 has better specs than a 3090 but comes out like 50 or 75% as fast as a 3090 :/
CheatCodesOfLife@reddit
It has slightly more memory bandwidth, but no equivalent to tensor cores. No amount of software optimization will change that.
3090 will always be > 5x faster once compute bound eg. prompt processing.
Blizado@reddit
Yeah, your setup is pretty normal, nothing AI special here. I have a RTX 4090 and 64GB DRR5, that is more a typical high end (because still only a RTX 5090 is better) gaming setup than an AI PC. I the r/LocalLLaMA field that is still more low mid tier. :D
camracks@reddit
I have half as much hardware than you and even then thereās people who have way less than half of me, thereās always going to be more and better, but you definitely have way more than the average person in LLM.
MerePotato@reddit
26B or 31B in this case?
fake_agent_smith@reddit (OP)
Both, but the 26B most of the time is enough and very fast.
MerePotato@reddit
I've found the same, its answers are less stable and its definitely way dumber but its smart enough for most tasks you'd actually want to use an on device model for
Blizado@reddit
Yeah, no wonder, 5B less and on top it is "only" a MoE. At the end it is always the question: do you want to sacrifice quality for speed? In normal daily use yes, in special cases, which are often more rare, no. So maybe a system would be good where you can as quick as possible switch between models.
Far-Low-4705@reddit
what did you fuck up that badly??
CheatCodesOfLife@reddit
I see you've used Qwen3 before lol
Far-Low-4705@reddit
Hah yep lol
Itās interesting, these models tend tell you what you already think is true and validate your correct or incorrect conclusions or your worst fears.
They very rarely give you an objective conclusion
CheatCodesOfLife@reddit
Yeah I don't think they can do it tbh. You give it the "brutal critic" prompt, it'll roast whatever you send it.
Then a few iterations later when it starts to glaze, if you copy the glazed version -> paste into a fresh chat, it'll roast the shit out of it anyway.
Far-Low-4705@reddit
haha lmaoooo true
fake_agent_smith@reddit (OP)
Health. I almost drove myself to death with overwork because I was stupid and overly ambitious. After 10+ years of drive I ended up with ANS collapse, HRV < 10ms, RHR above 100 bpm, high dehydration, orthostatic tachycardia and impaired immune system (which caused severe influenza episode). All caused by irregular and too short sleep, poor dietary choices (lack of fruit and vegetables + occasional fast food), above-moderate alcohol consumption and often working 14-16 hours a day.
Fortunately, I'm still rather young and ANS crash got me out of this madness and after I received medical help and guidelines I've been recovering since with perspective of full recovery in 12-18 months. Out of curiosity I tend to benchmark various models with a long prompt describing my situation in detail and well, Gemma was rather critical I'd say.
RegisteredJustToSay@reddit
That's the best (non)ad I've seen for Gemma in a while. Sycophancy is a bug, not a feature. Out of curiosity was this the MoE 26b or dense 31b?
fake_agent_smith@reddit (OP)
In this particular case the MoE.
PatinhoGamer@reddit
fix your strongest existential fears to improve nervous system
fake_agent_smith@reddit (OP)
Yeah, good advice. I've started therapy a few months ago as well besides fixing my terrible lifestyle choices.
FluoroquinolonesKill@reddit
This. That is probably what is motivating bro to work like that to begin with. Bro needs to do serious self examination.
Source: me, an intellectual, judging people for making mistakes I recently learned to stop making.
Echo9Zulu-@reddit
A fantastic benchmark my friend, imo gold standard
Sergei-_@reddit
hi, im new to local models running. in the process of setting up gemma4 atm. what is this app youre using to chat with the model and choose reasoning?
Shiny-Squirtle@reddit
What was your system prompt for the model to respond like this?
bilinenuzayli@reddit
I love local ai as well the answers are just class, when used clean through llama.cpp web server I'm convinced you could replace frontier AI's with a medium tier like 25 - 35b range model for most people that aren't doing super complex tasks and they wouldn't even notice they're using a model tens of times smaller. This local ai stuff is also enough for what I need. But I'm curious whats the solution to when there's a large conversation, like a large chat? Any harnesses that support long conversation I've tried reduce reasoning quality and partially lobotomise the model (any harness with a large and demanding system prompt does this for me, qwen 3.5 and Gemma 4, when I move the system prompt to user role the response quality bumps up a little but still not good as a fresh chat) personally that's the largest setback for me in local ai with small models.
letsgoiowa@reddit
I tested Minimax m2.7 to just spitball ideas about the new mysterious "Elephant" model on Openrouter that's like a gazillion tokens per second, but is incredibly stupid. Here's a snippet of its response and I SWEAR I didn't prompt in anything like this:
"The Key Clue The fact it's 100B and underperforms 27B says something specific: this lab can't optimize for shit. DeepSeek, OpenAI, Anthropic all have excellent inference optimization. Qwen/Alibaba does too."
THIS LAB CAN'T OPTIMIZE FOR SHIT lmao I'm dying
Kerbiter@reddit
what's that UI? bit new to the local AI models but curious, only tried Lemonade so far (AMD iGPU here)
fake_agent_smith@reddit (OP)
It's the new web UI baked into llama.cpp. Still in early stages of development, but super clean and it's enough for me. Lots of details here: https://github.com/ggml-org/llama.cpp/discussions/16938
pfn0@reddit
I wish they would tease it out into a separate SPA. It is the perfect light-weight LLM test client.
Tall-Ad-7742@reddit
To be honest. Yes. Yes it is.
unngh_yugstyx@reddit
It certainly feels less sycophantic and more truthful
Icy-Degree6161@reddit
Now I need context
Mean_Media_2775@reddit
I am new to local hosting and out of curiosity, what all things at max you can do with 9070xt+64gb ram. Because it is at highest side of my budget. I want to keep my expectations in check..