Local AI is the best | TheaterFire

I've also noticed high effort posts that are definitely not ai getting downvoted because "ai". It's a scary world where too much effort is penalized out of suspicion it's ai. What's this gonna do to Homo sapiens? Cray cray future for sure.

[-]

Webfarer@reddit

Just out of curiosity, what base model do you use? And what hardware?

[-]

fake_agent_smith@reddit (OP)

Currently Gemma 4, I tend to stay in bounds of 20-35B, because my hardware is just an RX 9070 XT and 64GB DDR5. For convos and simple issues, most of the time I use MoE variants for speed.

[-]

camracks@reddit

“Just”

[-]

LienniTa@reddit

stuff like 4x3090 are pretty common in this sub tbh

[-]

camracks@reddit

I know but those are like the top posts in the sub, on average I don’t think most people doing Local LLM have 4 3090s, a lot of people don’t even have a graphics card at all and are limited to APIs, but you’re not going to see a post of someone flexing their $100 in API usage lmao

[-]

fake_agent_smith@reddit (OP)

Well, people in this sub often share amazing specs with multi-GPU setup or other solutions capable of running GLM-5.1 and other amazing models (at least something 100B+). So in comparison my hardware is rather humble.

[-]

Force88@reddit

Yep, people here are really dedicated to llm, compared to their a6000, or 4x 5090 rig, my 2x 5060ti 16gb is just childplay.

Btw, is llama.cpp hard to use, I'm a newbie and only just learnt to use ollama due to how easy it is.

[-]

rainbyte@reddit

I think llama.cpp is easier if you interact with the community, because you can share the exact command you are running, and other users can suggest adding or removing options.

Syntax is literally: llama-server -m model.gguf --option-a value-a --option-b value-b

Give it a try!

[-]

Devatator_@reddit

No, you basically just run llama-server with whatever model you want and open the url that shows up

[-]

True-Lychee@reddit

Try LMStudio if you want a user-friendly UI for llama.cpp

[-]

Far-Low-4705@reddit

i got two amd mi50's for 64Gb of VRAM for like $200-300ish, but i sold old hardware that i got for free to offset the cost, so im very happy with my purchase

Just sucks that software support lags behind so much, and software is nowhere near as optimized as cuda. the mi50 has better specs than a 3090 but comes out like 50 or 75% as fast as a 3090 :/

[-]

CheatCodesOfLife@reddit

the mi50 has better specs than a 3090

It has slightly more memory bandwidth, but no equivalent to tensor cores. No amount of software optimization will change that.

3090 will always be > 5x faster once compute bound eg. prompt processing.

[-]

Blizado@reddit

Yeah, your setup is pretty normal, nothing AI special here. I have a RTX 4090 and 64GB DRR5, that is more a typical high end (because still only a RTX 5090 is better) gaming setup than an AI PC. I the r/LocalLLaMA field that is still more low mid tier. :D

[-]

camracks@reddit

I have half as much hardware than you and even then there’s people who have way less than half of me, there’s always going to be more and better, but you definitely have way more than the average person in LLM.

[-]

MerePotato@reddit

26B or 31B in this case?

[-]

fake_agent_smith@reddit (OP)

Both, but the 26B most of the time is enough and very fast.

[-]

MerePotato@reddit

I've found the same, its answers are less stable and its definitely way dumber but its smart enough for most tasks you'd actually want to use an on device model for

[-]

Blizado@reddit

Yeah, no wonder, 5B less and on top it is "only" a MoE. At the end it is always the question: do you want to sacrifice quality for speed? In normal daily use yes, in special cases, which are often more rare, no. So maybe a system would be good where you can as quick as possible switch between models.

[-]

Far-Low-4705@reddit

what did you fuck up that badly??

[-]

CheatCodesOfLife@reddit

Also id be careful, these smaller local models can also glaze pretty hard, honestly usually worse than frontier models.

I see you've used Qwen3 before lol

[-]

Far-Low-4705@reddit

Hah yep lol

It’s interesting, these models tend tell you what you already think is true and validate your correct or incorrect conclusions or your worst fears.

They very rarely give you an objective conclusion

[-]

CheatCodesOfLife@reddit

Yeah I don't think they can do it tbh. You give it the "brutal critic" prompt, it'll roast whatever you send it.

Then a few iterations later when it starts to glaze, if you copy the glazed version -> paste into a fresh chat, it'll roast the shit out of it anyway.

[-]

Far-Low-4705@reddit

haha lmaoooo true

[-]

fake_agent_smith@reddit (OP)

Health. I almost drove myself to death with overwork because I was stupid and overly ambitious. After 10+ years of drive I ended up with ANS collapse, HRV < 10ms, RHR above 100 bpm, high dehydration, orthostatic tachycardia and impaired immune system (which caused severe influenza episode). All caused by irregular and too short sleep, poor dietary choices (lack of fruit and vegetables + occasional fast food), above-moderate alcohol consumption and often working 14-16 hours a day.

Fortunately, I'm still rather young and ANS crash got me out of this madness and after I received medical help and guidelines I've been recovering since with perspective of full recovery in 12-18 months. Out of curiosity I tend to benchmark various models with a long prompt describing my situation in detail and well, Gemma was rather critical I'd say.

[-]

RegisteredJustToSay@reddit

That's the best (non)ad I've seen for Gemma in a while. Sycophancy is a bug, not a feature. Out of curiosity was this the MoE 26b or dense 31b?

[-]

fake_agent_smith@reddit (OP)

In this particular case the MoE.

[-]

PatinhoGamer@reddit

fix your strongest existential fears to improve nervous system

[-]

fake_agent_smith@reddit (OP)

Yeah, good advice. I've started therapy a few months ago as well besides fixing my terrible lifestyle choices.

[-]

FluoroquinolonesKill@reddit

This. That is probably what is motivating bro to work like that to begin with. Bro needs to do serious self examination.

Source: me, an intellectual, judging people for making mistakes I recently learned to stop making.

[-]

Echo9Zulu-@reddit

A fantastic benchmark my friend, imo gold standard

[-]

Sergei-_@reddit

hi, im new to local models running. in the process of setting up gemma4 atm. what is this app youre using to chat with the model and choose reasoning?

[-]

Shiny-Squirtle@reddit

What was your system prompt for the model to respond like this?

[-]

bilinenuzayli@reddit

I love local ai as well the answers are just class, when used clean through llama.cpp web server I'm convinced you could replace frontier AI's with a medium tier like 25 - 35b range model for most people that aren't doing super complex tasks and they wouldn't even notice they're using a model tens of times smaller. This local ai stuff is also enough for what I need. But I'm curious whats the solution to when there's a large conversation, like a large chat? Any harnesses that support long conversation I've tried reduce reasoning quality and partially lobotomise the model (any harness with a large and demanding system prompt does this for me, qwen 3.5 and Gemma 4, when I move the system prompt to user role the response quality bumps up a little but still not good as a fresh chat) personally that's the largest setback for me in local ai with small models.

[-]

letsgoiowa@reddit

I tested Minimax m2.7 to just spitball ideas about the new mysterious "Elephant" model on Openrouter that's like a gazillion tokens per second, but is incredibly stupid. Here's a snippet of its response and I SWEAR I didn't prompt in anything like this:

"The Key Clue The fact it's 100B and underperforms 27B says something specific: this lab can't optimize for shit. DeepSeek, OpenAI, Anthropic all have excellent inference optimization. Qwen/Alibaba does too."

THIS LAB CAN'T OPTIMIZE FOR SHIT lmao I'm dying

[-]

Kerbiter@reddit

what's that UI? bit new to the local AI models but curious, only tried Lemonade so far (AMD iGPU here)

[-]

fake_agent_smith@reddit (OP)

It's the new web UI baked into llama.cpp. Still in early stages of development, but super clean and it's enough for me. Lots of details here: https://github.com/ggml-org/llama.cpp/discussions/16938

[-]