How do I run AI locally? And what is the most efficient model / software?
Posted by 24_1378@reddit | LocalLLaMA | View on Reddit | 13 comments
Hey everyone. I'll admit - Sam Altman and Open AI just give me a really bad gut feeling. And to be honest, even if they're good intentioned and truly do care about the well being of people and try their best to keep conversations private, someone could just hack the server and leak out whatever users have. He also will be forced to if a frivolous law or court case is filed give data over to people who may not have the best intentions or may abuse a moral panic such as children's safety or mental health for purposes of power. Don't get me wrong, these issues need to be cared about - but they're often used as a trojan horse by politicians to abuse power.
And now with them giving up this data to the police automatically - I am more concerned. Police departments are rife with corruption and abuses of power, so are courts. Etc.
But this technology is amazing. I think when used properly - as a tool to help people out, let people learn and be more creative, it could very well better humanity. I was curious. What software can I use to emulate this on my own hardware? I've tried out Ollama, but I've heard that this isn't the most up to date though I'm still fucking amazed. And which model is best and most advanced / best for local? I'm a total noob at this.
SiEgE-F1@reddit
Things to consider:
llama.cpp - Has a built exe file you can download to run. Launches an OpenAI-compatible server with a webui. Your best option if you want something lightweight, that runs most of the stuff out there, and doesn't get in the way with unnecessary UI. Is capable of CPU offloading, making use of both CPU and GPU.
Koboldcpp - Same as llama.cpp, but has a friendly UI to launch stuff with. Just for the cases you don't want to terrorize yourself with terminal applications, learning what launch parameters they have.
Ollama - is ALMOST the same as llama.cpp, but got derived into its own little walled garden, with its own "requantized" models. Ollama used to be the first tool to download models for you, but llama.cpp can already do that on its own.
LM studio - I have no clue what that is, as I never used it. I've heard it cannot use your RAM or use your CPU? I might be mistaking it for VLLM, though.
VLLM - Also no idea.
Oobabooga's text generation webui - Was the first "everything included" tool. When compared to the tools above, this one can run loads of different types of quants, like ExLlamav2, ExLlamav3, Llama.cpp, Transformers and etc. But it always lingers behind in up-to-dateness.
Then, there are loads of other apps, or ways to inference your models, but I'd consider those apps to be split in 2 different teams:
Team 1 is C++ exe file apps. Used mostly to run your stuff on your PC with little to no prerequisites(except maybe GPU drivers and GPU-accelerated stuff like CUDA).
Team 2 is "Python hell" apps. You need at least the most basic Python background, and the essential Python tooling to make them work. Due to the nature of Python as a programming language, you might not have a very good time trying different Python apps, as they may break due to envs, or the dependency version conflicts.
Whatever you choose, keep in mind that if you face a weird issue, don't shy looking up the "Issues" tab at the corresponding project's Github page. Sometimes, "old, tried and true" things might end up broken, and you might have to try something else, or wait some time before the knowing people have a chance to fix it. That is the nature of this "all new thing".
Lissanro@reddit
Great overview. I can share some additional backends to consider depending on the use case:
For large MoE models ik_llama.cpp is one of the best backends. I shared details here how to build and use it in case someone wants to give it a try if they did not already. Can be much faster compared to llama.cpp, especially at larger context sizes. It is notable that ik_llama.cpp comes with built-in Mikupad frontend. But of course any other frontend that supports OpenAI-compatible API can be connected to it, like SillyTavern or Open WebUI.
TabbyAPI (exllamav2/v3) is excellent option when model fully fits in VRAM, since tends to provide better performance compared to other backends.
VLLM, that already mentioned in your overview, is one of the best options for batch inference and multi-user setups on rigs with plenty of VRAM, since it is less memory efficient from my experience, and also more combersome to setup.
AXYZE8@reddit
Download LM Studio. It's beginner friendly and will suggest you what will run on your hardware. For models you can check GPT-OSS 20B or Qwen3 30B - these are small efficient models. If it says you don't have enough memory to handle these try Gemma3 4B.
Try them and see if you like it. If you do then you should learn about topics like:
- Creating personality and character by system prompt
- Tool calling (for example searching in web)
- Dense vs MoE
- Quantization (for example in GGUF format)
- Hardware (GPU vs CPU inference, Apple Silicon vs PC)
and after learning about this you will get a lot better idea about local LLMs and you will maximize the performance of your hardware. That being said, it's never going to be close to the performance of ChatGPT and you will likely need to spend some money to run some better models. If you want to check how much better responses will be with bigger models you should register on OpenRouter and chat with models there - for example go compare GPT-OSS 20B to 120B variant.
Then you may like to try out different interface, I recommend Open WebUI. However you should start with LM Studio as it has way easier UX to learn basics.
24_1378@reddit (OP)
Just downloaded LM Studio. It's incredible.
And by different interface - do you mean there's an option to make LM Studio look like ChatGPT's website?
AXYZE8@reddit
Yes. Open WebUI does exactly that and allows you to add tools in a easy way. Model can then call things like:
- Web search (Google/Brave/any other search engine, thanks to that even small efficient model can have up to date expert knowledge)
- File parser/OCR (you can upload some PDF or JPG and it will see what's inside)
- Code execution (LLMs do not count values accurately, so instead of relying on their performance you let them write some lines of code and then execute. Way more accurate results.)
Additionally you have things like Speech-to-Talk, you can easily switch between different personalities etc.
One personality can be an expert in medicine that does a lot of reasoning, tries to validate own findings with web search. Another one can be a creative writer, that doesn't care about factuality or reality... you can get whatever response you want by sculpting system prompt. This is why that much people are interested in local LLMs.
One hour ago I've also tried Gemma3 E4B again, I hated this model because it's way too censored (popular slang words are wrongly interpreted as hateful, harming or even sometimes suicidal language), but I see there is 'abliterated' version now. That abliterated version doesn't have that problem anymore and the quality of multilingual respones is still on same, very high level, so if you care about some less popular languages you should definitely try out that model. Gemma3 E4B is different to Gemma3 4B.
You can read about abliteration, it will help you understand how LLMs work
https://huggingface.co/blog/mlabonne/abliteration
Flaky_Comedian2012@reddit
I would recommend KoboldCPP. You just run the exe and load up the model and it has a bunch of features that other backends like LMStudio lack, like for example "Story mode/pure autocomplete" and character cards.
I also find the API for it much more compatible.
Awwtifishal@reddit
I recommend jan.ai instead of LM studio because jan is open source. Both are better than ollama. About the best model, that depends on your hardware (particularly your VRAM and RAM) and your needs.
xAdakis@reddit
The most beginner friendly and easy to setup is LM Studio
24_1378@reddit (OP)
Very nice. In your opinion - why might it be better than Ollama?
xAdakis@reddit
It's just easier to setup and comes with a graphical user interface.
I only used Ollama for a short time myself, but it also didn't seem to have the same performance as LM Studio.
24_1378@reddit (OP)
Thank you. Im using LM studio right now and by god it is fucking amazing. I went offline and it still works, running mistral right now. That is crazy to me and a reminder that we do live in the future.
bucolucas@reddit
Figure out what level of model you want to run. MoE runs great on most RAM+CPU setups. Tell us your setup for more.
ba2sYd@reddit
You can check out LM Studio, there are a lot of models you can download there.
As for the best model, it really depends on your hardware. Basically, if you have a GPU with 24GB of VRAM, you can usually run 24B models. If you have 12GB, you’ll be limited to 12B models. LM Studio will tell you which models you can or can’t run. Try to download models that say full GPU offload possible.
Also, go for models with higher quantization. You can think of quantization like compression, it reduces RAM usage, but it can also affect performance (q3 < q4 < q5, and so on). I wouldn’t really recommend using models with lower quantization than q4 or q3.
As for good models, check out Gemma 3, Mistral models (mistral small is good model), GPT-OSS, and Qwen models.