For what purpose do you use local LLMs?

Posted by mrscript_lt@reddit | LocalLLaMA | View on Reddit | 21 comments

A lot of discussions which model is the best, but I keep asking myself, why would average person need expensive setup to run LLM locally when you can get ChatGPT 3.5 for free and 4 for 20usd/month?

My story: For day to day questions I use ChatGPT 4. It seems impracticall running LLM constantly or spinning it off when I need some answer quickly.

I come to local LLM word, because I have specific use-case, I need to generate lots of descriptions from my database to be published on web. In this case, its seems cheaper to run locally than to pay for ChatGPT API. But my use case is complex one and I'm still at the begining of my journey. Likely I will need fine tuning, RAG etc.

What are purposes you use local LLMs and why ChatGPT is not an option?

[-]

Tiny_Cockerel@reddit

Censorship. So-called alignment. Refusals.

"As a large language model created by OpenAI..." "It's important to note..."

When I put bread in my toaster, it toasts bread. It doesn't argue with me about the ethics of toast making. It simply does the job required of it.

I expect the same from any product I use.

[-]

kvothe_10@reddit

Well, I'd hope that if a crazy, well funded biologist asks an advanced model how to create the next COVID, it would say 'No'.

Your toaster doesn't have advanced reasoning skills and access to the entire knowledge of the internet.

[-]

limeypepino@reddit

Well shit, if an LLM isn't gonna help me build this meth lab then what am I even doing here?

[-]

Early_Pie5524@reddit

I would add that if I put a banana in my toaster, or a porkchop, it will not argue either.

[-]

skyatura@reddit

Not related to the topic, but, putting a banana on the toaster actually works or you just mentioned for the sake of the example?

Please, I need to know before I try it at home.

[-]

Zerohero2112@reddit

Well said, I love you Tiny_Cock

[-]

MeMyself_And_Whateva@reddit

To gamble. Can they be used for gambling? Absolutely.

With some success. Enough success that I am just waiting for even more powerful LLMs.

Using Goliath - 120B Q5_K_M on a 96GB PC and 8GB VRAM.

[-]

SirLouen@reddit

Goliath - 120B Q5_K_M

I'm learning everyday, did not know that you could run such big model over a CPU. Which CPU by the way are you using?

[-]

MeMyself_And_Whateva@reddit

Ryzen 5 5500. I get 0.35-0.40 t/s.

I also used it on my PC with only 48GB. Then it was really slow with 0.01-0.008. It used virtual memory instead.

[-]

Commercial-Clue3340@reddit

I think it is time for a distributed model to come out. Restricting it on a single pc is too much.

[-]

sergeant113@reddit

Don’t fancy yourself special. I work for clients, and they all want the local deployment option. They all agree to start with openai, though. But no one in their right mind want to have their (significant) investment be at the wimps of openai or azure.

Local models aren’t good enough yet. But the progress is real. And people of all levels are experimenting with and increasingly offloading their openai workloads to local deployments.

And that, to me, is very good news.

[-]

cavemansc2@reddit

OpenAI are totally a bunch of wimps.

[-]

ttkciar@reddit

My use-cases and the models I use for them:

NousResearch-Nous-Capybara-3B-V1 for RAG,
Medalpaca-13B as a copilot for medical reading
Starling-LM-11B-alpha as a copilot for physics research
Either Starling-LM-11B-alpha or PuddleJumper-13B-v2 for analysis of social, political, historical, or philosophical issues
Mistral-7B-OpenOrca for creative writing (sci-fi, not ERP)
NoroCetacean-20B-10K for creative writing (neither sci-fi nor ERP)
Phind-CodeLlama-34B-v2 for bulk code generation
Rift-Coder-7B and Refact-1.6B-fim as copilots for coding copilots
Scarlett-33B for casual topics and informal prose
Starling-LM-11B-alpha, Mistral-7B-SciPhi-32k, and Vicuna-33B for synthetic dataset generation

As for why not ChatGPT, mainly future-proofing.

Every AI Summer has thusfar been followed by an AI Winter, as a consequence of overhyping and overpromising on AI technologies. I'm too young to have been aware of the first AI Winter, but was active in the industry during the second one. The forces which caused that AI Winter were highly analogous to what we see today, with media speculating wildly and AI companies promising the moon.

When the moon doesn't materialize, there will be disillusionment and backlash, investment will dry up, and services we take for granted now (like ChatGPT or HuggingFace) might change for the worse or entirely cease to exist.

None of that will impact the models I have at home, though. Those will keep working no matter what happens, and open source LLM development can continue during the AI Winter.

That holds true for less dramatic changes, too. OpenAI keeps changing ChatGPT's behavior and their price schedules, not always for the better, and they censor it pretty harshly as well. There are topics on which it simply will not infer. My local models are immune to all of that as well.

Since I have no faith that ChatGPT will be usable in the long term, I'd rather not get into the habit of using it at all. Instead I am developing the skills and the technology which will serve me in the indefinite future.

[-]

tuxedo0@reddit

Thanks for the detailed answer. Do you mind letting us know a bit about the tools you use (ooba, vs code plugins, etc.)

[-]

ttkciar@reddit

I use llama.cpp for everything, wrapped with my own scripts written in Bash, Python, or Perl.

My RAG system is written in a mix of Perl and Python (using Perl's Inline::Python module, so I can use the nltk Python library for summarization). It uses a Lucy index for document retrieval, which I have populated with a Wikipedia dump.

What makes my RAG system different from others is that it doesn't vectorize documents until inference time. This makes it slower, but it allows me to summarize relevant documents to condense them down to only the information most relevant to the prompt, and fill context more efficiently and effectively with information-dense data. This also means I can switch inferring models at a whim, because the database doesn't have to be rebuilt with documents vectorized using the new model's embeddings.

As for Refact and Rift-Coder, both of these can be used with their respective VS plugins, but I don't actually use VS, so I can't do that. GGUFs are available for both models, and those work with llama.cpp just fine.

I wrote some scripts which watch a source file and iterate their main loop when the file's modification time changes.

The Rift-Coder wrapper looks for code leading up to a single-line comment # INFER, feeds that to Rift-Coder, and spits out its output in its own terminal window. To make it do the right thing, I write some comments in the code I'm editing just before the # INFER comment, so that when I hit "save" Rift-Coder responds to those comments in the context of the preceding code.

The Refact script is similar (it started life as a copy of the Rift-Coder script) but since it's a Fill-in-Middle model it gives Refact the code both before and after the # INFER comment, and splits it to make the prompt Refact expects, with <fim_prefix>$CODE1<fim_suffix>$CODE2<fim_middle>. It then takes Refact's reply, appends it to $CODE1, and repeats the process, by default five times but that's controlled by a command line argument. It then prints out the combined output in its own terminal window. It's very fast, so if I don't like what I see I can just hit "save" again which triggers the script to iterate again.

My medical and physics assistants are more crude than this. I have bash scripts med and star11 which take prompts as arguments and pass them to llama.cpp's main executable with the appropriate prompt formatting and command line parameters.

To use them, I append a question to my notes document (I write my notes in plaintext) and simply use command line interpolation in a different terminal window:

$ star11 "`cat notes.txt`"

.. and Starling will infer on the entire contents of my notes file (the backticks tell the shell to execute the command in the backticks and interpolate its output, and "cat" is a utility which outputs whatever is in th file passed to it). Since my notes end with a question, it tries to answer that question. It's crude, but good enough that I haven't been arsed to make something better.

Sometimes I'll do the same thing with my RAG system, when it's working (which it frequently is not), usually because there's not much in my notes yet and I hope there's something relevant in Wikipedia:

$ rag --model=star11 "`cat notes.txt`"

All of this works for me since I more or less do all of my work in side-by-side terminals (or one or more terminal to one side and xpdf showing a document on the other side).

Here's a screenshot I took back when I was still using PuddleJumper-13B-v2 for my physics assistant, to give you some idea of what this looks like in practice: http://ciar.org/h/infer_work.png

Starling works much better for almost everything, so I only rarely use PuddleJumper anymore, mostly for language translation (it's really good for translating between English, German, Yiddish, and Russian).

[-]

skyatura@reddit

In this context of code-generation, having a local model is also more reliable in the longrun. Why? Even though local models still can hallucinate, and probably more than OpenAI's, the constant updates and improvements you get may change the behavior with inputs you already know the usual output for said model.

For instance, the integrations u/ttkciar uses to generate code get more "predictable" the more they uses. Changing the model breaks this "expectation", thus may leading to a unexpected and unattended error.

No matter how good you are at coding, you are still subject to human errors like unintentionally ignoring something because you expected to behave it different based on your past experiences.

Randomness is not natural for humans, and LLMs, although seems to be human, in fact, rely on randomness.

[-]

nunodonato@reddit

how do you use RAG locally? is there a local model that generates embeddings?

[-]

CulturedNiichan@reddit

Creative writing. An uncensored model is essential for anything creative. Note that uncensored doesn't even necessary mean it's for NSFW. Corporate models are so censored to portray a very particular, biased view of the world, and of course a non-human one, in their vicious desire to control how people think or what they're allow to say or write, that unless you are writing a children's book (and a modern one, God forbid you were writing a traditional one where bad things actually may happen), it won't cut it.

Also, as a side effect of the finetuning for instruction mode, the prose they write is sterile, bland, assistant-like. Nothing good for creative writing unless very heavy prompting is used (and then, you have to bypass the agenda).

[-]

FluffnPuff_Rebirth@reddit

Very much agreed. Usually when people hear "uncensored" first thing that comes to mind is the lewd and the snuff stuff. But the thing is that if you want to have a good story with topics exploring the central struggles of the human experience, the model not being able to acknowledge the existence of some of the really dark and serious issues the world has will limit your ability to create anything of profound artistic value for self expression with the AI tool.

Many of the works we these days consider to be the staples of literature couldn't have been written if topics like suicide, messed up and nuanced relationships between the abuser and the abused etc couldn't be explored beyond the corporate approved PG-13 standard. Or if the LLM is made incapable of generating anything ethically questionable unless it is to unequivocally denounce it, it would make writing believable antagonists with grievances grounded in reality, or a protagonist with some serious internal struggles to overcome very difficult.

Creative writing with censored models quickly veer away from you using it as a tool to express yourself better, to you trying to change the way you express yourself to fit within the arbitrary guidelines a PR team ad hoc came up with, so some 4chan weirdos couldn't make it type out naughty things.

Tbh, i believe the core of the problem here is that the public does not understand what a LLM is. Most people believe it to be some entity with opinions of its own, so if it says bad things, that means the model somehow is capable of holding a moral standard or a belief. It should be much better communicated to the public that a LLM is not an entity with thought processes living inside a computer that writes things out for you, but closer to an advanced text message word auto complete feature, which our phones have had for decades, so the user being able to "manipulate" them is not that impressive nor concerning.

[-]

CulturedNiichan@reddit

I suppose if people realized what LLMs really are, it would be bad for marketing. I've seen a ton of articles such as "these are the x most beautiful cities in y country according to ChatGPT", as if it was a truth revealed by a celestial being or something like that. This... mysticism helps sell the product.

It's also pretty... pathetic that people are afraid that an LLM may write 'unethical' or 'illegal' content (depending on where you live, it may be even hard to come up with written material that is truly illegal, but somehow all of a sudden, a lot of people seem eerily comfortable with the notion that written material can and should be illegal - talk about normalizing and internalizing censorship in our current world). What they don't realize is that, as a matter of fact, what an LLM writes can be written by the person who wants said content to be written. I see no danger because it was an LLM writing it as per your instructions.

[-]

xlogic87@reddit

What are some good uncensored open source models?