"What do you guys even use local LLMs for?" Me: A lot
Posted by andy2na@reddit | LocalLLaMA | View on Reddit | 76 comments
Created separate private API keys for each service within LiteLLM and started logging the usage via Prometheus to view in Grafana. Surprised the Frigate GenAI summaries tokens quickly add up!
This view is only the past 6 hours.
Sn0opY_GER@reddit
sad claude noise :/ i have more $ token than that - my 5090 payed for itself after 2 weeks or so 😃
ebayironman@reddit
Exactly, And as time moves on and the online major AI providers have to generate positive revenue you're going to see the online AI providers prices go up, there leniency go down and the cost effectiveness of having a local LLM will increase, it's a no-brainer anyway.
Void-kun@reddit
You aren't wrong, but right now current hardware imho isn't good enough for local LLM development.
What models can you run multiple parallel sub-agents or a team of agents locally and work on multiple large repos without work taking hours and hours to complete before you can review and test?
I will 100% switch to local LLMs but I need the hardware to catch up first.
ebayironman@reddit
Was writing a good story last night, first chapter looks good! But ran out of context before chapter 2... I don't pay for any LLMs. But have invested several hundred in developing local LLMs and can do a bunch of stuff on my home computer, but yes, need to ramp up. I can see spending another few $K to get to where I want to be on hardware. At least the software is free...
Void-kun@reddit
But at that point spending 10k on hardware now that in a few years will be outdated and need replacing.
Compared to spending 10k on API costs for the next few years till the hardware catches up
ebayironman@reddit
$10K on hardware now will get you like $250K in tokens if you had to buy them in the course of a year or so most likely. One guy says he paid for his 5090 ($4K) in 2 weeks.
Void-kun@reddit
Token prices could be anything at the moment and it doesn't take into account the subsidised subscriptions.
marco89nish@reddit
1.2M tokens? You need to be in Bs to say a lot
DataCraftsman@reddit
You mean the Ts
Awkward-Customer@reddit
Q's or GTFO.
DeProgrammer99@reddit
Right? I generated around 25 million with Qwen3.5-27B across several days, and that was just one project (adding some nuance notes to my ~22k Japanese flash cards, redoing a few thousand because some of them returned reasoning instead of a final answer).
ShadowyTreeline@reddit
tangential question - have you used local AI in other ways for learning foreign languages? I was wondering if it would be useful to have an AI voice chat to practice with.
yourgamermomthethird@reddit
Many ways, there are Japanese native models if you want full Japanese, I’m actually working on making a fine tuned Japanese model for ingesting Japanese papers and making it explain it in different grade levels not sure if it will turn out good I’m new to fine tuning, also many models support multi language but can’t talk for Japanese quality specifically because I haven’t done enough testing, there are datasets to talk in certain keigo formats as well, voice might be hard though haven’t seen what’s out there yet I have a transcription model that understands Japanese tho it’s alright not great.
Anyways it depends on your level more than anything.
DeProgrammer99@reddit
I haven't really. This app generates notes for existing cards based on whatever knowledge the model has, and I have another app that generates flash cards based on plain text/HTML/PDF data it's given, and I upgraded MNN Chat into an interpreted chatroom server, and I made an eval generation/execution tool just so I have one that isn't Python and used it to evaluate some tiny models for that MNN Chat app...
I tried chatting with Shisa v2 70B and telling it to correct anything I say wrong, but it got too caught up in the conversation to follow directions and just generally acted like old local LLMs.
PferdOne@reddit
I don‘t wanna shill anything, but take a look at ISSEN. It‘s a ycombinator startup (https://www.ycombinator.com/companies/issen) and I just signed up with them for a year. Maybe you can take inspiration from them if you wanna build something for yourself.
gpalmorejr@reddit
I once manages to generation (everyone stand back) a 500 token script with 27B in 15 minutes. * Gasp *. I know everyone calm down. I think I could get I little more (crazy I know), but I have a slight suspicion that the low token rate may be related to my GTX1060 6GB card. But I'm still looking into it. Could be anything.
deenspaces@reddit
what flash cards app do you use?
DeProgrammer99@reddit
Just Anki. My note generation app is directly modifying the Anki database, which is synced by the Anki desktop app, and then I study with it on my Android.
simracerman@reddit
How do you keep tally of all tokens used? Llama-swap has a nice calculator but it sadly resets once you relaunch or modify the config file
DeProgrammer99@reddit
I estimated based on the reasoning budget (which it almost always hits), my original output limit, and the fact that I had to redo a few thousand of them and then redo some of those because it degenerated into a looped or returned reasoning as if it were the final answer about 3% of the time.
Guinness@reddit
Yeah that’s barely over 1 single context session. Just tagging a @script.py and calling Serena to ask a question can burn 100k tokens.
ToInfinityAndAbove@reddit
Exactly my first thought. Dude, I burned 3M tokens in 20 mins yesterday
dbenc@reddit
I'm using like a billion a week on opus (like 95% input but still)
marco89nish@reddit
I'm definitely doing 1B/month on Opus at work
andy2na@reddit (OP)
I just set up Prometheus and grafana last night and that's just showing the last 6 hours
Over the last 3 days, I've used 23 million https://imgur.com/a/6layCmz 👍
SirBardBarston@reddit
Please share setup. Is llama.cpp exporting metrics?
National_Meeting_749@reddit
There we go. Let's pump those numbers up! The first time I checked my token usage after I set up Hermes I was at 8 mil and quickly climbing.
ValenciaTangerine@reddit
Bs is just chatbot inference at this point. 1.2M for frigate vision summaries actually maps to a real workflow, thats every camera frame triggering a description. compare to running deepseek-coder agentic loops where ive hit 5M in an afternoon and produced one usable function. token volume is a vanity metric, the question is what you got out of it.
wakIII@reddit
Yeah when I saw frigate as a big user I was like wut, does bro have 10000 cameras
spencer_kw@reddit
code review on every commit before it hits the API model. local qwen catches maybe 60% of the obvious mistakes for free, which means when I do send something to opus it's already been through one round of cleanup. saves about $80/mo in API costs just from not sending garbage upstream.
Houston_NeverMind@reddit
How does opus know that the local model cleaned up the garbage? How do you filter them out of the commit diff?
horserino@reddit
What do you use for that? Opencode? Hermes? Something else?
spencer_kw@reddit
i've been using pi coding agent. It definitely requires a decent bit of manual setup and i also tuned it to take best principles from other agents but it just feels far less cluttered and gets to the point. when i tried using opencode it just felt it had too much clutter and hermes is def getting better so i embed hermes in my pi coding agent for specific things
Pretend_Engineer5951@reddit
I use local llm for reviewing too. My best agent is Roocode extension so far.
specify_@reddit
Rookie numbers smh. i use opencode with free Claude Opus provided by my university and selfhosted Qwen 3.6 27B/35B-A3B for subagents that the orchestrator spawns. vLLM with four RTX 5060 Ti 16GB
CalligrapherFar7833@reddit
Try finding a coral for first detection before feeding lots of useless frames to a vision llm
andy2na@reddit (OP)
I already filter a lot of stuff via openvino, if you enable GenAI summaries in 0.17, each one can use up to 32k tokens unless you specifically set it lower. IIRC, it sends every frame in the event to give an accurate GenAI summary - this is different than the regular AI summaries which either sends a snapshot or a few frames.
I dont mind at all, thats why I got into local LLMs, to not worry about token usage and privacy
FantasyMaster85@reddit
It doesn’t send every frame, it sends a maximum of twenty pre selected frames of the entirety of the event.
Within Frigate you can set what your available context is within your LLM and it will automatically determine the number of frames to send to automatically use about 98% of that context window. If you’ve got a huge context window (I run my local LLM with Frigate and it has a context window of about 150k) it still will only send a maximum of 20 frames and doesn’t come anywhere near using all that context.
For anyone interested, you can read about that here: https://docs.frigate.video/configuration/genai/genai_review/
dark-light92@reddit
Why are you still using LiteLLM after the security disaster they had?
a_slay_nub@reddit
As far as I'm aware, the security incident wasn't even their fault. They just got unlucky with the software they used for their security checking. the real issue was with Trivvy.
c4talystza@reddit
Which one? They just released a new CVE for SQL injection and i had some automated probing this morning after less than 48 hours in the wild.
andy2na@reddit (OP)
None of my llm things are open to the Internet and I only have local inferences linked in tinyllm, no cloud AI services are on it - so risk is low. Additionally, all the recent security incidences were addressed in a timely manner. You can't go boycotting every service or application that has had a security incident
dark-light92@reddit
LiteLLM's security incident had nothing to do with LLMs having access to internet. It simply stole your secrets for having litellm installed and running on your system.
You can use it if you want but I wouldn't touch it with a 10 foot pole. Because it's an over complicated mess of a codebase for what it does. I used it for a couple of months but stopped using it a month before the security incident because I just didn't feel the project was managed by responsible people when going through its document to figure out the things I needed to do. A month later, my fears were justified.
Hello_my_name_is_not@reddit
As a 3rd party reading this chain what do you use instead?
dark-light92@reddit
llama-swap is enough for my needs.
Maximum-Wishbone5616@reddit
1.2M is nothing. It is not even a day of work.
Mistic92@reddit
Just yesterday I have used 350mln tokens on deepseek.
devinprater@reddit
We have to write reports monthly, following a template. Since the reports contain personal information, the Llm's need to stay local. I've tried lots of models, but Qwen3.6 27b on our 4090 gets it right, with just a little correction, every time now. Of course, it only runs at like 20 tokens per second on Ollama, but I'll wait if it means less fixing.
GCoderDCoder@reddit
Insert slow clap...
synth_mania@reddit
What software is this dashboard? You dropped a lot of names I'm not familiar with in the post body.
andy2na@reddit (OP)
Sorry-
Using LiteLLM to route to models on different inference engines (like vLLM and llama.cpp): https://github.com/BerriAI/litellm
LiteLLM provides metrics to Prometheus, in which those metrics can be pulled into Grafana for the dashboard: https://github.com/grafana/grafana
My docker-compose stack for this setup:
c4talystza@reddit
Don't use main-latest tag. Don't use main-stable either (it has a sql injection CVE) use specific versions
Travnewmatic@reddit
dude this is dope, thanks for the weekend project :)
MmmmMorphine@reddit
Nice! I'm just making this layer on my system. I'm adding a lot of trace logging along the way to help train an intelligent router later.
Was curious how you route traffic between the different apps (hermes, goose, opencode, etc.) - is it all manually selected or how did you manage this part of things?
andy2na@reddit (OP)
I haven't gotten fancy with litellm routing, so I just give all the apps the litellm endpoint and manually select the model to use. I do give each app it's own API key, which allows me to know stats for each app
MmmmMorphine@reddit
Gotcha. Well if I succeed in training a smart router for this sort of thing, I'll be sure to post it
ComplexType568@reddit
Prometheus for data gathering and Grafana for displaying. It's a classic :)
FullstackSensei@reddit
graffana
revoked@reddit
What models are you running and what's your rig?
andy2na@reddit (OP)
Rtx 3090 holding qwen3.6-26B and 5060ti holding gemma4-e4b for STT and light tasks. 5060ti also holds a couple TTS models little omnivoice and kokoro
revoked@reddit
What’s your Hermes and OpenCode model dependencies? How do you feel about the performance? I have the same qwen model with openclaw and I constantly miss the Claude models
andy2na@reddit (OP)
The biggest issue with local llms, for me at least, with agents and coding is the max context window (usually max of 256k). The initial answers with qwen3.6-26B are usually great, and why people love to focus on "one/two shots" of building a Tetris game or whatever. I only use local llms to code and agents sparingly (but they take up a ton of the context usage).
Clear-Ad-9312@reddit
Have you tried Dynamic Context Pruning (DCP)?
https://github.com/badlogic/pi-mono/discussions/330
jacek2023@reddit
I use Gemma 26B with context up to 200000 for agentic coding. Yes it is making mistakes later, but after some practice I know now how to handle it.
wombweed@reddit
lol i have the exact same model selection! and, this reminded me to actually hook up my litellm stats to my grafana. thanks!
andy2na@reddit (OP)
Haven't found the perfect use for hermes yet, I primarily use it to summarize pages quickly vs opening up openwebui or similar
Still just messing around with it
ZiXXiV@reddit
I feel like all these 'proactive' agents, are useless. Sure they can deliver daily digests, but a simple script that scrapes rss feeds can do too..
Cupakov@reddit
What do you use Hermes for?
Nyghtbynger@reddit
To me theses dashboard are cute but not actionable. The only amount I really care is when I look at a providers dashboard and see how much money / cache is used, or how much time it takes for my local model
Clean_Initial_9618@reddit
Sorry new to this how does this setup help
Gringe8@reddit
So you like to answer questions with irrelevant info that doesnt answer it?
andy2na@reddit (OP)
I think the answer to the question is clearly shown in the picture
Gringe8@reddit
You use local llm to measure your llm usage?
andy2na@reddit (OP)
Look at the whole image, the top right graph shows llm token usage per application (frigate, home assistant, vane, etc)
Gringe8@reddit
Ok, i understand now lol. Ignore me.
andy2na@reddit (OP)
Lol no worries