"What do you guys even use local LLMs for?" Me: A lot

[-]

Sn0opY_GER@reddit

sad claude noise :/ i have more $ token than that - my 5090 payed for itself after 2 weeks or so 😃

[-]

Exactly, And as time moves on and the online major AI providers have to generate positive revenue you're going to see the online AI providers prices go up, there leniency go down and the cost effectiveness of having a local LLM will increase, it's a no-brainer anyway.

[-]

Void-kun@reddit

You aren't wrong, but right now current hardware imho isn't good enough for local LLM development.

What models can you run multiple parallel sub-agents or a team of agents locally and work on multiple large repos without work taking hours and hours to complete before you can review and test?

I will 100% switch to local LLMs but I need the hardware to catch up first.

[-]

ebayironman@reddit

Was writing a good story last night, first chapter looks good! But ran out of context before chapter 2... I don't pay for any LLMs. But have invested several hundred in developing local LLMs and can do a bunch of stuff on my home computer, but yes, need to ramp up. I can see spending another few $K to get to where I want to be on hardware. At least the software is free...

[-]

Void-kun@reddit

But at that point spending 10k on hardware now that in a few years will be outdated and need replacing.

Compared to spending 10k on API costs for the next few years till the hardware catches up

[-]

ebayironman@reddit

$10K on hardware now will get you like $250K in tokens if you had to buy them in the course of a year or so most likely. One guy says he paid for his 5090 ($4K) in 2 weeks.

[-]

Void-kun@reddit

Token prices could be anything at the moment and it doesn't take into account the subsidised subscriptions.

[-]

marco89nish@reddit

1.2M tokens? You need to be in Bs to say a lot

[-]

DataCraftsman@reddit

You mean the Ts

[-]

Awkward-Customer@reddit

Q's or GTFO.

[-]

DeProgrammer99@reddit

Right? I generated around 25 million with Qwen3.5-27B across several days, and that was just one project (adding some nuance notes to my ~22k Japanese flash cards, redoing a few thousand because some of them returned reasoning instead of a final answer).

[-]

ShadowyTreeline@reddit

~22k Japanese flash cards

tangential question - have you used local AI in other ways for learning foreign languages? I was wondering if it would be useful to have an AI voice chat to practice with.

[-]

yourgamermomthethird@reddit

Many ways, there are Japanese native models if you want full Japanese, I’m actually working on making a fine tuned Japanese model for ingesting Japanese papers and making it explain it in different grade levels not sure if it will turn out good I’m new to fine tuning, also many models support multi language but can’t talk for Japanese quality specifically because I haven’t done enough testing, there are datasets to talk in certain keigo formats as well, voice might be hard though haven’t seen what’s out there yet I have a transcription model that understands Japanese tho it’s alright not great.

Anyways it depends on your level more than anything.

[-]

DeProgrammer99@reddit

I haven't really. This app generates notes for existing cards based on whatever knowledge the model has, and I have another app that generates flash cards based on plain text/HTML/PDF data it's given, and I upgraded MNN Chat into an interpreted chatroom server, and I made an eval generation/execution tool just so I have one that isn't Python and used it to evaluate some tiny models for that MNN Chat app...

I tried chatting with Shisa v2 70B and telling it to correct anything I say wrong, but it got too caught up in the conversation to follow directions and just generally acted like old local LLMs.

[-]

PferdOne@reddit

I don‘t wanna shill anything, but take a look at ISSEN. It‘s a ycombinator startup (https://www.ycombinator.com/companies/issen) and I just signed up with them for a year. Maybe you can take inspiration from them if you wanna build something for yourself.

[-]

gpalmorejr@reddit

I once manages to generation (everyone stand back) a 500 token script with 27B in 15 minutes. * Gasp *. I know everyone calm down. I think I could get I little more (crazy I know), but I have a slight suspicion that the low token rate may be related to my GTX1060 6GB card. But I'm still looking into it. Could be anything.

[-]

deenspaces@reddit

what flash cards app do you use?

[-]

DeProgrammer99@reddit

Just Anki. My note generation app is directly modifying the Anki database, which is synced by the Anki desktop app, and then I study with it on my Android.

[-]

simracerman@reddit

How do you keep tally of all tokens used? Llama-swap has a nice calculator but it sadly resets once you relaunch or modify the config file

[-]

DeProgrammer99@reddit

I estimated based on the reasoning budget (which it almost always hits), my original output limit, and the fact that I had to redo a few thousand of them and then redo some of those because it degenerated into a looped or returned reasoning as if it were the final answer about 3% of the time.

[-]

Guinness@reddit

Yeah that’s barely over 1 single context session. Just tagging a @script.py and calling Serena to ask a question can burn 100k tokens.

[-]

ToInfinityAndAbove@reddit

Exactly my first thought. Dude, I burned 3M tokens in 20 mins yesterday

[-]

dbenc@reddit

I'm using like a billion a week on opus (like 95% input but still)

[-]

marco89nish@reddit

I'm definitely doing 1B/month on Opus at work

[-]

andy2na@reddit (OP)

I just set up Prometheus and grafana last night and that's just showing the last 6 hours

Over the last 3 days, I've used 23 million https://imgur.com/a/6layCmz 👍

[-]

SirBardBarston@reddit

Please share setup. Is llama.cpp exporting metrics?

[-]

National_Meeting_749@reddit

There we go. Let's pump those numbers up! The first time I checked my token usage after I set up Hermes I was at 8 mil and quickly climbing.

[-]

ValenciaTangerine@reddit

Bs is just chatbot inference at this point. 1.2M for frigate vision summaries actually maps to a real workflow, thats every camera frame triggering a description. compare to running deepseek-coder agentic loops where ive hit 5M in an afternoon and produced one usable function. token volume is a vanity metric, the question is what you got out of it.

[-]

wakIII@reddit

Yeah when I saw frigate as a big user I was like wut, does bro have 10000 cameras

[-]

spencer_kw@reddit

code review on every commit before it hits the API model. local qwen catches maybe 60% of the obvious mistakes for free, which means when I do send something to opus it's already been through one round of cleanup. saves about $80/mo in API costs just from not sending garbage upstream.

[-]

Houston_NeverMind@reddit

How does opus know that the local model cleaned up the garbage? How do you filter them out of the commit diff?

[-]

horserino@reddit

What do you use for that? Opencode? Hermes? Something else?

[-]

spencer_kw@reddit

i've been using pi coding agent. It definitely requires a decent bit of manual setup and i also tuned it to take best principles from other agents but it just feels far less cluttered and gets to the point. when i tried using opencode it just felt it had too much clutter and hermes is def getting better so i embed hermes in my pi coding agent for specific things

[-]

Pretend_Engineer5951@reddit

I use local llm for reviewing too. My best agent is Roocode extension so far.

[-]

specify_@reddit

Rookie numbers smh. i use opencode with free Claude Opus provided by my university and selfhosted Qwen 3.6 27B/35B-A3B for subagents that the orchestrator spawns. vLLM with four RTX 5060 Ti 16GB

[-]

CalligrapherFar7833@reddit

Try finding a coral for first detection before feeding lots of useless frames to a vision llm

[-]

andy2na@reddit (OP)

I already filter a lot of stuff via openvino, if you enable GenAI summaries in 0.17, each one can use up to 32k tokens unless you specifically set it lower. IIRC, it sends every frame in the event to give an accurate GenAI summary - this is different than the regular AI summaries which either sends a snapshot or a few frames.

I dont mind at all, thats why I got into local LLMs, to not worry about token usage and privacy

[-]

FantasyMaster85@reddit

It doesn’t send every frame, it sends a maximum of twenty pre selected frames of the entirety of the event.

Within Frigate you can set what your available context is within your LLM and it will automatically determine the number of frames to send to automatically use about 98% of that context window. If you’ve got a huge context window (I run my local LLM with Frigate and it has a context window of about 150k) it still will only send a maximum of 20 frames and doesn’t come anywhere near using all that context.

For anyone interested, you can read about that here: https://docs.frigate.video/configuration/genai/genai_review/

[-]

dark-light92@reddit

Why are you still using LiteLLM after the security disaster they had?

[-]

a_slay_nub@reddit

As far as I'm aware, the security incident wasn't even their fault. They just got unlucky with the software they used for their security checking. the real issue was with Trivvy.

[-]

c4talystza@reddit

Which one? They just released a new CVE for SQL injection and i had some automated probing this morning after less than 48 hours in the wild.

[-]

andy2na@reddit (OP)

None of my llm things are open to the Internet and I only have local inferences linked in tinyllm, no cloud AI services are on it - so risk is low. Additionally, all the recent security incidences were addressed in a timely manner. You can't go boycotting every service or application that has had a security incident

[-]

dark-light92@reddit

LiteLLM's security incident had nothing to do with LLMs having access to internet. It simply stole your secrets for having litellm installed and running on your system.

You can't go boycotting every service or application that has had a security incident

You can use it if you want but I wouldn't touch it with a 10 foot pole. Because it's an over complicated mess of a codebase for what it does. I used it for a couple of months but stopped using it a month before the security incident because I just didn't feel the project was managed by responsible people when going through its document to figure out the things I needed to do. A month later, my fears were justified.

[-]

Hello_my_name_is_not@reddit

As a 3rd party reading this chain what do you use instead?

[-]

dark-light92@reddit

llama-swap is enough for my needs.

[-]

Maximum-Wishbone5616@reddit

1.2M is nothing. It is not even a day of work.

[-]

Mistic92@reddit

Just yesterday I have used 350mln tokens on deepseek.

[-]

devinprater@reddit

We have to write reports monthly, following a template. Since the reports contain personal information, the Llm's need to stay local. I've tried lots of models, but Qwen3.6 27b on our 4090 gets it right, with just a little correction, every time now. Of course, it only runs at like 20 tokens per second on Ollama, but I'll wait if it means less fixing.

[-]

GCoderDCoder@reddit

Insert slow clap...

[-]

synth_mania@reddit

What software is this dashboard? You dropped a lot of names I'm not familiar with in the post body.

[-]

andy2na@reddit (OP)

Sorry-

Using LiteLLM to route to models on different inference engines (like vLLM and llama.cpp): https://github.com/BerriAI/litellm

LiteLLM provides metrics to Prometheus, in which those metrics can be pulled into Grafana for the dashboard: https://github.com/grafana/grafana

My docker-compose stack for this setup:

services:
  litellm-db:
    image: postgres:15
    container_name: litellm-db
    restart: unless-stopped
    environment:
      - POSTGRES_DB=litellm
      - POSTGRES_USER=litellm
      - POSTGRES_PASSWORD=litellm_secure_pass
    volumes:
      - /mnt/AI/litellm/db_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U litellm"]
      interval: 5s
      timeout: 5s
      retries: 5
    networks:
      - ai-net

  litellm-proxy:
    image: ghcr.io/berriai/litellm:main-latest
    container_name: litellm
    restart: unless-stopped
    ports:
      - "8484:4000" 
    volumes:
      - /mnt/AI/litellm/config.yaml:/app/config.yaml:ro
    environment:
      - OPENAI_API_KEY=litellm
      - USE_PRISMA_MIGRATE="True"
      - LITELLM_MASTER_KEY="sk-yourkey"
      - LITELLM_SALT_KEY="sk-salt-yoursaltkey"
      - UI_USERNAME=admin   
      - UI_PASSWORD=password
      - DATABASE_URL=postgresql://litellm:litellm_secure_pass@litellm-db:5432/litellm
      - STORE_MODEL_IN_DB="True"
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    depends_on:
      litellm-db:
        condition: service_healthy
    networks:
      - ai-net

  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - /mnt/AI/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - /mnt/AI/prometheus/data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    networks:
      - ai-net

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3007:3000"
    volumes:
      - /mnt/AI/grafana/data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin # Change this on first login
    networks:
      - ai-net

networks:
  ai-net:
    external: true

[-]

c4talystza@reddit

Don't use main-latest tag. Don't use main-stable either (it has a sql injection CVE) use specific versions

[-]

Travnewmatic@reddit

dude this is dope, thanks for the weekend project :)

[-]

MmmmMorphine@reddit

Nice! I'm just making this layer on my system. I'm adding a lot of trace logging along the way to help train an intelligent router later.

Was curious how you route traffic between the different apps (hermes, goose, opencode, etc.) - is it all manually selected or how did you manage this part of things?

[-]

andy2na@reddit (OP)

I haven't gotten fancy with litellm routing, so I just give all the apps the litellm endpoint and manually select the model to use. I do give each app it's own API key, which allows me to know stats for each app

[-]

MmmmMorphine@reddit

Gotcha. Well if I succeed in training a smart router for this sort of thing, I'll be sure to post it

[-]

ComplexType568@reddit

Prometheus for data gathering and Grafana for displaying. It's a classic :)

[-]

FullstackSensei@reddit

graffana

[-]

revoked@reddit

What models are you running and what's your rig?

[-]

andy2na@reddit (OP)

Rtx 3090 holding qwen3.6-26B and 5060ti holding gemma4-e4b for STT and light tasks. 5060ti also holds a couple TTS models little omnivoice and kokoro

[-]

revoked@reddit

What’s your Hermes and OpenCode model dependencies? How do you feel about the performance? I have the same qwen model with openclaw and I constantly miss the Claude models

[-]

andy2na@reddit (OP)

The biggest issue with local llms, for me at least, with agents and coding is the max context window (usually max of 256k). The initial answers with qwen3.6-26B are usually great, and why people love to focus on "one/two shots" of building a Tetris game or whatever. I only use local llms to code and agents sparingly (but they take up a ton of the context usage).

[-]