Are any of you using local llms for "real" work?
Posted by hmsenterprise@reddit | LocalLLaMA | View on Reddit | 154 comments
I am having fun personally tinkering with local models and workflows and such, but sometimes it feels like we're all still stuck in the "fun experimentation" phase with local LLMs and not actually producing any "production grade" outputs or using it in real workflows.
Idk if it's just the gap between what "personal" LLM-capable rigs can handle vs the compute needs of current best-in-class models or what.
Am I wrong here?
syncro2008@reddit
I get automated recruiter cold email sequences all the time and I hate having no way to fight back.
Tools like Jukebox, Gem, etc allow recruiters to setup these repulsive “3 email sequence” scripts and then blast them out to 1000s of inboxes.
“Hi {{candidate}}, the founders personally want to wipe your ass with venture $$$”, “Checking in, I know we’re all busy but we raised more $$$$ to wipe your ass with”
“Trying one more time…”
(I’ve received some of these sequences where the {{candidate}} variable wasn’t replaced.) (And yes, this is a first world problem for people who would love to get such inbound in a tough job market, so let’s spare that virtue side quest)
Anyways, Gmail filters that hardcode specific words or phrases always run the risk of false positives.
And I don’t see it in Google’s corporate interests to make solving this problem easy for end users anytime soon.
So I setup the following workflow:
Google app script runs every 5 minutes. Fetches emails from today that have not been labeled as “processed”
Create a batch request with the email title, body and sender email for each element.
Send it to my home server (Beelink ser5 max 6800u) running Qwen 3 8b behind a fast api server. System prompt describes the specific situation of tech recruiter cold email patterns with some few shot examples. Instructed to return a spam Boolean for each element in the batch.
Google app script receives the reply. Applies a Gmail label of “processed” to each email so they don’t get refetched in step 1. For the ones with spam: true, applies a “recruiter-spam” label, and removes it from Inbox.
In this way, when I “choose” to see inbound, I can find it in its right place, in the “recruiter-spam” labeled section of Gmail.
And my Inbox holds on to being human just a little while longer.
“My friend… what you allow, will continue.”
hmsenterprise@reddit (OP)
Ha as I read this, I anticipated you were going to be more nefarious and have your server engage in long running fake conversations with the recruiters to eat up their time or something like that
deepaklaksman@reddit
If you are wondering whether your hardware can run llms locally, checkout https://www.fitllms.com/
SM8085@reddit
The most "real" work I've done is that Qwen3-VL-30B-A3B-Thinking is currently going through videos 10-seconds at a time. Based on the bot's True/False boolean output a wrapping program keeps track of what segments is within. At the end, we're done using Qwen3-VL and the wrapping program uses the segment information to use FFMPEG to make a clipped version where should always be present.
\^--Screenshot from the wrapping program/script. Copying the frames to Yes/No (True/False) directories helps massively for checking the bot's accuracy. Can scroll through the directory and if you see in the No/False directory you know it flubbed. You can see how slow my hardware is, 12 minutes to analyze 20 frames and output JSON. Oddly, while the Thinking model chooses not to think it gave me higher accuracy than the Instruct, so I use it anyway.
A general example is you could have the bot look for every explosion in an action film. Maybe it considers muzzle flash from a gun to be an explosion, not technically wrong as far as I know. So you could prompt that it should only be explosions not from gunfire and try that.
I have a lot of video footage from live sources, so having the bot trim down a small 500mb file to say 30mb clip of what I'm interested in is literally saving NAS space.
If Qwen3-Omni would get llama.cpp support I'd be more open to using Qwen3-Omni to send the audio as well.
b_nodnarb@reddit
Looks like people want this. Would you consider putting it on AgentSystems? It allows you to discover and run and distribute self-hosted AI agents like they're apps: https://github.com/agentsystems/agentsystems (full disclosure, I'm a maintainer).
Leopold_Boom@reddit
Oh for gods sake I spent like 7 min on your github and site and I can't find a link to actual example agents. What's the point of a discovery platform without a bit list of "this is our best stuff"?
b_nodnarb@reddit
Good point. The whole system is federated so would need to consume the index api to surface the agents in the site - not just inside of the UI. I hind sight that’s super obvious but haven’t done that yet. Will reply here once that’s live!
Leopold_Boom@reddit
👍
relmny@reddit
I wonder if it could be used to remove scenes from videos? I guess I'll need some workflow wirh agents maybe? and some scripts to take big files, cut then in smaller pieces so qwen can read them, remove frames and then outting it back together aa a single file. And so on?
SM8085@reddit
Not sure if it's totally related, but ffmpeg does have some basic scene change detection where you can take frames when there's a major percentage of pixels changed. It seems oddly complicated to turn that into a timecode for each image for some reason. I have been meaning to try to tackle that problem eventually.
With subtitles, you could even intertwine the text between the scene images.
Mayion@reddit
Porn auto-tagging?
bloomsburyDS@reddit
clockwise rim job, anti-clockwise rim job, gals with dxxk etc.
PersonalCitron2328@reddit
There's no gals with dxxs just dudes with teets
Mayion@reddit
Dude seeing your reply notification made me go, "Wtf did I say for someone to reply with THAT"
Borkato@reddit
This is actually really cool, wtf. I’d love to see this in a git repo
SM8085@reddit
It's there, it's all free software.
hmsenterprise@reddit (OP)
But, out of curiosity and I don't mean this critically, what is the point of that workflow even? Like, is it actually a valuable task? I'm just trying to find someone/anyone who is using local LLMs on "consumer"-y hardware for something more than hobby tinkering
SM8085@reddit
It's valuable to me. When the accuracy is high enough I can delete the original video and save space as I mentioned. My NAS has a bunch of videos I've saved over the years that the bot can go through. That seems "real" to me, but I wouldn't call it "production grade" either.
GP_103@reddit
That could be incredibly useful to millions of people. I have Gigs of video , that could use this
spaceuniversal@reddit
Really spectacular. Let’s imagine all this integrated into a platform like YouTube with the possibility of doing such professional search queries as historical research “find me only black and white videos with vintage sewing machines”
hmsenterprise@reddit (OP)
Nice -- makes sense to me!
ryfromoz@reddit
Yes, me.
spaceuniversal@reddit
Really spectacular. Let’s imagine all this integrated into a platform like YouTube with the possibility of doing such professional search queries as historical research “find me only black and white videos with vintage sewing machines”
mfarmemo@reddit
I used gpt-oss-120b to create a data dictionary for healthcare databases today.
hmsenterprise@reddit (OP)
Nice -- did you feed it the schema and just let it rip? or how did you rig it up?
mfarmemo@reddit
Full metadata export for each data source standardized to data types since there are different syntaxes then chunk by schema with the detailed metadata. Used RAG for internal docs that give additional context as well as linked tables with shared keys when applicable. Thinking set to medium. Output to JSON.
I'll still need to verify a sample to reach at least 80% accuracy but overall the model does well interpreting the technical language.
Adventurous-Date9971@reddit
Big win: add column profiling and clinical code-set mapping so the model stops guessing. Compute per-column stats (null rate, distincts, min/max) and feed masked samples; ask it to map fields to FHIR resources and LOINC/SNOMED value sets, and to link likely FK targets. Gate acceptance on retrieval score; below threshold, return unknown and queue review. Emit JSON with a JSON Schema (name, description, piitag, constraints, fktargets, citations) and auto-generate Great Expectations checks from it. I’ve used dbt for type standardization and profiles, OpenMetadata for lineage/glossary, and DreamFactory to expose a consistent REST layer over mixed databases for the RAG and a light reviewer UI. How are you handling cross-source synonyms and PHI masking in the examples? The profiling plus controlled vocab with citations is what makes it production-safe.
SituationMan@reddit
I am a tutor. I use local LLM to make worksheets for students.
-dysangel-@reddit
I tried using Claude for real work for a while, but just overall found a felt quite disconnected from the code, and things ended up going slower because of overly messy implementations that I had to tidy up. At the moment I primarily code by hand, though with auto-complete, and occasionally asking for suggestions or farming out drudgery to the AI. For example I just had it write a test plan for a feature in our app, and I only had to tweak a few things. It's great for just the types of things that are so easy or repetitive that they're boring.
_murb@reddit
I use quite a bit for boilerplate code, tedius SQL queries, and data analysis (natural language to base script). I'm in the midst of taking processed data and feeding into various LLMs and system prompts to automate weekly/monthly/quarterly reports.
For personal I just bought a strix halo box and going to run gpt-oss and glm air for financial analysis, small agents, batch processing, CVE/patch management, home automation, so on.
giant3@reddit
I have tried to use LLMs for this exact task. Doesn't work and ended up wasting a month on trying to get it working.
What works is 100 line Perl programs.
The biggest problem is LLMs slow down the more tokens they ingest and I haven't found a way to optimize it. Also, if you 100K records that need to be processed and you miss even 1 record, the customer won't accept it.
_murb@reddit
Luckily I am using bedrock so access to high quality models (sonnet 4.5) is possible and it does work pretty well, just not quite production grade. I'm using deterministic code and taking the outputs into the LLMs to help reduce the hallucinations. My data is many multiples over 100k records/day so load the dataset into the LLM isn't feasible nor do I want to for the performance/quality.
giant3@reddit
I think a better option is to use LLM to write Perl code. I gave the LLM all variations of the data and asked it to write the code to handle it. Then I did some minor fixing and integration with other tools.
Lissanro@reddit
I think local models are very capable, but depends on what hardware you have and what model you choose. I have been using mostly Kimi K2 past few months, starting with its first version, and later 0905, along with R1 when I needed thinking, and then Terminus after it was released. Recently I also recently downloaded Kimi K2 Thinking, which is promising, but I need to test and use it more before I can decide if it will replace Terminus for me.
For professional purposes, I use LLMs mostly for coding tasks. Often Roo Code. I use ik_llama.cpp rather than the mainline llama.cpp, since I find ik_llama.cpp has better performance, especially at longer context.
Also, since you mentioned workflows, this remiveds me... I had experience with ChatGPT in the past, starting from its beta research release and some time after, and one thing I noticed that as time went by, my workflows kept breaking - the same prompt could start giving explanations, partial results or even refusals even though worked in the past with high success rate. Retesting all workflows I ever made and trying to find workarounds for each, every time they do some unannounced update without my permission, is just not feasible for professional use. Usually when I need to reuse my workflow, I don't have time to experiment. Not to mention, for serious work I don't even have the right to send anything to a third-party and wouldn't want to sent personal stuff either. Hence why I had to go local.
ResearcherSoft7664@reddit
Yes. I used qwen 3 vl locally offline to convert images into structured documents. It’s accurate and safe
relmny@reddit
I do. Every single day at work.
It's like my own assistant/colleague/expert. Multiple times a day.
Adventurous_Cat_1559@reddit
Yes, I have one integrated with a few mcp servers I’ve written to manage my obsidian notes. Eg., “hey, add this GitHub link to my todo list as urgent for project X” and having it scrape / make the files in the right folder is great. Also ask it “for each open task, check if there’s a GitHub link, if there is summarise any recent comments” or “hey, so and so on slack said this, add it to the notes on project Y”. So when I get to my issues I’ve everything summarised in my personal notes.
cointegration@reddit
Using qwen3 vl to parse vids to extract car plate numbers snd other vehicle markings
Photoperiod@reddit
Does this work well on poor quality video? For example a lot of cctv at street intersections is kinda bad quality and plates are far from the camera.
cointegration@reddit
Lets just say if your eyes can make it out easily, qwen can read it
alfamadorian@reddit
I want to do that in real time, so when I drive down the streets or into a parking lot, I know if I passed someone I know.
cointegration@reddit
I'm using qwen3vl coz i have the luxury of time to slowly parse the vids, if you want realtime you gonna need to use opencv/yolo or train your own cudnn
hmsenterprise@reddit (OP)
Why do you need to do that though? Is it just for fun or actually something important/valuable to you?
cointegration@reddit
coz that's what the client wants
oodelay@reddit
You're fishing for ideas?
hmsenterprise@reddit (OP)
No not at all, surprisingly. I am actually working on a purely cloud based AI writing tool as my main job right now. I just personally am into local AI stuff and just had this growing feeling that I've seen a ton of chatter about fun little hobbyist use cases but nobody is really doing anything super valuable with them yet. I originally tried to make my product support local model workflows but it was a major pain in the ass and people weren't into it (at least at the time -- couple months ago).
oodelay@reddit
For now it's not changing the work environment for everyone but those who use it now will understand it better when it blows up.
Pvt_Twinkietoes@reddit
Personal use? Or you work for the police?
MitsotakiShogun@reddit
I use an LLM, a scraper to fetch security vulnerabilities (CVEs), and an internal API that lists my running services, and the LLM generates a daily report for me about whether any of the software I'm running might have been mentioned.
I've used local LLMs for coding (mostly Qwen-2.5/3 and Mistral/Devstral).
I've used local LLMs (ollama in WSL with a <1B model) for prototyping various stuff in my job, e.g. writing an LLM proxy, or setting up an environment for interview tasks.
And finally, various personal side-projects that have potential to become actual products.
Pvt_Twinkietoes@reddit
Not using any OCR tool to extract the text to guide the VLM?
MitsotakiShogun@reddit
Not necessary, the VLM works well enough, an accountant double-checks and does data entry in the actual tax forms, and the tax office checks everything too, and may send back for corrections, so there is no fine or other issue even if the VLM gets it wrong.
Even without the accountant doing data entry / double-checking, it would still have been perfectly safe. Unlike what may be the case in other countries, the Swiss tax office is generally pretty chill about mistakes and they'll happily send you corrections (up to years later, if necessary). >!Although I'm not so sure if they'll be chill with omissions, so definitely be careful about that! D:!<
No-Consequence-1779@reddit
Perfect reason why ‘ai’ will not be eliminating many jobs.
Borkato@reddit
I would love to know what model and scraper for that first paragraph!
MitsotakiShogun@reddit
I wrote the scraper myself mainly because I can, but there probably more than are a few feeds you can you can use that are simpler and more robust, e.g. DuckDuckGo points to: * https://nvd.nist.gov/developers/vulnerabilities * https://cvefeed.io
The model I'm using currently (for all things) is
mistralai/Mistral-Small-3.2-24B-Instruct-2506in vLLM.Borkato@reddit
This is super helpful, thank you!!!
b_nodnarb@reddit
I was actually thinking about deploying something like this to AgentSystems (allows the local AI community you to discover and run self-hosted AI agents like they're apps): https://github.com/agentsystems/agentsystems - might be interesting to package the tax reporting agent on there for others to use. Full disclosure, I'm the maintainer and am looking for people with solid local-first agents. People seem to like this one.
ryfromoz@reddit
I do the same thing for security vulnerabilities!
Empty-Tourist3083@reddit
were you fine-tuning the models or using them out of the box?
hmsenterprise@reddit (OP)
Nice! This is the closest I've seen to "actually valuable/important work being done by my local llm setup" in this thread so far
No-Consequence-1779@reddit
Absolutely. But you need to know what real work is. This is where experience is a contributing factor.
Working in organizations, with different business apps; both internal and external, and different workflows for information workers: you identify the common bottlenecks, where time can be saved, and other inefficiencies.
Specifically, time is money and anything that provides an ROI via time savings is a candidate.
So many people are hobbyists coders and have no experience to reference. So they think small because that is their domain of knowledge.
In summary, a dummy will struggle for useful ideas.
Llms for coding are immense; and integrated into all enterprise level IDEs.
Llms for template based text or image processing, including meta tagging, classification , and OICR.
Llms for automating non deterministic processes. Monitored by deterministic algorithms.
There is so much.
CorpusculantCortex@reddit
I have deployed a local 14b model as sentiment/ content summarizer in a local self hosted microservice used by a bot that scrapes select news sources and logs the summaries to a rag I use in another bot that ultimately makes me money. It's not for my job or primary income, but it is actually perpetually doing work for roi.
hmsenterprise@reddit (OP)
That's pretty cool. What does the other bot do? Is that running on the same machine? Also, is that machine separate than your daily-use PC/laptop?
CorpusculantCortex@reddit
I would rather not say specifically, but there are a few use cases for the auto populated rag set up. It all runs on the same machine, I run Ubuntu and all components are systemd services that I have set up or dag orchestration depending on frequency of operation. Though the other bot feeds the rag to a cloud llm because the final step is benefitted by power I dont have locally, but all the rag and summarization helps reduce costs by limiting tokens that get sent with my structured prompt so cloud costs are barely anything per month.
Whether the system is separate than my daily driver is... complicated to answer, I have 4 computers on my desk. The tldr is it runs on my most utilized personal system. The longer explanation is: My most used one is my work laptop, that is mostly a client for cloud services/ servers though. I have two pcs the Ubuntu machine (newer higher powered) technically dual boots windows but i rarely boot outside windows, that is the personal machine I use most throughout the day for random tasks as i have a small portable monitor that i can run on the side. My other machine typically is running windows, it's my older system, but also dual boots Ubuntu. That one I use for gaming and photo/design stuff because Adobe doesnt like Linux. I use it somewhat less frequently, mostly to steam link into play games from my phone, or periodic photo download and editing. And I have my laptop/tablet which sits idle unless I have a specific project, but use as a client to ssh into my Ubuntu machine if i am working on a code based project, or (it is a surface pro), so I use it for the stylus if needed for design stuff.
NoWorking8412@reddit
I find local LLMs to be useful for data analysis related tasks when dealing with sensitive data.
kc858@reddit
I us NuExtract-2.0-8B to extract structured data from unstructured data, at least weekly.
sunkencity999@reddit
All of the time. They're great for solving local IT networking and support issues.
Academic-Air7112@reddit
Yep, we use "local" LLMs to write some of our own systems code for research.
false79@reddit
Yep your wrong here. I'm using my setup to write boilerplate code, add incremental features, documentation, grouping files for commits.
This technology is very capable. But you have to have a self awareness of what you do most often day to day and breakdown that problem that can be automated.
With that self-awareness is coming to the realization, you don't need triple digit LLMs to do your work if you can give it sufficient context to do the task.
Then it just frees up time to do more important things.
hmsenterprise@reddit (OP)
Ok how do you switch to local models for just those tasks? Do you have to cognitively evaluate the nature of every task you take on and whether your setup can handle it well?
I have experimented with local models for simple coding tasks (e.g., boilerplate or adding logs to a file or whatever) -- but even just the cognitive load of switching wasn't worth the 10 cents or whatever I'd save.
I-cant_even@reddit
It is also feasible to build a workflow that has a small fast model evaluate the requirements of the request and route to the appropriate model.
false79@reddit
The evaluation is near none. I have VS Code + Cline setup. In my prompt I say refer to these files, do it this way. And like 95% of the time, it does it.
Cline is advertised to hit any of the paid models but its also capable of hitting any Open AI compatible webservice.
These things are write to directly to the code base where (if you setup properly) performing linting, run tests, fix imports, try to take it's best guess why the code won't compile, and make an attempt to resolve it before human intervention.
SkyFeistyLlama8@reddit
Automatically generating commit messages has been a lot of fun. I can write code most days with enough coffee but communicating what I've done is something else entirely. I'm glad to throw it to Qwen or Devstral to help me out.
Savantskie1@reddit
I am learning this very slowly and have to stop myself from just assuming that the model can just understand my intent. Because if I don’t express what I want every model usually misunderstands and takes a path I didn’t want.
Borkato@reddit
Have you tried aider? I just started with it and it’s amazing
Pvt_Twinkietoes@reddit
Oh that's actually quite interesting. How do you go about doing that?
false79@reddit
Do you use Cline workflows? If not read up on it.
/Prep-commit-files-and-commit-message.md
Pvt_Twinkietoes@reddit
Oh this is pretty cool. Thanks will check out.
I-cant_even@reddit
I probably am on the edge of what we consider a 'local' system (4x 3090s) but I'm using quantized versions of Llama 3 R1 Distill 70B and GLM 4.6 Air amongst others to great success. The system I've built is designed to handle some of the nuances of working with 'dumber' models but will be able to take advantage of any model, the quality of the final product varies by model quality but commercial viability is there.
redoubt515@reddit
> Are any of you using local llms for "real" work?
> but sometimes it feels like we're all still stuck in the "fun experimentation"
That's why I'm here.
I work with my hands, my interest in LocalLLMs has little to nothing to do with my "real work."
AI is a fun side technical side-interest for me, and my path towards local LLMs stems from an interest in self-hosting, privacy and security and control/transparency.
onetimeiateaburrito@reddit
This is why I mess with local models too. I'm a truck driver, I can't think of anything in LLM could help me with that I want to deal with setting up.
I could get it to look at construction reports state by state and enter in the pickup/delivery times and miles for my load and locations. But really, it would all be superfluous for me.
It is cool tinkering with them. I ran a LoRA tuning on Gemma3 4b and it was difficult, but seeing what changed (and broke) in the models outputs was interesting.
Firesworn@reddit
DeepSeek-OCR is a remarkably powerful system I can run on the same 3080 TI I use to play games.
hmsenterprise@reddit (OP)
But what do you use it for?
Firesworn@reddit
Processing documents that are too sensitive to use with cloud based LLMs. Being completely local we control the data flow and can promise clients that their data isn't being fed into the giant data collection they are training on.
bluesformetal@reddit
We ran many products in scale ~20M users (e-commerce) and currently use 7-12B LLMs. Of course they are useful for real work. You don’t need gpt5-high for many use cases.
Goldstein1997@reddit
Elaborate?
makegeneve@reddit
I do a bunch of things professionally, using a cheap RTX4060 Ti 16GB. Krita + AI for generating ideation brainstorms in object design sessions. LMStudio with Qwen3-coder for kick-starting glue code when patching together workflows from different systems. LMStudio with gpt-oss 20B for drafting emails/reports/doing translations between English and French. All of these save me time, and time is money. On my todo list is PDF invoices to input into my ERP.
MercyChalk@reddit
My work involves developing local LLMs, but I accelerate that work almost exclusively with proprietary LLMs. Turns out the convenient systems, free credits, and fear of missing out trumps any benefits of local LLMs.
I would guess the people seriously replying "yes" to this question all have relevant privacy concerns.
timedacorn369@reddit
I have used qwen 3 4b and other smaller models to read my work emails/chats and automatically give me action items/ due dates. Its a simple python code i used where I just fed in the chats/emails and asked to give action items and then used notion API to create tasks.
Still a work in progress but I did see some good results in limited testing.
PhotographMain3424@reddit
I use nvidia/Llama3-ChatQA-1.5-8B to index 2M similar insurance docs using ollama. I load the index into Meilisearch and sell access to it. I did this after a trip to micro center and a little over $3k.
Busy_Leopard4539@reddit
Yes, for research in history. Generating metadatas and OCR tasks.
hmsenterprise@reddit (OP)
Very cool
PhotographMain3424@reddit
It really does a great job of normalizing named entities, better than NLP, trained NLP or Regex. I verify the output tokens were in my input tokens which seems to cut down on rare hallucinations when the answer is not in the document.
hmsenterprise@reddit (OP)
Oh damn that is a great idea ... Idk why I have never thought to do that lol. I have done some entity extraction stuff and had a hell of a time keeping it consistently "factual" in its output
RemarkableAd66@reddit
Well I have a 128GB macbook pro.I can run gpt-oss or glma-4.5-air in roo code and get ok results. But these models do worse than the paid models still. And they run slower due to poor prompt processing speed on mac silicon.
So if I am doing some coding I can either do:
1. Use roo code set to my local llama.cpp sever.
2. Wait slightly longer than a paid model.
3. If results are not good enough then switch to a paid model and redo the work.
Or, I can
1. Use roo code set to Claude/Gemini/Deepseek/GLM/Kimi depending on the price I'm willing to spend.
2. Get results faster and with lower chance of having to redo the work.
And using the paid models can be very cheap if you stick to something like Deepseek 3.1 or GLM-4.6 and only bump it up to Gemini/Claude for more difficult tasks.
So I tend to use paid models for real work even when local models could do the job.
Baldur-Norddahl@reddit
I have been 95% local for coding for some time. I switched to GPT 120b because it is actually fast even with prompt processing. This is on a M4 Max MacBook Pro 128 GB.
Now I also have a server with RTX 6000 Pro. I run GLM 4.5 Air on that. This setup is faster than cloud.
Yes the model is not the best. But it is my impression that few people actually pay for the very best for 100% of their jobs. That is just too expensive and often also too slow. So I don't think there is actually a big difference between what I am doing and what people running everything in the cloud.
hmsenterprise@reddit (OP)
This is very interesting. How do you use that server though? Are you using it for dev mostly? And if so, do you just somehow point whatever Agentic IDE you're using to that server endpoint?
Baldur-Norddahl@reddit
Yes. There is no build in security, so I use a ssh tunnel to port forward and a firewall rule that blocks general access. I use vLLM and my testing shows that you could have a great number of developers sharing this setup, although currently it is just me.
SkyFeistyLlama8@reddit
I've got half your RAM on another architecture but the same complaints apply. Prompt processing isn't great compared to a discrete RTX GPU and it's much slower than a cloud model.
I end up using a mix of cloud, medium and small LLMs for code. Throw architecture questions and Q&A to a cloud model, get something like Devstral or Qwen Coder 30B to write functions, and I keep a tiny 4B model on the NPU for autocorrect and syntax fixes.
hmsenterprise@reddit (OP)
Yeah this is exactly what I mean. I had a similar setup for some writing tasks and just always quickly gravitate back to paid cloud models.
hmsenterprise@reddit (OP)
and fwiw, I also have a 128GB MBP M4 Max
garloid64@reddit
Yeah qwen3-coder 30b can do small tasks for me in my actual job
Goozoon@reddit
Check also:
https://www.reddit.com/r/LocalLLaMA/s/RFTppCZxeU
Forgot_Password_Dude@reddit
Ya.
fab_space@reddit
I use it to learn and code monsters.
Not vibecoding
Past-Grapefruit488@reddit
Qwen3 coder and Qwen3 VL for processing in air-gapped environments. This system extracted data from last 10 years of documents.
Electrical_Job_4949@reddit
No. The quality gap with frontier api models is too wide.
chisleu@reddit
I've used models as small as qwen 3 coder 30b to do real work. https://convergence.ninja/post/blogs/000017-Qwen3Coder30bRules.md
I use GLM 4.6 locally every day for real work. Hell yeah, local LLMs are here bro. Hardware to run them is still expensive. But that will change a ton over the next decade as vendors have quickly realized LLM performance is critical to sales in the future.
uriahlight@reddit
I self-host an LLM solution I developed for a large construction company that looks for potential liabilities in multi-million dollar contacts alongside the applicable material spec sheets. It allows their estimators to identify pain points before the contract is manually reviewed by an attorney. The web app I built for them is hosted on AWS but the fine-tuning and inference is done on my local hardware.
Mescallan@reddit
I am a dev on Loggr.info . We built it around Gemma 3 4b to categorize daily journal entries locally and generate lifestyle recommendations.
Anything you can run locally is not going to be insanly powerful so you need to build the flow around minimizing complexity of the request.
The real advantage is privacy and cost. If either of those are something you are struggling with API models with, building with local models will be worth it, if you are not having issues in those categories, APIs are probably better overall
pmttyji@reddit
Not yet, but coming year onwards(Coding, Writing, etc.,).
FZNNeko@reddit
Yeah guys, I’m doing ‘work’ over here too. Don’t mind why my one-hand typing skills are as good as they are. Definitely a skill I developed through ‘work’ ;).
hmsenterprise@reddit (OP)
Lol yes I assume many people are trying to get around the cloud model guardrails with local llms
Pvt_Twinkietoes@reddit
I use them as zeroshot NER/Classification models. They're pretty decent at it out of the box. Training isn't too complicated, but for small repetitive tasks, they're often good enough.
hmsenterprise@reddit (OP)
What are you doing that for though? Is it an "important" task? That's what I'm trying to understand here is how many people are actually using them for important workflows
EpicSpaniard@reddit
The "fun experimentation" of local LLMs is work, to me. I work in security. Local LLMs provide value that SAAS providers can't - actual guarantee of privacy and security of our data.
Also I'm using it to automate tasks at home that I have to do but don't want to. Nutrition tracking, organising my documentation of my home network, etc, all count as work to me.
hmsenterprise@reddit (OP)
Agreed on the privacy/security front ofc. Why do you need an LLM to help with home network documentation though? How is that better than sticking it in an apple note or something? (genuinely curious--not criticizing)
EpicSpaniard@reddit
I have ADHD and my personal documentation is beyond unintelligible - or so I thought. AI tidies it up nicely, rewrites it for me, and I can actually understand it when I need to reference it 3 weeks later.
I cannot structure documentation for the life of me - so my alphabet soup mess of notes becomes somehow professionally displayed and visually appealing.
hmsenterprise@reddit (OP)
OK it's ironic that I said "why don't you stick it in an apple note" ... because 1. I also have diagnosed executive functioning issues (ADHD etc), 2. My home network is a freakin byzantine disaster and my documentation for it is scattered everywhere and I dread making changes or interacting with the network for this reason lol.
Do you store it in a markdown file on your PC and edit it with AI in your IDE/VSCode? or how do you do it?
EpicSpaniard@reddit
Obsidian, all markdown, and use it integrated with ollama running locally. Just open up a chat bot with it having access to the current note as context, and request that it rewrites it. Juggle the prompt a little until it's right.
hmsenterprise@reddit (OP)
🙌
imtourist@reddit
Yeah I a local LLM and a vector database to help me with my corporate taxes. I first fed all past 3 or 4 years of expenses along with their categorizations (meals, fuel, office supplies etc.) into a Postgres PG Vector database and embedded the data using the Nomic model. I then classified several hundred transactions for the past tax year using this vector set of data to categorized the transactions.
What normally would take me a few hours to do, it did in a few seconds and then I did some cleanup afterwards. Of course it took me several hours to write this whole things in terms of software development but I learned a lot and now have a tool to use in the future.
hmsenterprise@reddit (OP)
OK now we're talking! This is one of the more interesting "local LLMs did something very useful for me on consumer-ish hardware"
Available_Hornet3538@reddit
No. Can't afford the hardware. Would need to shell out 5k or so. Just not worth it. Love to play with them though.
ayylmaonade@reddit
Absolutely. I use Qwen3-VL-30B-A3B (thinking variant) as my "personal assistant" on a day-to-day basis and have done with local LLMs since the start of this year. My main usecases are general Q&A, using it like a search engine, researching topics, explaining things to me, coding, image/video analysis + identification, solving problems for me such as programming related issues, general stuff, or maths, and occasional translation.
I do use it for fun too, but it's mainly a genuine assistant. Also, I disagree with what you're saying about there being a gap between local and cloud. The only real restraint locally is compute, and nowadays we have incredibly intelligent models (especially MoEs) that can easily fit into 24GB of VRAM, or hell, much lower if using both CPU + GPU inference offloading, particularly with MoE models. Stuff like Qwen3, Mistral Small, GPT-OSS-20B, Ernie-4.5-21B-A3B, Magistral, etc.
So yeah, being able to use models that are legitimately better than Gemini-2.5-Flash (Reasoning) locally is amazing to me and extremely valuable for my daily life and also my job. I feel like I've got a 24/7 secretary.
hmsenterprise@reddit (OP)
Wow. Do you run all of that on the same machine you dev on? or do you have a separate "server" rig?
ayylmaonade@reddit
Yep! Run it all on the same machine. I keep
Qwen3-VL-30B-A3B-Thinkingfully offloaded to my GPU pretty much 24/7 set to a context length of 104K, flash attention and K/V cache quantized to Q8_0. It ends up using about ~22GB of my 7900 XTX's VRAM, but as an MoE with only 3.3B active params during inference, it runs at about 150 tokens/second.You really don't need much more than even just an average gaming PC to run some of the best local models. My buddy who has an RX 6600 (a rather weak 8GB GPU) runs Qwen3-VL-4B-Instruct @ Q4_K_M on his machine, fully GPU offloaded. He gets about 40 tokens/second.
Local AI has come a really long way recently.
hmsenterprise@reddit (OP)
Wow! You've got quite the optimized setup 🙌
Savantskie1@reddit
You’d be surprised what you can get away with from general gaming gear. Especially graphics cards with more than 8GB of ram.
clebo99@reddit
My MSI works great for my local LLM.
Blaze344@reddit
I take a LOT of value out of it for myself.
I have a gaming PC with a 7900XT with 20gb of VRAM and I happily use GPT-OSS-20b, KQ8 VQ8 with max context (130k), at 130 tok/s at the start and 100 tok/s at the end of context. I provide it to my local network using LM studio and connect by URL from my company's laptop either using Codex (with an LM studio profile) or using Jan (for adhoc Q/A). This works absolutely great in the sense that I can freely send all kinds of crazy stuff directly to the assistant and not have a care in the world about sensitive data, let it run free in private repositories and glance at anything it needs and not have a damn in the world regarding what I feed it. OSS-20b is consistent enough to be able to work with codex itself and can create one-shot ad-hoc scripts for several things if instructed to do so, as well as entirely manage my git usage with more complex stuff that would be part of a release that would otherwise bore me to death (like figuring out which commits are which features, creating branches for them, merging them all, guaranteeing it's all nice and clean).
With Codex, I know what ticks the latent space in the right way so I provide the right context in my code (functions, methods, names, locations, files) and ask the right questions (descriptively narrate the problem, what I can do, what is the expected result) and the model toils away at boring / menial stuff while I do something else in the 6\~8 minutes per task. (At some point, one might argue that it would have been better to do some of those things myself if I do describe them with such detail, but I'd argue that I'm much better at breaking down higher level concepts into lower levels than I am with managing pure syntax, so the assistant is really damn helpful. And I really hate doing some of the more menial stuff).
It also works reasonably well to translate my ideas into appropriate spark code when I forget the names of some of the spark utility helpers, and Codex also allows the model to do the aforementioned git boring stuff by itself and assisting me with some more esoteric git stuff. Finally, I also use it as a "final pass" for code reviewing merge requests (mine and from my team's) as it sometimes catches small inconsistencies that I glanced over (thanks to the thinking stuff and the large context and the speed of inference). The model is REALLY GOOD at short bursts of capabilities, in the sense that you wouldn't feed it an entire SQL script for Pyspark adaptation and hope it works one-shot, but rather, I can parse the original SQL script using an AST and feed it small parts (CTEs, large self-contained expressions, etc) and be 99% sure it will do it okay.
Finally, in my most recent experiment, I downloaded all of the Databrick's documentation and converted it to pure raw text, and set up a folder where I can just ask Codex running on that folder to "deep research" in the documentation to ask whatever I want of it. OSS-20B does that very well and fast and has not failed me so far. I suspect I can improve its capabilities by doing something similar by providing my local agent access to updated documentation that it can read and grep freely, and improve its context management as it looks for its own "few shot" examples in the documentation.
hmsenterprise@reddit (OP)
This is excellent commentary. Thank you for taking the time to share. Re: the 6-8min task length and "why not do yourself" -- related to what you describe is the frame of mind context shifting required between high level conceptual thinking and design vs lower level pipeworks code implementation. For some engineers, this isn't that big of a deal, but for me it is extremely difficult and costly to context switch mentally between those modes. So, I hear you!
I like your idea of having a downloaded "verified context data source" to pull from as needed
Far_Statistician1479@reddit
I work with local LLMs professionally. So does everyone else in an industry with any degree of data sensitivity
Aggressive-Bother470@reddit
Of course.
96GB VRAM, gpt120, 160t/s base. When I need more context I go back to Qwen 2507. I occasionally use Seed and Devstral, too.
You have to get creative sometimes and coax it to not attempt to read every file in the project, line by line.
hmsenterprise@reddit (OP)
So you're using it to write code? What wrapper/IDE are you using them in?
Aggressive-Bother470@reddit
roo
munkiemagik@reddit
as someone who doesnt have anything to do with tech/IT professionally/aspirationally my interest and experimentation with LLMs has just been to scratch an itch. ive taken it to the point of where I ended up building a threadripper server with multiple 3090's (and a 5090, but that 5090 was specifically bought for PCVR use but moonlinghting in LLM server as who doesnt want to have 80GB VRAM if its just sitting there).
But because I dont really have an aim I dont really have any idea of what to do with all this in a meaningful way. So I'm just tinkering and pottering about willy-nilly. Trying out random things that catch my attention but that makes learning and understanding less efficient, which is why I dip into these posts all the time.
The most useful things Ive managed to do with it the last month or two have been a few random apps and scripts but the purpose of those again was really just for the sake of it to see if I could, rather than something I genuinely needed.
I think just for my own case Im slowly starting to come to the conclusion that its OK for me to say Im doing it for fun but if I really was intending to achieve something with it then even with the combined not-insignifcant cost of hardware I've committed to it I still feel its better to use non-local LLMs to actually get productive things (within my sphere of conceivability) done. The cost for me so far, far outweighs its usefulness to me and massively overshadows how little it would have cost me to get the same things done had I just paid for tokens.
So for now Im just putting it mentally in the same hobby box as something like PCVR simracing, 3D Printing etc something that I dip into now and then when I feel like it. I'll spend what I want on it as long as its giving me continued enjoyment in return. eg being able to run GPT-OSS-120B at over 40t/s was worth buying another 3090, lol. And If I ever get bored of it al I can always sell it so its never really money down the drain.
And I lurk around here in these subs constantly in the hope that some day I see something that inspires me to feverish diligence and genuinely productive output.
hmsenterprise@reddit (OP)
Yes it is a blast, without a doubt! Not trying to besmirch that. I have multiple PCs just for AI stuff running in my garage. This question just came from curiosity after realizing I haven't actually met anyone who is using local LLMs for much beyond uncensored image generation and the occasional Very Technical Person who is using them for specific, constrained tasks within their development workflows.
robberviet@reddit
Coding assistant and rag, deep research for me is real work. Not local llm though. Local llm is for fun.
XiRw@reddit
I like asking it real questions, pondering philosophy ,psychology and it being a supplemental therapist for me but it wouldn’t make sense for coding since even the currently best models struggle with that.
Savantskie1@reddit
I disagree. The big models available online to download are pretty good for coding as long as you lay out a plan with them and do it in chunks. That’s how I’ve been able to create my first memory system considering that I barely understand the basics of coding. And I do it in chunks so I can learn as I go. I’m not coding completely on my own yet, but I can finally grasp the basics and general flow of what’s happening. Granted it’s only python which seems to be the only noob friendly coding other than early html, but I’m learning.
neoscript_ai@reddit
Sure, I am using local LLMs in healthcare for research, questioning, transcription and summarization
DuncanEyedaho@reddit
The only real "work" related use for me is summarizing documents where I don't wanna upload them because I don't have HIPPA or BAA agreements with any of the big companies (yet).
I've been pleasantly surprised with what little old llama_3.2_3B_q4_instruct can pull off. It's nowhere near any of the big models, but it handles language pretty darn well.
I even plopped it in a little project of mine and wrote my own RAG based episodic memory, but that was purely experimentation.
FullOf_Bad_Ideas@reddit
yes I use them in real workflows but they're not hosted on my own hardware.
Many companies and people rent GPUs. What do you think they run there? They run models, often those based on open weights.
I believe that llama 3.3 70B instruct was popular with in-house enterprise projects recently.
hmsenterprise@reddit (OP)
That's kind of my point though -- it's a much different experience to be able to hit "rented GPU" endpoints at will and in parallel -- and most of which are beefier than consumer hardware.
Idk about llama-3.3-70b in enterprise projects -- haven't seen anything but interested to learn what they're actually doing
MitsotakiShogun@reddit
One of the other departments in the (S&P500) company I work at used it as a base to finetune, for usage in one of our most critical products. It's a surprisingly decent base, especially for more serious tasks. Also another team used finetuned small models (<10B) for "less critical" products... with ~3M daily page views :D
But we also use models on/from Anthropic, OpenAI, Bedrock, Azure, and probably every other commercial offering available from NA/EU countries.
FullOf_Bad_Ideas@reddit
When you want to do work you want to parallelize and make it efficient.
Kinda like having people working in an office/factory instead of building Ford cars in their gardens and bringing parts together for an assembly in a football field. It's natural.
I do use local GLM 4.5 Air for coding assist on some work related projects which I know it can help me with. I have access to Codex and CC so it's not saving anyone any money, but I have local bias.
hmsenterprise@reddit (OP)
Nice
Terminator857@reddit
I setup local code development and advised a client on how to do the same. They wanted to make sure the code did not leave the premises.
I suggested using qwen3 coder. For workstation low end suggest strix halo, for mid tier suggest 5090 system, and for high end suggest RTX pro 6000 computer.
hmsenterprise@reddit (OP)
Have you set up any rigs for clients with rtx pro 6000s? What were they doing
Terminator857@reddit
No I don't set up, I just advise. They were just coding for startup.
oodelay@reddit
Lots of classification of photos, texts, keyword extraction, although Bert can do some of it, I find that I get a better retrieval score by using mistral 24b for technical documents in different languages. I'm going to be using qwen OCR to strip some pdf with tables and graphs. I also use Gemma 12b for reading the beginning of a JSON to create a parsing formula I can put anywhere else to grab those files. Mistral again to create thousands of question-answer pairs to create a fine-tune lora on a much smaller model...and more everyday.
hmsenterprise@reddit (OP)
Wait I don't understand that "parsing formula I can put anywhere else to grab those files" ... what does that mean?
Also why do you want to classify all that stuff (photos, texts, etc) -- for search reasons?
Freonr2@reddit
Relevant post over on /r/cline recently that is a bit more specific to using as a programming assistant, you can read my response there:
https://old.reddit.com/r/CLine/comments/1osi1wq/which_open_source_model_do_you_recommend_that_i/
hmsenterprise@reddit (OP)
Yes that has been roughly my experience as well