Is there any use case for large models with very slow token output for batch processing?
Posted by Last_Bad_2687@reddit | LocalLLaMA | View on Reddit | 17 comments
Maybe I'm influenced by the sci-fi story "The Last Answer" by Issac Assimov but I've always got a tickle imagining a huge model like Kimi running on, say, disk. Even if it is 0.001 tok/sec to ask complex questions and get an answer in a week
Is there any use or community focused on this?
yaosio@reddit
Not really. Currently tokens take the same amount of resources to be produced. I remember there was research into variable compute per token but I'm not aware of any model that uses it. Maybe that became mixture of experts though so models do use it. Going very slow will not produce better output.
Last_Bad_2687@reddit (OP)
I don't mean going slow to produce better output. I mean something like do full swap or have a linux driver that presents your SSD as RAM then do full offload on a 2-4TB SSD instead of VRAM. That's why 0.01 tok/sec, you are using SWAP/Page instead of RAM after offloading from VRAM
yaosio@reddit
You can already do that automatically in your favorite LLM front end. For me that's LM Studio. You can pick how many layers are loaded into VRAM. You can also pick CPU only. If you don't have enough VRAM it automatically offloads for you.
Last_Bad_2687@reddit (OP)
Exactly, but this is usually a last resort, no? A lot of people seem to have a target tok/sec of say 20-40 and deem anything under that as non-viable.
I am mainly asking:
1) Which use cases could have a much lower tok/sec need at the benefit of a much huger model (eg I can run Kimi 2.5 "super-offloaded" to disk/swap/page but is it useful??)
2) Are there any communities focused on this "quadrant" of the speed/size 2x2 matrix
yaosio@reddit
Colbrative writing might work because you're not generating many tokens per turn, and it gives you more time to think about what's being written and what you want to write. For things with objective quality I can't see a use for slowness because there's no way to know in advance the quality of the answer. You don't want to wait a day to find out the output is wrong.
sahanpk@reddit
slow works when the job is offline, retryable, and you care more about privacy/control than turnaround. otherwise smaller + scaffolded wins.
Last_Bad_2687@reddit (OP)
do you have a real world example of offline + retryable + privacy?
Eg: Huge law case with lots of context + huge model + weekend question
Puzzleheaded_Base302@reddit
let's say cloud API cost $5/million token (I made it up), run at 100 TPS (I also made it up). The 0.001 TPS rig will take 1 day to do something cloud API finish in 8 sec.
for a million token, it will take 1 billion second to produce on your local disk. 1 billion seconds is about 38 years. so you can only get 2 million tokens from your local disk across your whole life, which can be easily archived with $5 paid to Kimi, and done in hours.
during the 38 years to generate your 1 million output tokens, your motherboard will die, your hard disk will fail, your power supply likely will get broken caps. you baby will also finish college and create a grandchild for you.
logarithm math works in a very interesting way
Last_Bad_2687@reddit (OP)
Now do the logarithmic math of how much value AI companies get when everyone trains their models for free with real use cases, instead of trying to keep things local
sn2006gy@reddit
Have you ever read the Hitchhiker's Guide to the Galaxy?
42
ShengrenR@reddit
Sure, but what's the question?
Vusiwe@reddit
1TB RAM plus 1x Max-Q, running ~1TB+ models inc G 5.1, DS 3.2, K 2.6 at 1/4 thru half quality - small 4096 max context, 1k sized prompts mostly, 0.25 t/s speed
MUCH lower nominal t/s once all QA, retries, etc is taken into account… might be <0.03 t/s of final output
Fiction
Input is a Looong outline (100K) with QA-oriented reusable metaprogramming written within the outline itself. The outline is hundreds of small steps each typically with a QA check. Internal variable support, internal analysis, etc.
Custom engine that makes use of one of the ui’s
Use case is Unattended storytelling, with on-demand section rewriting, manual override GUI, automated retry, historical state reversion/resumption at any point in the story
It goes so slow, you actually get to live your life, be with family, while it runs in the background
Slow is the price of >SOTA level writing
GrokiniGPT@reddit
yeah, the brokies(me)
Former-Ad-5757@reddit
Brokies use cloud in this example...
Wear and tear and power usage will raise the bill far over the cost of a few cloud seconds.
beekersavant@reddit
I am looking at something like this as a teacher. Putting my whole curriculum in assignment by assignment as seperate programs. Being able to scan in handwriting and convert to text then grade each rubric item as a seperate program with error checks and human correction with model revision. The 128 gb mini computers seem to be the way to go. Assignments come in in batches of 150 at a time and they take 7 minutes each (about and not counting fatigue). It's very repetitive especially year after year.
Anyhow, time constraints are not an issue. It can take a week and I can hand grade odd ones out then update the model. 90% of assignments are just going through the motions. Token speed doesn't seem to matter unless very slow.
TripleSecretSquirrel@reddit
I've been chasing my holy grail of a fully local coding agent. The standard caveats are that I'm not a software engineer, I'm not writing very complex software, and I'm not writing anything for anyone else's use aside from me and maybe a couple friends.
This isn't quite as dramatic as what you're describing of course, but while qwen 3.6 27B is very capable, it takes a long time on my hardware, especially once you have big pools of context built up.
I have an automation script now that keeps my agent coding without needing my input until there's a working prototype. Depending on the size of the project, that can represent several hours of runtime. So I've changed my workflow to run overnight when my electricity is super cheap. I'll conduct a planning session and lay out the sprint, then when I'm getting in bed at night, I ssh into my server from my phone, set it to run, then when I wake up in the morning, I have a working prototype.
Baldur-Norddahl@reddit
The use case is for people doing it for hobby and learning.
For business I am going to claim that most use of the largest models are for coding or agentic workflows. Being very slow is not practical for that.
Many other uses do not actually require the largest models. Use a smaller model and provide the domain knowledge directly. By instructions in system prompt, the use of RAG databases, specific tools etc. Don't need to rely on the build in world knowledge of a 1T model.