Things I've learned about running LLMs locally (memory is everything)
Posted by Confident_County_140@reddit | LocalLLaMA | View on Reddit | 4 comments
After spending a lot of time trying to push local inference as far as possible, here's where I've landed:
CPU/RAM-only inference hits a wall fast. Past \~4B parameters, you drop to 3–4 tokens/second — technically functional but painful for anything interactive.
BitNet / 1-bit architectures are interesting but not there yet. Projects like Microsoft's BitNet are rethinking how models compute to make CPU-friendly inference practical. The catch: it requires training models from scratch with that architecture, and active development has slowed. You can't apply it to existing models.
Distributed inference isn't worth it right now — at least not on consumer hardware. Network latency kills throughput unless you're on a high-bandwidth interconnect. The only viable exception seems to be multi-GPU setups with something like vLLM, which is a different use case entirely.
What's actually working for me: quantization + MoE offloading. The most mature path by far. The sweet spot I've found is MoE models with layer offloading to RAM — lets me run models larger than my VRAM with usable token speeds. Not perfect, but the best trade-off available today for me.
One product worth watching: Tiiny AI Pocket Lab
This caught my attention recently. It's a pocket-sized ARM device with 80GB of LPDDR5X unified memory that claims to run 120B parameter models locally at around 20 tokens/second. The key tech enabling this is something they call TurboSparse — neuron-level sparse activation, meaning the chip only fires the parts of the model actually needed for a given query. That's how a 30W chip handles a 120B model without melting.
The concept is exactly what I've been looking for: a personal device that attaches to your machine and handles inference privately, offline, no cloud dependency.
That said, I have real reservations:
- The 120B claim depends entirely on their proprietary compression and sparsity research. There's no independent benchmark yet — just their own demos, including one running on a 14-year-old PC (which, yes, is a marketing stunt).
- TurboSparse is not open source. They say it will be, eventually. But right now the company is clearly prioritizing shipping hardware and generating revenue. That's rational for a startup — doesn't mean the open-source promise will materialize on any useful timeline.
- Sparse activation at this level trades compute for potential quality degradation. Nobody outside the company has stress-tested whether the 120B outputs are actually 120B-quality or effectively something much smaller.
The underlying idea — sparsity-based inference on high-bandwidth unified memory — is technically sound. PowerInfer (from SJTU, one of the same research groups involved) is partially open and showed real results. So this isn't vaporware territory. But there's a gap between "the research works" and "the product ships and performs as claimed."
Has anyone here actually ordered one or tested it? Curious whether the token speeds hold up on realistic prompts (long context, reasoning chains) rather than cherry-picked demos.
Silver-Champion-4846@reddit
I'm also interested in that thing. Not sure how accessible it would be for screen readers. If the quality reduction is negligible, I don't need a 120b model, I would just keep experimenting on llama3 70b finetunes lol.
Woof9000@reddit
Man, could please stop spamming all over the place with your bots
Silver-Champion-4846@reddit
You think I'm a bot? I'm really not.
Woof9000@reddit
AI slop