Things I've learned about running LLMs locally (memory is everything)

Posted by Confident_County_140@reddit | LocalLLaMA | View on Reddit | 4 comments

After spending a lot of time trying to push local inference as far as possible, here's where I've landed:

CPU/RAM-only inference hits a wall fast. Past \~4B parameters, you drop to 3–4 tokens/second — technically functional but painful for anything interactive.

BitNet / 1-bit architectures are interesting but not there yet. Projects like Microsoft's BitNet are rethinking how models compute to make CPU-friendly inference practical. The catch: it requires training models from scratch with that architecture, and active development has slowed. You can't apply it to existing models.

Distributed inference isn't worth it right now — at least not on consumer hardware. Network latency kills throughput unless you're on a high-bandwidth interconnect. The only viable exception seems to be multi-GPU setups with something like vLLM, which is a different use case entirely.

What's actually working for me: quantization + MoE offloading. The most mature path by far. The sweet spot I've found is MoE models with layer offloading to RAM — lets me run models larger than my VRAM with usable token speeds. Not perfect, but the best trade-off available today for me.

One product worth watching: Tiiny AI Pocket Lab

This caught my attention recently. It's a pocket-sized ARM device with 80GB of LPDDR5X unified memory that claims to run 120B parameter models locally at around 20 tokens/second. The key tech enabling this is something they call TurboSparse — neuron-level sparse activation, meaning the chip only fires the parts of the model actually needed for a given query. That's how a 30W chip handles a 120B model without melting.

The concept is exactly what I've been looking for: a personal device that attaches to your machine and handles inference privately, offline, no cloud dependency.

That said, I have real reservations:

The underlying idea — sparsity-based inference on high-bandwidth unified memory — is technically sound. PowerInfer (from SJTU, one of the same research groups involved) is partially open and showed real results. So this isn't vaporware territory. But there's a gap between "the research works" and "the product ships and performs as claimed."

Has anyone here actually ordered one or tested it? Curious whether the token speeds hold up on realistic prompts (long context, reasoning chains) rather than cherry-picked demos.