The Trillion-Parameter Dilemma: MiMo-V2.5-Pro went open-source (1.02T params). Is self-hosting worth it when the API costs $70 for 387M tokens?

Posted by jochenboele@reddit | LocalLLaMA | View on Reddit | 13 comments

Xiaomi open-sourced MiMo-V2.5-Pro. 1.02 trillion parameters, 42B active (MoE), 1M context, MIT license. On paper, this is exciting. In practice, I'm stuck on the math.

What I've been doing with it

I've been running V2.5-Pro via the API through Claude Code for autonomous coding sessions, not one-shot prompts, but extended multi-hour runs where the model picks its own tasks, debugs its own code, and keeps going across sessions using file-based memory.

Over \~125 sessions it built a full SaaS product from an empty repo: interactive API cost calculator with real-time pricing across 33 models and 10 providers, serverless API endpoints, Stripe checkout integration, embeddable widget system, RSS feed, newsletter infrastructure, SEO with structured data, and 60+ pages of content. 301 commits, all autonomous. It also ran quality audits on its own output: found issues across multiple files and fixed them without being asked.

This isn't "generate me a landing page." It's sustained autonomous development where the model maintains context across sessions, manages its own backlog, and makes architectural decisions. The kind of work where you'd notice immediately if the model was weak at instruction following or long-context reasoning.

The caching makes it absurdly cheap

Here's my billing:

Metric	Value
Total tokens	387,380,436
Cache hit tokens	373,124,480 (96.3%)
Cache miss tokens	11,600,665 (3.0%)
Output tokens	2,655,291 (0.7%)
Total cost	$70.12

96% cache hit rate. Claude Code reuses context heavily between tool calls within a session, and V2.5-Pro's caching means you're paying almost nothing for input after the first few calls. $70.12 for 387 million tokens across 125 sessions.

How it compares

	MiMo-V2.5-Pro	Claude Opus 4.6	GPT-5.4
Input	$1.00/M	$15.00/M	$2.50/M
Cached input	$0.14/M (86%)	$1.50/M (90%)	$0.25/M (90%)
Output	$3.00/M	$75.00/M	$15.00/M
387M token workload	$70 (actual)	\~$350-450 (est.)	\~$180-240 (est.)

The MiMo cost is actual measured data from our testing. Claude and GPT estimates are based on published API pricing with conservative cache hit assumptions (90% vs MiMo's 96%), though not for the exact same workload.

Then I got excited about open-source

MIT license. Open weights. I can run this myself. No rate limits, no API dependency, full data privacy.

Then I looked at the specs. 1.02T total parameters. Even with MoE (42B active), the full model weights are massive. FP8 quantized, you're looking at \~1TB.

My hardware: a MacBook Pro M4 with 48GB unified memory and a desktop with an RTX 4090 (24GB VRAM). The 4090 handles 70B models fine, I run quantized Qwen and DeepSeek on it regularly. But 1.02T parameters? Not even close.

Realistically, this model is very difficult to run locally. You'd need serious multi-GPU infrastructure, 4x A100 80GB minimum, probably more. That's $15,000-20,000 in hardware or $6/hr on cloud GPU rental. For a developer running coding sessions a few hours a day, the economics don't work.

Where the API wins (and where it doesn't)

For intermittent usage like mine, a few hours of coding sessions per day, the API with 96% cache hits is genuinely hard to beat. I'm spending \~$0.56 per session on average. The equivalent cloud GPU time would cost $6/hr just for the hardware, before you even factor in setup and maintenance.

Where self-hosting would win:

• Data privacy (the real killer feature for enterprise)

• Fine-tuning on proprietary codebases

• Running at scale 24/7 where the per-hour cost amortizes

• No rate limits (I hit API limits a few times during heavy testing)

But for most developers? The caching on the API side is doing too much heavy lifting.

Xiaomi also offers token plans with discounted credit multipliers and off-peak pricing, which may further reduce costs depending on workload patterns and usage intensity.

The question

Has anyone actually attempted the open-source V2.5-Pro yet? What hardware are you looking at? I'm curious whether anyone's working on quantized versions or GGUF conversions, though at 1.02T params even Q4 is going to be enormous.

The model is genuinely good at sustained autonomous coding. I just can't figure out when self-hosting it makes financial sense for someone who isn't running it around the clock.

[-]

MindPsychological140@reddit

The 96.3% cache hit is the actual story — effective cost \~$0.18/M tokens, brutal to match self-hosted. For 1T MoE you're looking at 4x A100 80GB for decent latency, or 1x 3090 + 512GB DDR5 with `--n-cpu-moe` if you tolerate expert offload latency. Either way you'd need prefix caching (vLLM/SGLang both support it) to approach that effective cost, and prefix caching eats VRAM you'd want for context. The API has amortized cache infrastructure you'd be rebuilding from scratch.

[-]

coyo-teh@reddit

what's a recipe for a banana cake?

[-]

jochenboele@reddit (OP)

They've got all that caching stuff running at scale across tons of users, which is why the effective cost is so low. Rebuilding that myself for a couple hours of coding a day just doesn't make sense. Would need to run it nonstop to even come close.

[-]

MindPsychological140@reddit

Yeah, that's the inflection. Rough back-of-envelope: at API effective $0.18/M, you'd need to push hundreds of millions of tokens/day before owning a multi-A100 rig for a 1T MoE starts looking sane on raw cost and that's assuming you can sustain similar cache hit rates locally, which honestly nobody outside the hyperscalers can.

The one place self-host still wins is when the workload can't go to the API for non-cost reasons compliance, air-gapped data, or sustained agentic loops where you're hammering 24/7 and want full control over batching/scheduling. For "couple hours of coding a day," API every time.

Side note though: agentic loops specifically ruin your cache hit rate because tool-call results keep mutating the context tail. If anyone's running long-running agents through the API, watch your cache write/read ratio I've seen workloads drop to <40% hit rate just from messy context management. Different problem from "should I self-host" but worth instrumenting.

[-]

jochenboele@reddit (OP)

I think the reason my cache rate stayed so high (96%) is that Claude Code keeps the system prompt and file contents as a stable prefix, and only the tool results change at he end. So most of the context stays cacheable between calls within a session. Would be interesting to see how other agent frameworks compare.

[-]

MindPsychological140@reddit

exactly append-only structure is the cache-friendly property. Claude Code's design is "stable prefix (system + repo context + memory) built once, everything else appended as turns." That maps cleanly onto how Anthropic's cache_control breakpoints work — the prefix never moves, so reads stay cheap.

The frameworks that fight this are the ones with history-mutation patterns:

LangChain agents in default AgentExecutor scratchpad and tool outputs end up interleaved with system → human → AI → tool → AI per iteration. Cacheable if you structure carefully, brutal if you don't.
AutoGen / CrewAI multi-agent :agent handoffs and role rotation effectively reorder the message list, which invalidates anything past the first speaker change.
Anything with summarization or compaction enabled mid-session the compaction step rewrites the middle of context, killing whatever cache state was held there.

Append-only + an explicit cache breakpoint at the boundary between "stuff that never changes" and "stuff that grows" is the whole game. Most non-Anthropic agent frameworks were built before prompt caching existed, so they don't structure context with that in mind.

FWIW I've been working on this from the other direction actually deduping the chunks that go into context before they hit the model, since even Claude Code accumulates duplicate file snippets and tool results across long sessions. Open-sourced it as Merlin: github.com/corbenicai/merlin-community

MIT, runs locally, MCP tool. Measured 22% chunk-level dedup on typical agent sessions on my end, up to 71% on RAG-heavy stuff. Different problem from cache invalidation but stacks well with it.

[-]

MelodicRecognition7@reddit

this question is asked every single week, and the answer is always the same: NO, it is not worth it to run large models at home, the only reason to justify enormous costs of running trillion-parameters models locally is data privacy.

but YES, it is totally worth it to run SMALL models at home, not 1T ones.

[-]

theawesomew@reddit

Can I ask, what are AI labs deploying into production that enables them to have such low costs compared to self-hosting models?

Every time I compute the aggregate cost of purchasing the necessary hardware and the electricity costs associated with running frontier open-weight models (DeepSeek V4 Pro 1.6T A49B, Kimi K2.6, GLM-5.1, etc.) I invariably produce a result which implies that their API pricing is outrightly unprofitable & I haven't been able to find a good explanation.

[-]

LagOps91@reddit

you are not in any way shape or form saving money from running this locally. your electricity would have to be basically free and you would have to have it do stuff 24/7 with the model. Even then it might still be better to sell the ai rig and get subscriptions instead.

the only real benefits you have is that you have full control over what models is running at what quant and settings and you have full privacy. so it's only worth going local and run such a huge model if you really value it that highly.

[-]

jochenboele@reddit (OP)

That's basically where I landed too. The API with that cache rate is just too cheap to justify the hardware. Privacy is the main argument for local.

[-]

seamonn@reddit

Privacy is the main argument for local.

and Quality and Consistency.

For now your Model is fine but you never know when it might be quantized, updated, or whatever that might break your workflows.

I was testing Kimi K2.5 (before K2.6 was released) a few weeks back and from some vendors, it was pretty much unusable because of how heavily it was quantized.

[-]

ItilityMSP@reddit

You don't need a huge model for successful code runs, don't vibecode, architect code in chat, break it down to functions, subfunctions, project helper code, one file for each. Each small local worker works one file. Let the architecture do the work.

[-]

jochenboele@reddit (OP)

That's a solid approach for structured work. The difference in my case is I wanted to test fully autonomous sessions where the model handles the planning AND execution without me breaking things down. More of an experiment to see how far it can go on its own. But yeah for production work, your approach is probably more reliable.