Is there appetite for hosting 3b/8b size models at an affordable rate?
Posted by No-Fig-8614@reddit | LocalLLaMA | View on Reddit | 21 comments
I don't want this to be a promotional post even though it kind of is. We are looking for people who want ot host 3b/8b models of the llama, gemma, and mistral model family's. We are working towards expanding to qwen and eventually larger model sizes, we are using new hardware that hasn't been really publicized like Groq, SambaNova, Cerebrus, or even specialized cloud services like TPU's
We are running an experiments and would love to know if anyone is interested in hosting 3/8b size models. Would there be interest in this? I'd love to know if people would find value out of a service like this.
I am not here to sell this I just want to know if people would be interested or is it not worth it until its larger parameter sizes as a lot of folks can self host this size model. But if you run multiple finetunes of this size.
This isn't tiny LORA adapters running on crowded public serverless endpoints - we run your entire custom model in a dedicated instance for an incredible price with token per second rates better than NVIDIA options.
Would love for some people, and I know the parameter and model family size is not ideal but its just the start as we continue it all.
The hardware is still in trial so we are aiming to get to what a 3b/8b class model would get on equivalent hardware, obviously Blackwell and A100/H100 etc hardware will be much faster but we are aiming at the 3090/4090 class hardware with these models.
Our new service is called: https://www.positron.ai/snap-serve
ForsookComparison@reddit
My use-cases for LLM's of this size are such that it becomes more affordable (and reasonable) to just do CPU inference on the same server that I host the API/website.
And anything on-prem I can slap a $60 GPU in and probably have a good time.
Herdnerfer@reddit
8b for $60 a month? Yall smoking crack.
ForsookComparison@reddit
I mean, hold up. /u/No-Fig-8614 says limitless.
Free-tier smaller models (ex: openrouter) throttle you pretty quickly and I bet I can get over $60 of API usage on all of the regular open-weight providers.
If Lambda Labs offers Llama3 8b for $0.04/1M output now and we guess we'll do a penny of input for every one of those (in a generation-heavy use-case)..
napkin math says that if I nonstop, for 1 month, have a use-case that requires more than 462 tokens/second across all of my customers, then OP's service becomes price-competitive.
No-Fig-8614@reddit (OP)
Can you help me understand is $60 a month to cover the hardware, electricity, datacenter costs, devops, to have a model hosted like that? What would be your desired price range?
kmouratidis@reddit
In the other post you mentioned 100 TPS. Assuming: - you run that non-stop - you get 100 TPS (as mentioned in the other post) for the 8B model (you probably don't, more likely you get it for the 3B, but whatever) - the cost is $60 / month
that's \~260M tokens / month and $0.23/1M tokens.
Assuming a 2:1 input/output blend, and only looking at OpenRouter (not sure about quants), for
__
you can get__
: - ~$0.10/1M -> Ministral 8B - ~$0.13/1M -> Gemma3 27B - ~$0.16/M -> Llama 3.1 70B or Qwen3 32B - ~$0.2/M -> Gemini 2.0 Flash or Qwen2.5 72B - ~$0.30/1M you can get Gemini 2.5 Flash (non-thinking) or 4o-mini or Llama4-Maverick or Qwen3-235B-A22BThen there is fireworks, charging: - $0.1 for <4B models - $0.2 for 4-16B models
No-Fig-8614@reddit (OP)
Yeah, totally understand those numbers and your assuming only one concurrent request... but I get where your math is coming from. But throw 10+ concurrent requests and such. Again this is for discovery.
The one thing to note is that this is meant for all the folks hosting FineTunes and doing it almost like MoE models but they route the request to the right fine tune. There are companies running 10+ fine tunes all tailored and for those folks it makes more sense.
RedditDiedLongAgo@reddit
Your business is already a failure and your cope is sad. Stop spamming Reddit. It's embarrassing.
Your "we" probably agrees but is too kind to tell you.
No-Fig-8614@reddit (OP)
Thanks for the kind words
iamMess@reddit
I think most people here can inference a 3b or 8b model themselves. For 60 usd you can get A LOT of serverless inference at runpod or like 120 hours of rtx 3090. Doubt many people are actually using the models actively that much per month.
No-Fig-8614@reddit (OP)
120 hours is 5 days vs what we are going for 30 days....
Willing_Landscape_61@reddit
That the thing: you have to consider that you might mostly attract customers who will use your servers 24/24 7/7 so a "gym membership" business model might not work.
RedditDiedLongAgo@reddit
lol as if you're paying l flat rate
Federal_Order4324@reddit
So is the hook to your service the fact one can inference with models we've fine-tuned?
AppealSame4367@reddit
I would ask the same questions: who needs an 3b or 8b model for money? There are new models like the qwen3 8b r1 0528 distill that are quite good for their size, but paying for it?
You get this kind of power almost anywhere for free right now. Chutes AI, Cursor Free with GPT 4.1, Windsurf. Probably some more.
I think you should aim higher: If you serve better models to some paying customers, you can at least start making some money and maybe buy more hardware slowly, then faster and faster.
No-Fig-8614@reddit (OP)
So if we increase the Parameter size to the next size of 20-40B be more appealing?
AppealSame4367@reddit
Yes, maybe Qwen3 A22 or even Deepseek R1 0528. Nvidia Nemotron maybe?
Now that i see that you talk about "massive scaling" im thinking: ok, maybe not thinking models or only thinking models where thinking can be safely disabled and enabled.
What's the use case for your services?
I guess massive agent based workflows. And thinking models might be impractical for most steps in an agent workflow, but maybe you could advertise more modern models than Llama 3 and Gemma 3.
It's complicated, the more i think about it and Im just some dude on the Internet
No-Fig-8614@reddit (OP)
I like to think about it as if you had a chain of models you run or if you need a model on for your process and don't want to spend on a service that doesn't make economical sense. Also the idea of you like buying a parking space for that monthly cost and just slot whatever model you want but that space is reserved for you.
I've seen some workflows that act like MoE models where the request comes in and gets routed to the right fine tuned model and so all the models arn't running full throttle all the time but they do need a bunch of fine tunes running at the same time and load across them varies.
AppealSame4367@reddit
But wouldn't it be easier to just have modern MoE models then for your customers?
The scenario you describe sounds highly specific. And it sounds like you are not sure who you wanna advertise to.
-vwv-@reddit
No.
RedditDiedLongAgo@reddit
Fuck off corpo.
Commercial-Celery769@reddit
The only good performing small model I know of is wan 1.3b and its not an llm its a t2v and i2v model. Not really worth putting alot of effort into small LLMS in their current state IMO unless its for only using them to help with simple tasks. Ive used many 8b models including the new qwen distill, they are a bit dumb.