3b and 7b Serving with new Hardware
Posted by No-Fig-8614@reddit | LocalLLaMA | View on Reddit | 4 comments
I don't want this to be a promotional post even though it kind of is. We are looking for people who want ot host 3b/8b models of the llama, gemma, and mistral model familys. We are working towards expanding to qwen and eventually larger model sizes: https://www.positron.ai/snap-serve
We are running an experiment to test our hardware out at $30 a month for 3b and $60 a month for 8b size models. If anyone has fine tunes. If you have a fine tune that you want running and can help test our hardware, the first 5 people will get a free month for the 3b model size and half off the 8b model size. We are looking for folks to try and test out the system on this new hardware outside Nvidia.
This isn't tiny LORA adapters running on crowded public serverless endpoints - we run your entire custom model in a dedicated instance for an incredible price with token per second rates double that of comparable NVIDIA options.
Would love for some people, and I know the parameter and model family size is not ideal but its just the start as we continue it all.
kmouratidis@reddit
How many is that?
No-Fig-8614@reddit (OP)
On comparable hardware to host, a 3090 or 4090 is around $450 a month and thats not including all the devops work that it takes, and we expose an openai compatible endpoint with all the bells in whistles. We are seeing on this hardware right now on FP16 100TPS and FP8 much faster. For 30/60 a month and we plan to get Qwen family of models soon + larger param sized models, as we see the next lass of models being 20-40B models hosted for around $150 a month but thats just us noodling around. This is brand new hardware that is still maturing in everything, why we are doing it at $30/60 for those sizes because we really need folks who want to host fine tunes and don't want to pay for a month on nvidia hardware or if they local host have it consume the majority of their resources.
kmouratidis@reddit
100 TPS on a single query or max throughput? And this is your custom hardware? It's probably the same or less than a 3090, but I guess the selling point is that it's at lower costs?
Speaking of which, why does a 3090 or 4090 cost $450/month? vast.ai puts the 3090 at ~220/month. Assuming you don't need a 3090 for a 3B model, you can get T4-16GB instances (e.g. g4dn/g4ad) on AWS for as "little" as $130-150/month... or buy a used one and then pay ~$100/month for electricity (assuming it's as expensive as mine!).
I haven't tested this yet (due to account issues), but Oracle offers free A1 instances (up to: 4 cores, 24 GB RAM) that could run both 3B and 8B models. It will be at much lower speeds, but it's going to be free. And most people can already run most of these models on their computers anyway, even without GPUs. It might become a hard sell.
smahs9@reddit
Do you have throughput numbers in terms of "standard" metrics like pp512 and tg128 at various RPS and batching rates? That would allow for a fair comparison with the existing hardware. Using just TPS may make sense for enthusiast use, not for business use.
Your prices look interesting if the quality is similar and throughput is similar or higher than a single vllm/tensorrt-llm instance running optimized kernels (w8a8, w4a16 marlin/machete, etc.) on a Hopper/Blackwell chip.