Best Cloud GPU / inference option / costs for per hour agentic coding

Posted by AdSuccessful4905@reddit | LocalLLaMA | View on Reddit | 4 comments

Hey folks,

I'm finding Copilot is sometimes quite slow and I would like to be able to chose models and hosting options instead of paying the large flat fee. I'm part of a software engineering team and we'd like to find a solution... Does anyone have any suggestions for GPU Cloud hosts that can host modern coding models? I was thinking about Qwen3 Coder, and what kind of GPU would be required to run the smaller 30B and the larger 480B parameter model- or are there newer SOTA models that outperform that as well?

I have been researching GPU Cloud providers and am curious about running our own inferencing on https://northflank.com/pricing or something like that... Do folks think that would take a lot of time to setup and that the costs would be significantly greater than using an inferencing service such as Fireworks.AI or DeepInfra?

Thanks,
Mark

[-]

punkrock3000@reddit

Did you end up going with one of these? How’d it go?

carl_peterson1@reddit

Co-founder of Thunder Compute here, honestly this depends on your throughput. If the inference server is underutilized you'll be better off paying per token with something like baseten or AWS bedrock. If utilization is high check us out, we have competitive prices for H100s ($1.38/hr) and A100s ($0.78/hr)

Theio666@reddit

If you want 100% match for the model, glm/minimax/kimi have coding plans, and with these you'll be sure that you're getting as accurate api as possible.

If you want to risk a bit but have more flexibility with model choise(or where you wanna use it), nanoGPT and chutes have most open source models with 60k queries month/2k queries a day. But you might get problems, I personally don't see issues with GLM via nanogpt, but kimi k2 seems to be fucked in one or another way among all 3rd party providers.

If you have low token usage then openrouter for token based billing, but it will be more expensive than both mentioned options.

Betadoggo_@reddit

Because of batching and general market forces specific model hosts will always be cheaper than running the models yourself on rented gpus. https://openrouter.ai/ is what most use. Avoid using 3rd party providers where possible, most of them (especially the cheaper ones) have degraded models.