How much will it cost to host something like qwen3.6 35b a3b in a cloud?
Posted by Euphoric_North_745@reddit | LocalLLaMA | View on Reddit | 144 comments
I keep hearing the model is good, I don't have the hardware for it, and I will wait to the end of the year for the hardware to evolve.
But, I still need coding, people are saying qwen3.6 35b a3b is good, so the question is now how much will it cost me to host it somwhere until I get new hardware.
albertgao@reddit
You’ve done 0 research on this topic, and maybe just ask any LLM freely would give you an action list that you can do in minutes….
Ollama cloud, opencode cloud, OpenRouter
Euphoric_North_745@reddit (OP)
none of the llms will give you the amount of quantization knowledge and install combinations mentioned here, yes, if i do that research next year, the LLM will have a copy of this reddit article and will give me more info. it is a knowledge db after all, not a human
albertgao@reddit
You didn’t even try, did you? I copied the same question to Gemini and GPT, both give me tables based report with detailed price.
Your question is primarily search based with a report in the end to combine the knowledge. Why do you think the current LLM wouldn’t pull it off? What do you think is hard in this case, I can even hand craft one in record time without help of GenAI.
It is fine if you are new to this stuff.
Euphoric_North_745@reddit (OP)
i want to talk to humans you ..., humanssss, i can't find anything with AI and we can all leave Reddit which is now partially ai anyway
tracagnotto@reddit
I use r/ShadowPC so about 30/35/$ month for me. It's a 16gb machine so don't expect miracles. I use it for gaming anyways so, to me it came as a natural follow up to test AI on it
Glum-Atmosphere9248@reddit
That's on demand gaming pc right? Isn't it against their TOS to do exactly this sort of hosting?
tracagnotto@reddit
It probably was I guess, they are advertising it: https://shadow.tech/eu/pro/solutions/ia-dev/
I think against the tos is if you find some sort of automation to keep the pc awake when you're AFK, but I'm actually using it and be mostly present on the machine as I work with said AI.
If you put a dedicated server you'll soon discover it turns off after a short idle and if you find automations to keep it awake, then yes that's against TOS
Glum-Atmosphere9248@reddit
I see, so no 24/7
ANR2ME@reddit
Interesting, their highest price is only $60/mo with 28GB RAM and 20GB VRAM (RTX A4500), but there is 8 hours per session limitation, fortunately it's not 8 hours per day limitation 😅
upalse@reddit
Single RTX PRO 6000 (or similar 96GB card) is about 30 bucks a day. Can do about 16-32 size batch inference, 20-30 tps each. Token cost about $0.3/M.
rduito@reddit
Everyone saying glm, deepseek etc. These are great but a $20/month sub gets you a nice chunk of gpt-5.5 (best sub now after copilot changed)
It won't be forever. You really don't want to miss the codex happy time. (Was copilot for a long time, and before that rovodev was 20m tokens/day free for months with top models).
Euphoric_North_745@reddit (OP)
codex 200$ subscription weekly limit reached in 2.5 days, today had to pay 40$ credits to get some stuff done
must find alternatives
rduito@reddit
Ooof. Would be great to know what your solution finally is and how it compares to using gpt 5.5 (or whatever you did via codex)
rduito@reddit
Sorry. It's localllama, I know
KFSys@reddit
Running something like that in the cloud can get expensive pretty quickly, especially if you keep it running all the time.
For a model that size you’re usually looking at a decent GPU (A100-level or similar), and those are billed hourly. It’s fine if you’re just spinning it up for a few hours here and there, but 24/7 usage adds up fast.
What I’ve done is just use cloud GPUs on demand when I need them. DigitalOcean has options for that and it works well for testing or short runs, but I wouldn’t leave it running full time unless you’re okay with the cost.
For coding specifically, you might be better off using hosted APIs for now and only using cloud GPUs when you actually want to experiment with the model itself.
sigiel@reddit
You’re tripping, I can run it on a single 3090, with 64k context q5 or 6, that is about 800 buck. Townof them and you get q8, pretty much save as original. And that is about 2000 buck. At 100t/s, you don’t need a h100, for gemma
KFSys@reddit
If you own a 3090 (or two), of course it’s cheaper long term. But that’s $800–$2000 upfront, which isn’t what I was talking about. I was talking about running it in the cloud without owning hardware. In that case, even a “mid-tier” GPU running 24/7 adds up fast, which was the whole point.
Euphoric_North_745@reddit (OP)
I was looking this morning at Alibaba's cloud, they say in the ad Model as a Service, not sure if that means as tokens or as model per hour.
Microsoft Azure I think has GPT as a model that can be rent per hour, i have to look at that as well
rm-rf-rm@reddit
This thread was reported for being off-topic. While that is true in the strictest reading of the sub's purpose, it is an adjcaent topic of interest and value to the community, as evidenced by the number of upvotes and comments (being complementary information to running locally thus information where to do what). We are also sort of the default place for any actual/serious discussion on AI, so approving it - though ofcourse we want to keep such content to a minimum.
gnaarw@reddit
Thank you. Renting a server in the cloud should still be considered safer in regards to privacy than calling API endpoints with all your text right there... Plus setup wise it's veeery similar to running llama.cpp/vLLM/whatever on one's own machine. I wouldn't even consider this just adjacent but the very same thing we do at home - minus the rented b200 or similar 😅
rm-rf-rm@reddit
yeah it really is a fully continuous spectrum with no hard and fast boundaries. Plus Im sure in the future we will end up having multi agent, multi model workflows that do some inference locally and some inference in a VPS/cloud
georgemp@reddit
InferX is pretty good. While they are in beta, they charge 20$/model/month. I've been running qwen-3.6-27b-fp8 at full context (262144) without any issues. Their support is also great. The promotional rate of course won't last. But, while they have it, it is a great deal.
That said, I don't think the model is anywhere as good as GLM-5.1. It's good for quick fixes, but not for major changes. Your experience may be different.
GradatimRecovery@reddit
Can run DeepSeek 4? Kimi 2.6?
georgemp@reddit
No. Their GPU's are limited to 70GB VRAM. So, you can only run small/mid-size models.
MrKresi@reddit
How it works? 20$, no extra cost and i can use it 24/7 ?
georgemp@reddit
i believe while in beta, yes. But, I have no idea how long that is going to be.
datathe1st@reddit
Around $20 usd a month for a more capable model, Qwen 3.6 27B (www.codewithfabric.com)
gpalmorejr@reddit
How bad is your hardware? I run it MoE offloaded on a GTX1060 6GB at 20tok/s. How does your hardware compare to that?
Euphoric_North_745@reddit (OP)
how to install that on ollama?
gpalmorejr@reddit
Install what?
Euphoric_North_745@reddit (OP)
"How bad is your hardware? I run it MoE offloaded on a GTX1060 6GB at 20tok/s. How does your hardware compare to that?" ?????????? context 😄 ?
gpalmorejr@reddit
What hardware are you using?
Euphoric_North_745@reddit (OP)
GTX 1080 8GB
gpalmorejr@reddit
Oh sweet Got like 32GB of RAM?
If so. You can use Llama.cpp (Ollama is slow and lacks support for some things.) or LM Studio (Or Unsloth Studio). These also use the llama.cpp runtime.
Then you can set up MoE offload to split the MoE model between the GPU and CPU in a different way than the normal sequential layer split and is quite a bit faster.
I generally use LM Studio for ease of use. These are the settings I use:
FatheredPuma81@reddit
Don't lol. If you're going to use a model in the cloud you might as well use the subsidized models that are extremely cheap like Minimax, Kimi, or Deepseek. But they'll probably ruin your experience when you eventually get the hardware. You can use Qwen3.6 35B in the Cloud but it's virtually the same price as Minimax M2.7 and more expensive than Deepseek V4 Flash so...
somerussianbear@reddit
I’m trying to spend A DOLLAR A DAY in DeepSeek and boy it is hard
shreddicated@reddit
Are you using flash model only?
somerussianbear@reddit
No, Pro is currently discounted at 75% off so it’s almost the same price.
Finanzamt_Endgegner@reddit
This this this deepseek v4 flash is basically free at this point lol
dbenc@reddit
and if you use batch inference some providers will give you 50% off
tillybowman@reddit
what's batch inteference
Bennie-Factors@reddit
There are 2 main ways to save money on batch processing. One is to run the batch at slower compute times. The other is to save in context switching of processed data. Memory and or prompt processing.
ImpressiveSuperfluit@reddit
Essentially sending multiple requests at the same "instance" of the model, for lack of better vocabulary. It gets a little muddy because we don't necessarily know how exactly everyone is routing their stuff and the details get technical and past my knowledge limit. But think about it this way: if two people send requests, they might be using different settings, different connections, using different prompts and whatever else.
Un-spaghettiying that has a cost, having 50 requests come in that share most of those variables makes it a lot easier to saturate the hardware with useful work. And, chances are, you'll mess with the cache a lot less, though I'm unsure if that's priced in, since they specifically mentioned batching, not caching.
Couldn't tell you the exact details on how it works in the average cloud, but memory is virtually always the bottleneck, so if you can keep everything the same (1x memory footprint), but then run 50 generations with those exact settings, then you can use all the compute that was just sitting there without memory to feed it. Thus, it's free compute, at least as far as hardware cost goes. Hence, they'd make it cheaper.
Note: lots of talking out of my ass here, details may vary drastically in a cloud setup. But the general idea should vaguely apply. Probably.
FatheredPuma81@reddit
I find it best to ask Gemini in AI Studio these types of questions I don't know because it has a powerful search.
Anyways Gemini says that you're thinking of hardware/local batching. Inference Batching is when you upload a file(maybe files?) containing a bunch of requests and they'll process it when the servers have low demand and then 12-24 hours later and you download the results.
BrewHog@reddit
Where are you finding that pricing? Genuinely curious to try it out
Finanzamt_Endgegner@reddit
https://api-docs.deepseek.com/quick_start/pricing/
FatheredPuma81@reddit
Yea I just edited my comment about that.
jon23d@reddit
The last time I tried paying for deepseek, I was terribly disappointed. It felt like the model couldn’t follow instructions, and was prone to hanging. I actually still have credits that I’d forgotten about! Has it gotten substantially better?
orph_reup@reddit
V4 flash is great.
LeonidasTMT@reddit
Deepseek works great for me now.
Pacoboyd@reddit
GLM 5.1 for me. Spoiled me rotten.
HornyGooner4402@reddit
Bought Lite plan couple months ago for $36 a year. I get 30 million tokens every 5 hours with no weekly limits, including GLM 5.1 for a couple months.
Shit is fucking good it's basically free LLM for a year.
AnarchistAtHeartt@reddit
36$ a year? Is this offer still available?
HornyGooner4402@reddit
No, unfortunately.
It was $72 ($6 a month) with 50% on first purchase so I just went with the yearly. The monthly cost then went to $12 then $18 and they decreased usage limit from 40M to around ~30M per 5 hour and added weekly limits when they introduced GLM-5 (which also burns twice as many tokens).
RIP Legacy Plan
AnarchistAtHeartt@reddit
Well, nothing good ever lasts. I never knew this plan existed, and like a moron, I paid for the Gemini Pro plan, what a shit show that has become.
HornyGooner4402@reddit
It seems like I've been on a lucky streak: Bought local hardware before RAM price hike -> got decent yearly subscription for cloud LLM -> local LLM is finally good enough for coding and agentic/coding task before my plan ends
Ven_is@reddit
What’s your next move man
HornyGooner4402@reddit
all in on black
CloudProvided@reddit
LET IT RIDE
evandena@reddit
Can you tell which model you're using in Claude Code?
seeKAYx@reddit
Those were the prices before GLM 5 was released. Now it costs several times as much.
The_2nd_Coming@reddit
wtf lol 30m tokens every 5 hours for $36 a year?!
HornyGooner4402@reddit
well, it was 40M hard limit for a bit, now it's between ~30-35M for when not in peak hours
The_2nd_Coming@reddit
Jesus that's mad. You might need to change your username to TokenGooner!
cunasmoker69420@reddit
Not for long though:
_bones__@reddit
One reason to self host is to avoid sending all your data to an American or Chinese provider. This can include credentials, keys, etc.
Cheap? Yes, they get to mine your shit.
FatheredPuma81@reddit
His post says he's already using the cloud and is going to be using it until he buys the hardware to go local...
Euphoric_North_745@reddit (OP)
Ok, I am interested, any of them is a good idea, and for the tokens, my daily session with Codex CLI is about 1 to 2 trillion tokens, a few millions + the trillion cached, whatever that means, that is what codex is showing in stats
how can i get Deepseek Flash for example? or mini max? just buy api? the coding tools burn a dollar every few minutes of kimi k2.6 as an example, i tested one.
vtkayaker@reddit
If you want to try a slightly older coding model that's super cheap to run, try a hosted GLM 4.5 Air or GPT-OSS 120B. Some places still host these, and I've rented GLM 4.5 Air for literally pennies before. Not, they won't match Opus 4.7 (or even Sonnet 4.5).
(Also, I find that pi.dev starts with only a couple thousand tokens in context and works surprisingly well with smaller models. Which can save some money when making small changes, and clearing in between each change.)
FatheredPuma81@reddit
Use Openrouter? It lets you use any model you want through 1 API.
I think wherever you're using Kimi K2.6 must be really expensive cause I can't see you using $1 without hitting the Context Limit at least 2 or 3 times...
overand@reddit
Are you sure you don't have the hardware for it?
What are your hardware specs?
Euphoric_North_745@reddit (OP)
1080 😄 I paid 1,000$ for it back then 😄 braged about it, and then it just became ooooold 😄
Hot_Turnip_3309@reddit
run the model on that, it'll be fine 20-30 tk/sec
Euphoric_North_745@reddit (OP)
35b on 8gb? doesn't it need at least 40 to 50 gb to produce a good result ?
overand@reddit
The 35B model is a "mixture of experts" model - it only uses 3B at any given time. If your system has enough RAM - (preferably 64 GB of ram, ideally DDR5, but I run all my stuff on a DDR4 system), you can probably get it working usably. Not fast, but usable, depending on the tasks.
Hot_Turnip_3309@reddit
Expert offloading makes it so you only need 6-8gb of vram, the rest on CPU ram. Anything at least 16gb of CPU ram will run it. You're missing out. -ncmoe 999 is the parameter in llama cpp.
Awwtifishal@reddit
Shared parameters fit in a 8gb GPU, while experts can be offloaded to CPU.
Naiw80@reddit
I run Qwen 3.6 with 28GB (sharding between RTX 4080 and a tesla P100, 128K context at 60 t/s) quality is fine.
Naiw80@reddit
Sorry I meant RTX 4070… I blame the phone :)
wasnt_in_the_hot_tub@reddit
It's 35B A3B: 3 billion active parameters
fuelburning@reddit
It’s not great, but I am running it on a 1070 8gb offloading most of the model to ddr4 and it runs surprisingly well. Not blazing fast but you can set a task in open code and just check on it later.
Toshik777@reddit
Just curious, how many t/s you find to be “surprisingly well” in your case?
fuelburning@reddit
total duration: 3m31.255608919s
load duration: 173.173271ms
prompt eval count: 253 token(s)
prompt eval duration: 2.424034874s
prompt eval rate: 104.37 tokens/s
eval count: 1144 token(s)
eval duration: 3m28.100890445s
eval rate: 5.50 tokens/s
fuelburning@reddit
Running on Ollama since it still supports ancient Cuda and offloading from the GPU, but here's some numbers.
Small prompt:
total duration: 55.900583463s
load duration: 235.805594ms
prompt eval count: 113 token(s)
prompt eval duration: 1.930217202s
prompt eval rate: 58.54 tokens/s
eval count: 298 token(s)
eval duration: 53.580469275s
eval rate: 5.56 tokens/s
Larger one:
total duration: 3m31.255608919s
load duration: 173.173271ms
prompt eval count: 253 token(s)
prompt eval duration: 2.424034874s
prompt eval rate: 104.37 tokens/s
eval count: 1144 token(s)
eval duration: 3m28.100890445s
eval rate: 5.50 tokens/s
ea_man@reddit
Man you can run A3B and 27B on a 16GB GPU (just 100k context for 27B, you want more you buy 2x or 24GB), you can get one used for like 300$ and you can re sell it when you are done.
e979d9@reddit
Where would you find a 16gb GPU for $300? Asking for a European friend
ea_man@reddit
I bought a AMD 6800 used for 260e 2 weeks ago, I just put new heat past in it :)
Usually the go for ~290e round here, I guess you can offer a little less.
Hylleh@reddit
Sorry can i ask what quant are you running 27b with for 16gb gpu? I always thought you needed 24gb gpu for 27gb dense.
ea_man@reddit
You run as big as a quant according to the contexet length you want to have.
Anyway I would recommend https://huggingface.co/cHunter789/Qwen3.6-27B-i1-IQ4_XS-GGUF for up to 110K context q4_0
Randomshortdude@reddit
Fairly cheap honestly. Cheap enough to the point where you may want to consider just outright purchasing the necessary hardware. Off top, an RTX 3090 is the cheapest option with 24GB VRAM (not sure how quantization efforts go with MoE models, but you should be able to get it to fit here with sufficient room for a solid context window). Alongside the RTX 3090, you'll need a solid enclosure (for the eGPU setup). That's gonna run you about $150-200 on the cheap end of things (shouldn't be too hard to find / require too much bargain hunting for you to stumble across listings in that range for legit products). You'll also an external PSU (prob 850W or more). Right now, you can scoop a solid one up off of eBay for bout $100 or so. You may need to shell out a few extra bucks for some connectors / dongles / adapters if you don't have them already (although these might come with the aforementioned products on your purchase list). Assuming you do - tack on another $40 to your bill. So altogether, we're looking at $800+$170+$100+$40 - which comes out to roughly $1.1k total. I don't know what your budget looks like, but if you're looking at hosted server options, then you were probably anticipating that the upfront cost was going to be greater than that. But that's really all it takes if you want to be able to leverage local inference for models that are roughly \~32B params or less.
Compare that to renting a server - which is going to run you approximately $100 or so a month, give or take (for a decent one like an A10 - which has 24GB VRAM and should be sufficient enough for your purposes). However after paying $100/month, you'll exceed the total sunk cost of the alternative at-home hardware investment in less than a year. So its all up to you when it comes to evaluating whether this is 'worth it' or not. If you can't afford to dole out that lump sum out the gate and you need to get your hands on something comparable to local inference for the sake of running that Qwen model ASAP, then I'd go ahead and rent me out an A10 from one of the popular GPU VPS neo-cloud providers out there (don't wanna name any names bc that might be against rules - but I'm sure you can find some).
But yeah - that about sums it if you're looking for a breakdown of the economic cost(s) of your available options.
Kahvana@reddit
First of all, you can try out qwen3.6-25b-a3b on chat.qwen.ai and see if you like it, do the same for qwen3.6-27b. If you like it in the chat interface from Qwen, then consider buying the hardware.
You can bridge the gap by using DeepSeek over API. DeepSeek V4 Flash is very affordable and supports a large context window. qwen3.6-25b-a3b over API isn't worth it for the prices offered.
SM8085@reddit
openrouter.ai/qwen/qwen3.6-35b-a3b $0.16-0.23 per million input. $0.9653-1.80 per million output.
But with Deepseek V4 as people are mentioning, For pro openrouter.ai/deepseek/deepseek-v4-pro the deepseek endpoint is $0.435/m input, $0.87/m output, so cheaper output for pro.
Flash is $0.14/m input, $0.28/m output. openrouter.ai/deepseek/deepseek-v4-flash
And then there's openrouter.ai/qwen/qwen3.6-plus $0.325/m input, $1.95/m output under 256K context. Very close to the more expensive 35B-A3B hosts.
If you were renting a DigitalOcean droplet to do the same thing, Qwen3.6-35B-A3B-Q8_0 would take over 50GB of RAM, so something with 64GB would be presumably needed,
Which is like a dollar an hour, and not great speeds.
You would need one of their GPU droplets to get decent speeds. If you can snag a GPU droplet with a RTX6000 then it's only $1.57/hour, but they're also currently fully rented.
Euphoric_North_745@reddit (OP)
wow, i am not ready for this pricing 😄 😞
SM8085@reddit
My CPU LLM rig that runs at about the same speed cost me less than the monthly price for the droplet in the screenshot.
So it almost always makes sense to buy hardware rather than using something like DigitalOcean/etc. for inference.
The only rational time to use those services would be if you just needed it for a short time to test something private.
Otherwise OpenRouter seems better.
It looks like 'runpod' is now calling itself lamba, https://lambda.ai/pricing The cheaper GPUs seem like a good deal compared to DigitalOcean.
$0.79/gpu/hour seems alright. Could probably load an MoE into the VRAM and then the rest into that ample RAM.
zxyzyxz@reddit
Plus the fact that models are getting smaller, better, faster, stronger every passing day, just compare it with last year around this time or even 6 months ago. Hardware doesn't get slower as much as models get better in the same time period.
jazir55@reddit
https://api-docs.deepseek.com/quick_start/pricing/
Do not go with OpenRouter for DeepSeek, muuuuuuuch cheaper directly from their own API.
rbur0425@reddit
I wouldn’t use OpenRouter because they use quantized versions of the models as proven by Moonshot - makers of Kimi.
Hofled@reddit
Be aware that the current DeepSeek V4 Pro pricing is 75% discounted by DeepSeek and will be offered at this price only until 2026/05/31 according to their official website: https://api-docs.deepseek.com/quick_start/pricing/#model-details
I assume that after that point the OpenRouter prices will reflect the non-discounted price as well.
Warsel77@reddit
Honestly, if you already go with a hosted model you might as well use Codex or Claude Code instead. They are still better and hosting a Qwen 3.6 for a longer time is also not for free
OmarBessa@reddit
24/7? between 150 and 200 bucks
BannedGoNext@reddit
RIGHT NOW hosting a model in the cloud is almost never worth it unless it's for testing it out to plan on buying hardware of your own for a well defined use case. But you are looking at a couple dollars an hour.
Adventurous_Papaya87@reddit
20-30 dollars a day for dedicated?
MasterLJ@reddit
You can do ephemeral workloads for like $1.00 - $4.00 USD per hour on platforms like Modal, Runpod, vast.ai, AWS etc.
If you want privacy over your own models this is the way to go and if you set it up with some tuning the coldstarts are pretty snappy (Modal has great documentation and tooling).
If this is something you're doing at volume then the hosted API to the same models is cheaper.
ElectronSpiderwort@reddit
It seems like you would get consistency over time and avoid rug-pulls and silent downgrades with this method. Haven't tried it, but sounds valuable
Finanzamt_Endgegner@reddit
honestly if you dont need privacy you should just use deepseek v4 flash that thing is literally dirt cheap. If you want privacy though then you should go fully local. Imo renting inference power from cloud aint worth it, but thats just my 2 cents
Middle_Bullfrog_6173@reddit
Renting can make sense when running large batches, especially with small models which tend to be relatively overpriced in apis. For peaky interactive work it's expensive.
Finanzamt_Endgegner@reddit
Well yeah but why would you run batched small models when there are cheaper big ones except for Benchmarking? I mean sure if need to benchmark at Fp16 Sure but other than that I don't really see a point, especially since you don't have the benefits of true local inference like privacy and independence.
Middle_Bullfrog_6173@reddit
I mean say you have 100k texts you need to generate summaries for. Cheapest is to rent some hardware and run your own inference using a small model that is good enough. Unless you happen to have a lot of hardware and time to run it locally.
Euphoric_North_745@reddit (OP)
Codex CLI reduced their internal limits, it will cost me now 400$ to 600$ a month to use it, 2 or 3 subscriptions, they got everyone addicted with the 200$ subscription and now it lasts 2 days a weeks, but the rest of the week is still here and we have to work 😞 so, if hosted llm can do the job, even better
ohhi23021@reddit
Maybe but the API without a sub would cost you $3-$5,000K a month which is probably more on par with the real cost. so enjoy the cheap $200 subs or or use a cheaper model. running locally is only for uncensored models and/or private data type stuff.
Euphoric_North_745@reddit (OP)
I am looking for AI instance per hour, use it for 5 or 6 hours a day
Finanzamt_Endgegner@reddit
well you probably wont get codex quality but like 90% of it, but tbh if you were willing to pay like idk 100 dollars and dont have issues with deepseek getting your data just use their pro variant at least until its promo is over that thing is a bit better than flash and atm still dirt cheap and after that either keep using it with a bit higher price or go flash, you can literally shove hundreds of millions of tokens there and it will cost you a few dollars for flash lmao
extopico@reddit
Hm that model when quantised to i4 bit can almost run on a potato, unless you want full context, that takes up RAM.
Euphoric_North_745@reddit (OP)
i also want it to think more than potato 😄
extopico@reddit
Oh i4 quant from unsloth is perfectly fine. I use it even though I can go bigger, because it’s good and fast. I have enough ram for its full context.
Euphoric_North_745@reddit (OP)
can it call tools? how long before it gets confused?
extopico@reddit
I use it with Nous Research hermes, it calls all the tools perfectly fine. I do not know how long it takes before it gets confused, I am not doing much heavy lifting yet so I cannot tell what is confusion and what is its basic lack of ability
AnomalyNexus@reddit
If you're using cloud anyway go for an API rather than a hosted model.
APIs can achieve higher usage density & thus economics than a model just for you that is idle 90% of the time
jonnywhatshisface@reddit
Make life easier and buy a Mac…
I’ll prepare myself for the flame wars. 😂
But seriously - for the cost of running it in the cloud you could just buy a laptop that can run many models.
If you need a solution while waiting then just install opencode as use their big pickle model. It’s free but rate limits with a 16 hour gap. If you put a few bucks in your zen account the rate limits are 5x higher and reset every 5 hours. Only need about $20 in the zen account to get massive free usage if big pickle. But, be aware that the model IS being trained on your code. That’s why it’s free.
With totally free usage - no money in zen at all and not even signed in - I get about 3-4 hours of usage from it for free before the 16 hour limit kicks in. The model is surprisingly very good but some days it has a bit of an “off day.” You can tell when they’re testing more heavily quantized versions of it under the hood.
Steus_au@reddit
you can run as low as $0.24 per hour on runpod. only when you need it. if it stopped it’s just 0.01 per hour. I do it all the time.
yellow_golf_ball@reddit
I'm testing Qwen3.6-35B-A3B-FP8 on an A100 (80GB VRAM) in Azure and it's $3.673/hr = $88.15/day = \~$2,644/month.
nunodonato@reddit
That's what I pay for a h200 lol azure is crazy expensive
Euphoric_North_745@reddit (OP)
Tokens per second please, this info will be important for installing it for business, and how many concurrent requests ?
merica420_69@reddit
I think ollama pro you can upload models. I know it's $20 a month but don't know the details.
Danwando@reddit
Ollama performance is terrible and unreliable, t/s varying from 0,1 to 30 within a single query
BumbleSlob@reddit
Do not give Ollama any recommendations. Trash software from scumbag trash developers who steal actual OSS code and try to slap their branding on it.
gaspoweredcat@reddit
just go with openrouter or something, you can run whatever you like on there and many are cheap as chips, but as others say deepseek v4 flash is stupid cheap, its hard to argue with, do the donkey work with that and tidy up with a SOTA model at the end
kmouratidis@reddit
Zero, if you're okay with a very slow CPU/RAM. Oracle cloud offers a free tier which fits a ~Q4 quantized Qwen*-30B-A3B, but it will likely be slower than running it on a used gaming PC or even a mini PC.
pacmanpill@reddit
why? it would cost you a lot of money. use it through openrouter if you don't care about privacy
siegevjorn@reddit
Google colab has some gpus you can try. They recently added RTX pro 6000 (G4).
pixelizedgaming@reddit
via API or through a rented GPU?
api probably costs near nothing, rented GPU would probably be like 20 cents/hr?
Savantskie1@reddit
Maybe a year ago lol
dmigowski@reddit
Don't. I did this and the model is only good if you can't code already. For the real thing you need Frontier models or >300B parameter models. Either host them or pay 100k to buy the hardware and host them yourself.
dumbass1337@reddit
Why would you want to host it over just using whats out there?
dataexception@reddit
What sub are you expecting to be in?
BitGreen1270@reddit
Assuming you're using vast.ai with a 3090 which would be roughly $0.2 usd/hr. You'll need to spend some time making a custom container and experimenting a bit. After that it probably will require 10mins start up time every time you rent an instance. Will depend on how many hours you use it everyday.
ea_man@reddit
I guess that for people who don't run task 24/7 that may still be a sound option, even more if you are not "making code" all week and you are often away.
BitGreen1270@reddit
Don't know if the economics work for op if he's coding like 10 hours every day. There's also the added effort of fine tuning the params for llama server, dealing with kv cache depletion, sudden crashes, flaky instance etc. It works for me as an enthusiast and someone who only plays with it over the weekend. But it might become frustrating if it keeps getting in the way of your main job.
ea_man@reddit
Oh OP just have to find a cheap API, if your job is coding 10h day you don't do QWEN A3B.
Sirius_Sec_@reddit
I'm running 27b for about $1 an hour renting an rtx6000 . I easily spend that in API usage when I'm doing heavy coding work. Plus im not giving my private info to any big company
running101@reddit
Where you rent from?
Sirius_Sec_@reddit
I already have a gke cluster so I just set up a GPU node for vllm to use . Spot pricing is 99¢ an hour . It's pretty stable and when it's not available I just use an external API .
dumbass1337@reddit
You can rent a 80gb a100 for like 700 dollars a month.
swizzex@reddit
It depends on usage so that's hard to say but it's very cheap compared to hardware.