Best gpu setup for under $500 usd

[-]

redoubt515@reddit

Can't "run models on par with gpt-oss 20b at a usable speed" already be achieved with a $0 GPU budget?

I run Qwen3-30B-A3B purely from slow DDR4. No GPU, not even DDR5 (not even fast DDR4 for that matter) at what I would consider usable but lackluster speeds (\~10 tk/s)

What would you consider "usable speed"?

[-]

redoubt515@reddit

i5-8500
2x16GB DDR4-2666
NVME storage (gen 3)
90W power supply

[-]

redoubt515@reddit

How so? (I'm not familiar with OpenVino)

[-]

Echo9Zulu-@reddit

Check out my projectOpenArc, there is discussion in the readme about OpenVINO.

CPU performance is really solid on edge devices, definitely worth a shot

[-]

redoubt515@reddit

I'll look into it. Thanks for the recommendation

[-]

Echo9Zulu-@reddit

No problem, also I forgot this post was about gpu lol. B50 could be a good choice for low power on budget with vulkan until openvino gets implemented in llama.cpp, which has an open PR.

[-]

Commercial-Clue3340@reddit

would you share what software you use to run?

[-]

redoubt515@reddit

llama.cpp (llama-server) running in a podman container (podman is an alternative to docker). Specifically this image: ghcr.io/ggml-org/llama.cpp:server-vulkan

I was previously using ollama but it lacks vulkan support so inference was pretty slow on CPU/iGPU only. Switching to llamacpp led to a meaningful speedup.

The specific model is: unsloth/Qwen3-30B-A3B-Thinking-2507-UD-Q4_K_XL.gguf

[-]

At least 200T/s on prefill. Token generation doesn't even matter that much, but to have 'usable speed' you need fast context processing. Preferably much higher than 200T/s (which is the absolute bare minimum) and >1000T/s. You're not going to process a 50k context at 50T/s like what you get on your CPU DDR4.....

[-]

redoubt515@reddit

> At least 200T/s on prefill.

It's true PP is quite slow on CPU only, but it also seems we have very different conceptions of the meaning of the word "usable" (to me it means the absolute minimum necessary for me to be willing to use it / i.e. 'acceptable, but just barely'). (as in "how's that Harbor Freight Welder you bought?" "eh, the duty cycle is shit, it's far from perfect, I wouldn't recommend it, but it's usable"

It also seems we use LLMs in different ways. From context, it seems you use it for coding which is a context where prompt processing speed at very long context lengths matters a lot. That isn't the context in which I use local LLMs, I'm not sure what OP's use-case is.

> even a simple GPU like a 3060Ti will give much faster context processing

Slow prompt processing on my system can certainly be tedious and not ideal--I'd add a GPU if the form factor allowed for it, but it is still definitely usable to me.

"usable" is the best I can hope for until my next upgrade which is probably still a couple years off. At that point, I'll probably go with a modest GPU, for PP and for enough VRAM to speed up moderate sized MoE models or fully run small dense models. I derive no income from LLMs, its purely a hobby.

[-]

roadwaywarrior@reddit

My pp goes in seconds

[-]

hainesk@reddit

If you can spend $700 you might be able to find a used 3090. That would get you good performance and 24gb VRAM. Otherwise you might just try CPU inference with gpt-oss 20b.

[-]

DistanceSolar1449@reddit

If he’s trying to run gpt-oss-20b then he’s better off with a $500 20gb 3080 from china. Just 4gb less vram than a 3090, but 2/3 the price. About $100 more than a regular 3080.

It’ll run gpt-oss-20b or even Qwen3 30b a3b.

[-]

koalfied-coder@reddit

Naw I would pay the extra 200 for the stability and VRAM easily. Them 3080s are a bit sketchy

[-]

DistanceSolar1449@reddit

Vram amount yes, stability, no. It's the same places in Shenzhen where they put the original VRAM on the PCBs, so it's going to be not much different quality and failure rates as the original. The chinese 4090 48GBs people have been buying for the past year have been fine as well.

Keep in mind these are datacenter cards, they're made for use 24/7 in a chinese datacenter (because they couldn't get their hands on B100s). The fact that a few of them get sold in the USA actually isn't their main purpose.

[-]

koalfied-coder@reddit

Do you have a machine I could SSH into to test those cards? As a proud owner of Galax 48gb 4090s I can attest that some Chinese cards are great. Those particular cards and the limited uplift makes me say to just get a diff card. The quality outside of Galax I really find lacking and I don't see any other company coming to the plate outside of modders.

[-]

DistanceSolar1449@reddit

Not 3080s no

[-]

Any-Ask-5535@reddit

I run Qwen 30b a3b fine on a 12gb 3060 and didn't buy some sketch card from Alibaba with weird driver support

[-]

Icy-Appointment-684@reddit

I wonder if getting 3 mi50 32 GB RAM from alibaba for around $450 would be better than a 3090?

That's 96GB vs 24GB

[-]

koalfied-coder@reddit

personally no as the compatibility and speed are not there.

[-]

DistanceSolar1449@reddit

Easy answer. Nvidia 3080 20GB from china

https://www.alibaba.com/x/B0cjeg

https://www.alibaba.com/x/B0cjeo

[-]

Unlucky-Message8866@reddit

love this, any legit?

[-]

Lissanro@reddit

Those cards are actually exist, yes. You can check seller's reputation and how long they do busyness. There is one caveat though: shipping cost may not be included, and depending on where you live, you may have to pay not just the shipping cost but also pay forwarder, possibly also customs fees.

For example, for me, the modded 3080 with 20 GB from Alibaba will end up in the actual cost being close to used 3090 that I can buy locally without too much trouble. But like I said, depends on where you live. So, everybody has to do their own research to decide what's the best option.

For the OP's case, just buying 3060 12GB may be the simplest solution, small models like Qwen3 30B-A3B or GPT-OSS 20B will run great with ik_llama.cpp, with their cache in GPU, and what did not fit in VRAM on CPU. I shared details here how to setup ik_llama.cpp if someone wants to give it a try. Basically, it is based on llama.cpp but with additional optimizations for MoE and CPU+GPU inference. Great for a limited budget system.

In case extra speed is needed, buying the second 3060 later is an option, than such small MoE models would fit entirely in VRAM and run at even better speed. If buying used 3060 cards, it may be possible to get two for under $500, but depends on local used market prices.

[-]

koalfied-coder@reddit

This is the proper response. They exist and are good for the purpose but better options exist.

[-]

DistanceSolar1449@reddit

Yes, alibaba sellers are generally legit. Just expect long shipping times.

[-]

Unlucky-Message8866@reddit

sorry, the question was not for you, you have zero trust from me given you are the one advertising it and have a month old account

[-]

Every-Most7097@reddit

No, this is all scams and garbage. Read the news, people are trying to smuggle cards into china and getting caught. If they existed they wouldn’t have to smuggle them in hahaha

[-]

DistanceSolar1449@reddit

There's a big difference between smuggling brand new $35k B100s into china, versus buying old GPUs like the 3080 or 4090 out of china.

https://www.tomshardware.com/news/old-rtx-3080-gpus-repurposed-for-chinese-ai-market-with-20gb-and-blower-style-cooling

[-]

DistanceSolar1449@reddit

I'm in the USA lol and not connected to any seller. Chinese gpus just tend to be the best deal at certain price points (below $500). I just make new accounts frequently because it's good internet hygiene. Old accounts are easier for governments to track.

[-]

Every-Most7097@reddit

These are all scams. China has been trying to get their hands on good AI cards. People literally fly here to smuggle our cards into China due to their lack of cards. These are all trash and scams.

[-]

redwurm@reddit

Dual 3060 12gb can be done for under $500.

[-]

koalfied-coder@reddit

Ye if OP cannot yet spring for a 3090 this is the way

[-]

ccbadd@reddit

You might look at an AMD MI-60 and run it under Vulkan. It's a 32GB server card but you would have to add cooling as it does not have a fan. They are generally under $300 on ebay.

[-]

Psychological_Ear393@reddit

You can get the 32GB MI50s for $130USD on alibaba

[-]

ccbadd@reddit

Cool. I try not to order stuff like that from China especially when they list something for $150 and then have $200 in shipping.

[-]

No_Efficiency_1144@reddit

Used 3090 if you can

[-]

im-ptp@reddit

For less than 500 where?

[-]

redoubt515@reddit

nowhere

[-]

No_Efficiency_1144@reddit

It’s a currency issue as I see 500 GBP which is 700 USD, so we are seeing the same price

[-]

redoubt515@reddit

Thanks for clariying.

Just to be clear though, OP's stated max budget is $500 (USD), which would be roughly £375 (GBP)

[-]

No_Efficiency_1144@reddit

I see, this budget is not working yeah

[-]

Marksta@reddit

It just depends on the "eye out" price vs "I need it right now" price. Yeah, it's definitely $800 each if you want it pick up 4 of them shipped at a moments notice. But sitting around with notifications or checking local and things change up.

[-]

Herr_Drosselmeyer@reddit

5060ti 16GB is the best you can get new for under $500. Should run gpt-oss 20b fine.

[-]

our_sole@reddit

Agreed. I found that the 5060ti has about the best price/performance ratio, at least right now.

I picked up a new MSI Gaming OC 5060ti 16gb on Amazon for around $550 I believe. Don't get the 8gb version of this btw.

I have the gpu on a Beelink gti14 intel ultra 9 185H with 64gb of ddr5 and an external gpu dock, and have been happy with it.

I have run models up to 30b with decent results, including gpt-oss 20b. 1440 gaming is also good.

Cheers

[-]

HopkinGr33n@reddit

Most people in these threads are focused on consumer cards, perhaps for good reasons. However enterprise server cards were designed for this kind of workload.

You’re going to want as much VRAM as you can squeeze into your available space and power profile. You’re going to want you run more than just the model you’re thinking about now, and probably helper models for prompt management and reranking too. I promise, VRAM will be your biggest bottleneck and biggest advantage unless you’re an advanced coder or tinkerer with superpowers, and even then, wouldn’t your prefer to spend some time with loved ones? Run out of VRAM, and generally, you’ve got no result and none of the speed benchmarks count.

Also for single stream workloads and normal sized (e.g. chat sized, not document sized) prompts, FP performance and tensor cores matter much less than memory bandwidth.

Check out a Tesla P40; with 24gb (double a typical 3060/3080), more CUDA cores than a 3060, comparable clock speeds and memory bandwidth to 3060, these things are workhorses and within your budget range, though the 250W power needs can be trouble. If you’re not switching models a lot, I think you’ll find the P40 to be a very reliable inference companion.

Also, server cards are passively cooled if you care about noise. Though to be fair, I’ve only put these things in servers that are designed for that, and make a helluva racket on their own anyway. I’ve no idea how hot a P40 would run in a desktop PC.

[-]

smayonak@reddit

I use a quant of gpt-oss-20b which is fine-tuned for coding on a system with a non-production machine with cheap Intel 6GB card and 32GB of RAM and it runs great. There's about 10 seconds of lead time before it starts answering but the tokens/sec are around 6/sec. It's really good.

[-]

QFGTrialByFire@reddit

Hi where did you get the fintuned for coding version? Oss 20B runs well on my 3080ti and isgreat gor agentic calls but qwrn 14B is better for coding so i keep switching. Oss 20Bis too big to fine tune onmy setup so would be great to get a coding fine tuned version.

[-]

smayonak@reddit

The Neo Codeplus abliterated version of oss 20B has been supposedly finetuned on three different datasets, so in theory it should be good for coding. I don't use it for code generation but rather for explaining coding concepts and fundamentals techniques.

[-]

QFGTrialByFire@reddit

Ha no worries was curious. Its kinda difficult to tell, looking at the hf card it says "NEOCode dataset". But that doesn't seem to be a published dataset or available anywhere so not sure what its been trained on.

I'd create a finetune of the oss-20B if i had the vram. If someone has around 24GB please create some synthetic data using quen 30B code+instruct and train oss 20B on that. 24GB will be enough to train the model. I know a lot to ask :)

[-]

Lower_Bedroom_2748@reddit

I run an old z270 board with a i7-6700 and 64gb of ddr4 2400 ram. I have a 3080ti I got off ebay for $500 and free shipping. My favorite is cydonia 22b starts at just over 2 tok/sec but by the time I am at 6k context it's down to just over 1 tok/sec. I wouldn't go bigger. Eva Qwen 32b is less than 1 tok/sec. My cpu never hits 100% the bottleneck is the ram. Still, it can be done depending on your desired tok/sec speed. Just my .02.

[-]

PermanentLiminality@reddit

I run the 20b OSS model on 2x P102-100 that cost $40 each. I get prompt processing of 950tk/s and token generation of 40 tk/s with small context. At larger context it slows down to about 30. With 20gb of VRAM I can do the full 131kb of context.

This is with the cards turned down to 165 watts. I'll test it at full context and full power.

[-]

BrilliantAudience497@reddit

If that's a hard $500 budget, your only nvidia option is the 5060ti. It runs the 20b just fine at \~100 token/second. If the budget is a little flexible, and you can wait a bit, the 5070 super that should be coming out in the next couple months (assuming rumors are accurate) will be \~50% better performance for \~$550, while the 5070ti super would be better performance and significantly more vram for \~$750 (giving you more room later for bigger models). If you can't wait but can go up in budget, the used 3090s should have similar pricing and performance to those 5070ti super, but they're available now (although used).

You've also got AMD and Intel options, but I don't know them particularly well and TBH if you're asking a question like this you probably don't want the headache of trying to get them to perform well. The reason almost everyone uses Nvidia for LLMs is because everyone else uses it and it's well supported.

[-]

koalfied-coder@reddit

5060ti and 5070ti are perhaps the worst cards for LLM unfortunately.

[-]

AppearanceHeavy6724@reddit

your only nvidia option is the 5060ti.

Ahaha no. 2x3060. $400.

[-]

Wrong-Historian@reddit

3080Ti. 900GB/s of memory bandwidth and lots of BF16 flops. This will give you fast prefill/context processing.

Then, offload the MOE layers to CPU DDR

[-]

AppearanceHeavy6724@reddit

3080Ti.

12 GiB? Nah.

[-]

Wrong-Historian@reddit

Do you understand what I'm saying.

8GB is enough to load all attention/KV of GPT-OSS-120B. The MOE layers run fine on CPU DDR. It's prefill that matters, and 12GB of VRAM will allow for fast prefill of GPT-OSS-120B which ultimately will determine user experience.

[-]

AppearanceHeavy6724@reddit

Why do you get so worked up? It is a bad idea to buy a GPU only to run a handful of MoE models. One may want to run dense models too.

[-]

Wrong-Historian@reddit

One may want to run dense models too.

Don't think so. They're utterly obsolete (IMO)

[-]

AppearanceHeavy6724@reddit

Okay, whatever you believe. Meanwhile majority models I care about are all dense.

[-]

BrilliantAudience497@reddit

Do you have benchmarks on that? It will certainly run, but for a workload like gpt-oss-20b that fits pretty well in 16gb I'd be skeptical it would have competitive performance (obviously it would be better performance than cpu offloading in the 16-24gb range, but that's not what OP was asking).

[-]

AppearanceHeavy6724@reddit

What kind of benchmark do you want? Both 5060ti and 3060 have comparable memory bandwidth, 5060ti just 25% better, not that important, you'll get 30 tps instead 37 tps theorectically. Practically difference will be smaller, but with 2x3060 you'll be able to run them parallel with vllm with around 45 tps. But far more imprtantly with 2x3060 you can run 32B models easily. 16B simply too puny for anything semi-serious, like Qwen 30B A3B, Mistral Small, GLM-4, Gemma 3 27b OSS-36B, you name it.

[-]

DistanceAlert5706@reddit

I haven't seen them cheaper than 250, idk where you get those prices. It will be 2 times slower than 5060ti and more expensive, not really an option.

[-]

AppearanceHeavy6724@reddit

You buy them used duh.

[-]

Background-Ad-5398@reddit

5060ti for the easy setup you can still use as a gaming pc, two 3060s might get you more vram but its not a great anything else setup

[-]

brahh85@reddit

Spend a bit more in a 3090. You might think there is not a big difference between 16GB and 24GB of VRAM, but that 25% more in price and VRAM allows you to do run more models at bigger contexts. If you buy a 16 GB, you will regret not going for the 24GB.

[-]

Wrong-Historian@reddit

Hot take: Get a 3080Ti 12GB. Raw compute and memory bandwidth matters more than amount of VRAM. You can run all the non-MOE layers of GPT-OSS-120B even in 8GB, and with a 3080Ti you will get fast prefill/context processing. Then you can run all the MOE layers on CPU for token generation which is fine.

3080Ti + 96GB DDR5 (as fast as you can get it). With that I have GPT-OSS-120B running at 30T/s TG and 210T/s PP.

[-]

DistanceSolar1449@reddit

3080Ti 12gb is a bad option when you can get a 3080 20gb on ebay or alibaba for a bit more

[-]

cantgetthistowork@reddit

3080Ti has the exact same specs as a 3090 with just lesser VRAM

[-]

DistanceSolar1449@reddit

20gb for $500 > 12gb for $400 though

[-]

Unlucky-Message8866@reddit

i haven't factually checked but i bet a second-hand 3090 is still the best bang for the buck. i don't know your whereabouts but i can find some here for ~500€.

[-]

woodanalytics@reddit

You might be able to find a good deal on a used 3090 but it will likely be another couple hundred dollars ($700)

The reason everyone recommends the 3090 is that it is the best value for money for vram at the “affordable” price

[-]

Ok_Needleworker_5247@reddit

You might want to explore the used market for AMD cards like the RX 6700 XT, which offers good performance and falls within your budget. AMD cards generally have less hassle with availability and might fit your needs if you're comfortable with some setup tweaks. This article might provide additional insights into GPU performance for ML tasks.

[-]

AppearanceHeavy6724@reddit

2x3060. $400.