RTX 3080 20GB - A comprehensive review of Chinese card
Posted by No-Refrigerator-1672@reddit | LocalLLaMA | View on Reddit | 31 comments
Hello! Recently, RTX 3080 20GB became available on Chinese sites like Alibaba. In light of rising prices for RTX3090, I've decided to give those cards a try, and ordered a pair of them. In this post I'll feature lots performance benchmarks, compare it to 3090, share my ordering experience, and discuss the feasibility of this purchase.
Overview of the card
The cards feature blower-style cooling. Physical dimensions matches that of a server card, like Mi50 or Tesla series. It takes 2 PCIe slots and features power connector on the shorter side. The power is supplied by 2x regular gpu connector (not EPS12V like on Tesla cards), with default power limit of 320W. The card is clearly prepared for installation inside server enclosures.

It looks like the card is based on a custom PCB. This PCB features NVLink connector, however, it is taped over with capton tape, and at this moment I can't verify if it is operational. The card also has video connectors (1 HDMI, 3 DisplayPort) and can function like a regular GPU. Card's enclosure is fully made out of metal. From the side, a full copper heatsink is visible, with thermal pads connecting it both to PCB and external shroud. The card feels heavy, sturdy, and well-built.
Test bench
I will test the cards in my personal inference server based on consumer motherboard. Due to this, the upper card gets PCIe 3.0 x16 link, while the lower card only gets PCIe 2.0 x2. This leads to degraded performance in tensor parallel mode, however, pipeline parallel mode and single card benchmarks remain largely unaffected. I've opted to install proprietary Nvidia drivers in my system; the cards were instantly recognized by the drivers and worked out of the box. Despite being unofficial mods, they don't require any software modifications on PC side. Full system specs are featured below:
root@proxmox:~# neofetch
.://:` `://:. root@proxmox
`hMMMMMMd/ /dMMMMMMh` ------------
`sMMMMMMMd: :mMMMMMMMs` OS: Proxmox VE 8.4.14 x86_64
`-/+oo+/:`.yMMMMMMMh- -hMMMMMMMy.`:/+oo+/-` Host: AX370-Gaming 3
`:oooooooo/`-hMMMMMMMyyMMMMMMMh-`/oooooooo:` Kernel: 6.8.12-16-pve
`/oooooooo:`:mMMMMMMMMMMMMm:`:oooooooo/` Uptime: 3 days, 13 hours, 53 mins
./ooooooo+- +NMMMMMMMMN+ -+ooooooo/. Packages: 1348 (dpkg)
.+ooooooo+-`oNMMMMNo`-+ooooooo+. Shell: bash 5.2.15
-+ooooooo/.`sMMs`./ooooooo+- Terminal: /dev/pts/6
:oooooooo/`..`/oooooooo: CPU: AMD Ryzen 5 5600G with Radeon Graphics (12) @ 4.464GHz
:oooooooo/`..`/oooooooo: GPU: NVIDIA GeForce RTX 3080
-+ooooooo/.`sMMs`./ooooooo+- GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
.+ooooooo+-`oNMMMMNo`-+ooooooo+. GPU: NVIDIA GeForce RTX 3080
./ooooooo+- +NMMMMMMMMN+ -+ooooooo/. GPU: NVIDIA P102-100
`/oooooooo:`:mMMMMMMMMMMMMm:`:oooooooo/` Memory: 18843MiB / 31458MiB
`:oooooooo/`-hMMMMMMMyyMMMMMMMh-`/oooooooo:`
`-/+oo+/:`.yMMMMMMMh- -hMMMMMMMy.`:/+oo+/-`
`sMMMMMMMm: :dMMMMMMMs`
`hMMMMMMd/ /dMMMMMMh`
`://:` `://:`
root@proxmox:~# nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3080 On | 00000000:01:00.0 Off | N/A |
| 50% 47C P8 14W / 320W | 18781MiB / 20480MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA P102-100 On | 00000000:05:00.0 Off | N/A |
| 0% 30C P8 6W / 125W | 8393MiB / 10240MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 3080 On | 00000000:08:00.0 Off | N/A |
| 50% 53C P8 16W / 320W | 19001MiB / 20480MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 641329 C VLLM::Worker_PP0 18772MiB |
| 1 N/A N/A 753366 C ./llama-server 8386MiB |
| 2 N/A N/A 641331 C VLLM::Worker_PP1 18992MiB |
+-----------------------------------------------------------------------------------------+
All performance measurements will be performed by vllm bench serve. Any test was run without KV cache quantization.
Single card: performance in various inference engines
For this test, I've chosen two models that a person could run on a single card without CPU offloading: one dense (Qwen3 14B AWQ) and one MoE (GPT-OSS 20B). In case of llama.cpp, I've used unsloth/Qwen3-14B-GGUF:Q4_K_XL and ggml-org/gpt-oss-20b-GGUF. I've also wanted to test HuggingFace TGI, but as it has no support for neither of test models (or even any of the newer ones for that matter), I decided to skip it.
Engine launch commands:
vLLM:
vllm serve /models/mxfp4/gpt-oss-20b/ --max-model-len 65536 --max-num-seqs 1
llama.cpp:
./llama-server -ngl 999 --no-mmap -fa on --no-webui -c 65536 --parallel 1 -m /models/gguf/gpt-oss-20b-mxfp4.gguf
SGLang:
python3 -m sglang.launch_server --model-path /models/mxfp4/gpt-oss-20b/ --log-level info --max-running-requests 1 --max-total-tokens 65536
Note: For GPT-OSS, SGLang refused to allocate more KV cache than 59k tokens even when explicitly said to. Therefore, 64k long test for SGLang failed. During initial runs, vLLM asked me to install FlashInfer for speedup in it's output log, so I did. All engines installed in full accordance to their official docs, and no other optimization actions were taken.
For this test, I've used the following command with various input lengths:
vllm bench serve --dataset-name random --backend openai --host vllm_host --port 8000 --endpoint "/v1/completions" --model "openai/gpt-oss-20b" --max-concurrency 1 --num-prompts 20 --random-input-len 16000 --random-output-len 512
Prompt Processing speed is calculated as time to first token divided by prompt length.


We can see, that for mxfp4 MoE model vLLM outperforms other engines on Prompt Processing (PP) by huge amount. For whatever reason Llama.cpp is very efficient in Token Generation (TG) for short sequences, however this edge is not enough to compensate very slow PP. SGLang lags behind significantly, however, this is to be expected, as SGLang itself states that mxpf4 support is not optimized yet.
For more traditional quantization types, SGLang maintains an edge over vLLM in TG, while matching it for PP for sequences longer than 4k tokens. Llama.cpp loses all across the board in this test. I can conclude that for single card and singe user case, SGLang is probably the best choice for this particular card, if you have compatible model.
Single card: available KV cache in vLLM
openai/gpt-oss-20b:
(EngineCore_DP0 pid=1874) INFO 11-16 08:01:36 [gpu_worker.py:298] Available KV cache memory: 3.65 GiB
(EngineCore_DP0 pid=1874) INFO 11-16 08:01:37 [kv_cache_utils.py:1087] GPU KV cache size: 79,744 tokens
(EngineCore_DP0 pid=1874) INFO 11-16 08:01:37 [kv_cache_utils.py:1091] Maximum concurrency for 65,536 tokens per request: 2.36x
cpatonn/Devstral-Small-2507-AWQ-4bit (cache manually set to 5GB):
(EngineCore_DP0 pid=1451) INFO 11-16 20:07:47 [kv_cache_utils.py:1087] GPU KV cache size: 32,768 tokens
(EngineCore_DP0 pid=1451) INFO 11-16 20:07:47 [kv_cache_utils.py:1091] Maximum concurrency for 32,768 tokens per request: 1.00x
Qwen/Qwen3-14B-AWQ:
(EngineCore_DP0 pid=1796) INFO 11-16 20:55:30 [gpu_worker.py:298] Available KV cache memory: 7.94 GiB
(EngineCore_DP0 pid=1796) INFO 11-16 20:55:30 [kv_cache_utils.py:1087] GPU KV cache size: 52,032 tokens
(EngineCore_DP0 pid=1796) INFO 11-16 20:55:30 [kv_cache_utils.py:1091] Maximum concurrency for 32,768 tokens per request: 1.59x
Amounts of available cache memory are reasonable. Personally, I would've liked to have more, but 30k is usable amount, with GPT-OSS 20B having enough to cover most typical use cases.
Single card: Performance vs power limit
In some circumstances, people would want to limit power usage of a card to maintain cooler temperatures, lower noise, save up on electrical bill, or install multiple GPUs with a limited power supply. To investigate this, I've measured single card performance vs power limit imposed via nvidia-smi. All tests are done with single requests to GPT-OSS 20B with 16k long prompts.

We can see that card maintains relatively good performance down to 220W. When power limit is lowered by 30%, card's performance degrades only by 10%, making power limitation a viable option for reducing fan noise and power bill.
Dual cards: pipeline parallel performance for single user
As I've stated previously, due to consumer motherboard, I only get PCIe 2.0 x2 to the second card. Preliminary testing showed that in tensor parallel mode, the second card maxes out PCIe bandwidth and plummets PP speeds to completely unacceptable numbers. Pipeline parallel mode, however, seems to stay mostly unaffected, thus I've decided to feature only it in this review. For this test, I've chosen much more popular options for models: cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit to test dense model, and cpatonn/Qwen3-VL-30B-A3B-Instruct-AWQ-4bit to test MoE. For llama.cpp, I've chosen unsloth/Qwen3-VL-32B-Instruct-GGUF:Q4_K_XL and unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF:Q4_K_XL. SGLang, despite advertising support for Qwen3 VL, threw out errors when I've made requests for both of the models, so I decided that it isn't worth the time.


So, we can see that those cards perform very well for 30B MoE model. Prompt processing for 32B dense looks very weird, probably hindered by narrow PCIe of the second card. I would conclude that if you want to go for multiple card setup, either go with MoE models, or use threadripper/epyc platform to get proper PCIe connectivity. llama.cpp seems to perform really bad, which isn't a big surprise. It is a shame that SGLang failed to do inference on those models, maybe I will revisit this test after a few updates.
Dual cards: available KV cache in vLLM
cpatonn/Qwen3-VL-30B-A3B-Instruct-AWQ-4bit:
(EngineCore_DP0 pid=566) INFO 11-17 13:11:03 [kv_cache_utils.py:1087] GPU KV cache size: 152,912 tokens
(EngineCore_DP0 pid=566) INFO 11-17 13:11:03 [kv_cache_utils.py:1091] Maximum concurrency for 131,072 tokens per request: 1.17x
cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit:
(EngineCore_DP0 pid=810) INFO 11-17 14:08:46 [kv_cache_utils.py:1087] GPU KV cache size: 53,248 tokens
(EngineCore_DP0 pid=810) INFO 11-17 14:08:46 [kv_cache_utils.py:1091] Maximum concurrency for 32,768 tokens per request: 1.62x
Cache situation looks similar to single card case. MoE models get lots of cache that probably covers any use case, dense models get enough cache to be decent for single requests.
Dual cards: multi-user performance scaling
Systems like RAG or agentic automation like n8n really like to make parallel requests, so even if you're buying those cards for yourself, you may still be interested in serving multiple parallel requests. To investigate that, I've chosen Qwen3 VL 30B, and have set maximum concurrency up to 16 in vllm, then have launched vllm bench serve with various concurrency numbers, using this command:
vllm bench serve --dataset-name random --backend openai --host vllm_host --port 8000 --endpoint "/v1/completions" --model "cpatonn/Qwen3-VL-30B-A3B-Instruct-AWQ-4bit" --max-concurrency 4 --num-prompts 100 --random-input-len 8000 --random-output-len 512
By design of this test, there were no requests in the queue on inference engine side, so I'm defining combined PP speed as prompt length divided by time to first token and multiplied by number of parallel requests.

Those GPUs are very good at processing simultaneous requests at their price. It seems like the sweet spot for Qwen3 30B MoE is 12 requests. You can easily run a heavy-duty rag solution like RAG Flow or create a cheap private AI setup for small company.
Dual cards: comparison against 3090
Of course, you would want to know how well this card stacks up against 3090. To answer this question, I've rented a runpod with dual 3090, and ran identical test on it. Also, this test serves a second purpose: if performance curves are similar, then we can be sure that my dual-card measurements aren't heavily affected by limited second card connectivity.
This test was run with cpatonn/Qwen3-VL-30B-A3B-Instruct-AWQ-4bit, vllm 0.11.0, in pipeline parallel mode.

During my testing, I've noticed that time to first token is consistently 300-400ms more for Runpod's 3090s vs mine 3080s, which has made 3090 results for sequences shorter than 16k unrealistically low. Due to this, I've decided to subtract 350ms from Runpod's 3090 measurements before processing the data for the graph. As we can see, 3090 offers 30% more TG performance, but PP performance is equal to 3080.
Purchasing experience and pricing
At this moment, I was unable to find any source for those GPUs other than Alibaba. This platform has more of customer-personalized flow: you're supposed to message the supplier you choose, negotiate, then the supplier will send you an offer. Typically, you'll get the first response within half a day. To request a shipping cost estimate, you'll need to tell them your country, city, and postal code. Once all order details are finalized, I had to send them my shipping address, and recieved official offer. In my case, within 24 hours from payment via PayPal, the seller sent me a video of my cards running FurMark and GPU-Z in test benches. Within the next day, they have sent me pictures of the package and shipping paperwork, and asked to verify the credentials. After that the shipping was handed to DHL. Overall, it took 6 days from the moment of me paying for the package to me receiving the parcel. I would rate the experience as good.
People report that this site has a number of scammers. Alibaba itself provides customer protection, but it only works if all your communication and transactions are done via the platform. Therefore, if the supplier asks you to switch to Whatsapp, or pay via wire transfer - refuse and find another one. If you would open supplier's profile on Alibaba, there will be a "Company Overview" page, where Alibaba will openly state the amount of transactions that was done by that supplier - try to find the one with the biggest number, that guarantees that they deal within the platform and your customer protection will be in place. My GPU supplier had 300+ transactions, and a storefront full of PC components.
My bill for the GPUs was structured in a following way: $415 x2 for cards, $80 for shipping, $25 for shipping insurance (applied by Alibaba), $25 Paypal transaction fees,160 EUR for import customs. In total, I've paid 1008.53 EUR, so the final price is 500 EUR per card.
Was this a good purchase, and should you get one?
Let's talk about the price. At the moment of writing, the cheapest 3090 in Europe on Ebay is 730 EUR including shipping. This makes 3080 20GB a better value: it costs 25 EUR per GB of VRAM, versus 30 EUR/GB for 3090. From performance comparison we can see that price/performance ratio of those two cards is roughly equal. Given that physically this card is prepared to fit workstations and servers very nicely, it also has an edge over 3090 and other gaming cards for multi-gpu setups. However, there are some caveats: as we can see from single card KV cache measurements, those missing 4GB significantly limit available prompt lengths, limiting long-context-prompt usecases to only MoE models. On the other hand, at the moment of writing, for 500 EUR only 16GB Nvidia cards are available, so when price-per-card is considered, 3080 20GB has an edge over any other option.
Also, there are some concerns about longevity: this 3080 is most likely build from salvaged GPU cores and VRAM out of some mining cards, so the reliability of such product is unknown. Over this sub, I've seen some people claiming that modded 2080Ti 22GB worked very long for them, while other claimed that it failed within a month, so we can draw the conclusion that a modded card can be reliable, but this isn't guaranteed. I've decided to take this risk, and at this moment I'm happy with my purchase. Those cards will work 24/7 in my personal inference server, and I oblige to update this post if they would ever fail in upcoming years.
I hope that you found this set of benchmarks useful, and this post will spark more discussion about those Chinese-made Nvidia cards, as at the moment those options seem to stay out of sight from the majority of this subreddit. Later, when I would have some more spare time, I'll also benchmark those cards in ComfyUI for image/video generation.
RecognitionOk7218@reddit
u/No-Refrigerator-1672 great write-up! I was wondering if you could spare your thoughts on this exact set-up (if you are still running it) with some of the current top models like qwen 3.6 27b or gemma4 ? Are you using some of the cyankiwi models and how would you advice someone starting with local llm approaching this set-up to compliment an agent harness like Hermes or as an offloading system for claude code driven projects?
I found the guide super helpful, and I am currently awaiting two 3080 20 gig cards myself - 3090 used prices are 1300$ in my part of the world, and it simply does not make sense.
Thank you again!
No-Refrigerator-1672@reddit (OP)
The setup is still running, and even more so, since November I've processed 105M of prompt tokens and generated 5.5M tokens, according to LiteLLM usage statistics. The cards are running 24/7 and are only powered down if I want to tinker with server's hardware, so reliability is solid. Still totally satisfied with cards.
Qwen 3.6 27b is out of question for me, for now, because the PP speed is crippled by motherboard's PCIe. I've found a way around it and I'm waiting from NVMe-to-PCIe adapter to arrive in the mail, this would hopefully unlock dense models. For now, I'm using Qwen 3.6 35B with vllm - the cards have enough memory to allocate full 256k long context. With single request, I get 6300 PP / 100 TG for 4k long prompts, 5000 PP / 43 TG for 32k long prompts, 2500 PP / 9 TG for 128k long prompts. The perofrmance for Qwen 3.5/3.6 tanks down quite a bit more than Qwen 3 at long prompts, but seems to remain usable. On the other hand, with Qwen 3.6 35B the sweet spot for parallel requiests is 16 streams, unlike 12 streams for Qwen 3 30b. This is pipeline parallel mode, as I'm limited on PCIe for now, and without MTP, as vllm support MTP only for tensor parallelism. I'll see if bumping up my PCIe connectivity will unlock tensor parallelism and MTP and allow for faster generation, but that remains to be seen.
Yes, I'm running cyankiwi 4-bit AWQ quants, those work reliably for me. I failed to make Claude Code work - it seems like they have some proprietary tool calling format, and my Qwen 3.6 35B just fails to use tools. OpenCode, however, works well. I was recently asked to share my agentic coding config, and published it in this comment, you can use it as a guide for best results with those cards, including vllm configs.
You're welcome! When building up my config, I was frustrated by lack of the proper technical reviews of the hardware, so I've decided to do my part in sharing proper knowledge, and I'm glad it's helping people.
RecognitionOk7218@reddit
Wow - this is a goldmine of reference knowledge!
Just a few follow ups that might help me adapt this to my set-up: CPU 9700X, 64GB DRR5 ram, Dual RTX 3080 20GB (eventually), X870 motherboard with PCIe 5.0 x16 + 4.0 x4
- Which NVMe-to-PCIe adapter are you using? Do you expect it to unlock tensor parallelism for dense models like Qwen3.6-27B or is pipeline parallelism still the way to go for my x4 slot?
- Would you recommend trying tensor parallelism despite the x4 bottleneck? Also, do you enable prefix-caching and auto-tool-choice for non-MoE models like 27B?
Thank you.
No-Refrigerator-1672@reddit (OP)
Knockoff Aliexpress stuff. I'm adapting NVMe to MCIO (it is PCIe over cable for servers), then MCIO into an x16 carrier board. This will give me Gen3 x4 speed to the card - which is still low, but 4 times faster than Gen2 x2 that I have now. This contraption can go up to Gen4 x4, but my motherboard is capped at Gen3. You can check back with me later, when I've actually tested this setup - they say it passed the customs and should be there soon.
For me, tensor parallelism for dense model is questionable, as they require a lot of bandwidth. Probaby only MoE; but I for sure will test both. Fur you - you should test it, but you have good chances. Somebody in the comments here mentioned that they run tensor parellelism just fine with pcie Gen5 x4, so with Gen4 the changes are good.
Tool choice must be enabled for agentic loads; and, if you're using OpenWebUI, I highly encourage you to enable tool choice too and set tool calling to "native" in OWUI - you'll be surprised by how much smarter your AI becomes when you give it tons of tools to use. As about prefix caching: first of all, prefix cache is on by default for most models in vLLM, it's just disabled for mamba-style models (Qwen 3.5 family), because prefix cache for Mamba is "experimental". Generally, in my experiments with Qwen 3.6, it reduces your raw performance by about 5%; but because previx cache allows vLLM to pick up previosly processed prompts on new requests, it speeds up massively multi-turn conversations and any agentic workflow. It makes sense to enable it for most use cases.
alm6iri@reddit
Check this, i buy one from alibaba RTX 3080 20Gb with 3 fan For games
https://www.alibaba.com/x/1lAVqP3?ck=pdp
eviloni@reddit
Can you tell me the dimensions of the card? specifically the length?
m0nsky@reddit
Hi, could you check the idle power usage for me in nvidia-smi? With ASPM L0sL1s configured in the motherboard BIOS, my RTX 5060 Ti cards drop down to 2w during idle (even with a model loaded). I'm looking to upgrade to some faster dual slot cards, but I want to keep the server power efficient.
No-Refrigerator-1672@reddit (OP)
I can't reconfigure the BIOS right now to check the srting you've mentioned. With default settings of gigabyte ax370 motherboard, the cards draw 6-7W at idle with no software running; with vLLM loaded, they sometimes idle at 6-7W, sometimes at 15W; the exact power they'll idle at is changing after server reboots, they both may fall into 6W or stuck at 15W. This is a common problem of Ampere and you can read more about it here, including some of the potebtial fixes.
m0nsky@reddit
Thanks for the reply and all of the detailed info in this thread. I think I'm going to pass on these (mainly due to the idle power), but they definitely seem like a good option for the money. I currently deploy 12B Q6/Q8 models for 24/7 usage and I'm looking to scale up to 27B Q6/Q8.
I see a lot of listings (for example: https://www.ebay.nl/itm/187434263061) on ebay that offer the same card (?) for EUR 499 + free shipping. I sent them an offer a while ago and they were OK with going down to 478 per card, however, I'll still have to deal with taxes and import costs. I have no experience with alibaba, it always seemed a bit odd to me (even though your experience seems to be great).
Because these cards are dual slot, do you think you could stack two of them straight next to eachother without issues? The cards will be idle 90% of the time. I currently have a board with 4x full width PCIe slots, so I was looking to stack 4 cards in the board, but I think they'd need at least a millimeter of breathing space not to completely choke them.
No-Refrigerator-1672@reddit (OP)
Those GPUs are just a few mm narrower than their pxi bracket, so dense stacking will leave breathing room by design. Blower-style GPUs in general are intended to be used with minimal breathing room, so no worries here. However, tharmals are tight. I have deployed them with triple-slot spacing, and under full prolongued load they reach 85*C with 100% fans. With dual-slot spacing, I would dial back power target a bit, to avoid potential throttling. Right now I'm running at 270W, and this only spins the fan up to 80%, so I think that dual-slot deplyment will be solid with this constraint.
m0nsky@reddit
Thanks a lot, I guess I'll give it a second thought. If you ever come across the ASPM/L0sL1s features in your BIOS, then I would love to hear your results on idle power.
Long_comment_san@reddit
Makes no sense whatsoever. Old reporposed mining junk with no warranty and giant heat draw. If it was 30gb, then it would have made a lot of sense at 500-550$.
No-Refrigerator-1672@reddit (OP)
So what's your proposed alternative? Is there any other Nvidia card that provides equal or better price per gb, and hasas good of software support? Keep in mind that 3090 isn't available anymore at $500 in many parts of the world.
Long_comment_san@reddit
Price per gb doesn't matter at all in 12-20 range. And there is no viable alternative to a viable bang for the back. Price per gb is only a functional metric only if the end result is acceptable.
It is not acceptable. It takes problems of second hand 3090ti, then amplifies them by re-soldeding memory of unspecified origin and no warranty.
Closest alternatives are 5060ti at 500$ for new warranty card with 16gb + 4 bit support (which is a huge deal in compatible models), and 3090ti at 24gb at 600-650, which is also a much faster card than 3080. R9700 at 1300$ is also a decent alternative with RDNA 4 and 32gb VRAM and warranty.
seamonn@reddit
To each their own! Ryzen 395 is too slow for many use cases.
DiamondTasty6049@reddit
Is the fan noisy? is it quiet enough to use it in study room? thank you
No-Refrigerator-1672@reddit (OP)
It depends on your tolerance. In idle, this card completely stops the fan and is dead silent. Under load, they do get loud - but it's manageable, I can deafen them by just listening to music. In my inference server, I have 3x 120MM Asus ROG cooling fans; those 3 fans spinning at 80% rpm are significantly louder than my 3080s, but my 3080s are significantly louder than gaming-style gards. That's the best comparison I can give you.
fallingdowndizzyvr@reddit
What seller did you buy from?
No-Refrigerator-1672@reddit (OP)
This one. I've dealt with this seller twice, both times my experience was good.
Autumnrain@reddit
Have to buy 2 cards minimum?
No-Refrigerator-1672@reddit (OP)
Depend on the seller. Most of them do require you to buy at least two.
fallingdowndizzyvr@reddit
Thank you.
kryptkpr@reddit
Fantastic write-up, the vLLM vs SgLang comparison in particular has me thinking I really need to play with SgLang more..
Smooth-Cow9084@reddit
In my country, in Europe, I can buy 3090s at lower 500s. I actually got one for 500€ + 40€ shipping, fees...
Maybe second hand markets other than eBay paint a different picture
No-Refrigerator-1672@reddit (OP)
Yeah, I see that 3090 situation vaires country by country. In my place (Latvia), second hand 3090s start at 700 eur, even on facebook marketplace and other local sites. Maybe you can find one at 600, but you'll have to monitor the marketplace daily. The last time I've seen a 3090 locally for 500 eur was in March. Some countries are just more lucky than others 😁 At the same time, those 3080s are available on-demand in any quantities.
Smooth-Cow9084@reddit
Damn. Any reason why they could be so high there? I believe Latvia is not richer than my country, so doubt is purchasing power
No-Refrigerator-1672@reddit (OP)
I can only come up with one idea: on local advertisement board, there are some "I'll buy your 3090/4090" ads. I guess that any seller that wants to sell low and fast gets swooped by those bulk buyers, with only high price options being left. Multiply that be the fact that there's only 2M people here, and you get a classical supply and demand situation.
Endlesscrysis@reddit
What country is this? And what site?
false79@reddit
What's the memory bandwidth on this card? I have a feeling you can cheaper, faster, and bigger with AMD.
No-Refrigerator-1672@reddit (OP)
It maches the umodded spec - 760GB/s, I've validated it by running cuda_bandwidthtest from default sdk. AMD is questionable. First, just like Nvidia, AMDs best card in under 500 EUR range is just 16GB. But software compatibility is what really bothers me. Rocm officially supports only 7900xt/xtx and, I believe, 9070xt cards, every other consumer model is unofficial. I've had a pair of Mi50 before, and I've learned that basically everything beyond llama.cpp requires CUDA, and I'm not willing to risk my money again by getting a card that will be held back by software.
false79@reddit
That's cool - thanks for the confirmation. Definitely in agreement.