RTX 3080 20GB - A comprehensive review of Chinese card

Posted by No-Refrigerator-1672@reddit | LocalLLaMA | View on Reddit | 16 comments

Hello! Recently, RTX 3080 20GB became available on Chinese sites like Alibaba. In light of rising prices for RTX3090, I've decided to give those cards a try, and ordered a pair of them. In this post I'll feature lots performance benchmarks, compare it to 3090, share my ordering experience, and discuss the feasibility of this purchase.

Overview of the card

The cards feature blower-style cooling. Physical dimensions matches that of a server card, like Mi50 or Tesla series. It takes 2 PCIe slots and features power connector on the shorter side. The power is supplied by 2x regular gpu connector (not EPS12V like on Tesla cards), with default power limit of 320W. The card is clearly prepared for installation inside server enclosures.

It looks like the card is based on a custom PCB. This PCB features NVLink connector, however, it is taped over with capton tape, and at this moment I can't verify if it is operational. The card also has video connectors (1 HDMI, 3 DisplayPort) and can function like a regular GPU. Card's enclosure is fully made out of metal. From the side, a full copper heatsink is visible, with thermal pads connecting it both to PCB and external shroud. The card feels heavy, sturdy, and well-built.

Test bench

I will test the cards in my personal inference server based on consumer motherboard. Due to this, the upper card gets PCIe 3.0 x16 link, while the lower card only gets PCIe 2.0 x2. This leads to degraded performance in tensor parallel mode, however, pipeline parallel mode and single card benchmarks remain largely unaffected. I've opted to install proprietary Nvidia drivers in my system; the cards were instantly recognized by the drivers and worked out of the box. Despite being unofficial mods, they don't require any software modifications on PC side. Full system specs are featured below:

root@proxmox:~# neofetch
         .://:`              `://:.            root@proxmox 
       `hMMMMMMd/          /dMMMMMMh`          ------------ 
        `sMMMMMMMd:      :mMMMMMMMs`           OS: Proxmox VE 8.4.14 x86_64 
`-/+oo+/:`.yMMMMMMMh-  -hMMMMMMMy.`:/+oo+/-`   Host: AX370-Gaming 3 
`:oooooooo/`-hMMMMMMMyyMMMMMMMh-`/oooooooo:`   Kernel: 6.8.12-16-pve 
  `/oooooooo:`:mMMMMMMMMMMMMm:`:oooooooo/`     Uptime: 3 days, 13 hours, 53 mins
    ./ooooooo+- +NMMMMMMMMN+ -+ooooooo/.       Packages: 1348 (dpkg) 
      .+ooooooo+-`oNMMMMNo`-+ooooooo+.         Shell: bash 5.2.15 
        -+ooooooo/.`sMMs`./ooooooo+-           Terminal: /dev/pts/6 
          :oooooooo/`..`/oooooooo:             CPU: AMD Ryzen 5 5600G with Radeon Graphics (12) @ 4.464GHz 
          :oooooooo/`..`/oooooooo:             GPU: NVIDIA GeForce RTX 3080 
        -+ooooooo/.`sMMs`./ooooooo+-           GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series 
      .+ooooooo+-`oNMMMMNo`-+ooooooo+.         GPU: NVIDIA GeForce RTX 3080 
    ./ooooooo+- +NMMMMMMMMN+ -+ooooooo/.       GPU: NVIDIA P102-100 
  `/oooooooo:`:mMMMMMMMMMMMMm:`:oooooooo/`     Memory: 18843MiB / 31458MiB 
`:oooooooo/`-hMMMMMMMyyMMMMMMMh-`/oooooooo:`
`-/+oo+/:`.yMMMMMMMh-  -hMMMMMMMy.`:/+oo+/-`                           
        `sMMMMMMMm:      :dMMMMMMMs`                                   
       `hMMMMMMd/          /dMMMMMMh`
         `://:`              `://:`

root@proxmox:~# nvidia-smi   
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080        On  |   00000000:01:00.0 Off |                  N/A |
| 50%   47C    P8             14W /  320W |   18781MiB /  20480MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA P102-100                On  |   00000000:05:00.0 Off |                  N/A |
|  0%   30C    P8              6W /  125W |    8393MiB /  10240MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3080        On  |   00000000:08:00.0 Off |                  N/A |
| 50%   53C    P8             16W /  320W |   19001MiB /  20480MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          641329      C   VLLM::Worker_PP0                      18772MiB |
|    1   N/A  N/A          753366      C   ./llama-server                         8386MiB |
|    2   N/A  N/A          641331      C   VLLM::Worker_PP1                      18992MiB |
+-----------------------------------------------------------------------------------------+

All performance measurements will be performed by vllm bench serve. Any test was run without KV cache quantization.

Single card: performance in various inference engines

For this test, I've chosen two models that a person could run on a single card without CPU offloading: one dense (Qwen3 14B AWQ) and one MoE (GPT-OSS 20B). In case of llama.cpp, I've used unsloth/Qwen3-14B-GGUF:Q4_K_XL and ggml-org/gpt-oss-20b-GGUF. I've also wanted to test HuggingFace TGI, but as it has no support for neither of test models (or even any of the newer ones for that matter), I decided to skip it.

Engine launch commands:

vLLM:
vllm serve /models/mxfp4/gpt-oss-20b/ --max-model-len 65536 --max-num-seqs 1

llama.cpp:
./llama-server -ngl 999 --no-mmap -fa on --no-webui -c 65536 --parallel 1 -m /models/gguf/gpt-oss-20b-mxfp4.gguf

SGLang:
python3 -m sglang.launch_server --model-path /models/mxfp4/gpt-oss-20b/ --log-level info --max-running-requests 1 --max-total-tokens 65536

Note: For GPT-OSS, SGLang refused to allocate more KV cache than 59k tokens even when explicitly said to. Therefore, 64k long test for SGLang failed. During initial runs, vLLM asked me to install FlashInfer for speedup in it's output log, so I did. All engines installed in full accordance to their official docs, and no other optimization actions were taken.

For this test, I've used the following command with various input lengths:

vllm bench serve --dataset-name random --backend openai --host vllm_host --port 8000 --endpoint "/v1/completions" --model "openai/gpt-oss-20b" --max-concurrency 1 --num-prompts 20 --random-input-len 16000 --random-output-len 512

Prompt Processing speed is calculated as time to first token divided by prompt length.

We can see, that for mxfp4 MoE model vLLM outperforms other engines on Prompt Processing (PP) by huge amount. For whatever reason Llama.cpp is very efficient in Token Generation (TG) for short sequences, however this edge is not enough to compensate very slow PP. SGLang lags behind significantly, however, this is to be expected, as SGLang itself states that mxpf4 support is not optimized yet.

For more traditional quantization types, SGLang maintains an edge over vLLM in TG, while matching it for PP for sequences longer than 4k tokens. Llama.cpp loses all across the board in this test. I can conclude that for single card and singe user case, SGLang is probably the best choice for this particular card, if you have compatible model.

Single card: available KV cache in vLLM

openai/gpt-oss-20b:

(EngineCore_DP0 pid=1874) INFO 11-16 08:01:36 [gpu_worker.py:298] Available KV cache memory: 3.65 GiB
(EngineCore_DP0 pid=1874) INFO 11-16 08:01:37 [kv_cache_utils.py:1087] GPU KV cache size: 79,744 tokens
(EngineCore_DP0 pid=1874) INFO 11-16 08:01:37 [kv_cache_utils.py:1091] Maximum concurrency for 65,536 tokens per request: 2.36x

cpatonn/Devstral-Small-2507-AWQ-4bit (cache manually set to 5GB):

(EngineCore_DP0 pid=1451) INFO 11-16 20:07:47 [kv_cache_utils.py:1087] GPU KV cache size: 32,768 tokens
(EngineCore_DP0 pid=1451) INFO 11-16 20:07:47 [kv_cache_utils.py:1091] Maximum concurrency for 32,768 tokens per request: 1.00x

Qwen/Qwen3-14B-AWQ:

(EngineCore_DP0 pid=1796) INFO 11-16 20:55:30 [gpu_worker.py:298] Available KV cache memory: 7.94 GiB
(EngineCore_DP0 pid=1796) INFO 11-16 20:55:30 [kv_cache_utils.py:1087] GPU KV cache size: 52,032 tokens
(EngineCore_DP0 pid=1796) INFO 11-16 20:55:30 [kv_cache_utils.py:1091] Maximum concurrency for 32,768 tokens per request: 1.59x

Amounts of available cache memory are reasonable. Personally, I would've liked to have more, but 30k is usable amount, with GPT-OSS 20B having enough to cover most typical use cases.

Single card: Performance vs power limit

In some circumstances, people would want to limit power usage of a card to maintain cooler temperatures, lower noise, save up on electrical bill, or install multiple GPUs with a limited power supply. To investigate this, I've measured single card performance vs power limit imposed via nvidia-smi. All tests are done with single requests to GPT-OSS 20B with 16k long prompts.

We can see that card maintains relatively good performance down to 220W. When power limit is lowered by 30%, card's performance degrades only by 10%, making power limitation a viable option for reducing fan noise and power bill.

Dual cards: pipeline parallel performance for single user

As I've stated previously, due to consumer motherboard, I only get PCIe 2.0 x2 to the second card. Preliminary testing showed that in tensor parallel mode, the second card maxes out PCIe bandwidth and plummets PP speeds to completely unacceptable numbers. Pipeline parallel mode, however, seems to stay mostly unaffected, thus I've decided to feature only it in this review. For this test, I've chosen much more popular options for models: cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit to test dense model, and cpatonn/Qwen3-VL-30B-A3B-Instruct-AWQ-4bit to test MoE. For llama.cpp, I've chosen unsloth/Qwen3-VL-32B-Instruct-GGUF:Q4_K_XL and unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF:Q4_K_XL. SGLang, despite advertising support for Qwen3 VL, threw out errors when I've made requests for both of the models, so I decided that it isn't worth the time.

So, we can see that those cards perform very well for 30B MoE model. Prompt processing for 32B dense looks very weird, probably hindered by narrow PCIe of the second card. I would conclude that if you want to go for multiple card setup, either go with MoE models, or use threadripper/epyc platform to get proper PCIe connectivity. llama.cpp seems to perform really bad, which isn't a big surprise. It is a shame that SGLang failed to do inference on those models, maybe I will revisit this test after a few updates.

Dual cards: available KV cache in vLLM

cpatonn/Qwen3-VL-30B-A3B-Instruct-AWQ-4bit:

(EngineCore_DP0 pid=566) INFO 11-17 13:11:03 [kv_cache_utils.py:1087] GPU KV cache size: 152,912 tokens
(EngineCore_DP0 pid=566) INFO 11-17 13:11:03 [kv_cache_utils.py:1091] Maximum concurrency for 131,072 tokens per request: 1.17x

cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit:

(EngineCore_DP0 pid=810) INFO 11-17 14:08:46 [kv_cache_utils.py:1087] GPU KV cache size: 53,248 tokens
(EngineCore_DP0 pid=810) INFO 11-17 14:08:46 [kv_cache_utils.py:1091] Maximum concurrency for 32,768 tokens per request: 1.62x

Cache situation looks similar to single card case. MoE models get lots of cache that probably covers any use case, dense models get enough cache to be decent for single requests.

Dual cards: multi-user performance scaling

Systems like RAG or agentic automation like n8n really like to make parallel requests, so even if you're buying those cards for yourself, you may still be interested in serving multiple parallel requests. To investigate that, I've chosen Qwen3 VL 30B, and have set maximum concurrency up to 16 in vllm, then have launched vllm bench serve with various concurrency numbers, using this command:

vllm bench serve --dataset-name random --backend openai --host vllm_host --port 8000 --endpoint "/v1/completions" --model "cpatonn/Qwen3-VL-30B-A3B-Instruct-AWQ-4bit" --max-concurrency 4 --num-prompts 100 --random-input-len 8000 --random-output-len 512

By design of this test, there were no requests in the queue on inference engine side, so I'm defining combined PP speed as prompt length divided by time to first token and multiplied by number of parallel requests.

Those GPUs are very good at processing simultaneous requests at their price. It seems like the sweet spot for Qwen3 30B MoE is 12 requests. You can easily run a heavy-duty rag solution like RAG Flow or create a cheap private AI setup for small company.

Dual cards: comparison against 3090

Of course, you would want to know how well this card stacks up against 3090. To answer this question, I've rented a runpod with dual 3090, and ran identical test on it. Also, this test serves a second purpose: if performance curves are similar, then we can be sure that my dual-card measurements aren't heavily affected by limited second card connectivity.

This test was run with cpatonn/Qwen3-VL-30B-A3B-Instruct-AWQ-4bit, vllm 0.11.0, in pipeline parallel mode.

During my testing, I've noticed that time to first token is consistently 300-400ms more for Runpod's 3090s vs mine 3080s, which has made 3090 results for sequences shorter than 16k unrealistically low. Due to this, I've decided to subtract 350ms from Runpod's 3090 measurements before processing the data for the graph. As we can see, 3090 offers 30% more TG performance, but PP performance is equal to 3080.

Purchasing experience and pricing

At this moment, I was unable to find any source for those GPUs other than Alibaba. This platform has more of customer-personalized flow: you're supposed to message the supplier you choose, negotiate, then the supplier will send you an offer. Typically, you'll get the first response within half a day. To request a shipping cost estimate, you'll need to tell them your country, city, and postal code. Once all order details are finalized, I had to send them my shipping address, and recieved official offer. In my case, within 24 hours from payment via PayPal, the seller sent me a video of my cards running FurMark and GPU-Z in test benches. Within the next day, they have sent me pictures of the package and shipping paperwork, and asked to verify the credentials. After that the shipping was handed to DHL. Overall, it took 6 days from the moment of me paying for the package to me receiving the parcel. I would rate the experience as good.

People report that this site has a number of scammers. Alibaba itself provides customer protection, but it only works if all your communication and transactions are done via the platform. Therefore, if the supplier asks you to switch to Whatsapp, or pay via wire transfer - refuse and find another one. If you would open supplier's profile on Alibaba, there will be a "Company Overview" page, where Alibaba will openly state the amount of transactions that was done by that supplier - try to find the one with the biggest number, that guarantees that they deal within the platform and your customer protection will be in place. My GPU supplier had 300+ transactions, and a storefront full of PC components.

My bill for the GPUs was structured in a following way: $415 x2 for cards, $80 for shipping, $25 for shipping insurance (applied by Alibaba), $25 Paypal transaction fees,160 EUR for import customs. In total, I've paid 1008.53 EUR, so the final price is 500 EUR per card.

Was this a good purchase, and should you get one?

Let's talk about the price. At the moment of writing, the cheapest 3090 in Europe on Ebay is 730 EUR including shipping. This makes 3080 20GB a better value: it costs 25 EUR per GB of VRAM, versus 30 EUR/GB for 3090. From performance comparison we can see that price/performance ratio of those two cards is roughly equal. Given that physically this card is prepared to fit workstations and servers very nicely, it also has an edge over 3090 and other gaming cards for multi-gpu setups. However, there are some caveats: as we can see from single card KV cache measurements, those missing 4GB significantly limit available prompt lengths, limiting long-context-prompt usecases to only MoE models. On the other hand, at the moment of writing, for 500 EUR only 16GB Nvidia cards are available, so when price-per-card is considered, 3080 20GB has an edge over any other option.

Also, there are some concerns about longevity: this 3080 is most likely build from salvaged GPU cores and VRAM out of some mining cards, so the reliability of such product is unknown. Over this sub, I've seen some people claiming that modded 2080Ti 22GB worked very long for them, while other claimed that it failed within a month, so we can draw the conclusion that a modded card can be reliable, but this isn't guaranteed. I've decided to take this risk, and at this moment I'm happy with my purchase. Those cards will work 24/7 in my personal inference server, and I oblige to update this post if they would ever fail in upcoming years.

I hope that you found this set of benchmarks useful, and this post will spark more discussion about those Chinese-made Nvidia cards, as at the moment those options seem to stay out of sight from the majority of this subreddit. Later, when I would have some more spare time, I'll also benchmark those cards in ComfyUI for image/video generation.

[-]

Long_comment_san@reddit

Makes no sense whatsoever. Old reporposed mining junk with no warranty and giant heat draw. If it was 30gb, then it would have made a lot of sense at 500-550$.

No-Refrigerator-1672@reddit (OP)

So what's your proposed alternative? Is there any other Nvidia card that provides equal or better price per gb, and hasas good of software support? Keep in mind that 3090 isn't available anymore at $500 in many parts of the world.

fallingdowndizzyvr@reddit

What seller did you buy from?

This one. I've dealt with this seller twice, both times my experience was good.

Autumnrain@reddit

Have to buy 2 cards minimum?

Depend on the seller. Most of them do require you to buy at least two.

Thank you.

kryptkpr@reddit

Fantastic write-up, the vLLM vs SgLang comparison in particular has me thinking I really need to play with SgLang more..

Smooth-Cow9084@reddit

In my country, in Europe, I can buy 3090s at lower 500s. I actually got one for 500€ + 40€ shipping, fees...

Maybe second hand markets other than eBay paint a different picture

Yeah, I see that 3090 situation vaires country by country. In my place (Latvia), second hand 3090s start at 700 eur, even on facebook marketplace and other local sites. Maybe you can find one at 600, but you'll have to monitor the marketplace daily. The last time I've seen a 3090 locally for 500 eur was in March. Some countries are just more lucky than others 😁 At the same time, those 3080s are available on-demand in any quantities.

Damn. Any reason why they could be so high there? I believe Latvia is not richer than my country, so doubt is purchasing power

I can only come up with one idea: on local advertisement board, there are some "I'll buy your 3090/4090" ads. I guess that any seller that wants to sell low and fast gets swooped by those bulk buyers, with only high price options being left. Multiply that be the fact that there's only 2M people here, and you get a classical supply and demand situation.

Endlesscrysis@reddit

What country is this? And what site?

false79@reddit

What's the memory bandwidth on this card? I have a feeling you can cheaper, faster, and bigger with AMD.

It maches the umodded spec - 760GB/s, I've validated it by running cuda_bandwidthtest from default sdk. AMD is questionable. First, just like Nvidia, AMDs best card in under 500 EUR range is just 16GB. But software compatibility is what really bothers me. Rocm officially supports only 7900xt/xtx and, I believe, 9070xt cards, every other consumer model is unofficial. I've had a pair of Mi50 before, and I've learned that basically everything beyond llama.cpp requires CUDA, and I'm not willing to risk my money again by getting a card that will be held back by software.

That's cool - thanks for the confirmation. Definitely in agreement.