2x Instinct MI50 32G running vLLM results

Posted by NaLanZeYu@reddit | LocalLLaMA | View on Reddit | 81 comments

I picked up these two AMD Instinct MI50 32G cards from a second-hand trading platform in China. Each card cost me 780 CNY, plus an additional 30 CNY for shipping. I also grabbed two cooling fans to go with them, each costing 40 CNY. In total, I spent 1730 CNY, which is approximately 230 USD.

Even though it’s a second-hand trading platform, the seller claimed they were brand new. Three days after I paid, the cards arrived at my doorstep. Sure enough, they looked untouched, just like the seller promised.

The MI50 cards can’t output video (even though they have a miniDP port). To use them, I had to disable CSM completely in the motherboard BIOS and enable the Above 4G decoding option.

System Setup

Hardware Setup

Intel Xeon E5-2666V3
RDIMM DDR3 1333 32GB*4
JGINYUE X99 TI PLUS

One MI50 is plugged into a PCIe 3.0 x16 slot, and the other is in a PCIe 3.0 x8 slot. There’s no Infinity Fabric Link between the two cards.

Software Setup

PVE 8.4.1 (Linux kernel 6.8)
Ubuntu 24.04 (LXC container)
ROCm 6.3
vLLM 0.9.0

The vLLM I used is a modified version. The official vLLM support on AMD platforms has some issues. GGUF, GPTQ, and AWQ all have problems.

vllm serv Parameters

docker run -it --rm --shm-size=2g --device=/dev/kfd --device=/dev/dri \
    --group-add video -p 8000:8000 -v /mnt:/mnt nalanzeyu/vllm-gfx906:v0.9.0-rocm6.3 \
    vllm serve --max-model-len 8192 --disable-log-requests --dtype float16 \
    /mnt/<MODEL_PATH> -tp 2

vllm bench Parameters

# for decode
vllm bench serve \
    --model /mnt/<MODEL_PATH> \
    --num-prompts 8 \
    --random-input-len 1 \
    --random-output-len 256 \
    --ignore-eos \
    --max-concurrency <CONCURRENCY>

# for prefill
vllm bench serve \
    --model /mnt/<MODEL_PATH> \
    --num-prompts 8 \
    --random-input-len 4096 \
    --random-output-len 1 \
    --ignore-eos \
    --max-concurrency 1

Results

~70B 4-bit

Model	B	1x Concurrency	2x Concurrency	4x Concurrency	8x Concurrency	Prefill
Qwen2.5	72B GPTQ	17.77 t/s	33.53 t/s	57.47 t/s	53.38 t/s	159.66 t/s
Llama 3.3	70B GPTQ	18.62 t/s	35.13 t/s	59.66 t/s	54.33 t/s	156.38 t/s

~30B 4-bit

Model	B	1x Concurrency	2x Concurrency	4x Concurrency	8x Concurrency	Prefill
Qwen3	32B AWQ	27.58 t/s	49.27 t/s	87.07 t/s	96.61 t/s	293.37 t/s
Qwen2.5-Coder	32B AWQ	27.95 t/s	51.33 t/s	88.72 t/s	98.28 t/s	329.92 t/s
GLM 4 0414	32B GPTQ	29.34 t/s	52.21 t/s	91.29 t/s	95.02 t/s	313.51 t/s
Mistral Small 2501	24B AWQ	39.54 t/s	71.09 t/s	118.72 t/s	133.64 t/s	433.95 t/s

~30B 8-bit

Model	B	1x Concurrency	2x Concurrency	4x Concurrency	8x Concurrency	Prefill
Qwen3	32B GPTQ	22.88 t/s	38.20 t/s	58.03 t/s	44.55 t/s	291.56 t/s
Qwen2.5-Coder	32B GPTQ	23.66 t/s	40.13 t/s	60.19 t/s	46.18 t/s	327.23 t/s

[-]

henfiber@reddit

Performance-wise, this is roughly equivalent to a 96GB M3 Ultra, for $250 + old server parts?

Roughly 20% slower in compute and 25% faster in memory bandwidth.

[-]

fallingdowndizzyvr@reddit

old server parts?

For only two cards, I would get new desktop parts. Recently you could get a 265K + 64GB DDR5 + 2TB of SSD + MB with x16 and 2x4 + a bunch of games for $529. Add a case and PSU and you have something that can house 2 or 3 GPUs.

[-]

ashirviskas@reddit

Recently you could get a 265K + 64GB DDR5 + 2TB of SSD + MB with 1x16 and 2x4 + a bunch of games for $529

Damn, I wanna go back

[-]

Qwen30bEnjoyer@reddit

Are you also whistfully looking at what used to be cheap?

[-]

Qwen30bEnjoyer@reddit

The good ol’ day.

[-]

dragonbornamdguy@reddit

Wont this limit you in cross-card communication? They should have 16pcie 4.0 but your setup will have like x4 or x8 on second card.

[-]

fallingdowndizzyvr@reddit

The communication is a few KB a token. Even x1 is fine for that.

[-]

apatheticonion@reddit

How on earth did you get that price? 😂

[-]

london_invest@reddit

Which platform did you buy them from?

[-]

london_invest@reddit

This is very interesting. Does your setup act like a 256GB GPU or 8 X 32GB?

[-]

extopico@reddit

Well you win the junkyard wars. This is great performance at a bargain price…at the expense of knowledge and time to set it up.

[-]

No-Refrigerator-1672@reddit

Actually, time to setup those cards is actually almost equalt to Nvidia, and knowledge required is minimal. llama.cpp supports them out of the box, you just have to compile the project yourself, which is easy enough to do. Ollama supports them out of the box, no configuration needed at all. Also, mlc-llm runs on mi50 out of the box with official distribution. The only problems I've encountered so far is getting the LXC container passtrough to work (which isn't required for regular people), getting vLLM to work (which is nice to have, but not essential), and getting llama.cpp to work with dual cards (tensor parallelism fails miserably, pipeline perallelism works flawlessly for some models and then fails for others). I would say for the price I've payed for them this was a bargain.

[-]

moderately-extremist@reddit

Did you get LXC container passthrough to work?

[-]

No-Refrigerator-1672@reddit

Yes, I did, but it wasn't a pleasant experience whatsoever. I don't remember whic guide I ended up following through, but in the end, most of my problems were caused by having ryzen iGPU in the system, like I've shared here.

[-]

moderately-extremist@reddit

Hey thanks for the link, I'll have to give that a try tomorrow.

[-]

Extension_Ada@reddit

@NaLanZeYu, sent you a DM. Having a hard time configuring vLLM paralelism with 3 Mi50GB. Willing to pay 400USD if you can help :)

[-]

Som1tokmynam@reddit

i dont know if anyone answered you, the issue with vLLM is that you need N*2 number of gpu's
so 1,2,4,8 etc...

[-]

Extension_Ada@reddit

Thanks! No, nobody had answered. Guess I'll have to go with 2 or buy one more to get to 4.

[-]

MLDataScientist@reddit

thank you for sharing! Great results! I will have a 8xMI50 32GB setup soon. Can't wait to try out your vLLM fork!

[-]

BeeNo7094@reddit

Do you have any numbers with the 8x setup? What motherboard did you choose?

[-]

MLDataScientist@reddit

Hi! I got ASROCK Romed8-2T with 8x32gb 3200 MHz DDR4. Waiting for the CPU now - AMD epyc 7532. It should arrive later this week. All of them together costed me $1k. I think it was a good deal. Once I get my CPU, I will run 8xGPU at PCIE 4.0 x16 and post benchmark results in this reddit group.

[-]

Potential-Leg-639@reddit

interesting stuff!
what's the power draw of that monster with all those GPUs and stressing them a bit with a larger model?

[-]

MLDataScientist@reddit

Hi! I just completed the build today. Idle power usage is 350w. llama.cpp model running on all 8 GPUs averages around 750w (spikes up to 1100W for a second).

[-]

CauliflowerOdd6543@reddit

Could you please make a post with your results on larger models? 😊

[-]

net3x@reddit

i think giving 8 lanes to each gpu is overkill, 4 lanes should work just fine if you are constrained. people overestimate pcie lanes for gpus. espeically if you run on PCIe 4.

[-]

BeeNo7094@reddit

I have the same motherboard, it only has 7 x16 slots, how are you planning to use the 8th GPU?

[-]

MLDataScientist@reddit

I have pcie 4.0 x16 to x16 x16 active switches (gigabyte branded). I will use Two of them. 8x mi50 32gb GPU and one RTX 3090.

[-]

BeeNo7094@reddit

Can you please share a link or serial number that I can search for?

[-]

MLDataScientist@reddit

Yes, search for Gigabyte G292-Z20 Riser Card. eBay still has some of them at around $45.

[-]

BeeNo7094@reddit

https://ebay.us/m/H7YWji Is this an active switch riser? I have a x16 to x8x8 bifurcator but simply don’t have the physical space between two risers to get it plugged into the motherboard and also plug in 2 risers in the bifurcator. What case/cabinet are you planning for?

[-]

MLDataScientist@reddit

Yes, that is an active switch but you don't need the case. This one is also fine and cheaper without the case: https://ebay.us/m/fZOuXj

[-]

BeeNo7094@reddit

I am also using an open rack mining rig. Kind of ran out of any physical space to mount GPUs, I have an artic freezer 4u CPU cooler, mounting 7 GPUs with 200mm was a pain. 400mm risers could help I suppose.

[-]

BeeNo7094@reddit

How would you plug multiple risers alongside riser cables? The pcie connector also looks a bit proprietary, it has a second smaller connector

[-]

MLDataScientist@reddit

Note that there are two versions of this active switch card.

Someone had this version in which the two x16 female slots were on the right side of the power connectors. They used SATA cable and soldered the other end as follows:

12V and GND: https://i.imgur.com/2OG2Wso.jpeg

3.3V: https://i.imgur.com/QFUanAL.jpeg

I had this version where two female PCIE slots are on the left side of the power connector:

The first pin on the right (shown with an arrow in the image) should be connected to 3.3V and back side for the same pin should have 12V and next pin should be GND line. The male PCIE on the right should be connected to your motherboard (via a 300-400mm PCIE4.0 riser cable) and the two female PCIE slots on the left are used for direct GPU connection (2x MI50) in my case.

[-]

BeeNo7094@reddit

Thanks a lot for the details, can’t imagine how long it took you to dig that info up.

What’s your opinion on backplanes like this https://ebay.us/m/LHAghB ?

[-]

MLDataScientist@reddit

Interesting. But you will be limited to SlimSAS 8i speeds when all slots are used. I see SlimSAS 8i connection provides 16GT/s per channel and has 8 channels (ref: https://www.amphenol-ast.com/v3/en/product_view.aspx?id=235 ). So, this means you get (16 / 8bit * 2 way) \~4GB/s two way bandwidth for each channel. Total bandwidth of the SlimSAS 8i is then 4 * 8 = \~32 GB/s two way. a single PCIE4.0 x16 slot has \~64GB/s two way bandwidth. So, this backplane is limiting each GPU to 32GB/s / 10 = 3.2 GB/s. 64GB/s / 3.2GB/s = 20x decrease in speed. Unless you are doing mining, this is not worth the investment. A single PCIE4.0 x16 offers more bandwidth than one slimSAS 8i.

[-]

Sisuuu@reddit

Any update on this? Performance etc

[-]

MLDataScientist@reddit

8x MI50 rig is still in the making (llama.cpp works but vllm needs more power due to tensor parallelism). Here is the 4x MI50 results: https://www.reddit.com/r/LocalLLaMA/comments/1nme5xy/4x_mi50_32gb_reach_22_ts_with_qwen3_235ba22b_and/

[-]

seesharpshooter@reddit

can someone help, its not working for me.
I have Ubuntu 22.04, 3 mi50 32 gb. Huananzhi x99 fd8 plus motherboard. I am getting below error.

(VllmWorkerProcess pid=231) INFO 10-06 10:43:22 [rocm.py:193] Using ROCmFlashAttention backend.

ERROR 10-06 10:43:22 [engine.py:454] HIP error: invalid argument

ERROR 10-06 10:43:22 [engine.py:454] Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

[-]

UnknownProcess@reddit

I have the same problem with OP's docker image.

I also have a feeling that it might be something to do with the driver and possibly related to https://github.com/ollama/ollama/issues/9302

I have tried ROCm 5.7.0 and 6.3.0, no luck.
I have tried different kernel versions for Ubuntu 22, no luck.

The only thing that works for me is ROCm 5.7.0 + latest Ollama, but only few models work, such as Qwen3. Some models trigger this issue. I never got vLLM to work.

[-]

seesharpshooter@reddit

working with Ubuntu 24.04 out of the box. just install OS and install docker and run the commands

[-]

Ok_Cow1976@reddit

thrilled to see you post here. I also got 2 mi50. could you please share the model cards of the quants? I have problems running glm4 and some other models.Thanks a lot for your great work!

[-]

NaLanZeYu@reddit (OP)

From https://huggingface.co/Qwen : Qwen series models except Qwen3 32B GPTQ-Int8

From https://modelscope.cn/profile/tclf90 : Qwen3 32B GPTQ-Int8 / GLM 4 0414 32B GPTQ-Int4

From https://huggingface.co/hjc4869 : Llama 3.3 70B GPTQ-Int4

From https://huggingface.co/casperhansen : Mistral Small 2501 24B AWQ

[-]

Familiar_Wish1132@reddit

pls help \^\^

command: vllm serve --enable-expert-parallel --max-model-len 8192 --disable-log-requests --dtype float16 /mnt/Qwen3-Coder-30B-A3B-Instruct-AWQ -tp 1

vllm-gfx906-1 | File "/opt/torchenv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 153, in build_async_engine_client

vllm-gfx906-1 | async with build_async_engine_client_from_engine_args(

vllm-gfx906-1 | File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__

vllm-gfx906-1 | return await anext(self.gen)

vllm-gfx906-1 | \^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^\^

vllm-gfx906-1 | File "/opt/torchenv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 280, in build_async_engine_client_from_engine_args

vllm-gfx906-1 | raise RuntimeError(

vllm-gfx906-1 | RuntimeError: Engine process failed to start. See stack trace for the root cause.

vllm-gfx906-1 exited with code 1

[-]

Ok_Cow1976@reddit

huge thanks!

[-]

Familiar_Wish1132@reddit

were you able to run glm ?

[-]

Potential-Leg-639@reddit

Hey.
Can you do some fresh tests with newer models?
That would be awesome!
Thanks mate

[-]

dazzou5ouh@reddit

How would this compare to a dual 3090 setup?

[-]

Ok-Nefariousness486@reddit

hey, i know im a bit late to this , but u/NaLanZeYu could you point to where you got them that cheap? ebay has them at 190 euro a pop

[-]

gurkburk76@reddit

How much does it draw? I was thinking of a 5060 ti 16gb but this is twice the mem at halv the price from what i can find.

[-]

jetaudio@reddit

I'm encountering a strange issue with my system. It fails to cold boot with an AMD Instinct MI50 32GB using a specific firmware (https://www.techpowerup.com/vgabios/276180/276180). To get the system to start, I have to follow this sequence:

Press the power button. The boot check LED flashes, but the screen remains black, and the PC does not boot.
Press the reset button. The system then starts up and runs normally.

Interestingly, I can boot without any issues when using a "Chinese" MI50 (which is recognized as a Radeon Pro VII 16GB).

My system specifications are:

Motherboard: MSI H410M-A PRO CPU: Intel i5-10400 RAM: 32GB DDR4 2666MHz

Can you give me some advice?

[-]

AppropriateWay4215@reddit

I had similar issues, it was to do with the BIOS reverting to CSM mode, the reason was the motherboard expected a UEFI GOP capable video card ( the mi50 is not one as it is a compute card), like the Radeon VII or in my case I managed to sort it by adding a cheap Quadro p620 in the third pcie slot, so all in all adding a cheap dummy gpu (UEFI GOP capable) resolved my issues. Obviously it all depends on the motherboard , BIOS etc but worth trying, hope it helps.

[-]

jetaudio@reddit

I disabled csm completely in bios, but cannot boot normally. So now, I use a 16gb mi50 flashed with radeon pro vii bios as my dummy gpu. And It runs.

[-]

ThunderousHazard@reddit

Great find, great price and great post.

I have a similar setup with Proxmox (lxc debian with cards mount in it), and it's great behind able to share cards simultaneously on various LXCs.

Seems like for barely 230$ you could support up to 4 users with "decent" (given the cost) speeds (assuming at least \~60tk/s for \~15tk/s each).

I would assume these tests are not done with a lot of data in the context? Would be nice to see the deterioration as the used ctx size increases.

[-]

NaLanZeYu@reddit (OP)

During the decode phase, the performance remains relatively stable when the context size is below 7.5k. However, when the context size reaches about 8k, decode performance suddenly drops by half.

[-]

jetaudio@reddit

I believe that it's because of the pcie3.0 limitation

[-]

Scotty_tha_boi007@reddit

I've had some trouble getting gpu passthrough working on my mi60, did you do anything special?

[-]

HilLiedTroopsDied@reddit

Vendor reset? Search GitHub for it

[-]

No-Break-7922@reddit

Seems like for barely 230$ you could support up to 4 users with "decent" (given the cost) speeds

Whoever is going big on GPUs today is up for a rude awakening next year with China entering the GPU manufacturing scene. It was hilarious seeing all the comments today on nvidia's earnings announcements.

[-]

gpupoor@reddit

unless they only do just small tasks you probably can't, prompt processing speeds aren't fabulous since the cards don't have tensor cores.

[-]

zekken523@reddit

FOR ALL INTERESTED IN GFX-906 (mi50/60, Radeon VII/Pro) --> https://discord.gg/k8H4kAfg6N

[-]

Affectionate-Main385@reddit

I love your work. Please keep it up ❤️

[-]

theanoncollector@reddit

How are your long context results? From my testing long contexts seem to get exponentially slower.

[-]

No-Refrigerator-1672@reddit

Using the linked vllm-gfx906 with 2xMi50 32 GB, official Qwen3-32B-AWQ image, and all generation parameters left default, I get the following results while serving a single client's 17.5k long request. The falloff is noticeable, but, I'd say, reasonable. Unfortunately, right now I don't have anything that can generate even longer prompt for testing.

INFO 05-31 06:49:00 [metrics.py:486] Avg prompt throughput: 114.9 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 18.4%, CPU KV cache usage: 0.0%.
INFO 05-31 06:49:05 [metrics.py:486] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 18.5%, CPU KV cache usage: 0.0%.
INFO 05-31 06:49:10 [metrics.py:486] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 18.6%, CPU KV cache usage: 0.0%.
INFO 05-31 06:49:15 [metrics.py:486] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 18.7%, CPU KV cache usage: 0.0%.
INFO 05-31 06:49:20 [metrics.py:486] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 18.8%, CPU KV cache usage: 0.0%.
INFO 05-31 06:49:25 [metrics.py:486] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 18.9%, CPU KV cache usage: 0.0%.

[-]

HilLiedTroopsDied@reddit

Are you still using nalanzeyu's triton+vllm fork? I took the base parts with my own ubuntu + ROCm and compiled llamacpp into a 30GB container, OOF. It performs about the same, but slightly slower with comparable models to Nalz's vllm. The PP is what kills it's performance for me.

[-]

No-Refrigerator-1672@reddit

No, I don't. This fork, with Q4 AWQ and GPTQ quants, either outright refuses to load a multimodal llm, or requires so much VRAM so I can only process 8k tokens on a 2x32GB cards for 32B model, which is hilarious. It's only usable for text-only models, which does not suit me. I do, however, re-test it's compatibility each time nlzy makes an update; but no luck so far.

[-]

HilLiedTroopsDied@reddit

PP is indeed the real slowdown on these cards, they do a good amount of tk/s for 32GB cards.

[-]

SillyLilBear@reddit

Can you run Qwen 32B Q4 & Q8 and report your tokens/sec?

[-]

NaLanZeYu@reddit (OP)

I guess you're asking about GGUF quantization.

In the case of 1x concurrency, GGUF's q4_1 is slightly faster than AWQ. Qwen2.5 q4_1 initially achieved around 34 tokens/second, while AWQ reached 28 tokens/second. However, under more concurrency, GGUF becomes much slower.

q4_1 is not very commonly used. It's precision is approximately equal to q4_K_S, inferior to q4_K_M, but it runs faster than q4_K on MI50.

BTW as of now, vLLM still does not support GGUF quantization for Qwen3.

[-]

MLDataScientist@reddit

Why is Q4_1 faster in MI50 compared to other quants? Does Q4_1 use int4 data type that is supported by MI50? I know that MI50 has around 110 TOPs of int4 performance.

[-]

NaLanZeYu@reddit (OP)

GGUF kernels all work by dequantizing weights to int8 first and then performing dot product operations. So they're actually leveraging INT8 performance, not INT4 performance.

Hard to say for sure if that's why GGUF q4_1 is a bit faster than Exllama AWQ. Could be the reason, or might not be. The Exllama kernel and GGUF kernel are pretty different in how they arrange weights and handle reduction sums.

As for why q4_1 is faster than q4_K, that's pretty clear, q4_1 has a much simpler data structure and dequantization process compared to q4_K.

[-]

MLDataScientist@reddit

thanks! By the way, I ran your fork with MI50 cards and I was not able to reach PP of \~300t/s for Qwen3-32B-autoround-4bit-gptq. Tried awq as well with 2xMI50. I am getting 230 t/s in vLLM. TG is great! It reaches 32t/s. I was running your fork of vLLM 0.9.2.dev1+g5273453b6. My question is did something change between your test time vLLM 0.9.0 and the new version that results in 25% performance loss in prefill speed? By the way, I connected both of them with PCIE4.0 x8. System: AMD 5950x, Ubuntu 24.04.02, ROCm 6.3.4.

[-]

NaLanZeYu@reddit (OP)

Try setting the environment variable VLLM_USE_V1=0. PP on V1 is slower than V0 because they use different Triton attention implementations.

V1 became the default after v0.9.2 in upstream vLLM. Additionally, V1's attention is faster on TG and works fine with Gemma models. Therefore, I have switched to V1 as the default like the upstream did.

[-]

MLDataScientist@reddit

thanks! Also, not related to vllm, I tested exllamav2 backend and API. Even though the TG was slow for qwen3 32B 5bpw at 13 t/s with 2xMI50, I saw PP reaching 450 t/s. So, there might be a room for improvement in vllm to improve PP by 50%+.

[-]

woahdudee2a@reddit

1) is DDR3 a typo? i think x99 is DDR4

2) did you have to order the cards through an agent?

3) that vLLM fork says MoE quants dont work, I wonder if that's WIP? you could add another pair of MI50s and give Qwen3 A235B Q3 a shot

[-]

NaLanZeYu@reddit (OP)

Not a typo. Some Xeon E5 V3/V4 has both DDR3 and DDR4 controllers.
No. I live in China and deal with seller directly.
I am the author of that fork. I have no plan with MoE models.

[-]

AendraSpades@reddit

Can u provide a link to modified version of vllm?

[-]

NaLanZeYu@reddit (OP)

https://github.com/nlzy/vllm-gfx906

[-]

fallingdowndizzyvr@reddit

Even though it’s a second-hand trading platform, the seller claimed they were brand new. Three days after I paid, the cards arrived at my doorstep. Sure enough, they looked untouched, just like the seller promised.

My Mi25 was sold as used. But if it was used, it must have been the cleanest datacenter on earth. Not a spec of dust on it even deep into the heatsink and not even a fingerprint smudge.

[-]

segmond@reddit

Very solid numbers!

[-]

a_beautiful_rhind@reddit

I thought you can reflash to different bios. At least for Mi25 it enables the output.

Very decent t/s speed, not that far from 3090 on 70b initially. Weaker on prompt processing. How bad does it fall as you add context?

Those cards used to be $5-600 USD and now less than P40, wow.