AMD BC-250 and the search for Cheap Compute

Posted by dugganmania@reddit | LocalLLaMA | View on Reddit | 24 comments

I've been searching for disused/underappreciated compute vectors for a few months since the MI50 shot up in proce - in comes the salvaged PS5 APU on a standalone board; Zen 2, 16 GB unified GDDR6, RDNA 2 (gfx1013). They're $50-150 on eBay and ship with 24 of 40 CUs enabled.

Got curious and started reading through amdgpu source. Two registers control CU availability it turns out:

CC_GC_SHADER_ARRAY_CONFIG, tells the driver how many CUs exist
SPI_PG_ENABLE_STATIC_WGP_MASK, tells the shader processor where to send work

Both are writable from inside the driver init path it turns out, clearing the hardware registers. You have to set both, either one alone does nothing:

pp512 numbers (Vulkan, llama.cpp):

Config	tok/s	Power	Temp
24 CU @ 1500 MHz	230	55W	71C
40 CU @ 1500 MHz	372	125W	83C
40 CU @ 2 GHz	466	181W	96C

I've also been working on a custom HIP kernel for gfx1013 since there isn't one, nor is there optimizations available in tensile. HIP already beats Vulkan on token generation (48 vs 30 tok/s on a 9B model), prefill is still behind but closing. The Vulkan backend uses fp16 FMA dequant which is hard to match with HIP's int8 dp4a path, but we're building a custom MMQ kernel that restructures the data flow to match what RADV's compiler does. Early results are promising, already got +63% pp on Q6_K over baseline HIP.

repo: https://github.com/duggasco/bc250-40cu-unlock

discord if you have one of these boards: discord.gg/8eZfFWhczz

[-]

snapo84@reddit

Would you be able to do a test with tinygrad and beamsearch = 4 on a GGUF of your choice?
tinygrad can directly talk to amd gpu's (should in theory also talk to this one) and beamsearch is for searching optimized kernels. i wonder what speed you would get....

[-]

dugganmania@reddit (OP)

Working on it now! gfx1013 is an odd bird so some patching will be necessary

[-]

Noxusequal@reddit

How did you benchmark @40 CUs ? I thought it only has 24 that are accessible?

[-]

dugganmania@reddit (OP)

Nope through some research turns out a few register writes from software is all it took - see my repo for patch!

[-]

Qwen_os_has_died@reddit

Mi50 is a better solution.

[-]

fallingdowndizzyvr@reddit

The Mi50 is an overpriced solution. Sure, when it was $130, why not? But at the current prices, why?

[-]

dugganmania@reddit (OP)

I agree - this is $130 price without need for anything but a power source. PXE boot over network and a server PSU and you can run 10x of them for the price of a single 3090 at the current inflated prices, with 160gb of unified GDDR6 mem

[-]

mdziekon@reddit

Is it really possible to connect a bunch of BC 250's and achieve unified vram for inference? So far all I heard about such a setup was when using llama.cpp RPC mode, but that its performance is not great. So what's the solution here, a different inference engine? Or is my research wrong?

[-]

fallingdowndizzyvr@reddit

PXE boot over network and a server PSU and you can run 10x of them for the price of a single 3090 at the current inflated prices, with 160gb of unified GDDR6 mem

If that's the goal then you are about a year late for that. Since the Asrock servers with 12 BC250s in them were $350. That's why people were selling these BC250s individually for $50. They bought the entire server for $350 and broke it apart.

[-]

dugganmania@reddit (OP)

Fair, but like everything else price of compute has only gone up past year. I think at current price point it’s still feasible. $300-400? Even $200? Nah.

[-]

dugganmania@reddit (OP)

I think the two are different in terms of usefulness. BC-250 is a self contained system. It pretty much matches perf ex mem bandwidth with Vulkan, I think with a custom hip kernel these could be even more interesting

[-]

Formal-Exam-8767@reddit

If you got 12 of these BC-250 boards together into a 4U rackmount server chassis (like the ASRock 4U12G), that would be 192GB and no issues with cooling right?

[-]

machinegunkisses@reddit

Madlads! Disused hardware, register hacks, custom kernels, that's legit. This is the quality content the Internet was made for. Good luck!

[-]

Glittering-Call8746@reddit

What's the comparison with mi50 16gb ?

[-]

dugganmania@reddit (OP)

I think MI50 16GB wins on pp but tg is close. HBM wins for mem latency and raw cores but BC-250 wins by simplicity and overall price if you're starting from ground-up (self contained). I only have MI50 32gb to run benchmarks on but would be happy to compare if someone has a 16gb on hand to test against

[-]

fallingdowndizzyvr@reddit

On a one GPU of a $50 V340, it's PP is 318tk/s. But there are two GPUs on the card so once TP works better, it should be better than that.

[-]

fallingdowndizzyvr@reddit

It's awesome that you were able to figure this put. I would have just assumed it was impossible. Since I thought AMD learned it's lesson from the pencil days and they started cutting physical traces on the die. But in this case, I guess they took a shortcut.

They're $50-150 on eBay and ship with 24 of 40 CUs enabled

They were $50 about a year ago. I haven't seen it that cheap in a while.

[-]

dugganmania@reddit (OP)

$50 recently for "as is" boards - I bought 5 of these and 3/5 were fine. Biggest issue so far from my experience with these is a need to clear CMOS (easy) to needing to recap the electrolytics as they're mostly only rated for 5k hours

[-]

Subject-Ad-9934@reddit

I've got 10 of these.
How would i connect them em all up?

[-]

dugganmania@reddit (OP)

thats the next step... I'm working on building a cluster of 12 myself. Quickest/easiest is through the GBE port but doing testing on unlocking the pcie 4.0 lanes on the m2 and passing through 10gbE via an m2 -> pcie 4.0 card

[-]

reto-wyss@reddit

Neat! I've looked into these, but the heat sink design makes it annoying or jank and idle power is really bad although there may be patches.

CU unlock sounds interesting for gaming as well. Have you run any tests on that?

[-]

dugganmania@reddit (OP)

Sleep is one of the things being floated. There have been a LOT of gaming benchmarks run on the discord since I released the patch a few days ago since a lot of folks are using the board as a “steam box adjacent”. FPS perf seems to average between 40-60% depending on the board

[-]

reto-wyss@reddit

Any reports that some boards have defective CUs so far?

[-]

dugganmania@reddit (OP)

Yes, it depends on the harvested CU map on stock. If harvested in last 8 CU pairs seems a better shot at having a clean working 40CUs. It is an active area of research in the community currently