AMD BC-250 and the search for Cheap Compute

Posted by dugganmania@reddit | LocalLLaMA | View on Reddit | 24 comments

I've been searching for disused/underappreciated compute vectors for a few months since the MI50 shot up in proce - in comes the salvaged PS5 APU on a standalone board; Zen 2, 16 GB unified GDDR6, RDNA 2 (gfx1013). They're $50-150 on eBay and ship with 24 of 40 CUs enabled.

Got curious and started reading through amdgpu source. Two registers control CU availability it turns out:

Both are writable from inside the driver init path it turns out, clearing the hardware registers. You have to set both, either one alone does nothing:

pp512 numbers (Vulkan, llama.cpp):

Config tok/s Power Temp
24 CU @ 1500 MHz 230 55W 71C
40 CU @ 1500 MHz 372 125W 83C
40 CU @ 2 GHz 466 181W 96C

I've also been working on a custom HIP kernel for gfx1013 since there isn't one, nor is there optimizations available in tensile. HIP already beats Vulkan on token generation (48 vs 30 tok/s on a 9B model), prefill is still behind but closing. The Vulkan backend uses fp16 FMA dequant which is hard to match with HIP's int8 dp4a path, but we're building a custom MMQ kernel that restructures the data flow to match what RADV's compiler does. Early results are promising, already got +63% pp on Q6_K over baseline HIP.

repo: https://github.com/duggasco/bc250-40cu-unlock

discord if you have one of these boards: discord.gg/8eZfFWhczz