Beating cuBLAS in SGEMM from Scratch

Posted by salykova@reddit | LocalLLaMA | View on Reddit | 12 comments

A while ago, I shared my article here about optimizing matrix multiplication on CPUs, achieving performance that outpaced NumPy - Beating NumPy's matrix multiplication in 150 lines of C code

I received positive feedback from you, and today I'm excited to share my second blog post. This one focuses on an SGEMM implementation that outperforms cuBLAS with its (modified?) CUTLASS kernel across a wide range of matrix sizes. The blog delves into benchmarking code on CUDA devices and explains the algorithm's design along with optimization techniques. These include inlined PTX, asynchronous memory copies, double-buffering, avoiding shared memory bank conflicts, and efficient coalesced storage using shared memory. The code is super easy to tweak, so you can customize it for your projects with kernel fusion or just drop it into your libraries as-is. If you have any questions, feel free to comment or send me a direct message - I'd love to hear your feedback and answer any questions you may have! Below, I've included performance comparisons against cuBLAS and Simon Boehm’s highly cited work, which is now integrated into llamafile aka tinyBLAS.

Blog post: https://salykova.github.io/sgemm-gpu
Code: https://github.com/salykova/sgemm.cu