Software FP8 for GPUs without hardware support - 3x speedup on memory-bound operations

Posted by Venom1806@reddit | LocalLLaMA | View on Reddit | 61 comments

Got tired of my RTX 3050 not supporting FP8, so I built a workaround. Packs lower-precision values into FP32 using bitwise operations + Triton kernels.

Results: 3x faster on memory-bound operations (GEMV, FlashAttention)

Works on any GPU - RTX 30/20 series, older cards without native FP8 support. Early stage but functional. Open to feedback.

Article Link | Github Link