Software FP8 for GPUs without hardware support - 3x speedup on memory-bound operations
Posted by Venom1806@reddit | LocalLLaMA | View on Reddit | 61 comments
Got tired of my RTX 3050 not supporting FP8, so I built a workaround. Packs lower-precision values into FP32 using bitwise operations + Triton kernels.
Results: 3x faster on memory-bound operations (GEMV, FlashAttention)
Works on any GPU - RTX 30/20 series, older cards without native FP8 support. Early stage but functional. Open to feedback.
lolxdmainkaisemaanlu@reddit
Damn I didn't know RTX 3xxx series didn't support FP8? I'm a noob and thought it was supported - coz I've been using fp8 / fp8 scaled models on my RTX 3060 and they do work..?
az226@reddit
Basically Volta added FP16, Ampere added BF16, Hopper did FP8, and Blackwell FP4.
RoaRene317@reddit
Almost correct
Volta : FP16
Turing (RTX 20 series) : FP16, INT8, INT4
Ampere (RTX 30 series) : FP16, BFLOAT16, TF32, (TensorFloat-32), INT8, INT4
Ada Lovelace (RTX 40 series) : FP16, BFLOAT16, TF32 (TensorFloat-32), FP8, INT8, INT4
Blackwell (RTX 50 series) : FP16, BFLOAT16, TF32 (TensorFloat-32), FP8, INT8, FP6, NVFP4, INT4
Vera Rubin (RTX 60 series): FP16, BFLOAT16, TF32 (TensorFloat-32), INT32, FP8, INT8, FP6, NVFP4, INT4
az226@reddit
r/ConfidentlyIncorrect.
1) I didn’t say “only added” I said added. And I said basically. I was simplifying the key changes. The ones that mattered.
2) Vera is a CPU, Rubin is a GPU.
3) There is no record of RTX 60 series, it’s is speculative at this point.
But thanks for trying to correct something that wasn’t incorrect.
john0201@reddit
It saves memory but you’re still using 16 bit cores
phazei@reddit
I'm not sure where the memory saving comes in for existing 3090 fp8 pipelines. In comfy it loads the fp8 model into system memory, and then moves that to vram as fp8 afaik and then upscales to fp16 when it does the calculation. So if I'm running a model such as Zimage which only takes 8 gigs of space, where does this feather come in and help?
john0201@reddit
It doesn’t need to store a 16 bit result, it is only using 16 bits for the computation.
phazei@reddit
so you're saying that currently it does store it as 16 bits after the computation and Feather will move it back to 8 bits?
john0201@reddit
I think you’re getting too far into the weeds. It’s 8 bit, the internals of the chip run the computation on the same silicon used for 16 bit math therefore it is no faster than it would be on 8 bit. In memory it is still 8 bit, on disk it is 8 bit, it is only during the actual computation it is temporarily represented as a 16 bit number.
phazei@reddit
https://www.reddit.com/r/LocalLLaMA/comments/1q0x8ci/software_fp8_for_gpus_without_hardware_support_3x/nxc4nd6/
phazei@reddit
Right, I get that, I guess I'm just confused on what would be "memory-bound" to provide the speedup.
spaceman_@reddit
16 bit ALUs. You can run 8bit, 16bit, 32bit etc on the same core.
There's no such thing as an 8bit core, but there are dedicated hardware components called ALUs that actually do the math bits and they are operation and operand size specific. In some cases these ALUs are actually shared between cores.
This leads to unintuitive situations on some hardware - for example, on older hardware that was mostly running 32bit float graphics work 16bit workloads sometimes at half speed compared 32bit, despite requiring half the memory bandwidth, because each core had its own 32bit ALUs but 16bit units were shared per pair.
Same thing existed on the CPU side - AMD Bulldozer cores had their own integer ALUs but shared floating point and SIMD hardware between two cores.
john0201@reddit
Nvidia likes to refer to CUDA ALUs as “cores,” I blame their marketing department.
spaceman_@reddit
AMD got hit with a class action over that kind of marketing.
phazei@reddit
hijacking top comment to clarify:
For anyone confused by "memory-bound" here, it's not about VRAM capacity. It means the GPU cores are waiting on data to arrive from memory. The bottleneck isn't the math, it's feeding the cores fast enough. FP8 is half the bytes of FP16, so it transfers twice as fast from VRAM to the registers where compute actually happens. The clever bit is that Feather does the upcast inside the kernel (in registers, basically free) rather than before it (which would mean a separate VRAM round-trip). That's where the 3x speedup comes from.
I was confused at first since the README made no specific clarification and when I think of a GPU, I basically just think of the VRAM.
CheatCodesOfLife@reddit
Yeah, that through me off like a year ago when I was trying to FP8 quants. I think vllm prints a warning about it and it works, but kind of annoying since the 4xxx series got it.
eghost355@reddit
Hi, may I ask can this work on gaming side? Nvidia just released DLSS 4.5 that requires fp8 so RTX 30/20 suffer performance lost. But I am not sure if DLSS's new Transformer model works on Triton kernel or not. Thanks a lot!
Routine_Day8121@reddit
This is exactly the kind of lifehack the community needs. FP8 is getting hype everywhere, but hardware adoption is slow. If software workarounds like this are stable, it could extend the life of mid tier GPUs for serious training experiments. Curious to see benchmarks on larger models and mixed workloads though, sometimes GEMV gains do not fully translate.
CheatCodesOfLife@reddit
Lol, what model wrote this, Sonnet?
Due-Function-4877@reddit
Us "boomers" with a degree write like that. The models are trained on real writing. Next time, I'll make sure to use all lower case and say "bruh" a few times for you.
CheatCodesOfLife@reddit
And then I'll ask which model again, because that's what GLM does when you tell it to write a low effort shitpost ;) I don't just make comments like that when I see curly quotes and em-dashes. Look at his post carefully:
That's what Sonnet 4.5 says whenever I've had it help you plan an idea about modifying an inference engine. I got it when I vibe-coded the anthropic /messages endpoint into TabbyAPI, and where I got it to help me re-implement the deprecated training code in llama.cpp.
Notice how it says "lifehack"? Because this project is OOD so the model picks a vague positive phrase that wouldn't really fit a comment on a git repo.
It's also uses these exact same phrases:
They don't quite fit ie, A100 was never a mid-tier GPU. That "serious training experiments" is what it said when it helped me get unsloth working on an A770 half a year ago.
Finally that classic "Curious to see..." thing it likes to end with after the 4th turn of bouncing ideas off it.
AppearanceHeavy6724@reddit
How about you go and fuck yourself? Who cares what is the model they used if any, if the point communicated well?
Guinness@reddit
You’re literally arguing for bots taking over Reddit. Please go back to Xitter if you like bots.
AppearanceHeavy6724@reddit
How about you go and fuck yourself too, and then move to r/antiai? There is a big fucking difference, between writing empty vague shit with ChatGPT and using LLMs to formalize you thoughts, stupid asshole.
CheatCodesOfLife@reddit
Did you read it? Was it communicated well?
AppearanceHeavy6724@reddit
yes.yes.
bigfatstinkypoo@reddit
this writing does not stink that bad, it's just corpo positivity speak
colin_colout@reddit
You're absolutely right to question my identity!
TheThoccnessMonster@reddit
Yup - and there’s plenty of model layers that are heavily convolutional that, even when offloaded to DLA/FP8 they just upcast to FP16 anyway. QAT and dedicated hardware for convolutions and unsupported activation functions stand to get us a lot more bang for our bucks.
Karyo_Ten@reddit
That has been supported on 4000 series since a couple of years ago, and it's supported on latest AMD and Intel GPUs AFAIK
Inevitable_Host_1446@reddit
I guess you could see that two ways - hardware adoption as in the hardware is slow to come out, or as in people are slow to get the latest. The latter has certainly been true with what a shitshow GPU prices have remained since the days of crypto boom at least. And now RAM is ridiculous as well and Nvidia are talking about cloud gaming...
gittubaba@reddit
Wow, just a few days ago I was arguing about this with chatgpt, it said this isn't possible :P. Can this be plugged into comfyui?
In my rtx 2060 super, fp8 gets cast to fp16 and bf16 get cast to fp32 when running inference.
a_beautiful_rhind@reddit
I think it's better to use the triton patch in comfy. https://github.com/woct0rdho/triton-windows/commit/440e3c42a640a4188dd356225e1b13a56b45a377
Also found it's possible to load BF16/FP16 as E4M3 and then save the vram without an extra file. Somehow my quality went up.
Unfortunately there is some bug in pytorch 2.9 where FP8_scaled gets passed directly into the triton compiler as FP8 and then cast to i8 by llvm. Torch 2.7 works flawless or you can just de-scale the weights.
You sorta want the calcs in FP16 and you wanna avoid BF16->FP32 conversion if speed is the goal. Int8 calcs can be tried by using sage attention. Not always better.
woct0rdho@reddit
My patch only enables fp8 to fp16 cast in Triton, not fp8 matmul. OP's kernels enable fp8 matmul and that's what we need for the next step.
a_beautiful_rhind@reddit
I did see but no movement. Hopefully at least they fix scaled/mixed FP8 as that seems to crash on compile for me with newer pytorch.
Also just found https://github.com/silveroxides/ComfyUI-QuantOps so giving int8 a go to see if it's any better/faster. Didn't know it was a thing.
Call me paranoid but supporting FP8 on pre-ada is something I've felt silently slow-walked in major projects even when those like yourself and OP put in the work.
Venom1806@reddit (OP)
Not sure about comfy UI, but I'm working on implementing functional api for torch.
a_beautiful_rhind@reddit
Comfy does torch and FP8/Fp8_scaled is used there much more than for LLMs. IME, on turning FP32 is going to be a slow ride vs FP16.
For my uses, compiling FP8 image gen weights was a huge speedup. I wonder if somehow your library can hijack FP8 ops to work seamlessly. Right now i'm having to compile triton from source and I doubt quantization/dequantization is accelerated.
Alarmed_Wind_4035@reddit
I used fp8 in comfy and saw no speed up mind to share how?
a_beautiful_rhind@reddit
The speedups can really only come from a few places.
You have HW accelerated FP8 support and don't accidentally cast to BF16/FP16 for the multiplication.
You are able to now compile the model and gain speed from that.
Smaller size of weights because there's really not int8 support besides GGUF.
You didn't say what you're trying to do.
Alarmed_Wind_4035@reddit
generating images or video using comfyui, I have 5060ti so I should be able to run fp8 but when I use the startup argument for fp8 I see no difference in speed.
a_beautiful_rhind@reddit
What did it say in the console when models load?
If it's like: model weight dtype torch.float8_e4m3fn, manual cast: torch.float16
Then you have your answer. You may also have to pass --fast fp8_matrix_mult
getmevodka@reddit
LLMs always something isnt real/possible or doable, if it is not part of their training data. Especially the newer LLMs are trained to only do things as efficient and complete as possible, which makes them severly dumber in hypothetical cases than the older LLMs, because they always do only the least amount of work necessary to keep things simple enough and noz make mistakes, as that is a heavy negative reward in their system. Imho its too agressive and the older LLMs like deepseek3.1 or qwen2.5 72b are better suited for hypothetical expectational work or fantasizing about potential ideas, while the newest generation of LLMs will do exceptional work within the scope of their trained abilities.
gittubaba@reddit
What are even saying bro?
getmevodka@reddit
Older big LLM better in creative talk because not trained to do least amount of work possible to not make mistake, while newer big LLM better at problem solving but not in accepting ideas outside of their training data, because their algo punishes them too hard for making mistakes while being trained.
About that
batonac@reddit
Could this be useful for increasing LLM performance on the Tesla P40?
johndeuff@reddit
Interested. Got p40 too.
bbjurn@reddit
What'd it take to get this to work with vLLM or other inference software?
elsung@reddit
Yeaaaa! I was just trying to get vLLM to load nemotron3-nano on my 2x 3090s but couldn’t get it working because FP8 isn’t supported (and theres no AWQ quant). Gotta be honest tho not sure how i would implement this in vLLM to get things working. Might need to vibe code this to see about implementing the solution lol
rainbyte@reddit
There is GPTQ quant, do you know if is it good?
elsung@reddit
i actually tried it and it wouldn’t work. i’m actually literally trying to make my own awq quant myself right now. no idea if it would work. vibe coding so far for getting this feather thing with vllm seems to be a tall task cuz claude / gpt is telling me no way jose lol
Venom1806@reddit (OP)
Idk, anything that uses torch.Tensor or is convertible to this format should work. Probably huggingface will work ig.
Flashy_Squirrel4745@reddit
Wasn't it already implemented with marlin ( https://github.com/vllm-project/vllm/blob/adcf682fc7d1835d037da331922751e880c8bc25/csrc/quantization/gptq_marlin/generate_kernels.py )? It is likely faster with its optimized cuda c code.
KingKoro@reddit
Would this also benefit RDNA3 ?
tw_numba_one@reddit
I believe so. If your environment has PyTorch support, it should work.
ethertype@reddit
Is this conceptually the same trick pytorch uses to handle MXFP4 on Ampere-class hardware? Which does not support MXFP4 natively.
heretic will do its magic on the original gpt-oss-20b safetensor in MXFP4 format. (The end result is 3x the original size, though.) I have been told heretic doesn't do anything in the code for this to occur, so I assume pytorch owns all the glory.
I also can perfectly fine load the native MXFP4 ggufs of gpt-oss-120b (converted by GG) on my 3090s, with llama.cpp. 120 t/s on empty context. Can't say if this is due to pytorch or if llama.cpp special-cases this on its own.
tynej@reddit
Very nice work. Could we use similiar trick for hopper architecture to support speed of fp4?
Venom1806@reddit (OP)
We could just use 8 fp4 instead of 4 fp8, we dont need an hopper.
FastDecode1@reddit
Pick one.
Venom1806@reddit (OP)
Sorry. Should work on RTX 20/30, there's no advantage in using with 40.
az226@reddit
Does it work for V100? Training too or just inference?
ab2377@reddit
wow 😳 👍