Software FP8 for GPUs without hardware support - 3x speedup on memory-bound operations

[-]

lolxdmainkaisemaanlu@reddit

Damn I didn't know RTX 3xxx series didn't support FP8? I'm a noob and thought it was supported - coz I've been using fp8 / fp8 scaled models on my RTX 3060 and they do work..?

[-]

az226@reddit

Basically Volta added FP16, Ampere added BF16, Hopper did FP8, and Blackwell FP4.

[-]

RoaRene317@reddit

Almost correct

Volta : FP16

Turing (RTX 20 series) : FP16, INT8, INT4

Ampere (RTX 30 series) : FP16, BFLOAT16, TF32, (TensorFloat-32), INT8, INT4

Ada Lovelace (RTX 40 series) : FP16, BFLOAT16, TF32 (TensorFloat-32), FP8, INT8, INT4

Blackwell (RTX 50 series) : FP16, BFLOAT16, TF32 (TensorFloat-32), FP8, INT8, FP6, NVFP4, INT4

Vera Rubin (RTX 60 series): FP16, BFLOAT16, TF32 (TensorFloat-32), INT32, FP8, INT8, FP6, NVFP4, INT4

[-]

az226@reddit

r/ConfidentlyIncorrect.

1) I didn’t say “only added” I said added. And I said basically. I was simplifying the key changes. The ones that mattered.

2) Vera is a CPU, Rubin is a GPU.

3) There is no record of RTX 60 series, it’s is speculative at this point.

But thanks for trying to correct something that wasn’t incorrect.

[-]

john0201@reddit

It saves memory but you’re still using 16 bit cores

[-]

I'm not sure where the memory saving comes in for existing 3090 fp8 pipelines. In comfy it loads the fp8 model into system memory, and then moves that to vram as fp8 afaik and then upscales to fp16 when it does the calculation. So if I'm running a model such as Zimage which only takes 8 gigs of space, where does this feather come in and help?

[-]

john0201@reddit

It doesn’t need to store a 16 bit result, it is only using 16 bits for the computation.

[-]

phazei@reddit

so you're saying that currently it does store it as 16 bits after the computation and Feather will move it back to 8 bits?

[-]

john0201@reddit

I think you’re getting too far into the weeds. It’s 8 bit, the internals of the chip run the computation on the same silicon used for 16 bit math therefore it is no faster than it would be on 8 bit. In memory it is still 8 bit, on disk it is 8 bit, it is only during the actual computation it is temporarily represented as a 16 bit number.

[-]

phazei@reddit

https://www.reddit.com/r/LocalLLaMA/comments/1q0x8ci/software_fp8_for_gpus_without_hardware_support_3x/nxc4nd6/

[-]

phazei@reddit

Right, I get that, I guess I'm just confused on what would be "memory-bound" to provide the speedup.

[-]

spaceman_@reddit

16 bit ALUs. You can run 8bit, 16bit, 32bit etc on the same core.

There's no such thing as an 8bit core, but there are dedicated hardware components called ALUs that actually do the math bits and they are operation and operand size specific. In some cases these ALUs are actually shared between cores.

This leads to unintuitive situations on some hardware - for example, on older hardware that was mostly running 32bit float graphics work 16bit workloads sometimes at half speed compared 32bit, despite requiring half the memory bandwidth, because each core had its own 32bit ALUs but 16bit units were shared per pair.

Same thing existed on the CPU side - AMD Bulldozer cores had their own integer ALUs but shared floating point and SIMD hardware between two cores.

[-]

john0201@reddit

Nvidia likes to refer to CUDA ALUs as “cores,” I blame their marketing department.

[-]

spaceman_@reddit

AMD got hit with a class action over that kind of marketing.

[-]

phazei@reddit

hijacking top comment to clarify:

For anyone confused by "memory-bound" here, it's not about VRAM capacity. It means the GPU cores are waiting on data to arrive from memory. The bottleneck isn't the math, it's feeding the cores fast enough. FP8 is half the bytes of FP16, so it transfers twice as fast from VRAM to the registers where compute actually happens. The clever bit is that Feather does the upcast inside the kernel (in registers, basically free) rather than before it (which would mean a separate VRAM round-trip). That's where the 3x speedup comes from.

I was confused at first since the README made no specific clarification and when I think of a GPU, I basically just think of the VRAM.

[-]

CheatCodesOfLife@reddit

Yeah, that through me off like a year ago when I was trying to FP8 quants. I think vllm prints a warning about it and it works, but kind of annoying since the 4xxx series got it.

[-]

eghost355@reddit

Hi, may I ask can this work on gaming side? Nvidia just released DLSS 4.5 that requires fp8 so RTX 30/20 suffer performance lost. But I am not sure if DLSS's new Transformer model works on Triton kernel or not. Thanks a lot!

[-]

Routine_Day8121@reddit

This is exactly the kind of lifehack the community needs. FP8 is getting hype everywhere, but hardware adoption is slow. If software workarounds like this are stable, it could extend the life of mid tier GPUs for serious training experiments. Curious to see benchmarks on larger models and mixed workloads though, sometimes GEMV gains do not fully translate.

[-]

CheatCodesOfLife@reddit

Lol, what model wrote this, Sonnet?

[-]

Due-Function-4877@reddit

Us "boomers" with a degree write like that. The models are trained on real writing. Next time, I'll make sure to use all lower case and say "bruh" a few times for you.

[-]

CheatCodesOfLife@reddit

Next time, I'll make sure to use all lower case and say "bruh" a few times for you.

And then I'll ask which model again, because that's what GLM does when you tell it to write a low effort shitpost ;) I don't just make comments like that when I see curly quotes and em-dashes. Look at his post carefully:

This is exactly the kind of lifehack the community needs.

That's what Sonnet 4.5 says whenever I've had it help you plan an idea about modifying an inference engine. I got it when I vibe-coded the anthropic /messages endpoint into TabbyAPI, and where I got it to help me re-implement the deprecated training code in llama.cpp.

Notice how it says "lifehack"? Because this project is OOD so the model picks a vague positive phrase that wouldn't really fit a comment on a git repo.

It's also uses these exact same phrases:

extend the life of mid tier GPUs serious training experiments

They don't quite fit ie, A100 was never a mid-tier GPU. That "serious training experiments" is what it said when it helped me get unsloth working on an A770 half a year ago.

Finally that classic "Curious to see..." thing it likes to end with after the 4th turn of bouncing ideas off it.

[-]

AppearanceHeavy6724@reddit

How about you go and fuck yourself? Who cares what is the model they used if any, if the point communicated well?

[-]

Guinness@reddit

You’re literally arguing for bots taking over Reddit. Please go back to Xitter if you like bots.

[-]

AppearanceHeavy6724@reddit

How about you go and fuck yourself too, and then move to r/antiai? There is a big fucking difference, between writing empty vague shit with ChatGPT and using LLMs to formalize you thoughts, stupid asshole.

[-]

CheatCodesOfLife@reddit

if the point communicated well

Did you read it? Was it communicated well?

[-]

AppearanceHeavy6724@reddit

yes.yes.

[-]

bigfatstinkypoo@reddit

this writing does not stink that bad, it's just corpo positivity speak

[-]

colin_colout@reddit

You're absolutely right to question my identity!

[-]

TheThoccnessMonster@reddit

Yup - and there’s plenty of model layers that are heavily convolutional that, even when offloaded to DLA/FP8 they just upcast to FP16 anyway. QAT and dedicated hardware for convolutions and unsupported activation functions stand to get us a lot more bang for our bucks.

[-]

Karyo_Ten@reddit

but hardware adoption is slow.

That has been supported on 4000 series since a couple of years ago, and it's supported on latest AMD and Intel GPUs AFAIK

[-]

Inevitable_Host_1446@reddit

I guess you could see that two ways - hardware adoption as in the hardware is slow to come out, or as in people are slow to get the latest. The latter has certainly been true with what a shitshow GPU prices have remained since the days of crypto boom at least. And now RAM is ridiculous as well and Nvidia are talking about cloud gaming...

[-]

gittubaba@reddit

Wow, just a few days ago I was arguing about this with chatgpt, it said this isn't possible :P. Can this be plugged into comfyui?

In my rtx 2060 super, fp8 gets cast to fp16 and bf16 get cast to fp32 when running inference.

[-]

a_beautiful_rhind@reddit

I think it's better to use the triton patch in comfy. https://github.com/woct0rdho/triton-windows/commit/440e3c42a640a4188dd356225e1b13a56b45a377

Also found it's possible to load BF16/FP16 as E4M3 and then save the vram without an extra file. Somehow my quality went up.

Unfortunately there is some bug in pytorch 2.9 where FP8_scaled gets passed directly into the triton compiler as FP8 and then cast to i8 by llvm. Torch 2.7 works flawless or you can just de-scale the weights.

You sorta want the calcs in FP16 and you wanna avoid BF16->FP32 conversion if speed is the goal. Int8 calcs can be tried by using sage attention. Not always better.

[-]

woct0rdho@reddit

My patch only enables fp8 to fp16 cast in Triton, not fp8 matmul. OP's kernels enable fp8 matmul and that's what we need for the next step.

[-]

a_beautiful_rhind@reddit

I did see but no movement. Hopefully at least they fix scaled/mixed FP8 as that seems to crash on compile for me with newer pytorch.

Also just found https://github.com/silveroxides/ComfyUI-QuantOps so giving int8 a go to see if it's any better/faster. Didn't know it was a thing.

Call me paranoid but supporting FP8 on pre-ada is something I've felt silently slow-walked in major projects even when those like yourself and OP put in the work.

[-]

Venom1806@reddit (OP)

Not sure about comfy UI, but I'm working on implementing functional api for torch.

[-]

a_beautiful_rhind@reddit

Comfy does torch and FP8/Fp8_scaled is used there much more than for LLMs. IME, on turning FP32 is going to be a slow ride vs FP16.

For my uses, compiling FP8 image gen weights was a huge speedup. I wonder if somehow your library can hijack FP8 ops to work seamlessly. Right now i'm having to compile triton from source and I doubt quantization/dequantization is accelerated.

[-]

Alarmed_Wind_4035@reddit

I used fp8 in comfy and saw no speed up mind to share how?

[-]

a_beautiful_rhind@reddit

The speedups can really only come from a few places.

You have HW accelerated FP8 support and don't accidentally cast to BF16/FP16 for the multiplication.
You are able to now compile the model and gain speed from that.
Smaller size of weights because there's really not int8 support besides GGUF.

You didn't say what you're trying to do.

[-]

Alarmed_Wind_4035@reddit

generating images or video using comfyui, I have 5060ti so I should be able to run fp8 but when I use the startup argument for fp8 I see no difference in speed.

[-]

a_beautiful_rhind@reddit

What did it say in the console when models load?

If it's like: model weight dtype torch.float8_e4m3fn, manual cast: torch.float16

Then you have your answer. You may also have to pass --fast fp8_matrix_mult

[-]

getmevodka@reddit

LLMs always something isnt real/possible or doable, if it is not part of their training data. Especially the newer LLMs are trained to only do things as efficient and complete as possible, which makes them severly dumber in hypothetical cases than the older LLMs, because they always do only the least amount of work necessary to keep things simple enough and noz make mistakes, as that is a heavy negative reward in their system. Imho its too agressive and the older LLMs like deepseek3.1 or qwen2.5 72b are better suited for hypothetical expectational work or fantasizing about potential ideas, while the newest generation of LLMs will do exceptional work within the scope of their trained abilities.

[-]

gittubaba@reddit

What are even saying bro?

[-]

getmevodka@reddit

Older big LLM better in creative talk because not trained to do least amount of work possible to not make mistake, while newer big LLM better at problem solving but not in accepting ideas outside of their training data, because their algo punishes them too hard for making mistakes while being trained.

About that

[-]

batonac@reddit

Could this be useful for increasing LLM performance on the Tesla P40?

[-]

johndeuff@reddit

Interested. Got p40 too.

[-]

bbjurn@reddit

What'd it take to get this to work with vLLM or other inference software?

[-]

elsung@reddit

Yeaaaa! I was just trying to get vLLM to load nemotron3-nano on my 2x 3090s but couldn’t get it working because FP8 isn’t supported (and theres no AWQ quant). Gotta be honest tho not sure how i would implement this in vLLM to get things working. Might need to vibe code this to see about implementing the solution lol

[-]

rainbyte@reddit

There is GPTQ quant, do you know if is it good?

[-]

elsung@reddit

i actually tried it and it wouldn’t work. i’m actually literally trying to make my own awq quant myself right now. no idea if it would work. vibe coding so far for getting this feather thing with vllm seems to be a tall task cuz claude / gpt is telling me no way jose lol

[-]

Venom1806@reddit (OP)

Idk, anything that uses torch.Tensor or is convertible to this format should work. Probably huggingface will work ig.

[-]

Flashy_Squirrel4745@reddit

Wasn't it already implemented with marlin ( https://github.com/vllm-project/vllm/blob/adcf682fc7d1835d037da331922751e880c8bc25/csrc/quantization/gptq_marlin/generate_kernels.py )? It is likely faster with its optimized cuda c code.

[-]

KingKoro@reddit

Would this also benefit RDNA3 ?

[-]

tw_numba_one@reddit

I believe so. If your environment has PyTorch support, it should work.

[-]

ethertype@reddit

Is this conceptually the same trick pytorch uses to handle MXFP4 on Ampere-class hardware? Which does not support MXFP4 natively.

heretic will do its magic on the original gpt-oss-20b safetensor in MXFP4 format. (The end result is 3x the original size, though.) I have been told heretic doesn't do anything in the code for this to occur, so I assume pytorch owns all the glory.

I also can perfectly fine load the native MXFP4 ggufs of gpt-oss-120b (converted by GG) on my 3090s, with llama.cpp. 120 t/s on empty context. Can't say if this is due to pytorch or if llama.cpp special-cases this on its own.

[-]