Floating point from scratch: Hard Mode

[-]

araujoms@reddit

Having fixed tons of bugs in implementations of custom floating point formats, I have deep empathy for her. And a firm belief that the authors of IEEE 754 are psychos cackling madly about all the suffering they have unleashed on mankind.

[-]

adzm@reddit

I tried to make a 128 bit float for fractal calculations back in the 486 days. I thought it would be easy, but was very difficult. I had to reach out to newsgroups through a teacher at my elementary school and actually got responses helping me and pointing me to get Knuth from the library.

[-]

sards3@reddit

Are you saying you were in elementary school at the time you worked on this, or am I misreading it? Were you some kind of child prodigy?

[-]

df1dcdb83cd14e6a9f7f@reddit

bruh I was about to write this comment, wtf. I feel like maybe it was a teacher from elementary school but OP was older

[-]

adzm@reddit

well I was lucky enough to have access to an older computer at home (pentiums had been out for a few years at that point iirc) and a copy of turbo c++ that I got someone else's dad to copy onto floppies for me lol. Thankfully a lot of books on computer graphics had the source code in there so I just copied it to have a starting point, and then could just play around and tweak things from there. It would have been really tricky to start from scratch with how confusing graphics programming was in those days. But this way I already had a canvas so to speak and could just write the code that calculates the color values etc

[-]

df1dcdb83cd14e6a9f7f@reddit

that actually makes reasonable sense, nice good for you man. wish i started that early

[-]

adzm@reddit

I was in fifth grade at the time yeah. But really I was just following along with some books I had about computer graphics and fractals and etc. I wrote a simple fractal generator in borland c++ and started hitting accuracy issues as I zoomed deeper. I had already made a custom type for complex numbers which was fun.

[-]

inio@reddit

In general for most fractals (e.g. Mandelbrot) floating point doesn't help much if you want to do really deep zooms. The interesting bits aren't near the origin, so the exponent does t really buy you anything. What you want is fixed point int64s, but then you need to write your ow multiply routines since few environments natively support 64*64->128.

[-]

adzm@reddit

This would have been helpful in 1995! I had already made my own type for the complex math, so this seemed to be the next reasonable step.

That said it was horribly slow either way. I remember it taking nearly an hour to generate an image.

[-]

East-Barnacle-7473@reddit

How does the kernel fscale fit in?

[-]

quetzalcoatl-pl@reddit

> A great thing about _bfloat16_ is that it has no spec, so we can implement it however we want!
> A horrible thing with _bfloat16_ is that it has no spec, so we can implement it however we want!

awesome xD

[-]

max123246@reddit

Let me just tell you, testing GPU kernels working in custom floating point specs which are as small as 4 bits... I don't understand how anything works because our testing strategy is just bump up the error tolerance if it trips up randomly. Very hard to determine what is actually a GPU kernel bug instead of just an error tolerance that needs to be bumped up.

And there is some research in testing strategies but new novel GPU kernel implementations that break the assumptions needed for testing pop up faster than the research can keep up.

Don't ask me how I know all this...

[-]

mccoyn@reddit

bf16 might be faster than f16 on a CPU simply because it has the same number of exponent bits as f32. It’s trivial to convert to f32, use the fast hardware implementation, then convert back to bf16.

[-]

max123246@reddit

Ohhh that's my bad. I thought significand bits were the exponent bits, not the mantissa bits. So many names for this stuff...

[-]

quetzalcoatl-pl@reddit

super interesting, thanks! I've never heard of 4-bit floats.. that sounds insane, or at least useless. But it's not my domain of expertise, so, uh, I guess these are useful in some way if you saw them there :)

re: error tolerance vs bugs - this one I know from my own experience.. but not GPU kernels, just some scientific simulation code..

[-]

max123246@reddit

Specifically in AI for inference. Typically you train the model at higher f16/f32 precision and then for inference you quantize all the way down to e2m1. Somehow, the performance of AI inference doesn't degrade despite such a small per weight floating point width.

Nvidia has something called block scaling with their nvfp4 type. The idea is you have a group of f4 types that have a shared scale factor. So you get the size advantages of f4 but retain greater dynamic range using that shared scale factor

It's weird. I also dislike the marketing terminology. nvfp4 is not a floating point type, it's e2m1 with a block-scale factor (typically 32 f4 elements) that can be e8m0 or e4m3. (E for exponent, M for mantissa. There's even more niche specifics that I don't really know off the top of my head but is elaborated within Nvidia tech blogs on nvfp4 and the PTX documentation)

[-]

Dineshvk18@reddit

“Floating point from scratch is wild, respect for anyone attempting this runnable”

[-]

Fajan_@reddit

This is one of those projects where each time you find yourself writing float x = 0.1f; you think long and hard about it 😭

Super awesome seeing how someone managed to make the journey from theory to C to hardware and was like “yeah, but I still don’t understand the ins-and-outs of floating point numbers”

Also, dropping the IEEE implementation just to squeeze more performance into a limited area is such a true engineering dilemma, not something you come across in your everyday “built-from-scratch” project.

It’d be really interesting to see how this holds up for bigger applications like machine learning pipelines and whether those simplifications might bite you in the butt later on.

[-]

MarekKnapek@reddit

Shameless plug, I have an interactive website about floating point numbers, it is located at: https://marekknapek.github.io/float-analyzer/binary32/

[-]

GergelyKiss@reddit

This is an amazing article, love the style. Have to admit you lost me at the hardware part, but still feel like I've learnt a lot.

Truly a humbling experience, thank you!

[-]

Ravek@reddit

I didn’t read the whole thing but I just wanted to call out that round to even is needed to make algorithms like TwoSum work. So I do think it’s a good default.

[-]

DiscipleofDeceit666@reddit

Can you know for certain that there aren’t other algorithms waiting to be discovered under this new rounding system. ✨Imagine✨ if this was what we needed to calculate primes in O(1).