Are floats faster than shorts (data type) in modern times?
Posted by rufian_balas@reddit | learnprogramming | View on Reddit | 38 comments
I've been wanting to use more precise data types for my programms and recently i asked the IA some questions about data driven design. In short, one of the things that told me was that moder CPUs are more capable when handling floats rather than shorts and even if it sounds couterintuitive its almost always faster to use floats.
This doesn't make sense to me since i've been learning that in order to optimize CPU's memory cache is best to use the lightest data types you can.
I was using c# when this came up but i understand that this is a language agnostic thing.
mredding@reddit
I've worked in both game dev as well as trading systems,
I think you're getting way ahead of yourself. You would have to write the code, see what machine code the compiler generates (For C#, you'll have to configure AOT). Then you'll have to look at the instructions, look them up in the architecture documentation, and basically count the clock cycles per each instruction.
And it really doesn't matter, most of the time, because this isn't where you're slow. It's everything else that's going to add up. You're typically IO bound or memory bound. Even a mid-range CPU of 20 years ago was several orders of magnitude faster than the memory bus.
Learn how to use a profiler, because you can't really learn anything from raw sample rates. Learn how to use a profiler, and then improve where you're slow on the critical path.
It's more important that you properly represent your data, and there's a certain art to that. If you're working with integers, then use integers. Floats are scientific numbers that accumulate error. Use smaller types in memory, use word sizes for parameters, loop, and functions.
So in terms of C++, which I'm more familiar with, I would use an
int_least16_tfor the smallest signed integer type that has at least 16 bits of precision in my data type. This way, I can pack in as much data into memory and make most efficient use of cache and the memory bus when transferring data. Then I'd write my functions in terms ofint_fast16_tor perhapsint_fast32_tfor the most efficient type (maybe the larger size to deal with overflow) that has at least 16 bits. The compiler will pick the most efficient size for the CPU, which is probably going to be a native WORD size or multiple therein for the registers.white_nerdy@reddit
Theoretically, this could be possible for a couple reasons.
In practice, I am skeptical. Is there a benchmark to prove it?
I agree that "Use smaller data types when you can" is more reasonable performance advice than "Use floats instead of shorts when you can."
However, I'm still not sure this is good practice for general development. For most programs, most of the time is spent in a small part of the code; most memory is taken by a small number of data structures. Optimization has costs (developer time and code complexity). Identifying and intensely optimizing those key parts is going to be a much more cost-effective use of those resources than replacing "int" with "short" everywhere you can.
Benchmarks and empirical evidence are your friend. So is the question: "Should we be optimizing this at all, or is it good enough as-is?"
"Premature optimization is the root of all evil." -- Donald Knuth
Cutalana@reddit
Floats are not comparable to a short as a float will have imprecision (.1 +.2 will not equal .3), so they should be used for different cases. Floats will typically be faster than a short (or other integer types) as most processor have a specialized unit for them. However, it’s very rare that a data type will be a bottleneck for a program, so much so that I would recommend not focusing on them until you get a good understanding of data structures and algorithms as they are much more influential on the speed of a program.
dmazzoni@reddit
While this is true for fractions, floats can be used to represent small integers (up to around 16 million for a 32-bit float) perfectly. So using a float to do math on numbers where the operands and results are all less than 16 million is totally safe with floats and will give identical results to integers.
All modern processors made in the last 20 years have had floating-point support built-in with the exception of extremely tiny embedded processors. Unless you're programming a tiny embedded microcontroller you can assume floating-point is included. However you're wrong that it's faster. To a first approximation they're the same these days. However, when you use SIMD, 16-bit shorts might still have an edge over 32-bit floats.
This is true.
RandomOne4Randomness@reddit
I believe reasonable addendum to these points is to mention that like most things, the subject of optimal performance is much more complex & nuanced than can be fully explained here.
For example; SIMD’s performance is going to be dependent on not just on the size of the data type, but more so on algorithm, micro-architecture/pipeline, data locality & pre-fetching, and how your compiler optimizes it (or if using assembly how well you can optimize it).
If I’m just doing something as simple as looping adding pairs of two whole number values together. Considerations like if the values will live in the CPU cache exclusively for the operations vs. being fetched from main memory vs. NVMe vs. disk, if the data stored in contiguous memory, datatype cache line alignment, etc. likely matters a lot more than the opcode that does the add.
If you’re targeting a specific CPU sku using a specific compiler you can be a lot more confident about impacts of small choices than if you’re targeting something broad like x86, x64, arm32, or arm64 architectures or some combination of those over various generations of micro-architectures.
When highest performance on the hardware really matters, but you cannot say exactly what it would execute on. You have to model for assumptions on commonalities in the hardware implementations it will likely run on. e.g. Target a theoretical machine supporting 8 byte words, speculative pre-fetch into 64 byte wide cache lines, with at least 8 KiB of data cache, supporting specific instruction types, etc.
alexbusiness102@reddit
Thanks for the posting.
stevevdvkpe@reddit
I have never heard of architectures where floating-point operations are faster than integer operations. Adding two integers just requires a binary adder. Adding two floating-point numbers requires decomposing the exponents and mantissas, aligning the mantissas when the exponents are different, doing a binary addition, and normalizing the result and adjusting the exponent. There's basically no way floating-point could ever be faster than integer operations.
DSrcl@reddit
Integer divisions are slower than fp divisions on many devices.
MatthiasWM@reddit
Back in the early 90‘s, SGI released the R10000 CPU that was optimized for 3d graphics task and was supposedly faster doing floating point operations than integer. I never tested it, but for „back then“, FP was very fast.
Axman6@reddit
Integer multiplication used to be slower than it is these days, there’s quite long dependency chains in a multiplier (bit 16 of the result depends on all bits of both inputs) so they were split into multiple cycles to allow for higher clock speeds. This is still mostly true, 64 bit multiplies can have a three cycle latency (Apple’s firestorm architecture has a three cycle latency, but can also perform a 64x64->128 multiply in three cycles because they use the same circuitry). FMUL on the same architecture has a four cycle latency.
AdministrativeLeg14@reddit
The parenthesised caveat is technically accurate, but since the question asks about “modern times”…are there modern processors that lack FPUs? Sure, back when we were running 386 or 486 CPUs (oh, the amazing speed of that old 486, at the time!), there were basic models and fancy models with FPUs, but does anyone make CPUs without FPUs today?
(This isn’t strictly rhetorical. Maybe there are low-power embedded systems even today where they don’t add FPUs? I’ve never worked in that sector.)
GeneticsGuy@reddit
Ya, for example, learning and understanding Big O Notation is far more critical for software design choices.
DavidRoyman@reddit
You talk about wanting precision, and then you mention that you want to optimize CPU's memory cache. Those are
If you require precision float could be sufficient for most applications. If a float doesn't offer sufficient precision, you should use a "decimal" type; most programming languages already have this type implemented in the standard library.
If you need speed, the bottleneck isn't the type, it's the context. Several similar operations will run faster if they can fit into a single SIMD. Compilers are unlikely to optimize for this, so it's quite a lot of work, and I can't see a reason for you to delve into that.
I'd stick with programming for your other needs, and only optimize if necessary. If you really care a lot about it, you should start from this video: https://www.youtube.com/watch?v=qin-Eps3U_E
Tosh97@reddit
Floats and shorts serve different purposes and their performance can vary based on the specific use case and hardware architecture. Generally, modern CPUs are optimized for handling 32-bit and 64-bit data types, which means that while floats may have a specialized processing unit, the actual speed difference often depends on how the data aligns with the CPU's architecture and memory access patterns. Always consider the context of your application when choosing data types.
keithstellyes@reddit
This isn't a well-designed concept. There's data oriented design, but that's something different
Like many optimization questions: the answer is it depends and also it probably doesn't matter and when it does matter you would test it
patternrelay@reddit
The trick is that “smaller type = faster” is not always true on modern hardware. Most CPUs like working with 32 or 64 bit values, so a short often gets converted to a 32 bit int under the hood before doing actual arithmetic. That means you can end up paying for extra conversions without really saving time. Where smaller types help is with memory bandwidth and cache when you have big arrays and tight loops that are bottlenecked by memory, not math. For normal code in C# it is usually fine to default to int/float for clarity, then only drop to short or byte when you know memory layout and size really matter.
patternrelay@reddit
What you heard is sort of “context dependent true,” which is why it sounds confusing. On most modern CPUs the native word size is 32 or 64 bits, so int and float often map nicely to what the hardware likes, while short tends to get promoted to int in actual operations anyway. That means you do not really get free speed from using short, and sometimes you even add extra instructions for sign extension or packing/unpacking. The cache argument only really starts to matter when you are working on huge arrays and doing tight inner loops where memory bandwidth dominates. In typical C# business or game logic, picking the smallest type for performance is rarely worth it compared to just using int or float and focusing on clean algorithms. If you want to experiment, you could try BenchmarkDotNet on a tiny loop that does a lot of math on arrays of short vs float and see what your actual CPU does.
AdministrativeLeg14@reddit
I don’t know if it’s actually true on any given (or indeed any) piece of modern hardware.
If it is true, it presumably has to do with data alignment. Yes, smaller is generally faster; but your CPU may be designed to operate on memory in 32-bit words, for example, and then reading a 32-bit float just involves reading a word, which is extremely quick and efficient. If it’s a 16-bit short, the CPU may have to do a bit of extra work to align the data properly to a word boundary, especially if you’re going to run vector calculations.
Of course, in 99.99% of cases, this probably doesn’t matter and you should optimise for correctness, for readability, and for efficient algorithms. In the extremely rare cases where this kind of optimisation is actually important, you’re going to need benchmarks anyway to find out for sure (and quantify) on relevant hardware platforms.
RainbowCrane@reddit
100% this - this kind of micro optimization is almost never going to give you the same performance boost you would get from profiling your application and fixing tight loops, poorly performing database queries, etc. Questions like this are philosophically interesting but I’ve never seen a case where this type of optimization was high on the list of things affecting application performance
khooke@reddit
Learning to distinguish between those philosophical discussion vs things that are important and add real value is a super valuable skill as you progress through your career. When you’re starting out it’s hard to see the difference because everything seems important.
RainbowCrane@reddit
Yep, that was a big part of others mentoring me when I was younger, and of my mentoring others as I advanced. One of my first lessons in the difference between theoretical best programming practices vs practical programming was the explanation by the senior database programmers that fully normalized databases mostly weren’t a thing in the real world of 1995 databases, where performance and space constraints led to non-normalized data
SymbolicDom@reddit
If i understand alignining correctly, then it's the bigger data types that should be aligned and reading bytes are fine anywhere even on 64 bit mashines.
Count2Zero@reddit
In general, floating point operations require more CPU cycles.
Multiply a short by 16 can be done by shifting the value four bits to the left. That can be done in 4 CPU cycles with most any CPU. (Assuming the value is already in a register).
Multiply a floating point number by 16 is a lot more work for the CPU, because it has to worry about fractional values. There are different ways to implement floating-point multiplication, but they are all going to take a LOT more than 4 CPU cycles to complete.
With modern CPUs running at several GHz (billions of CPU cycles per second), this really isn't much of an issue anymore - you need to do a shit-ton of multiplication before you notice a significant difference.
claythearc@reddit
I wouldn’t stress about this — it’s the most extreme case of premature optimization. But the AI isn’t entirely wrong, it’s just answering a question that doesn’t matter for you. Modern CPUs are optimized for 32/64-bit operations, and short arithmetic can require extra sign-extension instructions. Float SIMD (SSE/AVX) is also absurdly well-optimized compared to integer SIMD. So in a tight loop doing pure math on individual values, floats can genuinely be faster per-operation. But that’s not what data-oriented design is about. The win from smaller types is memory bandwidth and cache — if you’re iterating over millions of values, smaller types mean more fit in a cache line, fewer misses, faster overall. The AI conflated ALU throughput with memory access patterns, which are completely different concerns. None of this matters unless you’re in a game engine’s hot path or crunching massive datasets. In normal application code, you’ll burn way more time on the code smells: constant casts when APIs expect int, subtle bugs from implicit narrowing, maintaining mental overhead about which precise type lives where. Constrained types absolutely make sense for semantic clarity though — using sbyte for a deck index or ushort for a port number communicates intent and gives you some compile-time bounds enforcement. That’s a readability and correctness win, just not a performance one
ParsingError@reddit
Not exactly. ARM, x86, and RISC-V all have sign extending and zero extending load instructions. The only time it requires extra instructions on x86 is if you're doing an arith operation directly on a memory address AND you want a higher-precision result (in which case it requires loading into a register).
Float SSE/AVX instructions have worse latency and same-or-worse throughput than equivalent integer SIMD instructions, and 16-bit int instructions process twice as many values at once (assuming you want 16-bit outputs) so the int versions are definitely faster.
claythearc@reddit
After diving back down a rabbit hole im p sure you are correct. I don’t think it changes the core advice of “don’t care about it for efficiency but do care about it for semantics”, but I was a bit sloppy.
peterlinddk@reddit
Data driven design usually means that you use data from your users to design / re-design the application - and not that you should use certain data types in your code. There is (or was) something called "data-oriented design" where you designed your code around the data being manipulated, and optimized it for whatever hardware (usually video) had to use that data. However, it makes no sense in languages with virtual machines, since you can't control the actual data-layout in memory anyway.
When programming in high-level languages like C# and similar, you should write your code for humans, rather than the machine, and use floats if you want to communicate that here you are using something that may involve fractions, but you aren't too concerned about precision, but somehow space is important, since you don't use doubles. You would use short if you want to communicate that this variable only contains whole numbers (integers), and never above 16 bits, if that is for some reason important, since you don't use ints.
Otherwise you check the performance, and if it becomes a problem, you can change your data type.
But don't trust anyone, no articles, no redditors, no IA or whatever, who says that something is faster than another. They don't know your application, your data, and your system!
Leverkaas2516@reddit
Native-size floats are not going to be faster than native-size integers. If you find that using shorts is a performance problem, try using 32 or 64 bit integers. Whether it will help will depend on the machine architecure. If it helps at all, it probably won't help very much.
Floats are not ints. They behave differently. Use the data type that behaves as your program design requires, otherwise you'll get wrong answers.
PyroNine9@reddit
That's going to be highly dependent on the processor and what you're doing with the data. If you absolutely need to get the last cycle out of it, you'll need to benchmark the two approaches.
I have seen some software that does a LOT of math (mostly iterative models) actually benchmark multiple approaches on program start and then choose the fastest variant function for the computation.
trailing_zero_count@reddit
Only definitively for division, as integer division is very slow. For addition / multiplication, I think shorts would be slightly faster.
https://www.agner.org/optimize/instruction_tables.pdf
zhivago@reddit
Generally you should expect double and int to give the best performance.
Only use the short versions if space is more important than speed or a profiler tells you to.
BranchLatter4294@reddit
You asked "IA"? Lol
Ok
Maybe do some testing to find out the answer?
mjmvideos@reddit
Create some tests and find out.
Independent_Art_6676@reddit
on most cpu, there is a lot of conversion going on. The FPU converts all floating point type to a large type with excess bits to help reduce errors, so those floats get a double (pun?!) conversion from their 32 bit value to a 80 or 120 or whatever it is now bit value, the computations are done, and it gets converted back again to the smaller value. Integers don't suffer as much of this on some hardware, it just depends. X86 type chips for example has overlapped registers for 1, 2, 4, and 8 byte integers all in the same physical circuits, so it does not have to promote a 32 bit int to a 64 bit and back to do the computations. Also, integer circuitry *in general* is simpler than floating point, due to how it has to handle the exponent and base for floats while integers are pretty much simple math. I mean a multiply in binary is either a copy of the value (multiplied by 1) or all zeros (by zero) so a multiply is just steps that shift and add. A multiply in floating point is notably more complicated.
I don't know the modern timings. FPUs are VERY fast. Integer math circuits are also VERY fast. But all things being equal, the extra work for floating point makes me feel pretty confident that its probably at least microscopically slower than integer math if you can actually do the same work using both types. It may take a billion plus repeated operations to show a meaningful difference when timing it, though.
countsachot@reddit
I would wager it depends on a combination of the compiler or runtime optimizations, and the processor make and model itself. You can always start benchmarking targeted hardware with to find out.
spinwizard69@reddit
Actually language can make a difference as does implementation. floats can be 32 bit or 64 bit compliant. Then there is the question of are they IEEE floats or something else. That being said I'm not sure if there are any FP units that do 64 bit faster than 32 bit.
As for your programs, you should always ask why you need more precision. IEEE supports FP32, FP64 and FP128 as for what you are calling shorts I suspect that is language dependent and is not a floating point type. Often we are talking about "floats" as 32 bits and "doubles" as 64 bits (C++). There is also a potential for long doubles. Again in the context of C++, shorts are integer types, so shorts would almost always be faster than the floating point unit. The big but is that this can be execution unit dependent. I can't help with C# as i know zip about it, but if shorts are integers there then you might want to look into your understanding of the language. Integers and floats have two different use cases. it is important to understand that floats are approximations.
So when you ask which is faster, you need to ask what hardware is it running on. GPU's can really muck things up because of special types and optimizations. One of those optimizations is what can be done in parallel. For example if the GPU (for example) 128 16 bit integer operations per cycle as opposed to say 32 doubles per cycle which is faster? What I'm trying to say here is that it depends on the specifics of language and hardware the code runs on. In the made up numbers above the processor would complete 4x more operations per cycle.
Caches are a wholly different discussion. Generally shorter types means more data in cache but you also need to consider cache line length and other behaviors. If you have truly time critical code you may need to optimize for cache behavior on a specific processor. It is completely possible to arrange things (floating point operations) such that cache might not be a problem compared to integer.
Aggressive_Ad_5454@reddit
Floats and shorts are like apples and asterisks. Completely different purposes. Floats are for physical things, like the weight of a bag of potatoes or the temperature. You can also use them to hold integers, but they’ll prove imprecise for numbers with absolute values greater than some threshold.
Shorts are for inherently small integers, like the x/y position of an item on a screen (but not a printer).
Unless you’re doing down-to-the-metal image processing you’ll have a hard time measuring any differences in speed. Adds are fast, divides are slower, etc, for both.
Are shorts (16-bit integers) slower than longs (32-bit integers)? They might be if they’re stored unaligned. You can read about that. But again, you’ll need to know a lot about assembly language and processor architecture to even measure the difference. We’re talking nanoseconds or less.
If you have millions of them the space saving might help. But not for hundreds of them (unless you’re writing firmware for a tiny microwave oven controller).
high_throughput@reddit
Someone correct me but aren't 16 bit operations truncated from 32bit in C#?
In any case, you're 100% right that smaller types are better on the cache and RAM throughout, but you have to weigh that up against the time per operation taking SIMD into account.
If you spend many microseconds on each element computing a fractal or something, then you need to focus on compute since it's just a few MB/s.
If you just increment each number then it'll be strongly bottlenecked by memory and you should optimize for that.