I'd only bother to implement the block version, everything else is a job for the compiler.
But the whole point of this article is that you can't rely on the compiler turning simple, sensible-looking code into fast machine code. If you could, then the sensibe-looking starting point that most of us would write, wouldn't be 30x too slow. Sure, blocking captures the essence of the problem, but even then you're a factor 4x behind what he reaches in the end.
Better yet is to not transpose until you absolutely have to, like Haskell and Julia do.
The way I read the article was that the transpose was just a particularly clear example of issues that can appear in many other situations. I think the lessons were meant to be general, not something one could ignore by just saying "well, just don't transpose, then!"
As someone who has no real experience with such optimization, write and read not being symmetrical was the most important lesson learned from the article.
By the way, I think that there's one important bit overlooked from the article.
Since these values are constant, they obviously are always served from L1 and do not have any noticeable negative impact on performance. However, they increase counter values in the same way as truly heavy data loads do. That’s why we observe two extra loads from L1d per each processed element than we expected.
I think that extra loads are not exactly negligible, although the whole picture would remain mostly the same. I think that this happens because of pointer aliasing; it's technically possible that a write to dst->data()[n * r + c] overwrites pointers src._data (src is const Mat& so I'm not 100% certain that this is a case taken by the compiler though) and dst->_data, making them non-const. Storing src.data() and dst->data() to a local variable should eliminate excessive loads.
transpose_Blocks does store pointers to data locally, so I think that this is the reason why there's no extra L1 loads for transpose_Blocks.
araujoms@reddit
I'd only bother to implement the block version, everything else is a job for the compiler.
Better yet is to not transpose until you absolutely have to, like Haskell and Julia do.
amaurea@reddit (OP)
But the whole point of this article is that you can't rely on the compiler turning simple, sensible-looking code into fast machine code. If you could, then the sensibe-looking starting point that most of us would write, wouldn't be 30x too slow. Sure, blocking captures the essence of the problem, but even then you're a factor 4x behind what he reaches in the end.
The way I read the article was that the transpose was just a particularly clear example of issues that can appear in many other situations. I think the lessons were meant to be general, not something one could ignore by just saying "well, just don't transpose, then!"
JiminP@reddit
As someone who has no real experience with such optimization, write and read not being symmetrical was the most important lesson learned from the article.
By the way, I think that there's one important bit overlooked from the article.
I think that extra loads are not exactly negligible, although the whole picture would remain mostly the same. I think that this happens because of pointer aliasing; it's technically possible that a write to
dst->data()[n * r + c]overwrites pointerssrc._data(srcisconst Mat&so I'm not 100% certain that this is a case taken by the compiler though) anddst->_data, making them non-const. Storingsrc.data()anddst->data()to a local variable should eliminate excessive loads.transpose_Blocksdoes store pointers to data locally, so I think that this is the reason why there's no extra L1 loads fortranspose_Blocks.