Native bit size, Arbitrary-Precision Arithmetic

Posted by justpeachypay@reddit | Python | View on Reddit | 13 comments

Okay so I have a homework question to compare the int factorial and the float factorial of 200 and to explain what I find. For reference 200! results in a answer and 200.0! results in ∞. I’m like 90% sure the professor want an answer along the lines of “the calculated value exceeds (overflows) the maximum value that can be represented by the data type.” Especially given this is not a compsci class. Great, that’s easy. BUT I was like why, and so it took me a while because I’m a pretty new to programming and struggle with a lot of the terminology but I ended up coming across bignum. Now, the general idea of bignum makes sense to me… I understand how the arrays are structured beyond the bit size of the cpu, what I don’t understand is where the bit size comes in. Here’s my thought process: I understand that python uses the bit size to binarily encode the output - I think this is our base array and each individual 0 or 1 is an array. In order to maximize performance python can assign anywhere from 0 to 1073741823 = 2^30 digits to each array… ie 10^1073741823 (I’m not sure why this helps me visualize this number but it does)…. Now here’s where I get tripped up, how are you able to assign another number to a binary system (I know we’ve turned them into arrays but that’s still not making sense to me)? And why does this result in 2^31 -1 possible int digits? (I am assuming this is because it’s using 32 bits as parent arrays? I have python 64 bit so I’m not really understanding the correlation. Am I thinking about this completely wrong?

I’m sorry if some of this is unclear.

[-]

ElHeim@reddit

Ok, let's address the clarified question:

I understand that there are 2^(32) arrays [...]

Wait. There are not "2^(32) arrays". I'm not quite sure what exactly you mean with that. My parsin gives me: "there are over 4 billion arrays [...]". No, no.

CPython's internal representation of an integer type can be found here, along with an explanation of why 2^(30) instead of 2^(32). It looks like this:

typedef struct _PyLongValue {
    uintptr_t lv_tag; /* Number of digits, sign and flags */
    digit ob_digit[1];
} _PyLongValue;

struct _longobject {
    PyObject_HEAD
    _PyLongValue long_value;
};

That digit there is normally defined as an uint32_t, i.e., an unsigned 32-bit integer.

Python will then use an array of uint32_t to store the digits of an integer. In base 30. One digit per element of that array.

So, if you were to store, say, the number 123456, which fits in a single 2^(30) digit, the array that contains it will be of a single element, containing exactly that number. So, something like this:

[ 123456 ]

If you want to represent the number 4294967296, also known as 2^(32), which clearly does not fit in a single 2^(30), the representation will be instead "4 * 2^(30) + 0", like this:

[ 0, 4 ]

in a 2-element array (with the most significant digit on the right hand side).

I think what I’m confused about is that the max number python can compute is dependent upon your systems memory

Well, for native types you're limited by the size of the largest "word" that can be represented in your computer's architecture. These days that means typically 2^(32) or 2^(64). But what limits an array's size? Simplifying it a lot: the amount of memory you have.

And that's why Python doesn't come with an inherent largest integer value it can represent.

[-]

justpeachypay@reddit (OP)

2^32=4,294,967,296 so this is over 4 billion arrays. Are you saying this is not right? As I was understanding these are how many unsigned integer bit arrays exists.

But okay, so I get it now, I was thinking each digit was an element in the array, so the array could only support 2^30 elements. What you’ve stated makes significantly more sense. Now can think about it that way my previous thought process literally doesn’t make sense for such an efficient system, as that would be a pretty inefficient way to store numbers.

So I was playing around with this last night and it seems that I can set my max int to sys.set_int_max_str_digits(2147483647) which is 2^31 -1. I cannot set it larger. So a number of this magnitude is the largest number that can be stored in an array using multiple elements of 2^30 digits. Is this correct?

Lastly, I read that this max int of 2^31 -1 is usually calculated as 2^bitsizeofyourcomputer -1, but my computer is 64 bits. Could you direct me to what I should look up to read more about that?

Seriously thank you so much for taking the time to comment and explain this to me. It has really helped a ton.

[-]

ElHeim@reddit

2^(32) = 4,294,967,296 so this is over 4 billion arrays. Are you saying this is not right? As I was understanding these are how many unsigned integer bit arrays exists

I agree with u/fortunatefaileur that the whole "array" thing is just getting you confused. Very, very confused.

it seems that I can set my max int to sys.set_int_max_str_digits(2147483647) which is 2^(31) -1

I'm afraid you're mixing things here. Read the doc about sys.str_int_max_str_digits. This sets a limit on how large can be a number when representing it as a string. That has to do with the number of (usually decimal) digits you can write down.

No matter what value you set in there, you'll still be able to work with numbers as usual: the limit kicks in when you want to print that value out. This is because of practical reasons, as the operation can be very costly, and can be used to attack remote systems and make them crawl or run out of memory.

Lastly, I read that this max int of 2^(31) -1 is usually calculated as 2^(bitsizeofyourcomputer) -1, but my computer is 64 bits. Could you direct me to what I should look up to read more about that?

That's what I meant in the last paragraphs of my second answer, when talking about the "word" size. If Python were using the native type for integers, then yes: the largest integer you could natively represent in Python would be 2^(64)-1 (unsigned) or 2^(63)-1 (signed). With "natively" we mean storing it as a number that the computer can "understand" and operate upon using single CPU instructions, instead of having to write a function to simply add or multiply a number, like Python does.

But Python uses an arbitrary internal representation for its integers, meaning that it can go much (much) further than that.

If the assignment got you interested in digital representation of numbers, I'd suggest reading some introductory material on digital arithmetic. That should give you a general idea. But beware, it's a huge rabbit hole that can lead to computer architecture and stuff like that. Your professor might suggest reading material.

[-]

ElHeim@reddit

You're right in that Python is using bignums: otherwise 200! would result on an overflow.

And why does this result in 2^(31) -1 possible int digits

Let's start there. I take you did some decimal decomposition at some point during elementary school, right?

I.e., if I ask you "decompose 629" you would come up with "600 + 20 + 9" or "6 * 100 + 2 * 10 + 9 * 1", aka "6 * 10^(2) + 2 * 10^(1) + 9 * 10^(0)". That's the basics of positional number systems: each digit in a number has a value that depends on itself and a certain factor that is a function of the radix (the number of unique digits, 10 in this case) to the power of its position (0, 1, 2, 3, ...).

It's exactly the same with vanilla integers in computers. 100101001b (for binary) represents "1*2^(8) + 1*2^(5) + 1*2^(3) + 1*2^(0)" or "256 + 32 + 8 + 1" or 297d (for decimal). Notice the "vanilla" there: we can have different ways of encoding numbers.

No matter the number of bits you have available, the lowest number you can represent is 0 (all zeroes). Now, what's the largest number you can represent with so many bits? Say that we have 2 bits: 00, 01, 10, 11; i.e. 0, 1, 2, 3. That was easy. You can always figure out that largest number by assigning 1s to every digit, and adding up their values. That's tedious though, so we can do faster. Say we have 8 bits: 11111111, with the lowest bit having 2^(0) as its value, and the largest 2^(7). The next largest number would need 9 bits: 100000000, and it's exactly 2^(8). So it follows that the largest number we can represent with 8 bits is 2^(8)-1. Easy, right?

So, what if you have a 32 bit computer? The largest number you can represent is... 2^(32)-1. 64 bits? 2^(64)-1.

"Wait a minute", you ask. "I said 2^(31)-1 possible int digits, not 2^(32)-1. How come? Am I wrong?"

Not exactly. I told you that's the largest amount of "vanilla" integers you can represent with N bits. But there are other ways to represent numbers. For example, can you figure out how to represent negative numbers? Look that up and you'll find your answer.

About bignums, a hint. You know that we represent numbers using radix 10 (decimal numbers), and computers using radix 2 (binary numbers). There's nothing preventing anyone the use of radix 3, radix 16 (hexadecimal numbers!) or radix 200, for that matter: all the arithmetics work the same, only with different amount of digits for every position. You just need to make up symbols for all those extra digits, and you're golden.

So... What if you'd represent numbers in, say, radix 2^(32)? That way, if you consider a whole 32-bit integer as one digit, then you can use a string of 32-bit integers to represent a very large number, right? Or arrays, if you prefer. Well, for reasons out of the scope of this comment, Python uses 2^(30) instead, but the concept is the same..

After that, look up IEEE 759. With that in hand you'll understand why you get "inf" for the floats.

[-]

justpeachypay@reddit (OP)

Thank you, this helps a lot. So for clarity, 200.0! is not “overflowing” Python, since python supports significantly larger integers? Does python store the floats differently than decimal decomposition? If the data type is float, and it’s utilizing decimal decomposition, and the number becomes larger than what float can represent, why doesn’t it overflow?

And when you say look up how you represent negative numbers, you mean in the arrays?

And python is using symbols for everything in base 2^30 (besides 0-9 I assume)? I thought it was using decimal decomposition. (Now I’m more confused?)

[-]

ElHeim@reddit

Btw, further talking about 200.0! (i.e., using floats). Python "knows" that a number is too large to be represented with a float and tells you so. E.g.:

>>> fac(200)
3943289336823952517761816069660925311475679888435866316473712666221797249817016714601521420059923119520886060694598194151288213951213185525309633124764149655567314286353816586186984944719612228107258321201270166459320656137141474266387621212037869516201606287027897843301130159520851620311758504293980894611113948118519486873600000000000000000000000000000000000000000000000
>>> float(fac(200))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: int too large to convert to float

But if you start from a number that can be represented and then you operate on it, Python will have to abide by the floating point rules:

>>> float(fac(171)) * 172
inf

BTW, we get Inf because of the specifics of our computer's floating point implementation: that's the expected behavior when the result of an operation is too large to be represented in round-to-nearest mode.

[-]

justpeachypay@reddit (OP)

Gotcha, thank you so much!!!

[-]

ElHeim@reddit

Read my other answer, which addresses your clarification.

About floats: Python uses arbitrary arithmetics only for integers. For floats it uses the system's native type (typically the equivalent of a C's double). Follow the reference: IEEE 754. Wikipedia has a decent introduction to that, including the number representation.

BTW, given that you're studying Physics, you want to know about floating points and their trade-offs (particularly about how their precision - or lack thereof - changes with the value).

And when you say look up how you represent negative numbers, you mean in the arrays?

No. Up until then I was talking about binary numbers. Just read about the difference between unsigned and signed integers in any computer architecture. Unsigned numbers are represented as those "vanilla" ones I was talking about the whole time. For signed numbers you need a way to tell positive from negative. There are many ways, but one that simplifies a lot the arithmetic circuitry is the so-called "two's complement". Read up about that.

And python is using symbols for everything in base 2^(30) (besides 0-9 I assume)? I thought it was using decimal decomposition. (Now I’m more confused?)

Not "symbols". I was just explaining everything in generic terms. Python can't use internally anything different than binary. Again, refer to my other comment.

[-]

fortunatefaileur@reddit

You’re mixing up a bunch of things.

“Arrays” are just confusing you, I think. Binary numbers are just numbers using binary, same as decimal numbers are just a series of decimal digits. Neither are special (well, binary is special in that it is the smallest normal base), and any integer can be represented in any base - trinary, hexadecimal, centidecimal. They’re just ways of writing down a number, they’re not the number itself.
For positive integers, the number of digits limits how big a number can be - two decimal digits lets you do 00-99 (which is 100 numbers, 10^2), 32 binary digits lets you do 0-4 billion (2^32).
Negative integers can be handled in a similar way, by cutting the range in half, hence a “32-bit signed integer” can hold -2 billion to +2 billion.
CPUs have integer sizes they just natively support, ranging from 8-bits on an old microcontroller, to 64-bits on an Alpha, to whatever SIMD does now on intel/amd64. Key word is natifley, you can write code to handle other sizes, that’s what “big num” means - code to handle much longer numbers.

[-]

justpeachypay@reddit (OP)

I understand that there are 2^32 arrays which can all support 2^30 digit numbers in python. I think what I’m confused about is that the max number python can compute is dependent upon your systems memory, I guess I’m just a bit confused by that because shouldn’t the set of arrays put a limit on the number? Everything I’ve read says the max value is unbounded and only dependent upon the system you are using. I think by trying to logic my way through how that might work has only confused me more. I’m trying to understand why this number is dependent upon your systems memory and not upon the max value python can compute. I also don’t understand how exactly python is utilizing the memory of my computer, and how if python is utilizing my cpu how it’s not limited by the binary (I mean it is, but how is python able to utilize it such that it can compute extremely large values beyond what the compute itself can compute). How does python not have some inherent max value? In my brain the max value must be due to the arrays python “supports” (not sure that’s the right word). Sorry that is very obviously unclear in what I’ve originally stated. I was quite tired.

[-]

dasnoob@reddit

We aren't here to do your homework for you.

The answer has nothing to do with the language or precision or anything else you are saying. The answer has to do with how mathematics defines a factorial.

[-]

justpeachypay@reddit (OP)

I do know for a fact that’s not true… you can calculate the factorial of any non-integer positive number in math. Just because you never made it far enough in math to know that doesn’t mean you are right. I doubt you have even heard of a gamma function which has so many uses beyond calculating factorials of non integer numbers.

There are literally different limits on the largest integer the computer can compute and the largest float a computer can compute.. the computer will calculate the factorial of say 4.0! I’ve completed the homework. I’m asking for clarification on something beyond the scope of the class. please use your brain next time! math is not perfect on computers… everyone in the programming world should know that because you have to do things to mitigate it in your code??? I knew that coming from freaking matlab.

[-]

dasnoob@reddit

Ah youth to assume makes an ass out of u.