Investigating the Computer's Arithmetic: John Burkardt On any computer, it can be useful to know how integers and reals are represented internally, and how the computer carries out arithmetic operations. This information can help the programmer to estimate the effects of roundoff, when one number is so small compared to another as to be insignificant, and when quantities are close to blowing up. The MACHINE package has all of this information stored in DATA statements. The MACHAR and PARANOIA packages try to determine these facts dynamically, in a way that will work on any machine. We describe below some simple dynamic tests done on the Cray and the VAX to try to find out for ourselves what's going on inside the computer. Along the way, we stumbled across our share of surprises! Cray Maximum and Minimum Integers The Cray integer format uses a full 64 bit word. Note that, in order to speed up integer operations, the default CFT77 compiler option on the Cray assumes that in arithmetic operations integers are restricted to 46 bits. The investigations carried out in conjunction with this article require that the "-i64" option be used, to allow the full 64 bits of integer arithmetic to be used. Some of the information about the number storage for the Cray comes from the CFT77 manual, appendix G. Range of Integers on the Cray: On the Cray, any integer between -(2^63) and (2^63)-1 can be represented. Note that (2^63)-1 = 9,223,372,036,854,775,807. Cray Representation of Positive Integers: 0 is represented by a word each bit of which is zero. A positive integer is represented by a word which holds the "naive" binary representation of the number. That means that 1 is 63 "0" bits followed by a "1" in the final position, for instance, 2 is 62 "0" bits followed by "10" and 19 is: 00000...010011 All positive integers must have a beginning bit of 0. This is why the largest positive integer that may be represented is (2^63)-1, which is 011111...11111 Cray Representation of Negative Integers: A negative integer is stored in "2's complement" format. First the 64 bit binary representation of the absolute value of the number is computed. Then every bit is reversed. Then a binary 1 is added to that binary number. As an example, here is how -11 would be represented: 11 (decimal) = 1011 (binary) Set the word to 0000.....000001011 (64 bits) Reverse to 1111.....111110100 Add 1 1111.....111110101 If the addition of 1 caused a carry, of course, more than just the last bit of the word would be altered. These operations guarantee that the first bit of the word is "1", which marks the word as representing a negative value. Converting the Cray Representation of an Integer to its Integer Value: From what has been said so far, you should now understand that if we have a Cray word, which we know represents an integer, then we figure out what that value is as follows: A) If the first bit is zero, a positive integer is being represented. Simply convert from the binary format to decimal. B) If the first bit is one, a negative integer is being represented. Subtract one (in binary arithmetic), reverse all the bits, and then convert from the binary format to decimal. Computing The Largest Positive Integer on the Cray: We can compute the largest positive integer easily. Assuming that integers are stored in base 2, then the largest integer should be of the form 1+2+4+8+...... So we simply calculate successively 1, 1+2, 1+2+4 and so on. We write each result out to a file in case we hit integer overflow. Also, just to be sure that we don't have a silent overflow (which can happen!), we check that each result is indeed larger than the previous value using an arithmetic comparison operation. The results of this investigation are consistent with the 'theoretical' values deduced from the writeup in the manual, namely, that the largest positive integer representable on the Cray is 0111....1111 = (2^63)-1 Computing The Most Negative Integer on the Cray: From what has gone before, we can try to guess what the most negative integer is going to be. There are two obvious choices: 1000....0001 = -(2^63)+1 which is the negative of the largest positive integer, or 1000....0000 = -(2^63) If I were making the decision, I would have chosen the first value, but the Cray designers picked -(2^63). The reason that -(2^63) is not an ideal choice for the most negative integer is that it has no counterpart as a positive integer. In fact, on the Cray, if you take the absolute value of -(2^63), you get back -(2^63) which is surely an undesirable feature! Now let us see how we would try to compute the most negative integer. We can repeat the procedure for the most positive integer, at first, by starting with -1, -1-2, -1-2-4, and so on. Each time we compute a new result, we write it to a file, and we use an arithmetic comparison to make sure that the new value really does represent a more negative result. In this way, we arrive at a number -(2^63)+1. We know this must be roughly the right result, because if we subtract the next power of two, we don't get a smaller number. (In fact, we get positive numbers, but let's talk about that further on!) However, as mentioned earlier, we have reason to suspect that we can subtract 1 one more time and possibly get an even more negative number. Doing this, we find a new value which is reported to be negative, and less than the previous value. And a second subtraction of 1 maps us into the positive values again. So we can be fairly sure that the most negative integer on the Cray is -(2^63). Bizarre "Wrap around" of Cray Integers: We mentioned that, by subtracting successive powers of 2, we could go lower and lower in integers, until we almost reached -(2^63), and how subtracting one more got us to -(2^63) exactly. There is no "smaller" integer than this value, but, believe it or not, you can keep on subtracting! Strangely enough, on the Cray, subtracting one more gets us a positive number, rather than signalling an integer overflow or some other warning. This means that integers on the Cray behave as though they were ordered in the following way: -1, 0, 1, ... (2^63)-2, (2^63)-1, -(2^63), -(2^63)+1, ... -1, 0, 1 ^ ^ "Infinity" "-Infinity" Yes, integers silently wrap around on the Cray! Just imagine the problems that causes the unwary programmer! Representation of Real Numbers on the Cray Single precision real values on the Cray are stored in a 64 bit word. The word is divided into three fields: the sign bit S, the exponent E, and the mantissa M, representing the value as S * M * 2^E. Here S is assumed to be plus or minus 1, M is a real value greater than or equal to 1/2, and less than 1, and E is an integer exponent. The sign bit is the first bit, labeled bit 63. It is 0 for a positive number and for zero, and 1 for negative numbers. The exponent occurs in the second through sixteenth bits, labeled 62 to 48. Given an exponent E, it will be stored in the exponent field by adding 16,384 to it first, and then storing the binary representation of that result. 16,384, by the way, has a 15 bit binary representation of 100 000 000 000 000. However, there is a range limitation on exponents: No exponent E may be larger than 8191, nor smaller than -8192, although there is room in the exponent field for values not allowed by this limit. Finally, the mantissa is represented as a 48 bit binary fraction, in the bits labeled 47 down to 0. Thus, for example, the representation of 10 can be deduced as follows: 10 = 1010 (binary) = +1 * .1010 * 2^4 (binary) so the word will look like 0 100000000000100 1010000000...000 whereas -10 would be identical, except for a first bit of 1, and 5/16 would be 5/16 = 5 * 2^(-4) = 101 * 2^(-4) = +1 * .101 * 2^(-1) so the word will look like 0 011111111111111 101000000...000 The "Machine Precision" of the Cray Loosely speaking, the machine precision, or "machine epsilon", is the smallest positive number that can be added to 1 and make a difference. This tells us about the properties of addition for all numeric values. If we are, say, adding up a series of terms, which get smaller and smaller, at some point we must stop. The logical place to stop is when the terms are too small to make any difference whatsoever in the sum. Assuming for the moment that SUM and TERM are positive, this happens when TERM < EPS*SUM because then, SUM+TERM < SUM + EPS*ABS(SUM) = SUM * (1+EPS) But SUM*(1+EPS) is the first machine number that is larger than SUM, and so the inequality is saying that TERM is so small it makes no difference at all when added to SUM. The machine precision is thus a very good measure of the effects of using a finite arithmetic scheme. It tells us the limits to accuracy, the point at which we can expect severe roundoff problems, and the relative spacing between machine numbers. The loose definition of the machine precision might be The smallest positive number EPS such that 1 < 1+EPS. However, for various reasons, this definition is not tight enough for us. It doesn't guarantee, for instance, that 1-EPS < 1, even though we might expect this. But certain rounding schemes might defeat this. So for our purposes in defining the machine precision, let us make a stricter definition: The MACHINE EPSILON, "EPS", is defined to be 1) the smallest power of 2 with the properties: 2) 1+EPS > 1 3) 1-EPS < 1 4) (1+EPS)-EPS=1 5) (1-EPS)+EPS=1 Note that there may be smaller numbers, in particular, smaller powers of two, for which SOME (but not all) of these results are true. But such behavior can be because of clever rounding, or guard digits or some such trick. We want to know the size of a "small" number that still behaves fairly reasonably with respect to 1. This number gives us a scale we can use for ANY computation. Given a number X, we know that Y is "tiny" compared to X if ABS(Y) < EPS * ABS(X). Any number larger than this threshhold can still be added or subtracted to X and make some difference, however slight. (This is assuming the number Y isn't "huge" which is another problem). After all, we should be able to add EPS*ABS(X) to X, because we can add EPS to 1, and EPS*ABS(X)+X is the same as X*(1+EPS) or -X*(1-EPS), and so any number greater than EPS*ABS(X) should also be reasonable. Such a number can be useful when, for example, trying to estimate the derivative of a function. We might estimate the derivative of F(X)=X*X at 2.0 by the ratio F'(2.0) = (F(X+DELTA)-F(X))/DELTA. In this example, the top of the ratio would be ((X*X + 2*DELTA*X + DELTA*DELTA) - (X*X)) but notice that if DELTA is too small, then the terms 2*DELTA*X and DELTA*DELTA will not show up against X*X. In this case, using a DELTA of roughly the square root of the machine epsilon would give a reasonable result. For more complicated functions, estimating a judicious value for DELTA can be very difficult, but knowing the value of the machine precision is a guide in this process. Many people try to compute the machine precision by a loop like: EPS=1.0 10 CONTINUE EPS=EPS/2.0 IF(EPS+1.0.GT.1.0)GO TO 10 First of all, the final value of EPS returned is half as big as it should be! Also, because of compiler optimization, rounding, and other artifacts of the machine, this result is very often NOT correct. It could be smaller than the value we define above, reflecting the higher precision of the machine's internal arithmetic unit where temporary results are computed, but it will not have the right properties. Try it and see if your results are correct! On the VAX, I got a result that was slightly smaller than it should be, but on the PC my answer was absurdly low. Computing the Cray Machine Precision: In order to try to avoid the known problems of the simpler procedure for estimating the machine precision, we carried out the following computation: EPS=1.0 10 CONTINUE EPSOLD=EPS EPS=EPS/2.0 C TEMP1=1.0+EPS TEMP2=1.0 TEMP3=1.0-EPS C IF(TEMP1.EQ.TEMP2)GO TO 20 IF(TEMP2.EQ.TEMP3)GO TO 20 IF(TEMP1-EPS.NE.TEMP2)GO TO 20 IF(TEMP3+EPS.NE.TEMP2)GO TO 20 C GO TO 10 20 CONTINUE EPS=EPSOLD The result we got was EPS=2^(-47), which makes sense, since the Cray has a 48 bit mantissa. With a normalized mantissa, 1.0 is a 1 in bit 47 followed by 47 zeroes. The 47'th zero corresponds to 2^(-47), so that is the smallest number we can add to 1. The decimal value of EPS is roughly 0.7105 * 10^-14. The internal binary representation of the machine epsilon is: * exponent mantissa.... 0 011111111010010 100000000000000000000000000000000000000000000000 S E M The exponent has had 16,384 added to it. The number represented is +1 * .1000... * 2 ^ (011111111010010-010000000000000) (BINARY) =0.5 * 2^(-46) =2^(-47) The Cray Has Distinct Numbers Whose Difference is Zero! The IEEE floating point arithmetic standard was carefully defined to avoid certain problems that can arise in current arithmetic formulations. The IEEE standard includes the following feature, which you would expect from any sane system: If X is not equal to Y, the X-Y is not equal to 0. However, on the Cray system, it is in fact possible to have X and Y not equal, but X-Y equal to 0! A simple example of this can be concocted: Let Y=RMIN, the minimum power of 2. Let X=(1+EPS)*RMIN, a "slightly" larger number. Then X-Y=EPS*RMIN, which is a power of two that is smaller than RMIN. Since RMIN is the minimum power of 2 that can be represented, X-Y has the value 0. Consider the effect of this feature of Cray arithmetic on code like: IF(X.NE.Y)TEMP=1.0/(X-Y) which, for safety's sake, should actually read: IF(X-Y.NE.0.0)TEMP=1.0/(X-Y) Computing the Cray Minimum Positive Real Value To compute the minimum positive real number (or really, the minimum power of two), we compute successively RMIN=1, RMIN=1/2, and so on. Each time, we save the previous value of RMIN, and we stop when RMIN*2 is not equal to RMINOLD. At that point, RMINOLD is the smallest power of two. Now what do we expect to get? The sign bit must be 0, and the mantissa must be a 1 followed by all zeroes, because we are asking for a power of 2. The exponent field is the only quantity that can affect the outcome. We know that the Cray exponent field has unusual restrictions on it. If there were no restrictions, then an exponent field of zero would give us the minimum power of 2, once we subtracted the exponent bias of 16,384. The minimum power of 2 would be 1/2 * 2^(-16384) or 2^(-16385). However, the biased exponent is not allowed to decrease below 8191. Hence, for the Cray, we get 1/2 * 2^(8191-16384) = 1/2 * 2^(-8193), or 2^(-8194), which, strangely enough, prints out to be 0.0! It's not zero, and when we take its square root, we get a nonzero, printable number. The decimal value of the smallest Cray power of 2 is roughly 10^-2466. The binary representation of the smallest Cray power of 2 is: 0 001111111111111 100000000000000000000000000000000000000000000000 S E M = +1 * 0.1000.... * 2^(001111111111111-10000000000000) (BINARY) = 0.5 * 2^(8191-16384) = 0.5 * 2^(-8193) = 2^(-8194) Notice the strange fact that Cray exponents have a minimum value (001111111111111) which is rather high! If we allowed the exponent to drop to (000000000000000) we could represent 0.5*2^(-16384) =2^(-16385). I have no idea why this artificial restriction was made, unless, perhaps, this guarantees that any two real numbers that can be stored, can also be multiplied, and the result stored in a 64 bit word. Then, however, that sixty four bit result must be checked for 'legality'! The Cray manual claims that the smallest legal exponent is 2^(-8192) which means the smallest power of two would be 0.5*2^(-8192) or 2^(-8193), which does not agree with our results. Computing the Cray Maximum Real Value When computing the maximum real, the first thing to do is pack the mantissa. By now, you might guess that we would start with 1, add 1/2, 1/4, and so on, up to 2^(-47), giving us a mantissa of 2-(2^-47). Then we simply multiply by two over and over. We write each result to a file, so that if we hit a floating overflow, then the last number we printed out was the biggest legal number. Our result is roughly .999... * 2^8191, or 0.54*10^2466, or, in bits: 0 101111111111111 111111111111111111111111111111111111111111111111 S E M = +1 * 0.1111... * 2^(101111111111111-10000000000000) (BINARY) = 0.9999... * 2^(8191) This value agrees with that reported in the Cray CFT77 manual. VAX Maximum and Minimum Integers On the VAX, as mentioned earlier, the format used to store integers is similar to that used by the Cray, except that since the words are only 32 bits long, the minimum integer is -(2^31) and the maximum is (2^31)-1. Note that (2^31)-1 = 2,147,483,647. Again, there is a negative number which has no positive counterpart. However, on the VAX, if you add 1 to the largest integer or carry out any arithmetic operation which would result in a positive result too large to represent, the VAX triggers an "integer overflow" and halts the program. Integers do not "wrap around" on the VAX. Representation of Real Numbers on the VAX Single precision real values on the VAX are stored in a 32 bit word. The word is divided into three fields: the sign bit S, the exponent E, and the mantissa M, representing the value as S * M * 2^E. Here S is assumed to be plus or minus 1, M is a real value greater than or equal to 1/2, and less than 1, and E is an integer exponent. For unfathomable reasons, the VAX uses bytes in an order that is unlike the order that any sensible person would choose. This means that, for real numbers at least, the situation looks like this: Bits: 31-24, 23-16, 15-8, 7-0 Bytes: 1 0 3 2 Here we label the bytes 3, 2, 1, 0, and intend to suggest that 3 contains the "most significant" information. Thus, the natural way to print out the number is in the order "Byte 3, Byte 2, Byte 1, Byte 0", and we will do so further on, to avoid utter confusion. However, note that this hides the bizarre way the VAX actually stores the values. If you want to address the bytes of a word that contains real values, you will need to keep in mind the order mentioned above. Let's pretend that the actual ordering is as follows: Bits: 31-24, 23-16, 15-8, 7-0 Bytes: 3 2 1 0 What this means is that, for example, a number which we will describe as looking like this: 1 01101110 11111111111000000000011 S E M will actually be stored in the VAX as 1111000000000011 1 01101110 1111111 M continued S E M The Layout of a VAX Real Number Using our ordering, and not the VAX's, the sign bit of a real number is the first bit, labeled bit 31. It is 0 for a positive number and for zero, and 1 for negative numbers. The exponent occurs in the eight bits labeled 30 to 23. Given an exponent E, it will be stored in the exponent field by adding 128 to it first, and then storing the binary representation of that result. 128, by the way, has an 8 bit binary representation of 10 000 000. Thus, we expect that the minimum legal exponent will be -128, and the maximum will be +127. Finally, the mantissa is represented as a 23 bit binary fraction, in the bits labeled 22 down to 0. However, because every binary fraction (except for that representing 0) has a leading 1, the VAX drops this 1, thus gaining space for one extra place of binary accuracy. In other words, a mantissa of the form .101 will cause the digits "01" to be stored, and a mantissa of .1111 will cause the digits "111" to be stored. Thus, for example, the representation of 10 can be deduced as follows: 10 = 1010 (binary) = +1 * .1010 * 2^4 (binary) so the word will look like 0 10000100 010000000...000 whereas -10 would be identical, except for a first bit of 1, and 5/16 would be 5/16 = 5 * 2^(-4) = 101 * 2^(-4) = +1 * .101 * 2^(-1) so the word will look like 0 01111111 01000000...000 Computing the VAX Machine precision Running the same program we used on the Cray, we got EPS=2^(-23), which makes sense since the VAX uses effectively a 24 bit mantissa (23 + the "hidden bit"). The decimal value is roughly 0.1192 * 10^-6. In binary: 0 01101010 00000000000000000000000 S E M The sign bit is over the "S". The exponent, 01101010, begins at "E" and ends just before M. The mantissa begins at M. The exponent must have 128 subtracted from it. The mantissa should be prefixed by a "1" (this is called hidden or implicit bit normalization). Thus this number reads: +1 * .(1)000... * 2 ^ (01101010-10000000) (BINARY) = 0.5 * 2^(106-128) = 0.5 * 2^(-22) = 2^(-23). Computing the VAX Minimum Positive Real Value Using the same technique we used on the Cray, we find the minimum positive real number on the VAX to be 2^(-128), roughly 0.2938 * 10^-38, or, in bits: 0 00000001 00000000000000000000000 S E M = +1 * .(1)0000... * 2^(00000001-10000000) (BINARY) = 0.5 * 2^(-127) = 2^(-128). Note that neither the VAX nor the Cray seem to support "unnormalized" numbers. That is, typically, a number with an exponent of 0, and a mantissa whose first digit is NOT a 1. This would be a way of getting smaller numbers than you could ordinarily represent, and it would guarantee that the difference of two nonequal numbers would never be zero, but it apparently is not implemented. The consequence is that there are numbers X and Y on the Cray and on the VAX for which X and Y are different, but for which X-Y is zero. Computing the Vax Maximum Real Value Using the same program that was used for the Cray, we get for the maximum VAX real a value of .999... * 2^127, 0 11111111 11111111111111111111111 S E M = +1 * .(1)111... * 2^(11111111-10000000) (BINARY) = .9999... * 2^(127) = 0.1701 * 10 ^ 39. Summary We've sketched out some preliminary investigations into the arithmetic format of the computer. You might be interested in the alternative (and more sensible) IEEE format, described in PSCDOC:IEEE.DOC. You might want to compare what we found with what the routines in MACHINE report, and with what the dynamic routines MACHAR and PARANOIA compute. PARANOIA also probes aspects of the arithmetic used on the computer, including the use of guard digits, truncation, and higher precision for intermediate results. Even though the representation of numbers on a computer, and the implementation of arithmetic with those numbers should in most cases be tangential to your line work, I hope I've shown that there are surprises for the wise and traps for the unwary in this field!