Investigating the Computer's Arithmetic:
John Burkardt

On any computer, it can be useful to know how integers and reals are
represented internally, and how the computer carries out arithmetic
operations.  This information can help the programmer to estimate
the effects of roundoff, when one number is so small compared to another
as to be insignificant, and when quantities are close to blowing up.

The MACHINE package has all of this information stored in DATA statements.
The MACHAR and PARANOIA packages try to determine these facts dynamically,
in a way that will work on any machine.  We describe below some simple
dynamic tests done on the Cray and the VAX to try to find out for ourselves
what's going on inside the computer.  Along the way, we stumbled across
our share of surprises!

 
Cray Maximum and Minimum Integers
 

The Cray integer format uses a full 64 bit word.  Note that, in order to
speed up integer operations, the default CFT77 compiler option on the Cray 
assumes that in arithmetic operations integers are restricted to 46 bits.  The 
investigations carried out in conjunction with this article require that the 
"-i64" option be used, to allow the full 64 bits of integer arithmetic to be 
used.  Some of the information about the number storage for the Cray comes 
from the CFT77 manual, appendix G.


Range of Integers on the Cray:

 
On the Cray, any integer between -(2^63) and (2^63)-1 can be represented.
Note that (2^63)-1 = 9,223,372,036,854,775,807.


Cray Representation of Positive Integers:

 
0 is represented by a word each bit of which is zero.
 
A positive integer is represented by a word which holds the "naive" binary
representation of the number.  That means that 1 is 63 "0" bits followed by a
"1" in the final position, for instance, 2 is 62 "0" bits followed by "10"
and 19 is:
  
  00000...010011
 
All positive integers must have a beginning bit of 0.  This is why the
largest positive integer that may be represented is (2^63)-1, which is
 
  011111...11111


Cray Representation of Negative Integers:

 
A negative integer is stored in "2's complement" format.  First
the 64 bit binary representation of the absolute value of the number is
computed.  Then every bit is reversed.  Then a binary 1 is added to that
binary number.  As an example, here is how -11 would be represented:
 
  11 (decimal)  =               1011 (binary)
  Set the word to 0000.....000001011 (64 bits)
  Reverse to      1111.....111110100
  Add 1           1111.....111110101
 
If the addition of 1 caused a carry, of course, more than just the last
bit of the word would be altered.  These operations guarantee that the
first bit of the word is "1", which marks the word as representing a
negative value.


Converting the Cray Representation of an Integer to its Integer Value:


From what has been said so far, you should now understand that if we have a 
Cray word, which we know represents an integer, then we figure out what that 
value is as follows:

  A) If the first bit is zero, a positive integer is being represented.
     Simply convert from the binary format to decimal.

  B) If the first bit is one, a negative integer is being represented.
     Subtract one (in binary arithmetic), reverse all the bits, and
     then convert from the binary format to decimal.


Computing The Largest Positive Integer on the Cray:

 
We can compute the largest positive integer easily.  Assuming that
integers are stored in base 2, then the largest integer should be
of the form 1+2+4+8+......  So we simply calculate successively
1, 1+2, 1+2+4 and so on.  We write each result out to a file in
case we hit integer overflow.  

Also, just to be sure that we don't have a silent overflow (which can happen!),
we check that each result is indeed larger than the previous value using an
arithmetic comparison operation.  The results of this investigation are 
consistent with the 'theoretical' values deduced from the writeup in the manual,
namely, that the largest positive integer representable on the Cray is
 
  0111....1111 = (2^63)-1
 

Computing The Most Negative Integer on the Cray:


From what has gone before, we can try to guess what the most negative integer
is going to be.  There are two obvious choices:
 
  1000....0001 = -(2^63)+1
  
which is the negative of the largest positive integer, or
 
  1000....0000 = -(2^63)
 
If I were making the decision, I would have chosen the first value, but
the Cray designers picked -(2^63).  The reason that -(2^63) is not an 
ideal choice for the most negative integer is that it has no counterpart 
as a positive integer.  In fact, on the Cray, if you take the absolute value 
of -(2^63), you get back -(2^63) which is surely an undesirable feature!  

Now let us see how we would try to compute the most negative integer.
We can repeat the procedure for the most positive integer, at first,
by starting with -1, -1-2, -1-2-4, and so on.  Each time we compute
a new result, we write it to a file, and we use an arithmetic comparison
to make sure that the new value really does represent a more negative
result.
 
In this way, we arrive at a number -(2^63)+1.  We know this must be roughly
the right result, because if we subtract the next power of two, we don't get 
a smaller number.  (In fact, we get positive numbers, but let's talk about 
that further on!)  However, as mentioned earlier, we have reason to suspect
that we can subtract 1 one more time and possibly get an even more negative
number.  Doing this, we find a new value which is reported to be negative,
and less than the previous value.  And a second subtraction of 1 maps us
into the positive values again.  So we can be fairly sure that the most
negative integer on the Cray is -(2^63).


Bizarre "Wrap around" of Cray Integers:


We mentioned that, by subtracting successive powers of 2, we could go
lower and lower in integers, until we almost reached -(2^63), and how
subtracting one more got us to -(2^63) exactly.  There is no "smaller"
integer than this value, but, believe it or not, you can keep on subtracting!
Strangely enough, on the Cray, subtracting one more gets us a positive
number, rather than signalling an integer overflow or some other warning.
 
This means that integers on the Cray behave as though they were ordered in the
following way:
 
  -1, 0, 1, ...  (2^63)-2, (2^63)-1, -(2^63), -(2^63)+1, ... -1, 0, 1
                             ^          ^
                          "Infinity" "-Infinity"
 
Yes, integers silently wrap around on the Cray!  Just imagine the problems that
causes the unwary programmer!
 
 
Representation of Real Numbers on the Cray


Single precision real values on the Cray are stored in a 64 bit word.
The word is divided into three fields: the sign bit S, the exponent E, and the
mantissa M, representing the value as S * M * 2^E.  Here S is assumed to
be plus or minus 1, M is a real value greater than or equal to 1/2, and
less than 1, and E is an integer exponent.

The sign bit is the first bit, labeled bit 63.  It is 0 for a positive number 
and for zero, and 1 for negative numbers.

The exponent occurs in the second through sixteenth bits, labeled 62 to 48.
Given an exponent E, it will be stored in the exponent field by adding 
16,384 to it first, and then storing the binary representation of that result.
16,384, by the way, has a 15 bit binary representation of 100 000 000 000 000.
However, there is a range limitation on exponents: No exponent E may be larger
than 8191, nor smaller than -8192, although there is room in the exponent
field for values not allowed by this limit.

Finally, the mantissa is represented as a 48 bit binary fraction, in the
bits labeled 47 down to 0.

Thus, for example, the representation of 10 can be deduced as follows:

  10 = 1010 (binary) = +1 * .1010 * 2^4 (binary)

so the word will look like

  0 100000000000100 1010000000...000

whereas -10 would be identical, except for a first bit of 1, and 5/16
would be
  
  5/16 = 5 * 2^(-4) = 101 * 2^(-4) = +1 * .101 * 2^(-1)

so the word will look like

  0 011111111111111 101000000...000


The "Machine Precision" of the Cray
 
 
Loosely speaking, the machine precision, or "machine epsilon", is the
smallest positive number that can be added to 1 and make a difference.  This
tells us about the properties of addition for all numeric values.  If we
are, say, adding up a series of terms, which get smaller and smaller, at
some point we must stop.  The logical place to stop is when the terms
are too small to make any difference whatsoever in the sum.  Assuming for the
moment that SUM and TERM are positive, this happens when 

  TERM < EPS*SUM

because then,

  SUM+TERM < SUM + EPS*ABS(SUM) = SUM * (1+EPS)

But SUM*(1+EPS) is the first machine number that is larger than SUM, and
so the inequality is saying that TERM is so small it makes no difference
at all when added to SUM.

The machine precision is thus a very good measure of the effects of using
a finite arithmetic scheme.  It tells us the limits to accuracy, the
point at which we can expect severe roundoff problems, and the relative
spacing between machine numbers.

The loose definition of the machine precision might be

  The smallest positive number EPS such that 1 < 1+EPS.

However, for various reasons, this definition is not tight enough for us.
It doesn't guarantee, for instance, that 1-EPS < 1, even though we might
expect this.  But certain rounding schemes might defeat this.  So for our 
purposes in defining the machine precision, let us make a stricter definition:
 
  The MACHINE EPSILON, "EPS", is defined to be 
 
  1)  the smallest power of 2 with the properties:

  2)  1+EPS > 1 
  3)  1-EPS < 1
 
  4)  (1+EPS)-EPS=1
  5)  (1-EPS)+EPS=1
 
Note that there may be smaller numbers, in particular, smaller powers
of two, for which SOME (but not all) of these results are true.  But such
behavior can be because of clever rounding, or guard digits or some such
trick.  We want to know the size of a "small" number that still behaves fairly
reasonably with respect to 1.  This number gives us a scale we can use
for ANY computation.  Given a number X, we know that Y is "tiny" compared
to X if ABS(Y) < EPS * ABS(X).  Any number larger than this threshhold
can still be added or subtracted to X and make some difference, however
slight.  (This is assuming the number Y isn't "huge" which is another problem).
After all, we should be able to add EPS*ABS(X) to X, because we can add EPS
to 1, and EPS*ABS(X)+X is the same as X*(1+EPS) or -X*(1-EPS), and so any
number greater than EPS*ABS(X) should also be reasonable.
 
Such a number can be useful when, for example, trying to estimate the
derivative of a function.  We might estimate the derivative of F(X)=X*X
at 2.0 by the ratio F'(2.0) = (F(X+DELTA)-F(X))/DELTA.  In this example,
the top of the ratio would be ((X*X + 2*DELTA*X + DELTA*DELTA) - (X*X))
but notice that if DELTA is too small, then the terms 2*DELTA*X and
DELTA*DELTA will not show up against X*X.  In this case, using a DELTA
of roughly the square root of the machine epsilon would give a reasonable
result.  For more complicated functions, estimating a judicious value
for DELTA can be very difficult, but knowing the value of the machine
precision is a guide in this process.
 
Many people try to compute the machine precision by a loop like:
 
      EPS=1.0
10    CONTINUE
      EPS=EPS/2.0
      IF(EPS+1.0.GT.1.0)GO TO 10

First of all, the final value of EPS returned is half as big as it should
be!  Also, because of compiler optimization, rounding, and other artifacts
of the machine, this result is very often NOT correct.  It could
be smaller than the value we define above, reflecting the higher precision
of the machine's internal arithmetic unit where temporary results are
computed, but it will not have the right properties.  Try it and see
if your results are correct!  On the VAX, I got a result that was slightly
smaller than it should be, but on the PC my answer was absurdly low.
 
 
Computing the Cray Machine Precision:
 
 
In order to try to avoid the known problems of the simpler procedure for
estimating the machine precision, we carried out the following computation:
 
      EPS=1.0
10    CONTINUE
      EPSOLD=EPS
      EPS=EPS/2.0
C
      TEMP1=1.0+EPS
      TEMP2=1.0
      TEMP3=1.0-EPS
C
      IF(TEMP1.EQ.TEMP2)GO TO 20
      IF(TEMP2.EQ.TEMP3)GO TO 20
      IF(TEMP1-EPS.NE.TEMP2)GO TO 20
      IF(TEMP3+EPS.NE.TEMP2)GO TO 20
C
      GO TO 10
20    CONTINUE
      EPS=EPSOLD
 
The result we got was EPS=2^(-47), which makes sense, since the Cray has a 48 
bit mantissa.  With a normalized mantissa, 1.0 is a 1 in bit 47 followed by 47 
zeroes.  The 47'th zero corresponds to 2^(-47), so that is the smallest 
number we can add to 1.  The decimal value of EPS is roughly 0.7105 * 10^-14.
 
The internal binary representation of the machine epsilon is:

* exponent        mantissa....
0 011111111010010 100000000000000000000000000000000000000000000000
S E               M
 
The exponent has had 16,384 added to it.  The number represented is
+1 * .1000... * 2 ^ (011111111010010-010000000000000)  (BINARY)
=0.5 * 2^(-46)
=2^(-47)
 
 
The Cray Has Distinct Numbers Whose Difference is Zero!


The IEEE floating point arithmetic standard was carefully defined to avoid
certain problems that can arise in current arithmetic formulations.  The
IEEE standard includes the following feature, which you would expect from
any sane system:

  If X is not equal to Y, the X-Y is not equal to 0.

However, on the Cray system, it is in fact possible to have X and Y not
equal, but X-Y equal to 0!  A simple example of this can be concocted:

  Let Y=RMIN, the minimum power of 2.  
  Let X=(1+EPS)*RMIN, a "slightly" larger number.
  Then X-Y=EPS*RMIN, which is a power of two that is smaller than RMIN.
  Since RMIN is the minimum power of 2 that can be represented, X-Y
  has the value 0.

Consider the effect of this feature of Cray arithmetic on code like:

  IF(X.NE.Y)TEMP=1.0/(X-Y)

which, for safety's sake, should actually read:

  IF(X-Y.NE.0.0)TEMP=1.0/(X-Y)


Computing the Cray Minimum Positive Real Value
 
 
To compute the minimum positive real number (or really, the minimum power of 
two), we compute successively RMIN=1, RMIN=1/2, and so on.  Each time, we save 
the previous value of RMIN, and we stop when RMIN*2 is not equal to RMINOLD.  
At that point, RMINOLD is the smallest power of two.  

Now what do we expect to get?  The sign bit must be 0, and the mantissa
must be a 1 followed by all zeroes, because we are asking for a power of 2.
The exponent field is the only quantity that can affect the outcome.
We know that the Cray exponent field has unusual restrictions on it.
If there were no restrictions, then an exponent field of zero would
give us the minimum power of 2, once we subtracted the exponent bias of
16,384.  The minimum power of 2 would be 1/2 * 2^(-16384) or 2^(-16385).

However, the biased exponent is not allowed to decrease below 8191.  Hence,
for the Cray, we get 1/2 * 2^(8191-16384) = 1/2 * 2^(-8193), or 2^(-8194), 
which, strangely enough, prints out to be 0.0!  It's not zero, and when we 
take its square root, we get a nonzero, printable number.  The decimal value 
of the smallest Cray power of 2 is roughly 10^-2466.

The binary representation of the smallest Cray power of 2 is:
 
0 001111111111111 100000000000000000000000000000000000000000000000
S E               M
 
= +1 * 0.1000.... * 2^(001111111111111-10000000000000)  (BINARY)
= 0.5 * 2^(8191-16384)
= 0.5 * 2^(-8193)
= 2^(-8194)
 
Notice the strange fact that Cray exponents have a minimum value
(001111111111111) which is rather high!  If we allowed the exponent
to drop to (000000000000000) we could represent 0.5*2^(-16384)
=2^(-16385).  I have no idea why this artificial restriction was made,
unless, perhaps, this guarantees that any two real numbers that can
be stored, can also be multiplied, and the result stored in a
64 bit word.  Then, however, that sixty four bit result must be
checked for 'legality'!

The Cray manual claims that the smallest legal exponent is 2^(-8192)
which means the smallest power of two would be 0.5*2^(-8192) or 2^(-8193),
which does not agree with our results.
 
 
Computing the Cray Maximum Real Value
 
 
When computing the maximum real, the first thing to do is pack the mantissa.  
By now, you might guess that we would start with 1, add 1/2, 1/4, and so on, 
up to 2^(-47), giving us a mantissa of 2-(2^-47).  Then we simply multiply 
by two over and over.  We write each result to a file, so that if we hit a 
floating overflow, then the last number we printed out was the biggest legal 
number.

Our result is roughly .999... * 2^8191, or 0.54*10^2466,
or, in bits:
 
0 101111111111111 111111111111111111111111111111111111111111111111
S E               M
 
= +1 * 0.1111... * 2^(101111111111111-10000000000000)  (BINARY)
= 0.9999... * 2^(8191)
 
This value agrees with that reported in the Cray CFT77 manual.
 
 
VAX Maximum and Minimum Integers
 
 
On the VAX, as mentioned earlier, the format used to store integers
is similar to that used by the Cray, except that since the words are only 
32 bits long, the minimum integer is -(2^31) and the maximum is (2^31)-1.  
Note that (2^31)-1 = 2,147,483,647.  Again, there is a negative number which 
has no positive counterpart.  However, on the VAX, if you add 1 to the largest 
integer or carry out any arithmetic operation which would result in a positive 
result too large to represent, the VAX triggers an "integer overflow" and 
halts the program.  Integers do not "wrap around" on the VAX.  
 
 
Representation of Real Numbers on the VAX


Single precision real values on the VAX are stored in a 32 bit word.
The word is divided into three fields: the sign bit S, the exponent E, and the
mantissa M, representing the value as S * M * 2^E.  Here S is assumed to
be plus or minus 1, M is a real value greater than or equal to 1/2, and
less than 1, and E is an integer exponent.

For unfathomable reasons, the VAX uses bytes in an order that is unlike the
order that any sensible person would choose.  This means that, for real
numbers at least, the situation looks like this:

  Bits:     31-24, 23-16, 15-8, 7-0
  Bytes:      1      0      3    2

Here we label the bytes 3, 2, 1, 0, and intend to suggest that 3 contains the
"most significant" information.  Thus, the natural way to print out the
number is in the order "Byte 3, Byte 2, Byte 1, Byte 0", and we will do so
further on, to avoid utter confusion.  However, note that this hides the
bizarre way the VAX actually stores the values.  If you want to address
the bytes of a word that contains real values, you will need to keep in
mind the order mentioned above.

Let's pretend that the actual ordering is as follows:

  Bits:    31-24, 23-16, 15-8, 7-0
  Bytes:     3      2      1    0

What this means is that, for example, a number which we will describe as
looking like this:

  1 01101110 11111111111000000000011
  S E        M

will actually be stored in the VAX as

  1111000000000011  1 01101110 1111111
  M continued       S E        M
 

The Layout of a VAX Real Number


Using our ordering, and not the VAX's, the sign bit of a real number is the 
first bit, labeled bit 31.  It is 0 for a positive number and for zero, and 1 
for negative numbers.

The exponent occurs in the eight bits labeled 30 to 23.  Given an exponent E, 
it will be stored in the exponent field by adding 128 to it first, and then 
storing the binary representation of that result.  128, by the way, has an 8 
bit binary representation of 10 000 000.  Thus, we expect that the minimum 
legal exponent will be -128, and the maximum will be +127.

Finally, the mantissa is represented as a 23 bit binary fraction, in the
bits labeled 22 down to 0.  However, because every binary fraction (except for
that representing 0) has a leading 1, the VAX drops this 1, thus gaining space
for one extra place of binary accuracy.  In other words, a mantissa of the
form .101 will cause the digits "01" to be stored, and a mantissa of .1111
will cause the digits "111" to be stored.

Thus, for example, the representation of 10 can be deduced as follows:

  10 = 1010 (binary) = +1 * .1010 * 2^4 (binary)

so the word will look like

  0 10000100 010000000...000

whereas -10 would be identical, except for a first bit of 1, and 5/16
would be
  
  5/16 = 5 * 2^(-4) = 101 * 2^(-4) = +1 * .101 * 2^(-1)

so the word will look like

  0 01111111 01000000...000


Computing the VAX Machine precision
 
 
Running the same program we used on the Cray, we got EPS=2^(-23), which makes
sense since the VAX uses effectively a 24 bit mantissa (23 + the "hidden bit").
The decimal value is roughly 0.1192 * 10^-6.  In binary:
 
0 01101010 00000000000000000000000
S E        M
 
The sign bit is over the "S".  The exponent, 01101010, begins at "E" and ends 
just before M.  The mantissa begins at M.  The exponent must have 128 
subtracted from it.  The mantissa should be prefixed by a "1" (this is called 
hidden or implicit bit normalization).  Thus this number reads:
 
+1 * .(1)000... * 2 ^ (01101010-10000000)  (BINARY)
=         0.5                    * 2^(106-128)
=         0.5                    * 2^(-22)
= 2^(-23).
 
 
Computing the VAX Minimum Positive Real Value
 
 
Using the same technique we used on the Cray, we find the minimum positive 
real number on the VAX to be 2^(-128), roughly 0.2938 * 10^-38, or, in bits:
 
0 00000001 00000000000000000000000
S E        M
 
= +1 * .(1)0000...   * 2^(00000001-10000000)  (BINARY)
= 0.5 * 2^(-127) = 2^(-128).
 
Note that neither the VAX nor the Cray seem to support "unnormalized" numbers.
That is, typically, a number with an exponent of 0, and a mantissa whose
first digit is NOT a 1.  This would be a way of getting smaller numbers than 
you could ordinarily represent, and it would guarantee that the difference of 
two nonequal numbers would never be zero, but it apparently is not implemented.
The consequence is that there are numbers X and Y on the Cray and on the VAX
for which X and Y are different, but for which X-Y is zero.
 
 
Computing the Vax Maximum Real Value
 
 
Using the same program that was used for the Cray, we get for the maximum
VAX real a value of .999... * 2^127, 
 
0 11111111 11111111111111111111111
S E        M

= +1 * .(1)111... * 2^(11111111-10000000)  (BINARY)
= .9999... * 2^(127)
= 0.1701 * 10 ^ 39.
 

Summary


We've sketched out some preliminary investigations into the arithmetic format
of the computer.  You might be interested in the alternative (and more sensible)
IEEE format, described in PSCDOC:IEEE.DOC.  You might want to compare what we 
found with what the routines in MACHINE report, and with what the dynamic 
routines MACHAR and PARANOIA compute.  PARANOIA also probes aspects of the 
arithmetic used on the computer, including the use of guard digits, truncation,
and higher precision for intermediate results.  

Even though the representation of numbers on a computer, and the implementation
of arithmetic with those numbers should in most cases be tangential to your
line work, I hope I've shown that there are surprises for the wise and traps
for the unwary in this field!