MATMUL, AN INTERACTIVE COMPUTER BENCHMARK
_________________________________________

Introduction

What is a Benchmark Program?
Why Use a Benchmark at a Supercomputer Center?
Why use matrix multiplication for benchmarking?

The Method of Matrix Multiplication
Turning a Method into an Algorithm
Turning an Algorithm into a Program

What the MATMUL program does
How to use MATMUL
Measures of Computer Performance
Measures of Program Performance
Comparing C and FORTRAN

Test 1: Compare Computers
Test 2: Compare Rates for Different Problem Sizes
Test 3: Compare Times for Different Problem Sizes
Test 4: Compare Rates for Different Algorithms

Suggested projects for further study


Introduction
____________

These notes describe a computer program called MATMUL.  


MATMUL is a benchmark program that measures how fast a computer carries out a 
large problem involving multiplication and addition.


MATMUL can compare computers, languages and compilers, or algorithms.  


Some of the results can be explained very simply; other things that MATMUL 
finds may require learning about the computer's insides.  


Even if we don't understand WHY the MATMUL program runs slowly on some problems,
we can use the information as a warning to avoid certain methods.


What is a Benchmark Program?
___________________________

A benchmark program measures computer power in an understandable way.  

  * The program is chosen because it is "average" or "typical".  

  * The program is timed on the computer.  

  * The timing suggests how fast the computer will run similar programs.


The timing results can open up new questions about new things:

  * How long does it take to solve a problem twice as big?

  * Is it twice as hard to get an answer that is twice as accurate?

  * Are there methods that are twice as fast as other methods?

  * Are there methods that use half as much memory?

  * Does this computer have extra trouble with some kinds of computations?


One amusing thing to say about a benchmark program: 

  We don't care about its results.  We just want to know how long it
  took to compute them!


Why Use a Benchmark at a Supercomputer Center?
_____________________________________________

Supercomputers are much different than personal computers.

There's a lot of money tied up in a supercomputer, sometimes as much as
twenty or thirty million dollars.  The company or lab or university 
acquiring a supercomputer needs to answer some very serious questions:

  * How can I tell "how much" computer to buy?

  * How can I tell "how fast" a given computer can go?

  * How can someone who's using the computer tell when their program
    is running as fast as possible?


A benchmark program can help answer these questions.


Why Use Matrix Multiplication for Benchmarking?
_______________________________________________

The "sample" work that the MATMUL program carries out is matrix multiplication.
Two large tables of numbers are combined using multiplication and addition.


Matrix multiplication can be an EASY test to set up.  It's easy to:

  * describe the task to be accomplished.

  * write a program to carry out this task, in many languages.

  * set up a problem of any size.

  * figure out the amount of work required.


Matrix multiplication has some features that make it a USEFUL test:

  * We can make the amount of work increase by making the problem larger.

  * The work is typical of scientific programs.

  * There are several versions of the algorithm.


The Method of Matrix Multiplication:
___________________________________

MATMUL assumes a matrix is a square table of numbers, such as:

  11   5  -3
   2  19   7
   6   0  16

Matrix multiplication combines two such tables into a new one.

Each entry of the new table is computed by using a row of the first matrix
and a column of the second.  We simply multiply corresponding pairs of values
and add up all the results.


For instance, suppose we had two matrices which we'll call "B" and "C":

  11   5  -3               1  2  3
   2  19   7               4  5  6
   6   0  16               7  8  9

       B                      C

If we want to multiply the two matrices, we take a row of B and multiply
by a column of C.  For instance, if we take the second row of B and multiply
by the third column of C:

  2 * 3  +  19 * 6  +  7 * 9  =  6 + 114 + 63  =  183
  _____     ______     _____

which gives us the value for the second row and third column of the result.

If we complete this calculation, we get this matrix, which we'll call "A":

       10  23  36
      127 155 183
      118 140 162


If we want to write a formula for how the elements of A were computed,
we might recall how we computed element A(2,3):

  A(2,3) = B(2,1)*C(1,3) + B(2,2)*C(2,3) + B(2,3)*C(3,3)

We multiplied elements in B whose first index was "2" by elements of C whose 
last index was "3".  And we always chose a pair of elements to multiply where 
the last index of the B entry was equal to the first index of the C entry.

( A(2,3) means the number from the matrix A that's in row 2 and column 3).


Turning a Method into an Algorithm:
__________________________________

To use a computer, we have to turn our method of calculation into an algorithm.

An algorithm is a very specific blueprint for how the mathematical operations
of a computation are to be carried out.  

If we continue with our example of multiplying 3 by 3 matrices, we need to
write a formula for ANY entry of the matrix A, not just row 2, column 3.

Remember, our original formula could be written:

  A(2,3) = B(2,1)*C(1,3) + B(2,2)*C(2,3) + B(2,3)*C(3,3)

If we want to rewrite this for any row or column, we need to use symbolic names
for the row and column.  Let "I" stand for the row, and "J" the column of the 
entry of A that we wish to calculate:

  A(I,J) = B(I,1)*C(1,J) + B(I,2)*C(2,J) + B(I,3)*C(3,J)

This description is abstract enough to describe how to compute any entry of
a 3 by 3 matrix.  But it wouldn't work with a matrix of a different size.

To take care of that problem, we have to take one more step into abstraction.
Rather than write out the sum, we need to describe the parts of the sum,
and add them up one by one.

First we need to describe how we pick each pair of numbers to multiply.  We
know that that's simply every pair of entries of B and C where B's second
index is the same as C's first one.  Let's call that common value "K".  Then
a typical pair of values is:

  B(I,K) and C(K,J)

And we know that we have to multiply this pair and add it to A(I,J).


Here's what's interesting, though.  There's no particular order that we
have to do that in.

Suppose our first step is to add B(1,1)*C(1,1) to A(1,1).

What do we do second?  

  * We could keep working on A(1,1), by adding the next term, B(1,2)*C(2,1).

  * We could compute the next term involving B(1,1), namely B(1,1)*C(1,2),
    and add that to A(1,2).

  * We could compute the next term involving C(1,1), namely B(2,1)*C(1,1),
    and add that to A(2,1).

It should be clear that there are at least three ways of going about this
computation, already.  Actually, there turn out to be six, because each of
the three ways we've talked about has one more choice of two things to make!


In fact, there are several other interesting ways of carrying out this
calculation, which people have claimed are faster than these simple methods.

Using MATMUL, we will be able to test a claim like that!


Turning an Algorithm into a Program:
___________________________________

We described how we would compute any matrix entry, but our description is
not an algorithm.  It doesn't say which element should be computed first.
it doesn't say whether we should complete the calculation for one element
first and then do a next one.  All of these details must be decided for
an algorithm.

The natural way to do matrix multiplication would be to compute the values
of the result matrix one at a time.  Presumably we want to start with A(1,1).
When we add up the entries for A(1,1), we might as well work from left to
right.  Once we've computed A(1,1), however, we still have to decide whether 
we compute A(1,2) or A(2,1) next.  Let's say we do A(1,2) next.  This is
enough to decide what our algorithm will be.  A sketch of our computation 
would be as follows:

  Set all entries of A to 0.

  For each row I from 1 to 3, and
  For each row J from 1 to 3, and
  For each value K from 1 to 3,

    Add B(I,K)*C(K,J) to A(I,J) 

With a few translations, this formula can be written down line for line
as a computer program to carry out matrix multiplication.


Turning an Algorithm into a Program:
___________________________________


In order to "implement" an algorithm, we have to choose a computer language,
such as FORTRAN, and write instructions in that language.  

Here is a portion of a FORTRAN program that would carry out the IJK method.
We're assuming that the matrices B and C have already been assigned their
values:


      real a(n,n)
      real b(n,n)            <--  Set aside space for the matrices
      real c(n,n)

      do i = 1, n
        do j = 1, n

          a(i,j = 0.0                       <--  Set A to zero

        end do
      end do

      do i = 1, n
        do j = 1, n
          do k = 1, n

            a(i,k) = a(i,k) + b(i,j) * c(j,k)    <-- The key line

          end do
        end do
      end do


What The MATMUL Program Does:
____________________________

The MATMUL program multiplies two matrices, and reports the time required.


MATMUL has been written so that it is:

  * interactive; a user controls what the program does.

  * flexible; the user has many choices about what to test.

  * portable; the program can be run on different computers.


The maximum size matrix that can be used is controlled by MATMUL, and
varies depending on the computer.  We don't want to allow the user to
try a problem that needs too much memory, or takes too much time!


  Computer           Maximum Size       Number of       Number of 
                       Allowed           Entries       Operations
                       (N by N)          (3*N*N)        (2*N*N*N)
  IBM PC,
  Macintosh, &
  Decstation 2100      65 by 65           12,675        549,250

  VAX/VMS             257 by 257         198,147     33,949,186

  Cray YMP            513 by 513         789,807    270,011,394


How to Use MATMUL:
_________________

You start the MATMUL program simply by typing "MATMUL".  

The program will say hello, and then ask you for a command.

The program will execute your commands until you tell the program to stop.


In a simple test, you might tell MATMUL:

  * the size of the matrix to use;  (N=10)

  * the multiplication method to use; (ORDER=IJK)

  * to go ahead and carry out the multiplication; (M)


When MATMUL is done, it prints out a table of information describing how
much work it did, how long it took, and so on, like this:

 ORDER     LDA     N     Time          Ops   MFLOPS
 
  IJK      257    10    0.5448E-04    2000   36.7108

The first three columns report values you specified.  The other entries are:

  "Time" is the time required to multiply the matrices.

  "Ops" is the number of additions and multiplications carried out.

  "MFLOPS" is the "rate" at which the computer solved the problem,
     measured in Millions of floating point operations per second.


The legal multiplication methods include:

  * any one of IJK, IKJ, JIK, JKI, KIJ, KJI;
  * the SGEMM method;
  * the SAXPY method;
  * ALL methods.


Measures of Computer Performance:
________________________________

There are many ways to rate a computer, including:

  * the computer's clock speed.

  * the number of instructions per second that can be issued.

  * the number of floating point operations per second that can be done.


Each rating ignores some features and emphasizes others:

  * the clock speed does NOT represent the time an instruction takes.
    Many instructions take several or many clock ticks to complete.

  * many computer instructions carry out no useful work from the user's
    point of view.  

  * especially on the Cray, a single computer instruction may result in
    50 or 60 computations being carried out.

  * the floating point speed won't be a useful measure for programs that use 
    predominantly integer operations, or input/output.


For scientific computing, which typically deals with vast lists of decimal
numbers to be multiplied and added, the floating point speed is the most 
useful measure of speed and performance.


Floating point speed is measured in "millions of floating point operations
per second", a unit also called "MegaFLOPS".  

A computer can be rated by figuring out the maximum amount of floating point
work it can do in a second.  This is "theoretical" limit.  There's no guarantee
that a program will run that fast; it just can't run any faster than that rate.


A computer like the VAX runs at roughly 1 MegaFLOP.  Each processor of
the Cray YMP can run up to 333 MegaFLOPS, and there are 8 processors on
the PSC's Cray YMP.  

This means that, in terms of power, a Cray YMP running at top speed is roughly 
8 * 333 = 2,666 times as powerful as a VAX.  And a VAX is not a cheap, or 
tiny, or very slow machine, by any means!


Measures of Program Performance:
_______________________________

A single processor on the Cray CAN run as fast as 333 MegaFLOPS.  

When a real program is run on the Cray, the processor WILL run slower,
usually much slower!

It's important to know how to measure this speed, so that we can find out
how close our program comes to the maximum speed, to compare algorithms,
and to help us improve program speeds.


We can figure out the speed at which a program runs if we have two quantities:

  * the total processing time used,

  * the number of floating point operations carried out.


Most computers have a way of reporting the processing time.  The number
of floating point operations can be harder to account for.  

But for matrix multiplication it's easy. 


For instance, suppose we know that we multiplied two 300 by 300 matrices in
10 seconds on a particular computer.

Multiplying two 300 by 300 matrices involves 2*300*300*300 operations.  So
if we divide the amount of work by the time we can compute the rate:

  (2*300*300*300 operations) / (10 seconds) = 5,400,000 FLOPS = 5.4 MegaFLOPS.


Once we have a rate like "5.4 MegaFLOPS" for our program, we can use it to
make comparisons with the rate:

  * for top speed on that computer

  * of the same program another computer

  * of the same program in a different language
 
  * of an "improved version" of the program


Comparing C and FORTRAN:
_______________________

A copy of MATMUL was recently converted to C.  The copy only implements
the six DO loop methods.  However, we can make some interesting comparisons
with the FORTRAN code.

A large problem was solved with both C and FORTRAN.  The methods IJK, IKJ, JIK, 
JKI, KIJ and KJI were all run.  The highest and lowest MegaFLOP rates were 
recorded:

  Method          FORTRAN     C       FORTRAN     C
                    High     High       Low      Low
                                
 Cray YMP            213      213       154       31
 VAX/VMS            2.99     1.36      0.65     0.63 
 DECstation 5000    7.20     2.07      4.92     0.94 
 DECstation 2100    3.32     0.98      2.66     0.65

These results are not surprising; FORTRAN was designed to be highly efficient
for numerical calculations.  C has other priorities.


We can also use the C version of MATMUL to investigate the C compiler itself.
For instance, most of our computers have a C compiler provided by the computer's
vendor.  But there is also a compiler, known as "GCC", provided by GNU,
which is freely available and portable to most computers.  We have compared
the performance of GCC and the C compiler, and in general have found the
vendor's C compiler to be faster.

C compilers also have an optimization option.  Generally, the C compiler will
not perform optimization unless explicitly asked to, via a switch such as
"-O" on the compile command.  We have seen speedups of a factor of 2 or 3
when optimization is requested while compiling MATMUL.

Finally, there is the question of precision.  The variables in MATMUL are
declared as "floats".  However, a version of MATMUL could be prepared in
which the variables were declared as "double" which would normally allocate
twice as much space, and produce an answer twice as accurate.  Such higher
accuracy can slow down a computation, and not necessarily by a factor of 2.
Moreover, surprisingly, a "double" is not necessarily twice as long as a
"float".  On the Cray, both types are allocated a single 64 bit word.  So
in fact, to investigate this phenomenon, the "long double" type had to be
specified, which resulted in a 128 bit word.


Test 1: Compare Computers:
_________________________

The following test was done using just the IJK algorithm.

In order to include the IBM PC in the tables, we have to run small problems.
The Cray will not be running at peak efficiency for small problems, but
we get a feel for how much different a PC and a Cray can be!

 
   N     IBM PC   Macintosh   DECSTATION   VAX/VMS     Cray YMP
                    Plus         2100       6520         8/32  
  __     _____    _________   __________   ______      _______

  10       0.8       1.0       0.0039       0.001      0.00005
  20       6.3       4.0       0.0420       0.010      0.00022
  30      21.0      13.0       0.1445       0.040      0.00055
  40      49.4      30.0       0.3398       0.090      0.00109
  50      96.1      60.0       0.6601       0.190      0.00186
  60     165.7     103.0       1.1410       0.320      0.00296

The VAX takes 100 times as long as the Cray, the DECSTATION 300 times as long,
the Macintosh 30,000 times as long, and the IBM 50,000 times as long.


Test 2: Compare Rates for Different Problem Sizes:
_________________________________________________

This test was run on the Cray.  

We used the IJK algorithm, with many problem sizes, and printed the rate.


    N    MFLOPS   Graph of Performance
                  One * equals 10 MegaFLOPS.
  ___    ___      __________________________

    1      1      .
    2      2      .
    5     14      .*
   10     34      .***
   20     73      .*******
   30    102      .**********
   40    120      .************
   50    137      .*************
   60    150      .***************
   80    150      .***************
  100    165      .****************
  120    179      .*****************
  140    170      .*****************
  160    183      .******************
  180    191      .*******************
  200    182      .******************
  220    190      .*******************
  240    194      .*******************


What does this information tell us?  

The Cray can run at a top speed of 333 MegaFLOPS.  We never get that high.
We do get to within about 60% of top speed.

We also see that our program takes a while to "accelerate" to its cruising
speed of roughly 190 MegaFLOPS.  For small problems, the Cray does not
run very fast!  


Test 3: Compare Times for Different Problem Sizes:
_________________________________________________

We repeated the tests, but recorded the times instead.


The time is measured in millionths of a second.

    N    Time     Graph of Performance
                  One * equals .005 seconds
  ___    ___      __________________________

    1      5      .
    2      6      .
    5     18      .
   10     58      .
   20    220      .
   30    532      .
   40   1068      .
   50   1827      .
   60   2891      .
   80   6892      .*
  100  12145      .**
  120  19343      .***
  140  32137      .******
  160  44842      .********
  180  61095      .************
  200  88029      .*****************
  220 111846      .**********************
  240 142367      .****************************


What does this information tell us?  

The time required is very small for small problems, but "blows up" as 
N increases.  The time does not increase linearly.  It is going up 
more sharply than that.  

We know that as N increases, the amount of work goes up like N*N*N.  It
makes sense that the time would increase in a similar way, that is, cubically.


Test 4: Compare Rates for Different Algorithms
______________________________________________

The following test was done using a 256 by 256 matrix, and all methods.


           Cray YMP     Graph of Cray Performance     
             8/32       One * equals 10 MegaFLOPS.
            
  IJK       184         .******************
  IKJ       127         .************
  JIK       190         .*******************
  JKI       186         .******************
  KIJ       125         .************
  KJI       185         .******************

  MXMA      308         .******************************

  SAXPY     104         .**********
  SAXPYC    160         .****************

  SGEMM     175         .*****************
  SGEMMC    305         .******************************
  SGEMMS    355         .***********************************


The value of the arithmetic rate is very useful.  It allows us first of
all to compare algorithms to each other.  So overall, we can say that
the SGEMMS algorithm is the fastest of this bunch.

Secondly, the rate allows us to ask questions about groups of related
algorithms.  The simplest six codes seem to fall into two groups,
IJK, JIK, JKI, KJI, all of which run at about 190 MegaFLOPS, and 
IKJ and KIJ, which run at about 130 MegaFLOPS.  The two slow codes
both have "J" as the inner index.  This is a clue to WHY they are slow.

Similarly, we can see that the special Cray versions of some routines
are faster than the regular ones: SAXPYC is faster than SAXPY, and
SGEMMC is faster than SGEMM.

Finally, we can compare the maximum theoretical speed of 333 MegaFLOPS
against the speed we achieve.  Here we come to an extremely strange
result.  The SGEMMS routine seems to run at 355 MegaFLOPS, faster than
possible.  This is not a mistake; there's actually some "magic" in this
routine that allows it to get the answer faster than we would have
expected was possible.  


Suggested projects for further study:
____________________________________

You can try to run MATMUL on your own computer.  We can give you the
program on a computer disk for an IBM or Macintosh.  You can simply
call (412)-268-6485 and talk to John Burkardt, the software librarian,
about how to get a copy.  The disk includes the source code as well, 
so you can see how the program works, and if you have a FORTRAN compiler,
you can change the program as well.


Is PASCAL faster than C?  Is interpreted BASIC slower than compiled BASIC?
Just because someone says so doesn't mean it's true.  And how much faster
is one language over another?  It probably depends on the kind of work
you are doing, as well.  If you know two different languages, try to
set up a program in both that solves the same problem, and see what this
tells you about the speed of the languages.


There are many interactive programs available which allow you to set up
and solve mathematical problems.  Such programs include MACSYMA, MAPLE, 
MATHEMATICA, MATHCAD, MATLAB, and REDUCE.  You should be able to set up and
solve a matrix multiplication problem.  If you're lucky, you'll also be
able to request timing information from the program; otherwise, you may have
to use a stopwatch!  Can you compare these packages, in terms of the maximum
problem size you can work on, and on speed?


Can you figure out a formula that will explain the results?  For instance,
we might expect that the formula would be something like  

  Time = c * N*N*N

where N is the order of the matrix, and "c" is a number we don't know yet.
We have enough data to estimate what "c" should be, though, don't we?


Instead of choosing a problem of a given size, and then comparing how long
it takes to solve that problem with different algorithms, languages, or
machines, you might ask instead, how large a problem can I solve in one second?
This is a different way of looking at the problem, and it helps you to
see some of the results more clearly.  In that case, for matrix multiplication,
the answer would be 10 for the PC and Macintosh, 80 for the VAX, 450 for 
the Cray.


You might be interested in writing a program like MATMUL that compares
various algorithms for a problem more interesting to you.  Some problems
for which there are many algorithms of varying speeds include:

  * Sorting a list of data into order

  * Finding the shortest path between two cities

  * Solving a set of linear equations