MATMUL, AN INTERACTIVE COMPUTER BENCHMARK _________________________________________ Introduction What is a Benchmark Program? Why Use a Benchmark at a Supercomputer Center? Why use matrix multiplication for benchmarking? The Method of Matrix Multiplication Turning a Method into an Algorithm Turning an Algorithm into a Program What the MATMUL program does How to use MATMUL Measures of Computer Performance Measures of Program Performance Comparing C and FORTRAN Test 1: Compare Computers Test 2: Compare Rates for Different Problem Sizes Test 3: Compare Times for Different Problem Sizes Test 4: Compare Rates for Different Algorithms Suggested projects for further study Introduction ____________ These notes describe a computer program called MATMUL. MATMUL is a benchmark program that measures how fast a computer carries out a large problem involving multiplication and addition. MATMUL can compare computers, languages and compilers, or algorithms. Some of the results can be explained very simply; other things that MATMUL finds may require learning about the computer's insides. Even if we don't understand WHY the MATMUL program runs slowly on some problems, we can use the information as a warning to avoid certain methods. What is a Benchmark Program? ___________________________ A benchmark program measures computer power in an understandable way. * The program is chosen because it is "average" or "typical". * The program is timed on the computer. * The timing suggests how fast the computer will run similar programs. The timing results can open up new questions about new things: * How long does it take to solve a problem twice as big? * Is it twice as hard to get an answer that is twice as accurate? * Are there methods that are twice as fast as other methods? * Are there methods that use half as much memory? * Does this computer have extra trouble with some kinds of computations? One amusing thing to say about a benchmark program: We don't care about its results. We just want to know how long it took to compute them! Why Use a Benchmark at a Supercomputer Center? _____________________________________________ Supercomputers are much different than personal computers. There's a lot of money tied up in a supercomputer, sometimes as much as twenty or thirty million dollars. The company or lab or university acquiring a supercomputer needs to answer some very serious questions: * How can I tell "how much" computer to buy? * How can I tell "how fast" a given computer can go? * How can someone who's using the computer tell when their program is running as fast as possible? A benchmark program can help answer these questions. Why Use Matrix Multiplication for Benchmarking? _______________________________________________ The "sample" work that the MATMUL program carries out is matrix multiplication. Two large tables of numbers are combined using multiplication and addition. Matrix multiplication can be an EASY test to set up. It's easy to: * describe the task to be accomplished. * write a program to carry out this task, in many languages. * set up a problem of any size. * figure out the amount of work required. Matrix multiplication has some features that make it a USEFUL test: * We can make the amount of work increase by making the problem larger. * The work is typical of scientific programs. * There are several versions of the algorithm. The Method of Matrix Multiplication: ___________________________________ MATMUL assumes a matrix is a square table of numbers, such as: 11 5 -3 2 19 7 6 0 16 Matrix multiplication combines two such tables into a new one. Each entry of the new table is computed by using a row of the first matrix and a column of the second. We simply multiply corresponding pairs of values and add up all the results. For instance, suppose we had two matrices which we'll call "B" and "C": 11 5 -3 1 2 3 2 19 7 4 5 6 6 0 16 7 8 9 B C If we want to multiply the two matrices, we take a row of B and multiply by a column of C. For instance, if we take the second row of B and multiply by the third column of C: 2 * 3 + 19 * 6 + 7 * 9 = 6 + 114 + 63 = 183 _____ ______ _____ which gives us the value for the second row and third column of the result. If we complete this calculation, we get this matrix, which we'll call "A": 10 23 36 127 155 183 118 140 162 If we want to write a formula for how the elements of A were computed, we might recall how we computed element A(2,3): A(2,3) = B(2,1)*C(1,3) + B(2,2)*C(2,3) + B(2,3)*C(3,3) We multiplied elements in B whose first index was "2" by elements of C whose last index was "3". And we always chose a pair of elements to multiply where the last index of the B entry was equal to the first index of the C entry. ( A(2,3) means the number from the matrix A that's in row 2 and column 3). Turning a Method into an Algorithm: __________________________________ To use a computer, we have to turn our method of calculation into an algorithm. An algorithm is a very specific blueprint for how the mathematical operations of a computation are to be carried out. If we continue with our example of multiplying 3 by 3 matrices, we need to write a formula for ANY entry of the matrix A, not just row 2, column 3. Remember, our original formula could be written: A(2,3) = B(2,1)*C(1,3) + B(2,2)*C(2,3) + B(2,3)*C(3,3) If we want to rewrite this for any row or column, we need to use symbolic names for the row and column. Let "I" stand for the row, and "J" the column of the entry of A that we wish to calculate: A(I,J) = B(I,1)*C(1,J) + B(I,2)*C(2,J) + B(I,3)*C(3,J) This description is abstract enough to describe how to compute any entry of a 3 by 3 matrix. But it wouldn't work with a matrix of a different size. To take care of that problem, we have to take one more step into abstraction. Rather than write out the sum, we need to describe the parts of the sum, and add them up one by one. First we need to describe how we pick each pair of numbers to multiply. We know that that's simply every pair of entries of B and C where B's second index is the same as C's first one. Let's call that common value "K". Then a typical pair of values is: B(I,K) and C(K,J) And we know that we have to multiply this pair and add it to A(I,J). Here's what's interesting, though. There's no particular order that we have to do that in. Suppose our first step is to add B(1,1)*C(1,1) to A(1,1). What do we do second? * We could keep working on A(1,1), by adding the next term, B(1,2)*C(2,1). * We could compute the next term involving B(1,1), namely B(1,1)*C(1,2), and add that to A(1,2). * We could compute the next term involving C(1,1), namely B(2,1)*C(1,1), and add that to A(2,1). It should be clear that there are at least three ways of going about this computation, already. Actually, there turn out to be six, because each of the three ways we've talked about has one more choice of two things to make! In fact, there are several other interesting ways of carrying out this calculation, which people have claimed are faster than these simple methods. Using MATMUL, we will be able to test a claim like that! Turning an Algorithm into a Program: ___________________________________ We described how we would compute any matrix entry, but our description is not an algorithm. It doesn't say which element should be computed first. it doesn't say whether we should complete the calculation for one element first and then do a next one. All of these details must be decided for an algorithm. The natural way to do matrix multiplication would be to compute the values of the result matrix one at a time. Presumably we want to start with A(1,1). When we add up the entries for A(1,1), we might as well work from left to right. Once we've computed A(1,1), however, we still have to decide whether we compute A(1,2) or A(2,1) next. Let's say we do A(1,2) next. This is enough to decide what our algorithm will be. A sketch of our computation would be as follows: Set all entries of A to 0. For each row I from 1 to 3, and For each row J from 1 to 3, and For each value K from 1 to 3, Add B(I,K)*C(K,J) to A(I,J) With a few translations, this formula can be written down line for line as a computer program to carry out matrix multiplication. Turning an Algorithm into a Program: ___________________________________ In order to "implement" an algorithm, we have to choose a computer language, such as FORTRAN, and write instructions in that language. Here is a portion of a FORTRAN program that would carry out the IJK method. We're assuming that the matrices B and C have already been assigned their values: real a(n,n) real b(n,n) <-- Set aside space for the matrices real c(n,n) do i = 1, n do j = 1, n a(i,j = 0.0 <-- Set A to zero end do end do do i = 1, n do j = 1, n do k = 1, n a(i,k) = a(i,k) + b(i,j) * c(j,k) <-- The key line end do end do end do What The MATMUL Program Does: ____________________________ The MATMUL program multiplies two matrices, and reports the time required. MATMUL has been written so that it is: * interactive; a user controls what the program does. * flexible; the user has many choices about what to test. * portable; the program can be run on different computers. The maximum size matrix that can be used is controlled by MATMUL, and varies depending on the computer. We don't want to allow the user to try a problem that needs too much memory, or takes too much time! Computer Maximum Size Number of Number of Allowed Entries Operations (N by N) (3*N*N) (2*N*N*N) IBM PC, Macintosh, & Decstation 2100 65 by 65 12,675 549,250 VAX/VMS 257 by 257 198,147 33,949,186 Cray YMP 513 by 513 789,807 270,011,394 How to Use MATMUL: _________________ You start the MATMUL program simply by typing "MATMUL". The program will say hello, and then ask you for a command. The program will execute your commands until you tell the program to stop. In a simple test, you might tell MATMUL: * the size of the matrix to use; (N=10) * the multiplication method to use; (ORDER=IJK) * to go ahead and carry out the multiplication; (M) When MATMUL is done, it prints out a table of information describing how much work it did, how long it took, and so on, like this: ORDER LDA N Time Ops MFLOPS IJK 257 10 0.5448E-04 2000 36.7108 The first three columns report values you specified. The other entries are: "Time" is the time required to multiply the matrices. "Ops" is the number of additions and multiplications carried out. "MFLOPS" is the "rate" at which the computer solved the problem, measured in Millions of floating point operations per second. The legal multiplication methods include: * any one of IJK, IKJ, JIK, JKI, KIJ, KJI; * the SGEMM method; * the SAXPY method; * ALL methods. Measures of Computer Performance: ________________________________ There are many ways to rate a computer, including: * the computer's clock speed. * the number of instructions per second that can be issued. * the number of floating point operations per second that can be done. Each rating ignores some features and emphasizes others: * the clock speed does NOT represent the time an instruction takes. Many instructions take several or many clock ticks to complete. * many computer instructions carry out no useful work from the user's point of view. * especially on the Cray, a single computer instruction may result in 50 or 60 computations being carried out. * the floating point speed won't be a useful measure for programs that use predominantly integer operations, or input/output. For scientific computing, which typically deals with vast lists of decimal numbers to be multiplied and added, the floating point speed is the most useful measure of speed and performance. Floating point speed is measured in "millions of floating point operations per second", a unit also called "MegaFLOPS". A computer can be rated by figuring out the maximum amount of floating point work it can do in a second. This is "theoretical" limit. There's no guarantee that a program will run that fast; it just can't run any faster than that rate. A computer like the VAX runs at roughly 1 MegaFLOP. Each processor of the Cray YMP can run up to 333 MegaFLOPS, and there are 8 processors on the PSC's Cray YMP. This means that, in terms of power, a Cray YMP running at top speed is roughly 8 * 333 = 2,666 times as powerful as a VAX. And a VAX is not a cheap, or tiny, or very slow machine, by any means! Measures of Program Performance: _______________________________ A single processor on the Cray CAN run as fast as 333 MegaFLOPS. When a real program is run on the Cray, the processor WILL run slower, usually much slower! It's important to know how to measure this speed, so that we can find out how close our program comes to the maximum speed, to compare algorithms, and to help us improve program speeds. We can figure out the speed at which a program runs if we have two quantities: * the total processing time used, * the number of floating point operations carried out. Most computers have a way of reporting the processing time. The number of floating point operations can be harder to account for. But for matrix multiplication it's easy. For instance, suppose we know that we multiplied two 300 by 300 matrices in 10 seconds on a particular computer. Multiplying two 300 by 300 matrices involves 2*300*300*300 operations. So if we divide the amount of work by the time we can compute the rate: (2*300*300*300 operations) / (10 seconds) = 5,400,000 FLOPS = 5.4 MegaFLOPS. Once we have a rate like "5.4 MegaFLOPS" for our program, we can use it to make comparisons with the rate: * for top speed on that computer * of the same program another computer * of the same program in a different language * of an "improved version" of the program Comparing C and FORTRAN: _______________________ A copy of MATMUL was recently converted to C. The copy only implements the six DO loop methods. However, we can make some interesting comparisons with the FORTRAN code. A large problem was solved with both C and FORTRAN. The methods IJK, IKJ, JIK, JKI, KIJ and KJI were all run. The highest and lowest MegaFLOP rates were recorded: Method FORTRAN C FORTRAN C High High Low Low Cray YMP 213 213 154 31 VAX/VMS 2.99 1.36 0.65 0.63 DECstation 5000 7.20 2.07 4.92 0.94 DECstation 2100 3.32 0.98 2.66 0.65 These results are not surprising; FORTRAN was designed to be highly efficient for numerical calculations. C has other priorities. We can also use the C version of MATMUL to investigate the C compiler itself. For instance, most of our computers have a C compiler provided by the computer's vendor. But there is also a compiler, known as "GCC", provided by GNU, which is freely available and portable to most computers. We have compared the performance of GCC and the C compiler, and in general have found the vendor's C compiler to be faster. C compilers also have an optimization option. Generally, the C compiler will not perform optimization unless explicitly asked to, via a switch such as "-O" on the compile command. We have seen speedups of a factor of 2 or 3 when optimization is requested while compiling MATMUL. Finally, there is the question of precision. The variables in MATMUL are declared as "floats". However, a version of MATMUL could be prepared in which the variables were declared as "double" which would normally allocate twice as much space, and produce an answer twice as accurate. Such higher accuracy can slow down a computation, and not necessarily by a factor of 2. Moreover, surprisingly, a "double" is not necessarily twice as long as a "float". On the Cray, both types are allocated a single 64 bit word. So in fact, to investigate this phenomenon, the "long double" type had to be specified, which resulted in a 128 bit word. Test 1: Compare Computers: _________________________ The following test was done using just the IJK algorithm. In order to include the IBM PC in the tables, we have to run small problems. The Cray will not be running at peak efficiency for small problems, but we get a feel for how much different a PC and a Cray can be! N IBM PC Macintosh DECSTATION VAX/VMS Cray YMP Plus 2100 6520 8/32 __ _____ _________ __________ ______ _______ 10 0.8 1.0 0.0039 0.001 0.00005 20 6.3 4.0 0.0420 0.010 0.00022 30 21.0 13.0 0.1445 0.040 0.00055 40 49.4 30.0 0.3398 0.090 0.00109 50 96.1 60.0 0.6601 0.190 0.00186 60 165.7 103.0 1.1410 0.320 0.00296 The VAX takes 100 times as long as the Cray, the DECSTATION 300 times as long, the Macintosh 30,000 times as long, and the IBM 50,000 times as long. Test 2: Compare Rates for Different Problem Sizes: _________________________________________________ This test was run on the Cray. We used the IJK algorithm, with many problem sizes, and printed the rate. N MFLOPS Graph of Performance One * equals 10 MegaFLOPS. ___ ___ __________________________ 1 1 . 2 2 . 5 14 .* 10 34 .*** 20 73 .******* 30 102 .********** 40 120 .************ 50 137 .************* 60 150 .*************** 80 150 .*************** 100 165 .**************** 120 179 .***************** 140 170 .***************** 160 183 .****************** 180 191 .******************* 200 182 .****************** 220 190 .******************* 240 194 .******************* What does this information tell us? The Cray can run at a top speed of 333 MegaFLOPS. We never get that high. We do get to within about 60% of top speed. We also see that our program takes a while to "accelerate" to its cruising speed of roughly 190 MegaFLOPS. For small problems, the Cray does not run very fast! Test 3: Compare Times for Different Problem Sizes: _________________________________________________ We repeated the tests, but recorded the times instead. The time is measured in millionths of a second. N Time Graph of Performance One * equals .005 seconds ___ ___ __________________________ 1 5 . 2 6 . 5 18 . 10 58 . 20 220 . 30 532 . 40 1068 . 50 1827 . 60 2891 . 80 6892 .* 100 12145 .** 120 19343 .*** 140 32137 .****** 160 44842 .******** 180 61095 .************ 200 88029 .***************** 220 111846 .********************** 240 142367 .**************************** What does this information tell us? The time required is very small for small problems, but "blows up" as N increases. The time does not increase linearly. It is going up more sharply than that. We know that as N increases, the amount of work goes up like N*N*N. It makes sense that the time would increase in a similar way, that is, cubically. Test 4: Compare Rates for Different Algorithms ______________________________________________ The following test was done using a 256 by 256 matrix, and all methods. Cray YMP Graph of Cray Performance 8/32 One * equals 10 MegaFLOPS. IJK 184 .****************** IKJ 127 .************ JIK 190 .******************* JKI 186 .****************** KIJ 125 .************ KJI 185 .****************** MXMA 308 .****************************** SAXPY 104 .********** SAXPYC 160 .**************** SGEMM 175 .***************** SGEMMC 305 .****************************** SGEMMS 355 .*********************************** The value of the arithmetic rate is very useful. It allows us first of all to compare algorithms to each other. So overall, we can say that the SGEMMS algorithm is the fastest of this bunch. Secondly, the rate allows us to ask questions about groups of related algorithms. The simplest six codes seem to fall into two groups, IJK, JIK, JKI, KJI, all of which run at about 190 MegaFLOPS, and IKJ and KIJ, which run at about 130 MegaFLOPS. The two slow codes both have "J" as the inner index. This is a clue to WHY they are slow. Similarly, we can see that the special Cray versions of some routines are faster than the regular ones: SAXPYC is faster than SAXPY, and SGEMMC is faster than SGEMM. Finally, we can compare the maximum theoretical speed of 333 MegaFLOPS against the speed we achieve. Here we come to an extremely strange result. The SGEMMS routine seems to run at 355 MegaFLOPS, faster than possible. This is not a mistake; there's actually some "magic" in this routine that allows it to get the answer faster than we would have expected was possible. Suggested projects for further study: ____________________________________ You can try to run MATMUL on your own computer. We can give you the program on a computer disk for an IBM or Macintosh. You can simply call (412)-268-6485 and talk to John Burkardt, the software librarian, about how to get a copy. The disk includes the source code as well, so you can see how the program works, and if you have a FORTRAN compiler, you can change the program as well. Is PASCAL faster than C? Is interpreted BASIC slower than compiled BASIC? Just because someone says so doesn't mean it's true. And how much faster is one language over another? It probably depends on the kind of work you are doing, as well. If you know two different languages, try to set up a program in both that solves the same problem, and see what this tells you about the speed of the languages. There are many interactive programs available which allow you to set up and solve mathematical problems. Such programs include MACSYMA, MAPLE, MATHEMATICA, MATHCAD, MATLAB, and REDUCE. You should be able to set up and solve a matrix multiplication problem. If you're lucky, you'll also be able to request timing information from the program; otherwise, you may have to use a stopwatch! Can you compare these packages, in terms of the maximum problem size you can work on, and on speed? Can you figure out a formula that will explain the results? For instance, we might expect that the formula would be something like Time = c * N*N*N where N is the order of the matrix, and "c" is a number we don't know yet. We have enough data to estimate what "c" should be, though, don't we? Instead of choosing a problem of a given size, and then comparing how long it takes to solve that problem with different algorithms, languages, or machines, you might ask instead, how large a problem can I solve in one second? This is a different way of looking at the problem, and it helps you to see some of the results more clearly. In that case, for matrix multiplication, the answer would be 10 for the PC and Macintosh, 80 for the VAX, 450 for the Cray. You might be interested in writing a program like MATMUL that compares various algorithms for a problem more interesting to you. Some problems for which there are many algorithms of varying speeds include: * Sorting a list of data into order * Finding the shortest path between two cities * Solving a set of linear equations