PSCDOC:PERFORMANCE.DOC 30 September 1991 PERFORMANCE utilities monitor and report on user program efficiency. Some quantities that can be observed include elapsed time, number of arithmetic operations, number of I/O requests, MegaFLOP rate. Reports can be prepared for a whole program, per each subroutine, or in some cases for individual loops or pieces of code. Cray Research has provided a number of utilities and routines, the most useful of which include HPM, FLOWTRACE, PERFTRACE and SECOND. Help: man performance Document: performance.doc Examples: io.f, linsol.f, lintst.f, quad.f, sample.f, UNICOS usage: The instructions for each performance monitor are in separate documents, which you should refer to. See also: FLOWTRACE, FORGE, HPM, JA, LOOPMARK, MATMUL, PCA, PERFTRACE, PROCSTAT, PROF, PSR, SCILIB, SCOUNT, SECOND, PROF, SECOND, TIME, TIMEX), VECTORIZE, Cray Performance Monitoring 30 September 1991 Pittsburgh Supercomputing Center 1). Introduction 2). Exercise: Elapsed CPU time 3). Use of SECOND and SECONDR 4). Counting floating point operations, and floating point speed 5). Exercise: The MegaFLOP rate of Matrix multiplication 6). Getting a MegaFLOP rate using HPM 7). Exercise: Using HPM 8). Tracking down a problem with FLOWTRACE 9). Exercise: Using FLOWTRACE 10). Conclusions. Exercises for further study: 11). When does a loop vectorize? 12). Using SECOND to time little things 13). Using LOOPMARK to get a nice compiler listing 14) Using SCOUNT to count statement executions 15) Using HPM to compare multiplication to division 16) Using FLOWTRACE on the LINPACK benchmark. 17). When memory requests collide 18). A sample program to rewrite and benchmark 1). Introduction Performance monitoring is the measurement of the computer resources used by a program. Typical quantities measured can include: the elapsed CPU time, the maximum and average memory used, the number of instructions executed, the number of processors used, if running in parallel, the number of logical and arithmetic operations, the number of floating point operations, the number of I/O operations carried out. The Cray has a limited ability to run programs using more than one processor. For simplicity's sake, we will assume that we are using just one processor. That's the default, anyway. For typical supercomputing applications, the two important quantities are the number of floating point operations and the elapsed CPU time. The ratio, measured in millions of floating point operations per second, or MegaFLOPS, is commonly used to rate scientific computers and programs. We will restrict our attention from now on to just these quantities that define MegaFLOPS, that is, elapsed CPU time, and number of operations. 2). Exercise: Elapsed CPU time Let's run a program which carries out some work and reports the time and work required. The work being done is matrix multiplication. We'll have the program solve a problems using all the algorithms it knows about. matmul <-- Starts up the MATMUL program noshow <-- Turn off print out of all values. show=order <-- Turn on print out of ORDER show=time <-- Turn on print out of TIME lda=513 <-- Sets the size of the FORTRAN array order=all <-- Chooses the algorithm n=256 <-- Picks a problem size. m <-- Carries out the multiplication. q <-- quit y <-- Yes, I really want to quit! Put your results here (two decimals are enough): ORDER Time IJK __________ IKJ __________ JIK __________ JKI __________ KIJ __________ KJI __________ MXMA __________ SAXPY __________ SAXPYC __________ SGEMM __________ SGEMMC __________ SGEMMS __________ Here's the results I got to two decimal places: IJK 0.18 IKJ 0.26 JIK 0.18 JKI 0.18 KIJ 0.27 KJI 0.18 MXMA 0.11 SAXPY 0.35 SAXPYC 0.22 SGEMM 0.20 SGEMMC 0.11 SGEMMS 0.09 You would NEVER be able to guess from the source code which method would run faster. You have to know about the machine you're using to make that judgment. The SECOND routine can help you to see the difference. 3). The use of SECOND and SECONDR for measuring performance. In the MATMUL program that you just ran, the program printed out the value of the elapsed CPU time. This measurement was made by calling the Cray system routine SECOND. SECOND measures the amount of CPU time that has elapsed since the job began. By calling SECOND before and after some operation and subtracting the values returned, you can measure the "cost" of that operation in CPU time. Here is a typical use of the routine: CALL SECOND(TIME1) DO 10 I=1,1000 X(I)=SQRT(Y(I)) 10 CONTINUE CALL SECOND(TIME2) WRITE(*,*)'Elapsed CPU time=',TIME2-TIME1,' seconds.' Using SECOND you can time anything, a DO loop, a subroutine, a program. SECOND is ideal when you KNOW what portion of a program you want to measure, you want to measure just a few chunks of the program, the measured portions are executed only a few times. SECOND can be painful to use if you're searching for the portion of a program that uses a lot of time, or you're interested in timing many portions of a program, or the code to be timed executes very quickly (timings will not be accurate), or the code is executed many times (you have to save the timings and print later) Moreover, you have to have access to the source code of the program you want to work evaluate. And SECOND, by itself, doesn't tell you anything about how efficiently you used the Cray. There is a similar program, called SECONDR, for computing the elapsed real time, that is, the actual time that has passed, including time when your program is swapped out and not actually executing. The time value measured by SECONDR should usually be larger than that measured by SECOND. However, a serious discrepancy in the time values may indicate excessive overhead and swapping due to poorly handled I/O. 4). Counting floating point operations, and floating point speed We know how to time a program or program fragment. But we need to measure the amount of work done in that time. For scientific programs running on scientific computers, a good measure of work is the number of floating point arithmetic operations: addition, multiplication, and sometimes division. If we can estimate the number of operations involved in a program, and get the elapsed CPU time via SECOND, we can compute the number of floating point operations per second, called FLOPS. Usually, this is counted in millions, or MEGAFLOPS. This gives us a rate, or speed, of computation. Such a rate can give us an absolute scale of measurement, because the maximum MEGAFLOP rating on the Cray is known. You can figure out the Cray's computational rate from two facts: The Cray has a clock cycle of 6 nanoseconds. The Cray can do a (vectorized) add and multiply in one clock cycle. Thus we can do 2 floating point operations in 6 nanoseconds. The maximum MEGAFLOP rating on the Cray is therefore 2 ops / 6 billionths of a second = 2 thousand million ops / 6 seconds = 333 MEGAFLOPS. This value of 333 can be used to measure your program's performance. Roughly speaking, if your program is performing at over 100 MEGAFLOPS, you are achieving superior performance. If you are performing at under 10 MEGAFLOPS, your program probably shouldn't be running on the Cray at all. Let's do this measurement for the simple case of matrix multiplication. Assuming the computer organizes the calculation the way we would, we can figure out that each element of the result matrix will require being set to zero, and then having N pairs of values multiplied together and added: C(I,J)=0.0 C(I,J)=C(I,J)+A(I,1)*B(1,J) C(I,J)=C(I,J)+A(I,2)*B(2,J) ... C(I,J)=C(I,J)+A(I,N)+B(N,J) Ignoring the initial assignment to zero, each element will require N multiplications and N additions, or 2*N operations. And so the whole N*N matrix will require 2*N*N*N operations to compute. Armed with this information, we can choose a matrix size, figure out the amount of work required, use SECOND to get the elapsed CPU time, and come up with a computational rate. This will tell us how well the program is doing on the Cray, compared to the maximal rate of 333 MegaFLOPS. 5). Exercise: The MegaFLOP rate of Matrix multiplication Let's rerun the MATMUL program, but this time ask it to report the number of operations and the MFLOP rating for each method: matmul <-- Starts up the MATMUL program noshow <-- Turn off print out of all values. show=order <-- Turn on print out of ORDER show=mflops <-- Turn on print out of MFLOPS lda=513 <-- Sets the size of the FORTRAN array order=all <-- Chooses the algorithm n=256 <-- Picks a problem size. m <-- Carries out the multiplication. q <-- quit y <-- Yes, I really want to quit! Order MFLOPS IJK _________________ IKJ _________________ JIK _________________ JKI _________________ KIJ _________________ KJI _________________ MXMA _________________ SAXPY _________________ SAXPYC _________________ SGEMM _________________ SGEMMC _________________ SGEMMS _________________ Now these MegaFLOP ratings are more useful than the timings we got earlier. For one thing, it's easier to think in terms of rates rather than time. For another, we know the maximum rate on the Cray, so we can compare the speed of each method with that maximum rate, and get a real measure of how efficiently the Cray is being used. For matrix multiplication, we know it's easy to compute the number of operations, and hence to compute a MegaFLOP rate. The MATMUL program knows how large N is, and how long the computation took, and simply returns the MegaFLOP rate as 2*N**3/T. For most programs, however, it is not possible to do such a simple calculation. We'll see in a moment how to get a MegaFLOP rate, no matter how complicated the program is. 6). Getting a MegaFLOP rate using HPM From the discussion and exercises, you should see that the MegaFLOP rating for a program can be a useful indicator of its efficiency on the Cray. The problem is, we have to be able to estimate the amount of work we're asking the Cray to do, in order to figure out the rate, and that's not usually easy to do. Fortunately, there's a program available which counts the number of operations, times the code, computes the MegaFLOP rating, all with very little effort on out part. The program which does this is called HPM. You use HPM when you have an executable program to run. In the simplest case, you simply type hpm program where "program" is the name of the program or command you want to monitor. The program will execute, with HPM "observing" the run, and collecting statistics which will be reported at the end of the run. The only interesting values for a beginning user are the counts of floating point operations, and the MegaFLOP rate. Here's some sample HPM output: Group 0: CPU seconds : 5.70 CP executing : 950345302 Million inst/sec (MIPS) : 27.86 Instructions : 158811193 Avg. clock periods/inst : 5.98 % CP holding issue : 72.50 CP holding issue : 688963041 Inst.buffer fetches/sec : 0.53M Inst.buf. fetches: 3019615 Floating adds/sec : 58.65M F.P. adds : 334340355 Floating multiplies/sec : 58.65M F.P. multiplies : 334339392 Floating reciprocal/sec : 0.00M F.P. reciprocals : 2048 CPU mem. references/sec : 180.46M CPU references : 1028668469 I/O mem. references/sec : 2.21M I/O references : 12580870 Floating ops/CPU second : 117.31M 7). Exercise: Using HPM Retrieve the sample program TEST.F from the EXAMPLES directory. Do this with the UNICOS command: cfs get /usr/local/examples/performance/test.f Now compile and load TEST.F with the default options, and run it with HPM: cf77 test.f hpm a.out Let's compare a run that uses the enhanced vectorization preprocessor: cf77 -Zv test.f hpm a.out Let's look at a run which uses "inlining" of small routines: cf77 -Wf"-o inline" test.f hpm a.out And let's try the "aggressive" optimization option: cf77 -Wf"-o aggress" test.f hpm a.out MEGAFLOP ratings reported by HPM: Compile statement MEGAFLOP rating cf77 test.f _________________ cf77 -Zv test.f _________________ cf77 -Wf"-o inline" test.f _________________ cf77 -Wf"-o aggress" test.f _________________ ____________________________ _________________ <-- Try other switches! Don't be surprised if some of the MEGAFLOP ratings go down as you use a compiler option that's supposed to help. Sometimes a problem is too small to be optimized; and there are other reasons why one optimization strategy will never work for all programs! The important thing is that we're not helpless. We can easily see what the effect of our choices is. 8). Tracking down a problem with FLOWTRACE: Suppose you were trying to fix a program with a bad grade from HPM. How do you track down problem? There are a couple of programs that can help us here, including PERFTRACE and FLOWTRACE and PROF. We will demonstrate one of the simpler programs, FLOWTRACE. FLOWTRACE operates essentially by calling a library subroutine everytime control is passed from one subroutine or function to another. This library subroutine is responsible for keeping track of the elapsed CPU time, and totalling up the amount of time spent in each routine. At the end of the run, FLOWTRACE reports on the number of times each routine was called, and how much time was spent there. This information identifies where your execution time is spent. FLOWTRACE is invoked at compile time, using the following option: cf77 -F myprog.f or cc -F myprog.c When your program is run, a data file "flow.data" is created, which must be interpreted by the related "flowview" program, as in: a.out flowview -LA Keep in mind these facts: Your program may take longer to execute with FLOWTRACE on. Thus the timing results will be somewhat distorted. Subroutines that are called hundreds of times, or which execute very quickly, are not measured accurately. FLOWTRACE results are only going to be useful to you if your program is divided up into subroutines which share the work. Otherwise, you probably already know where all the time is being spent. (One user complained that the FLOWTRACE results weren't very helpful. The user's program had no subroutines at all!). You must have access to (most of) the source code. If your program calls library routines, in which most of the time is spent, FLOWTRACE will not be able to report this. Here's part of the output you get from FLOWTRACE: TOP 5 SIGNIFICANT ROUTINES ------------------------------ (CPU Times are Shown in Seconds) Routine Name Tot Time # Calls Avg Time Percentage Accum% ---------------------- -------- -------- ---------- ------- TAXPY 9.38E-01 125749 7.46E-06 68.99 68.99 ***************** TGEFA 3.84E-01 1 3.84E-01 28.27 97.26 ******* ITAMAX 2.89E-02 499 5.79E-05 2.13 99.38 TGESL 3.28E-03 1 3.28E-03 0.24 99.62 TSCAL 2.64E-03 499 5.30E-06 0.19 99.82 9). Exercise: Using FLOWTRACE FLOWTRACE is used by compiling the program of interest with the FLOWTRACE option on and then running the program. Aside from the cost of the recompile and relink, the program will also require as much as three times as much time! The compile and load command needed is cf77 -F myprog.f for FORTRAN, or cc -F myprog.c for C For our purposes, let's use FLOWTRACE on that FORTRAN test program again. First run the program "plain", and record the time. Then run it with FLOWTRACE. cf77 test.f a.out cf77 -F test.f a.out flowview -LA | more <-- only do this interactively! CPU time without FLOWTRACE _________________ CPU time with FLOWTRACE _________________ Look at the output from FLOWTRACE, and see how a couple of routines really stick out. Clearly, it is in our interest to make sure that those routines are well written. If the program is performing poorly, we know what routines to search first. What routine is called most often? __________________ What routine takes the most time per call? __________________ What routine takes the most total time? __________________ 10). Conclusions The most important thing I want you to understand is how easy it is to get an idea of your program's efficiency on the Cray using HPM. The MEGAFLOP rating provided by HPM gives you an excellent idea of how busy you're keeping the Cray. The second thing I hope you understand is the importance of using both HPM and timings in order to make comparisons. For comparisons, the timing information is much more important than the HPM numbers. A program with a high MEGAFLOP rate looks better than one with a low MEGAFLOP rate. But if they both solve the same problem, and the low MEGAFLOP rate program solves it faster, the low MEGAFLOP rate program is better. You're charged for time, not efficiency. So your goal is to bring down the time charges. Increasing the MEGAFLOP rate is simply one way to TRY to speed up the program, and it's certainly not a goal in itself. The third point I'd like to be sure you consider is the importance of focusing your optimization, that is, figuring out the smallest portion of the program that you should try to work on. Using FLOWTRACE (or PERFTRACE) you can at least find the slow subroutines or functions. This minimizes your work (and also minimizes the damage you can cause by misguided optimization of the wrong things!). I haven't had a chance to discuss some other interesting performance monitoring programs, including FORGE, LOOPMARK, PERFTRACE and SCOUNT. Each of these topics is described in an online document at the PSC. If you are interested in pursuing this topic, you might also be interested in: UNICOS Performance Utilities Reference Manual SR-2040 Cray Research, Inc which is available from Cray Research, or directly from our Publications Coordinator, Vivian Benton. A copy of this manual is available online in the documentation directories, under the name "performance.big". The document may also be examined (on the Cray only!) using the DOCVIEW program. Finally, if you are interested in pursuing the topic of performance monitoring, I recommend that you try some of the exercises that follow. 11). Exercise: When does a loop vectorize? No matter how well you write a FORTRAN DO loop, its actual efficiency will depend in part on the number of iterations carried out. There are two main reasons for this: There is some overhead involved in setting up a vector loop. There is some more overhead involved after every 64-th iteration. This means that a DO loop has an "asymptotic" efficiency that you only see when the loop is "large enough". When the loop is small, it may perform worse than a scalar code, and the performance will be strongly affected by problem sizes that are small, or near a (small) multiple of 64. We can see some of this occurring, particularly the "bumps" at 64, 128, and so on, using the MATMUL program. Let's restrict our attention to one algorithm, say "IJK", and simply run it for a bunch of values of N: matmul <-- Starts up the MATMUL program noshow <-- Turn off print out of all values. show=n <-- Turn on print out of N show=mflops <-- Turn on print out of MFLOPS lda=513 <-- Sets the size of the FORTRAN array order=ijk <-- Chooses the algorithm n=1 <-- N=1 m <-- Carries out the multiplication. n=2 <-- Set N=2 m <-- Carries out multiplication. Repeat for N=63, 64, 65, 66, and 127, 128, 129, 130 to see "bumps". Repeat for N=1, 2, 4, 8, 16, 32, 64, 128, 256, 512 to see asymptotic rate. q <-- quit y <-- Yes, I really want to quit! There are actually ways to estimate the asymptotic rate of a loop, which, by the way, is sometimes called "R-infinity". The reason for doing so is to try to understand the behavior of a loop for various sizes of N, and to be able to compare two loops even if their relative performance varies for small N. A related concept is "N(1/2)", which is the size of the loop at which it has reached half of its asymptotic speed. Use the actual MegaFLOP rating for the loop at N=512, and pretend that that is the asymptotic rate. Now try to find the value of N at which half that rate is achieved. Rate at N=512 ____________________ Estimated N(1/2) ____________________ 12) Exercise: Using SECOND to time little things Use the SECOND routine to get an idea of the relative cost of several kinds of operations on the Cray. Let's time 1,000 iterations of a loop containing various computations, and record the time required. PARAMETER (N=1000) DIMENSION X(N),Y(N),Z(N) DO 10 I=1,N X(I)=RANF() Y(I)=RANF() Z(I)=RANF() 10 CONTINUE CALL SECOND(TIME1) DO 20 I=1,N Insert some operation here. 20 CONTINUE CALL SECOND(TIME2) WRITE(*,*)'X(N)=',X(N) TIME=(TIME2-TIME1)/REAL(N) WRITE(*,*)'Operation required ',TIME,' seconds.' STOP END Use each of the following operations inside loop 20: Operation CPU time for 1000 X(I)=Y(I) _________________ X(I)=Y(I)+Z(I) _________________ X(I)=Y(I)*Z(I) _________________ X(I)=X(I)+Y(I)*Z(I) _________________ X(I)=Y(I)/Z(I) _________________ X(I)=SQRT(Y(I)) _________________ X(I)=EXP(Y(I)) _________________ Let's guess that copying one vector into another is about as fast an operation as you can do. How much longer does each operation take, compared to the copy? You should find that EXP, in particular, is very expensive. Repeat the experiment for X(I)=X(I)+Y(I)*Z(I), but insert the statement IMPLICIT DOUBLE PRECISION (A-H,O-Z) just before the DIMENSION statement and see what happens to the time! Perhaps you can see why we really urge users to avoid DOUBLE PRECISION. X(I)=X(I)+Y(I)*Z(I) _________________ (Double Precision) WARNING: Printing X(N) is important in the above example. If you don't believe me, try removing that WRITE statement, and repeat the experiment. The Cray realizes that you never check the X array, and so it doesn't do the calculation! 13). Exercise: Using LOOPMARK The Cray RANF routine may be used to compute "random" numbers between 0 and 1. RANF is a vectorizable routine, and so a loop containing a call to RANF may still vectorize. Type in the following program, and call it "random.f": PARAMETER (N=100) REAL A(N),B(N),C(N) DO 10 I=1,N A(I)=RANF() 10 CONTINUE DO 20 I=1,N B(I)=RANF() C(I)=RANF() 20 CONTINUE WRITE(*,*)A(1),B(1),C(1) STOP END Which loops do you think should vectorize? Want some hints? Calls to subroutines or functions inside a loop can inhibit vectorization. However, most FORTRAN functions (SQRT for instance) may be used in a loop without inhibiting vectorization. RANF is not a standard FORTRAN function, but may also be used in this way. Now use LOOPMARK to determine which of the loops in the program vectorize. Use the command cf77 -c -Wf"-em" random.f The LOOPMARK listing will show up in the file "random.l". You should notice a discrepancy between what vectorizes. If you are wondering about the results, consider the following: The Cray compiler wants to vectorize a loop only if it believes the loop will still produce identical results. The random number generator RANF, like most such routines, is "deterministic". There is a formula used to produce each value from the previous one. Since we don't really care whether our random numbers get shuffled, we can use a "CDIR$ IVDEP" statement just before the loop to "urge" the compiler to vectorize it. Repeat the exercise, and see if you can convince the compiler to vectorize the loop this time. 14) Exercise: Using SCOUNT to count statement executions The SCOUNT program can be used to show you how many times each statement of a program gets executed. The output of SCOUNT is a copy of the original program, with each line labeled by the number of times it was executed. When trying to analyze the performance and behavior of a program, it can be very useful to get the statement count, even though this does not have a direct relationship to the time or work involved in a particular line. To try SCOUNT, get the SCOUNT example program: cfs get /usr/local/examples/scount/scoprb.f Then use the following commands: scount -b -iscoprb.f cf77 cftsrc.f a.out scount -a -iscoprb.f > report.txt Take a look at the listing in "report.txt". Pay particular attention to the subroutine TAXPY. There are lines there that get executed a lot, and lines that don't get executed at all. Similarly, in subroutine TGEFA, there are some lines of code that are used to swap two rows of the matrix. Are these lines used? 15) Exercise: Using HPM to compare multiplication to division Among other things, HPM counts the number of multiplies, adds, and divides. Let's see how well it does with this program, which we can call "test.f": PARAMETER (N=100000) REAL A(N),B(N),C(N) DO 10 I=1,N A(I)=RANF() 10 CONTINUE DO 20 I=1,N B(I)=RANF() 20 CONTINUE DO 30 I=1,N C(I)=A(I)*B(I) 30 CONTINUE WRITE(*,*)'C(1)=',C(1) STOP END Count the number of floating point operations, as well as the MEGAFLOP rate, using the command hpm test.f Put your results here: Number of multiplies = __________ Number of reciprocals = __________ MegaFLOP rate = __________ Can you guess what will happen to the counts and the MegaFLOP rate if we change DO loop 30 to C(I)=A(I)/B(I) Number of multiplies = __________ Number of reciprocals = __________ MegaFLOP rate = __________ Hint: How does the Cray compute A/B? First it computes 1/B, in a process which requires two multiplications, and then multiplies A*(1/B)! For every (A/B) operation, therefore, HPM will report 1 floating point reciprocal, and three multiplications! 16) Exercise: Using FLOWTRACE on the LINPACK benchmark. Get the source code for the LINPACK benchmark in FORTRAN or C, whichever is your favorite language: cfs get /usr/local/src/bin/lbench/lbench.f or cfs get /usr/local/src/bin/lbenchc/lbenchc.c Compile, load and run the program with FLOWTRACE: cf77 -F lbench.f a.out flowview -LA or cc -DUNICOS -F lbenchc.c a.out flowview -LA If you are logged in interactively, use the command flowview -LA | more instead! 17) Exercise: When memory requests collide For this exercise, we'll use the "MATMUL" program that we used in the section two. The MATMUL program has an input parameter called "LDA", which has no effect on the numerical results of the computation, but is simply the leading dimension of the arrays used to hold the matrices. In other words, the N by N matrix will be stored in an array declared as DIMENSION A(LDA,N) LDA must be at least as big as N, otherwise there isn't enough space to store the array. LDA may be larger than N, without altering the numerical results; we just have extra space allocated. However, the value of LDA affects the "layout" of the data, and on some machines, particularly the Cray, this can make a big difference in performance. The purpose of this exercise is to exhibit the problem. We can't really to into a detailed explanation here. But watch what happens as we exhibit what's called a "memory bank conflict". Use the following input to matmul: matmul <-- Starts up the MATMUL program noshow <-- Turn off print out of all values. show=lda <-- Turn on print out of LDA show=time <-- Turn on print out of TIME order=ijk <-- Chooses the algorithm lda=511 <-- Sets the size of the FORTRAN array n=256 <-- Picks a problem size. m <-- Carries out the multiplication. lda=512 m lda=513 m q <-- quit y <-- Yes, I really want to quit! Put your results here: LDA Time 511 __________ 512 __________ 513 __________ The problem occurs when a memory bank can't deliver a number to the processor fast enough. The problem can often be avoided by making the first dimension of the array odd. 18) A sample program to rewrite and benchmark Here is a program which integrates a simple function over an interval. It is written in a very natural, but inefficient way. The inner loop goes from 1 to 3. And even then it won't vectorize! The outer loop goes from 1 to 100000, and is an ideal candidate for vectorization. But first you have to rewrite the code! Run HPM on the original version and record the MegaFLOP rate. Then see if you can improve the speed by trying things like finding a way of using DO loops with large iteration counts, and use HPM to see if you've improved the code's performance. You may need to repeat this exercise a few times. Original MegaFLOP rate: ________________________ Revised code MegaFLOP rate: ________________________ __________________________ ________________________ I got this code to run 50 times faster, but I had to really bash at it. See the example QUAD.FOR in the PERFORMANCE examples directory. DIMENSION XL(3) DIMENSION WL(3) C A=0.0 B=1.0 NSUB=100000 C C Weights and abscissas of simple Gauss-Legendre rule. C WL(1)=5.0/9.0 WL(2)=8.0/9.0 WL(3)=5.0/9.0 XL(1)=-.7746 XL(2)=0.0 XL(3)=.7746 C Q=0.0 C C Break up (A,B) into NSUB equal subintervals, (ASUB,BSUB). C DO 20 I=1,NSUB ASUB=((NSUB-I+1)*A+(I-1)*B)/REAL(NSUB) BSUB=((NSUB-I)*A+I*B)/REAL(NSUB) C C Apply Gauss-Legendre rule on subinterval. C DO 10 J=1,3 X=0.5*(ASUB+BSUB)+0.5*(ASUB-BSUB)*XL(J) Q=Q+0.5*WL(J)*(B-A)*FUN(X)/REAL(NSUB) 10 CONTINUE 20 CONTINUE C WRITE(*,*)'Integral=',Q RETURN END REAL FUNCTION FUN(X) C FUN=X*X RETURN END