PSCDOC:PERFORMANCE.DOC  30 September 1991

PERFORMANCE utilities monitor and report on user program efficiency.

Some quantities that can be observed include elapsed time, number of 
arithmetic operations, number of I/O requests, MegaFLOP rate.  Reports
can be prepared for a whole program, per each subroutine, or in some
cases for individual loops or pieces of code.

Cray Research has provided a number of utilities and routines, the
most useful of which include HPM, FLOWTRACE, PERFTRACE and SECOND.
 
Help:         man performance
 
Document:     performance.doc
 
Examples:     io.f, linsol.f, lintst.f, quad.f, sample.f,

UNICOS usage: The instructions for each performance monitor 
              are in separate documents, which you should refer 
              to.
 
See also:     FLOWTRACE, FORGE, HPM, JA, LOOPMARK, MATMUL, PCA, 
              PERFTRACE, PROCSTAT, PROF, PSR, SCILIB, SCOUNT, 
              SECOND, PROF, SECOND, TIME, TIMEX), VECTORIZE,


Cray Performance Monitoring
30 September 1991
Pittsburgh Supercomputing Center

1).  Introduction
2).  Exercise: Elapsed CPU time
3).  Use of SECOND and SECONDR
4).  Counting floating point operations, and floating point speed
5).  Exercise: The MegaFLOP rate of Matrix multiplication
6).  Getting a MegaFLOP rate using HPM
7).  Exercise: Using HPM
8).  Tracking down a problem with FLOWTRACE
9).  Exercise: Using FLOWTRACE
10). Conclusions.

     Exercises for further study:

11). When does a loop vectorize?
12). Using SECOND to time little things
13). Using LOOPMARK to get a nice compiler listing
14)  Using SCOUNT to count statement executions
15)  Using HPM to compare multiplication to division
16)  Using FLOWTRACE on the LINPACK benchmark.
17). When memory requests collide
18). A sample program to rewrite and benchmark


1). Introduction


Performance monitoring is the measurement of the computer resources
used by a program.  Typical quantities measured can include:

  the elapsed CPU time, 
  the maximum and average memory used, 
  the number of instructions executed, 
  the number of processors used, if running in parallel,
  the number of logical and arithmetic operations,
  the number of floating point operations, 
  the number of I/O operations carried out.

The Cray has a limited ability to run programs using more than one processor.
For simplicity's sake, we will assume that we are using just one processor.
That's the default, anyway.


For typical supercomputing applications, the two important quantities
are the number of floating point operations and the elapsed CPU time.

The ratio, measured in millions of floating point operations per second,
or MegaFLOPS, is commonly used to rate scientific computers and programs. 

We will restrict our attention from now on to just these quantities
that define MegaFLOPS, that is, elapsed CPU time, and number of operations.


2). Exercise: Elapsed CPU time

   
Let's run a program which carries out some work and reports the time and
work required.  The work being done is matrix multiplication.  We'll have the 
program solve a problems using all the algorithms it knows about.

     matmul              <-- Starts up the MATMUL program
     noshow              <-- Turn off print out of all values.
     show=order          <-- Turn on print out of ORDER
     show=time           <-- Turn on print out of TIME
     lda=513             <-- Sets the size of the FORTRAN array
     order=all           <-- Chooses the algorithm
     n=256               <-- Picks a problem size.
     m                   <-- Carries out the multiplication.
     q                   <-- quit
     y                   <-- Yes, I really want to quit!

Put your results here (two decimals are enough):

 ORDER     Time     

  IJK    __________

  IKJ    __________

  JIK    __________

  JKI    __________

  KIJ    __________

  KJI    __________

  MXMA   __________

  SAXPY  __________

  SAXPYC __________

  SGEMM  __________

  SGEMMC __________

  SGEMMS __________

Here's the results I got to two decimal places:

  IJK        0.18
  IKJ        0.26
  JIK        0.18
  JKI        0.18
  KIJ        0.27
  KJI        0.18
  MXMA       0.11
  SAXPY      0.35
  SAXPYC     0.22
  SGEMM      0.20
  SGEMMC     0.11
  SGEMMS     0.09

You would NEVER be able to guess from the source code which method would
run faster.  You have to know about the machine you're using to make that
judgment.  The SECOND routine can help you to see the difference.


3). The use of SECOND and SECONDR for measuring performance.


In the MATMUL program that you just ran, the program printed out the value
of the elapsed CPU time.  This measurement was made by calling the Cray system 
routine SECOND.  

SECOND measures the amount of CPU time that has elapsed since the job began.
By calling SECOND before and after some operation and subtracting the values
returned, you can measure the "cost" of that operation in CPU time.  Here
is a typical use of the routine:

    CALL SECOND(TIME1)
    DO 10 I=1,1000
      X(I)=SQRT(Y(I))
10    CONTINUE
    CALL SECOND(TIME2)
    WRITE(*,*)'Elapsed CPU time=',TIME2-TIME1,' seconds.'

Using SECOND you can time anything, a DO loop, a subroutine, a program.

SECOND is ideal when 
  you KNOW what portion of a program you want to measure,
  you want to measure just a few chunks of the program, 
  the measured portions are executed only a few times.

SECOND can be painful to use if
  you're searching for the portion of a program that uses a lot of time, or
  you're interested in timing many portions of a program, or
  the code to be timed executes very quickly (timings will not be accurate), or
  the code is executed many times (you have to save the timings and print later)
 
Moreover, you have to have access to the source code of the program you 
want to work evaluate.  And SECOND, by itself, doesn't tell you anything
about how efficiently you used the Cray.

There is a similar program, called SECONDR, for computing the elapsed real
time, that is, the actual time that has passed, including time when your
program is swapped out and not actually executing.  The time value
measured by SECONDR should usually be larger than that measured by SECOND.
However, a serious discrepancy in the time values may indicate excessive
overhead and swapping due to poorly handled I/O.
 
 
4). Counting floating point operations, and floating point speed


We know how to time a program or program fragment.  But we need to measure
the amount of work done in that time.  For scientific programs running on
scientific computers, a good measure of work is the number of floating point
arithmetic operations: addition, multiplication, and sometimes division.

If we can estimate the number of operations involved in a program, and get the
elapsed CPU time via SECOND, we can compute the number of floating point 
operations per second, called FLOPS.  Usually, this is counted in millions,
or MEGAFLOPS.  This gives us a rate, or speed, of computation.  Such a rate 
can give us an absolute scale of measurement, because the maximum MEGAFLOP 
rating on the Cray is known.  

You can figure out the Cray's computational rate from two facts:

  The Cray has a clock cycle of 6 nanoseconds.  
  The Cray can do a (vectorized) add and multiply in one clock cycle.

Thus we can do 2 floating point operations in 6 nanoseconds.  The maximum 
MEGAFLOP rating on the Cray is therefore

    2 ops / 6 billionths of a second 
  = 2 thousand million ops / 6 seconds 
  = 333 MEGAFLOPS.

This value of 333 can be used to measure your program's performance.  Roughly
speaking, if your program is performing at over 100 MEGAFLOPS, you are achieving
superior performance.  If you are performing at under 10 MEGAFLOPS, your 
program probably shouldn't be running on the Cray at all.

Let's do this measurement for the simple case of matrix multiplication.  

Assuming the computer organizes the calculation the way we would, we can
figure out that each element of the result matrix will require being set to
zero, and then having N pairs of values multiplied together and added:

  C(I,J)=0.0
  C(I,J)=C(I,J)+A(I,1)*B(1,J)
  C(I,J)=C(I,J)+A(I,2)*B(2,J)
  ...
  C(I,J)=C(I,J)+A(I,N)+B(N,J)

Ignoring the initial assignment to zero, each element will require N 
multiplications and N additions, or 2*N operations.  And so the whole
N*N matrix will require 2*N*N*N operations to compute.

Armed with this information, we can choose a matrix size, figure out the
amount of work required, use SECOND to get the elapsed CPU time, and
come up with a computational rate.  This will tell us how well the
program is doing on the Cray, compared to the maximal rate of 333 MegaFLOPS.


5). Exercise: The MegaFLOP rate of Matrix multiplication


Let's rerun the MATMUL program, but this time ask it to report the
number of operations and the MFLOP rating for each method:

     matmul              <-- Starts up the MATMUL program
     noshow              <-- Turn off print out of all values.
     show=order          <-- Turn on print out of ORDER
     show=mflops         <-- Turn on print out of MFLOPS
     lda=513             <-- Sets the size of the FORTRAN array
     order=all           <-- Chooses the algorithm
     n=256               <-- Picks a problem size.
     m                   <-- Carries out the multiplication.
     q                   <-- quit
     y                   <-- Yes, I really want to quit!


  Order          MFLOPS

  IJK        _________________

  IKJ        _________________

  JIK        _________________

  JKI        _________________

  KIJ        _________________

  KJI        _________________

  MXMA       _________________

  SAXPY      _________________

  SAXPYC     _________________

  SGEMM      _________________

  SGEMMC     _________________

  SGEMMS     _________________


Now these MegaFLOP ratings are more useful than the timings we got earlier.
For one thing, it's easier to think in terms of rates rather than time.
For another, we know the maximum rate on the Cray, so we can compare the
speed of each method with that maximum rate, and get a real measure of
how efficiently the Cray is being used.

For matrix multiplication, we know it's easy to compute the number of
operations, and hence to compute a MegaFLOP rate.  The MATMUL program
knows how large N is, and how long the computation took, and simply
returns the MegaFLOP rate as 2*N**3/T.  For most programs, however,
it is not possible to do such a simple calculation.

We'll see in a moment how to get a MegaFLOP rate, no matter how complicated
the program is.


6). Getting a MegaFLOP rate using HPM


From the discussion and exercises, you should see that the MegaFLOP rating
for a program can be a useful indicator of its efficiency on the Cray.
The problem is, we have to be able to estimate the amount of work we're
asking the Cray to do, in order to figure out the rate, and that's not
usually easy to do.  

Fortunately, there's a program available which counts the number of
operations, times the code, computes the MegaFLOP rating, all with
very little effort on out part.  The program which does this is called HPM.

You use HPM when you have an executable program to run.  In the
simplest case, you simply type 

  hpm program

where "program" is the name of the program or command you want to monitor.

The program will execute, with HPM "observing" the run, and collecting
statistics which will be reported at the end of the run.  The only
interesting values for a beginning user are the counts of floating point
operations, and the MegaFLOP rate.  Here's some sample HPM output:


Group 0:  CPU seconds   :       5.70      CP executing     :      950345302

Million inst/sec (MIPS) :      27.86      Instructions     :      158811193
Avg. clock periods/inst :       5.98
% CP holding issue      :      72.50      CP holding issue :      688963041
Inst.buffer fetches/sec :       0.53M     Inst.buf. fetches:        3019615
Floating adds/sec       :      58.65M     F.P. adds        :      334340355
Floating multiplies/sec :      58.65M     F.P. multiplies  :      334339392
Floating reciprocal/sec :       0.00M     F.P. reciprocals :           2048
CPU mem. references/sec :     180.46M     CPU references   :     1028668469
I/O mem. references/sec :       2.21M     I/O references   :       12580870

Floating ops/CPU second :     117.31M


7). Exercise: Using HPM


Retrieve the sample program TEST.F from the EXAMPLES directory.  Do this
with the UNICOS command:

  cfs get /usr/local/examples/performance/test.f

Now compile and load TEST.F with the default options, and run it with HPM:

  cf77 test.f
  hpm a.out

Let's compare a run that uses the enhanced vectorization preprocessor:

  cf77 -Zv test.f
  hpm a.out

Let's look at a run which uses "inlining" of small routines:

  cf77 -Wf"-o inline" test.f
  hpm a.out

And let's try the "aggressive" optimization option:

  cf77 -Wf"-o aggress" test.f
  hpm a.out


MEGAFLOP ratings reported by HPM:


  Compile statement            MEGAFLOP rating

  cf77 test.f                  _________________

  cf77 -Zv test.f              _________________

  cf77 -Wf"-o inline" test.f   _________________

  cf77 -Wf"-o aggress" test.f  _________________

  ____________________________ _________________    <--  Try other switches!


Don't be surprised if some of the MEGAFLOP ratings go down as you use a
compiler option that's supposed to help.  Sometimes a problem is too small
to be optimized; and there are other reasons why one optimization strategy
will never work for all programs!  The important thing is that we're not
helpless.  We can easily see what the effect of our choices is.


8). Tracking down a problem with FLOWTRACE:


Suppose you were trying to fix a program with a bad grade from HPM.
How do you track down problem?  There are a couple of programs that
can help us here, including PERFTRACE and FLOWTRACE and PROF.  We will
demonstrate one of the simpler programs, FLOWTRACE.

FLOWTRACE operates essentially by calling a library subroutine everytime
control is passed from one subroutine or function to another.  This
library subroutine is responsible for keeping track of the elapsed
CPU time, and totalling up the amount of time spent in each routine.
At the end of the run, FLOWTRACE reports on the number of times each
routine was called, and how much time was spent there.

This information identifies where your execution time is spent.

FLOWTRACE is invoked at compile time, using the following option:

  cf77 -F myprog.f  
or
  cc -F myprog.c

When your program is run, a data file "flow.data" is created, which must be
interpreted by the related "flowview" program, as in:

  a.out
  flowview -LA

Keep in mind these facts:

  Your program may take longer to execute with FLOWTRACE on.  Thus the
  timing results will be somewhat distorted.

  Subroutines that are called hundreds of times, or which execute very quickly,
  are not measured accurately.  

  FLOWTRACE results are only going to be useful to you if your program
  is divided up into subroutines which share the work.  Otherwise,
  you probably already know where all the time is being spent.  (One
  user complained that the FLOWTRACE results weren't very helpful.  The
  user's program had no subroutines at all!).

  You must have access to (most of) the source code.  If your program
  calls library routines, in which most of the time is spent, FLOWTRACE
  will not be able to report this.


Here's part of the output you get from FLOWTRACE:


     TOP 5 SIGNIFICANT ROUTINES
     ------------------------------
  (CPU Times are Shown in Seconds)

Routine Name   Tot Time  # Calls Avg Time Percentage  Accum%
---------------------- -------- -------- ----------  -------
TAXPY         9.38E-01   125749 7.46E-06      68.99    68.99  *****************
TGEFA         3.84E-01        1 3.84E-01      28.27    97.26  *******
ITAMAX        2.89E-02      499 5.79E-05       2.13    99.38  
TGESL         3.28E-03        1 3.28E-03       0.24    99.62  
TSCAL         2.64E-03      499 5.30E-06       0.19    99.82  


9). Exercise: Using FLOWTRACE


FLOWTRACE is used by compiling the program of interest with the FLOWTRACE
option on and then running the program.  Aside from the cost of the recompile
and relink, the program will also require as much as three times as much time!
The compile and load command needed is

  cf77 -F myprog.f                   
    for FORTRAN, or
  cc -F myprog.c                     
    for C

For our purposes, let's use FLOWTRACE on that FORTRAN test program again.  
First run the program "plain", and record the time.  Then run it with 
FLOWTRACE.

  cf77 test.f
  a.out

  cf77 -F test.f
  a.out
  flowview -LA | more      <--  only do this interactively!


CPU time without FLOWTRACE         _________________

CPU time with    FLOWTRACE         _________________


Look at the output from FLOWTRACE, and see how a couple of routines really 
stick out.  Clearly, it is in our interest to make sure that those routines 
are well written.  If the program is performing poorly, we know what routines 
to search first.


What routine is called most often?          __________________

What routine takes the most time per call?  __________________

What routine takes the most total time?     __________________


10). Conclusions


The most important thing I want you to understand is how easy it is to
get an idea of your program's efficiency on the Cray using HPM.  The
MEGAFLOP rating provided by HPM gives you an excellent idea of how
busy you're keeping the Cray.


The second thing I hope you understand is the importance of using both HPM
and timings in order to make comparisons.  For comparisons, the timing 
information is much more important than the HPM numbers.  A program with
a high MEGAFLOP rate looks better than one with a low MEGAFLOP rate.
But if they both solve the same problem, and the low MEGAFLOP rate program
solves it faster, the low MEGAFLOP rate program is better.  

You're charged for time, not efficiency.  So your goal is to bring down
the time charges.  Increasing the MEGAFLOP rate is simply one way to TRY
to speed up the program, and it's certainly not a goal in itself.


The third point I'd like to be sure you consider is the importance of 
focusing your optimization, that is, figuring out the smallest portion of the
program that you should try to work on.  Using FLOWTRACE (or PERFTRACE)
you can at least find the slow subroutines or functions.  This minimizes
your work (and also minimizes the damage you can cause by misguided
optimization of the wrong things!).


I haven't had a chance to discuss some other interesting performance
monitoring programs, including FORGE, LOOPMARK, PERFTRACE and SCOUNT.
Each of these topics is described in an online document at the PSC.  


If you are interested in pursuing this topic, you might also be interested in:

  UNICOS Performance Utilities Reference Manual
  SR-2040
  Cray Research, Inc

which is available from Cray Research, or directly from our Publications 
Coordinator, Vivian Benton.

A copy of this manual is available online in the documentation directories, 
under the name "performance.big".  The document may also be examined (on the 
Cray only!) using the DOCVIEW program.


Finally, if you are interested in pursuing the topic of performance monitoring,
I recommend that you try some of the exercises that follow.


11). Exercise:  When does a loop vectorize?


No matter how well you write a FORTRAN DO loop, its actual efficiency will
depend in part on the number of iterations carried out.  There are two
main reasons for this:

  There is some overhead involved in setting up a vector loop.
  There is some more overhead involved after every 64-th iteration.

This means that a DO loop has an "asymptotic" efficiency that you only see
when the loop is "large enough".  When the loop is small, it may perform
worse than a scalar code, and the performance will be strongly affected
by problem sizes that are small, or near a (small) multiple of 64.

We can see some of this occurring, particularly the "bumps" at 64, 128,
and so on, using the MATMUL program.  Let's restrict our attention to one
algorithm, say "IJK", and simply run it for a bunch of values of N:

     matmul              <-- Starts up the MATMUL program
     noshow              <-- Turn off print out of all values.
     show=n              <-- Turn on print out of N
     show=mflops         <-- Turn on print out of MFLOPS
     lda=513             <-- Sets the size of the FORTRAN array
     order=ijk           <-- Chooses the algorithm
     n=1                 <-- N=1
     m                   <-- Carries out the multiplication.
     n=2                 <-- Set N=2
     m                   <-- Carries out multiplication.

     Repeat for N=63, 64, 65, 66, and 127, 128, 129, 130 to see "bumps".

     Repeat for N=1, 2, 4, 8, 16, 32, 64, 128, 256, 512 to see asymptotic rate.

     q                   <-- quit
     y                   <-- Yes, I really want to quit!

There are actually ways to estimate the asymptotic rate of a loop, which,
by the way, is sometimes called "R-infinity".  The reason for doing so is
to try to understand the behavior of a loop for various sizes of N, and
to be able to compare two loops even if their relative performance varies
for small N.

A related concept is "N(1/2)", which is the size of the loop at which it
has reached half of its asymptotic speed.  Use the actual MegaFLOP rating
for the loop at N=512, and pretend that that is the asymptotic rate.
Now try to find the value of N at which half that rate is achieved.

  Rate at N=512    ____________________

  Estimated N(1/2) ____________________


12)  Exercise:  Using SECOND to time little things


Use the SECOND routine to get an idea of the relative cost of several
kinds of operations on the Cray.  Let's time 1,000 iterations of a loop
containing various computations, and record the time required.

        PARAMETER (N=1000)
        DIMENSION X(N),Y(N),Z(N)

        DO 10 I=1,N
          X(I)=RANF()
          Y(I)=RANF()
          Z(I)=RANF()
  10      CONTINUE

        CALL SECOND(TIME1)
        DO 20 I=1,N
          Insert some operation here.
  20      CONTINUE
        CALL SECOND(TIME2)
        WRITE(*,*)'X(N)=',X(N)
        TIME=(TIME2-TIME1)/REAL(N)
        WRITE(*,*)'Operation required ',TIME,' seconds.'
        STOP
        END

Use each of the following operations inside loop 20:

  Operation                    CPU time for 1000

  X(I)=Y(I)                    _________________

  X(I)=Y(I)+Z(I)               _________________

  X(I)=Y(I)*Z(I)               _________________

  X(I)=X(I)+Y(I)*Z(I)          _________________

  X(I)=Y(I)/Z(I)               _________________

  X(I)=SQRT(Y(I))              _________________

  X(I)=EXP(Y(I))               _________________

Let's guess that copying one vector into another is about as fast an
operation as you can do.  How much longer does each operation take,
compared to the copy?  You should find that EXP, in particular, is
very expensive.  

Repeat the experiment for X(I)=X(I)+Y(I)*Z(I), but insert the statement
        IMPLICIT DOUBLE PRECISION (A-H,O-Z)
just before the DIMENSION statement and see what happens to the time!   
Perhaps you can see why we really urge users to avoid DOUBLE PRECISION.

  X(I)=X(I)+Y(I)*Z(I)          _________________  (Double Precision)

WARNING: Printing X(N) is important in the above example.  If you don't 
believe me, try removing that WRITE statement, and repeat the experiment.
The Cray realizes that you never check the X array, and so it doesn't
do the calculation!


13).  Exercise: Using LOOPMARK


The Cray RANF routine may be used to compute "random" numbers between
0 and 1.  RANF is a vectorizable routine, and so a loop containing a call
to RANF may still vectorize.  Type in the following program, and call
it "random.f":

      PARAMETER (N=100)

      REAL A(N),B(N),C(N)

      DO 10 I=1,N
        A(I)=RANF()
10      CONTINUE

      DO 20 I=1,N
        B(I)=RANF()
        C(I)=RANF()
20      CONTINUE

      WRITE(*,*)A(1),B(1),C(1)
      STOP
      END

Which loops do you think should vectorize?  Want some hints?

  Calls to subroutines or functions inside a loop can inhibit vectorization.

  However, most FORTRAN functions (SQRT for instance) may be used in a loop
  without inhibiting vectorization.  RANF is not a standard FORTRAN
  function, but may also be used in this way.

Now use LOOPMARK to determine which of the loops in the program vectorize.
Use the command 

  cf77 -c -Wf"-em" random.f

The LOOPMARK listing will show up in the file "random.l".  You should notice a 
discrepancy between what vectorizes.  If you are wondering about the results, 
consider the following:

  The Cray compiler wants to vectorize a loop only if it believes 
  the loop will still produce identical results.

  The random number generator RANF, like most such routines, is
  "deterministic".  There is a formula used to produce each value
  from the previous one.

Since we don't really care whether our random numbers get shuffled, we
can use a "CDIR$ IVDEP" statement just before the loop to "urge" the
compiler to vectorize it.  Repeat the exercise, and see if you can convince
the compiler to vectorize the loop this time.


14)  Exercise:  Using SCOUNT to count statement executions


The SCOUNT program can be used to show you how many times each statement
of a program gets executed.  The output of SCOUNT is a copy of the original
program, with each line labeled by the number of times it was executed.

When trying to analyze the performance and behavior of a program, it
can be very useful to get the statement count, even though this does not
have a direct relationship to the time or work involved in a particular
line.

To try SCOUNT, get the SCOUNT example program:

  cfs get /usr/local/examples/scount/scoprb.f

Then use the following commands:

  scount -b -iscoprb.f
  cf77 cftsrc.f
  a.out
  scount -a -iscoprb.f > report.txt

Take a look at the listing in "report.txt".  Pay particular attention to the 
subroutine TAXPY.  There are lines there that get executed a lot, and lines 
that don't get executed at all.  Similarly, in subroutine TGEFA, there are 
some lines of code that are used to swap two rows of the matrix.  Are these 
lines used?


15)  Exercise: Using HPM to compare multiplication to division


Among other things, HPM counts the number of multiplies, adds, and divides.
Let's see how well it does with this program, which we can call "test.f":

      PARAMETER (N=100000)
      REAL A(N),B(N),C(N)

      DO 10 I=1,N
        A(I)=RANF()
10      CONTINUE
      DO 20 I=1,N
        B(I)=RANF()
20      CONTINUE
      DO 30 I=1,N
        C(I)=A(I)*B(I)
30      CONTINUE
 
      WRITE(*,*)'C(1)=',C(1)
      STOP
      END

Count the number of floating point operations, as well as the MEGAFLOP rate, 
using the command

  hpm test.f

Put your results here:

  Number of multiplies =  __________

  Number of reciprocals = __________

  MegaFLOP rate =         __________

Can you guess what will happen to the counts and the MegaFLOP rate if we 
change DO loop 30 to

        C(I)=A(I)/B(I)

  Number of multiplies =  __________

  Number of reciprocals = __________

  MegaFLOP rate =         __________

Hint:  How does the Cray compute A/B?  First it computes 1/B, in a 
process which requires two multiplications, and then multiplies
A*(1/B)!  For every (A/B) operation, therefore, HPM will report
1 floating point reciprocal, and three multiplications!


16)  Exercise: Using FLOWTRACE on the LINPACK benchmark.


Get the source code for the LINPACK benchmark in FORTRAN or C, whichever
is your favorite language:

  cfs get /usr/local/src/bin/lbench/lbench.f 
or
  cfs get /usr/local/src/bin/lbenchc/lbenchc.c

Compile, load and run the program with FLOWTRACE:

  cf77 -F lbench.f
  a.out
  flowview -LA

or

  cc -DUNICOS -F lbenchc.c
  a.out
  flowview -LA

If you are logged in interactively, use the command 
 
  flowview -LA | more

instead!


17)  Exercise: When memory requests collide


For this exercise, we'll use the "MATMUL" program that we used in the
section two.  

The MATMUL program has an input parameter called "LDA", which has no effect on 
the numerical results of the computation, but is simply the leading dimension 
of the arrays used to hold the matrices.  In other words, the N by N matrix 
will be stored in an array declared as

        DIMENSION A(LDA,N)

LDA must be at least as big as N, otherwise there isn't enough space to store
the array.  LDA may be larger than N, without altering the numerical results;
we just have extra space allocated.  However, the value of LDA affects the
"layout" of the data, and on some machines, particularly the Cray, this can
make a big difference in performance.

The purpose of this exercise is to exhibit the problem.  We can't really to
into a detailed explanation here.  But watch what happens as we exhibit
what's called a "memory bank conflict".

Use the following input to matmul:

     matmul              <-- Starts up the MATMUL program
     noshow              <-- Turn off print out of all values.
     show=lda            <-- Turn on print out of LDA
     show=time           <-- Turn on print out of TIME
     order=ijk           <-- Chooses the algorithm
     lda=511             <-- Sets the size of the FORTRAN array
     n=256               <-- Picks a problem size.
     m                   <-- Carries out the multiplication.
     lda=512            
     m                   
     lda=513             
     m                   
     q                   <-- quit
     y                   <-- Yes, I really want to quit!

Put your results here:

  LDA     Time     

  511    __________

  512    __________

  513    __________

The problem occurs when a memory bank can't deliver a number to the processor
fast enough.  The problem can often be avoided by making the first dimension
of the array odd.


18)  A sample program to rewrite and benchmark


Here is a program which integrates a simple function over an interval.
It is written in a very natural, but inefficient way.  The inner loop
goes from 1 to 3.  And even then it won't vectorize!  The outer loop
goes from 1 to 100000, and is an ideal candidate for vectorization.
But first you have to rewrite the code!  

Run HPM on the original version and record the MegaFLOP rate.  Then
see if you can improve the speed by trying things like finding a way
of using DO loops with large iteration counts, and use HPM to see
if you've improved the code's performance.  You may need to repeat this
exercise a few times.

  Original MegaFLOP rate:     ________________________

  Revised code MegaFLOP rate: ________________________

  __________________________  ________________________

I got this code to run 50 times faster, but I had to really bash at it.  See 
the example QUAD.FOR in the PERFORMANCE examples directory.


      DIMENSION XL(3)
      DIMENSION WL(3)
C
      A=0.0
      B=1.0
      NSUB=100000
C
C  Weights and abscissas of simple Gauss-Legendre rule.
C
      WL(1)=5.0/9.0
      WL(2)=8.0/9.0
      WL(3)=5.0/9.0
      XL(1)=-.7746
      XL(2)=0.0
      XL(3)=.7746
C
      Q=0.0
C
C  Break up (A,B) into NSUB equal subintervals, (ASUB,BSUB).
C
      DO 20 I=1,NSUB
        ASUB=((NSUB-I+1)*A+(I-1)*B)/REAL(NSUB)
        BSUB=((NSUB-I)*A+I*B)/REAL(NSUB)
C
C  Apply Gauss-Legendre rule on subinterval.
C
        DO 10 J=1,3
          X=0.5*(ASUB+BSUB)+0.5*(ASUB-BSUB)*XL(J)
          Q=Q+0.5*WL(J)*(B-A)*FUN(X)/REAL(NSUB)
10        CONTINUE
20      CONTINUE
C
      WRITE(*,*)'Integral=',Q
      RETURN
      END
      REAL FUNCTION FUN(X)
C
      FUN=X*X
      RETURN
      END